Creating a Training Dataset for NER and RE
Dear all,
I am using meta-llama/Meta-Llama-3-8B-Instruct
to produce a training dataset to be used not only with Llama3, but also with other spaCy models that require a specifically structured dataset to train them.
The problem that I am encountering is that when I am printing/saving the data in a JSON/textual file, I see that some trigger words that I have inserted to recognise the start and the end of entity recognition and relation extraction for a query/sentence are sometimes placed wrongly in the file.
I am using the following messages to prompt the model
messages = [
{"role": "system", "content": "You extract the Named Entities in the form: entity text (start_idx, end_idx, label)"},
{"role": "system", "content": "Also, you extract the relations in the form: (subject, predicate, object)"},
{"role": "user", "content": ""},
]
Then, I pass the sentence to be processed in the 3rd dictionary as follows:
with open('llama3_output_new.txt', 'a+') as f:
for d in tqdm(list_data):
text = d['text']
if text != '':
messages[2]["content"] = text
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
generated_text = outputs[0]["generated_text"][len(prompt):]
f.write('Original text: ' + messages[2]["content"])
f.write(generated_text)
f.write('------END------')
The first line which is written in the file is the following:Original text: You extract the Named Entities in the form: entity text (start_idx, end_idx, label)
,
while I was expecting that just after 'Original text: ' there should be the original text, indeed.
Then, sometimes, these strings that I have introduced to have "triggers" to limit the information about a query seem set in the wrong places:
1. (Mr. John Doe, purchased, object)
2. (Mr. John Doe, acquired, object)
3. (Mr. John Doe, died, 1990)
4. (Mr. John Doe, possibly acquired, object from David Red in 1985)
5. (Mrs.------END------
At first, I thought the problem was that I was using more than a GPU, but then I restricted the model to use a GPU, only. The reported results are for the case of the model using 1 GPU.
Moreover, I suppose that the model's output is serialised correctly by the model itself, also if it uses more than a GPU.
Do you have any suggestions? Any answer about this issue?