magistermilitum
/

roberta-multilingual-medieval-ner

 ---
+## Model Details
+This is a Fine-tuned version of the multilingual Roberta model on medieval charters. The model is intended to recognize Locations and persons in medieval texts
+in a Flat and nested manner. The train dataset entails 8k annotated texts on medieval latin, french and Spanish from a period ranging from 11th to 15th centuries.
+### How to Get Started with the Model
+The model is intended to be used in a simple way manner:
+```python
+import torch
+from transformers import pipeline
+pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")
+results = list(map(pipe, list_of_sentences))
+results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
+print(results)
+```
+### Model Description
+The following snippet can transforms model inferences to CONLL format using the BIO format.
+```python
+class TextProcessor:
+    def __init__(self, filename):
+        self.filename = filename
+        self.sent_detector = nltk.data.load("tokenizers/punkt/english.pickle") #sentence tokenizer
+        self.sentences = []
+        self.new_sentences = []
+        self.results = []
+        self.new_sentences_token_info = []
+        self.new_sentences_bio = []
+        self.BIO_TAGS = []
+        self.stripped_BIO_TAGS = []
+    def read_file(self):
+        with open(self.filename, 'r') as f:
+            text = f.read()
+        self.sentences = self.sent_detector.tokenize(text.strip())
+    def process_sentences(self): #We split long sentences as encoder has a 256 max-lenght. Sentences with les of 40 words will be merged.
+        for sentence in self.sentences:
+            if len(sentence.split()) < 40 and self.new_sentences:
+                self.new_sentences[-1] += " " + sentence
+            else:
+                self.new_sentences.append(sentence)
+    def apply_model(self, pipe):
+        self.results = list(map(pipe, self.new_sentences))
+        self.results=[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in self.results]
+    def tokenize_sentences(self):
+        for n_s in self.new_sentences:
+            tokens=n_s.split() # Basic tokenization
+            token_info = []
+            # Initialize a variable to keep track of character index
+            char_index = 0
+            # Iterate through the tokens and record start and end info
+            for token in tokens:
+                start = char_index
+                end = char_index + len(token)  # Subtract 1 for the last character of the token
+                token_info.append((token, start, end))
+                char_index += len(token) + 1  # Add 1 for the whitespace
+            self.new_sentences_token_info.append(token_info)
+    def process_results(self): #merge subwords and BIO tags
+        for result in self.results:
+            merged_bio_result = []
+            current_word = ""
+            current_label = None
+            current_start = None
+            current_end = None
+            for entity, subword, start, end in result:
+                if subword.startswith("▁"):
+                    subword = subword[1:]
+                    merged_bio_result.append([current_word, current_label, current_start, current_end])
+                    current_word = "" ; current_label = None ; current_start = None ; current_end = None
+                if current_start is None:
+                    current_word = subword ; current_label = entity ; current_start = start+1 ; current_end= end
+                else:
+                    current_word += subword ; current_end = end
+            if current_word:
+                merged_bio_result.append([current_word, current_label, current_start, current_end])
+            self.new_sentences_bio.append(merged_bio_result[1:])
+    def match_tokens_with_entities(self): #match BIO tags with tokens
+        for i,ss in enumerate(self.new_sentences_token_info):
+            for word in ss:
+                for ent in self.new_sentences_bio[i]:
+                    if word[1]==ent[2]:
+                        if ent[1]=="L-PERS":
+                            self.BIO_TAGS.append([word[0], "I-PERS", "B-LOC"])
+                            break
+                        else:
+                            if "LOC" in ent[1]:
+                                self.BIO_TAGS.append([word[0], "O", ent[1]])
+                            else:
+                                self.BIO_TAGS.append([word[0], ent[1], "O"])
+                            break
+                else:
+                    self.BIO_TAGS.append([word[0], "O", "O"])
+    def separate_dots_and_comma(self): #optional
+        signs=[",", ";", ":", "."]
+        for bio in self.BIO_TAGS:
+            if any(bio[0][-1]==sign for sign in signs) and len(bio[0])>1:
+                self.stripped_BIO_TAGS.append([bio[0][:-1], bio[1], bio[2]]);
+                self.stripped_BIO_TAGS.append([bio[0][-1], "O", "O"])
+            else:
+                self.stripped_BIO_TAGS.append(bio)
+    def save_BIO(self):
+        with open('output_BIO_a.txt', 'w', encoding='utf-8') as output_file:
+            output_file.write("TOKEN\tPERS\tLOCS\n"+"\n".join(["\t".join(x) for x in self.stripped_BIO_TAGS]))
+# Usage:
+processor = TextProcessor('sentence.txt')
+processor.read_file()
+processor.process_sentences()
+processor.apply_model(pipe)
+processor.tokenize_sentences()
+processor.process_results()
+processor.match_tokens_with_entities()
+processor.separate_dots_and_comma()
+processor.save_BIO()
+```
+- **Developed by:** [Sergio Torres Aguilar]
+- **Model type:** [XLM-Roberta]
+- **Language(s) (NLP):** [Medieval Latin, Spanish, French]
+- **Finetuned from model [optional]:** [Named Entity Recognition]
+### Direct Use
+A sentence as : "Ego Radulfus de Francorvilla miles, notum facio tam presentibus cum futuris quod, cum Guillelmo Bateste militi de Miliaco"
+Will be annotated in BIO format as:
+```python
+('Ego', 'O', 'O')
+('Radulfus', 'B-PERS')
+('de', 'I-PERS', 'O')
+('Francorvilla', 'I-PERS', 'B-LOC')
+('miles', 'O')
+(',', 'O', 'O')
+('notum', 'O', 'O')
+('facio', 'O', 'O')
+('tam', 'O', 'O')
+('presentibus', 'O', 'O')
+('quam', 'O', 'O')
+('futuris', 'O', 'O')
+('quod', 'O', 'O')
+(',', 'O', 'O')
+('cum', 'O', 'O')
+('Guillelmo', 'B-PERS', 'O')
+('Bateste', 'I-PERS', 'O')
+('militi', 'O', 'O')
+('de', 'O', 'O')
+('Miliaco', 'O', 'B-LOC')
+```
+### Training Procedure
+The model was fine-tuned during 5 epoch on the XML-Roberta-Large using a 5e-5 Lr and a batch size of 16.
+**BibTeX:**
+```bibtex
+@inproceedings{aguilar2022multilingual,
+  title={Multilingual Named Entity Recognition for Medieval Charters Using Stacked Embeddings and Bert-based Models.},
+  author={Aguilar, Sergio Torres},
+  booktitle={Proceedings of the second workshop on language technologies for historical and ancient languages},
+  pages={119--128},
+  year={2022}
+}
+```
+## Model Card Contact
+[sergio.torres@uni.lu]