dumitrescustefan's picture
Update README.md
1c73167
|
raw
history blame
5.36 kB
metadata
language: ro
datasets:
  - ronec
license: mit

bert-base-romanian-ner

Model description

bert-base-romanian-ner is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize 15 types of entities: persons, geo-political entities, locations, organizations, languages, national_religious_political entities, datetime, period, quantity, money, numeric, ordinal, facilities, works of art and events.

Specifically, this model is a bert-base-romanian-cased-v1 model that was fine-tuned on RONEC version 2.0, which holds 12330 sentences with over 0.5M tokens, to a total of 80.283 distinctly annotated entities. RONECv2 is a BIO2 annotated corpus, meaning this model will generate "B-" and "I-" style labels for entities.

How to use

There are 2 ways to use this model:

Directly in Transformers:

You can use this model with Transformers pipeline for NER; you will have to handle word tokenization in multiple subtokens cases with different labels.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Alex cumpără un bilet pentru trenul 3118 în direcția Cluj cu plecare la ora 13:00."
ner_results = nlp(example)
print(ner_results)

Use in a Python package

Install package Use named_persons_only

Don't forget!

Remember to always sanitize your text! Replace s and t cedilla-letters to comma-letters before processing your text with these models, with :

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

NER evaluation results

metric dev test
f1 95.1 91.3
precision 95.0 90.7
recall 95.3 91.9

{'test/avg_loss': 0.20585431158542633, 'test/ent_type': 0.9276865720748901, 'test/exact': 0.9118986129760742, 'test/partial': 0.9356381297111511, 'test/strict': 0.8921924233436584}

Corpus details

The corpus has the following classes and distribution in the train/valid/test splits:

| Classes | Total | Train | | Valid | | Test | | |------------- |:------: |:------: |:-------: |:------: |:-------: |:------: |:-------: | | | # | # | % | # | % | # | % | | PERSON | 26130 | 19167 | 73.35 | 2733 | 10.46 | 4230 | 16.19 | | GPE | 11103 | 8193 | 73.79 | 1182 | 10.65 | 1728 | 15.56 | | LOC | 2467 | 1824 | 73.94 | 270 | 10.94 | 373 | 15.12 | | ORG | 7880 | 5688 | 72.18 | 880 | 11.17 | 1312 | 16.65 | | LANGUAGE | 467 | 342 | 73.23 | 52 | 11.13 | 73 | 15.63 | | NAT_REL_POL | 4970 | 3673 | 73.90 | 516 | 10.38 | 781 | 15.71 | | DATETIME | 9614 | 6960 | 72.39 | 1029 | 10.7 | 1625 | 16.9 | | PERIOD | 1188 | 862 | 72.56 | 129 | 10.86 | 197 | 16.58 | | QUANTITY | 1588 | 1161 | 73.11 | 181 | 11.4 | 246 | 15.49 | | MONEY | 1424 | 1041 | 73.10 | 159 | 11.17 | 224 | 15.73 | | NUMERIC | 7735 | 5734 | 74.13 | 814 | 10.52 | 1187 | 15.35 | | ORDINAL | 1893 | 1377 | 72.74 | 212 | 11.2 | 304 | 16.06 | | FACILITY | 1126 | 840 | 74.6 | 113 | 10.04 | 173 | 15.36 | | WORK_OF_ART | 1596 | 1157 | 72.49 | 176 | 11.03 | 263 | 16.48 | | EVENT | 1102 | 826 | 74.95 | 107 | 9.71 | 169 | 15.34 |

Dataset contains 31 BIO2 classes: ['O', 'B-PERSON', 'I-PERSON', 'B-ORG', 'I-ORG', 'B-GPE', 'I-GPE', 'B-LOC', 'I-LOC', 'B-NAT_REL_POL', 'I-NAT_REL_POL', 'B-EVENT', 'I-EVENT', 'B-LANGUAGE', 'I-LANGUAGE', 'B-WORK_OF_ART', 'I-WORK_OF_ART', 'B-DATETIME', 'I-DATETIME', 'B-PERIOD', 'I-PERIOD', 'B-MONEY', 'I-MONEY', 'B-QUANTITY', 'I-QUANTITY', 'B-NUMERIC', 'I-NUMERIC', 'B-ORDINAL', 'I-ORDINAL', 'B-FACILITY', 'I-FACILITY'].

There are 16 classes: ['DATETIME', 'EVENT', 'FACILITY', 'GPE', 'LANGUAGE', 'LOC', 'MONEY', 'NAT_REL_POL', 'NUMERIC', 'O', 'ORDINAL', 'ORG', 'PERIOD', 'PERSON', 'QUANTITY', 'WORK_OF_ART']

BibTeX entry and citation info

Please consider citing the following paper as a thank you to the authors of the RONEC, even if it describes v1 of the corpus and you are using a model trained on v2:

Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).

or in .bibtex format:

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}