Fine-tuned multilingual model for russian language NER

This is the model card for fine-tuned Babelscape/wikineural-multilingual-ner, which has multilingual mBERT as its base. I`ve fine-tuned it using RCC-MSU/collection3 dataset for token-classification task. The dataset has BIO-pattern and following labels:

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

Model Details

Fine-tuning was proceeded in 3 epochs, and computed next metrics:

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	0.041000	0.032810	0.959569	0.974253	0.966855	0.993325
2	0.020800	0.028395	0.959569	0.974253	0.966855	0.993325
3	0.010500	0.029138	0.963239	0.973767	0.968474	0.993247

To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.

Basic usage

So, you can easily use this model with pipeline for 'token-classification' task.

import torch

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset


model_ckpt = "nesemenpolkov/msu-wiki-ner"

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
    model_ckpt,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)

pipe = pipeline(
    task="token-classification",
    model=model,
    tokenizer=tokenizer,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    aggregation_strategy="simple"
)

demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."

with torch.no_grad():
    out = pipe(demo_sample)

Bias, Risks, and Limitations

This model is finetuned version of Babelscape/wikineural-multilingual-ner, on a russian language NER dataset RCC-MSU/collection3. It can show low scores on another language texts.

Citation [optional]

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "Fine-tuned multilingual model for russian language NER.",
    author = "nesemenpolkov",
    booktitle = "Detecting names in noisy and dirty data.",
    month = oct,
    year = "2024",
    address = "Moscow, Russian Federation",
}

nesemenpolkov
/

msu-wiki-ner

Fine-tuned multilingual model for russian language NER

Model Details

Basic usage

Bias, Risks, and Limitations

Citation [optional]

Model tree for nesemenpolkov/msu-wiki-ner

Dataset used to train nesemenpolkov/msu-wiki-ner