Edit model card

Fine-tuned multilingual model for russian language NER

This is the model card for fine-tuned Babelscape/wikineural-multilingual-ner, which has multilingual mBERT as its base. I`ve fine-tuned it using RCC-MSU/collection3 dataset for token-classification task. The dataset has BIO-pattern and following labels:

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

Model Details

Fine-tuning was proceeded in 3 epochs, and computed next metrics:

Epoch Training Loss Validation Loss Precision Recall F1 Accuracy
1 0.041000 0.032810 0.959569 0.974253 0.966855 0.993325
2 0.020800 0.028395 0.959569 0.974253 0.966855 0.993325
3 0.010500 0.029138 0.963239 0.973767 0.968474 0.993247

To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.

Basic usage

So, you can easily use this model with pipeline for 'token-classification' task.

import torch

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset


model_ckpt = "nesemenpolkov/msu-wiki-ner"

label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
    model_ckpt,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)

pipe = pipeline(
    task="token-classification",
    model=model,
    tokenizer=tokenizer,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    aggregation_strategy="simple"
)

demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."

with torch.no_grad():
    out = pipe(demo_sample)

Bias, Risks, and Limitations

This model is finetuned version of Babelscape/wikineural-multilingual-ner, on a russian language NER dataset RCC-MSU/collection3. It can show low scores on another language texts.

Citation [optional]

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "Fine-tuned multilingual model for russian language NER.",
    author = "nesemenpolkov",
    booktitle = "Detecting names in noisy and dirty data.",
    month = oct,
    year = "2024",
    address = "Moscow, Russian Federation",
}
Downloads last month
34
Safetensors
Model size
177M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for nesemenpolkov/msu-wiki-ner

Finetuned
(4)
this model

Dataset used to train nesemenpolkov/msu-wiki-ner