datasets:
- artemkramov/coreference-dataset-ua
language:
- uk
tags:
- coreference-resolution
- anaphora
widget:
- text: Jens Peter Hansen kommer fra Danmark
example_title: Coreference resolution
model-index:
- name: test
results:
- task:
type: coreference-resolution
name: Coreference resolution
dataset:
type: artemkramov/coreference-dataset-ua
name: Silver Ukrainian Coreference Resolution Dataset
metrics:
- type: coval
value: 0.731
name: Mean F1 measure of MUC, BCubed, and CEAFE
Coreference resolution model for the Ukrainian language
The coreference resolution model for the Ukrainian language was trained on the silver Ukrainian coreference dataset using the F-Coref library. The model was trained on top of the XML-Roberta-base model.
Model Details
Model Description
- Developed by: Artem Kramov, Andrii Kursin (aqrsn@ukr.net).
- Languages: Ukrainian
- Finetuned from model: XML-Roberta-base
Model Sources
- Repository: https://github.com/artemkramov/fastcoref-ua/blob/main/README.md
- Demo: Google Colab, Streamlit
Out-of-Scope Use
According to the metrics retrieved from the evaluation dataset, the model is more precision-oriented. Also, there is a high level of granularity of mentions. E.g., the mention "Головний виконавчий директор Андрій Сидоренко" can be divided into the following coreferent groups: ["Головний виконавчий директор Андрій Сидоренко", "Головний виконавчий директор", "Андрій Сидоренко"]. Such a feature can also be used to extract some positions, roles, or other features of entities in the text.
How to Get Started with the Model
Use the code below to get started with the model.
from fastcoref import FCoref
import spacy
nlp = spacy.load('uk_core_news_md')
model_path = "artemkramov/coref-ua"
model = FCoref(model_name_or_path=model_path, device='cuda:0', nlp=nlp)
preds = model.predict(
texts=["""Мій друг дав мені свою машину та ключі до неї; крім того, він дав мені його книгу. Я з радістю її читаю."""]
)
preds[0].get_clusters(as_strings=False)
> [[(0, 3), (13, 17), (66, 70), (83, 84)],
[(0, 8), (18, 22), (58, 61), (71, 75)],
[(18, 29), (42, 45)],
[(71, 81), (95, 97)]]
preds[0].get_clusters()
> [['Мій', 'мені', 'мені', 'Я'], ['Мій друг', 'свою', 'він', 'його'], ['свою машину', 'неї'], ['його книгу', 'її']]
preds[0].get_logit(
span_i=(13, 17), span_j=(42, 45)
)
> -6.867196
Training Details
Training Data
The model was trained on the silver coreference resolution dataset: https://huggingface.co/datasets/artemkramov/coreference-dataset-ua.
Evaluation
Metrics
Two types of metrics were considered: mention-based and the coreference resolution metrics themselves.
Mention-based metrics:
- mention precision
- mention recall
- mention F1
Coreference resolution metrics were calculated as the average values across the following metrics: MUC, BCubed, CEAFE:
- coreference precision
- coreference recall
- coreference F1
Results
The metrics for the validation dataset:
Metric | Value |
---|---|
Mention precision | 0.850 |
Mention recall | 0.798 |
Mention F1 | 0.824 |
Coreference precision | 0.758 |
Coreference recall | 0.706 |
Coreference F1 | 0.731 |
Model Card Authors
Artem Kramov (https://www.linkedin.com/in/artem-kramov-0b3731100/), Andrii Kursin (aqrsn@ukr.net)