|
--- |
|
datasets: |
|
- artemkramov/coreference-dataset-ua |
|
language: |
|
- uk |
|
tags: |
|
- coreference-resolution |
|
- anaphora |
|
widget: |
|
- text: "Jens Peter Hansen kommer fra Danmark" |
|
example_title: "Coreference resolution" |
|
model-index: |
|
- name: test |
|
results: |
|
- task: |
|
type: coreference-resolution |
|
name: Coreference resolution |
|
dataset: |
|
type: artemkramov/coreference-dataset-ua |
|
name: Silver Ukrainian Coreference Resolution Dataset |
|
metrics: |
|
- type: coval |
|
value: 0.731 |
|
name: Mean F1 measure of MUC, BCubed, and CEAFE |
|
--- |
|
# Coreference resolution model for the Ukrainian language |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
The coreference resolution model for the Ukrainian language was trained on the [silver Ukrainian coreference dataset](https://huggingface.co/datasets/artemkramov/coreference-dataset-ua) |
|
using the [F-Coref](https://arxiv.org/abs/2209.04280) library. The model was trained on top of the [XML-Roberta-base model](https://huggingface.co/ukr-models/xlm-roberta-base-uk). |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** [Artem Kramov](https://www.linkedin.com/in/artem-kramov-0b3731100/), Andrii Kursin (aqrsn@ukr.net). |
|
- **Languages:** Ukrainian |
|
- **Finetuned from model:** [XML-Roberta-base](https://huggingface.co/ukr-models/xlm-roberta-base-uk) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/artemkramov/fastcoref-ua/blob/main/README.md |
|
- **Demo:** [Google Colab](https://colab.research.google.com/drive/1vsaH15DFDrmKB4aNsQ-9TCQGTW73uk1y?usp=sharing), [Streamlit](https://coreference-ua-app.streamlit.app) |
|
|
|
### Out-of-Scope Use |
|
|
|
According to the metrics retrieved from the evaluation dataset, the model is more precision-oriented. Also, there is a high level of granularity of mentions. |
|
E.g., the mention "Головний виконавчий директор Андрій Сидоренко" can be divided into the following coreferent groups: ["Головний виконавчий директор Андрій Сидоренко", "Головний виконавчий директор", "Андрій Сидоренко"]. |
|
Such a feature can also be used to extract some positions, roles, or other features of entities in the text. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from fastcoref import FCoref |
|
import spacy |
|
|
|
nlp = spacy.load('uk_core_news_md') |
|
|
|
model_path = "artemkramov/coref-ua" |
|
model = FCoref(model_name_or_path=model_path, device='cuda:0', nlp=nlp) |
|
|
|
preds = model.predict( |
|
texts=["""Мій друг дав мені свою машину та ключі до неї; крім того, він дав мені його книгу. Я з радістю її читаю."""] |
|
) |
|
|
|
preds[0].get_clusters(as_strings=False) |
|
> [[(0, 3), (13, 17), (66, 70), (83, 84)], |
|
[(0, 8), (18, 22), (58, 61), (71, 75)], |
|
[(18, 29), (42, 45)], |
|
[(71, 81), (95, 97)]] |
|
|
|
preds[0].get_clusters() |
|
> [['Мій', 'мені', 'мені', 'Я'], ['Мій друг', 'свою', 'він', 'його'], ['свою машину', 'неї'], ['його книгу', 'її']] |
|
|
|
preds[0].get_logit( |
|
span_i=(13, 17), span_j=(42, 45) |
|
) |
|
|
|
> -6.867196 |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on the silver coreference resolution dataset: https://huggingface.co/datasets/artemkramov/coreference-dataset-ua. |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
#### Metrics |
|
|
|
Two types of metrics were considered: mention-based and the coreference resolution metrics themselves. |
|
|
|
Mention-based metrics: |
|
- mention precision |
|
- mention recall |
|
- mention F1 |
|
|
|
Coreference resolution metrics were calculated as the average values across the following metrics: MUC, BCubed, CEAFE: |
|
- coreference precision |
|
- coreference recall |
|
- coreference F1 |
|
|
|
### Results |
|
|
|
The metrics for the validation dataset: |
|
|
|
| Metric | Value | |
|
|:---------------------|:-------| |
|
| Mention precision | 0.850 | |
|
| Mention recall | 0.798 | |
|
| Mention F1 | 0.824 | |
|
| Coreference precision | 0.758 | |
|
| Coreference recall | 0.706 | |
|
| Coreference F1 | 0.731 | |
|
|
|
## Model Card Authors |
|
|
|
Artem Kramov (https://www.linkedin.com/in/artem-kramov-0b3731100/), Andrii Kursin (aqrsn@ukr.net) |