File size: 4,301 Bytes
6334a32
 
 
 
 
 
 
 
 
 
 
 
 
 
ba27e8d
 
 
 
dd761c0
 
97a7d49
dd761c0
 
 
 
ba27e8d
 
 
 
 
 
dd761c0
ba27e8d
 
 
 
 
 
 
 
 
 
 
dd761c0
 
 
 
 
 
 
 
ba27e8d
dd761c0
 
 
 
ba27e8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd761c0
ba27e8d
6334a32
 
 
 
 
 
ba27e8d
 
 
 
 
dd761c0
ba27e8d
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: cc-by-nc-4.0
datasets:
- projecte-aina/ES-AST_Parallel_Corpus
language:
- es
- ast
metrics:
- bleu
- chrf
library_name: transformers
base_model:
- facebook/nllb-200-distilled-600M
---
## Projecte Aina’s Spanish-Asturian machine translation model

## Model description

This model was created as part of the participation of Language Technologies Unit at BSC in the WMT24 Shared Task: 
[Translation into Low-Resource Languages of Spain](https://www2.statmt.org/wmt24/romance-task.html). 
It results from a full fine-tuning of the [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model with a Spanish-Asturian corpus. 
Specifically, we used the [transformers library](https://huggingface.co/docs/transformers/) from Hugging Face and a filtered version 
of the [Spanish-Asturian dataset](https://huggingface.co/datasets/projecte-aina/ES-AST_Parallel_Corpus) to fine-tune the model. 
The model was evaluated using the Flores evaluation datasets. 
Please refer to the [paper](__poner_link___) for more information.

## Intended uses and limitations

You can use this model for machine translation from Spanish to Asturian.

## Limitations and bias

At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. 
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.

## Evaluation

### Variable and metrics

We use the BLEU and ChrF score for evaluation on the [Flores+](https://github.com/openlanguagedata/flores) evaluation datasets.

### Evaluation results

Below are the evaluation results on the machine translation from Spanish to Asturian compared to [Apertium](https://www.apertium.org/), 
[Eslema](https://eslema.it.uniovi.es/) and [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M):


| Test set (BLEU)      | Apertium | Eslema | NLLB-600M | Our model  |
|:---------------------|:---------|:-------|:----------|:-----------|
| Flores dev            | 16.66    | 17.30  | 17.23     | **19.33**  |
| Flores devtest        | 16.99    | 17.17  | 16.21     | **18.43**  |

| Test set (ChrF)       | Apertium | Eslema | NLLB-600M | Our model  |
|:---------------------|:---------|:-------|:----------|:-----------|
| Flores dev            | 50.57    | 50.77  | 49.72     | **52.26**  |
| Flores devtest        | 50.84    | 50.91  | 49.05     | **52.14**  |



## Additional information

## Paper
For further information, please refer to the [paper](__poner_link___) published for the Shared Task: Translation into Low-Resource Languages of Spain (WMT24)

### Author
The Language Technologies Unit from Barcelona Supercomputing Center.

### Contact
For further information, please send an email to <langtech@bsc.es>.

### Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

### License
[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/)

### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335, 2022/TL22/00215334.

The publication is part of the project PID2021-123988OB-C33, funded by MCIN/AEI/10.13039/501100011033/FEDER, EU.


### Disclaimer

<details>
<summary>Click to expand</summary>

The model published in this repository is intended for a generalist purpose and is available to third parties under a CC BY-NC 4.0 licence. 

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) 
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, 
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (Barcelona Supercomputing Center) 
be liable for any results arising from the use made by third parties.

</details>