a few readme improvements
Browse files
README.md
CHANGED
@@ -18,9 +18,15 @@ co2_eq_emissions:
|
|
18 |
Breton-French translator `m2m100_418M_br_fr`
|
19 |
============================================
|
20 |
|
21 |
-
This model is a fine-tuned version of
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
-
Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and
|
|
|
24 |
|
25 |
## Model description
|
26 |
|
@@ -28,7 +34,10 @@ See the description of the [base model](https://huggingface.co/facebook/m2m100_4
|
|
28 |
|
29 |
## Intended uses & limitations
|
30 |
|
31 |
-
This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale
|
|
|
|
|
|
|
32 |
|
33 |
## Training and evaluation data
|
34 |
|
@@ -38,7 +47,12 @@ The training dataset consists of:
|
|
38 |
- The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
|
39 |
- Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
|
40 |
|
41 |
-
These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
In addition to these, the training dataset also includes parallel br/fr sentences, provided as
|
44 |
glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
|
@@ -47,9 +61,13 @@ to Universal Dependencies in the Autogramm project.
|
|
47 |
|
48 |
## Training procedure
|
49 |
|
50 |
-
The training hyperparameters are those suggested by Adelani et al. (2022) in their [code
|
|
|
|
|
51 |
|
52 |
-
More specifically, we use the [example training
|
|
|
|
|
53 |
|
54 |
```bash
|
55 |
python run_translation.py \
|
@@ -65,45 +83,46 @@ python run_translation.py \
|
|
65 |
--save_steps 4096 \
|
66 |
--fp16 \
|
67 |
--num_train_epochs 4
|
68 |
-
|
69 |
```
|
70 |
|
71 |
### Training hyperparameters
|
72 |
|
73 |
The following hyperparameters were used during training:
|
74 |
|
75 |
-
- learning_rate
|
76 |
-
- train_batch_size
|
77 |
-
- eval_batch_size
|
78 |
-
- seed
|
79 |
-
- optimizer
|
80 |
-
- lr_scheduler_type
|
81 |
-
- num_epochs
|
82 |
|
83 |
### Framework versions
|
84 |
|
85 |
- Transformers 4.24.0
|
86 |
-
- Pytorch 1.
|
87 |
- Datasets 2.6.1
|
88 |
- Tokenizers 0.13.1
|
89 |
|
90 |
## References
|
91 |
|
92 |
- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
|
93 |
-
et al. 2022.
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
<https://doi.org/10.18653/v1/2022.naacl-main.223>.
|
98 |
- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
|
99 |
Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
|
100 |
Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
|
101 |
Linguistics.
|
102 |
-
-
|
|
|
|
|
|
|
103 |
International Conference on Language Resources and Evaluation (LREC 2012)
|
104 |
- Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
|
105 |
-
ressources pour son étude linguistique formelle, IKER, CNRS, http://arbres.iker.cnrs.fr
|
106 |
-
|
107 |
-
|
108 |
-
machine translation ". Proceedings of the 13th Annual Conference of the European Association of
|
109 |
Machine Translation, EAMT09. Barcelona, España. 213--218
|
|
|
18 |
Breton-French translator `m2m100_418M_br_fr`
|
19 |
============================================
|
20 |
|
21 |
+
This model is a fine-tuned version of
|
22 |
+
[facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) (Fan et al., 2021) on a
|
23 |
+
Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel
|
24 |
+
data on training and consequently report no quantitative evaluation at this time. Empirical
|
25 |
+
qualitative evidence suggests that the translations are generally adequate for short and simple
|
26 |
+
examples, the behaviour of the model on long and/or complex inputs is currently unknown.
|
27 |
|
28 |
+
Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and
|
29 |
+
suggestions are welcome!
|
30 |
|
31 |
## Model description
|
32 |
|
|
|
34 |
|
35 |
## Intended uses & limitations
|
36 |
|
37 |
+
This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale
|
38 |
+
many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far
|
39 |
+
as I can tell it usually provides translations that are least as good as those of other available
|
40 |
+
Breton-French translators, but it has not been evaluated quantitatively at a large scale.
|
41 |
|
42 |
## Training and evaluation data
|
43 |
|
|
|
47 |
- The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
|
48 |
- Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
|
49 |
|
50 |
+
These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using
|
51 |
+
[OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see
|
52 |
+
[`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to
|
53 |
+
the retraining of a statistical alignment model, but in my experience, different runs tend to give
|
54 |
+
extremely similar results. Do not hesitate to reach out if you experience difficulties in using this
|
55 |
+
to collect data.
|
56 |
|
57 |
In addition to these, the training dataset also includes parallel br/fr sentences, provided as
|
58 |
glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
|
|
|
61 |
|
62 |
## Training procedure
|
63 |
|
64 |
+
The training hyperparameters are those suggested by Adelani et al. (2022) in their [code
|
65 |
+
release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine
|
66 |
+
translation of several African languages.
|
67 |
|
68 |
+
More specifically, we use the [example training
|
69 |
+
script](https://github.com/huggingface/transformers/blob/06886d5a684228a695b29645993b3be55190bd9c/examples/pytorch/translation/run_translation.py)
|
70 |
+
provided by 🤗 Transformers for fine-tuning mBART with the following command
|
71 |
|
72 |
```bash
|
73 |
python run_translation.py \
|
|
|
83 |
--save_steps 4096 \
|
84 |
--fp16 \
|
85 |
--num_train_epochs 4
|
|
|
86 |
```
|
87 |
|
88 |
### Training hyperparameters
|
89 |
|
90 |
The following hyperparameters were used during training:
|
91 |
|
92 |
+
- `learning_rate`: 5e-05
|
93 |
+
- `train_batch_size`: 8
|
94 |
+
- `eval_batch_size`: 8
|
95 |
+
- `seed`: 42
|
96 |
+
- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
97 |
+
- `lr_scheduler_type`: linear
|
98 |
+
- `num_epochs`: 4.0
|
99 |
|
100 |
### Framework versions
|
101 |
|
102 |
- Transformers 4.24.0
|
103 |
+
- Pytorch 1.13.0
|
104 |
- Datasets 2.6.1
|
105 |
- Tokenizers 0.13.1
|
106 |
|
107 |
## References
|
108 |
|
109 |
- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
|
110 |
+
et al. 2022. “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African
|
111 |
+
News Translation”. In Proceedings of the 2022 Conference of the North American Chapter of the
|
112 |
+
Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United
|
113 |
+
States: Association for Computational Linguistics.
|
114 |
<https://doi.org/10.18653/v1/2022.naacl-main.223>.
|
115 |
- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
|
116 |
Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
|
117 |
Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
|
118 |
Linguistics.
|
119 |
+
- Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
|
120 |
+
Baines, et al. 2021. “Beyond english-centric multilingual machine translation”. The Journal of
|
121 |
+
Machine Learning Research 22 (1): 107:4839-107:4886.
|
122 |
+
- Tiedemann, Jorg 2012, “Parallel Data, Tools and Interfaces in OPUS”. In Proceedings of the 8th
|
123 |
International Conference on Language Resources and Evaluation (LREC 2012)
|
124 |
- Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
|
125 |
+
ressources pour son étude linguistique formelle, IKER, CNRS, <http://arbres.iker.cnrs.fr>.
|
126 |
+
- Tyers, Francis M. 2009 “Rule-based augmentation of training data in Breton-French statistical
|
127 |
+
machine translation”. In Proceedings of the 13th Annual Conference of the European Association of
|
|
|
128 |
Machine Translation, EAMT09. Barcelona, España. 213--218
|