lgrobol commited on
Commit
6b02420
1 Parent(s): eeaaffa

a few readme improvements

Browse files
Files changed (1) hide show
  1. README.md +43 -24
README.md CHANGED
@@ -18,9 +18,15 @@ co2_eq_emissions:
18
  Breton-French translator `m2m100_418M_br_fr`
19
  ============================================
20
 
21
- This model is a fine-tuned version of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on a Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel data on training and consequently report no quantitative evaluation at this time. Empirical qualitative evidence suggests that the translations are generally adequate for short and simple examples, the behaviour of the model on long and/or complex inputs is currently unknown.
 
 
 
 
 
22
 
23
- Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and suggestions are welcome!
 
24
 
25
  ## Model description
26
 
@@ -28,7 +34,10 @@ See the description of the [base model](https://huggingface.co/facebook/m2m100_4
28
 
29
  ## Intended uses & limitations
30
 
31
- This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far as I can tell it usually provides translations that are least as good as those of other available Breton-French translators, but this is has not been evaluated quantitatively at a large scale.
 
 
 
32
 
33
  ## Training and evaluation data
34
 
@@ -38,7 +47,12 @@ The training dataset consists of:
38
  - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
39
  - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
40
 
41
- These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.
 
 
 
 
 
42
 
43
  In addition to these, the training dataset also includes parallel br/fr sentences, provided as
44
  glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
@@ -47,9 +61,13 @@ to Universal Dependencies in the Autogramm project.
47
 
48
  ## Training procedure
49
 
50
- The training hyperparameters are those suggested by Adelani et al. (2022) in their [code release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine translation of several African languages.
 
 
51
 
52
- More specifically, we use the [example training script](https://github.com/huggingface/transformers/blob/674f750a57431222fa2832503a108df3badf1564/examples/pytorch/translation/run_translation.py) provided by 🤗 Transformers for fine-tuning mBART with the following command
 
 
53
 
54
  ```bash
55
  python run_translation.py \
@@ -65,45 +83,46 @@ python run_translation.py \
65
  --save_steps 4096 \
66
  --fp16 \
67
  --num_train_epochs 4
68
-
69
  ```
70
 
71
  ### Training hyperparameters
72
 
73
  The following hyperparameters were used during training:
74
 
75
- - learning_rate: 5e-05
76
- - train_batch_size: 8
77
- - eval_batch_size: 8
78
- - seed: 42
79
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
80
- - lr_scheduler_type: linear
81
- - num_epochs: 4.0
82
 
83
  ### Framework versions
84
 
85
  - Transformers 4.24.0
86
- - Pytorch 1.12.1+cu116
87
  - Datasets 2.6.1
88
  - Tokenizers 0.13.1
89
 
90
  ## References
91
 
92
  - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
93
- et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for
94
- African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of
95
- the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle,
96
- United States: Association for Computational Linguistics.
97
  <https://doi.org/10.18653/v1/2022.naacl-main.223>.
98
  - Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
99
  Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
100
  Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
101
  Linguistics.
102
- - Tiedemann, Jorg 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th
 
 
 
103
  International Conference on Language Resources and Evaluation (LREC 2012)
104
  - Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
105
- ressources pour son étude linguistique formelle, IKER, CNRS, http://arbres.iker.cnrs.fr. Licence
106
- Creative Commons BY-NC-SA.
107
- - Tyers, Francis M. 2009 "Rule-based augmentation of training data in Breton-French statistical
108
- machine translation ". Proceedings of the 13th Annual Conference of the European Association of
109
  Machine Translation, EAMT09. Barcelona, España. 213--218
 
18
  Breton-French translator `m2m100_418M_br_fr`
19
  ============================================
20
 
21
+ This model is a fine-tuned version of
22
+ [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) (Fan et al., 2021) on a
23
+ Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel
24
+ data on training and consequently report no quantitative evaluation at this time. Empirical
25
+ qualitative evidence suggests that the translations are generally adequate for short and simple
26
+ examples, the behaviour of the model on long and/or complex inputs is currently unknown.
27
 
28
+ Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and
29
+ suggestions are welcome!
30
 
31
  ## Model description
32
 
 
34
 
35
  ## Intended uses & limitations
36
 
37
+ This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale
38
+ many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far
39
+ as I can tell it usually provides translations that are least as good as those of other available
40
+ Breton-French translators, but it has not been evaluated quantitatively at a large scale.
41
 
42
  ## Training and evaluation data
43
 
 
47
  - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
48
  - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
49
 
50
+ These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using
51
+ [OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see
52
+ [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to
53
+ the retraining of a statistical alignment model, but in my experience, different runs tend to give
54
+ extremely similar results. Do not hesitate to reach out if you experience difficulties in using this
55
+ to collect data.
56
 
57
  In addition to these, the training dataset also includes parallel br/fr sentences, provided as
58
  glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
 
61
 
62
  ## Training procedure
63
 
64
+ The training hyperparameters are those suggested by Adelani et al. (2022) in their [code
65
+ release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine
66
+ translation of several African languages.
67
 
68
+ More specifically, we use the [example training
69
+ script](https://github.com/huggingface/transformers/blob/06886d5a684228a695b29645993b3be55190bd9c/examples/pytorch/translation/run_translation.py)
70
+ provided by 🤗 Transformers for fine-tuning mBART with the following command
71
 
72
  ```bash
73
  python run_translation.py \
 
83
  --save_steps 4096 \
84
  --fp16 \
85
  --num_train_epochs 4
 
86
  ```
87
 
88
  ### Training hyperparameters
89
 
90
  The following hyperparameters were used during training:
91
 
92
+ - `learning_rate`: 5e-05
93
+ - `train_batch_size`: 8
94
+ - `eval_batch_size`: 8
95
+ - `seed`: 42
96
+ - `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
97
+ - `lr_scheduler_type`: linear
98
+ - `num_epochs`: 4.0
99
 
100
  ### Framework versions
101
 
102
  - Transformers 4.24.0
103
+ - Pytorch 1.13.0
104
  - Datasets 2.6.1
105
  - Tokenizers 0.13.1
106
 
107
  ## References
108
 
109
  - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
110
+ et al. 2022. A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African
111
+ News Translation”. In Proceedings of the 2022 Conference of the North American Chapter of the
112
+ Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United
113
+ States: Association for Computational Linguistics.
114
  <https://doi.org/10.18653/v1/2022.naacl-main.223>.
115
  - Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
116
  Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
117
  Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
118
  Linguistics.
119
+ - Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
120
+ Baines, et al. 2021. “Beyond english-centric multilingual machine translation”. The Journal of
121
+ Machine Learning Research 22 (1): 107:4839-107:4886.
122
+ - Tiedemann, Jorg 2012, “Parallel Data, Tools and Interfaces in OPUS”. In Proceedings of the 8th
123
  International Conference on Language Resources and Evaluation (LREC 2012)
124
  - Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
125
+ ressources pour son étude linguistique formelle, IKER, CNRS, <http://arbres.iker.cnrs.fr>.
126
+ - Tyers, Francis M. 2009 “Rule-based augmentation of training data in Breton-French statistical
127
+ machine translation”. In Proceedings of the 13th Annual Conference of the European Association of
 
128
  Machine Translation, EAMT09. Barcelona, España. 213--218