lgrobol
/

m2m100_418M_br_fr

@@ -18,9 +18,15 @@ co2_eq_emissions:
 Breton-French translator `m2m100_418M_br_fr`
 ============================================
-This model is a fine-tuned version of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on a Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel data on training and consequently report no quantitative evaluation at this time. Empirical qualitative evidence suggests that the translations are generally adequate for short and simple examples, the behaviour of the model on long and/or complex inputs is currently unknown.
-Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and suggestions are welcome!
 ## Model description
@@ -28,7 +34,10 @@ See the description of the [base model](https://huggingface.co/facebook/m2m100_4
 ## Intended uses & limitations
-This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far as I can tell it usually provides translations that are least as good as those of other available Breton-French translators, but this is has not been evaluated quantitatively at a large scale.
 ## Training and evaluation data
@@ -38,7 +47,12 @@ The training dataset consists of:
 - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
 - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
-These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using [OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see [`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.
 In addition to these, the training dataset also includes parallel br/fr sentences, provided as
 glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
@@ -47,9 +61,13 @@ to Universal Dependencies in the Autogramm project.
 ## Training procedure
-The training hyperparameters are those suggested by Adelani et al. (2022) in their [code release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine translation of several African languages.
-More specifically, we use the [example training script](https://github.com/huggingface/transformers/blob/674f750a57431222fa2832503a108df3badf1564/examples/pytorch/translation/run_translation.py) provided by 🤗 Transformers for fine-tuning mBART with the following command
 ```bash
 python run_translation.py \
@@ -65,45 +83,46 @@ python run_translation.py \
   --save_steps 4096 \
   --fp16 \
   --num_train_epochs 4
 ```
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 5e-05
-- train_batch_size: 8
-- eval_batch_size: 8
-- seed: 42
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- num_epochs: 4.0
 ### Framework versions
 - Transformers 4.24.0
-- Pytorch 1.12.1+cu116
 - Datasets 2.6.1
 - Tokenizers 0.13.1
 ## References
 - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
-  et al. 2022. « A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for
-  African News Translation ». In Proceedings of the 2022 Conference of the North American Chapter of
-  the Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle,
-  United States: Association for Computational Linguistics.
   <https://doi.org/10.18653/v1/2022.naacl-main.223>.
 - Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
   Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
   Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
   Linguistics.
-- Tiedemann, Jorg 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th
   International Conference on Language Resources and Evaluation (LREC 2012)
 - Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
-  ressources pour son étude linguistique formelle, IKER, CNRS, http://arbres.iker.cnrs.fr. Licence
-  Creative Commons BY-NC-SA.
-- Tyers, Francis M. 2009 "Rule-based augmentation of training data in Breton-French statistical
-  machine translation ". Proceedings of the 13th Annual Conference of the European Association of
   Machine Translation, EAMT09. Barcelona, España. 213--218

 Breton-French translator `m2m100_418M_br_fr`
 ============================================
+This model is a fine-tuned version of
+[facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) (Fan et al., 2021) on a
+Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel
+data on training and consequently report no quantitative evaluation at this time. Empirical
+qualitative evidence suggests that the translations are generally adequate for short and simple
+examples, the behaviour of the model on long and/or complex inputs is currently unknown.
+Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and
+suggestions are welcome!
 ## Model description
 ## Intended uses & limitations
+This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale
+many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far
+as I can tell it usually provides translations that are least as good as those of other available
+Breton-French translators, but it has not been evaluated quantitatively at a large scale.
 ## Training and evaluation data
 - The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php)
 - Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
+These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using
+[OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see
+[`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to
+the retraining of a statistical alignment model, but in my experience, different runs tend to give
+extremely similar results. Do not hesitate to reach out if you experience difficulties in using this
+to collect data.
 In addition to these, the training dataset also includes parallel br/fr sentences, provided as
 glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their
 ## Training procedure
+The training hyperparameters are those suggested by Adelani et al. (2022) in their [code
+release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine
+translation of several African languages.
+More specifically, we use the [example training
+script](https://github.com/huggingface/transformers/blob/06886d5a684228a695b29645993b3be55190bd9c/examples/pytorch/translation/run_translation.py)
+provided by 🤗 Transformers for fine-tuning mBART with the following command
 ```bash
 python run_translation.py \
   --save_steps 4096 \
   --fp16 \
   --num_train_epochs 4
 ```
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- `learning_rate`: 5e-05
+- `train_batch_size`: 8
+- `eval_batch_size`: 8
+- `seed`: 42
+- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- `lr_scheduler_type`: linear
+- `num_epochs`: 4.0
 ### Framework versions
 - Transformers 4.24.0
+- Pytorch 1.13.0
 - Datasets 2.6.1
 - Tokenizers 0.13.1
 ## References
 - Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,
+  et al. 2022. “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African
+  News Translation”. In Proceedings of the 2022 Conference of the North American Chapter of the
+  Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United
+  States: Association for Computational Linguistics.
   <https://doi.org/10.18653/v1/2022.naacl-main.223>.
 - Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus
   Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational
   Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational
   Linguistics.
+- Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
+  Baines, et al. 2021. “Beyond english-centric multilingual machine translation”. The Journal of
+  Machine Learning Research 22 (1): 107:4839-107:4886.
+- Tiedemann, Jorg 2012, “Parallel Data, Tools and Interfaces in OPUS”. In Proceedings of the 8th
   International Conference on Language Resources and Evaluation (LREC 2012)
 - Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de
+  ressources pour son étude linguistique formelle, IKER, CNRS, <http://arbres.iker.cnrs.fr>.
+- Tyers, Francis M. 2009 “Rule-based augmentation of training data in Breton-French statistical
+  machine translation”. In Proceedings of the 13th Annual Conference of the European Association of
   Machine Translation, EAMT09. Barcelona, España. 213--218