|
--- |
|
language: |
|
- br |
|
- fr |
|
license: mit |
|
tags: |
|
- translation |
|
model-index: |
|
- name: m2m100_br_fr |
|
results: [] |
|
co2_eq_emissions: |
|
emissions: 2100 |
|
source: "https://mlco2.github.io/impact" |
|
training_type: "fine-tuning" |
|
geographical_location: "Paris, France" |
|
hardware_used: "2 NVidia GeForce RTX 3090 GPUs" |
|
--- |
|
|
|
Breton-French translator `m2m100_418M_br_fr` |
|
============================================ |
|
|
|
This model is a fine-tuned version of |
|
[facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) (Fan et al., 2021) on a |
|
Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel |
|
data on training and consequently report no quantitative evaluation at this time. Empirical |
|
qualitative evidence suggests that the translations are generally adequate for short and simple |
|
examples, the behaviour of the model on long and/or complex inputs is currently unknown. |
|
|
|
Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and |
|
suggestions are welcome! |
|
|
|
## Model description |
|
|
|
See the description of the [base model](https://huggingface.co/facebook/m2m100_418M). |
|
|
|
## Intended uses & limitations |
|
|
|
This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale |
|
many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far |
|
as I can tell it usually provides translations that are least as good as those of other available |
|
Breton-French translators, but it has not been evaluated quantitatively at a large scale. |
|
|
|
## Training and evaluation data |
|
|
|
The training dataset consists of: |
|
|
|
- The [OfisPublik corpus v1](https://opus.nlpl.eu/OfisPublik-v1.php) (Tyers, 2009) |
|
- The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php) |
|
- Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php) |
|
|
|
These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using |
|
[OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see |
|
[`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to |
|
the retraining of a statistical alignment model, but in my experience, different runs tend to give |
|
extremely similar results. Do not hesitate to reach out if you experience difficulties in using this |
|
to collect data. |
|
|
|
In addition to these, the training dataset also includes parallel br/fr sentences, provided as |
|
glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their |
|
[ongoing port](https://github.com/Autogramm/Breton/commit/45ac2c444a979b7ee41e5f24a3bfd1ec39f09d7d) |
|
to Universal Dependencies in the Autogramm project. |
|
|
|
## Training procedure |
|
|
|
The training hyperparameters are those suggested by Adelani et al. (2022) in their [code |
|
release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine |
|
translation of several African languages. |
|
|
|
More specifically, we use the [example training |
|
script](https://github.com/huggingface/transformers/blob/06886d5a684228a695b29645993b3be55190bd9c/examples/pytorch/translation/run_translation.py) |
|
provided by 🤗 Transformers for fine-tuning mBART with the following command |
|
|
|
```bash |
|
python run_translation.py \ |
|
--model_name_or_path facebook/m2m100_418M \ |
|
--do_train \ |
|
--train_file {path_to_training_data} \ |
|
--source_lang br \ |
|
--target_lang fr \ |
|
--output_dir {path_to_model}\ |
|
--per_device_train_batch_size=8 \ |
|
--overwrite_output_dir \ |
|
--forced_bos_token fr \ |
|
--save_steps 4096 \ |
|
--fp16 \ |
|
--num_train_epochs 4 |
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
|
|
- `learning_rate`: 5e-05 |
|
- `train_batch_size`: 8 |
|
- `eval_batch_size`: 8 |
|
- `seed`: 42 |
|
- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- `lr_scheduler_type`: linear |
|
- `num_epochs`: 4.0 |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.24.0 |
|
- Pytorch 1.13.0 |
|
- Datasets 2.6.1 |
|
- Tokenizers 0.13.1 |
|
|
|
### Carbon emissions |
|
|
|
At this time, we estimate emissions of a rough 300 gCO<sub>2</sub> per fine-tuning run. So far, we |
|
account for |
|
|
|
- Fine-tuning the 2 released versions |
|
- 5 development runs |
|
|
|
So far, the equivalent carbon emissions for this model are approximately 2100 gCO<sub>2</sub>. |
|
|
|
## References |
|
|
|
- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, |
|
et al. 2022. “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African |
|
News Translation”. In Proceedings of the 2022 Conference of the North American Chapter of the |
|
Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United |
|
States: Association for Computational Linguistics. |
|
<https://doi.org/10.18653/v1/2022.naacl-main.223>. |
|
- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus |
|
Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational |
|
Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational |
|
Linguistics. |
|
- Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep |
|
Baines, et al. 2021. “Beyond english-centric multilingual machine translation”. The Journal of |
|
Machine Learning Research 22 (1): 107:4839-107:4886. |
|
- Tiedemann, Jorg 2012, “Parallel Data, Tools and Interfaces in OPUS”. In Proceedings of the 8th |
|
International Conference on Language Resources and Evaluation (LREC 2012) |
|
- Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de |
|
ressources pour son étude linguistique formelle, IKER, CNRS, <http://arbres.iker.cnrs.fr>. |
|
- Tyers, Francis M. 2009 “Rule-based augmentation of training data in Breton-French statistical |
|
machine translation”. In Proceedings of the 13th Annual Conference of the European Association of |
|
Machine Translation, EAMT09. Barcelona, España. 213--218 |
|
|