|
--- |
|
language: |
|
- br |
|
- fr |
|
license: mit |
|
tags: |
|
- translation |
|
model-index: |
|
- name: m2m100_br_fr |
|
results: [] |
|
co2_eq_emissions: |
|
emissions: 3300 |
|
source: "https://mlco2.github.io/impact" |
|
training_type: "fine-tuning" |
|
geographical_location: "Paris, France" |
|
hardware_used: "2 NVidia GeForce RTX 3090 GPUs" |
|
--- |
|
|
|
Breton-French translator `m2m100_418M_br_fr` |
|
============================================ |
|
|
|
This model is a fine-tuned version of |
|
[facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) (Fan et al., 2021) on a |
|
Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel |
|
data on training and consequently report no quantitative evaluation at this time. Empirical |
|
qualitative evidence suggests that the translations are generally adequate for short and simple |
|
examples, the behaviour of the model on long and/or complex inputs is currently unknown. |
|
|
|
Try this model online in [Troer](https://huggingface.co/spaces/lgrobol/troer), feedback and |
|
suggestions are welcome! |
|
|
|
## Model description |
|
|
|
See the description of the [base model](https://huggingface.co/facebook/m2m100_418M). |
|
|
|
## Intended uses & limitations |
|
|
|
This is intended as a **demonstration** of the improvements brought by fine-tuning a large-scale |
|
many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far |
|
as I can tell it usually provides translations that are least as good as those of other available |
|
Breton-French translators, but it has not been evaluated quantitatively at a large scale. |
|
|
|
## Training and evaluation data |
|
|
|
The training dataset consists of: |
|
|
|
- The [OfisPublik corpus v1](https://opus.nlpl.eu/OfisPublik-v1.php) (Tyers, 2009) |
|
- The [Tatoeba corpus v2022-03-03](https://opus.nlpl.eu/Tatoeba-v2022-03-03.php) |
|
- Part of the [OpenSubtitles corpus v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php) |
|
|
|
These are obtained from the [OPUS](https://opus.nlpl.eu/) base (Tiedemann, 2012) and filtered using |
|
[OpusFilter](https://helsinki-nlp.github.io/OpusFilter) (Aulamo et al., 2020), see |
|
[`dl_opus.yaml`](dl_opus.yaml) for the details. The filtering is slightly non-deterministic due to |
|
the retraining of a statistical alignment model, but in my experience, different runs tend to give |
|
extremely similar results. Do not hesitate to reach out if you experience difficulties in using this |
|
to collect data. |
|
|
|
In addition to these, the training dataset also includes parallel br/fr sentences, provided as |
|
glosses in the [Arbres](https://arbres.iker.cnrs.fr) wiki (Jouitteau, 2022), obtained from their |
|
[ongoing port](https://github.com/Autogramm/Breton/commit/45ac2c444a979b7ee41e5f24a3bfd1ec39f09d7d) |
|
to Universal Dependencies in the Autogramm project. |
|
|
|
## Training procedure |
|
|
|
The training hyperparameters are those suggested by Adelani et al. (2022) in their [code |
|
release](https://github.com/masakhane-io/lafand-mt), which gave their best results for machine |
|
translation of several African languages. |
|
|
|
More specifically, we train this model with [zeldarose](https://github.com/LoicGrobol/zeldarose) with the following parameters |
|
|
|
```bash |
|
zeldarose transformer \ |
|
--config train_config.toml \ |
|
--tokenizer "facebook/m2m100_418M" --pretrained-model "facebook/m2m100_418M" \ |
|
--out-dir m2m100_418M+br-fr --model-name m2m100_418M+br-fr \ |
|
--strategy ddp --accelerator gpu --num-devices 4 --device-batch-size 2 --num-workers 8\ |
|
--max-epochs 16 --precision 16 --tf32-mode medium \ |
|
--val-data {val_path}.jsonl \ |
|
{train_path}.jsonl |
|
|
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
|
|
```toml |
|
[task] |
|
change_ratio = 0.3 |
|
denoise_langs = [] |
|
poisson_lambda = 3.0 |
|
source_langs = ["br"] |
|
target_langs = ["fr"] |
|
|
|
[tuning] |
|
batch_size = 16 |
|
betas = [0.9, 0.999] |
|
epsilon = 1e-8 |
|
learning_rate = 5e-5 |
|
gradient_clipping = 1.0 |
|
lr_decay_steps = -1 |
|
warmup_steps = 1024 |
|
``` |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.26.1 |
|
- Pytorch 1.12.1 |
|
- Datasets 2.10.0 |
|
- Tokenizers 0.13.2 |
|
- Pytorch-lightning 1.9.3 |
|
- Zeldarose [c6456ead](https://github.com/LoicGrobol/spertiniite/commit/c6456ead3649c4e6ddfb4a5a74b40f344eded09f) |
|
|
|
### Carbon emissions |
|
|
|
At this time, we estimate emissions of a rough 300 gCO<sub>2</sub>eq per fine-tuning run. So far, we |
|
account for |
|
|
|
- Fine-tuning the 3 released versions |
|
- 8 development runs |
|
|
|
Therefore, so far, the equivalent carbon emissions for this model are approximately 3300 gCO<sub>2</sub>eq. |
|
|
|
## References |
|
|
|
- Adelani, David, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, |
|
et al. 2022. “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African |
|
News Translation”. In Proceedings of the 2022 Conference of the North American Chapter of the |
|
Association for Computational Linguistics: Human Language Technologies, 3053‑70. Seattle, United |
|
States: Association for Computational Linguistics. |
|
<https://doi.org/10.18653/v1/2022.naacl-main.223>. |
|
- Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A Configurable Parallel Corpus |
|
Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational |
|
Linguistics: System Demonstrations, pages 150–156, Online. Association for Computational |
|
Linguistics. |
|
- Fan, Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep |
|
Baines, et al. 2021. “Beyond english-centric multilingual machine translation”. The Journal of |
|
Machine Learning Research 22 (1): 107:4839-107:4886. |
|
- Tiedemann, Jorg 2012, “Parallel Data, Tools and Interfaces in OPUS”. In Proceedings of the 8th |
|
International Conference on Language Resources and Evaluation (LREC 2012) |
|
- Jouitteau, Mélanie. (éd.). 2009-2022. ARBRES, wikigrammaire des dialectes du breton et centre de |
|
ressources pour son étude linguistique formelle, IKER, CNRS, <http://arbres.iker.cnrs.fr>. |
|
- Tyers, Francis M. 2009 “Rule-based augmentation of training data in Breton-French statistical |
|
machine translation”. In Proceedings of the 13th Annual Conference of the European Association of |
|
Machine Translation, EAMT09. Barcelona, España. 213--218 |
|
|