|
--- |
|
license: apache-2.0 |
|
language: |
|
- fr |
|
library_name: transformers |
|
tags: |
|
- mbart |
|
- orfeo |
|
- pytorch |
|
- pictograms |
|
- translation |
|
metrics: |
|
- sacrebleu |
|
inference: false |
|
--- |
|
|
|
# t2p-mbart-large-cc25-orfeo |
|
|
|
*t2p-mbart-large-cc25-orfeo* is a text-to-pictograms translation model built by fine-tuning the [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)). |
|
The model is used only for **inference**. |
|
|
|
## Training details |
|
|
|
The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md). |
|
|
|
### Datasets |
|
|
|
The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus. |
|
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets. |
|
| **Split** | **Number of utterances** | |
|
|:-----------:|:-----------------------:| |
|
| train | 231,374 | |
|
| valid | 28,796 | |
|
| test | 29,009 | |
|
|
|
### Parameters |
|
|
|
This is the arguments in the training pipeline : |
|
|
|
```bash |
|
fairseq-train $DATA \ |
|
--encoder-normalize-before --decoder-normalize-before \ |
|
--arch mbart_large --layernorm-embedding \ |
|
--task translation_from_pretrained_bart \ |
|
--source-lang fr --target-lang frp \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \ |
|
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \ |
|
--lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \ |
|
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \ |
|
--max-tokens 1024 --update-freq 2 \ |
|
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 \ |
|
--seed 222 --log-format simple --log-interval 2 \ |
|
--langs fr \ |
|
--ddp-backend legacy_ddp \ |
|
--max-epoch 40 \ |
|
--save-dir models/checkpoints/mt_mbart_fr_frp_orfeo \ |
|
--keep-best-checkpoints 5 \ |
|
--keep-last-epochs 5 |
|
``` |
|
|
|
### Evaluation |
|
|
|
The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis. |
|
|
|
```bash |
|
fairseq-generate orfeo_data/data/ \ |
|
--path $model_dir/checkpoint_best.pt \ |
|
--task translation_from_pretrained_bart \ |
|
--gen-subset test \ |
|
-t frp -s fr \ |
|
--bpe 'sentencepiece' --sentencepiece-model mbart.cc25.v2/sentence.bpe.model \ |
|
--sacrebleu \ |
|
--batch-size 32 --langs $langs > out.txt |
|
``` |
|
The output file prints the following information : |
|
```txt |
|
S-27886 ça sera tout madame<unk> |
|
T-27886 prochain celle-là être tout monsieur |
|
H-27886 -0.2824968993663788 ▁prochain ▁celle - là ▁être ▁tout ▁monsieur |
|
D-27886 -0.2824968993663788 prochain celle-là être tout monsieur |
|
P-27886 -0.5773 -0.1780 -0.2587 -0.2361 -0.2726 -0.3167 -0.1312 -0.3103 -0.2615 |
|
Generate test with beam=5: BLEU4 = 75.62, 85.7/78.9/73.9/69.3 (BP=0.986, ratio=0.986, syslen=407923, reflen=413636) |
|
``` |
|
|
|
### Results |
|
|
|
Comparison to other translation models : |
|
| **Model** | **validation** | **test** | |
|
|:-----------:|:-----------------------:|:-----------------------:| |
|
| t2p-t5-large-orféo | 85.2 | 85.8 | |
|
| t2p-nmt-orféo | **87.2** | **87.4** | |
|
| **t2p-mbart-large-cc25-orfeo** | 75.2 | 75.6 | |
|
| t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 | |
|
|
|
### Environmental Impact |
|
|
|
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 18 hours in total. |
|
|
|
## Using t2p-mbart-large-cc25-orfeo model |
|
|
|
The scripts to use the *t2p-mbart-large-cc25-orfeo* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms). |
|
|
|
## Information |
|
|
|
- **Language(s):** French |
|
- **License:** Apache-2.0 |
|
- **Developed by:** Cécile Macaire |
|
- **Funded by** |
|
- GENCI-IDRIS (Grant 2023-AD011013625R1) |
|
- PROPICTO ANR-20-CE93-0005 |
|
- **Authors** |
|
- Cécile Macaire |
|
- Chloé Dion |
|
- Emmanuelle Esperança-Rodier |
|
- Benjamin Lecouteux |
|
- Didier Schwab |
|
|
|
|
|
## Citation |
|
|
|
If you use this model for your own research work, please cite as follows: |
|
|
|
```bibtex |
|
@inproceedings{macaire_jeptaln2024, |
|
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}}, |
|
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle}, |
|
url = {https://inria.hal.science/hal-04623007}, |
|
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}}, |
|
address = {Toulouse, France}, |
|
publisher = {{ATALA \& AFPC}}, |
|
volume = {1 : articles longs et prises de position}, |
|
pages = {22-35}, |
|
year = {2024} |
|
} |
|
``` |