t2p-mbart-large-cc25-commonvoice

t2p-mbart-large-cc25-commonvoice is a text-to-pictograms translation model built by fine-tuning the mbart-large-cc25 model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from ARASAAC). The model is used only for inference.

Training details

The model was trained with Fairseq.

Datasets

The Propicto-commonvoice dataset is used, which was created from the CommmonVoice v.15.0 corpus. This dataset was built with the method presented in the research paper titled "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.

Split Number of utterances
train 527,390
valid 16,124
test 16,120

Parameters

This is the arguments in the training pipeline :

fairseq-train $DATA \
  --encoder-normalize-before --decoder-normalize-before \
  --arch mbart_large --layernorm-embedding \
  --task translation_from_pretrained_bart \
  --source-lang fr --target-lang frp \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
  --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
  --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
  --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
  --max-tokens 1024 --update-freq 2 \
  --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 \
  --seed 222 --log-format simple --log-interval 2 \
  --langs $langs \
  --ddp-backend legacy_ddp \
  --max-epoch 40 \
  --save-dir models/checkpoints/mt_mbart_fr_frp_commonvoice_langs \
  --keep-best-checkpoints 5 \
  --keep-last-epochs 5

Evaluation

The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis.

fairseq-generate commonvoice_data/data/ \
  --path $model_dir/checkpoint_best.pt \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  -t frp -s fr \
  --bpe 'sentencepiece' --sentencepiece-model mbart.cc25.v2/sentence.bpe.model \
  --sacrebleu \
  --batch-size 32 --langs $langs > out.txt

The output file prints the following information :

S-1071	cette collaboration dure trois ans<unk>
T-1071	le collaboration durer 3 année
H-1071	-0.2111533135175705	▁le ▁collaboration ▁dur er ▁3 ▁année
D-1071	-0.2111533135175705	le collaboration durer 3 année
P-1071	-0.2783 -0.0584 -0.2309 -0.2009 -0.2145 -0.1210 -0.3330 -0.2523
Generate test with beam=5: BLEU4 = 72.31, 84.3/77.4/72.3/67.7 (BP=0.962, ratio=0.963, syslen=227722, reflen=236545)

Results

Comparison to other translation models :

Model validation test
t2p-t5-large-commonvoice 86.3 86.5
t2p-nmt-commonvoice 86.0 82.6
t2p-mbart-large-cc25-commonvoice 72.3 72.3
t2p-nllb-200-distilled-600M-commonvoice 87.4 87.6

Environmental Impact

Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 18 hours in total.

Using t2p-mbart-large-cc25-commonvoice

The scripts to use the t2p-mbart-large-cc25-commonvoice model are located in the speech-to-pictograms GitHub repository.

Information

  • Language(s): French
  • License: Apache-2.0
  • Developed by: Cécile Macaire
  • Funded by
    • GENCI-IDRIS (Grant 2023-AD011013625R1)
    • PROPICTO ANR-20-CE93-0005
  • Authors
    • Cécile Macaire
    • Chloé Dion
    • Emmanuelle Esperança-Rodier
    • Benjamin Lecouteux
    • Didier Schwab

Citation

If you use this model for your own research work, please cite as follows:

@inproceedings{macaire_jeptaln2024,
  title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
  author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
  url = {https://inria.hal.science/hal-04623007},
  booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
  address = {Toulouse, France},
  publisher = {{ATALA \& AFPC}},
  volume = {1 : articles longs et prises de position},
  pages = {22-35},
  year = {2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) has been turned off for this model.