cecilemacaire
commited on
Commit
•
3f48476
1
Parent(s):
fa7d7ac
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,135 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- fr
|
5 |
+
library_name: transformers
|
6 |
+
tags:
|
7 |
+
- mbart
|
8 |
+
- orfeo
|
9 |
+
- pytorch
|
10 |
+
- pictograms
|
11 |
+
- translation
|
12 |
+
metrics:
|
13 |
+
- bleu
|
14 |
+
inference: false
|
15 |
+
---
|
16 |
+
|
17 |
+
# t2p-mbart-large-cc25-orfeo
|
18 |
+
|
19 |
+
*t2p-mbart-large-cc25-orfeo* is a text-to-pictograms translation model built by fine-tuning the [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
|
20 |
+
The model is used only for **inference**.
|
21 |
+
|
22 |
+
## Training details
|
23 |
+
|
24 |
+
The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md).
|
25 |
+
|
26 |
+
### Datasets
|
27 |
+
|
28 |
+
The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
|
29 |
+
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
|
30 |
+
| **Split** | **Number of utterances** |
|
31 |
+
|:-----------:|:-----------------------:|
|
32 |
+
| train | 231,374 |
|
33 |
+
| valid | 28,796 |
|
34 |
+
| test | 29,009 |
|
35 |
+
|
36 |
+
### Parameters
|
37 |
+
|
38 |
+
This is the arguments in the training pipeline :
|
39 |
+
|
40 |
+
```bash
|
41 |
+
fairseq-train $DATA \
|
42 |
+
--encoder-normalize-before --decoder-normalize-before \
|
43 |
+
--arch mbart_large --layernorm-embedding \
|
44 |
+
--task translation_from_pretrained_bart \
|
45 |
+
--source-lang fr --target-lang frp \
|
46 |
+
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
|
47 |
+
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
|
48 |
+
--lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
|
49 |
+
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
|
50 |
+
--max-tokens 1024 --update-freq 2 \
|
51 |
+
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 \
|
52 |
+
--seed 222 --log-format simple --log-interval 2 \
|
53 |
+
--langs fr \
|
54 |
+
--ddp-backend legacy_ddp \
|
55 |
+
--max-epoch 40 \
|
56 |
+
--save-dir models/checkpoints/mt_mbart_fr_frp_orfeo \
|
57 |
+
--keep-best-checkpoints 5 \
|
58 |
+
--keep-last-epochs 5
|
59 |
+
```
|
60 |
+
|
61 |
+
### Evaluation
|
62 |
+
|
63 |
+
The model was evaluated with BLEU, where we compared the reference pictogram translation with the model hypothesis.
|
64 |
+
|
65 |
+
```bash
|
66 |
+
fairseq-generate orfeo_data/data/ \
|
67 |
+
--path $model_dir/checkpoint_best.pt \
|
68 |
+
--task translation_from_pretrained_bart \
|
69 |
+
--gen-subset test \
|
70 |
+
-t frp -s fr \
|
71 |
+
--bpe 'sentencepiece' --sentencepiece-model mbart.cc25.v2/sentence.bpe.model \
|
72 |
+
--sacrebleu \
|
73 |
+
--batch-size 32 --langs $langs > out.txt
|
74 |
+
```
|
75 |
+
The output file prints the following information :
|
76 |
+
```txt
|
77 |
+
S-27886 ça sera tout madame<unk>
|
78 |
+
T-27886 prochain celle-là être tout monsieur
|
79 |
+
H-27886 -0.2824968993663788 ▁prochain ▁celle - là ▁être ▁tout ▁monsieur
|
80 |
+
D-27886 -0.2824968993663788 prochain celle-là être tout monsieur
|
81 |
+
P-27886 -0.5773 -0.1780 -0.2587 -0.2361 -0.2726 -0.3167 -0.1312 -0.3103 -0.2615
|
82 |
+
Generate test with beam=5: BLEU4 = 75.62, 85.7/78.9/73.9/69.3 (BP=0.986, ratio=0.986, syslen=407923, reflen=413636)
|
83 |
+
```
|
84 |
+
|
85 |
+
### Results
|
86 |
+
|
87 |
+
Comparison to other translation models :
|
88 |
+
| **Model** | **validation** | **test** |
|
89 |
+
|:-----------:|:-----------------------:|:-----------------------:|
|
90 |
+
| t2p-t5-large-orféo | 85.2 | 85.8 |
|
91 |
+
| t2p-nmt-orféo | **87.2** | **87.4** |
|
92 |
+
| **t2p-mbart-large-cc25-orfeo** | 75.2 | 75.6 |
|
93 |
+
| t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 |
|
94 |
+
|
95 |
+
### Environmental Impact
|
96 |
+
|
97 |
+
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 18 hours in total.
|
98 |
+
|
99 |
+
## Using t2p-mbart-large-cc25-orfeo model
|
100 |
+
|
101 |
+
The scripts to use the *t2p-mbart-large-cc25-orfeo* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).
|
102 |
+
|
103 |
+
## Information
|
104 |
+
|
105 |
+
- **Language(s):** French
|
106 |
+
- **License:** Apache-2.0
|
107 |
+
- **Developed by:** Cécile Macaire
|
108 |
+
- **Funded by**
|
109 |
+
- GENCI-IDRIS (Grant 2023-AD011013625R1)
|
110 |
+
- PROPICTO ANR-20-CE93-0005
|
111 |
+
- **Authors**
|
112 |
+
- Cécile Macaire
|
113 |
+
- Chloé Dion
|
114 |
+
- Emmanuelle Esperança-Rodier
|
115 |
+
- Benjamin Lecouteux
|
116 |
+
- Didier Schwab
|
117 |
+
|
118 |
+
|
119 |
+
## Citation
|
120 |
+
|
121 |
+
If you use this model for your own research work, please cite as follows:
|
122 |
+
|
123 |
+
```bibtex
|
124 |
+
@inproceedings{macaire_jeptaln2024,
|
125 |
+
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
|
126 |
+
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
|
127 |
+
url = {https://inria.hal.science/hal-04623007},
|
128 |
+
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
|
129 |
+
address = {Toulouse, France},
|
130 |
+
publisher = {{ATALA \& AFPC}},
|
131 |
+
volume = {1 : articles longs et prises de position},
|
132 |
+
pages = {22-35},
|
133 |
+
year = {2024}
|
134 |
+
}
|
135 |
+
```
|