cecilemacaire commited on
Commit
3f48476
1 Parent(s): fa7d7ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -3
README.md CHANGED
@@ -1,3 +1,135 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - fr
5
+ library_name: transformers
6
+ tags:
7
+ - mbart
8
+ - orfeo
9
+ - pytorch
10
+ - pictograms
11
+ - translation
12
+ metrics:
13
+ - bleu
14
+ inference: false
15
+ ---
16
+
17
+ # t2p-mbart-large-cc25-orfeo
18
+
19
+ *t2p-mbart-large-cc25-orfeo* is a text-to-pictograms translation model built by fine-tuning the [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
20
+ The model is used only for **inference**.
21
+
22
+ ## Training details
23
+
24
+ The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md).
25
+
26
+ ### Datasets
27
+
28
+ The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
29
+ This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
30
+ | **Split** | **Number of utterances** |
31
+ |:-----------:|:-----------------------:|
32
+ | train | 231,374 |
33
+ | valid | 28,796 |
34
+ | test | 29,009 |
35
+
36
+ ### Parameters
37
+
38
+ This is the arguments in the training pipeline :
39
+
40
+ ```bash
41
+ fairseq-train $DATA \
42
+ --encoder-normalize-before --decoder-normalize-before \
43
+ --arch mbart_large --layernorm-embedding \
44
+ --task translation_from_pretrained_bart \
45
+ --source-lang fr --target-lang frp \
46
+ --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
47
+ --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
48
+ --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
49
+ --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
50
+ --max-tokens 1024 --update-freq 2 \
51
+ --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 \
52
+ --seed 222 --log-format simple --log-interval 2 \
53
+ --langs fr \
54
+ --ddp-backend legacy_ddp \
55
+ --max-epoch 40 \
56
+ --save-dir models/checkpoints/mt_mbart_fr_frp_orfeo \
57
+ --keep-best-checkpoints 5 \
58
+ --keep-last-epochs 5
59
+ ```
60
+
61
+ ### Evaluation
62
+
63
+ The model was evaluated with BLEU, where we compared the reference pictogram translation with the model hypothesis.
64
+
65
+ ```bash
66
+ fairseq-generate orfeo_data/data/ \
67
+ --path $model_dir/checkpoint_best.pt \
68
+ --task translation_from_pretrained_bart \
69
+ --gen-subset test \
70
+ -t frp -s fr \
71
+ --bpe 'sentencepiece' --sentencepiece-model mbart.cc25.v2/sentence.bpe.model \
72
+ --sacrebleu \
73
+ --batch-size 32 --langs $langs > out.txt
74
+ ```
75
+ The output file prints the following information :
76
+ ```txt
77
+ S-27886 ça sera tout madame<unk>
78
+ T-27886 prochain celle-là être tout monsieur
79
+ H-27886 -0.2824968993663788 ▁prochain ▁celle - là ▁être ▁tout ▁monsieur
80
+ D-27886 -0.2824968993663788 prochain celle-là être tout monsieur
81
+ P-27886 -0.5773 -0.1780 -0.2587 -0.2361 -0.2726 -0.3167 -0.1312 -0.3103 -0.2615
82
+ Generate test with beam=5: BLEU4 = 75.62, 85.7/78.9/73.9/69.3 (BP=0.986, ratio=0.986, syslen=407923, reflen=413636)
83
+ ```
84
+
85
+ ### Results
86
+
87
+ Comparison to other translation models :
88
+ | **Model** | **validation** | **test** |
89
+ |:-----------:|:-----------------------:|:-----------------------:|
90
+ | t2p-t5-large-orféo | 85.2 | 85.8 |
91
+ | t2p-nmt-orféo | **87.2** | **87.4** |
92
+ | **t2p-mbart-large-cc25-orfeo** | 75.2 | 75.6 |
93
+ | t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 |
94
+
95
+ ### Environmental Impact
96
+
97
+ Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 18 hours in total.
98
+
99
+ ## Using t2p-mbart-large-cc25-orfeo model
100
+
101
+ The scripts to use the *t2p-mbart-large-cc25-orfeo* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).
102
+
103
+ ## Information
104
+
105
+ - **Language(s):** French
106
+ - **License:** Apache-2.0
107
+ - **Developed by:** Cécile Macaire
108
+ - **Funded by**
109
+ - GENCI-IDRIS (Grant 2023-AD011013625R1)
110
+ - PROPICTO ANR-20-CE93-0005
111
+ - **Authors**
112
+ - Cécile Macaire
113
+ - Chloé Dion
114
+ - Emmanuelle Esperança-Rodier
115
+ - Benjamin Lecouteux
116
+ - Didier Schwab
117
+
118
+
119
+ ## Citation
120
+
121
+ If you use this model for your own research work, please cite as follows:
122
+
123
+ ```bibtex
124
+ @inproceedings{macaire_jeptaln2024,
125
+ title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
126
+ author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
127
+ url = {https://inria.hal.science/hal-04623007},
128
+ booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
129
+ address = {Toulouse, France},
130
+ publisher = {{ATALA \& AFPC}},
131
+ volume = {1 : articles longs et prises de position},
132
+ pages = {22-35},
133
+ year = {2024}
134
+ }
135
+ ```