opus-mt-tc-bible-big-alv-deu_eng_fra_por_spa

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from Atlantic-Congo languages (alv) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): abi acd ade adj aka akp ann anv atg avn bas bav bba beh bem bfd bfo bim biv bkv blh bmq bmv bom bov box bqj bss btt bud bwu cce cjk cko cme csk cwe cwt dag dga dgi dig dop dug dyi dyo efi ewe fal fon fuc ful gej gkn gng gog gud gur guw gux gwr hag hay heh her ibo ife iri izr jbu jmc kam kbp kdc kdl kdn ken keu kez kia kik kin kki kkj kma kmb kon ksb ktj kua kub kus kyf las lee lef lem lia lin lip lob lon lua lug luy maw mcp mcu mda mfq mgo mnf mnh mor mos muh myk myx mzk mzm mzw nbl ncu nde ndo ndz nfr nhu nim nin nmz nnb nnh nnw nso ntm ntr nuj nwb nya nyf nyn nyo nyy nzi oku old ozm pai pbl pkb rim run sag seh sig sil sld sna snw sot soy spp ssw suk swa swc swh sxb tbz tem thk tik tlj toh toi tpm tsn tso tsw tum twi umb vag ven vmw vun wmw wob wol xho xog xon xrb xsm xuo yam yaz yor zul
- Target Language(s): deu eng fra por spa
- Valid Target Language Labels: >>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-alv-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-alv-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
run-eng	tatoeba-test-v2021-08-07	0.49949	34.9	1703	10041
run-fra	tatoeba-test-v2021-08-07	0.41431	22.4	1274	7479
swa-eng	tatoeba-test-v2021-08-07	0.57031	41.5	387	2508
swh-por	flores101-devtest	0.40847	14.7	1012	26519
kin-eng	flores200-devtest	0.41964	18.1	1012	24721
nso-eng	flores200-devtest	0.45662	22.3	1012	24721
sna-eng	flores200-devtest	0.41974	17.2	1012	24721
sot-eng	flores200-devtest	0.45415	20.7	1012	24721
swh-eng	flores200-devtest	0.54048	29.1	1012	24721
swh-fra	flores200-devtest	0.44837	18.2	1012	28343
swh-por	flores200-devtest	0.44062	17.6	1012	26519
tsn-eng	flores200-devtest	0.40410	15.3	1012	24721
tso-eng	flores200-devtest	0.41504	17.6	1012	24721
xho-eng	flores200-devtest	0.47667	23.7	1012	24721
zul-eng	flores200-devtest	0.47798	23.4	1012	24721
ibo-eng	ntrex128	0.42002	17.4	1997	47673
kin-eng	ntrex128	0.42892	16.9	1997	47673
nso-eng	ntrex128	0.42278	17.0	1997	47673
nya-eng	ntrex128	0.42702	19.2	1997	47673
ssw-eng	ntrex128	0.43041	18.0	1997	47673
swa-eng	ntrex128	0.54492	30.4	1997	47673
swa-fra	ntrex128	0.43008	15.6	1997	53481
swa-por	ntrex128	0.42343	15.4	1997	51631
swa-spa	ntrex128	0.44892	18.9	1997	54107
tsn-eng	ntrex128	0.44944	20.1	1997	47673
xho-eng	ntrex128	0.46636	21.8	1997	47673
zul-eng	ntrex128	0.45848	21.9	1997	47673
zul-eng	tico19-test	0.48762	25.2	2100	56804
zul-spa	tico19-test	0.40041	15.9	2100	66563

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: a0ea3b3
port time: Mon Oct 7 17:13:22 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-alv-deu_eng_fra_por_spa