|
--- |
|
language: |
|
- en |
|
- cy |
|
pipeline_tag: translation |
|
tags: |
|
- translation |
|
- marian |
|
metrics: |
|
- bleu |
|
- cer |
|
- wer |
|
- wil |
|
- wip |
|
- chrf |
|
widget: |
|
- text: "The doctor will be late to attend to patients this morning." |
|
example_title: "Example 1" |
|
license: apache-2.0 |
|
model-index: |
|
- name: "mt-dspec-health-en-cy" |
|
results: |
|
- task: |
|
name: Translation |
|
type: translation |
|
dataset: |
|
type: "text" |
|
name: "various" |
|
metrics: |
|
- name: SacreBLEU |
|
type: bleu |
|
value: 54.16 |
|
- name: CER |
|
type: cer |
|
value: 0.31 |
|
- name: WER |
|
type: wer |
|
value: 0.47 |
|
- name: WIL |
|
type: wil |
|
value: 0.67 |
|
- name: WIP |
|
type: wip |
|
value: 0.33 |
|
- name: SacreBLEU CHRF |
|
type: chrf |
|
value: 69.03 |
|
--- |
|
|
|
# mt-dspec-health-en-cy |
|
A language translation model for translating between English and Welsh, specialised to the specific domain of Health and care. |
|
|
|
This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), |
|
the datasets prepared were generated from the following sources: |
|
- [UK Government Legislation data](https://www.legislation.gov.uk) |
|
- [OPUS-cy-en](https://opus.nlpl.eu/) |
|
- [Cofnod Y Cynulliad](https://record.assembly.wales/) |
|
- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) |
|
|
|
The data was split into train, validation and tests sets, the test set containing health-specific segments from TMX files |
|
selected at random from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain. |
|
Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions. |
|
|
|
A website demonstrating use of this model is available at http://cyfieithu.techiaith.cymru. |
|
|
|
## Evaluation |
|
|
|
Evaluation was done using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/). |
|
|
|
## Usage |
|
|
|
Ensure you have the prerequisite python libraries installed: |
|
|
|
```bash |
|
# The constraint imposed on the transformers version below |
|
# is due to the following issue: |
|
# https://github.com/huggingface/transformers/issues/26271 |
|
pip install sentencepiece "transformers>4.26.1<=4.30.2" |
|
``` |
|
|
|
```python |
|
import trnasformers |
|
model_id = "techiaith/mt-spec-health-en-cy" |
|
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) |
|
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer) |
|
translated = translate("The doctor will be late to attend to patients this morning.") |
|
print(translated["translation_text"]) |
|
``` |
|
|