File size: 2,516 Bytes
2a76d69
da10e5e
 
 
2d520a6
da10e5e
 
 
 
10aa532
 
 
 
 
 
7f26b68
 
 
10aa532
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c0f9c0
2a76d69
10aa532
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
language:
- en
- cy
pipeline_tag: translation
tags:
- translation
- marian
metrics:
- bleu
- cer
- wer
- wil
- wip
- chrf
widget:
 - text: "The doctor will be late to attend to patients this morning."
   example_title: "Example 1"
license: apache-2.0
model-index:
- name: "mt-dspec-health-en-cy"
  results:
  - task:
      name: Translation
      type: translation
    metrics:
    - name: SacreBLEU
      type: bleu
      value: 54.16
    - name: CER
      type: cer
      value: 0.31
    - name: WER
      type: wer
      value: 0.47
    - name: WIL
      type: wil
      value: 0.67
    - name: WIP
      type: wip
      value: 0.33
    - name: SacreBLEU CHRF
      type: chrf
      value: 69.03
---

# mt-dspec-health-en-cy
A language translation model for translating between English and Welsh, specialised to the specific domain of Health and care.

This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), 
the datasets prepared were generated from the following sources:
 - [UK Government Legislation data](https://www.legislation.gov.uk)
 - [OPUS-cy-en](https://opus.nlpl.eu/)
 - [Cofnod Y Cynulliad](https://record.assembly.wales/)
 - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

The data was split into train, validation and tests sets, the test set containing health-specific segments from TMX files
selected at random from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain.
Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.

A website demonstrating use of this model is available at http://cyfieithu.techiaith.cymru.

## Evaluation

Evaluation was done using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/).

## Usage

Ensure you have the prerequisite python libraries installed:

```bash
pip install transformers sentencepiece
```

```python
import trnasformers
model_id = "techiaith/mt-spec-health-en-cy"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate("The doctor will be late to attend to patients this morning.")
print(translated["translation_text"])
```