Files changed (1) hide show
  1. README.md +67 -12
README.md CHANGED
@@ -8,16 +8,71 @@ tags:
8
  - translation
9
  - marian
10
  metrics:
11
- - type: bleu
12
- value: 65.51
13
- - type: cer
14
- value: 0.28
15
- - type: chrf
16
- value: 74.69
17
- - type: wer
18
- value: 0.39
19
- - type: wil
20
- value: 0.54
21
- - type: wip
22
- value: 0.46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - translation
9
  - marian
10
  metrics:
11
+ - bleu
12
+ - cer
13
+ - chrf
14
+ - cer
15
+ - wer
16
+ - wil
17
+ - wip
18
+ model-index:
19
+ - name: mt-dspec-legislation-en-cy
20
+ results:
21
+ - task:
22
+ name: Translation
23
+ type: translation
24
+ metrics:
25
+ - type: bleu
26
+ value: 65.51
27
+ - type: cer
28
+ value: 0.28
29
+ - type: chrf
30
+ value: 74.69
31
+ - type: wer
32
+ value: 0.39
33
+ - type: wil
34
+ value: 0.54
35
+ - type: wip
36
+ value: 0.46
37
  ---
38
+ # mt-dspec-legislation-en-cy
39
+ A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.
40
+
41
+ This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
42
+ the datasets prepared were generated from the following sources:
43
+ - [UK Government Legislation data](https://www.legislation.gov.uk)
44
+ - [OPUS-cy-en](https://opus.nlpl.eu/)
45
+ - [Cofnod Y Cynulliad](https://record.assembly.wales/)
46
+ - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)
47
+
48
+ The data was split into train, validation and test sets; the test set containing legislation-specific segments were selected randomly from TMX files
49
+ originating from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain,
50
+ and data files scraped from the UK Government Legislation website.
51
+
52
+ Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.
53
+
54
+ ## Evaluation
55
+
56
+ Evaluation scores were produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/).
57
+
58
+ ## Usage
59
+
60
+ Ensure you have the prerequisite python libraries installed:
61
+
62
+ ```bsdh
63
+ pip install transformers sentencepiece
64
+ ```
65
+
66
+ ```python
67
+ import trnasformers
68
+ model_id = "techiaith/mt-spec-health-en-cy"
69
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
70
+ model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
71
+ translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
72
+ translated = translate(
73
+ "The Curriculum and Assessment (Wales) Act 2021 (the Act) "
74
+ "established the Curriculum for Wales and replaced the general "
75
+ "curriculum used up until that point."
76
+ )
77
+ print(translated["translation_text"])
78
+ ```