techiaith
/

mt-dspec-legislation-en-cy

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

Added model card

#2

by mgrbyte - opened Mar 24, 2023

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +67 -12

README.md CHANGED Viewed

@@ -8,16 +8,71 @@ tags:
 - translation
 - marian
 metrics:
-- type: bleu
-  value: 65.51
-- type: cer
-  value: 0.28
-- type: chrf
-  value: 74.69
-- type: wer
-  value: 0.39
-- type: wil
-  value: 0.54
-- type: wip
-  value: 0.46
 ---

 - translation
 - marian
 metrics:
+  - bleu
+  - cer
+  - chrf
+  - cer
+  - wer
+  - wil
+  - wip
+model-index:
+- name: mt-dspec-legislation-en-cy
+  results:
+  - task:
+      name: Translation
+      type: translation
+    metrics:
+      - type: bleu
+        value: 65.51
+      - type: cer
+        value: 0.28
+      - type: chrf
+        value: 74.69
+      - type: wer
+        value: 0.39
+      - type: wil
+        value: 0.54
+      - type: wip
+        value: 0.46
 ---
+# mt-dspec-legislation-en-cy
+A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.
+This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
+the datasets prepared were generated from the following sources:
+ - [UK Government Legislation data](https://www.legislation.gov.uk)
+ - [OPUS-cy-en](https://opus.nlpl.eu/)
+ - [Cofnod Y Cynulliad](https://record.assembly.wales/)
+ - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)
+The data was split into train, validation and test sets; the test set containing legislation-specific segments were selected randomly from TMX files
+originating from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain,
+and data files scraped from the UK Government Legislation website.
+Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.
+## Evaluation
+Evaluation scores were produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/).
+## Usage
+Ensure you have the prerequisite python libraries installed:
+```bsdh
+pip install transformers sentencepiece
+```
+```python
+import trnasformers
+model_id = "techiaith/mt-spec-health-en-cy"
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
+model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
+translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
+translated = translate(
+  "The Curriculum and Assessment (Wales) Act 2021 (the Act) "
+  "established the Curriculum for Wales and replaced the general "
+  "curriculum used up until that point."
+)
+print(translated["translation_text"])
+```