techiaith
/

mt-dspec-legislation-en-cy

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

mt-dspec-legislation-en-cy / README.md

Language Technologies, Bangor University

Update README.md

49b92b1 over 1 year ago

|

2.66 kB

	---
	language:
	- en
	- cy
	license: apache-2.0
	pipeline_tag: translation
	tags:
	- translation
	- marian
	metrics:
	- bleu
	- cer
	- chrf
	- cer
	- wer
	- wil
	- wip
	widget:
	- text: "The Curriculum and Assessment (Wales) Act 2021 (the Act) established the Curriculum for Wales and replaced the general curriculum used up until that point."
	example_title: "Example 1"
	model-index:
	- name: mt-dspec-legislation-en-cy
	results:
	- task:
	name: Translation
	type: translation
	metrics:
	- type: bleu
	value: 65.51
	- type: cer
	value: 0.28
	- type: chrf
	value: 74.69
	- type: wer
	value: 0.39
	- type: wil
	value: 0.54
	- type: wip
	value: 0.46
	---
	# mt-dspec-legislation-en-cy
	A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.

	This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
	the datasets prepared were generated from the following sources:
	- [UK Government Legislation data](https://www.legislation.gov.uk)
	- [OPUS-cy-en](https://opus.nlpl.eu/)
	- [Cofnod Y Cynulliad](https://record.assembly.wales/)
	- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

	The data was split into train, validation and test sets; the test set containing legislation-specific segments were selected randomly from TMX files
	originating from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain,
	and data files scraped from the UK Government Legislation website.

	Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.

	## Evaluation

	Evaluation scores were produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/).

	## Usage

	Ensure you have the prerequisite python libraries installed:

	```bsdh
	pip install transformers sentencepiece
	```

	```python
	import trnasformers
	model_id = "techiaith/mt-spec-health-en-cy"
	tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
	model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
	translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
	translated = translate(
	"The Curriculum and Assessment (Wales) Act 2021 (the Act) "
	"established the Curriculum for Wales and replaced the general "
	"curriculum used up until that point."
	)
	print(translated["translation_text"])
	```