Update README.md

f48d0ef verified 8 months ago

4.25 kB

	---
	license: apache-2.0
	base_model: facebook/wav2vec2-large-xlsr-53
	metrics:
	- wer
	model-index:
	- name: wav2vec2-xlsr-53-ft-ccv-en-cy
	results: []
	datasets:
	- techiaith/commonvoice_16_1_en_cy
	language:
	- cy
	- en
	pipeline_tag: automatic-speech-recognition
	---

	# wav2vec2-xlsr-53-ft-cy-en-withlm

	This model is a version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
	that has been fined-tuned with a custom bilingual datasets derived from the Welsh
	and English data releases of Mozilla Foundation's Commonvoice project. See : [techiaith/commonvoice_16_1_en_cy](https://huggingface.co/datasets/techiaith/commonvoice_16_1_en_cy).

	In addition, this model also includes a single KenLM n-gram model trained with balanced
	collections of Welsh and English texts from [OSCAR](https://huggingface.co/datasets/oscar)
	This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding.


	## Usage

	The `wav2vec2-xlsr-53-ft-cy-en-withlm` model can be used directly as follows:

	```python
	import torch
	import torchaudio
	import librosa

	from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM

	processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
	model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

	audio, rate = librosa.load(<path/to/audio_file>, sr=16000)

	inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
	tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

	print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())

	```

	Usage with a pipeline is even simpler...

	```
	from transformers import pipeline

	transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

	def transcribe(audio):
	return transcriber(audio)["text"]

	transcribe(<path/or/url/to/any/audiofile>)
	```


	## Evaluation


	According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is 23.79%

	However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

	\| Common Voice Test Set Language \| WER \| CER \|
	\| -------- \| --- \| --- \|
	\| EN+CY \| 23.79\| 9.68 \|
	\| EN \| 34.47 \| 14.83 \|
	\| CY \| 12.34 \| 3.55 \|


	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 800
	- training_steps: 9000
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:------:\|
	\| 6.0574 \| 0.25 \| 500 \| 2.0297 \| 0.9991 \|
	\| 1.224 \| 0.5 \| 1000 \| 0.5368 \| 0.4342 \|
	\| 0.434 \| 0.75 \| 1500 \| 0.4861 \| 0.3891 \|
	\| 0.3295 \| 1.01 \| 2000 \| 0.4301 \| 0.3411 \|
	\| 0.2739 \| 1.26 \| 2500 \| 0.3818 \| 0.3053 \|
	\| 0.2619 \| 1.51 \| 3000 \| 0.3894 \| 0.3060 \|
	\| 0.2517 \| 1.76 \| 3500 \| 0.3497 \| 0.2802 \|
	\| 0.2244 \| 2.01 \| 4000 \| 0.3519 \| 0.2792 \|
	\| 0.1854 \| 2.26 \| 4500 \| 0.3376 \| 0.2718 \|
	\| 0.1779 \| 2.51 \| 5000 \| 0.3206 \| 0.2520 \|
	\| 0.1749 \| 2.77 \| 5500 \| 0.3169 \| 0.2535 \|
	\| 0.1636 \| 3.02 \| 6000 \| 0.3122 \| 0.2465 \|
	\| 0.137 \| 3.27 \| 6500 \| 0.3054 \| 0.2382 \|
	\| 0.1311 \| 3.52 \| 7000 \| 0.2956 \| 0.2280 \|
	\| 0.1261 \| 3.77 \| 7500 \| 0.2898 \| 0.2236 \|
	\| 0.1187 \| 4.02 \| 8000 \| 0.2847 \| 0.2176 \|
	\| 0.1011 \| 4.27 \| 8500 \| 0.2763 \| 0.2124 \|
	\| 0.0981 \| 4.52 \| 9000 \| 0.2754 \| 0.2115 \|


	### Framework versions

	- Transformers 4.38.2
	- Pytorch 2.2.1+cu121
	- Datasets 2.18.0
	- Tokenizers 0.15.2