Update README.md

f2bdd30 verified 7 months ago

No virus

4.14 kB

	---
	license: apache-2.0
	base_model: facebook/wav2vec2-large-xlsr-53
	metrics:
	- wer
	model-index:
	- name: wav2vec2-xlsr-53-ft-ccv-en-cy
	results: []
	datasets:
	- techiaith/commonvoice_16_1_en_cy
	language:
	- cy
	- en
	pipeline_tag: automatic-speech-recognition
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# wav2vec2-xlsr-53-ft-ccv-en-cy

	A speech recognition acoustic model for Welsh and English, fine-tuned from [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) using English/Welsh balanced data derived from version 11 of their respective Common Voice datasets (https://commonvoice.mozilla.org/cy/datasets). Custom bilingual Common Voice train/dev and test splits were built using the scripts at https://github.com/techiaith/docker-commonvoice-custom-splits-builder#introduction

	Source code and scripts for training wav2vec2-xlsr-ft-en-cy can be found at [https://github.com/techiaith/docker-wav2vec2-cy](https://github.com/techiaith/docker-wav2vec2-cy/blob/main/train/fine-tune/python/run_en_cy.sh).


	## Usage

	The wav2vec2-xlsr-53-ft-ccv-en-cy model can be used directly as follows:

	```python
	import torch
	import torchaudio
	import librosa

	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
	model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")

	audio, rate = librosa.load(audio_file, sr=16000)

	inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
	tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

	# greedy decoding
	predicted_ids = torch.argmax(logits, dim=-1)

	print("Prediction:", processor.batch_decode(predicted_ids))

	```

	## Evaluation


	According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy is 23.79%

	However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

	\| Common Voice Test Set Language \| WER \| CER \|
	\| -------- \| --- \| --- \|
	\| EN+CY \| 23.79\| 9.68 \|
	\| EN \| 34.47 \| 14.83 \|
	\| CY \| 12.34 \| 3.55 \|


	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 800
	- training_steps: 9000
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:------:\|
	\| 6.0574 \| 0.25 \| 500 \| 2.0297 \| 0.9991 \|
	\| 1.224 \| 0.5 \| 1000 \| 0.5368 \| 0.4342 \|
	\| 0.434 \| 0.75 \| 1500 \| 0.4861 \| 0.3891 \|
	\| 0.3295 \| 1.01 \| 2000 \| 0.4301 \| 0.3411 \|
	\| 0.2739 \| 1.26 \| 2500 \| 0.3818 \| 0.3053 \|
	\| 0.2619 \| 1.51 \| 3000 \| 0.3894 \| 0.3060 \|
	\| 0.2517 \| 1.76 \| 3500 \| 0.3497 \| 0.2802 \|
	\| 0.2244 \| 2.01 \| 4000 \| 0.3519 \| 0.2792 \|
	\| 0.1854 \| 2.26 \| 4500 \| 0.3376 \| 0.2718 \|
	\| 0.1779 \| 2.51 \| 5000 \| 0.3206 \| 0.2520 \|
	\| 0.1749 \| 2.77 \| 5500 \| 0.3169 \| 0.2535 \|
	\| 0.1636 \| 3.02 \| 6000 \| 0.3122 \| 0.2465 \|
	\| 0.137 \| 3.27 \| 6500 \| 0.3054 \| 0.2382 \|
	\| 0.1311 \| 3.52 \| 7000 \| 0.2956 \| 0.2280 \|
	\| 0.1261 \| 3.77 \| 7500 \| 0.2898 \| 0.2236 \|
	\| 0.1187 \| 4.02 \| 8000 \| 0.2847 \| 0.2176 \|
	\| 0.1011 \| 4.27 \| 8500 \| 0.2763 \| 0.2124 \|
	\| 0.0981 \| 4.52 \| 9000 \| 0.2754 \| 0.2115 \|


	### Framework versions

	- Transformers 4.38.2
	- Pytorch 2.2.1+cu121
	- Datasets 2.18.0
	- Tokenizers 0.15.2