metadata

license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
metrics:
  - wer
model-index:
  - name: wav2vec2-xlsr-53-ft-ccv-en-cy
    results: []
datasets:
  - techiaith/commonvoice_16_1_en_cy
language:
  - cy
  - en
pipeline_tag: automatic-speech-recognition

wav2vec2-xlsr-53-ft-ccv-en-cy

A speech recognition acoustic model for Welsh and English, fine-tuned from facebook/wav2vec2-large-xlsr-53 using English/Welsh balanced data derived from version 11 of their respective Common Voice datasets (https://commonvoice.mozilla.org/cy/datasets). Custom bilingual Common Voice train/dev and test splits were built using the scripts at https://github.com/techiaith/docker-commonvoice-custom-splits-builder#introduction

Source code and scripts for training wav2vec2-xlsr-ft-en-cy can be found at https://github.com/techiaith/docker-wav2vec2-cy.

Usage

The wav2vec2-xlsr-53-ft-ccv-en-cy model can be used directly as follows:

import torch
import torchaudio
import librosa

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")

audio, rate = librosa.load(audio_file, sr=16000)

inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

# greedy decoding
predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))

Evaluation

According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy is 23.79%

However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

Common Voice Test Set Language	WER	CER
EN+CY	23.79	9.68
EN	34.47	14.83
CY	12.34	3.55

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 800
training_steps: 9000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
6.0574	0.25	500	2.0297	0.9991
1.224	0.5	1000	0.5368	0.4342
0.434	0.75	1500	0.4861	0.3891
0.3295	1.01	2000	0.4301	0.3411
0.2739	1.26	2500	0.3818	0.3053
0.2619	1.51	3000	0.3894	0.3060
0.2517	1.76	3500	0.3497	0.2802
0.2244	2.01	4000	0.3519	0.2792
0.1854	2.26	4500	0.3376	0.2718
0.1779	2.51	5000	0.3206	0.2520
0.1749	2.77	5500	0.3169	0.2535
0.1636	3.02	6000	0.3122	0.2465
0.137	3.27	6500	0.3054	0.2382
0.1311	3.52	7000	0.2956	0.2280
0.1261	3.77	7500	0.2898	0.2236
0.1187	4.02	8000	0.2847	0.2176
0.1011	4.27	8500	0.2763	0.2124
0.0981	4.52	9000	0.2754	0.2115

Framework versions

Transformers 4.38.2
Pytorch 2.2.1+cu121
Datasets 2.18.0
Tokenizers 0.15.2