Automatic Speech Recognition
Transformers
Safetensors
Welsh
English
wav2vec2
Inference Endpoints
DewiBrynJones's picture
Update README.md
f2bdd30 verified
|
raw
history blame
4.14 kB
metadata
license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
metrics:
  - wer
model-index:
  - name: wav2vec2-xlsr-53-ft-ccv-en-cy
    results: []
datasets:
  - techiaith/commonvoice_16_1_en_cy
language:
  - cy
  - en
pipeline_tag: automatic-speech-recognition

wav2vec2-xlsr-53-ft-ccv-en-cy

A speech recognition acoustic model for Welsh and English, fine-tuned from facebook/wav2vec2-large-xlsr-53 using English/Welsh balanced data derived from version 11 of their respective Common Voice datasets (https://commonvoice.mozilla.org/cy/datasets). Custom bilingual Common Voice train/dev and test splits were built using the scripts at https://github.com/techiaith/docker-commonvoice-custom-splits-builder#introduction

Source code and scripts for training wav2vec2-xlsr-ft-en-cy can be found at https://github.com/techiaith/docker-wav2vec2-cy.

Usage

The wav2vec2-xlsr-53-ft-ccv-en-cy model can be used directly as follows:

import torch
import torchaudio
import librosa

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")

audio, rate = librosa.load(audio_file, sr=16000)

inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

# greedy decoding
predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))

Evaluation

According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy is 23.79%

However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

Common Voice Test Set Language WER CER
EN+CY 23.79 9.68
EN 34.47 14.83
CY 12.34 3.55

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0003
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 800
  • training_steps: 9000
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer
6.0574 0.25 500 2.0297 0.9991
1.224 0.5 1000 0.5368 0.4342
0.434 0.75 1500 0.4861 0.3891
0.3295 1.01 2000 0.4301 0.3411
0.2739 1.26 2500 0.3818 0.3053
0.2619 1.51 3000 0.3894 0.3060
0.2517 1.76 3500 0.3497 0.2802
0.2244 2.01 4000 0.3519 0.2792
0.1854 2.26 4500 0.3376 0.2718
0.1779 2.51 5000 0.3206 0.2520
0.1749 2.77 5500 0.3169 0.2535
0.1636 3.02 6000 0.3122 0.2465
0.137 3.27 6500 0.3054 0.2382
0.1311 3.52 7000 0.2956 0.2280
0.1261 3.77 7500 0.2898 0.2236
0.1187 4.02 8000 0.2847 0.2176
0.1011 4.27 8500 0.2763 0.2124
0.0981 4.52 9000 0.2754 0.2115

Framework versions

  • Transformers 4.38.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.18.0
  • Tokenizers 0.15.2