Automatic Speech Recognition
Transformers
Safetensors
Welsh
English
wav2vec2
Inference Endpoints
DewiBrynJones's picture
Update README.md
f48d0ef verified
|
raw
history blame
4.25 kB
metadata
license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
metrics:
  - wer
model-index:
  - name: wav2vec2-xlsr-53-ft-ccv-en-cy
    results: []
datasets:
  - techiaith/commonvoice_16_1_en_cy
language:
  - cy
  - en
pipeline_tag: automatic-speech-recognition

wav2vec2-xlsr-53-ft-cy-en-withlm

This model is a version of facebook/wav2vec2-large-xlsr-53 that has been fined-tuned with a custom bilingual datasets derived from the Welsh and English data releases of Mozilla Foundation's Commonvoice project. See : techiaith/commonvoice_16_1_en_cy.

In addition, this model also includes a single KenLM n-gram model trained with balanced collections of Welsh and English texts from OSCAR This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding.

Usage

The wav2vec2-xlsr-53-ft-cy-en-withlm model can be used directly as follows:

import torch
import torchaudio
import librosa

from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM

processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

audio, rate = librosa.load(<path/to/audio_file>, sr=16000)

inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())

Usage with a pipeline is even simpler...

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

def transcribe(audio):
    return transcriber(audio)["text"]

transcribe(<path/or/url/to/any/audiofile>)

Evaluation

According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is 23.79%

However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

Common Voice Test Set Language WER CER
EN+CY 23.79 9.68
EN 34.47 14.83
CY 12.34 3.55

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0003
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 800
  • training_steps: 9000
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer
6.0574 0.25 500 2.0297 0.9991
1.224 0.5 1000 0.5368 0.4342
0.434 0.75 1500 0.4861 0.3891
0.3295 1.01 2000 0.4301 0.3411
0.2739 1.26 2500 0.3818 0.3053
0.2619 1.51 3000 0.3894 0.3060
0.2517 1.76 3500 0.3497 0.2802
0.2244 2.01 4000 0.3519 0.2792
0.1854 2.26 4500 0.3376 0.2718
0.1779 2.51 5000 0.3206 0.2520
0.1749 2.77 5500 0.3169 0.2535
0.1636 3.02 6000 0.3122 0.2465
0.137 3.27 6500 0.3054 0.2382
0.1311 3.52 7000 0.2956 0.2280
0.1261 3.77 7500 0.2898 0.2236
0.1187 4.02 8000 0.2847 0.2176
0.1011 4.27 8500 0.2763 0.2124
0.0981 4.52 9000 0.2754 0.2115

Framework versions

  • Transformers 4.38.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.18.0
  • Tokenizers 0.15.2