metadata

license: apache-2.0
base_model: facebook/wav2vec2-xls-r-300m
tags:
  - generated_from_trainer
metrics:
  - wer
  - cer
model-index:
  - name: wav2vec2-large-xls-r-300m-hi
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 15
          type: mozilla-foundation/common_voice_15_0
          args: hi
        metrics:
          - name: Test WER
            type: wer
            value: 29.34
          - name: Test CER
            type: cer
            value: 7.86
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 8
          type: mozilla-foundation/common_voice_8_0
          args: hi
        metrics:
          - name: Test WER
            type: wer
            value: 52.09
          - name: Test CER
            type: cer
            value: 17.9
datasets:
  - mozilla-foundation/common_voice_15_0
language:
  - hi
library_name: transformers
pipeline_tag: automatic-speech-recognition

wav2vec2-large-xls-r-300m-hi

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.3611
Wer: 29.92%
Cer: 7.86%

View the results on Kaggle Notebook: https://www.kaggle.com/code/kingabzpro/wav2vec-2-eval

Evaluation

import torch
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import unicodedata
import re


test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "hi", split="test")
wer = load_metric("wer")
cer = load_metric("cer")

processor = Wav2Vec2Processor.from_pretrained("SakshiRathi77/wav2vec2_xlsr_300m")
model = Wav2Vec2ForCTC.from_pretrained("SakshiRathi77/wav2vec2_xlsr_300m")
model.to("cuda")


# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
    chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\’\'\|\&\–]'
    remove_en = '[A-Za-z]'
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"].lower())
    batch["sentence"] = re.sub(remove_en, "", batch["sentence"]).lower()
    batch["sentence"] = unicodedata.normalize("NFKC", batch["sentence"])

    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
      logits = model(inputs.input_values.to("cuda")).logits

      pred_ids = torch.argmax(logits, dim=-1)
      batch["pred_strings"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
      return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {}".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))

WER: 52.09850206372026
CER: 17.902923538230883

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 300
num_epochs: 100

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
7.0431	19.05	300	3.4423	1.0	1.0
2.3233	38.1	600	0.5965	0.4757	0.1329
0.5676	57.14	900	0.3962	0.3584	0.0954
0.3611	76.19	1200	0.3651	0.3190	0.0820
0.2996	95.24	1500	0.3611	0.2992	0.0786

Framework versions

Transformers 4.33.0
Pytorch 2.0.0
Datasets 2.1.0
Tokenizers 0.13.3