File size: 4,047 Bytes
a3bfd83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a34a2d0
c3645dc
a3bfd83
f592023
a3bfd83
f592023
a3bfd83
 
 
 
 
 
 
 
 
 
 
a34a2d0
 
 
a3bfd83
 
 
a34a2d0
a3bfd83
 
 
 
 
 
 
f592023
a3bfd83
 
f592023
a3bfd83
 
 
 
 
 
 
c3645dc
f592023
a3bfd83
 
a34a2d0
 
 
 
a3bfd83
 
a34a2d0
 
 
 
 
 
a3bfd83
 
 
a34a2d0
a3bfd83
f592023
c3645dc
fbeb9ca
 
c3645dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language: mr
datasets:
- openslr
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
model-index:
- name: XLSR Wav2Vec2 Large 53 Marathi by Sumedh Khodke
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: OpenSLR mr
      type: openslr
    metrics:
       - name: Test WER
         type: wer
         value: 12.7
---

# Wav2Vec2-Large-XLSR-53-Marathi
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [OpenSLR SLR64](http://openslr.org/64/) dataset. When using this model, make sure that your speech input is sampled at 16kHz. This data contains only female voices but it works well for male voices too.
**WER (Word Error Rate) on the Test Set**: 12.70 %  
## Usage
The model can be used directly without a language model as follows, given that your dataset has Marathi `actual_text` and `path_in_folder` columns:
```python
import torch, torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

mr_test_dataset_new = all_data['test']

processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi") 
model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi") 

resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
# Preprocessing the datasets. We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
inputs = processor(mr_test_dataset_new["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", mr_test_dataset_new["actual_text"][:5])
```
## Evaluation
Evaluated on 10% of the Marathi data on Open SLR-64.
```python
import re, torch, torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

mr_test_dataset_new = all_data['test']
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' 
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets. We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch
result = mr_test_dataset_new.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
```

## Training
Train-Test ratio was 90:10.
Colab training notebook can be found [here](https://colab.research.google.com/drive/1wX46fjExcgU5t3AsWhSPTipWg_aMDg2f?usp=sharing). 

## Training Config and Summary 
weights-and-biases run summary [here](https://wandb.ai/wandb/xlsr/runs/3itdhtb8/overview?workspace=user-sumedhkhodke)