Update README.md

c3645dc over 3 years ago

4.05 kB

	---
	language: mr
	datasets:
	- openslr
	metrics:
	- wer
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- xlsr-fine-tuning-week
	license: apache-2.0
	model-index:
	- name: XLSR Wav2Vec2 Large 53 Marathi by Sumedh Khodke
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: OpenSLR mr
	type: openslr
	metrics:
	- name: Test WER
	type: wer
	value: 12.7
	---

	# Wav2Vec2-Large-XLSR-53-Marathi
	Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [OpenSLR SLR64](http://openslr.org/64/) dataset. When using this model, make sure that your speech input is sampled at 16kHz. This data contains only female voices but it works well for male voices too.
	WER (Word Error Rate) on the Test Set: 12.70 %
	## Usage
	The model can be used directly without a language model as follows, given that your dataset has Marathi `actual_text` and `path_in_folder` columns:
	```python
	import torch, torchaudio
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	mr_test_dataset_new = all_data['test']

	processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
	model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")

	resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
	# Preprocessing the datasets. We need to read the aduio files as arrays
	def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch
	mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
	inputs = processor(mr_test_dataset_new["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
	predicted_ids = torch.argmax(logits, dim=-1)
	print("Prediction:", processor.batch_decode(predicted_ids))
	print("Reference:", mr_test_dataset_new["actual_text"][:5])
	```
	## Evaluation
	Evaluated on 10% of the Marathi data on Open SLR-64.
	```python
	import re, torch, torchaudio
	from datasets import load_dataset, load_metric
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	mr_test_dataset_new = all_data['test']
	wer = load_metric("wer")

	processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
	model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
	model.to("cuda")

	chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
	resampler = torchaudio.transforms.Resample(48_000, 16_000)
	# Preprocessing the datasets. We need to read the aduio files as arrays
	def speech_file_to_array_fn(batch):
	batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch
	mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
	def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch
	result = mr_test_dataset_new.map(evaluate, batched=True, batch_size=8)
	print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
	```

	## Training
	Train-Test ratio was 90:10.
	Colab training notebook can be found [here](https://colab.research.google.com/drive/1wX46fjExcgU5t3AsWhSPTipWg_aMDg2f?usp=sharing).

	## Training Config and Summary
	weights-and-biases run summary [here](https://wandb.ai/wandb/xlsr/runs/3itdhtb8/overview?workspace=user-sumedhkhodke)