facebook
/

wav2vec2-base-10k-voxpopuli-ft-hr

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

patrickvonplaten commited on May 5, 2021

Commit

9c32b65

•

1 Parent(s): 7beaa64

correct readme

Files changed (1) hide show

README.md +50 -0

README.md CHANGED Viewed

@@ -17,3 +17,53 @@ Learning, Semi-Supervised Learning and Interpretation](https://arxiv.org/abs/210
 **Authors**: *Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux* from *Facebook AI*
 See the official website for more information, [here](https://github.com/facebookresearch/voxpopuli/)

 **Authors**: *Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux* from *Facebook AI*
 See the official website for more information, [here](https://github.com/facebookresearch/voxpopuli/)
+# Usage for inference
+In the following it is shown how the model can be used in inference on a sample of the [Common Voice dataset](https://commonvoice.mozilla.org/en/datasets)
+```python
+#!/usr/bin/env python3
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from datasets import load_dataset
+import torchaudio
+import torch
+# resample audio
+# load model & processor
+model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-hr")
+processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-hr")
+# load dataset
+ds = load_dataset("common_voice", "hr", split="validation[:1%]")
+# common voice does not match target sampling rate
+common_voice_sample_rate = 48000
+target_sample_rate = 16000
+resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)
+# define mapping fn to read in sound file and resample
+def map_to_array(batch):
+    speech, _ = torchaudio.load(batch["path"])
+    speech = resampler(speech)
+    batch["speech"] = speech[0]
+    return batch
+# load all audio files
+ds = ds.map(map_to_array)
+# run inference on the first 5 data samples
+inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)
+# inference
+logits = model(**inputs).logits
+predicted_ids = torch.argmax(logits, axis=-1)
+print(processor.batch_decode(predicted_ids))
+```