patrickvonplaten commited on
Commit
9c32b65
1 Parent(s): 7beaa64

correct readme

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md CHANGED
@@ -17,3 +17,53 @@ Learning, Semi-Supervised Learning and Interpretation](https://arxiv.org/abs/210
17
  **Authors**: *Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux* from *Facebook AI*
18
 
19
  See the official website for more information, [here](https://github.com/facebookresearch/voxpopuli/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  **Authors**: *Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux* from *Facebook AI*
18
 
19
  See the official website for more information, [here](https://github.com/facebookresearch/voxpopuli/)
20
+
21
+
22
+ # Usage for inference
23
+
24
+ In the following it is shown how the model can be used in inference on a sample of the [Common Voice dataset](https://commonvoice.mozilla.org/en/datasets)
25
+
26
+ ```python
27
+ #!/usr/bin/env python3
28
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
29
+ from datasets import load_dataset
30
+ import torchaudio
31
+ import torch
32
+
33
+ # resample audio
34
+
35
+ # load model & processor
36
+ model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-hr")
37
+ processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-hr")
38
+
39
+ # load dataset
40
+ ds = load_dataset("common_voice", "hr", split="validation[:1%]")
41
+
42
+ # common voice does not match target sampling rate
43
+ common_voice_sample_rate = 48000
44
+ target_sample_rate = 16000
45
+
46
+ resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)
47
+
48
+
49
+ # define mapping fn to read in sound file and resample
50
+ def map_to_array(batch):
51
+ speech, _ = torchaudio.load(batch["path"])
52
+ speech = resampler(speech)
53
+ batch["speech"] = speech[0]
54
+ return batch
55
+
56
+
57
+ # load all audio files
58
+ ds = ds.map(map_to_array)
59
+
60
+ # run inference on the first 5 data samples
61
+ inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)
62
+
63
+ # inference
64
+ logits = model(**inputs).logits
65
+ predicted_ids = torch.argmax(logits, axis=-1)
66
+
67
+ print(processor.batch_decode(predicted_ids))
68
+ ```
69
+