Automatic Speech Recognition
Transformers
Safetensors
Welsh
English
wav2vec2
Inference Endpoints
DewiBrynJones commited on
Commit
f48d0ef
1 Parent(s): 19831fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -15
README.md CHANGED
@@ -14,48 +14,60 @@ language:
14
  pipeline_tag: automatic-speech-recognition
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # wav2vec2-xlsr-53-ft-ccv-en-cy
 
 
21
 
22
- A speech recognition acoustic model for Welsh and English, fine-tuned from [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) using English/Welsh balanced data derived from version 11 of their respective Common Voice datasets (https://commonvoice.mozilla.org/cy/datasets). Custom bilingual Common Voice train/dev and test splits were built using the scripts at https://github.com/techiaith/docker-commonvoice-custom-splits-builder#introduction
23
-
24
- Source code and scripts for training wav2vec2-xlsr-ft-en-cy can be found at [https://github.com/techiaith/docker-wav2vec2-cy](https://github.com/techiaith/docker-wav2vec2-cy/blob/main/train/fine-tune/python/run_en_cy.sh).
25
 
26
 
27
  ## Usage
28
 
29
- The wav2vec2-xlsr-53-ft-ccv-en-cy model can be used directly as follows:
30
 
31
  ```python
32
  import torch
33
  import torchaudio
34
  import librosa
35
 
36
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
37
 
38
- processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
39
- model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
40
 
41
- audio, rate = librosa.load(audio_file, sr=16000)
42
 
43
  inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
44
 
45
  with torch.no_grad():
46
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
47
 
48
- # greedy decoding
49
- predicted_ids = torch.argmax(logits, dim=-1)
 
 
 
50
 
51
- print("Prediction:", processor.batch_decode(predicted_ids))
 
 
 
52
 
 
 
 
 
53
  ```
54
 
 
55
  ## Evaluation
56
 
57
 
58
- According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy is **23.79%**
59
 
60
  However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.
61
 
 
14
  pipeline_tag: automatic-speech-recognition
15
  ---
16
 
17
+ # wav2vec2-xlsr-53-ft-cy-en-withlm
 
18
 
19
+ This model is a version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
20
+ that has been fined-tuned with a custom bilingual datasets derived from the Welsh
21
+ and English data releases of Mozilla Foundation's Commonvoice project. See : [techiaith/commonvoice_16_1_en_cy](https://huggingface.co/datasets/techiaith/commonvoice_16_1_en_cy).
22
 
23
+ In addition, this model also includes a single KenLM n-gram model trained with balanced
24
+ collections of Welsh and English texts from [OSCAR](https://huggingface.co/datasets/oscar)
25
+ This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding.
26
 
27
 
28
  ## Usage
29
 
30
+ The `wav2vec2-xlsr-53-ft-cy-en-withlm` model can be used directly as follows:
31
 
32
  ```python
33
  import torch
34
  import torchaudio
35
  import librosa
36
 
37
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
38
 
39
+ processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
40
+ model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
41
 
42
+ audio, rate = librosa.load(<path/to/audio_file>, sr=16000)
43
 
44
  inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
45
 
46
  with torch.no_grad():
47
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
48
 
49
+ print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())
50
+
51
+ ```
52
+
53
+ Usage with a pipeline is even simpler...
54
 
55
+ ```
56
+ from transformers import pipeline
57
+
58
+ transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
59
 
60
+ def transcribe(audio):
61
+ return transcriber(audio)["text"]
62
+
63
+ transcribe(<path/or/url/to/any/audiofile>)
64
  ```
65
 
66
+
67
  ## Evaluation
68
 
69
 
70
+ According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is **23.79%**
71
 
72
  However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.
73