File size: 4,245 Bytes
820d04f 3a473cf f2bdd30 820d04f f48d0ef 820d04f f48d0ef 820d04f f48d0ef f2bdd30 f48d0ef f2bdd30 f48d0ef f2bdd30 f48d0ef f2bdd30 f48d0ef f2bdd30 f48d0ef f2bdd30 f48d0ef f2bdd30 f48d0ef f2bdd30 820d04f f48d0ef f2bdd30 820d04f f48d0ef 820d04f f2bdd30 820d04f f2bdd30 820d04f 3a473cf 820d04f 124ec62 820d04f 3a473cf f2bdd30 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
metrics:
- wer
model-index:
- name: wav2vec2-xlsr-53-ft-ccv-en-cy
results: []
datasets:
- techiaith/commonvoice_16_1_en_cy
language:
- cy
- en
pipeline_tag: automatic-speech-recognition
---
# wav2vec2-xlsr-53-ft-cy-en-withlm
This model is a version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
that has been fined-tuned with a custom bilingual datasets derived from the Welsh
and English data releases of Mozilla Foundation's Commonvoice project. See : [techiaith/commonvoice_16_1_en_cy](https://huggingface.co/datasets/techiaith/commonvoice_16_1_en_cy).
In addition, this model also includes a single KenLM n-gram model trained with balanced
collections of Welsh and English texts from [OSCAR](https://huggingface.co/datasets/oscar)
This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding.
## Usage
The `wav2vec2-xlsr-53-ft-cy-en-withlm` model can be used directly as follows:
```python
import torch
import torchaudio
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
audio, rate = librosa.load(<path/to/audio_file>, sr=16000)
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())
```
Usage with a pipeline is even simpler...
```
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
def transcribe(audio):
return transcriber(audio)["text"]
transcribe(<path/or/url/to/any/audiofile>)
```
## Evaluation
According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is **23.79%**
However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.
| Common Voice Test Set Language | WER | CER |
| -------- | --- | --- |
| EN+CY | 23.79| 9.68 |
| EN | 34.47 | 14.83 |
| CY | 12.34 | 3.55 |
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 800
- training_steps: 9000
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 6.0574 | 0.25 | 500 | 2.0297 | 0.9991 |
| 1.224 | 0.5 | 1000 | 0.5368 | 0.4342 |
| 0.434 | 0.75 | 1500 | 0.4861 | 0.3891 |
| 0.3295 | 1.01 | 2000 | 0.4301 | 0.3411 |
| 0.2739 | 1.26 | 2500 | 0.3818 | 0.3053 |
| 0.2619 | 1.51 | 3000 | 0.3894 | 0.3060 |
| 0.2517 | 1.76 | 3500 | 0.3497 | 0.2802 |
| 0.2244 | 2.01 | 4000 | 0.3519 | 0.2792 |
| 0.1854 | 2.26 | 4500 | 0.3376 | 0.2718 |
| 0.1779 | 2.51 | 5000 | 0.3206 | 0.2520 |
| 0.1749 | 2.77 | 5500 | 0.3169 | 0.2535 |
| 0.1636 | 3.02 | 6000 | 0.3122 | 0.2465 |
| 0.137 | 3.27 | 6500 | 0.3054 | 0.2382 |
| 0.1311 | 3.52 | 7000 | 0.2956 | 0.2280 |
| 0.1261 | 3.77 | 7500 | 0.2898 | 0.2236 |
| 0.1187 | 4.02 | 8000 | 0.2847 | 0.2176 |
| 0.1011 | 4.27 | 8500 | 0.2763 | 0.2124 |
| 0.0981 | 4.52 | 9000 | 0.2754 | 0.2115 |
### Framework versions
- Transformers 4.38.2
- Pytorch 2.2.1+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2 |