File size: 6,684 Bytes
8449774
cc4a90b
 
8449774
 
cc4a90b
 
8449774
 
 
 
 
 
 
 
 
 
 
 
8b54d62
8449774
 
 
cc4a90b
8449774
 
 
cc4a90b
8b54d62
 
 
8449774
 
8b54d62
8449774
8b54d62
8449774
 
 
cc4a90b
8449774
8b54d62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8449774
 
8b54d62
8449774
 
 
8b54d62
 
 
8449774
 
 
8b54d62
8449774
 
 
8b54d62
 
 
 
 
 
 
 
 
8449774
 
 
 
 
 
 
 
 
8b54d62
8449774
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
- eo
license: apache-2.0
tags:
- automatic-speech-recognition
- mozilla-foundation/common_voice_13_0
- generated_from_trainer
datasets:
- common_voice_13_0
metrics:
- wer
model-index:
- name: wav2vec2-common_voice_13_0-eo-10_1
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_13_0
      type: common_voice_13_0
      config: eo
      split: validation
      args: 'Config: eo, Training split: train, Eval split: validation'
    metrics:
    - name: Wer
      type: wer
      value: 0.05342994850125446
    - name: CER
      type: cer
      value: 0.0098
---

# wav2vec2-common_voice_13_0-eo-10_1, an Esperanto speech recognizer

This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) Esperanto dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0391
- Cer: 0.0098
- Wer: 0.0534

The first 10 examples in the evaluation set:

| Actual<br>Predicted | CER |
|:--------------------|:----|
| `la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo`<br>`la orienta parto apud benino kaj niĝerio estis nomita sklafmarbordo` | 0.014925373134328358 |
| `en la sekva jaro li ricevis premion`<br>`en la sekva jaro li ricevis premion` | 0.0 |
| `ŝi studis historion ĉe la universitato de brita kolumbio`<br>`ŝi studis historion ĉe la universitato de brita kolumbio` | 0.0 |
| `larĝaj ŝtupoj kuras al la fasado`<br>`larĝaj ŝtupoj kuras al la fasado` | 0.0 |
| `la municipo ĝuas duan epokon de etendo kaj disvolviĝo`<br>`la municipo ĝuas duan epokon de etendo kaj disvolviĝo` | 0.0 |
| `li estis ankaŭ katedrestro kaj dekano`<br>`li estis ankaŭ katedresto kaj dekano` | 0.02702702702702703 |
| `librovendejo apartenas al la muzeo`<br>`librovendejo apartenas al l muzeo` | 0.029411764705882353 |
| `ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj`<br>`ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj` | 0.02702702702702703 |
| `unue ili estas ruĝaj poste brunaj`<br>`unue ili estas ruĝaj poste brunaj` | 0.0 |
| `la loĝantaro laboras en la proksima ĉefurbo`<br>`la loĝantaro laboras en la proksima ĉefurbo` | 0.0 |

The differences in results for the above compared to the previous model ([xekri/wav2vec2-common_voice_13_0-eo-10](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-10)) are:
* eepokon -> epokon
* katedristo -> katedresto
* al la muzeo -> al l muzeo

## Model description

See [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53). This model is a version of [xekri/wav2vec2-common_voice_13_0-eo-10](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-10) trained for 5 more epochs.

## Intended uses & limitations

Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.

The output is all lowercase, no punctuation.

## Training and evaluation data

The training split was set to `train` while the eval split was set to `validation`. Some files were filtered out of the train and validation dataset due to bad data; see [xekri/wav2vec2-common_voice_13_0-eo-3](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-3) for a detailed discussion. In summary, I used `xekri/wav2vec2-common_voice_13_0-eo-3` as a detector to detect bad files, then hardcoded those files into the trainer code to be filtered out.

## Training procedure

I used a modified version of [`run_speech_recognition_ctc.py`](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition) for training. See [`run_speech_recognition_ctc.py`](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-10/blob/main/run_speech_recognition_ctc.py) in this repo.

The parameters to the trainer are in [train.json](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-10/blob/main/train.json) in this repo.

The key changes between this training run and `xekri/wav2vec2-common_voice_13_0-eo-3`, aside from the filtering and use of the full training and validation sets are:

* Layer drop probability is 20%
* Train only for 5 epochs

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- layerdrop: 0.2
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 5
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Cer    | Wer    |
|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
| 0.1142        | 0.22  | 1000  | 0.0483          | 0.0126 | 0.0707 |
| 0.1049        | 0.44  | 2000  | 0.0474          | 0.0123 | 0.0675 |
| 0.0982        | 0.67  | 3000  | 0.0471          | 0.0120 | 0.0664 |
| 0.092         | 0.89  | 4000  | 0.0459          | 0.0117 | 0.0640 |
| 0.0847        | 1.11  | 5000  | 0.0459          | 0.0115 | 0.0631 |
| 0.0837        | 1.33  | 6000  | 0.0453          | 0.0113 | 0.0624 |
| 0.0803        | 1.56  | 7000  | 0.0443          | 0.0109 | 0.0598 |
| 0.0826        | 1.78  | 8000  | 0.0441          | 0.0110 | 0.0604 |
| 0.0809        | 2.0   | 9000  | 0.0437          | 0.0110 | 0.0605 |
| 0.0728        | 2.22  | 10000 | 0.0451          | 0.0109 | 0.0597 |
| 0.0707        | 2.45  | 11000 | 0.0444          | 0.0108 | 0.0591 |
| 0.0698        | 2.67  | 12000 | 0.0442          | 0.0105 | 0.0576 |
| 0.0981        | 2.89  | 13000 | 0.0411          | 0.0104 | 0.0572 |
| 0.0928        | 3.11  | 14000 | 0.0413          | 0.0102 | 0.0561 |
| 0.0927        | 3.34  | 15000 | 0.0410          | 0.0102 | 0.0565 |
| 0.0886        | 3.56  | 16000 | 0.0402          | 0.0102 | 0.0558 |
| 0.091         | 3.78  | 17000 | 0.0400          | 0.0101 | 0.0553 |
| 0.0888        | 4.0   | 18000 | 0.0398          | 0.0100 | 0.0546 |
| 0.0885        | 4.23  | 19000 | 0.0395          | 0.0099 | 0.0542 |
| 0.0869        | 4.45  | 20000 | 0.0394          | 0.0099 | 0.0540 |
| 0.0844        | 4.67  | 21000 | 0.0393          | 0.0098 | 0.0539 |
| 0.0882        | 4.89  | 22000 | 0.0391          | 0.0098 | 0.0537 |


### Framework versions

- Transformers 4.29.2
- Pytorch 2.0.1+cu117
- Datasets 2.12.0
- Tokenizers 0.13.3