File size: 7,397 Bytes
730d5bf 92f01c2 baa4ba3 92f01c2 d3098f5 92f01c2 730d5bf 8bed95e 92f01c2 1b26fd0 f4c3509 af76ca8 92f01c2 af76ca8 ca62b93 92f01c2 730d5bf 1f22577 730d5bf f4c3509 1b26fd0 1f22577 f4c3509 4dd04e9 730d5bf eed6b40 42fb3e1 f860d7c 42fb3e1 a79f845 730d5bf eed6b40 730d5bf eed6b40 869d3c6 730d5bf 965ee2c 0ee86d2 eb4ed25 c8795a2 3fb5390 c8795a2 3fb5390 c8795a2 3fb5390 83680d4 c8795a2 3fb5390 730d5bf 78f7c2a eed6b40 869d3c6 730d5bf eed6b40 965ee2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
base_model: masoudmzb/wav2vec2-xlsr-multilingual-53-fa
metrics:
- wer
widget:
- example_title: M22N20
src: >-
https://huggingface.co/lnxdx/20_2000_1e-5_hp-mehrdad/blob/main/M16A01.wav
- example_title: Common Voice sample 2978
src: >-
https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3/resolve/main/sample2978.flac
- example_title: Common Voice sample 5168
src: >-
https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3/resolve/main/sample5168.flac
model-index:
- name: wav2vec2-large-xlsr-persian-shemo
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 13.0 fa
type: common_voice_13_0
args: fa
metrics:
- name: Test WER
type: wer
value: 19.21
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: ShEMO
type: shemo
args: fa
metrics:
- name: Test WER
type: wer
value: 32.85
language:
- fa
pipeline_tag: automatic-speech-recognition
tags:
- audio
- speech
- automatic-speech-recognition
- asr
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Wav2Vec2 Large XLSR Persian ShEMO
This model is a fine-tuned version of [masoudmzb/wav2vec2-xlsr-multilingual-53-fa](https://huggingface.co/masoudmzb/wav2vec2-xlsr-multilingual-53-fa)
on the [ShEMO](https://github.com/pariajm/sharif-emotional-speech-dataset) dataset for speech recognition in Persian (Farsi).
When using this model, make sure that your speech input is sampled at 16 kHz.
It achieves the following results:
- Loss on ShEMO train set: 0.7618
- Loss on ShEMO dev set: 0.6728
- WER on ShEMO train set: 30.47
- WER on ShEMO dev set: 32.85
- WER on Common Voice 13 test set: 19.21
## Evaluation results 🌡️
| Checkpoint Name | WER on ShEMO dev set | WER on Common Voice 13 test set | Max :) |
| :---------------------------------------------------------------------------------------------------------------: | :------: | :-------: | :---: |
| [m3hrdadfi/wav2vec2-large-xlsr-persian-v3](https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3) | 46.55 | **17.43** | 46.55 |
| [m3hrdadfi/wav2vec2-large-xlsr-persian-shemo](https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-shemo) | **7.42** | 33.88 | 33.88 |
| [masoudmzb/wav2vec2-xlsr-multilingual-53-fa](https://huggingface.co/masoudmzb/wav2vec2-xlsr-multilingual-53-fa) | 56.54 | 24.68 | 56.54 |
| This checkpoint | 32.85 | 19.21 | **32.85** |
As you can see, my model performs better in maximum case :D
## Training procedure
##### Training hyperparameters
**Training hyperparameters**
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 2000
- mixed_precision_training: Native AMP
You may need *gradient_accumulation* because you need more batch size.
#### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 1.8553 | 0.62 | 100 | 1.4126 | 0.4866 |
| 1.4083 | 1.25 | 200 | 1.0428 | 0.4366 |
| 1.1718 | 1.88 | 300 | 0.8683 | 0.4127 |
| 0.9919 | 2.5 | 400 | 0.7921 | 0.3919 |
| 0.9493 | 3.12 | 500 | 0.7676 | 0.3744 |
| 0.9414 | 3.75 | 600 | 0.7247 | 0.3695 |
| 0.8897 | 4.38 | 700 | 0.7202 | 0.3598 |
| 0.8716 | 5.0 | 800 | 0.7096 | 0.3546 |
| 0.8467 | 5.62 | 900 | 0.7023 | 0.3499 |
| 0.8227 | 6.25 | 1000 | 0.6994 | 0.3411 |
| 0.855 | 6.88 | 1100 | 0.6883 | 0.3432 |
| 0.8457 | 7.5 | 1200 | 0.6773 | 0.3426 |
| 0.7614 | 8.12 | 1300 | 0.6913 | 0.3344 |
| 0.8127 | 8.75 | 1400 | 0.6827 | 0.3335 |
| 0.8443 | 9.38 | 1500 | 0.6725 | 0.3356 |
| 0.7548 | 10.0 | 1600 | 0.6759 | 0.3318 |
| 0.7839 | 10.62 | 1700 | 0.6773 | 0.3286 |
| 0.7912 | 11.25 | 1800 | 0.6748 | 0.3286 |
| 0.8238 | 11.88 | 1900 | 0.6735 | 0.3297 |
| 0.7618 | 12.5 | 2000 | 0.6728 | 0.3286 |
#### Hyperparameter tuning
Several models with differet hyperparameters were trained. The following figures show the training process for three of them.
![wer](wandb-wer.png)
![loss](wandb-loss.png)
**20_2000_1e-5_hp-mehrdad** is the current model (lnxdx/Wav2Vec2-Large-XLSR-Persian-ShEMO) and it's hyperparameters are:
```python
model = Wav2Vec2ForCTC.from_pretrained(
model_name_or_path if not last_checkpoint else last_checkpoint,
# hp-mehrdad: Hyperparams of 'm3hrdadfi/wav2vec2-large-xlsr-persian-v3'
attention_dropout = 0.05316,
hidden_dropout = 0.01941,
feat_proj_dropout = 0.01249,
mask_time_prob = 0.04529,
layerdrop = 0.01377,
ctc_loss_reduction = 'mean',
ctc_zero_infinity = True,
)
```
The hyperparameters of **19_2000_1e-5_hp-base** are:
```python
model = Wav2Vec2ForCTC.from_pretrained(
model_name_or_path if not last_checkpoint else last_checkpoint,
# hp-base: Hyperparams simmilar to ('facebook/wav2vec2-large-xlsr-53' or 'facebook/wav2vec2-xls-r-300m')
attention_dropout = 0.1,
hidden_dropout = 0.1,
feat_proj_dropout = 0.1,
mask_time_prob = 0.075,
layerdrop = 0.1,
ctc_loss_reduction = 'mean',
ctc_zero_infinity = True,
)
```
And the hyperparameters of **22_2000_1e-5_hp-masoud** are:
```python
model = Wav2Vec2ForCTC.from_pretrained(
model_name_or_path if not last_checkpoint else last_checkpoint,
# hp-masoud: Hyperparams of 'masoudmzb/wav2vec2-xlsr-multilingual-53-fa'
attention_dropout = 0.2,
hidden_dropout = 0.2,
feat_proj_dropout = 0.1,
mask_time_prob = 0.2,
layerdrop = 0.2,
ctc_loss_reduction = 'mean',
ctc_zero_infinity = True,
)
```
Learning rate is 1e-5 for all three models.
As you can see this model performs better with WER metric on validation(evaluation) set.
The script used for training can be found [here](https://colab.research.google.com/github/m3hrdadfi/notebooks/blob/main/Fine_Tune_XLSR_Wav2Vec2_on_Persian_ShEMO_ASR_with_%F0%9F%A4%97_Transformers_ipynb.ipynb).
Check out [this blog](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for more information.
#### Framework versions
- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0
## Contact us 🤝
If you have any technical question regarding the model, pretraining, code or publication, please create an issue in the repository. This is the *best* way to reach us.
## Citation ↩️ |