metadata
base_model: masoudmzb/wav2vec2-xlsr-multilingual-53-fa
metrics:
- wer
widget:
- example_title: M22N20
src: https://huggingface.co/lnxdx/20_2000_1e-5_hp-mehrdad/blob/main/M16A01.wav
- example_title: Common Voice sample 2978
src: >-
https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3/resolve/main/sample2978.flac
- example_title: Common Voice sample 5168
src: >-
https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3/resolve/main/sample5168.flac
model-index:
- name: wav2vec2-large-xlsr-persian-shemo
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 13.0 fa
type: common_voice_13_0
args: fa
metrics:
- name: Test WER
type: wer
value: 19.21
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: ShEMO
type: shemo
args: fa
metrics:
- name: Test WER
type: wer
value: 32.85
language:
- fa
pipeline_tag: automatic-speech-recognition
tags:
- audio
- speech
- automatic-speech-recognition
- asr
Wav2Vec2 Large XLSR Persian ShEMO
This model is a fine-tuned version of masoudmzb/wav2vec2-xlsr-multilingual-53-fa on the ShEMO dataset for speech recognition in Persian (Farsi). When using this model, make sure that your speech input is sampled at 16 kHz.
It achieves the following results:
- Loss on ShEMO train set: 0.7618
- Loss on ShEMO dev set: 0.6728
- WER on ShEMO train set: 30.47
- WER on ShEMO dev set: 32.85
- WER on Common Voice 13 test set: 19.21
Evaluation
Checkpoint Name | WER on ShEMO dev set | WER on Common Voice 13 test set | Max :) |
---|---|---|---|
m3hrdadfi/wav2vec2-large-xlsr-persian-v3 | 46.55 | 17.43 | 46.55 |
m3hrdadfi/wav2vec2-large-xlsr-persian-shemo | 7.42 | 33.88 | 33.88 |
masoudmzb/wav2vec2-xlsr-multilingual-53-fa | 56.54 | 24.68 | 56.54 |
This checkpoint | 32.85 | 19.21 | 32.85 |
As you can see, my model performs better in maximum case :D
Training procedure
Model hyperparameters
model = Wav2Vec2ForCTC.from_pretrained(
model_name_or_path if not last_checkpoint else last_checkpoint,
# hp-mehrdad: Hyperparams of 'm3hrdadfi/wav2vec2-large-xlsr-persian-v3'
attention_dropout = 0.05316,
hidden_dropout = 0.01941,
feat_proj_dropout = 0.01249,
mask_time_prob = 0.04529,
layerdrop = 0.01377,
ctc_loss_reduction = 'mean',
ctc_zero_infinity = True,
)
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 2000
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
1.8553 | 0.62 | 100 | 1.4126 | 0.4866 |
1.4083 | 1.25 | 200 | 1.0428 | 0.4366 |
1.1718 | 1.88 | 300 | 0.8683 | 0.4127 |
0.9919 | 2.5 | 400 | 0.7921 | 0.3919 |
0.9493 | 3.12 | 500 | 0.7676 | 0.3744 |
0.9414 | 3.75 | 600 | 0.7247 | 0.3695 |
0.8897 | 4.38 | 700 | 0.7202 | 0.3598 |
0.8716 | 5.0 | 800 | 0.7096 | 0.3546 |
0.8467 | 5.62 | 900 | 0.7023 | 0.3499 |
0.8227 | 6.25 | 1000 | 0.6994 | 0.3411 |
0.855 | 6.88 | 1100 | 0.6883 | 0.3432 |
0.8457 | 7.5 | 1200 | 0.6773 | 0.3426 |
0.7614 | 8.12 | 1300 | 0.6913 | 0.3344 |
0.8127 | 8.75 | 1400 | 0.6827 | 0.3335 |
0.8443 | 9.38 | 1500 | 0.6725 | 0.3356 |
0.7548 | 10.0 | 1600 | 0.6759 | 0.3318 |
0.7839 | 10.62 | 1700 | 0.6773 | 0.3286 |
0.7912 | 11.25 | 1800 | 0.6748 | 0.3286 |
0.8238 | 11.88 | 1900 | 0.6735 | 0.3297 |
0.7618 | 12.5 | 2000 | 0.6728 | 0.3286 |
Choosing the best model
Several models with differet hyperparameters were trained. The following figures show the training process for three of them. As you can see this model performs better on evaluation set.
Framework versions
- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0