File size: 6,717 Bytes
48bd797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
576a3af
48bd797
 
 
 
 
12878e1
48bd797
12878e1
48bd797
 
 
 
 
 
12878e1
48bd797
 
 
12878e1
 
 
 
 
 
 
48bd797
 
 
12878e1
 
 
48bd797
12878e1
 
 
 
48bd797
12878e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48bd797
12878e1
 
 
 
 
 
48bd797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12878e1
 
48bd797
 
 
 
 
 
 
12878e1
 
 
 
 
 
 
 
 
 
 
48bd797
 
 
 
 
 
576a3af
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
language:
- ar
license: apache-2.0
base_model: openai/whisper-small
tags:
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Small Ar - Neethu VM
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 11.0
      type: mozilla-foundation/common_voice_11_0
      config: ar
      split: test
      args: 'config: ar, split: test'
    metrics:
    - name: Wer
      type: wer
      value: 44.862730695069324
pipeline_tag: automatic-speech-recognition
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Whisper Small Arabic - Neethu VM

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Arabic Common Voice 11.0 dataset .
It achieves the following results on the evaluation set:
- Loss: 0.3402
- Wer: 44.8627

## Model description

This model is a fine-tuned version of openai/whisper-small, tailored specifically for Arabic speech recognition tasks. The model was trained using the Arabic subset of the Common Voice 11.0 dataset, which is a large-scale, open-source collection of transcribed speech data provided by the Mozilla Foundation.

## Intended uses & limitations

Speech-to-Text Conversion: This model is designed to transcribe spoken Arabic into written text. It is suitable for applications requiring accurate and efficient conversion of audio data to text.

Voice-Activated Interfaces: Enhance applications and devices with voice recognition capabilities, enabling users to interact with technology in Arabic.

Accessibility Tools: Assist in making audio content accessible to those with hearing impairments or in environments where audio cannot be played.

Content Creation and Archiving: Streamline the transcription process for content creators, journalists, and researchers working with Arabic audio materials.

## Training and evaluation data

Dataset: The model was fine-tuned using the Arabic subset of the Common Voice 11.0 dataset, a large-scale, open-source dataset created by Mozilla.

Data Characteristics: The Common Voice dataset is a diverse collection of voice recordings contributed by volunteers worldwide, encompassing a wide range of speakers, accents, and environments. The Arabic subset includes various dialects and speech styles, contributing to the model's ability to generalize across different Arabic-speaking regions.

Preprocessing: The audio data was preprocessed to standardize sampling rates and formats, ensuring compatibility with the Whisper model's input requirements.
Dataset: The evaluation was conducted using a designated test split of the Common Voice Arabic dataset. This ensures that the model's performance metrics are unbiased and reflective of its ability to generalize to new data.

Metrics: The primary metric used for evaluating the model's performance is the Word Error Rate (WER), which measures the accuracy of the transcriptions by comparing the predicted text to the ground truth.
## Training procedure
Steps Involved
Data Preparation:

Data Collection: Gathered the Arabic subset from the Common Voice 11.0 dataset.
Preprocessing: Standardized the audio data by normalizing sampling rates and formats. Transcriptions were cleaned and aligned with the audio files to ensure accurate training pairs.
Model Setup:

Base Model: The Whisper-small model was used as the base model due to its capability to handle diverse speech recognition tasks.
Environment Configuration: Training was conducted on a machine equipped with a suitable GPU to handle the model's computational requirements efficiently.
Fine-Tuning:

Hyperparameters: The learning rate, batch size, and other training hyperparameters were chosen to balance performance and training time.
Training Process: The model was trained over multiple epochs, with regular checkpoints to save progress and evaluate performance on the validation set.
Loss Function: Cross-entropy loss was used to optimize the model's predictions against the ground truth transcriptions.
Evaluation:

Validation Set: A portion of the dataset was reserved for validation to monitor the model's performance and avoid overfitting.
Metrics: Word Error Rate (WER) and validation loss were used as the primary metrics to assess the model's accuracy and generalization capability.
Optimization:

Early Stopping: Implemented to prevent overfitting, stopping the training when the validation loss ceased to improve significantly.
Fine-Tuning Adjustments: Hyperparameters and learning strategies were adjusted based on validation performance to enhance model accuracy.
### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 4000
- mixed_precision_training: Native AMP

### Training results

The table below shows the model's training and validation progress over multiple epochs, highlighting improvements in both loss and Word Error Rate (WER) as training progressed.

| Training Loss | Epoch  | Step | Validation Loss | Wer     |
|:-------------:|:------:|:----:|:---------------:|:-------:|
| 0.3059        | 0.4156 | 1000 | 0.4141          | 49.8008 |
| 0.2894        | 0.8313 | 2000 | 0.3603          | 46.8148 |
| 0.1908        | 1.2469 | 3000 | 0.3519          | 46.4806 |
| 0.1699        | 1.6625 | 4000 | 0.3402          | 44.8627 |

Analysis
Training Loss: This metric reflects the model's performance on the training data. A decrease in training loss over time indicates that the model is learning to fit the training data more accurately.

Validation Loss: This metric indicates how well the model generalizes to unseen data. The consistent decrease in validation loss suggests improved generalization.

Word Error Rate (WER): This is the key metric for evaluating the model's accuracy in transcribing speech. A reduction in WER from 49.80% to 44.86% demonstrates significant improvements in the model's ability to accurately convert Arabic speech to text.

These results showcase the model's learning curve and highlight its increased proficiency with further training. This information can help users understand the model's training dynamics and its expected performance in practical applications.




### Framework versions

- Transformers 4.41.1
- Pytorch 2.2.1+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1