|
--- |
|
license: cc-by-nc-nd-4.0 |
|
datasets: |
|
- openslr |
|
- mozilla-foundation/common_voice_13_0 |
|
language: |
|
- gl |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- ITG |
|
- PyTorch |
|
- Transformers |
|
- whisper |
|
- whisper-small |
|
--- |
|
|
|
# Whisper Small Galician |
|
|
|
## Description |
|
|
|
This is a fine-tuned version of the [openai/whisper-small](https://huggingface.co/openai/whisper-small) pre-trained model for ASR in galician. |
|
|
|
--- |
|
|
|
## Dataset |
|
|
|
We used two datasets combined: |
|
1. The [OpenSLR galician](https://huggingface.co/datasets/openslr/viewer/SLR77) dataset, available in the openslr repository. |
|
2. The [Common Voice 13 galician](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0/viewer/gl) dataset, available in the Common Voice repository. |
|
|
|
--- |
|
|
|
|
|
## Example inference script |
|
|
|
### Check this example script to run our model in inference mode |
|
|
|
```python |
|
import torch |
|
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq |
|
|
|
filename = "demo.wav" #change this line to the name of your audio file |
|
sample_rate = 16_000 |
|
processor = AutoProcessor.from_pretrained('ITG/whisper-small-gl') |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained('ITG/whisper-small-gl') |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
model.to(device) |
|
|
|
with torch.no_grad(): |
|
speech_array, _ = librosa.load(filename, sr=sample_rate) |
|
inputs = processor(speech_array, sampling_rate=sample_rate, return_tensors="pt").to(device) |
|
input_features = inputs.input_features |
|
generated_ids = model.generate(inputs=input_features, max_length=225) |
|
decode_output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(f"ASR Galician whisper-small output: {decode_output}") |
|
``` |
|
--- |
|
|
|
## Fine-tuning hyper-parameters |
|
|
|
| **Hyper-parameter** | **Value** | |
|
|:----------------------------------------:|:---------------------------:| |
|
| Training batch size | 16 | |
|
| Evaluation batch size | 8 | |
|
| Learning rate | 1e-5 | |
|
| Gradient checkpointing | true | |
|
| Gradient accumulation steps | 1 | |
|
| Max training epochs | 100 | |
|
| Max steps | 4000 | |
|
| Generate max length | 225 | |
|
| Warmup training steps (%) | 12,5% | |
|
| FP16 | true | |
|
| Metric for best model | wer | |
|
| Greater is better | false | |
|
|
|
|
|
## Fine-tuning in a different dataset or style |
|
|
|
If you're interested in fine-tuning your own whisper model, we suggest starting with the [openai/whisper-small model](https://huggingface.co/openai/whisper-small). Additionally, you may find the Transformers |
|
step-by-step guide for [fine-tuning whisper on multilingual ASR datasets](https://huggingface.co/blog/fine-tune-whisper) to be a valuable resource. This guide served as a helpful reference during the training |
|
process of this Galician whisper-small model! |