# Fine-tuned Whisper-V3-Turbo for Vietnamese ASR This project involves fine-tuning the Whisper-V3-Turbo model to improve its performance for Automatic Speech Recognition (ASR) in the Vietnamese language. The model was trained for 240 hours using a single Nvidia A6000 GPU. ## Data Sources The training data comes from various Vietnamese speech corpora. Below is a list of datasets used for training: 1. **capleaf/viVoice** - Path: `capleaf/viVoice` - Mode: `3` - Split: `train` 2. **NhutP/VSV-1100** - Path: `NhutP/VSV-1100` - Mode: `1` - Split: `train` 3. **doof-ferb/fpt_fosd** - Path: `doof-ferb/fpt_fosd` - Mode: `0` - Split: `train` 4. **doof-ferb/infore1_25hours** - Path: `doof-ferb/infore1_25hours` - Mode: `0` - Split: `train` 5. **google/fleurs (vi_vn)** - Path: `google/fleurs` - Name: `vi_vn` - Mode: `1` - Split: `train` 6. **doof-ferb/LSVSC** - Path: `doof-ferb/LSVSC` - Mode: `1` - Split: `train` 7. **quocanh34/viet_vlsp** - Path: `quocanh34/viet_vlsp` - Mode: `0` - Split: `train` 8. **linhtran92/viet_youtube_asr_corpus_v2** - Path: `linhtran92/viet_youtube_asr_corpus_v2` - Mode: `1` - Split: `train` 9. **doof-ferb/infore2_audiobooks** - Path: `doof-ferb/infore2_audiobooks` - Mode: `0` - Split: `train` 10. **linhtran92/viet_bud500** - Path: `linhtran92/viet_bud500` - Mode: `0` - Split: `train` ## Model The model used in this project is the **Whisper-V3-Turbo**. Whisper is a multilingual ASR model trained on a large and diverse dataset. The version used here has been fine-tuned specifically for the Vietnamese language. ## Training Configuration - **GPU Used**: Nvidia A6000 - **Training Time**: 240 hours - **Training Mode**: Fine-tuning - **Dataset Split**: `train` - **Mode Options**: Various modes as specified for each dataset ## Usage To use the fine-tuned model, follow the steps below: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "suzii/vi-whisper-large-v3-turbo-v1" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) result = pipe("your-audio.mp3", return_timestamps=True) ``` ## Acknowledgements This project would not be possible without the following datasets: - [capleaf/viVoice](https://www.kaggle.com/datasets/capleaf/viVoice) - [NhutP/VSV-1100](https://www.kaggle.com/datasets/nhutp/vsv-1100) - [doof-ferb/fpt_fosd](https://www.kaggle.com/datasets/doof-ferb/fpt_fosd) - [doof-ferb/infore1_25hours](https://www.kaggle.com/datasets/doof-ferb/infore1-25hours) - [google/fleurs](https://www.kaggle.com/datasets/google/fleurs) - [doof-ferb/LSVSC](https://www.kaggle.com/datasets/doof-ferb/LSVSC) - [quocanh34/viet_vlsp](https://www.kaggle.com/datasets/quocanh34/viet-vlsp) - [linhtran92/viet_youtube_asr_corpus_v2](https://www.kaggle.com/datasets/linhtran92/viet-youtube-asr-corpus-v2) - [doof-ferb/infore2_audiobooks](https://www.kaggle.com/datasets/doof-ferb/infore2-audiobooks) - [linhtran92/viet_bud500](https://www.kaggle.com/datasets/linhtran92/viet-bud500) ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.