--- license: mit language: - th base_model: biodatlab/whisper-th-small-combined tags: - whisper - Pytorch --- # Whisper-th-small-ct2 whisper-th-small-ct2 is the CTranslate2 format of [biodatlab/whisper-th-small-combined](https://huggingface.co/biodatlab/whisper-th-small-combined), comparable with [WhisperX](https://github.com/m-bain/whisperX) and [faster-whisper](https://github.com/SYSTRAN/faster-whisper), which enables: - 🤏 **Half the size** of original Huggingface format. - ⚡ī¸ Batched inference for **70x** real-time transcription Whisper large-v2. - đŸĒļ A faster-whisper backend, requiring **<8GB GPU memory** for large-v2 with beam_size=5. - đŸŽ¯ Accurate word-level timestamps using wav2vec2 alignment. - đŸ‘¯â€â™‚ī¸ Multispeaker ASR using speaker diarization(includes speaker ID labels). - đŸ—Ŗī¸ VAD preprocessing, reducing hallucinations and allowing batching with no WER degradation. ### Usage ```python !pip install git+https://github.com/m-bain/whisperx.git import whisperx import time # Setting device = "cuda" audio_file = "audio.mp3" batch_size = 16 compute_type = "float16" """ Your Hugging Face token for the Diarization model is required. Additionally, you need to accept the terms and conditions before use. Please visit the model page here. https://huggingface.co/pyannote/segmentation-3.0 """ HF_TOKEN = "" # load model and transcript model = whisperx.load_model("Thaweewat/whisper-th-small-ct2", device, compute_type=compute_type) st_time = time.time() audio = whisperx.load_audio(audio_file) result = model.transcribe(audio, batch_size=batch_size) # Assign speaker labels diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device) diarize_segments = diarize_model(audio) result = whisperx.assign_word_speakers(diarize_segments, result) # Combine pure text if needed combined_text = ' '.join(segment['text'] for segment in result['segments']) print(f"Response time: {time.time() - st_time} seconds") print(diarize_segments) print(result) print(combined_text) ```