How to get Word and Verbose level transcription?
Large-v3 is very fast with batching as shown here --- https://huggingface.co/openai/whisper-large-v3
Batching speeds up the transcription process by a lot. The only reason I wish to use faster_whisper is cause it provides things like verbose, word level transcription
Additionally support for various input params like best_of, beam_size etc all of which are supported by whisper - https://github.com/openai/whisper/blob/main/whisper/transcribe.py
Using word level transcriptions, as specified in the Model Card:
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
Should give you word level timestamps, something like:
{
"text": " the",
"timestamp": [
187.6,
188.64
]
},
{
"text": " fact",
"timestamp": [
188.64,
188.88
]
},
This is very similar to what you get with the whisper model, which looks something like:
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 3.0,
"text": " Okay, so I've started recording.",
"tokens": [
50364,
1033,
11,
...
13,
50524
],
"temperature": 0.0,
"avg_logprob": -0.43806132332223363,
"compression_ratio": 1.2953020134228188,
"no_speech_prob": 0.1916283816099167,
"words": [
{
"word": " Okay,",
"start": 0.0,
"end": 0.56,
"probability": 0.12234115600585938
},
...
{
"word": " recording.",
"start": 2.44,
"end": 3.0,
"probability": 0.8062686920166016
}
]
},
While the whisper model does provide more information, and other input params might still not be available, word timestamps are currently possible
Use batch_size=1
results=pipe(batch_size=1,