Audio transcribing, timestamping for whole sentences.
#16
by
artyomboyko
- opened
Good afternoon. Is there any way to generate audio transcoding without breaking sentences.
For example, when transcribing a video get instead of:
00:00:08,960 --> 00:00:13,840 This video is an introductory video about coders, decoders and codecs.
00:00:13,840 --> 00:00:18,640. In this episode we try to understand what a transformer network is all about,
00:00:18,640 --> 00:00:24,720 and try to explain it in simple, high-level terms.
The following:
00:00:08,960 --> 00:00:18,640 This video is an introductory video to a series of videos about coders, decoders, and coder decoders.
00:00:18,640 --> 00:00:24,720 In this series we will try to understand what a transformer network is and try to explain it in simple, high-level terms.
???
Hey @ElectricSecretAgent! Could you simply piece together the transcriptions and take the first/last timestamps?
import torch
from transformers import pipeline
from datasets import load_dataset
model = "openai/whisper-tiny"
device = 0 if torch.cuda.is_available() else "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=model,
chunk_length_s=30,
device=device,
)
# replace this with the loading/inference for your audio sample
ls_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
out = pipe(ls_dummy[0]["audio"], return_timestamps=True)
# join all the text together
text = [chunk["text"] for chunk in out["chunks"]]
text = "".join(text)
# get first timestamp of first chunk
start = out["chunks"][0]["timestamp"][0]
# get last timestamp of last chunk
end = out["chunks"][-1]["timestamp"][-1]
print(f"{start} -> {end}: {text}")
Print output:
0.0 -> 5.44: Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.
Thanks. I test it.
See ACICFG's implementation(with VAD, forced alignment and translation pipeline): https://colab.research.google.com/github/cnbeining/Whisper_Notebook/blob/master/WhisperX.ipynb
Thanks!
You can also set batch_size=...
in the transformers implementation to speed-up transcription for long audio samples:
out = pipe(ls_dummy[0]["audio"], return_timestamps=True, batch_size=4)