is it possible to access the transcript result in batches after each chunk has finished?

#85
by hanifanggawi - opened

is it possible to access the chunks of the transcript result in batches after each chunk has been transcribed?
the result from the example seems to return a dictionary object with "chunks" and "text" only, there is no generator object like the whisper v2 model

I have also tried to use the pipeline chunk batching as described in https://huggingface.co/docs/transformers/en/main_classes/pipelines#pipeline-chunk-batching, but it simply just runs once with all the transcript in a single final chunk,

this is the code i tried

# ... imports and model loading
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    device=device,
    generate_kwargs={"language": "en"}
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

all_model_outputs = []
for preprocessed in pipe.preprocess(sample):
    model_outputs = pipe.forward(preprocessed)
    print('processed chunk: ', model_outputs)
    all_model_outputs.append(model_outputs)
outputs = pipe.postprocess(all_model_outputs)
print('complete output: ', outputs)

this is the output (most points on the tensor omitted):

processed chunk:  {'is_last': True, 'tokens': tensor([[50365, .....,  0, 50598]])}
complete output: {'text': " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker

Have you found any solution?

I ended up using the ctranslate2 model instead with faster-whisper backend. https://huggingface.co/Systran/faster-whisper-large-v3 . faster-whisper allows obtaining the result stream via generator

Sign up or log in to comment