patrickvonplaten commited on
Commit
3264a14
1 Parent(s): f364131

Speculative Decoding doesn't work yet with Whisper-v3

Browse files

In the example above we advertise using tiny in a ForCausalLM class which can't work since tiny does **not** share the same encoder as large-v3. We can advertise it as soon as distil-v3 is out.

Files changed (1) hide show
  1. README.md +0 -51
README.md CHANGED
@@ -258,57 +258,6 @@ result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "fren
258
  print(result["chunks"])
259
  ```
260
 
261
- ## Speculative Decoding
262
-
263
- Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
264
- ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
265
- replacement for existing Whisper pipelines, since the same outputs are guaranteed.
266
-
267
- In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
268
- specify it as the "assistant model" for generation:
269
-
270
- ```python
271
- from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
272
- import torch
273
- from datasets import load_dataset
274
-
275
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
276
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
277
-
278
- assistant_model_id = "openai/whisper-tiny"
279
-
280
- assistant_model = AutoModelForCausalLM.from_pretrained(
281
- assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
282
- )
283
- assistant_model.to(device)
284
-
285
- model_id = "openai/whisper-large-v3"
286
-
287
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
288
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
289
- )
290
- model.to(device)
291
-
292
- processor = AutoProcessor.from_pretrained(model_id)
293
-
294
- pipe = pipeline(
295
- "automatic-speech-recognition",
296
- model=model,
297
- tokenizer=processor.tokenizer,
298
- feature_extractor=processor.feature_extractor,
299
- max_new_tokens=128,
300
- generate_kwargs={"assistant_model": assistant_model},
301
- torch_dtype=torch_dtype,
302
- device=device,
303
- )
304
-
305
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
306
- sample = dataset[0]["audio"]
307
-
308
- result = pipe(sample)
309
- print(result["text"])
310
- ```
311
-
312
  ## Additional Speed & Memory Improvements
313
 
314
  You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
 
258
  print(result["chunks"])
259
  ```
260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
  ## Additional Speed & Memory Improvements
262
 
263
  You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.