openai
/

whisper-large-v3

@@ -114,39 +114,69 @@ license: apache-2.0
 # Whisper
-Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper
-[Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford
-et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many
-datasets and domains in a zero-shot setting.
-Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2)
-models, except for the following minor differences:
-1. The spectrogram input uses 128 Mel frequency bins instead of 80
 2. A new language token for Cantonese
-The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled
-audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset.
-The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors
-compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details).
-**Disclaimer**: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and
-pasted from the original model card.
 ## Usage
-Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers
-library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and
-🤗 Accelerate to reduce the model loading time:
 ```bash
 pip install --upgrade pip
-pip install --upgrade transformers datasets[audio] accelerate
 ```
 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class to transcribe audios of arbitrary length:
 ```python
 import torch
@@ -171,6 +201,10 @@ pipe = pipeline(
     model=model,
     tokenizer=processor.tokenizer,
     feature_extractor=processor.feature_extractor,
     torch_dtype=torch_dtype,
     device=device,
 )
@@ -183,33 +217,9 @@ print(result["text"])
 ```
 To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
-```python
-result = pipe("audio.mp3")
-```
-Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter:
-```python
-result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
-```
-Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous
-tokens. The following example demonstrates how to enable these heuristics:
-```python
-generate_kwargs = {
-    "max_new_tokens": 448,
-    "num_beams": 1,
-    "condition_on_prev_tokens": False,
-    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
-    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
-    "logprob_threshold": -1.0,
-    "no_speech_threshold": 0.6,
-    "return_timestamps": True,
-}
-result = pipe(sample, generate_kwargs=generate_kwargs)
 ```
 Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
@@ -248,240 +258,41 @@ result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "fren
 print(result["chunks"])
 ```
-<details>
-<summary> For more control over the generation parameters, use the model + processor API directly: </summary>
-```python
-import torch
-from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
-from datasets import Audio, load_dataset
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
-model_id = "openai/whisper-large-v3"
-model = AutoModelForSpeechSeq2Seq.from_pretrained(
-    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
-)
-model.to(device)
-processor = AutoProcessor.from_pretrained(model_id)
-dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
-sample = dataset[0]["audio"]
-inputs = processor(
-    sample["array"],
-    sampling_rate=sample["sampling_rate"],
-    return_tensors="pt",
-    truncation=False,
-    padding="longest",
-    return_attention_mask=True,
-)
-inputs = inputs.to(device, dtype=torch_dtype)
-gen_kwargs = {
-    "max_new_tokens": 448,
-    "num_beams": 1,
-    "condition_on_prev_tokens": False,
-    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
-    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
-    "logprob_threshold": -1.0,
-    "no_speech_threshold": 0.6,
-    "return_timestamps": True,
-}
-pred_ids = model.generate(**inputs, **gen_kwargs)
-pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
-print(pred_text)
-```
-</details>
 ## Additional Speed & Memory Improvements
-You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM
-requirements.
-### Chunked Long-Form
-Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are
-required:
-1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
-2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
-The sequential long-form algorithm should be used in either of the following scenarios:
-1. Transcription accuracy is the most important factor, and speed is less of a consideration
-2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
-Conversely, the chunked algorithm should be used when:
-1. Transcription speed is the most important factor
-2. You are transcribing a **single** long audio file
-By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s`
-parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long
-audio files, pass the argument `batch_size`:
-```python
-import torch
-from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
-from datasets import load_dataset
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
-model_id = "openai/whisper-large-v3"
-model = AutoModelForSpeechSeq2Seq.from_pretrained(
-    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
-)
-model.to(device)
-processor = AutoProcessor.from_pretrained(model_id)
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model=model,
-    tokenizer=processor.tokenizer,
-    feature_extractor=processor.feature_extractor,
-    chunk_length_s=30,
-    batch_size=16,  # batch size for inference - set based on your device
-    torch_dtype=torch_dtype,
-    device=device,
-)
-dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
-sample = dataset[0]["audio"]
-result = pipe(sample)
-print(result["text"])
-```
-#### Torch compile
-The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)
-for 4.5x speed-ups.
-**Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️
-```python
-import torch
-from torch.nn.attention import SDPBackend, sdpa_kernel
-from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
-from datasets import load_dataset
-from tqdm import tqdm
-torch.set_float32_matmul_precision("high")
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
-model_id = "openai/whisper-large-v3"
-model = AutoModelForSpeechSeq2Seq.from_pretrained(
-    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
-).to(device)
-# Enable static cache and compile the forward pass
-model.generation_config.cache_implementation = "static"
-model.generation_config.max_new_tokens = 256
-model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
-processor = AutoProcessor.from_pretrained(model_id)
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model=model,
-    tokenizer=processor.tokenizer,
-    feature_extractor=processor.feature_extractor,
-    torch_dtype=torch_dtype,
-    device=device,
-)
-dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
-sample = dataset[0]["audio"]
-# 2 warmup steps
-for _ in tqdm(range(2), desc="Warm-up step"):
-    with sdpa_kernel(SDPBackend.MATH):
-        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})
-# fast run
-with sdpa_kernel(SDPBackend.MATH):
-    result = pipe(sample.copy())
-print(result["text"])
-```
-#### Flash Attention 2
-We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile).
-To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
 ```
 pip install flash-attn --no-build-isolation
 ```
-Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
-```python
-model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
 ```
-#### Torch Scale-Product-Attention (SDPA)
-If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
-This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
-whether you have a compatible PyTorch version, run the following Python code snippet:
-```python
-from transformers.utils import is_torch_sdpa_available
-print(is_torch_sdpa_available())
 ```
-If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
-returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
-Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
-`attn_implementation="sdpa"` as follows:
-```python
-model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
 ```
-For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
-## Model details
-Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two
-flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English
-speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech
-translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech
-translation, the model predicts transcriptions to a *different* language to the audio.
-Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only
-and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
-are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
-checkpoints are summarised in the following table with links to the models on the Hub:
-| Size     | Parameters | English-only                                         | Multilingual                                        |
-|----------|------------|------------------------------------------------------|-----------------------------------------------------|
-| tiny     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny)     |
-| base     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
-| small    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
-| medium   | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
-| large    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)    |
-| large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
-| large-v3 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |
 ## Fine-Tuning
@@ -501,7 +312,7 @@ In particular, we caution against using Whisper models to transcribe recordings
 ## Training Data
-The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2.
 As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.

 # Whisper
+Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours
+of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need
+for fine-tuning.
+Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
+by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
+Whisper `large-v3` has the same architecture as the previous large models except the following minor differences:
+1. The input uses 128 Mel frequency bins instead of 80
 2. A new language token for Cantonese
+The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
+The model was trained for 2.0 epochs over this mixture dataset.
+The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`.
+**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
+copied and pasted from the original model card.
+## Model details
+Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
+It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
+The models were trained on either English-only data or multilingual data. The English-only models were trained
+on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
+translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio.
+For speech translation, the model predicts transcriptions to a *different* language to the audio.
+Whisper checkpoints come in five configurations of varying model sizes.
+The smallest four are trained on either English-only or multilingual data.
+The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
+are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
+checkpoints are summarised in the following table with links to the models on the Hub:
+| Size     | Parameters | English-only                                         | Multilingual                                        |
+|----------|------------|------------------------------------------------------|-----------------------------------------------------|
+| tiny     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny)     |
+| base     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
+| small    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
+| medium   | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
+| large    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)    |
+| large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
+| large-v3 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |
 ## Usage
+Whisper `large-v3` is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
+install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
+audio dataset from the Hugging Face Hub:
 ```bash
 pip install --upgrade pip
+pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
 ```
 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
+class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe
+long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI
+(see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should
+be set based on the specifications of your device:
 ```python
 import torch
     model=model,
     tokenizer=processor.tokenizer,
     feature_extractor=processor.feature_extractor,
+    max_new_tokens=128,
+    chunk_length_s=30,
+    batch_size=16,
+    return_timestamps=True,
     torch_dtype=torch_dtype,
     device=device,
 )
 ```
 To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
+```diff
+- result = pipe(sample)
++ result = pipe("audio.mp3")
 ```
 Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
 print(result["chunks"])
 ```
 ## Additional Speed & Memory Improvements
+You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
+### Flash Attention
+We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
+To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
 ```
 pip install flash-attn --no-build-isolation
 ```
+and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
+```diff
+- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
++ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
 ```
+### Torch Scale-Product-Attention (SDPA)
+If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
+To do so, you first need to install optimum:
 ```
+pip install --upgrade optimum
 ```
+And then convert your model to a "BetterTransformer" model before using it:
+```diff
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
++ model = model.to_bettertransformer()
+```
 ## Fine-Tuning
 ## Training Data
+The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
 As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.

config.json CHANGED Viewed

@@ -33,7 +33,6 @@
   "mask_time_length": 10,
   "mask_time_min_masks": 2,
   "mask_time_prob": 0.05,
-  "max_length": 448,
   "max_source_positions": 1500,
   "max_target_positions": 448,
   "median_filter_width": 7,

   "mask_time_length": 10,
   "mask_time_min_masks": 2,
   "mask_time_prob": 0.05,
   "max_source_positions": 1500,
   "max_target_positions": 448,
   "median_filter_width": 7,

generation_config.json CHANGED Viewed

@@ -161,11 +161,10 @@
     "<|yue|>": 50358,
     "<|zh|>": 50260
   },
-  "max_initial_timestamp_index": 50,
   "max_length": 448,
   "no_timestamps_token_id": 50364,
   "pad_token_id": 50257,
-  "prev_sot_token_id": 50362,
   "return_timestamps": false,
   "suppress_tokens": [
     1,

     "<|yue|>": 50358,
     "<|zh|>": 50260
   },
+  "max_initial_timestamp_index": 1,
   "max_length": 448,
   "no_timestamps_token_id": 50364,
   "pad_token_id": 50257,
   "return_timestamps": false,
   "suppress_tokens": [
     1,