contribute-branch

#12

by HiveerLi - opened Nov 6, 2023

base: refs/heads/main

←

from: refs/pr/12

Discussion Files changed

+22

-126

This PR is in draft mode

Files changed (4) hide show

README.md +21 -124
generation_config.json +1 -2
original-model.bin → original-large-32-2-en.bin +0 -0
original-model.fp32.bin → original-large-32-2.fp32.bin +0 -0

README.md CHANGED Viewed

@@ -23,24 +23,14 @@ It is a distilled version of the Whisper model that is **6 times faster**, 49% s
 **within 1% WER** on out-of-distribution evaluation sets. This is the repository for distil-large-v2,
 a distilled variant of [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2).
-| Model                                                                      | Params / M | Rel. Latency ↑ | Short-Form WER ↓ | Long-Form WER ↓ |
-|----------------------------------------------------------------------------|------------|----------------|------------------|-----------------|
-| [large-v3](https://huggingface.co/openai/whisper-large-v3)                 | 1550       | 1.0            | **8.4**          | 11.0            |
-| [large-v2](https://huggingface.co/openai/whisper-large-v2)                 | 1550       | 1.0            | 9.1              | 11.7            |
-|                                                                            |            |                |                  |                 |
-| [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)   | 756        | 6.3            | 9.7              | **10.8**        |
-| [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2)   | 756        | 5.8            | 10.1             | 11.6            |
-| [distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) | 394        | **6.8**        | 11.1             | 12.4            |
-| [distil-small.en](https://huggingface.co/distil-whisper/distil-small.en)   | **166**    | 5.6            | 12.1             | 12.8            |
-<div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400">
-  <p><b>Update:</b> following the release of OpenAI's Whisper large-v3, an updated <a href="ttps://huggingface.co/distil-whisper/distil-large-v3"> distil-large-v3</a> model was published. This <a href="ttps://huggingface.co/distil-whisper/distil-large-v3"> distil-large-v3</a> model surpasses the performance of the distil-large-v2 model, with no architecture changes and better support for sequential long-form generation. Thus, it is recommended that the <a href="ttps://huggingface.co/distil-whisper/distil-large-v3"> distil-large-v3</a> model is used in-place of the large-v2 model. </p>
-</div>
-**Note:** Distil-Whisper is currently only available for English speech recognition. We are working with the community
-to distill Whisper on other languages. If you are interested in distilling Whisper in your language, check out the
-provided [training code](https://github.com/huggingface/distil-whisper/tree/main/training). We will update the
-[Distil-Whisper repository](https://github.com/huggingface/distil-whisper/) with multilingual checkpoints when ready!
 ## Usage
@@ -56,7 +46,7 @@ pip install --upgrade transformers accelerate datasets[audio]
 ### Short-Form Transcription
 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class to transcribe short-form audio files (< 30-seconds) as follows:
 ```python
 import torch
@@ -101,7 +91,7 @@ To transcribe a local audio file, simply pass the path to your audio file when y
 ### Long-Form Transcription
-Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
 is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
 To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
@@ -154,9 +144,9 @@ result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/r
 ### Speculative Decoding
-Distil-Whisper can be used as an assistant model to Whisper for [speculative decoding](https://huggingface.co/blog/whisper-speculative-decoding).
-Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster.
-This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.
 In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
 specify it as the "assistant model" for generation:
@@ -239,72 +229,21 @@ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dt
 + model = model.to_bettertransformer()
 ```
-### Running Distil-Whisper in `openai-whisper`
-To use the model in the original Whisper format, first ensure you have the [`openai-whisper`](https://pypi.org/project/openai-whisper/) package installed:
-```bash
-pip install --upgrade openai-whisper
-```
-The following code-snippet demonstrates how to transcribe a sample file from the LibriSpeech dataset loaded using
-🤗 Datasets:
-```python
-import torch
-from datasets import load_dataset
-from huggingface_hub import hf_hub_download
-from whisper import load_model, transcribe
-distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v2", filename="original-model.bin")
-model = load_model(distil_large_v2)
-dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-sample = dataset[0]["audio"]["array"]
-sample = torch.from_numpy(sample).float()
-pred_out = transcribe(model, audio=sample)
-print(pred_out["text"])
-```
-To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:
-```python
-pred_out = transcribe(model, audio="audio.mp3")
-```
 ### Whisper.cpp
-Distil-Whisper can be run from the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) repository with the original
-sequential long-form transcription algorithm. In a [provisional benchmark](https://github.com/ggerganov/whisper.cpp/pull/1424#issuecomment-1793513399)
-on Mac M1, `distil-large-v2` is 2x faster than `large-v2`, while performing to within 0.1% WER over long-form audio.
-Note that future releases of Distil-Whisper will target faster CPU inference more! By distilling smaller encoders, we
-aim to achieve similar speed-ups to what we obtain on GPU.
-Steps for getting started:
-1. Clone the Whisper.cpp repository:
-```
-git clone https://github.com/ggerganov/whisper.cpp.git
-cd whisper.cpp
-```
-2. Download the ggml weights for `distil-medium.en` from the Hugging Face Hub:
-```bash
-python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='distil-whisper/distil-large-v2', filename='ggml-large-32-2.en.bin', local_dir='./models')"
-```
-Note that if you do not have the `huggingface_hub` package installed, you can also download the weights with `wget`:
-```bash
-wget https://huggingface.co/distil-whisper/distil-large-v2/resolve/main/ggml-large-32-2.en.bin -P ./models
-```
-3. Run inference using the provided sample audio:
-```bash
-make -j && ./main -m models/ggml-large-32-2.en.bin -f samples/jfk.wav
-```
 ### Transformers.js
@@ -323,43 +262,6 @@ See the [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_
 *Note:* Due to the large model size, we recommend running this model server-side with [Node.js](https://huggingface.co/docs/transformers.js/guides/node-audio-processing) (instead of in-browser).
-### Candle
-Through an integration with Hugging Face [Candle](https://github.com/huggingface/candle/tree/main) 🕯️, Distil-Whisper is
-now available in the Rust library 🦀
-Benefit from:
-* Optimised CPU backend with optional MKL support for x86 and Accelerate for Macs
-* CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL
-* WASM support: run Distil-Whisper in a browser
-Steps for getting started:
-1. Install [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) as explained [here](https://huggingface.github.io/candle/guide/installation.html)
-2. Clone the `candle` repository locally:
-```
-git clone https://github.com/huggingface/candle.git
-```
-3. Enter the example directory for [Whisper](https://github.com/huggingface/candle/tree/main/candle-examples/examples/whisper):
-```
-cd candle/candle-examples/examples/whisper
-```
-4. Run an example:
-```
-cargo run --example whisper --release -- --model distil-large-v2
-```
-5. To specify your own audio file, add the `--input` flag:
-```
-cargo run --example whisper --release -- --model distil-large-v2 --input audio.wav
-```
-### 8bit & 4bit Quantization
-Coming soon ...
-### Whisper.cpp
-Coming soon ...
 ## Model Details
 Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
@@ -516,12 +418,7 @@ where it performs to within 0.2% WER of Whisper.
 ## Reproducing Distil-Whisper
-Training and evaluation code to reproduce Distil-Whisper is available under the Distil-Whisper repository: https://github.com/huggingface/distil-whisper/tree/main/training
-## License
-Distil-Whisper inherits the [MIT license](https://github.com/huggingface/distil-whisper/blob/main/LICENSE) from OpenAI's Whisper model.
 ## Citation

 **within 1% WER** on out-of-distribution evaluation sets. This is the repository for distil-large-v2,
 a distilled variant of [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2).
+| Model                                                                      | Params / M | Rel. Latency | Short-Form WER | Long-Form WER |
+|----------------------------------------------------------------------------|------------|--------------|----------------|---------------|
+| [large-v2](https://huggingface.co/openai/whisper-large-v2)                 | 1550       | 1.0          | **9.1**        | 11.7          |
+|                                                                            |            |              |                |               |
+| [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2)   | 756        | 5.8          | 10.1           | **11.6**      |
+| [distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) | **394**    | **6.8**      | 11.1           | 12.4          |
+**Note:** Distil-Whisper is currently only available for English speech recognition. Multilingual support will be provided in a follow-up.
 ## Usage
 ### Short-Form Transcription
 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
+class to transcribe short-form audio files as follows:
 ```python
 import torch
 ### Long-Form Transcription
+Distil-Whisper uses a chunked algorithm to transcribe long-form audio files. In practice, this chunked long-form algorithm
 is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
 To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
 ### Speculative Decoding
+Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
+ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
+replacement for existing Whisper pipelines, since the same outputs are guaranteed.
 In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
 specify it as the "assistant model" for generation:
 + model = model.to_bettertransformer()
 ```
+### 8bit & 4bit Quantization
+Coming soon ...
+### Candle
+Coming soon ...
 ### Whisper.cpp
+Coming soon ...
+### Running Whisper in `openai/whisper`
+Coming soon ...
 ### Transformers.js
 *Note:* Due to the large model size, we recommend running this model server-side with [Node.js](https://huggingface.co/docs/transformers.js/guides/node-audio-processing) (instead of in-browser).
 ## Model Details
 Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
 ## Reproducing Distil-Whisper
+Training and evaluation code to reproduce Distil-Whisper will be made available on the Distil-Whisper repository: https://github.com/huggingface/distil-whisper
 ## Citation

generation_config.json CHANGED Viewed

@@ -123,11 +123,10 @@
     "<|zh|>": 50260
   },
   "language": "<|en|>",
-  "max_initial_timestamp_index": 50,
   "max_length": 448,
   "no_timestamps_token_id": 50363,
   "pad_token_id": 50257,
-  "prev_sot_token_id": 50361,
   "return_timestamps": false,
   "suppress_tokens": [
     1,

     "<|zh|>": 50260
   },
   "language": "<|en|>",
+  "max_initial_timestamp_index": 1,
   "max_length": 448,
   "no_timestamps_token_id": 50363,
   "pad_token_id": 50257,
   "return_timestamps": false,
   "suppress_tokens": [
     1,

original-model.bin → original-large-32-2-en.bin RENAMED Viewed

File without changes

original-model.fp32.bin → original-large-32-2.fp32.bin RENAMED Viewed

File without changes