RachidAR
/

Whisper-v3-large-turbo

Automatic Speech Recognition

audio

hf-asr-leaderboard

Model card Files Files and versions Community

RachidAR commited on Oct 1

Commit

efa57fa

•

1 Parent(s): 8e7ac48

Update README.md

Browse files

Files changed (1) hide show

README.md +98 -4

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-language:
 - en
 - zh
 - de
@@ -29,7 +29,7 @@ language:
 - da
 - hu
 - ta
-- no
 - th
 - ur
 - hr
@@ -109,7 +109,7 @@ widget:
 - example_title: Librispeech sample 2
   src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 pipeline_tag: automatic-speech-recognition
-license: apache-2.0
 ---
 # Whisper
@@ -120,4 +120,98 @@ et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a
 datasets and domains in a zero-shot setting.
 @OpenAI
-Downloaded from: [link](https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt)

 ---
+language:
 - en
 - zh
 - de
 - da
 - hu
 - ta
+- 'no'
 - th
 - ur
 - hr
 - example_title: Librispeech sample 2
   src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 pipeline_tag: automatic-speech-recognition
+license: mit
 ---
 # Whisper
 datasets and domains in a zero-shot setting.
 @OpenAI
+Downloaded from: [link](https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt)
+## Available models and languages
+There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs.
+Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model.
+The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.
+|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
+|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
+|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~10x      |
+|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~7x       |
+| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~4x       |
+| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
+| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |
+| turbo  |   809 M    |        N/A         |      `turbo`       |     ~6 GB     |      ~8x       |
+The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.
+Additionally, the `turbo` model is an optimized version of `large-v3` that offers faster transcription speed with a minimal degradation in accuracy.
+Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
+![WER breakdown by language](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62)
+## Command-line usage
+The following command will transcribe speech in audio files, using the `turbo` model:
+    whisper audio.flac audio.mp3 audio.wav --model turbo
+The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:
+    whisper japanese.wav --language Japanese
+Adding `--task translate` will translate the speech into English:
+    whisper japanese.wav --language Japanese --task translate
+Run the following to view all available options:
+    whisper --help
+See [tokenizer.py](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) for the list of all available languages.
+## Python usage
+Transcription can also be performed within Python:
+```python
+import whisper
+model = whisper.load_model("turbo")
+result = model.transcribe("audio.mp3")
+print(result["text"])
+```
+Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
+Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model.
+```python
+import whisper
+model = whisper.load_model("turbo")
+# load audio and pad/trim it to fit 30 seconds
+audio = whisper.load_audio("audio.mp3")
+audio = whisper.pad_or_trim(audio)
+# make log-Mel spectrogram and move to the same device as the model
+mel = whisper.log_mel_spectrogram(audio).to(model.device)
+# detect the spoken language
+_, probs = model.detect_language(mel)
+print(f"Detected language: {max(probs, key=probs.get)}")
+# decode the audio
+options = whisper.DecodingOptions()
+result = whisper.decode(model, mel, options)
+# print the recognized text
+print(result.text)
+```
+## More examples
+Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussions/categories/show-and-tell) category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc.
+## License
+Whisper's code and model weights are released under the MIT License. See [LICENSE](https://github.com/openai/whisper/blob/main/LICENSE) for further details.