RachidAR commited on
Commit
efa57fa
1 Parent(s): 8e7ac48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -4
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- language:
3
  - en
4
  - zh
5
  - de
@@ -29,7 +29,7 @@ language:
29
  - da
30
  - hu
31
  - ta
32
- - no
33
  - th
34
  - ur
35
  - hr
@@ -109,7 +109,7 @@ widget:
109
  - example_title: Librispeech sample 2
110
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
111
  pipeline_tag: automatic-speech-recognition
112
- license: apache-2.0
113
  ---
114
 
115
  # Whisper
@@ -120,4 +120,98 @@ et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a
120
  datasets and domains in a zero-shot setting.
121
 
122
  @OpenAI
123
- Downloaded from: [link](https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
  - en
4
  - zh
5
  - de
 
29
  - da
30
  - hu
31
  - ta
32
+ - 'no'
33
  - th
34
  - ur
35
  - hr
 
109
  - example_title: Librispeech sample 2
110
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
111
  pipeline_tag: automatic-speech-recognition
112
+ license: mit
113
  ---
114
 
115
  # Whisper
 
120
  datasets and domains in a zero-shot setting.
121
 
122
  @OpenAI
123
+ Downloaded from: [link](https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt)
124
+
125
+ ## Available models and languages
126
+
127
+ There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs.
128
+ Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model.
129
+ The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.
130
+
131
+ | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
132
+ |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
133
+ | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~10x |
134
+ | base | 74 M | `base.en` | `base` | ~1 GB | ~7x |
135
+ | small | 244 M | `small.en` | `small` | ~2 GB | ~4x |
136
+ | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
137
+ | large | 1550 M | N/A | `large` | ~10 GB | 1x |
138
+ | turbo | 809 M | N/A | `turbo` | ~6 GB | ~8x |
139
+
140
+ The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.
141
+ Additionally, the `turbo` model is an optimized version of `large-v3` that offers faster transcription speed with a minimal degradation in accuracy.
142
+
143
+ Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
144
+
145
+ ![WER breakdown by language](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62)
146
+
147
+
148
+
149
+ ## Command-line usage
150
+
151
+ The following command will transcribe speech in audio files, using the `turbo` model:
152
+
153
+ whisper audio.flac audio.mp3 audio.wav --model turbo
154
+
155
+ The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:
156
+
157
+ whisper japanese.wav --language Japanese
158
+
159
+ Adding `--task translate` will translate the speech into English:
160
+
161
+ whisper japanese.wav --language Japanese --task translate
162
+
163
+ Run the following to view all available options:
164
+
165
+ whisper --help
166
+
167
+ See [tokenizer.py](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) for the list of all available languages.
168
+
169
+
170
+ ## Python usage
171
+
172
+ Transcription can also be performed within Python:
173
+
174
+ ```python
175
+ import whisper
176
+
177
+ model = whisper.load_model("turbo")
178
+ result = model.transcribe("audio.mp3")
179
+ print(result["text"])
180
+ ```
181
+
182
+ Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
183
+
184
+ Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model.
185
+
186
+ ```python
187
+ import whisper
188
+
189
+ model = whisper.load_model("turbo")
190
+
191
+ # load audio and pad/trim it to fit 30 seconds
192
+ audio = whisper.load_audio("audio.mp3")
193
+ audio = whisper.pad_or_trim(audio)
194
+
195
+ # make log-Mel spectrogram and move to the same device as the model
196
+ mel = whisper.log_mel_spectrogram(audio).to(model.device)
197
+
198
+ # detect the spoken language
199
+ _, probs = model.detect_language(mel)
200
+ print(f"Detected language: {max(probs, key=probs.get)}")
201
+
202
+ # decode the audio
203
+ options = whisper.DecodingOptions()
204
+ result = whisper.decode(model, mel, options)
205
+
206
+ # print the recognized text
207
+ print(result.text)
208
+ ```
209
+
210
+ ## More examples
211
+
212
+ Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussions/categories/show-and-tell) category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc.
213
+
214
+
215
+ ## License
216
+
217
+ Whisper's code and model weights are released under the MIT License. See [LICENSE](https://github.com/openai/whisper/blob/main/LICENSE) for further details.