Files changed (3) hide show
  1. README.md +73 -262
  2. config.json +0 -1
  3. generation_config.json +1 -2
README.md CHANGED
@@ -114,39 +114,69 @@ license: apache-2.0
114
 
115
  # Whisper
116
 
117
- Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper
118
- [Robust Speech Recognition via Large-Scale Weak Supervision](https://huggingface.co/papers/2212.04356) by Alec Radford
119
- et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many
120
- datasets and domains in a zero-shot setting.
121
 
122
- Whisper large-v3 has the same architecture as the previous [large](https://huggingface.co/openai/whisper-large) and [large-v2](https://huggingface.co/openai/whisper-large-v2)
123
- models, except for the following minor differences:
124
 
125
- 1. The spectrogram input uses 128 Mel frequency bins instead of 80
 
 
126
  2. A new language token for Cantonese
127
 
128
- The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled
129
- audio collected using Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . The model was trained for 2.0 epochs over this mixture dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
- The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors
132
- compared to Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) . For more details on the different checkpoints available, refer to the section [Model details](#model-details).
 
 
 
133
 
134
- **Disclaimer**: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and
135
- pasted from the original model card.
 
 
 
 
 
 
 
136
 
137
  ## Usage
138
 
139
- Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers
140
- library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and
141
- 🤗 Accelerate to reduce the model loading time:
142
 
143
  ```bash
144
  pip install --upgrade pip
145
- pip install --upgrade transformers datasets[audio] accelerate
146
  ```
147
 
148
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
149
- class to transcribe audios of arbitrary length:
 
 
 
150
 
151
  ```python
152
  import torch
@@ -171,6 +201,10 @@ pipe = pipeline(
171
  model=model,
172
  tokenizer=processor.tokenizer,
173
  feature_extractor=processor.feature_extractor,
 
 
 
 
174
  torch_dtype=torch_dtype,
175
  device=device,
176
  )
@@ -183,33 +217,9 @@ print(result["text"])
183
  ```
184
 
185
  To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
186
-
187
- ```python
188
- result = pipe("audio.mp3")
189
- ```
190
-
191
- Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batch_size` parameter:
192
-
193
- ```python
194
- result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
195
- ```
196
-
197
- Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous
198
- tokens. The following example demonstrates how to enable these heuristics:
199
-
200
- ```python
201
- generate_kwargs = {
202
- "max_new_tokens": 448,
203
- "num_beams": 1,
204
- "condition_on_prev_tokens": False,
205
- "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
206
- "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
207
- "logprob_threshold": -1.0,
208
- "no_speech_threshold": 0.6,
209
- "return_timestamps": True,
210
- }
211
-
212
- result = pipe(sample, generate_kwargs=generate_kwargs)
213
  ```
214
 
215
  Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
@@ -248,240 +258,41 @@ result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "fren
248
  print(result["chunks"])
249
  ```
250
 
251
- <details>
252
-
253
- <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
254
-
255
- ```python
256
- import torch
257
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
258
- from datasets import Audio, load_dataset
259
-
260
-
261
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
262
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
263
-
264
- model_id = "openai/whisper-large-v3"
265
-
266
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
267
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
268
- )
269
- model.to(device)
270
-
271
- processor = AutoProcessor.from_pretrained(model_id)
272
-
273
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
274
- dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
275
- sample = dataset[0]["audio"]
276
-
277
- inputs = processor(
278
- sample["array"],
279
- sampling_rate=sample["sampling_rate"],
280
- return_tensors="pt",
281
- truncation=False,
282
- padding="longest",
283
- return_attention_mask=True,
284
- )
285
- inputs = inputs.to(device, dtype=torch_dtype)
286
-
287
- gen_kwargs = {
288
- "max_new_tokens": 448,
289
- "num_beams": 1,
290
- "condition_on_prev_tokens": False,
291
- "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
292
- "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
293
- "logprob_threshold": -1.0,
294
- "no_speech_threshold": 0.6,
295
- "return_timestamps": True,
296
- }
297
-
298
- pred_ids = model.generate(**inputs, **gen_kwargs)
299
- pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
300
-
301
- print(pred_text)
302
- ```
303
-
304
- </details>
305
-
306
  ## Additional Speed & Memory Improvements
307
 
308
- You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM
309
- requirements.
310
-
311
- ### Chunked Long-Form
312
-
313
- Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are
314
- required:
315
- 1. **Sequential:** uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
316
- 2. **Chunked:** splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
317
-
318
- The sequential long-form algorithm should be used in either of the following scenarios:
319
- 1. Transcription accuracy is the most important factor, and speed is less of a consideration
320
- 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
321
-
322
- Conversely, the chunked algorithm should be used when:
323
- 1. Transcription speed is the most important factor
324
- 2. You are transcribing a **single** long audio file
325
-
326
- By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunk_length_s`
327
- parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long
328
- audio files, pass the argument `batch_size`:
329
-
330
- ```python
331
- import torch
332
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
333
- from datasets import load_dataset
334
-
335
-
336
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
337
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
338
-
339
- model_id = "openai/whisper-large-v3"
340
-
341
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
342
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
343
- )
344
- model.to(device)
345
-
346
- processor = AutoProcessor.from_pretrained(model_id)
347
 
348
- pipe = pipeline(
349
- "automatic-speech-recognition",
350
- model=model,
351
- tokenizer=processor.tokenizer,
352
- feature_extractor=processor.feature_extractor,
353
- chunk_length_s=30,
354
- batch_size=16, # batch size for inference - set based on your device
355
- torch_dtype=torch_dtype,
356
- device=device,
357
- )
358
 
359
- dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
360
- sample = dataset[0]["audio"]
361
-
362
- result = pipe(sample)
363
- print(result["text"])
364
- ```
365
-
366
- #### Torch compile
367
-
368
- The Whisper forward pass is compatible with [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)
369
- for 4.5x speed-ups.
370
-
371
- **Note:** `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️
372
-
373
- ```python
374
- import torch
375
- from torch.nn.attention import SDPBackend, sdpa_kernel
376
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
377
- from datasets import load_dataset
378
- from tqdm import tqdm
379
-
380
- torch.set_float32_matmul_precision("high")
381
-
382
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
383
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
384
-
385
- model_id = "openai/whisper-large-v3"
386
-
387
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
388
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
389
- ).to(device)
390
-
391
- # Enable static cache and compile the forward pass
392
- model.generation_config.cache_implementation = "static"
393
- model.generation_config.max_new_tokens = 256
394
- model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
395
-
396
- processor = AutoProcessor.from_pretrained(model_id)
397
-
398
- pipe = pipeline(
399
- "automatic-speech-recognition",
400
- model=model,
401
- tokenizer=processor.tokenizer,
402
- feature_extractor=processor.feature_extractor,
403
- torch_dtype=torch_dtype,
404
- device=device,
405
- )
406
-
407
- dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
408
- sample = dataset[0]["audio"]
409
-
410
- # 2 warmup steps
411
- for _ in tqdm(range(2), desc="Warm-up step"):
412
- with sdpa_kernel(SDPBackend.MATH):
413
- result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})
414
-
415
- # fast run
416
- with sdpa_kernel(SDPBackend.MATH):
417
- result = pipe(sample.copy())
418
-
419
- print(result["text"])
420
- ```
421
-
422
- #### Flash Attention 2
423
-
424
- We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU supports it and you are not using [torch.compile](#torch-compile).
425
- To do so, first install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
426
 
427
  ```
428
  pip install flash-attn --no-build-isolation
429
  ```
430
 
431
- Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
432
 
433
- ```python
434
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
 
435
  ```
436
 
437
- #### Torch Scale-Product-Attention (SDPA)
438
 
439
- If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
440
- This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
441
- whether you have a compatible PyTorch version, run the following Python code snippet:
442
 
443
- ```python
444
- from transformers.utils import is_torch_sdpa_available
445
-
446
- print(is_torch_sdpa_available())
447
  ```
448
-
449
- If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
450
- returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
451
-
452
- Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
453
- `attn_implementation="sdpa"` as follows:
454
-
455
- ```python
456
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
457
  ```
458
 
459
- For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
460
-
461
-
462
- ## Model details
463
-
464
- Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. There are two
465
- flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English
466
- speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech
467
- translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio. For speech
468
- translation, the model predicts transcriptions to a *different* language to the audio.
469
-
470
- Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only
471
- and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
472
- are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
473
- checkpoints are summarised in the following table with links to the models on the Hub:
474
-
475
- | Size | Parameters | English-only | Multilingual |
476
- |----------|------------|------------------------------------------------------|-----------------------------------------------------|
477
- | tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
478
- | base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
479
- | small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
480
- | medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
481
- | large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
482
- | large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
483
- | large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
484
 
 
 
 
 
485
 
486
  ## Fine-Tuning
487
 
@@ -501,7 +312,7 @@ In particular, we caution against using Whisper models to transcribe recordings
501
 
502
  ## Training Data
503
 
504
- The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2.
505
 
506
  As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
507
 
 
114
 
115
  # Whisper
116
 
117
+ Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours
118
+ of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need
119
+ for fine-tuning.
 
120
 
121
+ Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
122
+ by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
123
 
124
+ Whisper `large-v3` has the same architecture as the previous large models except the following minor differences:
125
+
126
+ 1. The input uses 128 Mel frequency bins instead of 80
127
  2. A new language token for Cantonese
128
 
129
+ The Whisper `large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
130
+ The model was trained for 2.0 epochs over this mixture dataset.
131
+
132
+ The `large-v3` model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper `large-v2`.
133
+
134
+
135
+ **Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
136
+ copied and pasted from the original model card.
137
+
138
+ ## Model details
139
+
140
+ Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
141
+ It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
142
+
143
+ The models were trained on either English-only data or multilingual data. The English-only models were trained
144
+ on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
145
+ translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio.
146
+ For speech translation, the model predicts transcriptions to a *different* language to the audio.
147
 
148
+ Whisper checkpoints come in five configurations of varying model sizes.
149
+ The smallest four are trained on either English-only or multilingual data.
150
+ The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
151
+ are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
152
+ checkpoints are summarised in the following table with links to the models on the Hub:
153
 
154
+ | Size | Parameters | English-only | Multilingual |
155
+ |----------|------------|------------------------------------------------------|-----------------------------------------------------|
156
+ | tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
157
+ | base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
158
+ | small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
159
+ | medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
160
+ | large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
161
+ | large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
162
+ | large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
163
 
164
  ## Usage
165
 
166
+ Whisper `large-v3` is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
167
+ install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
168
+ audio dataset from the Hugging Face Hub:
169
 
170
  ```bash
171
  pip install --upgrade pip
172
+ pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
173
  ```
174
 
175
  The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
176
+ class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe
177
+ long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI
178
+ (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should
179
+ be set based on the specifications of your device:
180
 
181
  ```python
182
  import torch
 
201
  model=model,
202
  tokenizer=processor.tokenizer,
203
  feature_extractor=processor.feature_extractor,
204
+ max_new_tokens=128,
205
+ chunk_length_s=30,
206
+ batch_size=16,
207
+ return_timestamps=True,
208
  torch_dtype=torch_dtype,
209
  device=device,
210
  )
 
217
  ```
218
 
219
  To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
220
+ ```diff
221
+ - result = pipe(sample)
222
+ + result = pipe("audio.mp3")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
  ```
224
 
225
  Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
 
258
  print(result["chunks"])
259
  ```
260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
  ## Additional Speed & Memory Improvements
262
 
263
+ You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
 
265
+ ### Flash Attention
 
 
 
 
 
 
 
 
 
266
 
267
+ We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
268
+ To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
 
270
  ```
271
  pip install flash-attn --no-build-isolation
272
  ```
273
 
274
+ and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
275
 
276
+ ```diff
277
+ - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
278
+ + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
279
  ```
280
 
281
+ ### Torch Scale-Product-Attention (SDPA)
282
 
283
+ If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
284
+ To do so, you first need to install optimum:
 
285
 
 
 
 
 
286
  ```
287
+ pip install --upgrade optimum
 
 
 
 
 
 
 
 
288
  ```
289
 
290
+ And then convert your model to a "BetterTransformer" model before using it:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
291
 
292
+ ```diff
293
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
294
+ + model = model.to_bettertransformer()
295
+ ```
296
 
297
  ## Fine-Tuning
298
 
 
312
 
313
  ## Training Data
314
 
315
+ The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
316
 
317
  As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
318
 
config.json CHANGED
@@ -33,7 +33,6 @@
33
  "mask_time_length": 10,
34
  "mask_time_min_masks": 2,
35
  "mask_time_prob": 0.05,
36
- "max_length": 448,
37
  "max_source_positions": 1500,
38
  "max_target_positions": 448,
39
  "median_filter_width": 7,
 
33
  "mask_time_length": 10,
34
  "mask_time_min_masks": 2,
35
  "mask_time_prob": 0.05,
 
36
  "max_source_positions": 1500,
37
  "max_target_positions": 448,
38
  "median_filter_width": 7,
generation_config.json CHANGED
@@ -161,11 +161,10 @@
161
  "<|yue|>": 50358,
162
  "<|zh|>": 50260
163
  },
164
- "max_initial_timestamp_index": 50,
165
  "max_length": 448,
166
  "no_timestamps_token_id": 50364,
167
  "pad_token_id": 50257,
168
- "prev_sot_token_id": 50362,
169
  "return_timestamps": false,
170
  "suppress_tokens": [
171
  1,
 
161
  "<|yue|>": 50358,
162
  "<|zh|>": 50260
163
  },
164
+ "max_initial_timestamp_index": 1,
165
  "max_length": 448,
166
  "no_timestamps_token_id": 50364,
167
  "pad_token_id": 50257,
 
168
  "return_timestamps": false,
169
  "suppress_tokens": [
170
  1,