whisper_large_multi / README.md
Divyasreepat's picture
Update README.md with new model card content
b042434 verified
---
library_name: keras-hub
license: mit
tags:
- speech-recognition
- keras
- automatic-speech-recognition
pipeline_tag: automatic-speech-recognition
---
### Model Overview
⚠️ Whisper is currently only available via the `keras-hub-nightly` package. Use `pip install keras-hub-nightly` to try this model.
A Whisper encoder-decoder network for speech.
This class implements a Transformer-based encoder-decoder model as
described in
["Robust Speech Recognition via Large-Scale Weak Supervision"](https://arxiv.org/abs/2212.04356).
It includes the embedding lookups and transformer layers, but not the head
for predicting the next token.
The default constructor gives a fully customizable, randomly initialized Whisper
model with any number of layers, heads, and embedding dimensions. To load
preset architectures and weights, use the `from_preset()` constructor.
Disclaimer: Pre-trained models are provided on an "as is" basis, without
warranties or conditions of any kind. The underlying model is provided by a
third party and subject to a separate license, available
[here](https://github.com/openai/whisper).
__Arguments__
- __vocabulary_size__: int. The size of the token vocabulary.
- __num_layers__: int. The number of transformer encoder layers and
transformer decoder layers.
- __num_heads__: int. The number of attention heads for each transformer.
The hidden size must be divisible by the number of attention heads.
- __hidden_dim__: int. The size of the transformer encoding and pooler layers.
- __intermediate_dim__: int. The output dimension of the first Dense layer in
a two-layer feedforward network for each transformer.
- __num_mels__: int. The number of mel-frequency filters. Defaults to `80`.
- __dropout__: float. Dropout probability for the Transformer encoder.
- __max_encoder_sequence_length__: int. The maximum sequence length that the
audio encoder can consume. Since the second convolutional layer in
the encoder reduces the sequence length by half (stride of 2), we
use `max_encoder_sequence_length // 2` as the sequence length for the
positional embedding layer.
- __max_decoder_sequence_length__: int. The maximum sequence length that the
text decoder can consume.
## Example Usage
```python
import keras_hub
import keras_core as keras
import numpy as np
```
```python
input_data = {
"encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
"decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
"decoder_padding_mask": np.array(
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
),
}
# Randomly initialized Whisper encoder-decoder model with a custom config.
model = keras_hub.models.WhisperBackbone(
vocabulary_size=51864,
num_layers=4,
num_heads=4,
hidden_dim=256,
intermediate_dim=512,
max_encoder_sequence_length=128,
max_decoder_sequence_length=128,
)
model(input_data)
```
## Example Usage with Hugging Face URI
```python
import keras_hub
import keras_core as keras
import numpy as np
```
```python
input_data = {
"encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
"decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
"decoder_padding_mask": np.array(
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
),
}
# Randomly initialized Whisper encoder-decoder model with a custom config.
model = keras_hub.models.WhisperBackbone(
vocabulary_size=51864,
num_layers=4,
num_heads=4,
hidden_dim=256,
intermediate_dim=512,
max_encoder_sequence_length=128,
max_decoder_sequence_length=128,
)
model(input_data)
```