|
--- |
|
license: apache-2.0 |
|
language: |
|
- ja |
|
library_name: nemo |
|
tags: |
|
- automatic-speech-recognition |
|
- NeMo |
|
--- |
|
|
|
# reazonspeech-nemo-v2 |
|
|
|
`reazonspeech-nemo-v2` is an automatic speech recognition model trained |
|
on [ReazonSpeech v2.0 corpus](https://huggingface.co/datasets/reazon-research/reazonspeech). |
|
|
|
This model supports inference of long-form Japanese audio clips up to |
|
several hours. |
|
|
|
## Model Architecture |
|
|
|
The model features an improved Conformer architecture from |
|
[Fast Conformer with Linearly Scalable Attention for Efficient |
|
Speech Recognition](https://arxiv.org/abs/2305.05084). |
|
|
|
* Subword-based RNN-T model. The total parameter count is 619M. |
|
|
|
* Encoder uses [Longformer](https://arxiv.org/abs/2004.05150) attention |
|
with local context size of 256, and has a single global token. |
|
|
|
* Decoder has a vocabulary space of 3000 tokens constructed by |
|
[SentencePiece](https://github.com/google/sentencepiece) |
|
unigram tokenizer. |
|
|
|
We trained this model for 1 million steps using AdamW optimizer |
|
following Noam annealing schedule. |
|
|
|
## Usage |
|
|
|
We recommend to use this model through our |
|
[reazonspeech](https://github.com/reazon-research/reazonspeech) |
|
library. |
|
|
|
``` |
|
from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path |
|
|
|
audio = audio_from_path("speech.wav") |
|
model = load_model() |
|
ret = transcribe(model, audio) |
|
print(ret.text) |
|
``` |
|
|
|
## License |
|
|
|
[Apaceh Licence 2.0](https://choosealicense.com/licenses/apache-2.0/) |
|
|