reazonspeech-nemo-v2
reazonspeech-nemo-v2
is an automatic speech recognition model trained
on ReazonSpeech v2.0 corpus.
This model supports inference of long-form Japanese audio clips up to several hours.
Model Architecture
The model features an improved Conformer architecture from Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.
Subword-based RNN-T model. The total parameter count is 619M.
Encoder uses Longformer attention with local context size of 256, and has a single global token.
Decoder has a vocabulary space of 3000 tokens constructed by SentencePiece unigram tokenizer.
We trained this model for 1 million steps using AdamW optimizer following Noam annealing schedule.
Usage
We recommend to use this model through our reazonspeech library.
from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path
audio = audio_from_path("speech.wav")
model = load_model()
ret = transcribe(model, audio)
print(ret.text)
License
- Downloads last month
- 165,556