Automatic Speech Recognition
NeMo
Japanese
NeMo
File size: 1,434 Bytes
a9099fc
 
f889979
 
 
 
 
 
a9099fc
f889979
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3369340
f889979
3369340
f889979
3369340
f889979
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
license: apache-2.0
language:
- ja
library_name: nemo
tags:
  - automatic-speech-recognition
  - NeMo
---

# reazonspeech-nemo-v2

`reazonspeech-nemo-v2` is an automatic speech recognition model trained
on [ReazonSpeech v2.0 corpus](https://huggingface.co/datasets/reazon-research/reazonspeech).

This model supports inference of long-form Japanese audio clips up to
several hours.

## Model Architecture

The model features an improved Conformer architecture from
[Fast Conformer with Linearly Scalable Attention for Efficient
Speech Recognition](https://arxiv.org/abs/2305.05084).

* Subword-based RNN-T model. The total parameter count is 619M.

* Encoder uses [Longformer](https://arxiv.org/abs/2004.05150) attention
  with local context size of 256, and has a single global token.

* Decoder has a vocabulary space of 3000 tokens constructed by
  [SentencePiece](https://github.com/google/sentencepiece)
  unigram tokenizer.

We trained this model for 1 million steps using AdamW optimizer
following Noam annealing schedule.

## Usage

We recommend to use this model through our
[reazonspeech](https://github.com/reazon-research/reazonspeech)
library.

```
from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model()
ret = transcribe(model, audio)
print(ret.text)
```

## License

[Apaceh Licence 2.0](https://choosealicense.com/licenses/apache-2.0/)