File size: 2,673 Bytes
294662c 112836f 294662c 6d6269c 294662c c2bcb39 294662c c2bcb39 294662c 380d2be 112836f 294662c 380d2be 294662c 380d2be 294662c 380d2be 294662c 380d2be 294662c 380d2be 294662c 380d2be 294662c 380d2be 294662c 380d2be |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
language:
- de
license: cc-by-4.0
library_name: nemo
datasets:
- mozilla-foundation/common_voice_7_0
- Multilingual LibriSpeech (2000 hours)
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- CTC
- Conformer
- Transformer
- NeMo
- pytorch
model-index:
- name: stt_de_conformer_transducer_large
results:
- task:
type: automatic-speech-recognition
dataset:
type: common_voice_7_0
name: mozilla-foundation/common_voice_7_0
config: other
split: test
args:
lageangu: de
metrics:
- type: wer
value: 4.93
name: WER
---
## Model Overview
<DESCRIBE IN ONE LINE THE MODEL AND ITS USE>
## NVIDIA NeMo: Training
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
```
pip install nemo_toolkit['all']
```
## How to Use this Model
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
### Automatically instantiate the model
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("iqbalc/stt_de_conformer_transducer_large")
```
### Transcribing using Python
```
asr_model.transcribe(['filename.wav'])
```
### Transcribing many audio files
```shell
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="iqbalc/stt_de_conformer_transducer_large" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
```
### Input
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
### Output
This model provides transcribed speech as a string for a given audio sample.
## Model Architecture
Conformer-Transducer model is an autoregressive variant of Conformer model for Automatic Speech Recognition which uses Transducer loss/decoding
## Training
The NeMo toolkit was used for training the models. These models are fine-tuned with this example script and this base config.
The tokenizers for these models were built using the text transcripts of the train set with this script.
### Datasets
All the models in this collection are trained on a composite dataset comprising of over two thousand hours of cleaned German speech:
1. MCV7.0 567 hours
2. MLS 1524 hours
3. VoxPopuli 214 hours
## Performance
Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
MCV7.0 test = 4.93
## Limitations
The model might perform worse for accented speech
## References
[NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |