metadata
tags:
- espnet
- audio
- automatic-speech-recognition
language: et
license: cc-by-4.0
Estonian Espnet2 ASR model
Model description
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
Intended uses & limitations
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
How to use
from espnet2.bin.asr_inference import Speech2Text
model = Speech2Text.from_pretrained(
"TalTechNLP/espnet2_estonian",
lm_weight=0.6, ctc_weight=0.4, beam_size=60
)
# read a sound file with 16k sample rate
import soundfile
speech, rate = soundfile.read("speech.wav")
assert rate == 16000
text, *_ = model(speech)
print(text[0])
Limitations and bias
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
- Speech containing technical and other domain-specific terms
- Children's speech
- Non-native speech
- Speech recorded under very noisy conditions or with a microphone far from the speaker
- Very spontaneous and overlapping speech
Training data
Acoustic training data:
Type | Amount (h) |
---|---|
Broadcast speech | 591 |
Spontaneous speech | 53 |
Elderly speech corpus | 53 |
Talks, lectures | 49 |
Parliament speeches | 31 |
Total | 761 |
Language model training data:
- Estonian National Corpus 2019
- OpenSubtitles
- Speech transcripts
Training procedure
Standard EspNet2 Conformer recipe.
Evaluation results
WER
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset | 2864 | 56575 | 93.1 | 4.5 | 2.4 | 2.0 | 8.9 | 63.4 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset | 273 | 4677 | 93.9 | 3.6 | 2.4 | 1.2 | 7.3 | 46.5 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset | 818 | 11093 | 94.7 | 2.7 | 2.5 | 0.9 | 6.2 | 45.0 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset | 1207 | 13865 | 82.3 | 8.5 | 9.3 | 3.4 | 21.2 | 74.1 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset | 1648 | 22707 | 86.4 | 7.6 | 6.0 | 2.5 | 16.1 | 75.7 |
BibTeX entry and citation info
Citing ESPnet
@inproceedings{watanabe2018espnet,
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
title={{ESPnet}: End-to-End Speech Processing Toolkit},
year={2018},
booktitle={Proceedings of Interspeech},
pages={2207--2211},
doi={10.21437/Interspeech.2018-1456},
url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}