espnet2_estonian / README.md
Tanel's picture
Update README.md
34c2de6
metadata
tags:
  - espnet
  - audio
  - automatic-speech-recognition
language: et
license: cc-by-4.0

Estonian Espnet2 ASR model

Model description

This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

Intended uses & limitations

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

How to use


from espnet2.bin.asr_inference import Speech2Text
    
model = Speech2Text.from_pretrained(
  "TalTechNLP/espnet2_estonian", 
  lm_weight=0.6, ctc_weight=0.4, beam_size=60
)

# read a sound file with 16k sample rate
import soundfile
speech, rate = soundfile.read("speech.wav")
assert rate == 16000
text, *_ = model(speech)
print(text[0])

Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

  • Speech containing technical and other domain-specific terms
  • Children's speech
  • Non-native speech
  • Speech recorded under very noisy conditions or with a microphone far from the speaker
  • Very spontaneous and overlapping speech

Training data

Acoustic training data:

Type Amount (h)
Broadcast speech 591
Spontaneous speech 53
Elderly speech corpus 53
Talks, lectures 49
Parliament speeches 31
Total 761

Language model training data:

  • Estonian National Corpus 2019
  • OpenSubtitles
  • Speech transcripts

Training procedure

Standard EspNet2 Conformer recipe.

Evaluation results

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset 2864 56575 93.1 4.5 2.4 2.0 8.9 63.4
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset 273 4677 93.9 3.6 2.4 1.2 7.3 46.5
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset 818 11093 94.7 2.7 2.5 0.9 6.2 45.0
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset 1207 13865 82.3 8.5 9.3 3.4 21.2 74.1
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset 1648 22707 86.4 7.6 6.0 2.5 16.1 75.7

BibTeX entry and citation info

Citing ESPnet

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}