TalTechNLP
/

voxlingua107-epaca-tdnn

+---
+language: multilingual
+tags:
+- LID
+- spoken language recognition
+license: Apache 2.0
+datasets:
+- VoxLingua107
+metrics:
+- ER
+inference: false
+---
+# VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
+## Model description
+This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain.
+The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition.
+The model can classify a speech utterance according to the language spoken.
+It knows about 107 different languages (
+Abkhazian,
+Afrikaans,
+Amharic,
+Arabic,
+Assamese,
+Azerbaijani,
+Bashkir,
+Belarusian,
+Bulgarian,
+Bengali,
+Tibetan,
+Breton,
+Bosnian,
+Catalan,
+Cebuano,
+Czech,
+Welsh,
+Danish,
+German,
+Greek,
+English,
+Esperanto,
+Spanish,
+Estonian,
+Basque,
+Persian,
+Finnish,
+Faroese,
+French,
+Galician,
+Guarani,
+Gujarati,
+Manx,
+Hausa,
+Hawaiian,
+Hindi,
+Croatian,
+Haitian,
+Hungarian,
+Armenian,
+Interlingua,
+Indonesian,
+Icelandic,
+Italian,
+Hebrew,
+Japanese,
+Javanese,
+Georgian,
+Kazakh,
+Central Khmer,
+Kannada,
+Korean,
+Latin,
+Luxembourgish,
+Lingala,
+Lao,
+Lithuanian,
+Latvian,
+Malagasy,
+Maori,
+Macedonian,
+Malayalam,
+Mongolian,
+Marathi,
+Malay,
+Maltese,
+Burmese,
+Nepali,
+Dutch,
+Norwegian Nynorsk,
+Norwegian,
+Occitan,
+Panjabi,
+Polish,
+Pushto,
+Portuguese,
+Romanian,
+Russian,
+Sanskrit,
+Scots,
+Sindhi,
+Sinhala,
+Slovak,
+Slovenian,
+Shona,
+Somali,
+Albanian,
+Serbian,
+Sundanese,
+Swedish,
+Swahili,
+Tamil,
+Telugu,
+Tajik,
+Thai,
+Turkmen,
+Tagalog,
+Turkish,
+Tatar,
+Ukrainian,
+Urdu,
+Uzbek,
+Vietnamese,
+Waray,
+Yiddish,
+Yoruba,
+Mandarin Chinese).
+## Intended uses & limitations
+The model has two uses:
+  - use 'as is' for spoken language recognition
+  - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
+The model is trained on the automatically collected YouTube data. For more
+information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
+#### How to use
+```python
+import torchaudio
+from speechbrain.pretrained import EncoderClassifier
+EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn", savedir="tmp")
+# Download Thai language sample from Omniglot
+signal, fs = torchaudio.load("https://omniglot.com/soundfiles/udhr/udhr_th.mp3")
+# Resample to 16000 and convert to mono by taking only the left channel
+signal_resampled = torchaudio.transforms.Resample(fs, 16000)(signal)[0]
+prediction =  language_id.classify_batch(signal_resampled)
+print(prediction)
+  (tensor([[0.3210, 0.3751, 0.3680, 0.3939, 0.4026, 0.3644, 0.3689, 0.3597, 0.3508,
+           0.3666, 0.3895, 0.3978, 0.3848, 0.3957, 0.3949, 0.3586, 0.4360, 0.3997,
+           0.4106, 0.3886, 0.4177, 0.3870, 0.3764, 0.3763, 0.3672, 0.4000, 0.4256,
+           0.4091, 0.3563, 0.3695, 0.3320, 0.3838, 0.3850, 0.3867, 0.3878, 0.3944,
+           0.3924, 0.4063, 0.3803, 0.3830, 0.2996, 0.4187, 0.3976, 0.3651, 0.3950,
+           0.3744, 0.4295, 0.3807, 0.3613, 0.4710, 0.3530, 0.4156, 0.3651, 0.3777,
+           0.3813, 0.6063, 0.3708, 0.3886, 0.3766, 0.4023, 0.3785, 0.3612, 0.4193,
+           0.3720, 0.4406, 0.3243, 0.3866, 0.3866, 0.4104, 0.4294, 0.4175, 0.3364,
+           0.3595, 0.3443, 0.3565, 0.3776, 0.3985, 0.3778, 0.2382, 0.4115, 0.4017,
+           0.4070, 0.3266, 0.3648, 0.3888, 0.3907, 0.3755, 0.3631, 0.4460, 0.3464,
+           0.3898, 0.3661, 0.3883, 0.3772, 0.9289, 0.3687, 0.4298, 0.4211, 0.3838,
+           0.3521, 0.3515, 0.3465, 0.4772, 0.4043, 0.3844, 0.3973, 0.4343]]), tensor([0.9289]), tensor([94]), ['th'])
+# The scores in the prediction[0] tensor can be interpreted as cosine scores between
+# the languages and the given utterance (i.e., the larger the better)
+# The identified language ISO code is given in prediction[3]
+print(prediction[3])
+  ['th']
+```
+#### Limitations and bias
+Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
+ - Probably it's accuracy on smaller languages  is quite limited
+ - Probably it works much worse on female speech than male speech (because of YouTube data includes much more male speech)
+ - Based on subjective experiments, it doesn't work well for speech with a foreign accent
+ - Probably it doesn't work well on children's speech
+## Training data
+The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
+VoxLingua107 is a speech dataset for training spoken language identification models.
+The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
+VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
+The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
+## Training procedure
+We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
+Training recipe will be published soon.
+## Evaluation results
+Error rate: 6% on the development dataset
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{valk2021slt,
+  title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
+  author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
+  booktitle={Proc. IEEE SLT Workshop},
+  year={2021},
+}
+```