Tanel commited on
Commit
c2dba9f
1 Parent(s): bb34c6d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md CHANGED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ tags:
4
+ - LID
5
+ - spoken language recognition
6
+ license: Apache 2.0
7
+ datasets:
8
+ - VoxLingua107
9
+ metrics:
10
+ - ER
11
+ inference: false
12
+ ---
13
+
14
+ # VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
15
+
16
+ ## Model description
17
+
18
+ This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain.
19
+ The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition.
20
+
21
+ The model can classify a speech utterance according to the language spoken.
22
+ It knows about 107 different languages (
23
+ Abkhazian,
24
+ Afrikaans,
25
+ Amharic,
26
+ Arabic,
27
+ Assamese,
28
+ Azerbaijani,
29
+ Bashkir,
30
+ Belarusian,
31
+ Bulgarian,
32
+ Bengali,
33
+ Tibetan,
34
+ Breton,
35
+ Bosnian,
36
+ Catalan,
37
+ Cebuano,
38
+ Czech,
39
+ Welsh,
40
+ Danish,
41
+ German,
42
+ Greek,
43
+ English,
44
+ Esperanto,
45
+ Spanish,
46
+ Estonian,
47
+ Basque,
48
+ Persian,
49
+ Finnish,
50
+ Faroese,
51
+ French,
52
+ Galician,
53
+ Guarani,
54
+ Gujarati,
55
+ Manx,
56
+ Hausa,
57
+ Hawaiian,
58
+ Hindi,
59
+ Croatian,
60
+ Haitian,
61
+ Hungarian,
62
+ Armenian,
63
+ Interlingua,
64
+ Indonesian,
65
+ Icelandic,
66
+ Italian,
67
+ Hebrew,
68
+ Japanese,
69
+ Javanese,
70
+ Georgian,
71
+ Kazakh,
72
+ Central Khmer,
73
+ Kannada,
74
+ Korean,
75
+ Latin,
76
+ Luxembourgish,
77
+ Lingala,
78
+ Lao,
79
+ Lithuanian,
80
+ Latvian,
81
+ Malagasy,
82
+ Maori,
83
+ Macedonian,
84
+ Malayalam,
85
+ Mongolian,
86
+ Marathi,
87
+ Malay,
88
+ Maltese,
89
+ Burmese,
90
+ Nepali,
91
+ Dutch,
92
+ Norwegian Nynorsk,
93
+ Norwegian,
94
+ Occitan,
95
+ Panjabi,
96
+ Polish,
97
+ Pushto,
98
+ Portuguese,
99
+ Romanian,
100
+ Russian,
101
+ Sanskrit,
102
+ Scots,
103
+ Sindhi,
104
+ Sinhala,
105
+ Slovak,
106
+ Slovenian,
107
+ Shona,
108
+ Somali,
109
+ Albanian,
110
+ Serbian,
111
+ Sundanese,
112
+ Swedish,
113
+ Swahili,
114
+ Tamil,
115
+ Telugu,
116
+ Tajik,
117
+ Thai,
118
+ Turkmen,
119
+ Tagalog,
120
+ Turkish,
121
+ Tatar,
122
+ Ukrainian,
123
+ Urdu,
124
+ Uzbek,
125
+ Vietnamese,
126
+ Waray,
127
+ Yiddish,
128
+ Yoruba,
129
+ Mandarin Chinese).
130
+
131
+ ## Intended uses & limitations
132
+
133
+ The model has two uses:
134
+
135
+ - use 'as is' for spoken language recognition
136
+ - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
137
+
138
+ The model is trained on the automatically collected YouTube data. For more
139
+ information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
140
+
141
+
142
+ #### How to use
143
+
144
+ ```python
145
+ import torchaudio
146
+ from speechbrain.pretrained import EncoderClassifier
147
+ EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn", savedir="tmp")
148
+ # Download Thai language sample from Omniglot
149
+ signal, fs = torchaudio.load("https://omniglot.com/soundfiles/udhr/udhr_th.mp3")
150
+ # Resample to 16000 and convert to mono by taking only the left channel
151
+ signal_resampled = torchaudio.transforms.Resample(fs, 16000)(signal)[0]
152
+ prediction = language_id.classify_batch(signal_resampled)
153
+ print(prediction)
154
+ (tensor([[0.3210, 0.3751, 0.3680, 0.3939, 0.4026, 0.3644, 0.3689, 0.3597, 0.3508,
155
+ 0.3666, 0.3895, 0.3978, 0.3848, 0.3957, 0.3949, 0.3586, 0.4360, 0.3997,
156
+ 0.4106, 0.3886, 0.4177, 0.3870, 0.3764, 0.3763, 0.3672, 0.4000, 0.4256,
157
+ 0.4091, 0.3563, 0.3695, 0.3320, 0.3838, 0.3850, 0.3867, 0.3878, 0.3944,
158
+ 0.3924, 0.4063, 0.3803, 0.3830, 0.2996, 0.4187, 0.3976, 0.3651, 0.3950,
159
+ 0.3744, 0.4295, 0.3807, 0.3613, 0.4710, 0.3530, 0.4156, 0.3651, 0.3777,
160
+ 0.3813, 0.6063, 0.3708, 0.3886, 0.3766, 0.4023, 0.3785, 0.3612, 0.4193,
161
+ 0.3720, 0.4406, 0.3243, 0.3866, 0.3866, 0.4104, 0.4294, 0.4175, 0.3364,
162
+ 0.3595, 0.3443, 0.3565, 0.3776, 0.3985, 0.3778, 0.2382, 0.4115, 0.4017,
163
+ 0.4070, 0.3266, 0.3648, 0.3888, 0.3907, 0.3755, 0.3631, 0.4460, 0.3464,
164
+ 0.3898, 0.3661, 0.3883, 0.3772, 0.9289, 0.3687, 0.4298, 0.4211, 0.3838,
165
+ 0.3521, 0.3515, 0.3465, 0.4772, 0.4043, 0.3844, 0.3973, 0.4343]]), tensor([0.9289]), tensor([94]), ['th'])
166
+ # The scores in the prediction[0] tensor can be interpreted as cosine scores between
167
+ # the languages and the given utterance (i.e., the larger the better)
168
+ # The identified language ISO code is given in prediction[3]
169
+ print(prediction[3])
170
+ ['th']
171
+ ```
172
+
173
+ #### Limitations and bias
174
+
175
+ Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
176
+
177
+ - Probably it's accuracy on smaller languages is quite limited
178
+ - Probably it works much worse on female speech than male speech (because of YouTube data includes much more male speech)
179
+ - Based on subjective experiments, it doesn't work well for speech with a foreign accent
180
+ - Probably it doesn't work well on children's speech
181
+
182
+
183
+ ## Training data
184
+
185
+ The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
186
+
187
+ VoxLingua107 is a speech dataset for training spoken language identification models.
188
+ The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
189
+
190
+ VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
191
+ The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
192
+
193
+ ## Training procedure
194
+
195
+ We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
196
+ Training recipe will be published soon.
197
+
198
+ ## Evaluation results
199
+
200
+ Error rate: 6% on the development dataset
201
+
202
+
203
+ ### BibTeX entry and citation info
204
+
205
+ ```bibtex
206
+ @inproceedings{valk2021slt,
207
+ title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
208
+ author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
209
+ booktitle={Proc. IEEE SLT Workshop},
210
+ year={2021},
211
+ }
212
+ ```