versae commited on
Commit
0a0067d
2 Parent(s): 3489b8d 03f0615

Merge branch 'main' of https://huggingface.co/NbAiLab/nordic-lid

Browse files
Files changed (1) hide show
  1. README.md +197 -0
README.md CHANGED
@@ -1,3 +1,200 @@
1
  ---
2
  license: openrail
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: openrail
3
  ---
4
+
5
+ # Nordic language identification
6
+
7
+ This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, a model that identifies between the 13 most common languages in the Nordic countries, and a model that extends that 159 languages in the world.
8
+
9
+ ## `nordic-lid.bin`
10
+
11
+ Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
12
+
13
+ | ISO-639-3 | Language | Precision | Recall | F1-Score | Support |
14
+ |:-------------|:------------------|------------:|---------:|-----------:|----------:|
15
+ | dan | Danish | 0.9720 | 0.9838 | 0.9779 | 494 |
16
+ | eng | English | 0.9980 | 0.9940 | 0.9960 | 502 |
17
+ | fao | Faroese | 0.9920 | 0.9940 | 0.9930 | 499 |
18
+ | fin | Finnish | 1.0000 | 1.0000 | 1.0000 | 500 |
19
+ | isl | Icelandic | 0.9900 | 0.9920 | 0.9910 | 499 |
20
+ | nno | Norwegian Nynorsk | 0.9920 | 0.9861 | 0.9890 | 503 |
21
+ | nob | Norwegian Bokmål | 0.9840 | 0.9743 | 0.9791 | 505 |
22
+ | sma | Southern Sami | 0.9800 | 0.9703 | 0.9751 | 101 |
23
+ | sme | Northern Sami | 1.0000 | 0.9921 | 0.9960 | 504 |
24
+ | smj | Lule Sami | 0.9920 | 0.9960 | 0.9940 | 498 |
25
+ | smn | Inari Sami | 0.9950 | 1.0000 | 0.9975 | 199 |
26
+ | sms | Skolt Sami | 0.9900 | 0.9950 | 0.9925 | 199 |
27
+ | swe | Swedish | 0.9860 | 0.9920 | 0.9890 | 497 |
28
+ | Accuracy | | | | 0.9905 | 5500 |
29
+ | Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
30
+ | Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
31
+
32
+ ## `nordic-lid_all.bin`
33
+
34
+ Additionally trained on sentences from [Taoteba](https://tatoeba.org/en/).
35
+
36
+ | ISO-639-3 | Language | Precision | Recall | F1-Score | Support |
37
+ |:-------------|:----------------------------|------------:|---------:|-----------:|----------:|
38
+ | afr | Afrikaans | 0.9476 | 0.9476 | 0.9476 | 191 |
39
+ | ara | Arabic | 0.9708 | 0.9472 | 0.9588 | 492 |
40
+ | arq | Algerian Arabic | 0.9478 | 0.9237 | 0.9356 | 118 |
41
+ | arz | Egyptian Arabic | 0.6316 | 0.7660 | 0.6923 | 47 |
42
+ | asm | Assamese | 0.9828 | 0.9884 | 0.9856 | 173 |
43
+ | avk | Kotava | 0.9791 | 0.9894 | 0.9842 | 189 |
44
+ | aze | Azerbaijani | 0.9707 | 0.9789 | 0.9748 | 237 |
45
+ | bel | Belarusian | 0.9892 | 0.9733 | 0.9812 | 375 |
46
+ | ben | Bengali | 0.9872 | 0.9872 | 0.9872 | 235 |
47
+ | ber | Berber | 0.8881 | 0.8388 | 0.8627 | 577 |
48
+ | bos | Bosnian | 0.1310 | 0.3333 | 0.1880 | 33 |
49
+ | bre | Breton | 0.9648 | 0.9786 | 0.9716 | 280 |
50
+ | bua | Buryat | 0.9111 | 0.9111 | 0.9111 | 45 |
51
+ | bul | Bulgarian | 0.9597 | 0.9662 | 0.9630 | 444 |
52
+ | cat | Catalan | 0.9538 | 0.9475 | 0.9507 | 305 |
53
+ | cbk | Chavacano | 0.9627 | 0.9773 | 0.9699 | 132 |
54
+ | ceb | Cebuano | 0.8205 | 0.8533 | 0.8366 | 75 |
55
+ | ces | Czech | 0.9606 | 0.9740 | 0.9672 | 500 |
56
+ | chv | Chuvash | 0.9756 | 0.9877 | 0.9816 | 81 |
57
+ | ckb | Central Kurdish (Soranî) | 0.9751 | 0.9915 | 0.9832 | 355 |
58
+ | ckt | Chukchi | 0.9615 | 1.0000 | 0.9804 | 25 |
59
+ | cmn | Mandarin Chinese | 0.9530 | 0.8743 | 0.9120 | 557 |
60
+ | cor | Cornish | 0.9945 | 0.9628 | 0.9784 | 188 |
61
+ | csb | Kashubian | 0.9574 | 1.0000 | 0.9783 | 45 |
62
+ | cym | Welsh | 0.9375 | 0.9615 | 0.9494 | 78 |
63
+ | dan | Danish | 0.9401 | 0.9363 | 0.9382 | 1005 |
64
+ | deu | German | 0.9853 | 0.9781 | 0.9817 | 549 |
65
+ | dsb | Lower Sorbian | 0.8704 | 0.8246 | 0.8468 | 57 |
66
+ | dtp | Central Dusun | 0.8881 | 0.9549 | 0.9203 | 133 |
67
+ | ell | Greek | 0.9979 | 0.9979 | 0.9979 | 475 |
68
+ | eng | English | 0.9895 | 0.9839 | 0.9867 | 1055 |
69
+ | epo | Esperanto | 0.9817 | 0.9926 | 0.9871 | 540 |
70
+ | est | Estonian | 0.9545 | 0.9711 | 0.9628 | 173 |
71
+ | eus | Basque | 0.9844 | 0.9583 | 0.9712 | 264 |
72
+ | fao | Faroese | 0.9820 | 0.9859 | 0.9840 | 498 |
73
+ | fin | Finnish | 0.9932 | 0.9780 | 0.9855 | 1045 |
74
+ | fkv | Kven Finnish | 0.6154 | 0.8889 | 0.7273 | 18 |
75
+ | fra | French | 0.9871 | 0.9908 | 0.9890 | 542 |
76
+ | frr | North Frisian | 0.9640 | 0.9710 | 0.9675 | 138 |
77
+ | fry | Frisian | 0.6774 | 0.9545 | 0.7925 | 22 |
78
+ | gcf | Guadeloupean Creole French | 0.9619 | 1.0000 | 0.9806 | 101 |
79
+ | gla | Scottish Gaelic | 0.9412 | 0.9796 | 0.9600 | 49 |
80
+ | gle | Irish | 0.9635 | 0.9778 | 0.9706 | 135 |
81
+ | glg | Galician | 0.9104 | 0.9369 | 0.9234 | 206 |
82
+ | gos | Gronings | 0.9549 | 0.9588 | 0.9569 | 243 |
83
+ | grc | Ancient Greek | 0.9828 | 0.9828 | 0.9828 | 58 |
84
+ | grn | Guarani | 0.9684 | 0.9935 | 0.9808 | 154 |
85
+ | guc | Wayuu | 0.9111 | 0.9762 | 0.9425 | 42 |
86
+ | hau | Hausa | 0.9814 | 0.9953 | 0.9883 | 425 |
87
+ | heb | Hebrew | 1.0000 | 1.0000 | 1.0000 | 536 |
88
+ | hin | Hindi | 1.0000 | 0.9974 | 0.9987 | 391 |
89
+ | hoc | Ho | 0.9429 | 0.9167 | 0.9296 | 36 |
90
+ | hrv | Croatian | 0.7447 | 0.6119 | 0.6718 | 286 |
91
+ | hrx | Hunsrik | 0.8727 | 0.9231 | 0.8972 | 52 |
92
+ | hsb | Upper Sorbian | 0.8400 | 0.8289 | 0.8344 | 76 |
93
+ | hun | Hungarian | 0.9853 | 0.9926 | 0.9889 | 539 |
94
+ | hye | Armenian | 1.0000 | 1.0000 | 1.0000 | 225 |
95
+ | ido | Ido | 0.9791 | 0.9563 | 0.9676 | 343 |
96
+ | ile | Interlingue | 0.9352 | 0.9416 | 0.9384 | 291 |
97
+ | ilo | Ilocano | 0.9917 | 0.9600 | 0.9756 | 125 |
98
+ | ina | Interlingua | 0.9558 | 0.9621 | 0.9589 | 449 |
99
+ | ind | Indonesian | 0.8526 | 0.8203 | 0.8361 | 423 |
100
+ | isl | Icelandic | 0.9863 | 0.9897 | 0.9880 | 871 |
101
+ | ita | Italian | 0.9817 | 0.9711 | 0.9764 | 553 |
102
+ | jav | Javanese | 0.9600 | 0.9600 | 0.9600 | 50 |
103
+ | jbo | Lojban | 1.0000 | 0.9926 | 0.9963 | 405 |
104
+ | jpn | Japanese | 0.9851 | 1.0000 | 0.9925 | 530 |
105
+ | kab | Kabyle | 0.8382 | 0.8959 | 0.8661 | 509 |
106
+ | kat | Georgian | 1.0000 | 0.9885 | 0.9942 | 260 |
107
+ | kaz | Kazakh | 0.9896 | 0.9845 | 0.9870 | 193 |
108
+ | kha | Khasi | 0.9038 | 0.9400 | 0.9216 | 100 |
109
+ | khm | Khmer | 1.0000 | 1.0000 | 1.0000 | 75 |
110
+ | kmr | Northern Kurdish (Kurmancî) | 0.9851 | 0.9763 | 0.9807 | 338 |
111
+ | knc | Central Kanuri | 0.9719 | 0.9886 | 0.9802 | 175 |
112
+ | kor | Korean | 0.9972 | 0.9832 | 0.9902 | 358 |
113
+ | kzj | Coastal Kadazan | 0.9615 | 0.9336 | 0.9474 | 241 |
114
+ | lad | Ladino | 0.7846 | 0.7969 | 0.7907 | 64 |
115
+ | lat | Latin | 0.9756 | 0.9639 | 0.9697 | 498 |
116
+ | lfn | Lingua Franca Nova | 0.9745 | 0.9700 | 0.9723 | 434 |
117
+ | lij | Ligurian | 0.9333 | 0.9333 | 0.9333 | 90 |
118
+ | lin | Lingala | 0.9765 | 0.9765 | 0.9765 | 213 |
119
+ | lit | Lithuanian | 0.9864 | 0.9922 | 0.9893 | 512 |
120
+ | ltz | Luxembourgish | 0.9773 | 0.9348 | 0.9556 | 46 |
121
+ | lvs | Latvian | 0.9597 | 0.9795 | 0.9695 | 146 |
122
+ | lzh | Literary Chinese | 0.7692 | 0.8046 | 0.7865 | 87 |
123
+ | mal | Malayalam | 1.0000 | 1.0000 | 1.0000 | 44 |
124
+ | mar | Marathi | 0.9961 | 1.0000 | 0.9980 | 509 |
125
+ | mhr | Meadow Mari | 0.9849 | 0.9751 | 0.9800 | 201 |
126
+ | mkd | Macedonian | 0.9572 | 0.9480 | 0.9526 | 519 |
127
+ | mon | Mongolian | 0.9708 | 0.9779 | 0.9744 | 136 |
128
+ | mus | Muskogee (Creek) | 0.9000 | 0.9643 | 0.9310 | 28 |
129
+ | mya | Burmese | 1.0000 | 0.9643 | 0.9818 | 28 |
130
+ | nds | Low German (Low Saxon) | 0.9829 | 0.9710 | 0.9769 | 414 |
131
+ | nld | Dutch | 0.9662 | 0.9772 | 0.9717 | 527 |
132
+ | nnb | Nande | 0.9870 | 0.9870 | 0.9870 | 385 |
133
+ | nno | Norwegian Nynorsk | 0.9585 | 0.9652 | 0.9619 | 575 |
134
+ | nob | Norwegian Bokmål | 0.9247 | 0.9156 | 0.9201 | 912 |
135
+ | nst | Naga (Tangshang) | 1.0000 | 1.0000 | 1.0000 | 39 |
136
+ | nus | Nuer | 0.9903 | 0.9903 | 0.9903 | 103 |
137
+ | oci | Occitan | 0.9672 | 0.9555 | 0.9613 | 247 |
138
+ | orv | Old East Slavic | 0.9692 | 0.9692 | 0.9692 | 65 |
139
+ | oss | Ossetian | 0.9818 | 0.9926 | 0.9872 | 271 |
140
+ | ota | Ottoman Turkish | 0.9204 | 0.9905 | 0.9541 | 105 |
141
+ | pam | Kapampangan | 0.9865 | 0.9865 | 0.9865 | 74 |
142
+ | pcd | Picard | 0.9552 | 0.9846 | 0.9697 | 65 |
143
+ | pes | Persian | 0.9890 | 0.9890 | 0.9890 | 455 |
144
+ | pms | Piedmontese | 0.8780 | 0.9000 | 0.8889 | 40 |
145
+ | pol | Polish | 0.9848 | 0.9829 | 0.9838 | 526 |
146
+ | por | Portuguese | 0.9687 | 0.9616 | 0.9651 | 547 |
147
+ | prg | Old Prussian | 0.9800 | 0.9800 | 0.9800 | 50 |
148
+ | rhg | Rohingya | 0.9780 | 0.9944 | 0.9861 | 179 |
149
+ | rom | Romani | 0.9302 | 0.8889 | 0.9091 | 45 |
150
+ | ron | Romanian | 0.9826 | 0.9912 | 0.9869 | 457 |
151
+ | run | Kirundi | 0.9914 | 0.9665 | 0.9788 | 239 |
152
+ | rus | Russian | 0.9634 | 0.9814 | 0.9723 | 537 |
153
+ | sah | Yakut | 1.0000 | 0.9600 | 0.9796 | 50 |
154
+ | sat | Santali | 0.9942 | 0.9942 | 0.9942 | 171 |
155
+ | sdh | Southern Kurdish | 0.9423 | 0.9074 | 0.9245 | 54 |
156
+ | shi | Tashelhit | 0.9706 | 0.8980 | 0.9329 | 147 |
157
+ | slk | Slovak | 0.9333 | 0.9380 | 0.9356 | 403 |
158
+ | slv | Slovenian | 0.7018 | 0.8889 | 0.7843 | 45 |
159
+ | sma | Southern Sami | 0.9600 | 0.9600 | 0.9600 | 100 |
160
+ | sme | Northern Sami | 0.9980 | 0.9901 | 0.9940 | 504 |
161
+ | smj | Lule Sami | 0.9820 | 0.9959 | 0.9889 | 493 |
162
+ | smn | Inari Sami | 0.9950 | 0.9900 | 0.9925 | 201 |
163
+ | sms | Skolt Sami | 0.9750 | 0.9848 | 0.9799 | 198 |
164
+ | spa | Spanish | 0.9760 | 0.9601 | 0.9680 | 551 |
165
+ | sqi | Albanian | 0.9762 | 0.9762 | 0.9762 | 126 |
166
+ | srp | Serbian | 0.8367 | 0.8216 | 0.8291 | 499 |
167
+ | swc | Congo Swahili | 0.8727 | 0.8458 | 0.8591 | 454 |
168
+ | swe | Swedish | 0.9819 | 0.9819 | 0.9819 | 994 |
169
+ | swg | Swabian | 0.9694 | 0.9406 | 0.9548 | 101 |
170
+ | swh | Swahili | 0.6798 | 0.7225 | 0.7005 | 191 |
171
+ | tat | Tatar | 0.9791 | 0.9843 | 0.9817 | 381 |
172
+ | tgl | Tagalog | 0.9757 | 0.9710 | 0.9734 | 414 |
173
+ | tha | Thai | 1.0000 | 0.9910 | 0.9955 | 222 |
174
+ | thv | Tahaggart Tamahaq | 0.6552 | 0.7037 | 0.6786 | 27 |
175
+ | tig | Tigre | 1.0000 | 1.0000 | 1.0000 | 181 |
176
+ | tlh | Klingon | 1.0000 | 0.9932 | 0.9966 | 442 |
177
+ | tok | Toki Pona | 1.0000 | 1.0000 | 1.0000 | 495 |
178
+ | tpw | Old Tupi | 0.8929 | 0.9259 | 0.9091 | 27 |
179
+ | tuk | Turkmen | 0.9779 | 0.9603 | 0.9690 | 277 |
180
+ | tur | Turkish | 0.9908 | 0.9541 | 0.9721 | 567 |
181
+ | uig | Uyghur | 0.9966 | 0.9900 | 0.9933 | 300 |
182
+ | ukr | Ukrainian | 0.9831 | 0.9831 | 0.9831 | 534 |
183
+ | urd | Urdu | 1.0000 | 0.9914 | 0.9957 | 116 |
184
+ | uzb | Uzbek | 0.8200 | 0.9318 | 0.8723 | 44 |
185
+ | vie | Vietnamese | 0.9977 | 0.9953 | 0.9965 | 427 |
186
+ | vol | Volapük | 0.9908 | 0.9908 | 0.9908 | 218 |
187
+ | war | Waray | 0.9307 | 0.9691 | 0.9495 | 97 |
188
+ | wuu | Shanghainese | 0.8318 | 0.9036 | 0.8662 | 197 |
189
+ | xal | Kalmyk | 0.9302 | 0.9524 | 0.9412 | 42 |
190
+ | xmf | Mingrelian | 0.7419 | 0.8519 | 0.7931 | 27 |
191
+ | yid | Yiddish | 0.9971 | 1.0000 | 0.9986 | 348 |
192
+ | yue | Cantonese | 0.9004 | 0.9711 | 0.9344 | 242 |
193
+ | zgh | Standard Moroccan Tamazight | 0.9873 | 0.9873 | 0.9873 | 158 |
194
+ | zlm | Malay (Vernacular) | 0.8488 | 0.8902 | 0.8690 | 82 |
195
+ | zsm | Malay | 0.7606 | 0.7883 | 0.7742 | 274 |
196
+ | zza | Zaza | 0.9294 | 0.9634 | 0.9461 | 82 |
197
+ | Accuracy | | | | 0.9591 | 44049 |
198
+ | Weighted avg | | 0.9604 | 0.9591 | 0.9595 | 44049 |
199
+ | Macro avg | | 0.9371 | 0.9474 | 0.9413 | 44049 |
200
+