Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,362 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# Clean ConceptNet Data for All Languages
|
6 |
+
|
7 |
+
## Data Details
|
8 |
+
|
9 |
+
For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).
|
10 |
+
|
11 |
+
The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.
|
12 |
+
|
13 |
+
We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.
|
14 |
+
|
15 |
+
### Dataset Structure
|
16 |
+
|
17 |
+
Each file is a txt file with a word / phrase and corresponding embedding separated with a space.
|
18 |
+
|
19 |
+
Use the following function to read in the embeddings:
|
20 |
+
|
21 |
+
```python
|
22 |
+
def read_embeddings_from_text(file_path, embedding_size=300):
|
23 |
+
"""Function to read the embeddings from a txt file"""
|
24 |
+
embeddings = {}
|
25 |
+
with open(file_path, 'r', encoding='utf-8') as file:
|
26 |
+
for line in file:
|
27 |
+
parts = line.strip().split(' ')
|
28 |
+
embedding_start_index = len(parts) - embedding_size
|
29 |
+
phrase = ' '.join(parts[:embedding_start_index])
|
30 |
+
embedding = np.array([float(val) for val in parts[embedding_start_index:]])
|
31 |
+
embeddings[phrase] = embedding
|
32 |
+
return embeddings
|
33 |
+
```
|
34 |
+
|
35 |
+
### Dataset Details
|
36 |
+
|
37 |
+
| Language Code | Vocabulary Size |
|
38 |
+
| --- | ------- |
|
39 |
+
| ab | 252 |
|
40 |
+
| adx | 549 |
|
41 |
+
| ae | 192 |
|
42 |
+
| af | 12973 |
|
43 |
+
| ang | 9788 |
|
44 |
+
| ar | 75684 |
|
45 |
+
| arc | 1688 |
|
46 |
+
| arn | 1181 |
|
47 |
+
| ast | 27485 |
|
48 |
+
| av | 172 |
|
49 |
+
| az | 13277 |
|
50 |
+
| ba | 4250 |
|
51 |
+
| bal | 370 |
|
52 |
+
| be | 14871 |
|
53 |
+
| bg | 171740 |
|
54 |
+
| bm | 2422 |
|
55 |
+
| bn | 7306 |
|
56 |
+
| bo | 2127 |
|
57 |
+
| br | 11665 |
|
58 |
+
| ca | 82706 |
|
59 |
+
| ce | 2311 |
|
60 |
+
| ceb | 18882 |
|
61 |
+
| chk | 724 |
|
62 |
+
| cim | 889 |
|
63 |
+
| cop | 1071 |
|
64 |
+
| crh | 2449 |
|
65 |
+
| cs | 77422 |
|
66 |
+
| csb | 602 |
|
67 |
+
| cu | 7526 |
|
68 |
+
| cy | 13243 |
|
69 |
+
| da | 46600 |
|
70 |
+
| de | 500260 |
|
71 |
+
| dsb | 3993 |
|
72 |
+
| ee | 571 |
|
73 |
+
| egl | 854 |
|
74 |
+
| egx | 1890 |
|
75 |
+
| egy | 447 |
|
76 |
+
| el | 39667 |
|
77 |
+
| en | 941858 |
|
78 |
+
| enm | 17286 |
|
79 |
+
| eo | 91074 |
|
80 |
+
| es | 646097 |
|
81 |
+
| et | 20088 |
|
82 |
+
| eu | 41427 |
|
83 |
+
| fa | 46736 |
|
84 |
+
| fi | 259852 |
|
85 |
+
| fil | 16165 |
|
86 |
+
| fj | 209 |
|
87 |
+
| fo | 10513 |
|
88 |
+
| fr | 1449790 |
|
89 |
+
| frm | 4472 |
|
90 |
+
| fro | 14493 |
|
91 |
+
| frp | 2799 |
|
92 |
+
| frr | 476 |
|
93 |
+
| fur | 2295 |
|
94 |
+
| fy | 7608 |
|
95 |
+
| ga | 29459 |
|
96 |
+
| gag | 505 |
|
97 |
+
| gd | 14418 |
|
98 |
+
| gl | 52824 |
|
99 |
+
| gml | 177 |
|
100 |
+
| got | 2982 |
|
101 |
+
| grc | 25689 |
|
102 |
+
| gu | 4427 |
|
103 |
+
| gv | 6812 |
|
104 |
+
| haw | 1371 |
|
105 |
+
| hbo | 2898 |
|
106 |
+
| he | 27283 |
|
107 |
+
| hi | 18363 |
|
108 |
+
| hil | 1414 |
|
109 |
+
| hsb | 25778 |
|
110 |
+
| ht | 2699 |
|
111 |
+
| hu | 65163 |
|
112 |
+
| hy | 23434 |
|
113 |
+
| ia | 5728 |
|
114 |
+
| io | 21076 |
|
115 |
+
| is | 40287 |
|
116 |
+
| ist | 422 |
|
117 |
+
| it | 548767 |
|
118 |
+
| iu | 1871 |
|
119 |
+
| ja | 283049 |
|
120 |
+
| ka | 25014 |
|
121 |
+
| khb | 297 |
|
122 |
+
| ki | 1374 |
|
123 |
+
| kjh | 482 |
|
124 |
+
| kk | 13700 |
|
125 |
+
| kl | 1427 |
|
126 |
+
| km | 3466 |
|
127 |
+
| ko | 30616 |
|
128 |
+
| koy | 205 |
|
129 |
+
| ku | 9737 |
|
130 |
+
| kw | 1797 |
|
131 |
+
| ky | 3574 |
|
132 |
+
| la | 848943 |
|
133 |
+
| lad | 1453 |
|
134 |
+
| lb | 10863 |
|
135 |
+
| li | 485 |
|
136 |
+
| lij | 1331 |
|
137 |
+
| lld | 4884 |
|
138 |
+
| lmo | 2109 |
|
139 |
+
| ln | 4109 |
|
140 |
+
| lo | 1422 |
|
141 |
+
| lt | 21184 |
|
142 |
+
| lv | 30059 |
|
143 |
+
| mdf | 2086
|
144 |
+
| mg | 26575 |
|
145 |
+
| mga | 178 |
|
146 |
+
| mi | 945 |
|
147 |
+
| mk | 28935 |
|
148 |
+
| mn | 6740 |
|
149 |
+
| ms | 88416 |
|
150 |
+
| mt | 2006 |
|
151 |
+
| mul | 16034 |
|
152 |
+
| mwl | 1302 |
|
153 |
+
| my | 4875 |
|
154 |
+
| myv | 642 |
|
155 |
+
| nap | 1506 |
|
156 |
+
| nci | 3358 |
|
157 |
+
| nds | 5192 |
|
158 |
+
| nl | 138580 |
|
159 |
+
| no | 94946 |
|
160 |
+
| nog | 450 |
|
161 |
+
| non | 4079 |
|
162 |
+
| nov | 649 |
|
163 |
+
| nrf | 9724 |
|
164 |
+
| nv | 6333 |
|
165 |
+
| oc | 22113 |
|
166 |
+
| oge | 438 |
|
167 |
+
| osp | 458 |
|
168 |
+
| ota | 834 |
|
169 |
+
| pal | 256 |
|
170 |
+
| pcd | 1424 |
|
171 |
+
| pi | 1828 |
|
172 |
+
| pjt | 364 |
|
173 |
+
| pl | 139396 |
|
174 |
+
| ppl | 268 |
|
175 |
+
| pro | 2798 |
|
176 |
+
| ps | 1087 |
|
177 |
+
| pt | 248669 |
|
178 |
+
| rm | 3919 |
|
179 |
+
| ro | 36206 |
|
180 |
+
| rom | 552 |
|
181 |
+
| ru | 424944 |
|
182 |
+
| rue | 200 |
|
183 |
+
| rup | 3079 |
|
184 |
+
| rw | 355 |
|
185 |
+
| sa | 5789 |
|
186 |
+
| scn | 4749 |
|
187 |
+
| sco | 8537 |
|
188 |
+
| se | 68758 |
|
189 |
+
| ses | 3095 |
|
190 |
+
| sga | 2913 |
|
191 |
+
| sh | 57974 |
|
192 |
+
| sk | 21657 |
|
193 |
+
| sl | 89210 |
|
194 |
+
| sm | 588 |
|
195 |
+
| so | 593 |
|
196 |
+
| sq | 16262 |
|
197 |
+
| stq | 1237 |
|
198 |
+
| su | 2514 |
|
199 |
+
| sv | 133965 |
|
200 |
+
| sw | 9131 |
|
201 |
+
| swb | 672 |
|
202 |
+
| syc | 2855 |
|
203 |
+
| szl | 237 |
|
204 |
+
| ta | 9064 |
|
205 |
+
| te | 18707 |
|
206 |
+
| tg | 2937 |
|
207 |
+
| th | 94281 |
|
208 |
+
| tk | 815 |
|
209 |
+
| tpi | 1511 |
|
210 |
+
| tpw | 270 |
|
211 |
+
| tr | 38490 |
|
212 |
+
| tt | 4676 |
|
213 |
+
| ty | 293 |
|
214 |
+
| tyv | 337 |
|
215 |
+
| ug | 998 |
|
216 |
+
| uk | 27682 |
|
217 |
+
| ur | 8476 |
|
218 |
+
| uz | 5224 |
|
219 |
+
| vec | 5555 |
|
220 |
+
| vep | 2867 |
|
221 |
+
| vi | 37433 |
|
222 |
+
| vo | 8277 |
|
223 |
+
| vot | 489 |
|
224 |
+
| wa | 1956 |
|
225 |
+
| wau | 184 |
|
226 |
+
| wo | 1196 |
|
227 |
+
| wym | 1330 |
|
228 |
+
| xcl | 16182 |
|
229 |
+
| yi | 8054 |
|
230 |
+
| yua | 735 |
|
231 |
+
| za | 473 |
|
232 |
+
| zh | 274080 |
|
233 |
+
| zza | 621 |
|
234 |
+
| abe | 185 |
|
235 |
+
| ady | 3807 |
|
236 |
+
| ain | 298 |
|
237 |
+
| akk | 313 |
|
238 |
+
| akz | 151 |
|
239 |
+
| alt | 289 |
|
240 |
+
| an | 4457 |
|
241 |
+
| axm | 350 |
|
242 |
+
| ccc | 445 |
|
243 |
+
| ch | 174 |
|
244 |
+
| chl | 528 |
|
245 |
+
| cho | 155 |
|
246 |
+
| chr | 1087 |
|
247 |
+
| cic | 699 |
|
248 |
+
| cjs | 306 |
|
249 |
+
| cv | 2892 |
|
250 |
+
| dlm | 1091 |
|
251 |
+
| dum | 2040 |
|
252 |
+
| esu | 227 |
|
253 |
+
| ff | 215 |
|
254 |
+
| gmh | 217 |
|
255 |
+
| gn | 131 |
|
256 |
+
| goh | 2002 |
|
257 |
+
| gsw | 2336 |
|
258 |
+
| ha | 802 |
|
259 |
+
| hit | 221 |
|
260 |
+
| ie | 637 |
|
261 |
+
| ii | 51 |
|
262 |
+
| ilo | 442 |
|
263 |
+
| jv | 4919 |
|
264 |
+
| kbd | 762 |
|
265 |
+
| kn | 3415 |
|
266 |
+
| krl | 637 |
|
267 |
+
| liv | 569 |
|
268 |
+
| lkt | 682 |
|
269 |
+
| ltg | 139 |
|
270 |
+
| lzz | 127 |
|
271 |
+
| mch | 384 |
|
272 |
+
| mh | 200 |
|
273 |
+
| ml | 6750 |
|
274 |
+
| mr | 5545 |
|
275 |
+
| na | 200 |
|
276 |
+
| nah | 1612 |
|
277 |
+
| nan | 486 |
|
278 |
+
| ne | 4224 |
|
279 |
+
| nhn | 269 |
|
280 |
+
| nmn | 313 |
|
281 |
+
| odt | 365 |
|
282 |
+
| ofs | 345 |
|
283 |
+
| oj | 587 |
|
284 |
+
| or | 109 |
|
285 |
+
| orv | 199 |
|
286 |
+
| os | 4481 |
|
287 |
+
| osx | 1848 |
|
288 |
+
| pa | 4488 |
|
289 |
+
| pap | 3612 |
|
290 |
+
| peo | 184 |
|
291 |
+
| pms | 2857 |
|
292 |
+
| qu | 5156 |
|
293 |
+
| raj | 190 |
|
294 |
+
| rap | 313 |
|
295 |
+
| sah | 2695 |
|
296 |
+
| sc | 573 |
|
297 |
+
| sd | 143 |
|
298 |
+
| si | 2062 |
|
299 |
+
| smn | 511 |
|
300 |
+
| sms | 493 |
|
301 |
+
| srn | 1249 |
|
302 |
+
| sux | 785 |
|
303 |
+
| tet | 361 |
|
304 |
+
| twf | 527 |
|
305 |
+
| txb | 588 |
|
306 |
+
| uga | 573 |
|
307 |
+
| war | 12987 |
|
308 |
+
| xh | 2504 |
|
309 |
+
| xmf | 149 |
|
310 |
+
| xpr | 98 |
|
311 |
+
| xwo | 456 |
|
312 |
+
| yo | 2283 |
|
313 |
+
| zu | 2758 |
|
314 |
+
| co | 1474 |
|
315 |
+
| prg | 480 |
|
316 |
+
| aii | 345 |
|
317 |
+
| am | 1909 |
|
318 |
+
| bi | 92 |
|
319 |
+
| dv | 117 |
|
320 |
+
| kim | 388 |
|
321 |
+
| krc | 460 |
|
322 |
+
| kum | 505 |
|
323 |
+
| ti | 292 |
|
324 |
+
| udm | 306 |
|
325 |
+
| xto | 121 |
|
326 |
+
| zdj | 58 |
|
327 |
+
| dak | 879 |
|
328 |
+
| frk | 1 |
|
329 |
+
| oma | 748 |
|
330 |
+
| shh | 185 |
|
331 |
+
| aa | 725 |
|
332 |
+
| dje | 338 |
|
333 |
+
| hke | 246 |
|
334 |
+
| qya | 180 |
|
335 |
+
| st | 102 |
|
336 |
+
| wae | 437 |
|
337 |
+
| xno | 274 |
|
338 |
+
| dua | 317 |
|
339 |
+
| fon | 805 |
|
340 |
+
| hak | 4 |
|
341 |
+
| jbo | 32 |
|
342 |
+
|
343 |
+
### Licensing Information
|
344 |
+
|
345 |
+
This work includes data from ConceptNet 5, which was compiled by the
|
346 |
+
Commonsense Computing Initiative. ConceptNet 5 is freely available under
|
347 |
+
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
|
348 |
+
http://conceptnet.io.
|
349 |
+
|
350 |
+
### Citation Information
|
351 |
+
|
352 |
+
```
|
353 |
+
@paper{speer2017conceptnet,
|
354 |
+
author = {Robyn Speer and Joshua Chin and Catherine Havasi},
|
355 |
+
title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
|
356 |
+
conference = {AAAI Conference on Artificial Intelligence},
|
357 |
+
year = {2017},
|
358 |
+
pages = {4444--4451},
|
359 |
+
keywords = {ConceptNet; knowledge graph; word embeddings},
|
360 |
+
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
|
361 |
+
}
|
362 |
+
```
|