--- license: cc-by-4.0 --- # Clean ConceptNet Data for All Languages ## Data Details For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz). The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words. We generate graph embeddings for 72 languages present in both CC100 and ConceptNet. ### Dataset Structure Each file is a txt file with a word / phrase and corresponding embedding separated with a space. Use the following function to read in the embeddings: ```python def read_embeddings_from_text(file_path, embedding_size=300): """Function to read the embeddings from a txt file""" embeddings = {} with open(file_path, 'r', encoding='utf-8') as file: for line in file: parts = line.strip().split(' ') embedding_start_index = len(parts) - embedding_size phrase = ' '.join(parts[:embedding_start_index]) embedding = np.array([float(val) for val in parts[embedding_start_index:]]) embeddings[phrase] = embedding return embeddings ``` ### Language Details | Language Code | Language Name | Vocabulary Size| | --- | --- | --- | | af | Afrikaans | 12973 | | sc | Sardinian | 573 | | yo | Yoruba | 2283 | | gn | Guarani | 131 | | qu | Quechua | 5156 | | li | Limburgish | 485 | | ln | Lingala | 4109 | | wo | Wolof | 1196 | | zu | Zulu | 2758 | | rm | Romansh | 3919 | | ht | Haitian Creole | 2699 | | su | Sundanese | 2514 | | br | Breton | 11665 | | gd | Scottish Gaelic | 14418 | | xh | Xhosa | 2504 | | mg | Malagasy | 26575 | | jv | Javanese | 4919 | | fy | Frisian | 7608 | | sa | Sanskrit | 5789 | | my | Burmese | 4875 | | ug | Uyghur | 998 | | yi | Yiddish | 8054 | | or | Oriya | 109 | | ha | Hausa | 802 | | la | Latin | 848943 | | sd | Sindhi | 143 | | so | Somali | 593 | | ku | Kurdish | 9737 | | pa | Punjabi | 4488 | | ps | Pashto | 1087 | | ga | Irish | 29459 | | am | Amharic | 1909 | | km | Khmer | 3466 | | uz | Uzbek | 5224 | | ky | Kyrgyz | 3574 | | cy | Welsh | 13243 | | gu | Gujarati | 4427 | | eo | Esperanto | 91074 | | sw | Swahili | 9131 | | mr | Marathi | 5545 | | kn | Kannada | 3415 | | ne | Nepali | 4224 | | mn | Mongolian | 6740 | | si | Sinhala | 2062 | | te | Telugu | 18707 | | be | Belarusian | 14871 | | mk | Macedonian | 28935 | | gl | Galician | 52824 | | hy | Armenian | 23434 | | is | Icelandic | 40287 | | ml | Malayalam | 6750 | | bn | Bengali | 7306 | | ur | Urdu | 8476 | | kk | Kazakh | 13700 | | ka | Georgian | 25014 | | az | Azerbaijani | 13277 | | sq | Albanian | 16262 | | ta | Tamil | 9064 | | et | Estonian | 20088 | | lv | Latvian | 30059 | | ms | Malay | 88416 | | sl | Slovenian | 89210 | | lt | Lithuanian | 21184 | | he | Hebrew | 27283 | | sk | Slovak | 21657 | | el | Greek | 39667 | | th | Thai | 94281 | | bg | Bulgarian | 171740 | | da | Danish | 46600 | | uk | Ukrainian | 27682 | | ro | Romanian | 36206 | ### Licensing Information This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from http://conceptnet.io. ### Citation Information ``` @misc{gurgurov2024lowremrepositorywordembeddings, title={LowREm: A Repository of Word Embeddings for 87 Low-Resource Languages Enhanced with Multilingual Graph Knowledge}, author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann}, year={2024}, eprint={2409.18193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.18193}, } @paper{speer2017conceptnet, author = {Robyn Speer and Joshua Chin and Catherine Havasi}, title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge}, conference = {AAAI Conference on Artificial Intelligence}, year = {2017}, pages = {4444--4451}, keywords = {ConceptNet; knowledge graph; word embeddings}, url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972} } ```