Are the languages that are supported documented anywhere?

#1
by Jesse-marqo - opened

Hi,

Just wondering if the languages this model supports are documented anywhere? I see two papers, the SigLIP paper and Pali

I can find reference to 109 languages but don't seem to see the full list? Any help is appreciated and apologies if I have missed it.

Thanks

Jesse-marqo changed discussion title from Are the actual languages that are supported documented anywhere? to Are the languages that are supported documented anywhere?
PyTorch Image Models org
edited 5 days ago

@Jesse-marqo afaik, the tokenizer used is from mT5 (https://github.com/google-research/big_vision/blob/46b2456f54b9d4f829d1925b78943372b376153d/big_vision/pp/ops_text.py#L50) which was trained on mC4 and that supports 101 languages

'
mT5 is pretrained on the mC4 corpus, covering 101 languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
'

WebLI claims 109 but doesn't specify which, maybe some tokens overlap langs so 101 vs 109 isn't an issue but I'm not sure?

Great, thanks!

Sign up or log in to comment