timm/ViT-B-16-SigLIP-i18n-256 · Are the languages that are supported documented anywhere?

6 days ago

Hi,

Just wondering if the languages this model supports are documented anywhere? I see two papers, the SigLIP paper and Pali

I can find reference to 109 languages but don't seem to see the full list? Any help is appreciated and apologies if I have missed it.

Thanks

Jesse-marqo changed discussion title from Are the actual languages that are supported documented anywhere? to Are the languages that are supported documented anywhere? 6 days ago

rwightman

PyTorch Image Models org 5 days ago

•

edited 5 days ago

@Jesse-marqo afaik, the tokenizer used is from mT5 (https://github.com/google-research/big_vision/blob/46b2456f54b9d4f829d1925b78943372b376153d/big_vision/pp/ops_text.py#L50) which was trained on mC4 and that supports 101 languages

'
mT5 is pretrained on the mC4 corpus, covering 101 languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
'

WebLI claims 109 but doesn't specify which, maybe some tokens overlap langs so 101 vs 109 isn't an issue but I'm not sure?

Jesse-marqo

5 days ago

Great, thanks!