--- license: cc-by-nc-4.0 language: - ab - af - am - ar - as - az - ba - be - bn - bo - bs - br - bg - ca - cs - cv - cy - da - de - dv - el - en - eo - et - eu - ee - fo - fa - tl - fi - fr - fy - ga - gl - gv - gn - gu - ht - ha - he - hi - hr - hu - hy - ig - ia - id - is - it - jv - ja - kn - ka - kk - km - rw - ky - ku - ko - lo - la - lv - ln - lt - lb - lg - ml - mr - mk - mg - mt - mn - mi - ms - my - ne - nl - nn - no - oc - or - pa - pl - pt - ps - ro - ru - sa - si - sl - sk - sn - sd - so - st - es - sq - sc - sr - su - sw - sv - ta - tt - te - tg - th - tn - tk - tr - tw - ug - uk - ur - uz - vi - xh - yi - yo - zh --- ## mHuBERT-147 models mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages. This repository contains: * Fairseq checkpoint (original); * HuggingFace checkpoint; * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr). # Additional Information **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available. **Fairseq fork:** https://github.com/utter-project/fairseq **Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue). # Datasets Included For ASR/ST/TTS datasets, only train set is used. * [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/) * [BibleTTS](https://www.openslr.org/129/) * [ClovaCall](https://github.com/clovaai/ClovaCall) * [CommonVoice v11](https://commonvoice.mozilla.org/en/datasets) * Google TTS data: [Javanese](https://www.openslr.org/41/), [Khmer](https://www.openslr.org/42/), [Nepali](https://www.openslr.org/43/), [Sundanese](https://www.openslr.org/44/), [South African Languages](https://www.openslr.org/32/), [Bengali Languages](https://www.openslr.org/37/) * IISc-MILE: [Tamil](https://www.openslr.org/127/), [Kannada](https://www.openslr.org/126/) * [Japanese Versatile Speech](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus) * [Kokoro](https://github.com/kaiidams/Kokoro-Speech-Dataset) * [Kosp2e](https://github.com/warnikchow/kosp2e) * Media Speech: [Turkish Only](https://www.openslr.org/108/) * [Multilingual LibriSpeech](https://www.openslr.org/94/) * [Samrómur](https://www.openslr.org/128/) * [THCHS-30](https://www.openslr.org/18/) and [THUYG-20](https://www.openslr.org/22/) * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/) * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/) # Citing ``` @inproceedings{boito2024mhubert, author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu}, title={{mHuBERT-147: A Compact Multilingual HuBERT Model}}, year=2024, booktitle={Interspeech 2024}, } ``` # Funding This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631. For more information go to https://he-utter.eu/