|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ab |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- ba |
|
- be |
|
- bn |
|
- bo |
|
- bs |
|
- br |
|
- bg |
|
- ca |
|
- cs |
|
- cv |
|
- cy |
|
- da |
|
- de |
|
- dv |
|
- el |
|
- en |
|
- eo |
|
- et |
|
- eu |
|
- ee |
|
- fo |
|
- fa |
|
- tl |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gl |
|
- gv |
|
- gn |
|
- gu |
|
- ht |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- ig |
|
- ia |
|
- id |
|
- is |
|
- it |
|
- jv |
|
- ja |
|
- kn |
|
- ka |
|
- kk |
|
- km |
|
- rw |
|
- ky |
|
- ku |
|
- ko |
|
- lo |
|
- la |
|
- lv |
|
- ln |
|
- lt |
|
- lb |
|
- lg |
|
- ml |
|
- mr |
|
- mk |
|
- mg |
|
- mt |
|
- mn |
|
- mi |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- nn |
|
- no |
|
- oc |
|
- or |
|
- pa |
|
- pl |
|
- pt |
|
- ps |
|
- ro |
|
- ru |
|
- sa |
|
- si |
|
- sl |
|
- sk |
|
- sn |
|
- sd |
|
- so |
|
- st |
|
- es |
|
- sq |
|
- sc |
|
- sr |
|
- su |
|
- sw |
|
- sv |
|
- ta |
|
- tt |
|
- te |
|
- tg |
|
- th |
|
- tn |
|
- tk |
|
- tr |
|
- tw |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- yo |
|
- zh |
|
--- |
|
|
|
## mHuBERT-147 models |
|
|
|
mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages. |
|
|
|
This repository contains: |
|
* Fairseq checkpoint (original); |
|
* HuggingFace checkpoint; |
|
* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr). |
|
|
|
|
|
# Additional Information |
|
|
|
|
|
**Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest |
|
|
|
Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available. |
|
|
|
**Fairseq fork:** https://github.com/utter-project/fairseq |
|
|
|
**Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts |
|
|
|
**Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue). |
|
|
|
|
|
# Datasets Included |
|
|
|
For ASR/ST/TTS datasets, only train set is used. |
|
* [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/) |
|
* [BibleTTS](https://www.openslr.org/129/) |
|
* [ClovaCall](https://github.com/clovaai/ClovaCall) |
|
* [CommonVoice v11](https://commonvoice.mozilla.org/en/datasets) |
|
* Google TTS data: [Javanese](https://www.openslr.org/41/), [Khmer](https://www.openslr.org/42/), [Nepali](https://www.openslr.org/43/), [Sundanese](https://www.openslr.org/44/), [South African Languages](https://www.openslr.org/32/), [Bengali Languages](https://www.openslr.org/37/) |
|
* IISc-MILE: [Tamil](https://www.openslr.org/127/), [Kannada](https://www.openslr.org/126/) |
|
* [Japanese Versatile Speech](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus) |
|
* [Kokoro](https://github.com/kaiidams/Kokoro-Speech-Dataset) |
|
* [Kosp2e](https://github.com/warnikchow/kosp2e) |
|
* Media Speech: [Turkish Only](https://www.openslr.org/108/) |
|
* [Multilingual LibriSpeech](https://www.openslr.org/94/) |
|
* [Samrómur](https://www.openslr.org/128/) |
|
* [THCHS-30](https://www.openslr.org/18/) and [THUYG-20](https://www.openslr.org/22/) |
|
* [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/) |
|
* [VoxPopuli](https://github.com/facebookresearch/voxpopuli/) |
|
|
|
|
|
# Citing |
|
|
|
``` |
|
@inproceedings{boito2024mhubert, |
|
author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu}, |
|
title={{mHuBERT-147: A Compact Multilingual HuBERT Model}}, |
|
year=2024, |
|
booktitle={Interspeech 2024}, |
|
} |
|
``` |
|
|
|
|
|
# Funding |
|
|
|
This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631. |
|
For more information go to https://he-utter.eu/ |