mzboito commited on
Commit
00ae5f4
1 Parent(s): f288cec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -12
README.md CHANGED
@@ -123,21 +123,30 @@ language:
123
  - zh
124
  ---
125
 
126
- This repository contains the files for the 3rd iteration, base architecture, multilingual HuBERT model.
127
 
128
 
129
- ## mHuBERT-147 models
130
 
131
- mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
 
 
 
 
 
 
132
 
133
- This repository contains:
 
 
 
 
134
  * Fairseq checkpoint (original);
135
  * HuggingFace checkpoint;
136
  * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
137
 
 
138
 
139
- # Additional Information
140
-
141
 
142
  **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
143
 
@@ -147,10 +156,12 @@ Please note that since training, there were CommonVoice removal requests. This m
147
 
148
  **Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
149
 
150
- **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
 
151
 
 
152
 
153
- # Datasets Included
154
 
155
  For ASR/ST/TTS datasets, only train set is used.
156
  * [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/)
@@ -169,8 +180,10 @@ For ASR/ST/TTS datasets, only train set is used.
169
  * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
170
  * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
171
 
 
 
172
 
173
- # Citing
174
 
175
  ```
176
  @inproceedings{boito2024mhubert,
@@ -181,9 +194,6 @@ booktitle={Interspeech 2024},
181
  }
182
  ```
183
 
184
-
185
- # Funding
186
-
187
  <img src="https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/HbzC1C-uHe25ewTy2wyoK.png" width=7% height=7%>
188
  This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) funded by European Union’s Horizon Europe Research and Innovation programme under grant agreement number 101070631.
189
 
 
123
  - zh
124
  ---
125
 
 
126
 
127
 
128
+ # Table of Contents:
129
 
130
+ 1. [Summary](https://huggingface.co/utter-project/mHuBERT-147#mhubert-147-models)
131
+ 2. [Training Data and Code](https://huggingface.co/utter-project/mHuBERT-147#Training)
132
+ 3. [ML-SUPERB Scores]()
133
+ 4. [Languages and Datasets](https://huggingface.co/utter-project/mHuBERT-147#Languages-and-Datasets)
134
+ 5. [Citing and Funding Information](https://huggingface.co/utter-project/mHuBERT-147#Citing-and-Funding-Information)
135
+
136
+ # mHuBERT-147 models
137
 
138
+ mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
139
+ Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
140
+ Training employs a two-level language, data source up-sampling during training. See more information in our paper.
141
+
142
+ **This repository contains:**
143
  * Fairseq checkpoint (original);
144
  * HuggingFace checkpoint;
145
  * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
146
 
147
+ **Model details:** 3rd iteration, base architecture, 147 languages.
148
 
149
+ # Training
 
150
 
151
  **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
152
 
 
156
 
157
  **Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
158
 
159
+ # ML-SUPERB Scores
160
+
161
 
162
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)
163
 
164
+ # Languages and Datasets
165
 
166
  For ASR/ST/TTS datasets, only train set is used.
167
  * [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/)
 
180
  * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
181
  * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
182
 
183
+ **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
184
+
185
 
186
+ # Citing and Funding Information
187
 
188
  ```
189
  @inproceedings{boito2024mhubert,
 
194
  }
195
  ```
196
 
 
 
 
197
  <img src="https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/HbzC1C-uHe25ewTy2wyoK.png" width=7% height=7%>
198
  This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) funded by European Union’s Horizon Europe Research and Innovation programme under grant agreement number 101070631.
199