mzboito commited on
Commit
fd7154c
1 Parent(s): 2b9a070

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -14
README.md CHANGED
@@ -125,7 +125,7 @@ language:
125
 
126
  ## mHuBERT-147 models
127
 
128
- mHuBERT-147 are compact and competitive multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages.
129
 
130
  This repository contains:
131
  * Fairseq checkpoint (original);
@@ -133,19 +133,6 @@ This repository contains:
133
  * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
134
 
135
 
136
- # Citing
137
-
138
-
139
- ```
140
- @inproceedings{boito2024mhubert,
141
- author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
142
- title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
143
- year=2024,
144
- booktitle={Interspeech 2024},
145
- }
146
- ```
147
-
148
-
149
  # Additional Information
150
 
151
 
@@ -159,6 +146,7 @@ Please note that since training, there were CommonVoice removal requests. This m
159
 
160
  **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
161
 
 
162
  # Datasets Included
163
 
164
  For ASR/ST/TTS datasets, only train set is used.
@@ -178,6 +166,19 @@ For ASR/ST/TTS datasets, only train set is used.
178
  * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
179
  * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  # Funding
182
 
183
  This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631.
 
125
 
126
  ## mHuBERT-147 models
127
 
128
+ mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
129
 
130
  This repository contains:
131
  * Fairseq checkpoint (original);
 
133
  * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
134
 
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  # Additional Information
137
 
138
 
 
146
 
147
  **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
148
 
149
+
150
  # Datasets Included
151
 
152
  For ASR/ST/TTS datasets, only train set is used.
 
166
  * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
167
  * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
168
 
169
+
170
+ # Citing
171
+
172
+ ```
173
+ @inproceedings{boito2024mhubert,
174
+ author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
175
+ title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
176
+ year=2024,
177
+ booktitle={Interspeech 2024},
178
+ }
179
+ ```
180
+
181
+
182
  # Funding
183
 
184
  This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631.