---
license: mit
language:
- zh
- en
metrics:
- cer
- bleu
tags:
- asr
- automatic-speech-recognition
- automatic-speech-translation
- speech-translation
- speech-recognition
---
# MooER (ๆฉ่ณ): an LLM-based Speech Recognition and Translation Model from Moore Threads
**Online Demo**: https://mooer-speech.mthreads.com:10077/
## ๐ฅ Update
We release a new model *MooER-80K-v2* using 80K hours of data. Click [here](https://huggingface.co/mtspeech/MooER-MTL-80K) to try the new model.
## ๐ Introduction
We introduce **MooER (ๆฉ่ณ)**: an LLM-based speech recognition and translation model developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).
For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER)
## ๐ฅ Evaluation Results
We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101).
### Training data
We utilize 5k hours of data (MT5K) to train our basic *MooER-5K* model. The data sources include:
| Dataset | Duration |
|---------------|---------------|
| aishell2 | 137h |
| librispeech | 131h |
| multi_cn | 100h |
| wenetspeech | 1361h |
| in-house data | 3274h |
Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.
Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.
At this moment, we are also developing a new model trained with 80K hours of data.
### Speech Recognition
The performance of speech recognition is evaluated using WER/CER.
Language | Testset | Paraformer-large | SenseVoice-small | Qwen-audio | Whisper-large-v3 | SeamlessM4T-v2 | MooER-5K | MooER-80K | MooER-80K-v2 |
---|---|---|---|---|---|---|---|---|---|
Chinese | aishell1 | 1.93 | 3.03 | 1.43 | 7.86 | 4.09 | 1.93 | 1.25 | 1.00 |
aishell2_ios | 2.85 | 3.79 | 3.57 | 5.38 | 4.81 | 3.17 | 2.67 | 2.62 | |
test_magicdata | 3.66 | 3.81 | 5.31 | 8.36 | 9.69 | 3.48 | 2.52 | 2.17 | |
test_thchs | 3.99 | 5.17 | 4.86 | 9.06 | 7.14 | 4.11 | 3.14 | 3.00 | |
fleurs cmn_dev | 5.56 | 6.39 | 10.54 | 4.54 | 7.12 | 5.81 | 5.23 | 5.15 | |
fleurs cmn_test | 6.92 | 7.36 | 11.07 | 5.24 | 7.66 | 6.77 | 6.18 | 6.14 | |
average | 4.15 | 4.93 | 6.13 | 6.74 | 6.75 | 4.21 | 3.50 | 3.35 | |
English | librispeech test_clean | 14.15 | 4.07 | 2.15 | 3.42 | 2.77 | 7.78 | 4.11 | 3.57 |
librispeech test_other | 22.99 | 8.26 | 4.68 | 5.62 | 5.25 | 15.25 | 9.99 | 9.09 | |
fleurs eng_dev | 24.93 | 12.92 | 22.53 | 11.63 | 11.36 | 18.89 | 13.32 | 13.12 | |
fleurs eng_test | 26.81 | 13.41 | 22.51 | 12.57 | 11.82 | 20.41 | 14.97 | 14.74 | |
gigaspeech dev | 24.23 | 19.44 | 12.96 | 19.18 | 28.01 | 23.46 | 16.92 | 17.34 | |
gigaspeech test | 23.07 | 16.65 | 13.26 | 22.34 | 28.65 | 22.09 | 16.64 | 16.97 | |
average | 22.70 | 12.46 | 13.02 | 12.46 | 14.64 | 17.98 | 12.66 | 12.47 |