svanlin-tencent commited on
Commit
a1d49ae
1 Parent(s): 5ad5517
Files changed (1) hide show
  1. README.md +33 -29
README.md CHANGED
@@ -22,21 +22,25 @@ By open-sourcing the Hunyuan-Large model and revealing related technical details
22
 
23
  ## Benchmark Evaluation
24
 
25
- **Hunyuan-Large pre-trained model** achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
26
-
 
 
 
 
 
 
27
 
28
  | Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
29
  |------------------|---------------|--------------|---------------|-------------|---------------|
30
  | MMLU | 85.2 | 79.3 | 77.8 | 78.5 | **88.4** |
31
  | MMLU-Pro | **61.6** | 53.8 | 49.5 | - | 60.2 |
32
  | BBH | 85.9 | 81.6 | 78.9 | 78.9 | **86.3** |
33
- | HellaSwag | - | - | **88.7** | 87.8 | 86.8 |
34
- | CommonsenseQA | 85.8 | 84.1 | 78.5 | - | **92.9** |
35
  | WinoGrande | 86.7 | 85.3 | 85.0 | 84.9 | **88.7** |
36
  | PIQA | - | - | 83.6 | 83.7 | **88.3** |
37
- | SIQA | - | - | 64.6 | - | **83.6** |
38
  | NaturalQuestions | - | - | 39.6 | 38.7 | **52.8** |
39
- | BoolQ | 80.0 | 79.4 | 87.4 | 84.0 | **92.9** |
40
  | DROP | 84.8 | 79.6 | 80.4 | 80.1 | **88.9** |
41
  | ARC-C | **96.1** | 92.9 | 91.2 | 92.4 | 95.0 |
42
  | TriviaQA | - | - | 82.1 | 79.9 | **89.2** |
@@ -49,8 +53,7 @@ By open-sourcing the Hunyuan-Large model and revealing related technical details
49
  | HumanEval | 61.0 | 58.5 | 53.1 | 48.8 | **71.4** |
50
  | MBPP | **73.4** | 68.6 | 64.2 | 66.6 | 72.6 |
51
 
52
-
53
- **Hunyuan-Large-Instruct achieves** consistent improvements on most types of tasks compared to LLMs having similar
54
  activated parameters, indicating the effectiveness of our post-training. Delving into the model performance
55
  in different categories of benchmarks, we find that our instruct model achieves the best performance on MMLU and MATH dataset.
56
  Notably, on the MMLU dataset, our model demonstrates a significant improvement, outperforming the LLama3.1-405B model by 2.6%.
@@ -59,24 +62,22 @@ capabilities across a wide array of language understanding tasks. The model’s
59
  on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
60
  Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.
61
 
62
-
63
-
64
  | Model | LLama3.1 405B Inst. | LLama3.1 70B Inst. | Mixtral 8x22B Inst. | DeepSeekV2.5 Chat | Hunyuan-Large Inst. |
65
  |----------------------|---------------------|--------------------|---------------------|-------------------|---------------------|
66
- | MMLU | 87.3 | 83.6 | 77.8 | 80.4 | **89.9** |
67
- | CMMLU | - | - | 61.0 | 79.5 | **90.4** |
68
- | C-Eval | - | - | 60.0 | 79.9 | **88.6** |
69
- | BBH | - | - | 82.0 | **87.1** | 81.2 |
70
- | HellaSwag | - | - | 86.0 | **90.3** | 88.5 |
71
- | ARC-C | **96.9** | 94.8 | 91.5 | 92.9 | 94.6 |
72
- | DROP | - | - | 67.5 | 79.5 | **88.3** |
73
- | GPQA_diamond | **50.7** | 46.7 | 38.4 | 42.4 | 42.4 |
74
- | MATH | 73.8 | 68.0 | 51.0 | 74.7 | **77.4** |
75
- | HumanEval | 89.0 | 80.5 | 75.6 | 89.0 | **90.0** |
76
- | AlignBench | 6.0 | 5.9 | 6.2 | 8.0 | **8.3** |
77
- | MT-Bench | 9.1 | 8.8 | 8.1 | 9.0 | **9.4** |
78
- | IFEval strict-prompt | **86.0** | 83.6 | 71.2 | - | 85.0 |
79
-
80
 
81
 
82
 
@@ -85,11 +86,14 @@ Remarkably, this leap in accuracy is achieved with only 52 billion activated par
85
  If you find our work helpful, feel free to give us a cite.
86
 
87
  ```
88
- @article{Tencent-Hunyuan-Large,
89
- title={Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent},
90
- author={Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao She, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, and Jie Jiang.},
91
- journal={arXiv:},
92
- year={2024}
 
 
 
93
  }
94
  ```
95
 
 
22
 
23
  ## Benchmark Evaluation
24
 
25
+ **Hunyuan-Large pre-trained model** achieves the best overall performance compared to both Dense and MoE based
26
+ competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU,
27
+ Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks.
28
+ Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks
29
+ such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA and TriviaQA).
30
+ For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH,
31
+ and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall
32
+ best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
33
 
34
  | Model | LLama3.1-405B | LLama3.1-70B | Mixtral-8x22B | DeepSeek-V2 | Hunyuan-Large |
35
  |------------------|---------------|--------------|---------------|-------------|---------------|
36
  | MMLU | 85.2 | 79.3 | 77.8 | 78.5 | **88.4** |
37
  | MMLU-Pro | **61.6** | 53.8 | 49.5 | - | 60.2 |
38
  | BBH | 85.9 | 81.6 | 78.9 | 78.9 | **86.3** |
39
+ | HellaSwag | - | - | **88.7** | 87.8 | 86.8 |
40
+ | CommonsenseQA | 85.8 | 84.1 | 82.4 | - | **92.9** |
41
  | WinoGrande | 86.7 | 85.3 | 85.0 | 84.9 | **88.7** |
42
  | PIQA | - | - | 83.6 | 83.7 | **88.3** |
 
43
  | NaturalQuestions | - | - | 39.6 | 38.7 | **52.8** |
 
44
  | DROP | 84.8 | 79.6 | 80.4 | 80.1 | **88.9** |
45
  | ARC-C | **96.1** | 92.9 | 91.2 | 92.4 | 95.0 |
46
  | TriviaQA | - | - | 82.1 | 79.9 | **89.2** |
 
53
  | HumanEval | 61.0 | 58.5 | 53.1 | 48.8 | **71.4** |
54
  | MBPP | **73.4** | 68.6 | 64.2 | 66.6 | 72.6 |
55
 
56
+ **Hunyuan-Large-Instruct** achieves consistent improvements on most types of tasks compared to LLMs having similar
 
57
  activated parameters, indicating the effectiveness of our post-training. Delving into the model performance
58
  in different categories of benchmarks, we find that our instruct model achieves the best performance on MMLU and MATH dataset.
59
  Notably, on the MMLU dataset, our model demonstrates a significant improvement, outperforming the LLama3.1-405B model by 2.6%.
 
62
  on the MATH dataset, where it surpasses the LLama3.1-405B by a notable margin of 3.6%.
63
  Remarkably, this leap in accuracy is achieved with only 52 billion activated parameters, underscoring the efficiency of our model.
64
 
 
 
65
  | Model | LLama3.1 405B Inst. | LLama3.1 70B Inst. | Mixtral 8x22B Inst. | DeepSeekV2.5 Chat | Hunyuan-Large Inst. |
66
  |----------------------|---------------------|--------------------|---------------------|-------------------|---------------------|
67
+ | MMLU | 87.3 | 83.6 | 77.8 | 80.4 | **89.9** |
68
+ | CMMLU | - | - | 61.0 | - | **90.4** |
69
+ | C-Eval | - | - | 60.0 | - | **88.6** |
70
+ | BBH | - | - | 78.4 | 84.3 | **89.5** |
71
+ | HellaSwag | - | - | 86.0 | **90.3** | 88.5 |
72
+ | ARC-C | **96.9** | 94.8 | 90.0 | - | 94.6 |
73
+ | GPQA_diamond | **51.1** | 46.7 | - | - | 42.4 |
74
+ | MATH | 73.8 | 68.0 | 49.8 | 74.7 | **77.4** |
75
+ | HumanEval | 89.0 | 80.5 | 75.0 | 89.0 | **90.0** |
76
+ | AlignBench | 6.0 | 5.9 | 6.2 | 8.0 | **8.3** |
77
+ | MT-Bench | 9.1 | 8.8 | 8.1 | 9.0 | **9.4** |
78
+ | IFEval strict-prompt | **86.0** | 83.6 | 71.2 | - | 85.0 |
79
+ | Arena-Hard | 69.3 | 55.7 | - | 76.2 | **81.8** |
80
+ | AlpacaEval-2.0 | 39.3 | 34.3 | 30.9 | 50.5 | **51.8** |
81
 
82
 
83
 
 
86
  If you find our work helpful, feel free to give us a cite.
87
 
88
  ```
89
+ @misc{sun2024hunyuanlargeopensourcemoemodel,
90
+ title={Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent},
91
+ author={Xingwu Sun and Yanfeng Chen and Yiqing Huang and Ruobing Xie and Jiaqi Zhu and Kai Zhang and Shuaipeng Li and Zhen Yang and Jonny Han and Xiaobo Shu and Jiahao Bu and Zhongzhi Chen and Xuemeng Huang and Fengzong Lian and Saiyong Yang and Jianfeng Yan and Yuyuan Zeng and Xiaoqin Ren and Chao Yu and Lulu Wu and Yue Mao and Tao Yang and Suncong Zheng and Kan Wu and Dian Jiao and Jinbao Xue and Xipeng Zhang and Decheng Wu and Kai Liu and Dengpeng Wu and Guanghui Xu and Shaohua Chen and Shuang Chen and Xiao Feng and Yigeng Hong and Junqiang Zheng and Chengcheng Xu and Zongwei Li and Xiong Kuang and Jianglu Hu and Yiqi Chen and Yuchi Deng and Guiyang Li and Ao Liu and Chenchen Zhang and Shihui Hu and Zilong Zhao and Zifan Wu and Yao Ding and Weichao Wang and Han Liu and Roberts Wang and Hao Fei and Peijie She and Ze Zhao and Xun Cao and Hai Wang and Fusheng Xiang and Mengyuan Huang and Zhiyuan Xiong and Bin Hu and Xuebin Hou and Lei Jiang and Jiajia Wu and Yaping Deng and Yi Shen and Qian Wang and Weijie Liu and Jie Liu and Meng Chen and Liang Dong and Weiwen Jia and Hu Chen and Feifei Liu and Rui Yuan and Huilin Xu and Zhenxiang Yan and Tengfei Cao and Zhichao Hu and Xinhua Feng and Dong Du and Tinghao She and Yangyu Tao and Feng Zhang and Jianchen Zhu and Chengzhong Xu and Xirui Li and Chong Zha and Wen Ouyang and Yinben Xia and Xiang Li and Zekun He and Rongpeng Chen and Jiawei Song and Ruibin Chen and Fan Jiang and Chongqing Zhao and Bo Wang and Hao Gong and Rong Gan and Winston Hu and Zhanhui Kang and Yong Yang and Yuhong Liu and Di Wang and Jie Jiang},
92
+ year={2024},
93
+ eprint={2411.02265},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.CL},
96
+ url={https://arxiv.org/abs/2411.02265},
97
  }
98
  ```
99