MMLU-Pro benchmark
#16
by
kth8
- opened
In Meta's announcement I noticed they showed MMLU scores for the 1B and 3B models but not MMLU-Pro. Here is my testing result with Qwen2.5 for comparison:
| Models | Data Source | Overall | Biology | Business | Chemistry | Computer Science | Economics | Engineering | Health | History | Law | Math | Philosophy | Physics | Psychology | Other |
|-------------------------|----------------|----------|----------|-----------|------------|------------------|------------|--------------|----------|-----------|--------|---------|-------------|----------|-------------|---------|
| Qwen2.5-1.5B | Self-Reported | 0.321 | 0.435 | 0.374 | 0.256 | 0.351 | 0.389 | 0.190 | 0.336 | 0.278 | 0.148 | 0.430 | 0.279 | 0.286 | 0.469 | 0.325 |
| Llama-3.2-1B-Instruct | Self-Reported | 0.226 | 0.406 | 0.219 | 0.155 | 0.239 | 0.274 | 0.125 | 0.260 | 0.213 | 0.173 | 0.234 | 0.200 | 0.180 | 0.346 | 0.242 |
| Qwen2.5-0.5B | Self-Reported | 0.149 | 0.208 | 0.146 | 0.116 | 0.137 | 0.225 | 0.110 | 0.169 | 0.131 | 0.134 | 0.133 | 0.132 | 0.122 | 0.212 | 0.150 |
You can view the full leaderboard here: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro