MMLU doesn't match on lm-evaluation-harness
#2
by
yixinsong
- opened
yixinsong
changed discussion status to
closed
Same question.
Hi, we use a different implementation of MMLU: cloze version vs MC, where we consider the log probabilities of entire answer sequences, instead of just single letters. You can find more details about this in this blog post and in appendix G.2 of this paper.
To reproduce our results you can use the guidelines here: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu#evaluation