Reproducing Evaluation with lighteval

#1
by PatrickHaller - opened

Hey!

For reproducibilities sake can you verify if this is the right configuration for evaluating with lighteval?

helm|hellaswag|0|0
lighteval|arc:easy|0|0
leaderboard|arc:challenge|0|0
helm|mmlu|0|0
helm|piqa|0|0
helm|commonsenseqa|0|0
lighteval|triviaqa|0|0
leaderboard|winogrande|0|0
lighteval|openbookqa|0|0
leaderboard|gsm8k|5|0

Furthermore:

  • Did you manually calculate average over the accuracy for easy and challenge for ARC?
  • What metrics did you report? Is it all accuracy?

Greetings,
Patrick

It also seems that some numbers might be wrong:

Winogrande 52.5 -> 54.62
GSM8K 3.2 -> 0.32
PIQA 71.3 -> 3.1 (em), 9.0 (qem), 3.8 (pem), 19.79 (pqem)
ETC.

Some numbers seems to be wildly different from what you reported....

|                     Task                      |Version| Metric |Value |   |Stderr|
|-----------------------------------------------|------:|--------|-----:|---|-----:|
|all                                            |       |em      |0.1994|±  |0.0285|
|                                               |       |qem     |0.2052|±  |0.0282|
|                                               |       |pem     |0.2423|±  |0.0308|
|                                               |       |pqem    |0.4098|±  |0.0352|
|                                               |       |acc     |0.4650|±  |0.0142|
|                                               |       |acc_norm|0.4796|±  |0.0151|
|helm:commonsenseqa:0                           |      0|em      |0.1949|±  |0.0113|
|                                               |       |qem     |0.1974|±  |0.0114|
|                                               |       |pem     |0.1949|±  |0.0113|
|                                               |       |pqem    |0.3129|±  |0.0133|
|helm:hellaswag:0                               |      0|em      |0.2173|±  |0.0041|
|                                               |       |qem     |0.2404|±  |0.0043|
|                                               |       |pem     |0.2297|±  |0.0042|
|                                               |       |pqem    |0.3162|±  |0.0046|
|helm:mmlu:_average:0                           |       |em      |0.2021|±  |0.0297|
|                                               |       |qem     |0.2109|±  |0.0303|
|                                               |       |pem     |0.2469|±  |0.0321|
|                                               |       |pqem    |0.4168|±  |0.0366|
--- MMLU subs ---
|helm:piqa:0                                    |      0|em      |0.0311|±  |0.0025|
|                                               |       |qem     |0.0904|±  |0.0041|
|                                               |       |pem     |0.0386|±  |0.0027|
|                                               |       |pqem    |0.1979|±  |0.0057|
|leaderboard:arc:challenge:0                    |      0|acc     |0.3660|±  |0.0141|
|                                               |       |acc_norm|0.3848|±  |0.0142|
|leaderboard:gsm8k:5                            |      0|qem     |0.0030|±  |0.0015|
|leaderboard:winogrande:0                       |      0|acc     |0.5462|±  |0.0140|
|lighteval:arc:easy:0                           |      0|acc     |0.7016|±  |0.0094|
|                                               |       |acc_norm|0.6801|±  |0.0096|
|lighteval:openbookqa:0                         |      0|acc     |0.2460|±  |0.0193|
|                                               |       |acc_norm|0.3740|±  |0.0217|
|lighteval:triviaqa:0                           |      0|qem     |0.1699|±  |0.0028|
Hugging Face TB Research org
edited 5 days ago

Hi, you can find the evaluation details: https://github.com/huggingface/smollm/blob/main/evaluation/README.md (currently missing MMLU cloze, we'll add it soon)

Sign up or log in to comment