open-llm-leaderboard/open_llm_leaderboard · drop score unreproducable in local server

Dec 1, 2023

Hi.
I have been reproducing scores of leaderboard in my local a100 server
however, I am not able to reproduce same score on leaderboard on few tasks: gsm8k & drop score(with opened weights, like lvkaokao, etc)
this is an example patterns that model cannot reproduce.
used setting: identical to openllmleaderboard about page
in same doc id, model generates blank sequence. but leaderboard results are not blank.

also, score changes dramatically if I change batch size due to long evaluation time.(example case: drop, f1 score)

2batch: 0.206
8 batch: 0.1464

do you have any thoughts?

clefourrier

Open LLM Leaderboard org Dec 1, 2023

Hi!
We use a batch size of 1 systematically (I need to update the doc about this, very sorry).
I'm surprised you can't reproduce the DROP results. Are you using the same commit?

We know that for generative evals, changing the batch size changes the score, it's an issue you can open on the harness.

clefourrier

Open LLM Leaderboard org Dec 1, 2023

Independently, we have found out that DROP scores were unreliable, and have decided to remove it from the leaderboard, we wrote about our findings here

clefourrier changed discussion status to closed Dec 1, 2023

leejunhyeok

Dec 4, 2023

that is, responsible haha