Spaces:
Running
on
CPU Upgrade
drop score unreproducable in local server
Hi.
I have been reproducing scores of leaderboard in my local a100 server
however, I am not able to reproduce same score on leaderboard on few tasks: gsm8k & drop score(with opened weights, like lvkaokao, etc)
this is an example patterns that model cannot reproduce.
used setting: identical to openllmleaderboard about page
in same doc id, model generates blank sequence. but leaderboard results are not blank.
also, score changes dramatically if I change batch size due to long evaluation time.(example case: drop, f1 score)
- 2batch: 0.206
- 8 batch: 0.1464
do you have any thoughts?
Hi!
We use a batch size of 1 systematically (I need to update the doc about this, very sorry).
I'm surprised you can't reproduce the DROP results. Are you using the same commit?
We know that for generative evals, changing the batch size changes the score, it's an issue you can open on the harness.
Independently, we have found out that DROP scores were unreliable, and have decided to remove it from the leaderboard, we wrote about our findings here
that is, responsible haha