Fail to reproduce results on server benchmark by using lm-evaluation-harness
I am reproducing gemma-2b's accuracy on hellaswag benchmark. But I only got acc@hellaswag: 0.3415 by using lm-evaluation-harness with zero-shot setting, failing to reproduce 0.714 reported in the Model card. In addition, there are results on other benchmarks that fails to meet the reported number, like ARC-e, ARC-c, PIQA.
Is there anything that I missed for reproducing your results ? Command I used are as follows:
for task in wikitext lambada_openai winogrande piqa sciq wsc arc_easy arc_challenge logiqa hellaswag mmlu boolq openbookqa
do
lm_eval --model hf \
--model_args pretrained=/path/to/gemma-2b/ \
--tasks $task \
--device cuda:0 \
--batch_size 1
done
Hi
@Zhuangl
!
I suggest you to use https://github.com/huggingface/lighteval lighteval library which should support Gemma cc
@clefourrier
@SaylorTwift
Hi!
If you want to reproduce the numbers we report on the Open LLM Leaderboard, you can use this version of the Eleuther AI Harness with this command: python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>
(details in the About page of the Leaderboard)
However, you can also also use lighteval, which is in v0 at the moment, to experiment with evaluation, I think
@SaylorTwift
bumped transformers to a version supporting gemma
yesteday (but it's not there to reproduce Open LLM Leaderboard results).
Hi @Zhuangl , did this end up fixing the differences?