Spaces:
Running
on
CPU Upgrade
gsm8k score largely different from local run
When I run a model locally I get a GSM8K (5-shot) score of 58.60, while the leaderboard reports 54.89 : https://huggingface.co/datasets/open-llm-leaderboard/details_mobiuslabsgmbh__aanaphi2-v0.1
The rest of the scores are also slightly different, but the GSM8K score is the only one that is quite different (-3.71 points).
Is there some flag to set or something ? I basically run it like this (with the latest lm-eval version):
model.eval();
import lm_eval
model.config.use_cache = False
lm_eval.tasks.initialize_tasks()
model_eval = lm_eval.models.huggingface.HFLM(pretrained=model, tokenizer=tokenizer)
result = lm_eval.evaluator.simple_evaluate(model_eval, tasks=["gsm8k"], num_fewshot=5, batch_size=8)['results']
Thank you in advance!
Good question, I'd like to know more about the GSM8K eval as well.
One thing that jumps out is that you'e using a batch_size of 8 when HF using a batch_size of 1.
I run it with different batch sizes and I was still getting 58.60, but I haven't tried a batch_size of 1. Will try that, thanks!
Hi ! You can find reproducibility info in the About
tab of the leaderboard. Let us know if you encounter any more issues !
Thanks! It says "for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs", that might explain why the scores are widely different. Curious to know why not just use the accuracy from GSM8K ?
Regarding the comment you pointed out from the paper, I assume that they simply would have gotten less good of a score without the fine-tuning - a lot of reported scores in papers/tech reports are not done in a reproducible setup, but in a setup that is advantageous for the evaluated model (like using CoT instead of few shot prompting, or reporting results on a fine tuned model instead of the base like the one you pointed out).
This is typically one of the use cases for which the leaderboard is an interesting resource: we evaluate all models in exactly the same setup, so that scores are actually comparable.
Side note: when you try to reproduce results from the leaderboard, please make sure that you use the same commit as we do :)
Since it would seem that your issue is explained by the above comment, I'm going to close, but feel free to reopen if needed.