Spaces:
Running
on
CPU Upgrade
How to understand the different between local report and the scores reported on open-llm-leaderboard/
I ran mistralai/Mistral-7B-Instruct-v0.2 locally according to the instructions:
git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout adding_all_changess
pip install -e .[math,ifeval,sentencepiece]
lm-eval --model_args="pretrained=mistralai/Mistral-7B-Instruct-v0.2,revision=,dtype=float" --tasks=leaderboard --batch_size=auto
And I got the results as following:
hf (pretrained=./Mistral-7B-Instruct-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard | N/A | none | 0 | acc | β | 0.2713 | Β± | 0.0041 |
none | 0 | acc_norm | β | 0.4322 | Β± | 0.0054 | ||
none | 0 | exact_match | β | 0.0204 | Β± | 0.0039 | ||
none | 0 | inst_level_loose_acc | β | 0.5588 | Β± | N/A | ||
none | 0 | inst_level_strict_acc | β | 0.5072 | Β± | N/A | ||
none | 0 | prompt_level_loose_acc | β | 0.4288 | Β± | 0.0213 | ||
none | 0 | prompt_level_strict_acc | β | 0.3826 | Β± | 0.0209 | ||
- leaderboard_bbh | N/A | none | 3 | acc_norm | β | 0.4581 | Β± | 0.0062 |
- leaderboard_bbh_boolean_expressions | 0 | none | 3 | acc_norm | β | 0.7840 | Β± | 0.0261 |
- leaderboard_bbh_causal_judgement | 0 | none | 3 | acc_norm | β | 0.6150 | Β± | 0.0357 |
- leaderboard_bbh_date_understanding | 0 | none | 3 | acc_norm | β | 0.3680 | Β± | 0.0306 |
- leaderboard_bbh_disambiguation_qa | 0 | none | 3 | acc_norm | β | 0.6120 | Β± | 0.0309 |
- leaderboard_bbh_formal_fallacies | 0 | none | 3 | acc_norm | β | 0.4760 | Β± | 0.0316 |
- leaderboard_bbh_geometric_shapes | 0 | none | 3 | acc_norm | β | 0.3560 | Β± | 0.0303 |
- leaderboard_bbh_hyperbaton | 0 | none | 3 | acc_norm | β | 0.6520 | Β± | 0.0302 |
- leaderboard_bbh_logical_deduction_five_objects | 0 | none | 3 | acc_norm | β | 0.3440 | Β± | 0.0301 |
- leaderboard_bbh_logical_deduction_seven_objects | 0 | none | 3 | acc_norm | β | 0.3080 | Β± | 0.0293 |
- leaderboard_bbh_logical_deduction_three_objects | 0 | none | 3 | acc_norm | β | 0.4840 | Β± | 0.0317 |
- leaderboard_bbh_movie_recommendation | 0 | none | 3 | acc_norm | β | 0.5240 | Β± | 0.0316 |
- leaderboard_bbh_navigate | 0 | none | 3 | acc_norm | β | 0.5520 | Β± | 0.0315 |
- leaderboard_bbh_object_counting | 0 | none | 3 | acc_norm | β | 0.3600 | Β± | 0.0304 |
- leaderboard_bbh_penguins_in_a_table | 0 | none | 3 | acc_norm | β | 0.4315 | Β± | 0.0411 |
- leaderboard_bbh_reasoning_about_colored_objects | 0 | none | 3 | acc_norm | β | 0.4120 | Β± | 0.0312 |
- leaderboard_bbh_ruin_names | 0 | none | 3 | acc_norm | β | 0.4640 | Β± | 0.0316 |
- leaderboard_bbh_salient_translation_error_detection | 0 | none | 3 | acc_norm | β | 0.4000 | Β± | 0.0310 |
- leaderboard_bbh_snarks | 0 | none | 3 | acc_norm | β | 0.5618 | Β± | 0.0373 |
- leaderboard_bbh_sports_understanding | 0 | none | 3 | acc_norm | β | 0.7960 | Β± | 0.0255 |
- leaderboard_bbh_temporal_sequences | 0 | none | 3 | acc_norm | β | 0.2920 | Β± | 0.0288 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects | 0 | none | 3 | acc_norm | β | 0.2560 | Β± | 0.0277 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects | 0 | none | 3 | acc_norm | β | 0.1480 | Β± | 0.0225 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects | 0 | none | 3 | acc_norm | β | 0.3400 | Β± | 0.0300 |
- leaderboard_bbh_web_of_lies | 0 | none | 3 | acc_norm | β | 0.5160 | Β± | 0.0317 |
- leaderboard_gpqa | N/A | none | 0 | acc_norm | β | 0.2819 | Β± | 0.0130 |
- leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | β | 0.2374 | Β± | 0.0303 |
- leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | β | 0.2949 | Β± | 0.0195 |
- leaderboard_gpqa_main | 1 | none | 0 | acc_norm | β | 0.2857 | Β± | 0.0214 |
- leaderboard_ifeval | 2 | none | 0 | inst_level_loose_acc | β | 0.5588 | Β± | N/A |
none | 0 | inst_level_strict_acc | β | 0.5072 | Β± | N/A | ||
none | 0 | prompt_level_loose_acc | β | 0.4288 | Β± | 0.0213 | ||
none | 0 | prompt_level_strict_acc | β | 0.3826 | Β± | 0.0209 | ||
- leaderboard_math_algebra_hard | 1 | none | 4 | exact_match | β | 0.0261 | Β± | 0.0091 |
- leaderboard_math_counting_and_prob_hard | 1 | none | 4 | exact_match | β | 0.0244 | Β± | 0.0140 |
- leaderboard_math_geometry_hard | 1 | none | 4 | exact_match | β | 0.0076 | Β± | 0.0076 |
- leaderboard_math_hard | N/A | none | 4 | exact_match | β | 0.0204 | Β± | 0.0039 |
- leaderboard_math_intermediate_algebra_hard | 1 | none | 4 | exact_match | β | 0.0071 | Β± | 0.0050 |
- leaderboard_math_num_theory_hard | 1 | none | 4 | exact_match | β | 0.0065 | Β± | 0.0065 |
- leaderboard_math_prealgebra_hard | 1 | none | 4 | exact_match | β | 0.0570 | Β± | 0.0167 |
- leaderboard_math_precalculus_hard | 1 | none | 4 | exact_match | β | 0.0074 | Β± | 0.0074 |
- leaderboard_mmlu_pro | 0.1 | none | 5 | acc | β | 0.2713 | Β± | 0.0041 |
- leaderboard_musr | N/A | none | 0 | acc_norm | β | 0.4722 | Β± | 0.0179 |
- leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | β | 0.5440 | Β± | 0.0316 |
- leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | β | 0.3477 | Β± | 0.0298 |
- leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | β | 0.5280 | Β± | 0.0316 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard | N/A | none | 0 | acc | β | 0.2713 | Β± | 0.0041 |
none | 0 | acc_norm | β | 0.4322 | Β± | 0.0054 | ||
none | 0 | exact_match | β | 0.0204 | Β± | 0.0039 | ||
none | 0 | inst_level_loose_acc | β | 0.5588 | Β± | N/A | ||
none | 0 | inst_level_strict_acc | β | 0.5072 | Β± | N/A | ||
none | 0 | prompt_level_loose_acc | β | 0.4288 | Β± | 0.0213 | ||
none | 0 | prompt_level_strict_acc | β | 0.3826 | Β± | 0.0209 | ||
- leaderboard_bbh | N/A | none | 3 | acc_norm | β | 0.4581 | Β± | 0.0062 |
- leaderboard_gpqa | N/A | none | 0 | acc_norm | β | 0.2819 | Β± | 0.0130 |
- leaderboard_math_hard | N/A | none | 4 | exact_match | β | 0.0204 | Β± | 0.0039 |
- leaderboard_musr | N/A | none | 0 | acc_norm | β | 0.4722 | Β± | 0.0179 |
But the scores reported are
mistralai/Mistral-7B-Instruct-v0.2 π
18.44
54.96
22.91
2.64
3.47
7.61
19.08
How to understand the different between two? and how to I convert local result into format of open-llm-leaderboard?
Hi @xinchen9 ,
Let me help you β after the evaluation you will get a results file, like this one we have for mistralai/Mistral-7B-Instruct-v0.2
β link to the results json
We process all result files to get normalised values for all benchmarks, you can find out more here in our documentation. There you will also find the colab notebook, feel free to copy it and normalise your results
I hope it's clear, feel free to ask any questions!
Closing this discussion due to inactivity