How to understand the different between local report and the scores reported on open-llm-leaderboard/

#964
by xinchen9 - opened

I ran mistralai/Mistral-7B-Instruct-v0.2 locally according to the instructions:

git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout adding_all_changess
pip install -e .[math,ifeval,sentencepiece]
lm-eval --model_args="pretrained=mistralai/Mistral-7B-Instruct-v0.2,revision=,dtype=float" --tasks=leaderboard --batch_size=auto

And I got the results as following:
hf (pretrained=./Mistral-7B-Instruct-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1)

Tasks Version Filter n-shot Metric Value Stderr
leaderboard N/A none 0 acc ↑ 0.2713 Β± 0.0041
none 0 acc_norm ↑ 0.4322 Β± 0.0054
none 0 exact_match ↑ 0.0204 Β± 0.0039
none 0 inst_level_loose_acc ↑ 0.5588 Β± N/A
none 0 inst_level_strict_acc ↑ 0.5072 Β± N/A
none 0 prompt_level_loose_acc ↑ 0.4288 Β± 0.0213
none 0 prompt_level_strict_acc ↑ 0.3826 Β± 0.0209
- leaderboard_bbh N/A none 3 acc_norm ↑ 0.4581 Β± 0.0062
- leaderboard_bbh_boolean_expressions 0 none 3 acc_norm ↑ 0.7840 Β± 0.0261
- leaderboard_bbh_causal_judgement 0 none 3 acc_norm ↑ 0.6150 Β± 0.0357
- leaderboard_bbh_date_understanding 0 none 3 acc_norm ↑ 0.3680 Β± 0.0306
- leaderboard_bbh_disambiguation_qa 0 none 3 acc_norm ↑ 0.6120 Β± 0.0309
- leaderboard_bbh_formal_fallacies 0 none 3 acc_norm ↑ 0.4760 Β± 0.0316
- leaderboard_bbh_geometric_shapes 0 none 3 acc_norm ↑ 0.3560 Β± 0.0303
- leaderboard_bbh_hyperbaton 0 none 3 acc_norm ↑ 0.6520 Β± 0.0302
- leaderboard_bbh_logical_deduction_five_objects 0 none 3 acc_norm ↑ 0.3440 Β± 0.0301
- leaderboard_bbh_logical_deduction_seven_objects 0 none 3 acc_norm ↑ 0.3080 Β± 0.0293
- leaderboard_bbh_logical_deduction_three_objects 0 none 3 acc_norm ↑ 0.4840 Β± 0.0317
- leaderboard_bbh_movie_recommendation 0 none 3 acc_norm ↑ 0.5240 Β± 0.0316
- leaderboard_bbh_navigate 0 none 3 acc_norm ↑ 0.5520 Β± 0.0315
- leaderboard_bbh_object_counting 0 none 3 acc_norm ↑ 0.3600 Β± 0.0304
- leaderboard_bbh_penguins_in_a_table 0 none 3 acc_norm ↑ 0.4315 Β± 0.0411
- leaderboard_bbh_reasoning_about_colored_objects 0 none 3 acc_norm ↑ 0.4120 Β± 0.0312
- leaderboard_bbh_ruin_names 0 none 3 acc_norm ↑ 0.4640 Β± 0.0316
- leaderboard_bbh_salient_translation_error_detection 0 none 3 acc_norm ↑ 0.4000 Β± 0.0310
- leaderboard_bbh_snarks 0 none 3 acc_norm ↑ 0.5618 Β± 0.0373
- leaderboard_bbh_sports_understanding 0 none 3 acc_norm ↑ 0.7960 Β± 0.0255
- leaderboard_bbh_temporal_sequences 0 none 3 acc_norm ↑ 0.2920 Β± 0.0288
- leaderboard_bbh_tracking_shuffled_objects_five_objects 0 none 3 acc_norm ↑ 0.2560 Β± 0.0277
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 0 none 3 acc_norm ↑ 0.1480 Β± 0.0225
- leaderboard_bbh_tracking_shuffled_objects_three_objects 0 none 3 acc_norm ↑ 0.3400 Β± 0.0300
- leaderboard_bbh_web_of_lies 0 none 3 acc_norm ↑ 0.5160 Β± 0.0317
- leaderboard_gpqa N/A none 0 acc_norm ↑ 0.2819 Β± 0.0130
- leaderboard_gpqa_diamond 1 none 0 acc_norm ↑ 0.2374 Β± 0.0303
- leaderboard_gpqa_extended 1 none 0 acc_norm ↑ 0.2949 Β± 0.0195
- leaderboard_gpqa_main 1 none 0 acc_norm ↑ 0.2857 Β± 0.0214
- leaderboard_ifeval 2 none 0 inst_level_loose_acc ↑ 0.5588 Β± N/A
none 0 inst_level_strict_acc ↑ 0.5072 Β± N/A
none 0 prompt_level_loose_acc ↑ 0.4288 Β± 0.0213
none 0 prompt_level_strict_acc ↑ 0.3826 Β± 0.0209
- leaderboard_math_algebra_hard 1 none 4 exact_match ↑ 0.0261 Β± 0.0091
- leaderboard_math_counting_and_prob_hard 1 none 4 exact_match ↑ 0.0244 Β± 0.0140
- leaderboard_math_geometry_hard 1 none 4 exact_match ↑ 0.0076 Β± 0.0076
- leaderboard_math_hard N/A none 4 exact_match ↑ 0.0204 Β± 0.0039
- leaderboard_math_intermediate_algebra_hard 1 none 4 exact_match ↑ 0.0071 Β± 0.0050
- leaderboard_math_num_theory_hard 1 none 4 exact_match ↑ 0.0065 Β± 0.0065
- leaderboard_math_prealgebra_hard 1 none 4 exact_match ↑ 0.0570 Β± 0.0167
- leaderboard_math_precalculus_hard 1 none 4 exact_match ↑ 0.0074 Β± 0.0074
- leaderboard_mmlu_pro 0.1 none 5 acc ↑ 0.2713 Β± 0.0041
- leaderboard_musr N/A none 0 acc_norm ↑ 0.4722 Β± 0.0179
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm ↑ 0.5440 Β± 0.0316
- leaderboard_musr_object_placements 1 none 0 acc_norm ↑ 0.3477 Β± 0.0298
- leaderboard_musr_team_allocation 1 none 0 acc_norm ↑ 0.5280 Β± 0.0316
Groups Version Filter n-shot Metric Value Stderr
leaderboard N/A none 0 acc ↑ 0.2713 Β± 0.0041
none 0 acc_norm ↑ 0.4322 Β± 0.0054
none 0 exact_match ↑ 0.0204 Β± 0.0039
none 0 inst_level_loose_acc ↑ 0.5588 Β± N/A
none 0 inst_level_strict_acc ↑ 0.5072 Β± N/A
none 0 prompt_level_loose_acc ↑ 0.4288 Β± 0.0213
none 0 prompt_level_strict_acc ↑ 0.3826 Β± 0.0209
- leaderboard_bbh N/A none 3 acc_norm ↑ 0.4581 Β± 0.0062
- leaderboard_gpqa N/A none 0 acc_norm ↑ 0.2819 Β± 0.0130
- leaderboard_math_hard N/A none 4 exact_match ↑ 0.0204 Β± 0.0039
- leaderboard_musr N/A none 0 acc_norm ↑ 0.4722 Β± 0.0179

But the scores reported are

mistralai/Mistral-7B-Instruct-v0.2 πŸ“‘

18.44
54.96
22.91
2.64
3.47
7.61
19.08

How to understand the different between two? and how to I convert local result into format of open-llm-leaderboard?

Open LLM Leaderboard org

Hi @xinchen9 ,

Let me help you – after the evaluation you will get a results file, like this one we have for mistralai/Mistral-7B-Instruct-v0.2 – link to the results json

We process all result files to get normalised values for all benchmarks, you can find out more here in our documentation. There you will also find the colab notebook, feel free to copy it and normalise your results

I hope it's clear, feel free to ask any questions!

Open LLM Leaderboard org

Closing this discussion due to inactivity

alozowski changed discussion status to closed

Sign up or log in to comment