google/gemma-2b-it · What do they mean by maj@1 ?

joserass

May 14

I get that they might mean majority vote, but what does the 1 mean? the candidate response?

Renu11

Google org May 30

•

edited May 30

Hi @joserass , maj@1 metric score computed by greedily sampling 'once' per question while evaluating the Gemma 2b and 7b model performance. You can refer to this doc for more details on the used metrics for the model evaluation. Thank you

joserass

May 30

Would you be able to direct me to the source code of the evaluation pipeline used for the reported results?

I was unable to replicate the GSM8K benchmark 17.7%(2b-it) and 46.4%(7b-it) using 8-shot CoT with greedy decoding. For the 2b-it it was around 10% and 7b-it around 25%.

Used the an implementation based on the oficial repo methodology https://github.com/google-deepmind/gemma/blob/main/colabs/gsm8k_eval.ipynb and also the lm_eval framework from https://github.com/EleutherAI/lm-evaluation-harness.

Tried not only with maj@1 but also with sampling with different top_p and top_k variations.

Thanks @Renu11

suryabhupa

Google org Jun 10

We haven't yet open sourced our own internal evaluation harness; it's interesting that the numbers you find are lower -- we'll look into it!