What do they mean by maj@1 ?
I get that they might mean majority vote, but what does the 1 mean? the candidate response?
Would you be able to direct me to the source code of the evaluation pipeline used for the reported results?
I was unable to replicate the GSM8K benchmark 17.7%(2b-it) and 46.4%(7b-it) using 8-shot CoT with greedy decoding. For the 2b-it it was around 10% and 7b-it around 25%.
Used the an implementation based on the oficial repo methodology https://github.com/google-deepmind/gemma/blob/main/colabs/gsm8k_eval.ipynb and also the lm_eval framework from https://github.com/EleutherAI/lm-evaluation-harness.
Tried not only with maj@1 but also with sampling with different top_p and top_k variations.
Thanks @Renu11
We haven't yet open sourced our own internal evaluation harness; it's interesting that the numbers you find are lower -- we'll look into it!