Underreported HumanEval Scores?
#83
by
VaibhavSahai
- opened
Hello. I have noticed that after the June update, this model performs significantly better on HumanEval.
It previously had 64.6%
(as measured on the eval plus leaderboard), but after i ran the same test, it scored a 72%
.
my params:temp = 0 and max_tokens = 2048.
Could someone verify/recheck these scores? Thank you
Thank you for your interest and effort to independently run the HumanEval benchmark.
Np
@nguyenbh
. I also ran EvalPlus (which adds extra test cases to human eval) and observed a jump from pre- June 59.1%
to 65.2%
. Really like the new model for lightweight coding tasks now๐