Spaces:
Running
on
CPU Upgrade
Benchmarks for GPT-3.5 & GPT-4 for comparison
Is it possible to also add gpt-3.5 and gpt-4 benchmarks for comparison purpose
Hi. A clone of this space includes GPTs 3.5 and 4. It can be found here: https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard
@jaspercatapang Wow. That's wonderful. Thanks!
@jaspercatapang That is cool but reproducibility is in question since we have no idea how OpenAI run its benchmarks. I wonder if they can be reproduced with same evaluation script as rest of the leaderboard
@felixz agreed, it would be better if they can evaluate them. However, they already declined to run their evaluations on a previous thread emphasizing that this leaderboard is for open LLMs.
But I agree with you, now that we’re reaching a point where open LLMs are reaching proprietary LLM levels, in terms of performance, it is important to have both categories validated.
Hi!
We won't add GPT3.5 and GPT4 for 2 reasons: 1) as @jaspercatapang mentionned, this is a leaderboard for Open LLMs. 2) However, our main reason for not including models with closed APIs such as GPT3.5 etc is the well know fact that these models have APIs which change through time, so any evaluation we would do would only be valid on the precise day where we would do it.
This would not give reproducible results,, and reproducibility is very important for us.
I think it's possible. It's also true in other leaderboards like https://tatsu-lab.github.io/alpaca_eval/.
The rationale that GPT "isn't open" or changes over time makes no sense to me. Half the reason people want this benchmark in the first place is specifically to compare models to ChatGPT and see "whether we're there yet", so to speak. If all we have is OpenAI's self-reporting of how their model performed three months ago, that's still an extremely valuable data point. OpenAI could shut down their entire company tomorrow and take all of their models offline, but the fact that somebody at some point in history created a language model which performed like that gives us an anchor for what's possible.
Furthermore, I see no reason why you couldn't run this same set of benchmarks through their API independently and then mark down the scores as "GPT-4-2308" or "GPT-4-2309", which would get you more objective and useful results. In fact, you could even do it over time to prove that their model changes, and exactly how much.
can you believe that new models are now beating gpt3.5 in average scores?
Llama2 base model get 69 on MMLU and similar on Hellaswag. I think the only thing these finetunes improved on is Truthqa.
So yea you can say Llama2 is on level of Chatgpt3.5.
Still all these benchmarks miss instruction following, conversations, and coding abilities so it is hard to make a strong statement.
The rationale that GPT "isn't open" or changes over time makes no sense to me. Half the reason people want this benchmark in the first place is specifically to compare models to ChatGPT and see "whether we're there yet", so to speak. If all we have is OpenAI's self-reporting of how their model performed three months ago, that's still an extremely valuable data point. OpenAI could shut down their entire company tomorrow and take all of their models offline, but the fact that somebody at some point in history created a language model which performed like that gives us an anchor for what's possible.
Furthermore, I see no reason why you couldn't run this same set of benchmarks through their API independently and then mark down the scores as "GPT-4-2308" or "GPT-4-2309", which would get you more objective and useful results. In fact, you could even do it over time to prove that their model changes, and exactly how much.
Good arguments. The argument about closed models changing over time is pretty weak. I mean you snapshot the result as you said.
Better argument could be that HF does not really want to pay for benchmark GPT3 or 4. The cost is not trivial and you can get into many thousands of dollars quickly.