Spaces:
Running
on
CPU Upgrade
Models for Human/GPT4 Eval
Please comment and react on the models you want us to add! We'll be selecting models from this, rather than automatically running them.
airoboros-13b-gpt4.ggmlv3.q8_0 https://huggingface.co/TheBloke/airoboros-13b-gpt4-GGML
nous-hermes-13b.ggmlv3.q8_0 https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML
These seem to be among the highest performing 13b models (according to certain evaluations), and it would be nice to have them on the leaderboard.
Please comment and react on the models you want us to add! We'll be selecting models from this, rather than automatically running them.
-
https://huggingface.co/WizardLM/WizardLM-30B-V1.0
https://huggingface.co/WizardLM/WizardLM-13B-V1.0
https://huggingface.co/WizardLM/WizardLM-7B-V1.0
https://huggingface.co/YuxinJiang/Lion
https://huggingface.co/TheBloke/selfee-13b-fp16
https://huggingface.co/TheBloke/selfee-7B-fp16
https://huggingface.co/TheBloke/tulu-13B-fp16
https://huggingface.co/TheBloke/tulu-7B-fp16
https://huggingface.co/TheBloke/tulu-30B-fp16
https://huggingface.co/TheBloke/CAMEL-13B-Combined-Data-fp16
What about testing the top 10 models of the LLM benchmark?
Please evaluate https://huggingface.co/OpenAssistant/falcon-40b-sft-top1-560. It is both a lot newer and better than the pythia-based oasst-12b
that was used as one of your initial models. If multiple models are possible, then https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226 would also be nice. If not, then only evaluating falcon-40b-sft-top1-560
would be enough.
https://huggingface.co/WizardLM/WizardLM-30B-V1.0
https://huggingface.co/WizardLM/WizardLM-13B-V1.0
WizardLM is currently the best open source model due to their unique fine-tuning method
According to several benchmarks:
https://github.com/aigoopy/llm-jeopardy
https://docs.google.com/spreadsheets/d/1NgHDxbVWJFolq8bLvLkuPWKC7i_R6I6W/edit#gid=2011456595
https://tatsu-lab.github.io/alpaca_eval/
https://www.reddit.com/r/LocalLLaMA/comments/1469343/hi_folks_back_with_an_update_to_the_humaneval/
airoboros models are good too
RLHF open assiatant 30b
https://huggingface.co/Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-hf
chimera inst 13b claimed to be 97% of chatgpt by gpt4 eval
https://huggingface.co/Yhyu13/chimera-inst-chat-13b-hf
Would love to see Orca once the weights are released!
Great stuff everyone, I'll launch a batch tomorrow / early next week. We'll figure out what throughput of models we can do, but we can generally do many more models with GPT4 evals than Human, but without human it's hard to calibrate ๐
Add one for falcon-7b-instruct and mpt-7b-instruct & chats please
This one: Monero/Manticore-13b-Chat-Pyg-Guanaco
Heard about it on Reddit a few weeks back and I agree it (subjectively) is still the best 13B model I've tried.
Where's the results
?
tiiuae/falcon-40b-instruct
timdettmers/guanaco-65b-merged
HuggingFaceH4/starchat-beta
bigcode/starcoderplus
TheBloke/Wizard-Vicuna-13B-Uncensored-HF
mosaicml/mpt-7b
bigcode/starcoderplus
Salesforce/codegen-16B-nl
facebook/galactica-120b
Microsoft Orca 13B (https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)
Google PaLM 2/Bard
Claude+
Bump up for any of the compressed weight models they need more benchmarking could be its own leaderboard breakout
mosaicml/mpt-30b
mosaicml/mpt-30b-instruct
mosaicml/mpt-30b-chat
lmsys/vicuna-7b-v1.3
lmsys/vicuna-13b-v1.3
lmsys/vicuna-33b-v1.3
facebook/opt-iml-1.3b
facebook/opt-iml-30b
facebook/opt-iml-max-1.3b
facebook/opt-iml-max-30b
Would be good to see the difference between a verbose chatbot LLM and succinct instruction-tuned LLM.
@tallrichandsom The GPT/Human eval leaderboard was moved here
I think you should definetely add falcon 7b and 40b open assistant finetuned version, as on elo rating and from many end users perpective, it's the best and even in terms of feeling superior (falcon 40b OA) to chatgpt quality. I'm realling talking about the real feeling you have using it and results quality.
Would be good to see the comparison between newest tiiuae/falcon-40b-instruct
and GPT4
@natolambert Closing this discussion since the Human and GPT4 evaluation leaderboard moved