Could you please test the consistency of preference between `RLHFlow/pair-preference-model-LLaMA3-8B` and GPT-4 on alpacaeval dataset?

#2
by rungao2001 - opened

It may cost too much for small teams when they train a model and test it on alpacaeval2. And I believe that this model, with the strong ability of giving pair preference, can be a good judger to judge the responses for different models, and may can take the place of GPT-4. It maybe very very interesting to get the model win rate against GPT-4 on alpacaeval with RLHFlow/pair-preference-model-LLaMA3-8B as a judger, and compare the result with the official win rate shown on AlpacaEval Leaderboard.

RLHFlow org

Hi, thanks for your interests in our models.

The alpaca eval does not have a dataset. I do have some results actually for the mt-bench and lmsys though.

Preference model

  • lmsys/chatbot_arena_conversations 15k 0.822
  • Arena-Hard 0.791
  • lmsys/mt_bench_human_judgments/human 0.805
  • lmsys/mt_bench_human_judgments/gpt4 0.938

We delete the pairs with tie in the test.

It turns out that all the BT model, preference model, and armo have been used for online iterative dpo, and lead to models with alpaca eval win rate possibly > 50%. So the model can be used to overfit alpaca eval lol.

Sign up or log in to comment