Could you please test the consistency of preference between `RLHFlow/pair-preference-model-LLaMA3-8B` and GPT-4 on alpacaeval dataset?

by rungao2001 - opened Jun 20

rungao2001

Jun 20

It may cost too much for small teams when they train a model and test it on alpacaeval2. And I believe that this model, with the strong ability of giving pair preference, can be a good judger to judge the responses for different models, and may can take the place of GPT-4. It maybe very very interesting to get the model win rate against GPT-4 on alpacaeval with RLHFlow/pair-preference-model-LLaMA3-8B as a judger, and compare the result with the official win rate shown on AlpacaEval Leaderboard.

weqweasdas

RLHFlow org Jun 22

Hi, thanks for your interests in our models.

The alpaca eval does not have a dataset. I do have some results actually for the mt-bench and lmsys though.

Preference model

lmsys/chatbot_arena_conversations 15k 0.822
Arena-Hard 0.791
lmsys/mt_bench_human_judgments/human 0.805
lmsys/mt_bench_human_judgments/gpt4 0.938

We delete the pairs with tie in the test.

It turns out that all the BT model, preference model, and armo have been used for online iterative dpo, and lead to models with alpaca eval win rate possibly > 50%. So the model can be used to overfit alpaca eval lol.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment