Understand reward metrics
Thank you for the excellent work. I am new to the concept and want to understand how DPO works.
From your reported training results, I observed this trend:
Validation Loss β
Rewards/chosen β
Rewards/rejected β
Rewards/accuracies β
Rewards/margins β
I assume that these trends are correct to achieve a good performance.
However, when I tried to fine-tune a smaller model (to learn the concept) following this tutorial. I got a different trend:
Validation Loss β
Rewards/chosen β
Rewards/rejected β
Rewards/accuracies β
Rewards/margins β
I am only tuning with different learning_rate
now and my model can't achieve a trend in yours. May I ask which trend is correct and if there is any secret recipe behind it?
I know that you guys will release "The Alignment Handbook" but I am having an assignment deadline soon. Thus, I hope that you guys can help me have a sneak peek into the recipes.
I checked GitHub, Reddit, and even FutureSpot, and found that many other people are having this same problem with smaller models. I have found a different tutorial-like article that may help you. While I am not on the H4 team, I do have some knowledge regarding DPOs.
I hope the link below will help solve your problem!