Understand reward metrics

#22
by NhatHoang2002 - opened

Thank you for the excellent work. I am new to the concept and want to understand how DPO works.

From your reported training results, I observed this trend:

Validation Loss ↓
Rewards/chosen ↓
Rewards/rejected ↓
Rewards/accuracies ↑ 
Rewards/margins ↑

I assume that these trends are correct to achieve a good performance.

However, when I tried to fine-tune a smaller model (to learn the concept) following this tutorial. I got a different trend:

Validation Loss ↓
Rewards/chosen ↑
Rewards/rejected ↑
Rewards/accuracies ↑ 
Rewards/margins ↑

I am only tuning with different learning_rate now and my model can't achieve a trend in yours. May I ask which trend is correct and if there is any secret recipe behind it?

I know that you guys will release "The Alignment Handbook" but I am having an assignment deadline soon. Thus, I hope that you guys can help me have a sneak peek into the recipes.

I checked GitHub, Reddit, and even FutureSpot, and found that many other people are having this same problem with smaller models. I have found a different tutorial-like article that may help you. While I am not on the H4 team, I do have some knowledge regarding DPOs.
I hope the link below will help solve your problem!

https://ai.plainenglish.io/direct-preference-optimization-dpo-a-simplified-approach-to-fine-tuning-large-language-models-bae1c6d7ec29

Sign up or log in to comment