Why not use the Plackett-Luce Model version of DPO when K=4 ranked responses are present?

by MasterGodzilla - opened

Screen Shot 2023-11-02 at 21.56.33.png

The original paper have the DPO version that can deal with multiple ranked responses.

Since you guys are ranking responses from 4 models using the UltraFeedback framework, using the Plackett-Luce version might very likely provide more information to the instruction tuning process with only twice the computation cost.

Why did you guys decide not to do it but instead saved "the highest scoring response as yw and a random lower scoring prompt as yl" from the four responses?

MasterGodzilla changed discussion title from Why not use the Plackett-Luce Model version of DPO since K=4 ranked responses are present to Why not use the Plackett-Luce Model version of DPO when K=4 ranked responses are present?

Sign up or log in to comment