Why is `multi_obj_rewards` multipled by 5, but then 0.5 is subtracted from it?

#11
by xzuyn - opened

If it's being scaled from 0 to 1 to be from 0 to 5 why would 0.5 be subtracted? Wouldn't this make it from 0 to 4.5?

Also is output.score supposed to be from 0 to 1 as well?

Lastly, does this model support multi-turn samples or system turns or is it only good (or capable) at doing single turn?

RLHFlow org

The original HelpSteer rating scale is 0-4, and I shift & re-scale it to 0.05-0.95.
The training labels are constrained in [0,1], but the output.score is not guaranteed to be in [0,1], since we do not apply sigmoid.

Haoxiang-Wang changed discussion status to closed

Does the preference score take multiple turns or system turns into account? Like could this model be useful for checking if a model is following the system prompt correctly?

Sign up or log in to comment