Why is `multi_obj_rewards` multipled by 5, but then 0.5 is subtracted from it?
#11
by
xzuyn
- opened
If it's being scaled from 0 to 1
to be from 0 to 5
why would 0.5 be subtracted? Wouldn't this make it from 0 to 4.5
?
Also is output.score
supposed to be from 0 to 1
as well?
Lastly, does this model support multi-turn samples or system turns or is it only good (or capable) at doing single turn?
The original HelpSteer rating scale is 0-4
, and I shift & re-scale it to 0.05-0.95
.
The training labels are constrained in [0,1]
, but the output.score
is not guaranteed to be in [0,1]
, since we do not apply sigmoid.
Haoxiang-Wang
changed discussion status to
closed
Does the preference score take multiple turns or system turns into account? Like could this model be useful for checking if a model is following the system prompt correctly?