Why is `multi_obj_rewards` multipled by 5, but then 0.5 is subtracted from it?

#11

by xzuyn - opened Jul 18

Discussion

xzuyn

Jul 18

•

edited Jul 19

If it's being scaled from 0 to 1 to be from 0 to 5 why would 0.5 be subtracted? Wouldn't this make it from 0 to 4.5?

Also is output.score supposed to be from 0 to 1 as well?

Lastly, does this model support multi-turn samples or system turns or is it only good (or capable) at doing single turn?

Haoxiang-Wang

RLHFlow org Jul 19

The original HelpSteer rating scale is 0-4, and I shift & re-scale it to 0.05-0.95.
The training labels are constrained in [0,1], but the output.score is not guaranteed to be in [0,1], since we do not apply sigmoid.

Haoxiang-Wang changed discussion status to closed Jul 19

xzuyn

Jul 21

Does the preference score take multiple turns or system turns into account? Like could this model be useful for checking if a model is following the system prompt correctly?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment