metadata
library_name: peft
base_model: EleutherAI/pythia-410m-deduped
license: apache-2.0
datasets:
- argilla/dpo-mix-7k
tags:
- RLHF
- RLAIF
- PPO
- RM
- reward-model
- reward_model
sapphia-410m-RM
super duper ultra highly experimental lora finetune of EleutherAI/pythia-410m-deduped on argilla/dpo-mix-7k, to be a reward model.
why?
nexusflow achieved good results with traditional reward model finetuning! why not meeeeeee :3