|
Llama-3 8B RLHF checkpoint trained by OpenRLHF |
|
|
|
Using the models and datasets: |
|
|
|
- Base SFT model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture |
|
- Reward model: https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-mixture |
|
- Prompt dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1 |
|
|
|
Training Hyperparameters |
|
|
|
``` |
|
Actor Learning Rate: 5e-7 |
|
Critic Learning Rate: 9e-6 |
|
Learning Rate Scheduler: Cosine with 0.03 Warmup |
|
PPO epoch: 1 |
|
Training Batch Size: 128 |
|
Experience Buffer Size: 1024 |
|
Reward Normalization: True |
|
Max Prompt Length: 2048 |
|
Max Response Length: 2048 |
|
Max Samples: 100k (To save GPU resources) |
|
``` |
|
|
|
Evaluation |
|
|
|
``` |
|
Chat-Arena-Hard |
|
------------------------------------------- |
|
llama-3-8b-sft | score: 5.6 |
|
llama-3-8b-rlhf-100k | score: 20.5 |
|
``` |
|
|
|
|
|
Training logs |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63f6c04ac96958470d1e9043/iqwD8jBAX1vhu0PT0ycy8.png" width="800px"> |
|
|
|
|