README.md · OpenRLHF/Llama-3-8b-rlhf-100k at 737e12dc191280ce6cddee62d47811ff23b3fcde

Llama-3 8B RLHF checkpoint trained by OpenRLHF

Using the models and datasets:

Base model: https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-mixture
Reward model: https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-mixture
Prompt dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1

Training Hyperparameters

Actor Learning Rate: 5e-7
Critic Learning Rate: 9e-6
Learning Rate Scheduler: Cosine with 0.03 Warmup
PPO epoch: 1
Training Batch Size: 128
Experience Buffer Size: 1024
Reward Normalization: True
Max Prompt Length: 2048
Max Response Length: 2048
Max Samples: 100k

Training logs