OpenRLHF
/

Llama-3-8b-rlhf-100k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-8b-rlhf-100k / README.md

chuyi777's picture

Update README.md

154394f verified 5 months ago

|

953 Bytes

	Llama-3 8B RLHF checkpoint trained by OpenRLHF

	Using the models and datasets:

	- Base SFT model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture
	- Reward model: https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-mixture
	- Prompt dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1

	Training Hyperparameters

	```
	Actor Learning Rate: 5e-7
	Critic Learning Rate: 9e-6
	Learning Rate Scheduler: Cosine with 0.03 Warmup
	PPO epoch: 1
	Training Batch Size: 128
	Experience Buffer Size: 1024
	Reward Normalization: True
	Max Prompt Length: 2048
	Max Response Length: 2048
	Max Samples: 100k (To save GPU resources)
	```

	Evaluation

	```
	Chat-Arena-Hard
	-------------------------------------------
	llama-3-8b-sft \| score: 5.6
	llama-3-8b-rlhf-100k \| score: 20.5
	```


	Training logs

	<img src="https://cdn-uploads.huggingface.co/production/uploads/63f6c04ac96958470d1e9043/iqwD8jBAX1vhu0PT0ycy8.png" width="800px">