File size: 929 Bytes
737e12d
 
 
 
7795511
737e12d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb3833c
 
 
 
82535c6
fb3833c
 
 
82535c6
 
 
 
80e6773
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Llama-3 8B RLHF checkpoint trained by OpenRLHF

Using the models and datasets:

- Base SFT model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture
- Reward model: https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-mixture
- Prompt dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1

Training Hyperparameters

```
Actor Learning Rate: 5e-7
Critic Learning Rate: 9e-6
Learning Rate Scheduler: Cosine with 0.03 Warmup
PPO epoch: 1
Training Batch Size: 128
Experience Buffer Size: 1024
Reward Normalization: True
Max Prompt Length: 2048
Max Response Length: 2048
Max Samples: 100k
```

Evaluation

```
Chat-Arena-Hard
-------------------------------------------
llama-3-8b-sft                 | score: 5.6   
llama-3-8b-rlhf-100k           | score: 20.5
```


Training logs

<img src="https://cdn-uploads.huggingface.co/production/uploads/63f6c04ac96958470d1e9043/iqwD8jBAX1vhu0PT0ycy8.png" width="800px">