hugodk-sch commited on
Commit
e0e256f
1 Parent(s): f947ec6

Model save

Browse files
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ tags:
4
+ - trl
5
+ - dpo
6
+ - generated_from_trainer
7
+ base_model: NorLLM-AI/NorMistral-7B
8
+ model-index:
9
+ - name: norllm-ai-normistral-7b-align-scan
10
+ results: []
11
+ ---
12
+
13
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
+ should probably proofread and complete it, then remove this comment. -->
15
+
16
+ # norllm-ai-normistral-7b-align-scan
17
+
18
+ This model is a fine-tuned version of [NorLLM-AI/NorMistral-7B](https://huggingface.co/NorLLM-AI/NorMistral-7B) on the None dataset.
19
+ It achieves the following results on the evaluation set:
20
+ - Loss: 0.8088
21
+ - Rewards/chosen: -1.1685
22
+ - Rewards/rejected: -1.5136
23
+ - Rewards/accuracies: 0.5918
24
+ - Rewards/margins: 0.3451
25
+ - Logps/rejected: -37.2208
26
+ - Logps/chosen: -33.2299
27
+ - Logits/rejected: -2.8265
28
+ - Logits/chosen: -2.8292
29
+
30
+ ## Model description
31
+
32
+ More information needed
33
+
34
+ ## Intended uses & limitations
35
+
36
+ More information needed
37
+
38
+ ## Training and evaluation data
39
+
40
+ More information needed
41
+
42
+ ## Training procedure
43
+
44
+ ### Training hyperparameters
45
+
46
+ The following hyperparameters were used during training:
47
+ - learning_rate: 5e-06
48
+ - train_batch_size: 4
49
+ - eval_batch_size: 8
50
+ - seed: 42
51
+ - distributed_type: multi-GPU
52
+ - gradient_accumulation_steps: 2
53
+ - total_train_batch_size: 8
54
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
+ - lr_scheduler_type: cosine
56
+ - lr_scheduler_warmup_ratio: 0.1
57
+ - num_epochs: 4
58
+
59
+ ### Training results
60
+
61
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
62
+ |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
63
+ | 0.6746 | 0.26 | 100 | 0.6828 | 0.0185 | -0.0185 | 0.5694 | 0.0370 | -34.7290 | -31.2516 | -2.8058 | -2.8084 |
64
+ | 0.6195 | 0.52 | 200 | 0.6735 | -0.0458 | -0.1322 | 0.5511 | 0.0864 | -34.9185 | -31.3587 | -2.8176 | -2.8201 |
65
+ | 0.5567 | 0.78 | 300 | 0.6810 | -0.1233 | -0.2426 | 0.5723 | 0.1192 | -35.1024 | -31.4880 | -2.8203 | -2.8231 |
66
+ | 0.2251 | 1.04 | 400 | 0.6779 | -0.3249 | -0.4970 | 0.6013 | 0.1720 | -35.5264 | -31.8240 | -2.8175 | -2.8204 |
67
+ | 0.2082 | 1.3 | 500 | 0.6859 | -0.4136 | -0.6723 | 0.6092 | 0.2587 | -35.8186 | -31.9717 | -2.8475 | -2.8487 |
68
+ | 0.2119 | 1.56 | 600 | 0.6993 | -0.5421 | -0.7899 | 0.5926 | 0.2478 | -36.0147 | -32.1860 | -2.8301 | -2.8322 |
69
+ | 0.1579 | 1.82 | 700 | 0.7178 | -0.6062 | -0.8251 | 0.5806 | 0.2189 | -36.0734 | -32.2928 | -2.8261 | -2.8284 |
70
+ | 0.0649 | 2.08 | 800 | 0.7260 | -0.7190 | -1.0000 | 0.6071 | 0.2810 | -36.3648 | -32.4808 | -2.8243 | -2.8271 |
71
+ | 0.1014 | 2.34 | 900 | 0.7758 | -1.0050 | -1.3365 | 0.5831 | 0.3315 | -36.9256 | -32.9574 | -2.8278 | -2.8304 |
72
+ | 0.0425 | 2.6 | 1000 | 0.7952 | -1.0994 | -1.4459 | 0.5826 | 0.3465 | -37.1080 | -33.1148 | -2.8238 | -2.8267 |
73
+ | 0.0878 | 2.86 | 1100 | 0.7929 | -1.0931 | -1.4389 | 0.5889 | 0.3458 | -37.0962 | -33.1042 | -2.8257 | -2.8283 |
74
+ | 0.0534 | 3.12 | 1200 | 0.7997 | -1.1321 | -1.4857 | 0.5889 | 0.3535 | -37.1742 | -33.1693 | -2.8258 | -2.8285 |
75
+ | 0.035 | 3.38 | 1300 | 0.8024 | -1.1445 | -1.5019 | 0.5889 | 0.3575 | -37.2014 | -33.1899 | -2.8266 | -2.8291 |
76
+ | 0.0126 | 3.64 | 1400 | 0.8126 | -1.1630 | -1.5088 | 0.5860 | 0.3457 | -37.2128 | -33.2208 | -2.8267 | -2.8294 |
77
+ | 0.0525 | 3.9 | 1500 | 0.8088 | -1.1685 | -1.5136 | 0.5918 | 0.3451 | -37.2208 | -33.2299 | -2.8265 | -2.8292 |
78
+
79
+
80
+ ### Framework versions
81
+
82
+ - PEFT 0.10.0
83
+ - Transformers 4.39.0.dev0
84
+ - Pytorch 2.1.2+cu121
85
+ - Datasets 2.14.6
86
+ - Tokenizers 0.15.1
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:095bcec096ec0dcb3328313ace617ab9e3f44b32d89f46c73f684e05bc923672
3
  size 671150064
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f903bf7494b061ea5d02422638c75b2cb7e337e53433a4388c2d5ffb921c9e0
3
  size 671150064
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 4.0,
3
+ "train_loss": 0.21652605050763526,
4
+ "train_runtime": 11213.836,
5
+ "train_samples": 3079,
6
+ "train_samples_per_second": 1.098,
7
+ "train_steps_per_second": 0.137
8
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 4.0,
3
+ "train_loss": 0.21652605050763526,
4
+ "train_runtime": 11213.836,
5
+ "train_samples": 3079,
6
+ "train_samples_per_second": 1.098,
7
+ "train_steps_per_second": 0.137
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,2595 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 4.0,
5
+ "eval_steps": 100,
6
+ "global_step": 1540,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0,
13
+ "grad_norm": 23.75,
14
+ "learning_rate": 3.2467532467532474e-08,
15
+ "logits/chosen": -2.7358343601226807,
16
+ "logits/rejected": -2.7480404376983643,
17
+ "logps/chosen": -27.35565757751465,
18
+ "logps/rejected": -21.06114387512207,
19
+ "loss": 0.6931,
20
+ "rewards/accuracies": 0.0,
21
+ "rewards/chosen": 0.0,
22
+ "rewards/margins": 0.0,
23
+ "rewards/rejected": 0.0,
24
+ "step": 1
25
+ },
26
+ {
27
+ "epoch": 0.03,
28
+ "grad_norm": 38.5,
29
+ "learning_rate": 3.2467532467532465e-07,
30
+ "logits/chosen": -3.009772777557373,
31
+ "logits/rejected": -2.999285936355591,
32
+ "logps/chosen": -33.21327209472656,
33
+ "logps/rejected": -31.971134185791016,
34
+ "loss": 0.7026,
35
+ "rewards/accuracies": 0.4027777910232544,
36
+ "rewards/chosen": -0.019398299977183342,
37
+ "rewards/margins": -0.015061982907354832,
38
+ "rewards/rejected": -0.0043363189324736595,
39
+ "step": 10
40
+ },
41
+ {
42
+ "epoch": 0.05,
43
+ "grad_norm": 27.375,
44
+ "learning_rate": 6.493506493506493e-07,
45
+ "logits/chosen": -2.89970064163208,
46
+ "logits/rejected": -2.8947174549102783,
47
+ "logps/chosen": -32.48947525024414,
48
+ "logps/rejected": -28.9757080078125,
49
+ "loss": 0.7014,
50
+ "rewards/accuracies": 0.4375,
51
+ "rewards/chosen": -0.008624804206192493,
52
+ "rewards/margins": -0.012319705449044704,
53
+ "rewards/rejected": 0.0036949000786989927,
54
+ "step": 20
55
+ },
56
+ {
57
+ "epoch": 0.08,
58
+ "grad_norm": 25.25,
59
+ "learning_rate": 9.740259740259742e-07,
60
+ "logits/chosen": -3.09592866897583,
61
+ "logits/rejected": -3.107868194580078,
62
+ "logps/chosen": -32.89606857299805,
63
+ "logps/rejected": -30.1730899810791,
64
+ "loss": 0.6998,
65
+ "rewards/accuracies": 0.48750001192092896,
66
+ "rewards/chosen": 0.007569731678813696,
67
+ "rewards/margins": -0.006207291968166828,
68
+ "rewards/rejected": 0.013777022249996662,
69
+ "step": 30
70
+ },
71
+ {
72
+ "epoch": 0.1,
73
+ "grad_norm": 27.125,
74
+ "learning_rate": 1.2987012987012986e-06,
75
+ "logits/chosen": -2.8652756214141846,
76
+ "logits/rejected": -2.8561177253723145,
77
+ "logps/chosen": -31.76981544494629,
78
+ "logps/rejected": -32.3731575012207,
79
+ "loss": 0.6743,
80
+ "rewards/accuracies": 0.612500011920929,
81
+ "rewards/chosen": 0.03630450740456581,
82
+ "rewards/margins": 0.04572600871324539,
83
+ "rewards/rejected": -0.009421499446034431,
84
+ "step": 40
85
+ },
86
+ {
87
+ "epoch": 0.13,
88
+ "grad_norm": 21.125,
89
+ "learning_rate": 1.6233766233766235e-06,
90
+ "logits/chosen": -2.8860361576080322,
91
+ "logits/rejected": -2.8838560581207275,
92
+ "logps/chosen": -29.639389038085938,
93
+ "logps/rejected": -30.066131591796875,
94
+ "loss": 0.6763,
95
+ "rewards/accuracies": 0.5,
96
+ "rewards/chosen": 0.06046304851770401,
97
+ "rewards/margins": 0.04677456617355347,
98
+ "rewards/rejected": 0.013688492588698864,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 0.16,
103
+ "grad_norm": 23.25,
104
+ "learning_rate": 1.9480519480519483e-06,
105
+ "logits/chosen": -2.915588855743408,
106
+ "logits/rejected": -2.9170215129852295,
107
+ "logps/chosen": -30.02178382873535,
108
+ "logps/rejected": -27.954660415649414,
109
+ "loss": 0.6659,
110
+ "rewards/accuracies": 0.612500011920929,
111
+ "rewards/chosen": 0.06479138135910034,
112
+ "rewards/margins": 0.06417764723300934,
113
+ "rewards/rejected": 0.0006137322634458542,
114
+ "step": 60
115
+ },
116
+ {
117
+ "epoch": 0.18,
118
+ "grad_norm": 36.75,
119
+ "learning_rate": 2.2727272727272728e-06,
120
+ "logits/chosen": -2.995821475982666,
121
+ "logits/rejected": -3.002281665802002,
122
+ "logps/chosen": -29.207416534423828,
123
+ "logps/rejected": -30.850021362304688,
124
+ "loss": 0.6903,
125
+ "rewards/accuracies": 0.5,
126
+ "rewards/chosen": 0.044131673872470856,
127
+ "rewards/margins": 0.019063914194703102,
128
+ "rewards/rejected": 0.025067755952477455,
129
+ "step": 70
130
+ },
131
+ {
132
+ "epoch": 0.21,
133
+ "grad_norm": 29.875,
134
+ "learning_rate": 2.597402597402597e-06,
135
+ "logits/chosen": -2.8110384941101074,
136
+ "logits/rejected": -2.827033519744873,
137
+ "logps/chosen": -29.421245574951172,
138
+ "logps/rejected": -29.7005558013916,
139
+ "loss": 0.6571,
140
+ "rewards/accuracies": 0.675000011920929,
141
+ "rewards/chosen": 0.07168709486722946,
142
+ "rewards/margins": 0.08745081722736359,
143
+ "rewards/rejected": -0.015763718634843826,
144
+ "step": 80
145
+ },
146
+ {
147
+ "epoch": 0.23,
148
+ "grad_norm": 25.0,
149
+ "learning_rate": 2.922077922077922e-06,
150
+ "logits/chosen": -2.89847993850708,
151
+ "logits/rejected": -2.8813650608062744,
152
+ "logps/chosen": -32.741859436035156,
153
+ "logps/rejected": -30.076095581054688,
154
+ "loss": 0.6708,
155
+ "rewards/accuracies": 0.6499999761581421,
156
+ "rewards/chosen": 0.0567438118159771,
157
+ "rewards/margins": 0.08941256999969482,
158
+ "rewards/rejected": -0.03266875073313713,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 0.26,
163
+ "grad_norm": 23.0,
164
+ "learning_rate": 3.246753246753247e-06,
165
+ "logits/chosen": -3.0026650428771973,
166
+ "logits/rejected": -3.0032451152801514,
167
+ "logps/chosen": -31.96946144104004,
168
+ "logps/rejected": -30.823959350585938,
169
+ "loss": 0.6746,
170
+ "rewards/accuracies": 0.5375000238418579,
171
+ "rewards/chosen": 0.052449680864810944,
172
+ "rewards/margins": 0.05712243169546127,
173
+ "rewards/rejected": -0.00467275083065033,
174
+ "step": 100
175
+ },
176
+ {
177
+ "epoch": 0.26,
178
+ "eval_logits/chosen": -2.8084466457366943,
179
+ "eval_logits/rejected": -2.8058016300201416,
180
+ "eval_logps/chosen": -31.25157928466797,
181
+ "eval_logps/rejected": -34.729000091552734,
182
+ "eval_loss": 0.6828339099884033,
183
+ "eval_rewards/accuracies": 0.5693521499633789,
184
+ "eval_rewards/chosen": 0.01852412335574627,
185
+ "eval_rewards/margins": 0.0370328426361084,
186
+ "eval_rewards/rejected": -0.01850871555507183,
187
+ "eval_runtime": 113.0199,
188
+ "eval_samples_per_second": 3.035,
189
+ "eval_steps_per_second": 0.38,
190
+ "step": 100
191
+ },
192
+ {
193
+ "epoch": 0.29,
194
+ "grad_norm": 30.75,
195
+ "learning_rate": 3.5714285714285718e-06,
196
+ "logits/chosen": -2.9552407264709473,
197
+ "logits/rejected": -2.931591510772705,
198
+ "logps/chosen": -32.020774841308594,
199
+ "logps/rejected": -31.225658416748047,
200
+ "loss": 0.6329,
201
+ "rewards/accuracies": 0.699999988079071,
202
+ "rewards/chosen": 0.11761404573917389,
203
+ "rewards/margins": 0.1576995849609375,
204
+ "rewards/rejected": -0.040085554122924805,
205
+ "step": 110
206
+ },
207
+ {
208
+ "epoch": 0.31,
209
+ "grad_norm": 23.75,
210
+ "learning_rate": 3.896103896103897e-06,
211
+ "logits/chosen": -3.03885817527771,
212
+ "logits/rejected": -3.067966938018799,
213
+ "logps/chosen": -28.88214683532715,
214
+ "logps/rejected": -34.20409393310547,
215
+ "loss": 0.621,
216
+ "rewards/accuracies": 0.574999988079071,
217
+ "rewards/chosen": 0.16905571520328522,
218
+ "rewards/margins": 0.20298466086387634,
219
+ "rewards/rejected": -0.03392895311117172,
220
+ "step": 120
221
+ },
222
+ {
223
+ "epoch": 0.34,
224
+ "grad_norm": 19.375,
225
+ "learning_rate": 4.220779220779221e-06,
226
+ "logits/chosen": -2.741839647293091,
227
+ "logits/rejected": -2.737023115158081,
228
+ "logps/chosen": -28.76812171936035,
229
+ "logps/rejected": -30.218597412109375,
230
+ "loss": 0.6385,
231
+ "rewards/accuracies": 0.637499988079071,
232
+ "rewards/chosen": 0.1239342913031578,
233
+ "rewards/margins": 0.17480790615081787,
234
+ "rewards/rejected": -0.050873614847660065,
235
+ "step": 130
236
+ },
237
+ {
238
+ "epoch": 0.36,
239
+ "grad_norm": 20.875,
240
+ "learning_rate": 4.5454545454545455e-06,
241
+ "logits/chosen": -3.016112804412842,
242
+ "logits/rejected": -3.013633966445923,
243
+ "logps/chosen": -27.28360366821289,
244
+ "logps/rejected": -31.76962661743164,
245
+ "loss": 0.6517,
246
+ "rewards/accuracies": 0.550000011920929,
247
+ "rewards/chosen": 0.12157317250967026,
248
+ "rewards/margins": 0.18928703665733337,
249
+ "rewards/rejected": -0.06771388649940491,
250
+ "step": 140
251
+ },
252
+ {
253
+ "epoch": 0.39,
254
+ "grad_norm": 20.0,
255
+ "learning_rate": 4.870129870129871e-06,
256
+ "logits/chosen": -2.8128550052642822,
257
+ "logits/rejected": -2.807783603668213,
258
+ "logps/chosen": -27.434728622436523,
259
+ "logps/rejected": -31.449132919311523,
260
+ "loss": 0.5642,
261
+ "rewards/accuracies": 0.675000011920929,
262
+ "rewards/chosen": 0.23446612060070038,
263
+ "rewards/margins": 0.3697589635848999,
264
+ "rewards/rejected": -0.13529284298419952,
265
+ "step": 150
266
+ },
267
+ {
268
+ "epoch": 0.42,
269
+ "grad_norm": 26.5,
270
+ "learning_rate": 4.999768804644796e-06,
271
+ "logits/chosen": -3.129559278488159,
272
+ "logits/rejected": -3.1123318672180176,
273
+ "logps/chosen": -31.945110321044922,
274
+ "logps/rejected": -29.27609634399414,
275
+ "loss": 0.5266,
276
+ "rewards/accuracies": 0.7749999761581421,
277
+ "rewards/chosen": 0.32796427607536316,
278
+ "rewards/margins": 0.5057547688484192,
279
+ "rewards/rejected": -0.17779052257537842,
280
+ "step": 160
281
+ },
282
+ {
283
+ "epoch": 0.44,
284
+ "grad_norm": 23.875,
285
+ "learning_rate": 4.998356098992574e-06,
286
+ "logits/chosen": -2.942965030670166,
287
+ "logits/rejected": -2.950735569000244,
288
+ "logps/chosen": -29.63945960998535,
289
+ "logps/rejected": -31.524499893188477,
290
+ "loss": 0.5615,
291
+ "rewards/accuracies": 0.6875,
292
+ "rewards/chosen": 0.15309393405914307,
293
+ "rewards/margins": 0.408103883266449,
294
+ "rewards/rejected": -0.2550099492073059,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 0.47,
299
+ "grad_norm": 23.25,
300
+ "learning_rate": 4.9956598544545566e-06,
301
+ "logits/chosen": -2.7956109046936035,
302
+ "logits/rejected": -2.793761730194092,
303
+ "logps/chosen": -29.341201782226562,
304
+ "logps/rejected": -30.051944732666016,
305
+ "loss": 0.5951,
306
+ "rewards/accuracies": 0.637499988079071,
307
+ "rewards/chosen": 0.19718877971172333,
308
+ "rewards/margins": 0.3674519658088684,
309
+ "rewards/rejected": -0.17026321589946747,
310
+ "step": 180
311
+ },
312
+ {
313
+ "epoch": 0.49,
314
+ "grad_norm": 14.0625,
315
+ "learning_rate": 4.991681456235483e-06,
316
+ "logits/chosen": -2.9083733558654785,
317
+ "logits/rejected": -2.904571533203125,
318
+ "logps/chosen": -29.67318344116211,
319
+ "logps/rejected": -28.667194366455078,
320
+ "loss": 0.5737,
321
+ "rewards/accuracies": 0.7124999761581421,
322
+ "rewards/chosen": 0.30811822414398193,
323
+ "rewards/margins": 0.4939153790473938,
324
+ "rewards/rejected": -0.18579718470573425,
325
+ "step": 190
326
+ },
327
+ {
328
+ "epoch": 0.52,
329
+ "grad_norm": 13.375,
330
+ "learning_rate": 4.986422948250881e-06,
331
+ "logits/chosen": -2.9793169498443604,
332
+ "logits/rejected": -2.967294216156006,
333
+ "logps/chosen": -33.094722747802734,
334
+ "logps/rejected": -30.4979248046875,
335
+ "loss": 0.6195,
336
+ "rewards/accuracies": 0.637499988079071,
337
+ "rewards/chosen": 0.36257725954055786,
338
+ "rewards/margins": 0.40088003873825073,
339
+ "rewards/rejected": -0.03830284625291824,
340
+ "step": 200
341
+ },
342
+ {
343
+ "epoch": 0.52,
344
+ "eval_logits/chosen": -2.8200676441192627,
345
+ "eval_logits/rejected": -2.8176209926605225,
346
+ "eval_logps/chosen": -31.358734130859375,
347
+ "eval_logps/rejected": -34.91850280761719,
348
+ "eval_loss": 0.6735296845436096,
349
+ "eval_rewards/accuracies": 0.5510797500610352,
350
+ "eval_rewards/chosen": -0.04576955363154411,
351
+ "eval_rewards/margins": 0.08643829822540283,
352
+ "eval_rewards/rejected": -0.13220785558223724,
353
+ "eval_runtime": 112.8088,
354
+ "eval_samples_per_second": 3.041,
355
+ "eval_steps_per_second": 0.381,
356
+ "step": 200
357
+ },
358
+ {
359
+ "epoch": 0.55,
360
+ "grad_norm": 20.125,
361
+ "learning_rate": 4.9798870320769884e-06,
362
+ "logits/chosen": -2.9180569648742676,
363
+ "logits/rejected": -2.918910264968872,
364
+ "logps/chosen": -32.37510299682617,
365
+ "logps/rejected": -34.01622772216797,
366
+ "loss": 0.5638,
367
+ "rewards/accuracies": 0.6875,
368
+ "rewards/chosen": 0.39498284459114075,
369
+ "rewards/margins": 0.48783811926841736,
370
+ "rewards/rejected": -0.0928553119301796,
371
+ "step": 210
372
+ },
373
+ {
374
+ "epoch": 0.57,
375
+ "grad_norm": 16.125,
376
+ "learning_rate": 4.9720770655628216e-06,
377
+ "logits/chosen": -2.898266553878784,
378
+ "logits/rejected": -2.913907527923584,
379
+ "logps/chosen": -29.420312881469727,
380
+ "logps/rejected": -28.794261932373047,
381
+ "loss": 0.5663,
382
+ "rewards/accuracies": 0.6625000238418579,
383
+ "rewards/chosen": 0.45463672280311584,
384
+ "rewards/margins": 0.6142338514328003,
385
+ "rewards/rejected": -0.15959712862968445,
386
+ "step": 220
387
+ },
388
+ {
389
+ "epoch": 0.6,
390
+ "grad_norm": 18.625,
391
+ "learning_rate": 4.96299706110506e-06,
392
+ "logits/chosen": -2.944633960723877,
393
+ "logits/rejected": -2.948894500732422,
394
+ "logps/chosen": -30.680644989013672,
395
+ "logps/rejected": -31.86056137084961,
396
+ "loss": 0.6034,
397
+ "rewards/accuracies": 0.637499988079071,
398
+ "rewards/chosen": 0.28273314237594604,
399
+ "rewards/margins": 0.34417831897735596,
400
+ "rewards/rejected": -0.061445169150829315,
401
+ "step": 230
402
+ },
403
+ {
404
+ "epoch": 0.62,
405
+ "grad_norm": 18.875,
406
+ "learning_rate": 4.952651683586668e-06,
407
+ "logits/chosen": -3.0014634132385254,
408
+ "logits/rejected": -3.009148597717285,
409
+ "logps/chosen": -29.685632705688477,
410
+ "logps/rejected": -30.464202880859375,
411
+ "loss": 0.4269,
412
+ "rewards/accuracies": 0.862500011920929,
413
+ "rewards/chosen": 0.7315748333930969,
414
+ "rewards/margins": 0.868951141834259,
415
+ "rewards/rejected": -0.13737639784812927,
416
+ "step": 240
417
+ },
418
+ {
419
+ "epoch": 0.65,
420
+ "grad_norm": 21.875,
421
+ "learning_rate": 4.9410462479802945e-06,
422
+ "logits/chosen": -2.8361315727233887,
423
+ "logits/rejected": -2.826373338699341,
424
+ "logps/chosen": -26.182703018188477,
425
+ "logps/rejected": -29.741592407226562,
426
+ "loss": 0.5404,
427
+ "rewards/accuracies": 0.75,
428
+ "rewards/chosen": 0.46290817856788635,
429
+ "rewards/margins": 0.5888797640800476,
430
+ "rewards/rejected": -0.12597161531448364,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 0.68,
435
+ "grad_norm": 12.3125,
436
+ "learning_rate": 4.928186716617686e-06,
437
+ "logits/chosen": -2.8184003829956055,
438
+ "logits/rejected": -2.837860345840454,
439
+ "logps/chosen": -28.799774169921875,
440
+ "logps/rejected": -34.532344818115234,
441
+ "loss": 0.5319,
442
+ "rewards/accuracies": 0.75,
443
+ "rewards/chosen": 0.6181110739707947,
444
+ "rewards/margins": 0.8330680131912231,
445
+ "rewards/rejected": -0.21495695412158966,
446
+ "step": 260
447
+ },
448
+ {
449
+ "epoch": 0.7,
450
+ "grad_norm": 16.125,
451
+ "learning_rate": 4.914079696126526e-06,
452
+ "logits/chosen": -2.9628233909606934,
453
+ "logits/rejected": -2.968907356262207,
454
+ "logps/chosen": -29.94136619567871,
455
+ "logps/rejected": -30.401519775390625,
456
+ "loss": 0.4633,
457
+ "rewards/accuracies": 0.7875000238418579,
458
+ "rewards/chosen": 0.5039950609207153,
459
+ "rewards/margins": 0.8257501721382141,
460
+ "rewards/rejected": -0.32175517082214355,
461
+ "step": 270
462
+ },
463
+ {
464
+ "epoch": 0.73,
465
+ "grad_norm": 13.5,
466
+ "learning_rate": 4.8987324340362445e-06,
467
+ "logits/chosen": -2.9749794006347656,
468
+ "logits/rejected": -2.962801218032837,
469
+ "logps/chosen": -29.97719955444336,
470
+ "logps/rejected": -28.926509857177734,
471
+ "loss": 0.5875,
472
+ "rewards/accuracies": 0.762499988079071,
473
+ "rewards/chosen": 0.4466908574104309,
474
+ "rewards/margins": 0.623786449432373,
475
+ "rewards/rejected": -0.17709562182426453,
476
+ "step": 280
477
+ },
478
+ {
479
+ "epoch": 0.75,
480
+ "grad_norm": 13.9375,
481
+ "learning_rate": 4.882152815054587e-06,
482
+ "logits/chosen": -2.903925657272339,
483
+ "logits/rejected": -2.886340618133545,
484
+ "logps/chosen": -31.33095932006836,
485
+ "logps/rejected": -31.375301361083984,
486
+ "loss": 0.3679,
487
+ "rewards/accuracies": 0.824999988079071,
488
+ "rewards/chosen": 0.7832067608833313,
489
+ "rewards/margins": 1.2944037914276123,
490
+ "rewards/rejected": -0.511197030544281,
491
+ "step": 290
492
+ },
493
+ {
494
+ "epoch": 0.78,
495
+ "grad_norm": 25.125,
496
+ "learning_rate": 4.864349357016816e-06,
497
+ "logits/chosen": -2.9058868885040283,
498
+ "logits/rejected": -2.9024853706359863,
499
+ "logps/chosen": -31.235179901123047,
500
+ "logps/rejected": -28.030147552490234,
501
+ "loss": 0.5567,
502
+ "rewards/accuracies": 0.737500011920929,
503
+ "rewards/chosen": 0.5924834609031677,
504
+ "rewards/margins": 0.8484223484992981,
505
+ "rewards/rejected": -0.25593873858451843,
506
+ "step": 300
507
+ },
508
+ {
509
+ "epoch": 0.78,
510
+ "eval_logits/chosen": -2.8230674266815186,
511
+ "eval_logits/rejected": -2.8203024864196777,
512
+ "eval_logps/chosen": -31.48798370361328,
513
+ "eval_logps/rejected": -35.10240936279297,
514
+ "eval_loss": 0.6810446381568909,
515
+ "eval_rewards/accuracies": 0.5722591280937195,
516
+ "eval_rewards/chosen": -0.12331710010766983,
517
+ "eval_rewards/margins": 0.11923385411500931,
518
+ "eval_rewards/rejected": -0.24255095422267914,
519
+ "eval_runtime": 112.9994,
520
+ "eval_samples_per_second": 3.035,
521
+ "eval_steps_per_second": 0.381,
522
+ "step": 300
523
+ },
524
+ {
525
+ "epoch": 0.81,
526
+ "grad_norm": 21.875,
527
+ "learning_rate": 4.84533120650964e-06,
528
+ "logits/chosen": -2.7906689643859863,
529
+ "logits/rejected": -2.8061537742614746,
530
+ "logps/chosen": -28.591060638427734,
531
+ "logps/rejected": -31.535472869873047,
532
+ "loss": 0.4812,
533
+ "rewards/accuracies": 0.8125,
534
+ "rewards/chosen": 0.3830157220363617,
535
+ "rewards/margins": 0.8650951385498047,
536
+ "rewards/rejected": -0.48207932710647583,
537
+ "step": 310
538
+ },
539
+ {
540
+ "epoch": 0.83,
541
+ "grad_norm": 13.6875,
542
+ "learning_rate": 4.825108134172131e-06,
543
+ "logits/chosen": -3.033379554748535,
544
+ "logits/rejected": -3.0197348594665527,
545
+ "logps/chosen": -29.19796371459961,
546
+ "logps/rejected": -29.172719955444336,
547
+ "loss": 0.4534,
548
+ "rewards/accuracies": 0.7749999761581421,
549
+ "rewards/chosen": 0.6081727147102356,
550
+ "rewards/margins": 1.120539903640747,
551
+ "rewards/rejected": -0.5123672485351562,
552
+ "step": 320
553
+ },
554
+ {
555
+ "epoch": 0.86,
556
+ "grad_norm": 8.0625,
557
+ "learning_rate": 4.80369052967602e-06,
558
+ "logits/chosen": -2.951292037963867,
559
+ "logits/rejected": -2.9344191551208496,
560
+ "logps/chosen": -27.382125854492188,
561
+ "logps/rejected": -31.329797744750977,
562
+ "loss": 0.4283,
563
+ "rewards/accuracies": 0.800000011920929,
564
+ "rewards/chosen": 0.5879670977592468,
565
+ "rewards/margins": 1.1508591175079346,
566
+ "rewards/rejected": -0.5628920793533325,
567
+ "step": 330
568
+ },
569
+ {
570
+ "epoch": 0.88,
571
+ "grad_norm": 23.5,
572
+ "learning_rate": 4.781089396387968e-06,
573
+ "logits/chosen": -3.1658730506896973,
574
+ "logits/rejected": -3.1736338138580322,
575
+ "logps/chosen": -30.635177612304688,
576
+ "logps/rejected": -33.81396484375,
577
+ "loss": 0.4196,
578
+ "rewards/accuracies": 0.7875000238418579,
579
+ "rewards/chosen": 0.602415919303894,
580
+ "rewards/margins": 1.2731926441192627,
581
+ "rewards/rejected": -0.6707767248153687,
582
+ "step": 340
583
+ },
584
+ {
585
+ "epoch": 0.91,
586
+ "grad_norm": 11.25,
587
+ "learning_rate": 4.757316345716554e-06,
588
+ "logits/chosen": -3.046379327774048,
589
+ "logits/rejected": -3.0505118370056152,
590
+ "logps/chosen": -29.554956436157227,
591
+ "logps/rejected": -32.34663009643555,
592
+ "loss": 0.4581,
593
+ "rewards/accuracies": 0.7875000238418579,
594
+ "rewards/chosen": 0.7511544823646545,
595
+ "rewards/margins": 1.251146912574768,
596
+ "rewards/rejected": -0.4999924600124359,
597
+ "step": 350
598
+ },
599
+ {
600
+ "epoch": 0.94,
601
+ "grad_norm": 12.125,
602
+ "learning_rate": 4.73238359114687e-06,
603
+ "logits/chosen": -2.885371446609497,
604
+ "logits/rejected": -2.887463331222534,
605
+ "logps/chosen": -27.74961280822754,
606
+ "logps/rejected": -30.798757553100586,
607
+ "loss": 0.4613,
608
+ "rewards/accuracies": 0.824999988079071,
609
+ "rewards/chosen": 0.5400681495666504,
610
+ "rewards/margins": 1.1207592487335205,
611
+ "rewards/rejected": -0.5806912183761597,
612
+ "step": 360
613
+ },
614
+ {
615
+ "epoch": 0.96,
616
+ "grad_norm": 30.25,
617
+ "learning_rate": 4.706303941965804e-06,
618
+ "logits/chosen": -2.9629409313201904,
619
+ "logits/rejected": -2.9607956409454346,
620
+ "logps/chosen": -29.71516990661621,
621
+ "logps/rejected": -32.788673400878906,
622
+ "loss": 0.4285,
623
+ "rewards/accuracies": 0.7749999761581421,
624
+ "rewards/chosen": 0.6428869962692261,
625
+ "rewards/margins": 1.1592637300491333,
626
+ "rewards/rejected": -0.5163766741752625,
627
+ "step": 370
628
+ },
629
+ {
630
+ "epoch": 0.99,
631
+ "grad_norm": 16.125,
632
+ "learning_rate": 4.679090796681225e-06,
633
+ "logits/chosen": -2.915102005004883,
634
+ "logits/rejected": -2.8993585109710693,
635
+ "logps/chosen": -27.930139541625977,
636
+ "logps/rejected": -28.983240127563477,
637
+ "loss": 0.3961,
638
+ "rewards/accuracies": 0.8374999761581421,
639
+ "rewards/chosen": 0.6089375615119934,
640
+ "rewards/margins": 1.2651110887527466,
641
+ "rewards/rejected": -0.6561735272407532,
642
+ "step": 380
643
+ },
644
+ {
645
+ "epoch": 1.01,
646
+ "grad_norm": 9.0625,
647
+ "learning_rate": 4.650758136138454e-06,
648
+ "logits/chosen": -3.20560884475708,
649
+ "logits/rejected": -3.1781651973724365,
650
+ "logps/chosen": -28.22250747680664,
651
+ "logps/rejected": -36.23353576660156,
652
+ "loss": 0.2827,
653
+ "rewards/accuracies": 0.887499988079071,
654
+ "rewards/chosen": 0.8412960171699524,
655
+ "rewards/margins": 1.9863579273223877,
656
+ "rewards/rejected": -1.1450618505477905,
657
+ "step": 390
658
+ },
659
+ {
660
+ "epoch": 1.04,
661
+ "grad_norm": 5.125,
662
+ "learning_rate": 4.621320516337559e-06,
663
+ "logits/chosen": -2.9704136848449707,
664
+ "logits/rejected": -2.9771697521209717,
665
+ "logps/chosen": -30.30599594116211,
666
+ "logps/rejected": -31.767135620117188,
667
+ "loss": 0.2251,
668
+ "rewards/accuracies": 0.887499988079071,
669
+ "rewards/chosen": 1.1859118938446045,
670
+ "rewards/margins": 2.4125428199768066,
671
+ "rewards/rejected": -1.226630687713623,
672
+ "step": 400
673
+ },
674
+ {
675
+ "epoch": 1.04,
676
+ "eval_logits/chosen": -2.8203864097595215,
677
+ "eval_logits/rejected": -2.8175127506256104,
678
+ "eval_logps/chosen": -31.82402229309082,
679
+ "eval_logps/rejected": -35.526432037353516,
680
+ "eval_loss": 0.6779412627220154,
681
+ "eval_rewards/accuracies": 0.6013289093971252,
682
+ "eval_rewards/chosen": -0.32494381070137024,
683
+ "eval_rewards/margins": 0.17202156782150269,
684
+ "eval_rewards/rejected": -0.4969654083251953,
685
+ "eval_runtime": 112.9877,
686
+ "eval_samples_per_second": 3.036,
687
+ "eval_steps_per_second": 0.381,
688
+ "step": 400
689
+ },
690
+ {
691
+ "epoch": 1.06,
692
+ "grad_norm": 10.5,
693
+ "learning_rate": 4.590793060955158e-06,
694
+ "logits/chosen": -2.91953706741333,
695
+ "logits/rejected": -2.9032034873962402,
696
+ "logps/chosen": -26.497915267944336,
697
+ "logps/rejected": -30.3842716217041,
698
+ "loss": 0.1921,
699
+ "rewards/accuracies": 0.949999988079071,
700
+ "rewards/chosen": 0.7503674030303955,
701
+ "rewards/margins": 2.4338011741638184,
702
+ "rewards/rejected": -1.6834341287612915,
703
+ "step": 410
704
+ },
705
+ {
706
+ "epoch": 1.09,
707
+ "grad_norm": 3.71875,
708
+ "learning_rate": 4.559191453574582e-06,
709
+ "logits/chosen": -2.958489179611206,
710
+ "logits/rejected": -2.975358486175537,
711
+ "logps/chosen": -30.06867027282715,
712
+ "logps/rejected": -29.58856773376465,
713
+ "loss": 0.2889,
714
+ "rewards/accuracies": 0.862500011920929,
715
+ "rewards/chosen": 1.2172677516937256,
716
+ "rewards/margins": 2.2790169715881348,
717
+ "rewards/rejected": -1.0617492198944092,
718
+ "step": 420
719
+ },
720
+ {
721
+ "epoch": 1.12,
722
+ "grad_norm": 6.4375,
723
+ "learning_rate": 4.52653192962838e-06,
724
+ "logits/chosen": -2.9180357456207275,
725
+ "logits/rejected": -2.9390196800231934,
726
+ "logps/chosen": -27.29958152770996,
727
+ "logps/rejected": -33.08103942871094,
728
+ "loss": 0.1982,
729
+ "rewards/accuracies": 0.949999988079071,
730
+ "rewards/chosen": 0.8723928332328796,
731
+ "rewards/margins": 2.6052730083465576,
732
+ "rewards/rejected": -1.7328803539276123,
733
+ "step": 430
734
+ },
735
+ {
736
+ "epoch": 1.14,
737
+ "grad_norm": 7.875,
738
+ "learning_rate": 4.492831268057307e-06,
739
+ "logits/chosen": -3.005812168121338,
740
+ "logits/rejected": -2.991055727005005,
741
+ "logps/chosen": -32.43821716308594,
742
+ "logps/rejected": -34.362449645996094,
743
+ "loss": 0.1944,
744
+ "rewards/accuracies": 0.949999988079071,
745
+ "rewards/chosen": 1.1018731594085693,
746
+ "rewards/margins": 2.664522886276245,
747
+ "rewards/rejected": -1.5626493692398071,
748
+ "step": 440
749
+ },
750
+ {
751
+ "epoch": 1.17,
752
+ "grad_norm": 7.4375,
753
+ "learning_rate": 4.458106782690094e-06,
754
+ "logits/chosen": -2.835895538330078,
755
+ "logits/rejected": -2.841238021850586,
756
+ "logps/chosen": -27.996952056884766,
757
+ "logps/rejected": -33.16737365722656,
758
+ "loss": 0.1935,
759
+ "rewards/accuracies": 0.8999999761581421,
760
+ "rewards/chosen": 1.2257856130599976,
761
+ "rewards/margins": 2.535287857055664,
762
+ "rewards/rejected": -1.309502124786377,
763
+ "step": 450
764
+ },
765
+ {
766
+ "epoch": 1.19,
767
+ "grad_norm": 4.75,
768
+ "learning_rate": 4.422376313348405e-06,
769
+ "logits/chosen": -2.908581256866455,
770
+ "logits/rejected": -2.8960087299346924,
771
+ "logps/chosen": -28.098934173583984,
772
+ "logps/rejected": -37.033470153808594,
773
+ "loss": 0.1398,
774
+ "rewards/accuracies": 0.949999988079071,
775
+ "rewards/chosen": 1.1988991498947144,
776
+ "rewards/margins": 3.1904969215393066,
777
+ "rewards/rejected": -1.9915975332260132,
778
+ "step": 460
779
+ },
780
+ {
781
+ "epoch": 1.22,
782
+ "grad_norm": 35.5,
783
+ "learning_rate": 4.3856582166815696e-06,
784
+ "logits/chosen": -2.961662530899048,
785
+ "logits/rejected": -2.967067241668701,
786
+ "logps/chosen": -30.011981964111328,
787
+ "logps/rejected": -33.761478424072266,
788
+ "loss": 0.2217,
789
+ "rewards/accuracies": 0.9125000238418579,
790
+ "rewards/chosen": 1.2641057968139648,
791
+ "rewards/margins": 2.9333133697509766,
792
+ "rewards/rejected": -1.6692078113555908,
793
+ "step": 470
794
+ },
795
+ {
796
+ "epoch": 1.25,
797
+ "grad_norm": 6.375,
798
+ "learning_rate": 4.347971356735789e-06,
799
+ "logits/chosen": -2.897602081298828,
800
+ "logits/rejected": -2.903383255004883,
801
+ "logps/chosen": -26.06301498413086,
802
+ "logps/rejected": -32.249534606933594,
803
+ "loss": 0.2337,
804
+ "rewards/accuracies": 0.9125000238418579,
805
+ "rewards/chosen": 0.9800997972488403,
806
+ "rewards/margins": 2.5127570629119873,
807
+ "rewards/rejected": -1.532657504081726,
808
+ "step": 480
809
+ },
810
+ {
811
+ "epoch": 1.27,
812
+ "grad_norm": 4.9375,
813
+ "learning_rate": 4.309335095262675e-06,
814
+ "logits/chosen": -2.9619221687316895,
815
+ "logits/rejected": -2.9760589599609375,
816
+ "logps/chosen": -30.95633888244629,
817
+ "logps/rejected": -33.317893981933594,
818
+ "loss": 0.1951,
819
+ "rewards/accuracies": 0.925000011920929,
820
+ "rewards/chosen": 1.0749385356903076,
821
+ "rewards/margins": 2.807677745819092,
822
+ "rewards/rejected": -1.7327392101287842,
823
+ "step": 490
824
+ },
825
+ {
826
+ "epoch": 1.3,
827
+ "grad_norm": 25.25,
828
+ "learning_rate": 4.269769281772082e-06,
829
+ "logits/chosen": -3.1099023818969727,
830
+ "logits/rejected": -3.0993502140045166,
831
+ "logps/chosen": -27.969009399414062,
832
+ "logps/rejected": -36.54082489013672,
833
+ "loss": 0.2082,
834
+ "rewards/accuracies": 0.9125000238418579,
835
+ "rewards/chosen": 1.0757564306259155,
836
+ "rewards/margins": 3.2704696655273438,
837
+ "rewards/rejected": -2.1947131156921387,
838
+ "step": 500
839
+ },
840
+ {
841
+ "epoch": 1.3,
842
+ "eval_logits/chosen": -2.8486714363098145,
843
+ "eval_logits/rejected": -2.8475496768951416,
844
+ "eval_logps/chosen": -31.971736907958984,
845
+ "eval_logps/rejected": -35.81857681274414,
846
+ "eval_loss": 0.6858980655670166,
847
+ "eval_rewards/accuracies": 0.6092192530632019,
848
+ "eval_rewards/chosen": -0.413571298122406,
849
+ "eval_rewards/margins": 0.2586813271045685,
850
+ "eval_rewards/rejected": -0.6722525954246521,
851
+ "eval_runtime": 112.9971,
852
+ "eval_samples_per_second": 3.035,
853
+ "eval_steps_per_second": 0.381,
854
+ "step": 500
855
+ },
856
+ {
857
+ "epoch": 1.32,
858
+ "grad_norm": 13.8125,
859
+ "learning_rate": 4.22929424333435e-06,
860
+ "logits/chosen": -2.9114999771118164,
861
+ "logits/rejected": -2.9171700477600098,
862
+ "logps/chosen": -28.95807456970215,
863
+ "logps/rejected": -32.554012298583984,
864
+ "loss": 0.1718,
865
+ "rewards/accuracies": 0.887499988079071,
866
+ "rewards/chosen": 1.2146211862564087,
867
+ "rewards/margins": 3.0521254539489746,
868
+ "rewards/rejected": -1.8375046253204346,
869
+ "step": 510
870
+ },
871
+ {
872
+ "epoch": 1.35,
873
+ "grad_norm": 9.5,
874
+ "learning_rate": 4.1879307741372085e-06,
875
+ "logits/chosen": -2.8482823371887207,
876
+ "logits/rejected": -2.851170063018799,
877
+ "logps/chosen": -27.4453182220459,
878
+ "logps/rejected": -32.90190124511719,
879
+ "loss": 0.2558,
880
+ "rewards/accuracies": 0.9125000238418579,
881
+ "rewards/chosen": 1.3344898223876953,
882
+ "rewards/margins": 3.184781789779663,
883
+ "rewards/rejected": -1.8502919673919678,
884
+ "step": 520
885
+ },
886
+ {
887
+ "epoch": 1.38,
888
+ "grad_norm": 4.53125,
889
+ "learning_rate": 4.145700124802693e-06,
890
+ "logits/chosen": -2.9950766563415527,
891
+ "logits/rejected": -2.980553150177002,
892
+ "logps/chosen": -28.1324462890625,
893
+ "logps/rejected": -33.559776306152344,
894
+ "loss": 0.1474,
895
+ "rewards/accuracies": 0.925000011920929,
896
+ "rewards/chosen": 1.1568902730941772,
897
+ "rewards/margins": 3.404855728149414,
898
+ "rewards/rejected": -2.2479655742645264,
899
+ "step": 530
900
+ },
901
+ {
902
+ "epoch": 1.4,
903
+ "grad_norm": 5.15625,
904
+ "learning_rate": 4.102623991469562e-06,
905
+ "logits/chosen": -3.148716688156128,
906
+ "logits/rejected": -3.149186134338379,
907
+ "logps/chosen": -27.875808715820312,
908
+ "logps/rejected": -33.190120697021484,
909
+ "loss": 0.2048,
910
+ "rewards/accuracies": 0.8999999761581421,
911
+ "rewards/chosen": 1.2375982999801636,
912
+ "rewards/margins": 2.9263930320739746,
913
+ "rewards/rejected": -1.688794493675232,
914
+ "step": 540
915
+ },
916
+ {
917
+ "epoch": 1.43,
918
+ "grad_norm": 15.375,
919
+ "learning_rate": 4.058724504646834e-06,
920
+ "logits/chosen": -3.111207962036133,
921
+ "logits/rejected": -3.115915536880493,
922
+ "logps/chosen": -29.704254150390625,
923
+ "logps/rejected": -30.918773651123047,
924
+ "loss": 0.2088,
925
+ "rewards/accuracies": 0.9125000238418579,
926
+ "rewards/chosen": 1.1263028383255005,
927
+ "rewards/margins": 2.9283792972564697,
928
+ "rewards/rejected": -1.8020765781402588,
929
+ "step": 550
930
+ },
931
+ {
932
+ "epoch": 1.45,
933
+ "grad_norm": 1.6328125,
934
+ "learning_rate": 4.014024217844167e-06,
935
+ "logits/chosen": -2.8809523582458496,
936
+ "logits/rejected": -2.8608622550964355,
937
+ "logps/chosen": -28.932674407958984,
938
+ "logps/rejected": -31.60831642150879,
939
+ "loss": 0.2277,
940
+ "rewards/accuracies": 0.8999999761581421,
941
+ "rewards/chosen": 1.2164714336395264,
942
+ "rewards/margins": 2.8543777465820312,
943
+ "rewards/rejected": -1.6379063129425049,
944
+ "step": 560
945
+ },
946
+ {
947
+ "epoch": 1.48,
948
+ "grad_norm": 9.5625,
949
+ "learning_rate": 3.968546095984911e-06,
950
+ "logits/chosen": -3.127673625946045,
951
+ "logits/rejected": -3.1156859397888184,
952
+ "logps/chosen": -28.24026870727539,
953
+ "logps/rejected": -30.688241958618164,
954
+ "loss": 0.203,
955
+ "rewards/accuracies": 0.925000011920929,
956
+ "rewards/chosen": 1.3376219272613525,
957
+ "rewards/margins": 2.8865649700164795,
958
+ "rewards/rejected": -1.548943042755127,
959
+ "step": 570
960
+ },
961
+ {
962
+ "epoch": 1.51,
963
+ "grad_norm": 6.9375,
964
+ "learning_rate": 3.922313503607806e-06,
965
+ "logits/chosen": -3.031052827835083,
966
+ "logits/rejected": -3.0235180854797363,
967
+ "logps/chosen": -27.095932006835938,
968
+ "logps/rejected": -34.802921295166016,
969
+ "loss": 0.1872,
970
+ "rewards/accuracies": 0.925000011920929,
971
+ "rewards/chosen": 1.1634767055511475,
972
+ "rewards/margins": 3.093201160430908,
973
+ "rewards/rejected": -1.9297244548797607,
974
+ "step": 580
975
+ },
976
+ {
977
+ "epoch": 1.53,
978
+ "grad_norm": 6.40625,
979
+ "learning_rate": 3.875350192863368e-06,
980
+ "logits/chosen": -2.850231885910034,
981
+ "logits/rejected": -2.8259904384613037,
982
+ "logps/chosen": -26.238988876342773,
983
+ "logps/rejected": -30.916696548461914,
984
+ "loss": 0.1868,
985
+ "rewards/accuracies": 0.9375,
986
+ "rewards/chosen": 0.7704905271530151,
987
+ "rewards/margins": 2.6334352493286133,
988
+ "rewards/rejected": -1.8629448413848877,
989
+ "step": 590
990
+ },
991
+ {
992
+ "epoch": 1.56,
993
+ "grad_norm": 5.875,
994
+ "learning_rate": 3.8276802913111436e-06,
995
+ "logits/chosen": -2.9241130352020264,
996
+ "logits/rejected": -2.9332339763641357,
997
+ "logps/chosen": -27.865304946899414,
998
+ "logps/rejected": -32.76335525512695,
999
+ "loss": 0.2119,
1000
+ "rewards/accuracies": 0.9375,
1001
+ "rewards/chosen": 1.113793134689331,
1002
+ "rewards/margins": 3.0938680171966553,
1003
+ "rewards/rejected": -1.9800748825073242,
1004
+ "step": 600
1005
+ },
1006
+ {
1007
+ "epoch": 1.56,
1008
+ "eval_logits/chosen": -2.8321738243103027,
1009
+ "eval_logits/rejected": -2.8301031589508057,
1010
+ "eval_logps/chosen": -32.18595886230469,
1011
+ "eval_logps/rejected": -36.014678955078125,
1012
+ "eval_loss": 0.699341893196106,
1013
+ "eval_rewards/accuracies": 0.5926079750061035,
1014
+ "eval_rewards/chosen": -0.5421043038368225,
1015
+ "eval_rewards/margins": 0.24781100451946259,
1016
+ "eval_rewards/rejected": -0.7899153232574463,
1017
+ "eval_runtime": 112.7941,
1018
+ "eval_samples_per_second": 3.041,
1019
+ "eval_steps_per_second": 0.381,
1020
+ "step": 600
1021
+ },
1022
+ {
1023
+ "epoch": 1.58,
1024
+ "grad_norm": 13.5625,
1025
+ "learning_rate": 3.7793282895240927e-06,
1026
+ "logits/chosen": -2.946056842803955,
1027
+ "logits/rejected": -2.9427828788757324,
1028
+ "logps/chosen": -30.479598999023438,
1029
+ "logps/rejected": -33.436546325683594,
1030
+ "loss": 0.1429,
1031
+ "rewards/accuracies": 0.949999988079071,
1032
+ "rewards/chosen": 1.316603660583496,
1033
+ "rewards/margins": 3.323195219039917,
1034
+ "rewards/rejected": -2.006591558456421,
1035
+ "step": 610
1036
+ },
1037
+ {
1038
+ "epoch": 1.61,
1039
+ "grad_norm": 9.625,
1040
+ "learning_rate": 3.730319028506478e-06,
1041
+ "logits/chosen": -3.014833927154541,
1042
+ "logits/rejected": -3.022573947906494,
1043
+ "logps/chosen": -28.9743595123291,
1044
+ "logps/rejected": -34.54629898071289,
1045
+ "loss": 0.1341,
1046
+ "rewards/accuracies": 0.949999988079071,
1047
+ "rewards/chosen": 1.335627555847168,
1048
+ "rewards/margins": 3.2767157554626465,
1049
+ "rewards/rejected": -1.941088080406189,
1050
+ "step": 620
1051
+ },
1052
+ {
1053
+ "epoch": 1.64,
1054
+ "grad_norm": 22.125,
1055
+ "learning_rate": 3.6806776869317074e-06,
1056
+ "logits/chosen": -2.894308090209961,
1057
+ "logits/rejected": -2.8933069705963135,
1058
+ "logps/chosen": -27.416006088256836,
1059
+ "logps/rejected": -31.5345401763916,
1060
+ "loss": 0.2028,
1061
+ "rewards/accuracies": 0.8999999761581421,
1062
+ "rewards/chosen": 1.4153752326965332,
1063
+ "rewards/margins": 3.449214458465576,
1064
+ "rewards/rejected": -2.033839702606201,
1065
+ "step": 630
1066
+ },
1067
+ {
1068
+ "epoch": 1.66,
1069
+ "grad_norm": 8.375,
1070
+ "learning_rate": 3.6304297682067146e-06,
1071
+ "logits/chosen": -2.82157301902771,
1072
+ "logits/rejected": -2.8144097328186035,
1073
+ "logps/chosen": -28.129154205322266,
1074
+ "logps/rejected": -32.73756408691406,
1075
+ "loss": 0.1817,
1076
+ "rewards/accuracies": 0.949999988079071,
1077
+ "rewards/chosen": 1.2250679731369019,
1078
+ "rewards/margins": 3.056783676147461,
1079
+ "rewards/rejected": -1.8317155838012695,
1080
+ "step": 640
1081
+ },
1082
+ {
1083
+ "epoch": 1.69,
1084
+ "grad_norm": 8.625,
1085
+ "learning_rate": 3.579601087369492e-06,
1086
+ "logits/chosen": -3.041685104370117,
1087
+ "logits/rejected": -3.0316648483276367,
1088
+ "logps/chosen": -29.226932525634766,
1089
+ "logps/rejected": -35.155330657958984,
1090
+ "loss": 0.1016,
1091
+ "rewards/accuracies": 0.9750000238418579,
1092
+ "rewards/chosen": 1.4800013303756714,
1093
+ "rewards/margins": 3.3574302196502686,
1094
+ "rewards/rejected": -1.877428650856018,
1095
+ "step": 650
1096
+ },
1097
+ {
1098
+ "epoch": 1.71,
1099
+ "grad_norm": 34.0,
1100
+ "learning_rate": 3.5282177578265295e-06,
1101
+ "logits/chosen": -2.9072232246398926,
1102
+ "logits/rejected": -2.9152984619140625,
1103
+ "logps/chosen": -29.22675132751465,
1104
+ "logps/rejected": -32.41046905517578,
1105
+ "loss": 0.1796,
1106
+ "rewards/accuracies": 0.949999988079071,
1107
+ "rewards/chosen": 1.361181616783142,
1108
+ "rewards/margins": 3.326385974884033,
1109
+ "rewards/rejected": -1.9652040004730225,
1110
+ "step": 660
1111
+ },
1112
+ {
1113
+ "epoch": 1.74,
1114
+ "grad_norm": 5.46875,
1115
+ "learning_rate": 3.476306177936961e-06,
1116
+ "logits/chosen": -2.8980605602264404,
1117
+ "logits/rejected": -2.9036636352539062,
1118
+ "logps/chosen": -28.326217651367188,
1119
+ "logps/rejected": -33.606929779052734,
1120
+ "loss": 0.0961,
1121
+ "rewards/accuracies": 0.987500011920929,
1122
+ "rewards/chosen": 1.4718984365463257,
1123
+ "rewards/margins": 3.5891189575195312,
1124
+ "rewards/rejected": -2.117220401763916,
1125
+ "step": 670
1126
+ },
1127
+ {
1128
+ "epoch": 1.77,
1129
+ "grad_norm": 7.34375,
1130
+ "learning_rate": 3.423893017450324e-06,
1131
+ "logits/chosen": -3.068453311920166,
1132
+ "logits/rejected": -3.0703165531158447,
1133
+ "logps/chosen": -26.69721031188965,
1134
+ "logps/rejected": -35.200775146484375,
1135
+ "loss": 0.1478,
1136
+ "rewards/accuracies": 0.9375,
1137
+ "rewards/chosen": 1.6269474029541016,
1138
+ "rewards/margins": 3.727607011795044,
1139
+ "rewards/rejected": -2.1006596088409424,
1140
+ "step": 680
1141
+ },
1142
+ {
1143
+ "epoch": 1.79,
1144
+ "grad_norm": 9.8125,
1145
+ "learning_rate": 3.3710052038048794e-06,
1146
+ "logits/chosen": -3.1004300117492676,
1147
+ "logits/rejected": -3.0753302574157715,
1148
+ "logps/chosen": -29.50313377380371,
1149
+ "logps/rejected": -35.621055603027344,
1150
+ "loss": 0.1709,
1151
+ "rewards/accuracies": 0.9125000238418579,
1152
+ "rewards/chosen": 1.4712846279144287,
1153
+ "rewards/margins": 3.5642027854919434,
1154
+ "rewards/rejected": -2.0929179191589355,
1155
+ "step": 690
1156
+ },
1157
+ {
1158
+ "epoch": 1.82,
1159
+ "grad_norm": 7.125,
1160
+ "learning_rate": 3.3176699082935546e-06,
1161
+ "logits/chosen": -3.0888259410858154,
1162
+ "logits/rejected": -3.082875967025757,
1163
+ "logps/chosen": -26.177494049072266,
1164
+ "logps/rejected": -36.16416549682617,
1165
+ "loss": 0.1579,
1166
+ "rewards/accuracies": 0.925000011920929,
1167
+ "rewards/chosen": 1.642321228981018,
1168
+ "rewards/margins": 3.7625675201416016,
1169
+ "rewards/rejected": -2.120246410369873,
1170
+ "step": 700
1171
+ },
1172
+ {
1173
+ "epoch": 1.82,
1174
+ "eval_logits/chosen": -2.828380584716797,
1175
+ "eval_logits/rejected": -2.8261237144470215,
1176
+ "eval_logps/chosen": -32.29281997680664,
1177
+ "eval_logps/rejected": -36.07339096069336,
1178
+ "eval_loss": 0.7178195118904114,
1179
+ "eval_rewards/accuracies": 0.5805647969245911,
1180
+ "eval_rewards/chosen": -0.6062225699424744,
1181
+ "eval_rewards/margins": 0.21891874074935913,
1182
+ "eval_rewards/rejected": -0.8251413702964783,
1183
+ "eval_runtime": 112.945,
1184
+ "eval_samples_per_second": 3.037,
1185
+ "eval_steps_per_second": 0.381,
1186
+ "step": 700
1187
+ },
1188
+ {
1189
+ "epoch": 1.84,
1190
+ "grad_norm": 4.875,
1191
+ "learning_rate": 3.2639145321045933e-06,
1192
+ "logits/chosen": -2.907222270965576,
1193
+ "logits/rejected": -2.915740728378296,
1194
+ "logps/chosen": -29.818323135375977,
1195
+ "logps/rejected": -35.702247619628906,
1196
+ "loss": 0.184,
1197
+ "rewards/accuracies": 0.9125000238418579,
1198
+ "rewards/chosen": 1.5511611700057983,
1199
+ "rewards/margins": 3.508523464202881,
1200
+ "rewards/rejected": -1.9573619365692139,
1201
+ "step": 710
1202
+ },
1203
+ {
1204
+ "epoch": 1.87,
1205
+ "grad_norm": 13.25,
1206
+ "learning_rate": 3.2097666922441107e-06,
1207
+ "logits/chosen": -3.004633665084839,
1208
+ "logits/rejected": -3.0026512145996094,
1209
+ "logps/chosen": -30.316162109375,
1210
+ "logps/rejected": -33.810691833496094,
1211
+ "loss": 0.1853,
1212
+ "rewards/accuracies": 0.925000011920929,
1213
+ "rewards/chosen": 1.3776910305023193,
1214
+ "rewards/margins": 3.3225979804992676,
1215
+ "rewards/rejected": -1.9449069499969482,
1216
+ "step": 720
1217
+ },
1218
+ {
1219
+ "epoch": 1.9,
1220
+ "grad_norm": 7.4375,
1221
+ "learning_rate": 3.1552542073477554e-06,
1222
+ "logits/chosen": -2.9351017475128174,
1223
+ "logits/rejected": -2.942718982696533,
1224
+ "logps/chosen": -25.93600082397461,
1225
+ "logps/rejected": -32.751060485839844,
1226
+ "loss": 0.1238,
1227
+ "rewards/accuracies": 0.949999988079071,
1228
+ "rewards/chosen": 1.3286634683609009,
1229
+ "rewards/margins": 3.64039945602417,
1230
+ "rewards/rejected": -2.3117361068725586,
1231
+ "step": 730
1232
+ },
1233
+ {
1234
+ "epoch": 1.92,
1235
+ "grad_norm": 5.59375,
1236
+ "learning_rate": 3.100405083388799e-06,
1237
+ "logits/chosen": -3.0043129920959473,
1238
+ "logits/rejected": -3.0114331245422363,
1239
+ "logps/chosen": -28.998254776000977,
1240
+ "logps/rejected": -40.23419952392578,
1241
+ "loss": 0.1188,
1242
+ "rewards/accuracies": 0.949999988079071,
1243
+ "rewards/chosen": 1.4500622749328613,
1244
+ "rewards/margins": 4.418519020080566,
1245
+ "rewards/rejected": -2.968456745147705,
1246
+ "step": 740
1247
+ },
1248
+ {
1249
+ "epoch": 1.95,
1250
+ "grad_norm": 9.9375,
1251
+ "learning_rate": 3.0452474992899645e-06,
1252
+ "logits/chosen": -3.0003299713134766,
1253
+ "logits/rejected": -2.9911446571350098,
1254
+ "logps/chosen": -31.06940269470215,
1255
+ "logps/rejected": -36.11074447631836,
1256
+ "loss": 0.2631,
1257
+ "rewards/accuracies": 0.8999999761581421,
1258
+ "rewards/chosen": 1.3671112060546875,
1259
+ "rewards/margins": 3.5888779163360596,
1260
+ "rewards/rejected": -2.221766233444214,
1261
+ "step": 750
1262
+ },
1263
+ {
1264
+ "epoch": 1.97,
1265
+ "grad_norm": 5.90625,
1266
+ "learning_rate": 2.989809792446417e-06,
1267
+ "logits/chosen": -2.9182653427124023,
1268
+ "logits/rejected": -2.9267349243164062,
1269
+ "logps/chosen": -28.694416046142578,
1270
+ "logps/rejected": -32.506526947021484,
1271
+ "loss": 0.146,
1272
+ "rewards/accuracies": 0.9624999761581421,
1273
+ "rewards/chosen": 1.3879867792129517,
1274
+ "rewards/margins": 3.4621334075927734,
1275
+ "rewards/rejected": -2.0741465091705322,
1276
+ "step": 760
1277
+ },
1278
+ {
1279
+ "epoch": 2.0,
1280
+ "grad_norm": 11.75,
1281
+ "learning_rate": 2.9341204441673267e-06,
1282
+ "logits/chosen": -2.920984983444214,
1283
+ "logits/rejected": -2.9358911514282227,
1284
+ "logps/chosen": -28.243701934814453,
1285
+ "logps/rejected": -35.78112030029297,
1286
+ "loss": 0.2048,
1287
+ "rewards/accuracies": 0.925000011920929,
1288
+ "rewards/chosen": 1.52691650390625,
1289
+ "rewards/margins": 3.398749589920044,
1290
+ "rewards/rejected": -1.8718332052230835,
1291
+ "step": 770
1292
+ },
1293
+ {
1294
+ "epoch": 2.03,
1295
+ "grad_norm": 1.671875,
1296
+ "learning_rate": 2.878208065043501e-06,
1297
+ "logits/chosen": -2.994871139526367,
1298
+ "logits/rejected": -2.987257719039917,
1299
+ "logps/chosen": -29.053070068359375,
1300
+ "logps/rejected": -33.76008605957031,
1301
+ "loss": 0.0836,
1302
+ "rewards/accuracies": 0.9624999761581421,
1303
+ "rewards/chosen": 1.9706106185913086,
1304
+ "rewards/margins": 4.18499755859375,
1305
+ "rewards/rejected": -2.214386463165283,
1306
+ "step": 780
1307
+ },
1308
+ {
1309
+ "epoch": 2.05,
1310
+ "grad_norm": 2.75,
1311
+ "learning_rate": 2.8221013802485974e-06,
1312
+ "logits/chosen": -2.9811806678771973,
1313
+ "logits/rejected": -2.9775445461273193,
1314
+ "logps/chosen": -24.44116973876953,
1315
+ "logps/rejected": -34.26652145385742,
1316
+ "loss": 0.057,
1317
+ "rewards/accuracies": 0.9750000238418579,
1318
+ "rewards/chosen": 1.5414597988128662,
1319
+ "rewards/margins": 4.266219615936279,
1320
+ "rewards/rejected": -2.724759817123413,
1321
+ "step": 790
1322
+ },
1323
+ {
1324
+ "epoch": 2.08,
1325
+ "grad_norm": 0.69921875,
1326
+ "learning_rate": 2.76582921478147e-06,
1327
+ "logits/chosen": -2.921146869659424,
1328
+ "logits/rejected": -2.902113199234009,
1329
+ "logps/chosen": -27.294260025024414,
1330
+ "logps/rejected": -36.3679084777832,
1331
+ "loss": 0.0649,
1332
+ "rewards/accuracies": 0.949999988079071,
1333
+ "rewards/chosen": 1.8000681400299072,
1334
+ "rewards/margins": 4.472109794616699,
1335
+ "rewards/rejected": -2.672041416168213,
1336
+ "step": 800
1337
+ },
1338
+ {
1339
+ "epoch": 2.08,
1340
+ "eval_logits/chosen": -2.8270516395568848,
1341
+ "eval_logits/rejected": -2.824305772781372,
1342
+ "eval_logps/chosen": -32.480831146240234,
1343
+ "eval_logps/rejected": -36.364830017089844,
1344
+ "eval_loss": 0.7259832620620728,
1345
+ "eval_rewards/accuracies": 0.6071428656578064,
1346
+ "eval_rewards/chosen": -0.7190301418304443,
1347
+ "eval_rewards/margins": 0.28097450733184814,
1348
+ "eval_rewards/rejected": -1.0000046491622925,
1349
+ "eval_runtime": 112.9479,
1350
+ "eval_samples_per_second": 3.037,
1351
+ "eval_steps_per_second": 0.381,
1352
+ "step": 800
1353
+ },
1354
+ {
1355
+ "epoch": 2.1,
1356
+ "grad_norm": 0.515625,
1357
+ "learning_rate": 2.7094204786572254e-06,
1358
+ "logits/chosen": -3.0023553371429443,
1359
+ "logits/rejected": -3.0219783782958984,
1360
+ "logps/chosen": -28.510021209716797,
1361
+ "logps/rejected": -36.82685089111328,
1362
+ "loss": 0.0608,
1363
+ "rewards/accuracies": 0.949999988079071,
1364
+ "rewards/chosen": 1.8590720891952515,
1365
+ "rewards/margins": 5.111330986022949,
1366
+ "rewards/rejected": -3.252258777618408,
1367
+ "step": 810
1368
+ },
1369
+ {
1370
+ "epoch": 2.13,
1371
+ "grad_norm": 4.5625,
1372
+ "learning_rate": 2.6529041520546072e-06,
1373
+ "logits/chosen": -2.9950449466705322,
1374
+ "logits/rejected": -2.9868369102478027,
1375
+ "logps/chosen": -29.983810424804688,
1376
+ "logps/rejected": -34.7702522277832,
1377
+ "loss": 0.0753,
1378
+ "rewards/accuracies": 0.9624999761581421,
1379
+ "rewards/chosen": 1.7618064880371094,
1380
+ "rewards/margins": 4.724579811096191,
1381
+ "rewards/rejected": -2.962772846221924,
1382
+ "step": 820
1383
+ },
1384
+ {
1385
+ "epoch": 2.16,
1386
+ "grad_norm": 3.90625,
1387
+ "learning_rate": 2.5963092704273302e-06,
1388
+ "logits/chosen": -2.873569965362549,
1389
+ "logits/rejected": -2.872542381286621,
1390
+ "logps/chosen": -28.7579345703125,
1391
+ "logps/rejected": -30.285070419311523,
1392
+ "loss": 0.0967,
1393
+ "rewards/accuracies": 0.949999988079071,
1394
+ "rewards/chosen": 1.7464784383773804,
1395
+ "rewards/margins": 4.3574628829956055,
1396
+ "rewards/rejected": -2.6109836101531982,
1397
+ "step": 830
1398
+ },
1399
+ {
1400
+ "epoch": 2.18,
1401
+ "grad_norm": 1.328125,
1402
+ "learning_rate": 2.53966490958702e-06,
1403
+ "logits/chosen": -2.976682662963867,
1404
+ "logits/rejected": -2.9772753715515137,
1405
+ "logps/chosen": -29.0651912689209,
1406
+ "logps/rejected": -31.696544647216797,
1407
+ "loss": 0.0447,
1408
+ "rewards/accuracies": 0.987500011920929,
1409
+ "rewards/chosen": 2.118293285369873,
1410
+ "rewards/margins": 4.679741859436035,
1411
+ "rewards/rejected": -2.561448335647583,
1412
+ "step": 840
1413
+ },
1414
+ {
1415
+ "epoch": 2.21,
1416
+ "grad_norm": 13.6875,
1417
+ "learning_rate": 2.4830001707654135e-06,
1418
+ "logits/chosen": -3.0549464225769043,
1419
+ "logits/rejected": -3.056971549987793,
1420
+ "logps/chosen": -26.48697280883789,
1421
+ "logps/rejected": -35.12908935546875,
1422
+ "loss": 0.05,
1423
+ "rewards/accuracies": 0.987500011920929,
1424
+ "rewards/chosen": 1.833518385887146,
1425
+ "rewards/margins": 4.853404521942139,
1426
+ "rewards/rejected": -3.0198864936828613,
1427
+ "step": 850
1428
+ },
1429
+ {
1430
+ "epoch": 2.23,
1431
+ "grad_norm": 1.3828125,
1432
+ "learning_rate": 2.4263441656635054e-06,
1433
+ "logits/chosen": -2.8957679271698,
1434
+ "logits/rejected": -2.9036927223205566,
1435
+ "logps/chosen": -21.315093994140625,
1436
+ "logps/rejected": -32.645179748535156,
1437
+ "loss": 0.0701,
1438
+ "rewards/accuracies": 0.949999988079071,
1439
+ "rewards/chosen": 1.7470932006835938,
1440
+ "rewards/margins": 4.7478485107421875,
1441
+ "rewards/rejected": -3.000755786895752,
1442
+ "step": 860
1443
+ },
1444
+ {
1445
+ "epoch": 2.26,
1446
+ "grad_norm": 1.2109375,
1447
+ "learning_rate": 2.3697260014953107e-06,
1448
+ "logits/chosen": -3.064497470855713,
1449
+ "logits/rejected": -3.0464019775390625,
1450
+ "logps/chosen": -28.66514015197754,
1451
+ "logps/rejected": -32.98708724975586,
1452
+ "loss": 0.0539,
1453
+ "rewards/accuracies": 0.9750000238418579,
1454
+ "rewards/chosen": 1.899011254310608,
1455
+ "rewards/margins": 4.9983906745910645,
1456
+ "rewards/rejected": -3.099379301071167,
1457
+ "step": 870
1458
+ },
1459
+ {
1460
+ "epoch": 2.29,
1461
+ "grad_norm": 1.953125,
1462
+ "learning_rate": 2.3131747660339396e-06,
1463
+ "logits/chosen": -2.961402416229248,
1464
+ "logits/rejected": -2.941373348236084,
1465
+ "logps/chosen": -28.94051742553711,
1466
+ "logps/rejected": -35.915077209472656,
1467
+ "loss": 0.1267,
1468
+ "rewards/accuracies": 0.9375,
1469
+ "rewards/chosen": 1.2919337749481201,
1470
+ "rewards/margins": 4.449277877807617,
1471
+ "rewards/rejected": -3.1573433876037598,
1472
+ "step": 880
1473
+ },
1474
+ {
1475
+ "epoch": 2.31,
1476
+ "grad_norm": 4.96875,
1477
+ "learning_rate": 2.256719512667651e-06,
1478
+ "logits/chosen": -2.942389726638794,
1479
+ "logits/rejected": -2.9418246746063232,
1480
+ "logps/chosen": -30.278453826904297,
1481
+ "logps/rejected": -39.759918212890625,
1482
+ "loss": 0.0258,
1483
+ "rewards/accuracies": 0.987500011920929,
1484
+ "rewards/chosen": 1.6221895217895508,
1485
+ "rewards/margins": 5.620645523071289,
1486
+ "rewards/rejected": -3.998455762863159,
1487
+ "step": 890
1488
+ },
1489
+ {
1490
+ "epoch": 2.34,
1491
+ "grad_norm": 1.8671875,
1492
+ "learning_rate": 2.2003892454735786e-06,
1493
+ "logits/chosen": -2.9577229022979736,
1494
+ "logits/rejected": -2.975241184234619,
1495
+ "logps/chosen": -28.08145523071289,
1496
+ "logps/rejected": -37.178321838378906,
1497
+ "loss": 0.1014,
1498
+ "rewards/accuracies": 0.9750000238418579,
1499
+ "rewards/chosen": 1.1683677434921265,
1500
+ "rewards/margins": 5.182019233703613,
1501
+ "rewards/rejected": -4.0136518478393555,
1502
+ "step": 900
1503
+ },
1504
+ {
1505
+ "epoch": 2.34,
1506
+ "eval_logits/chosen": -2.8304085731506348,
1507
+ "eval_logits/rejected": -2.827810287475586,
1508
+ "eval_logps/chosen": -32.957435607910156,
1509
+ "eval_logps/rejected": -36.925601959228516,
1510
+ "eval_loss": 0.775818407535553,
1511
+ "eval_rewards/accuracies": 0.5830564498901367,
1512
+ "eval_rewards/chosen": -1.0049890279769897,
1513
+ "eval_rewards/margins": 0.331479549407959,
1514
+ "eval_rewards/rejected": -1.3364684581756592,
1515
+ "eval_runtime": 112.9719,
1516
+ "eval_samples_per_second": 3.036,
1517
+ "eval_steps_per_second": 0.381,
1518
+ "step": 900
1519
+ },
1520
+ {
1521
+ "epoch": 2.36,
1522
+ "grad_norm": 4.4375,
1523
+ "learning_rate": 2.1442129043167877e-06,
1524
+ "logits/chosen": -2.9465219974517822,
1525
+ "logits/rejected": -2.936427354812622,
1526
+ "logps/chosen": -29.5849666595459,
1527
+ "logps/rejected": -38.474422454833984,
1528
+ "loss": 0.0556,
1529
+ "rewards/accuracies": 0.9750000238418579,
1530
+ "rewards/chosen": 1.0259087085723877,
1531
+ "rewards/margins": 5.3213396072387695,
1532
+ "rewards/rejected": -4.295430660247803,
1533
+ "step": 910
1534
+ },
1535
+ {
1536
+ "epoch": 2.39,
1537
+ "grad_norm": 4.5,
1538
+ "learning_rate": 2.088219349982323e-06,
1539
+ "logits/chosen": -3.0080857276916504,
1540
+ "logits/rejected": -2.988001823425293,
1541
+ "logps/chosen": -30.026147842407227,
1542
+ "logps/rejected": -35.326351165771484,
1543
+ "loss": 0.0998,
1544
+ "rewards/accuracies": 0.9375,
1545
+ "rewards/chosen": 1.7186342477798462,
1546
+ "rewards/margins": 5.033928871154785,
1547
+ "rewards/rejected": -3.3152947425842285,
1548
+ "step": 920
1549
+ },
1550
+ {
1551
+ "epoch": 2.42,
1552
+ "grad_norm": 4.375,
1553
+ "learning_rate": 2.0324373493478803e-06,
1554
+ "logits/chosen": -2.9182701110839844,
1555
+ "logits/rejected": -2.9180097579956055,
1556
+ "logps/chosen": -27.986026763916016,
1557
+ "logps/rejected": -38.27654266357422,
1558
+ "loss": 0.0346,
1559
+ "rewards/accuracies": 1.0,
1560
+ "rewards/chosen": 1.2587478160858154,
1561
+ "rewards/margins": 5.192648410797119,
1562
+ "rewards/rejected": -3.933900833129883,
1563
+ "step": 930
1564
+ },
1565
+ {
1566
+ "epoch": 2.44,
1567
+ "grad_norm": 1.484375,
1568
+ "learning_rate": 1.976895560604729e-06,
1569
+ "logits/chosen": -3.001537561416626,
1570
+ "logits/rejected": -3.0352466106414795,
1571
+ "logps/chosen": -27.144699096679688,
1572
+ "logps/rejected": -36.409183502197266,
1573
+ "loss": 0.0498,
1574
+ "rewards/accuracies": 0.9750000238418579,
1575
+ "rewards/chosen": 1.4297518730163574,
1576
+ "rewards/margins": 5.339670658111572,
1577
+ "rewards/rejected": -3.9099185466766357,
1578
+ "step": 940
1579
+ },
1580
+ {
1581
+ "epoch": 2.47,
1582
+ "grad_norm": 3.109375,
1583
+ "learning_rate": 1.921622518534466e-06,
1584
+ "logits/chosen": -2.944477081298828,
1585
+ "logits/rejected": -2.947420835494995,
1586
+ "logps/chosen": -27.510986328125,
1587
+ "logps/rejected": -38.851806640625,
1588
+ "loss": 0.0953,
1589
+ "rewards/accuracies": 0.9375,
1590
+ "rewards/chosen": 0.9855767488479614,
1591
+ "rewards/margins": 5.092670917510986,
1592
+ "rewards/rejected": -4.107093334197998,
1593
+ "step": 950
1594
+ },
1595
+ {
1596
+ "epoch": 2.49,
1597
+ "grad_norm": 0.5859375,
1598
+ "learning_rate": 1.8666466198491794e-06,
1599
+ "logits/chosen": -3.0110037326812744,
1600
+ "logits/rejected": -2.9924447536468506,
1601
+ "logps/chosen": -28.574935913085938,
1602
+ "logps/rejected": -40.97336959838867,
1603
+ "loss": 0.0474,
1604
+ "rewards/accuracies": 0.9750000238418579,
1605
+ "rewards/chosen": 1.5641006231307983,
1606
+ "rewards/margins": 5.6208696365356445,
1607
+ "rewards/rejected": -4.056769371032715,
1608
+ "step": 960
1609
+ },
1610
+ {
1611
+ "epoch": 2.52,
1612
+ "grad_norm": 0.51953125,
1613
+ "learning_rate": 1.8119961086025376e-06,
1614
+ "logits/chosen": -3.15667462348938,
1615
+ "logits/rejected": -3.160064220428467,
1616
+ "logps/chosen": -26.655324935913086,
1617
+ "logps/rejected": -37.378997802734375,
1618
+ "loss": 0.0603,
1619
+ "rewards/accuracies": 0.9624999761581421,
1620
+ "rewards/chosen": 1.3655481338500977,
1621
+ "rewards/margins": 5.361147880554199,
1622
+ "rewards/rejected": -3.9955997467041016,
1623
+ "step": 970
1624
+ },
1625
+ {
1626
+ "epoch": 2.55,
1627
+ "grad_norm": 4.125,
1628
+ "learning_rate": 1.7576990616793139e-06,
1629
+ "logits/chosen": -2.9752297401428223,
1630
+ "logits/rejected": -3.0014469623565674,
1631
+ "logps/chosen": -27.867412567138672,
1632
+ "logps/rejected": -37.458885192871094,
1633
+ "loss": 0.0588,
1634
+ "rewards/accuracies": 0.949999988079071,
1635
+ "rewards/chosen": 1.6494334936141968,
1636
+ "rewards/margins": 5.35553503036499,
1637
+ "rewards/rejected": -3.706101655960083,
1638
+ "step": 980
1639
+ },
1640
+ {
1641
+ "epoch": 2.57,
1642
+ "grad_norm": 3.078125,
1643
+ "learning_rate": 1.7037833743707892e-06,
1644
+ "logits/chosen": -2.8809103965759277,
1645
+ "logits/rejected": -2.879664897918701,
1646
+ "logps/chosen": -31.8229923248291,
1647
+ "logps/rejected": -35.25844955444336,
1648
+ "loss": 0.1466,
1649
+ "rewards/accuracies": 0.9125000238418579,
1650
+ "rewards/chosen": 1.2134218215942383,
1651
+ "rewards/margins": 4.447299003601074,
1652
+ "rewards/rejected": -3.233877182006836,
1653
+ "step": 990
1654
+ },
1655
+ {
1656
+ "epoch": 2.6,
1657
+ "grad_norm": 3.421875,
1658
+ "learning_rate": 1.6502767460434588e-06,
1659
+ "logits/chosen": -3.044959783554077,
1660
+ "logits/rejected": -3.0451231002807617,
1661
+ "logps/chosen": -29.768239974975586,
1662
+ "logps/rejected": -38.18815231323242,
1663
+ "loss": 0.0425,
1664
+ "rewards/accuracies": 0.9750000238418579,
1665
+ "rewards/chosen": 1.273899793624878,
1666
+ "rewards/margins": 4.840017795562744,
1667
+ "rewards/rejected": -3.5661182403564453,
1668
+ "step": 1000
1669
+ },
1670
+ {
1671
+ "epoch": 2.6,
1672
+ "eval_logits/chosen": -2.82666015625,
1673
+ "eval_logits/rejected": -2.8237953186035156,
1674
+ "eval_logps/chosen": -33.11478042602539,
1675
+ "eval_logps/rejected": -37.10795974731445,
1676
+ "eval_loss": 0.7951747179031372,
1677
+ "eval_rewards/accuracies": 0.5826411843299866,
1678
+ "eval_rewards/chosen": -1.099395990371704,
1679
+ "eval_rewards/margins": 0.34648579359054565,
1680
+ "eval_rewards/rejected": -1.445881724357605,
1681
+ "eval_runtime": 112.9893,
1682
+ "eval_samples_per_second": 3.036,
1683
+ "eval_steps_per_second": 0.381,
1684
+ "step": 1000
1685
+ },
1686
+ {
1687
+ "epoch": 2.62,
1688
+ "grad_norm": 23.875,
1689
+ "learning_rate": 1.5972066659083796e-06,
1690
+ "logits/chosen": -2.985290288925171,
1691
+ "logits/rejected": -2.9671034812927246,
1692
+ "logps/chosen": -28.381534576416016,
1693
+ "logps/rejected": -37.65121078491211,
1694
+ "loss": 0.1117,
1695
+ "rewards/accuracies": 0.9375,
1696
+ "rewards/chosen": 1.111913800239563,
1697
+ "rewards/margins": 5.119040489196777,
1698
+ "rewards/rejected": -4.007126808166504,
1699
+ "step": 1010
1700
+ },
1701
+ {
1702
+ "epoch": 2.65,
1703
+ "grad_norm": 26.875,
1704
+ "learning_rate": 1.5446003988985041e-06,
1705
+ "logits/chosen": -2.9878487586975098,
1706
+ "logits/rejected": -2.9929039478302,
1707
+ "logps/chosen": -26.1834774017334,
1708
+ "logps/rejected": -35.29491424560547,
1709
+ "loss": 0.08,
1710
+ "rewards/accuracies": 0.9624999761581421,
1711
+ "rewards/chosen": 1.1799854040145874,
1712
+ "rewards/margins": 5.289762496948242,
1713
+ "rewards/rejected": -4.109776973724365,
1714
+ "step": 1020
1715
+ },
1716
+ {
1717
+ "epoch": 2.68,
1718
+ "grad_norm": 1.25,
1719
+ "learning_rate": 1.4924849716612211e-06,
1720
+ "logits/chosen": -2.9306976795196533,
1721
+ "logits/rejected": -2.9175031185150146,
1722
+ "logps/chosen": -28.533618927001953,
1723
+ "logps/rejected": -37.49860382080078,
1724
+ "loss": 0.0675,
1725
+ "rewards/accuracies": 0.987500011920929,
1726
+ "rewards/chosen": 1.2261769771575928,
1727
+ "rewards/margins": 5.032643795013428,
1728
+ "rewards/rejected": -3.806466579437256,
1729
+ "step": 1030
1730
+ },
1731
+ {
1732
+ "epoch": 2.7,
1733
+ "grad_norm": 3.1875,
1734
+ "learning_rate": 1.440887158673332e-06,
1735
+ "logits/chosen": -2.8356387615203857,
1736
+ "logits/rejected": -2.852898120880127,
1737
+ "logps/chosen": -30.96673011779785,
1738
+ "logps/rejected": -39.19948196411133,
1739
+ "loss": 0.0475,
1740
+ "rewards/accuracies": 0.987500011920929,
1741
+ "rewards/chosen": 1.2193964719772339,
1742
+ "rewards/margins": 5.134554862976074,
1743
+ "rewards/rejected": -3.9151573181152344,
1744
+ "step": 1040
1745
+ },
1746
+ {
1747
+ "epoch": 2.73,
1748
+ "grad_norm": 2.65625,
1749
+ "learning_rate": 1.3898334684855647e-06,
1750
+ "logits/chosen": -2.9140465259552,
1751
+ "logits/rejected": -2.9217119216918945,
1752
+ "logps/chosen": -26.750823974609375,
1753
+ "logps/rejected": -36.618751525878906,
1754
+ "loss": 0.0998,
1755
+ "rewards/accuracies": 0.925000011920929,
1756
+ "rewards/chosen": 1.5606456995010376,
1757
+ "rewards/margins": 5.167923927307129,
1758
+ "rewards/rejected": -3.6072781085968018,
1759
+ "step": 1050
1760
+ },
1761
+ {
1762
+ "epoch": 2.75,
1763
+ "grad_norm": 6.96875,
1764
+ "learning_rate": 1.3393501301037245e-06,
1765
+ "logits/chosen": -3.0197348594665527,
1766
+ "logits/rejected": -3.012899875640869,
1767
+ "logps/chosen": -27.697092056274414,
1768
+ "logps/rejected": -37.08921813964844,
1769
+ "loss": 0.0324,
1770
+ "rewards/accuracies": 1.0,
1771
+ "rewards/chosen": 1.4635608196258545,
1772
+ "rewards/margins": 5.2141265869140625,
1773
+ "rewards/rejected": -3.750566005706787,
1774
+ "step": 1060
1775
+ },
1776
+ {
1777
+ "epoch": 2.78,
1778
+ "grad_norm": 1.5234375,
1779
+ "learning_rate": 1.2894630795134454e-06,
1780
+ "logits/chosen": -2.9200081825256348,
1781
+ "logits/rejected": -2.934675455093384,
1782
+ "logps/chosen": -29.083492279052734,
1783
+ "logps/rejected": -38.6815071105957,
1784
+ "loss": 0.0528,
1785
+ "rewards/accuracies": 0.9750000238418579,
1786
+ "rewards/chosen": 1.4372832775115967,
1787
+ "rewards/margins": 5.663827419281006,
1788
+ "rewards/rejected": -4.226544380187988,
1789
+ "step": 1070
1790
+ },
1791
+ {
1792
+ "epoch": 2.81,
1793
+ "grad_norm": 2.15625,
1794
+ "learning_rate": 1.2401979463554984e-06,
1795
+ "logits/chosen": -2.8056931495666504,
1796
+ "logits/rejected": -2.7873711585998535,
1797
+ "logps/chosen": -29.705490112304688,
1798
+ "logps/rejected": -36.016380310058594,
1799
+ "loss": 0.1643,
1800
+ "rewards/accuracies": 0.925000011920929,
1801
+ "rewards/chosen": 1.2028892040252686,
1802
+ "rewards/margins": 4.897210121154785,
1803
+ "rewards/rejected": -3.6943202018737793,
1804
+ "step": 1080
1805
+ },
1806
+ {
1807
+ "epoch": 2.83,
1808
+ "grad_norm": 1.359375,
1809
+ "learning_rate": 1.1915800407584705e-06,
1810
+ "logits/chosen": -2.8858981132507324,
1811
+ "logits/rejected": -2.875478506088257,
1812
+ "logps/chosen": -29.841075897216797,
1813
+ "logps/rejected": -34.505531311035156,
1814
+ "loss": 0.0712,
1815
+ "rewards/accuracies": 0.949999988079071,
1816
+ "rewards/chosen": 1.508385419845581,
1817
+ "rewards/margins": 5.07808780670166,
1818
+ "rewards/rejected": -3.5697035789489746,
1819
+ "step": 1090
1820
+ },
1821
+ {
1822
+ "epoch": 2.86,
1823
+ "grad_norm": 4.59375,
1824
+ "learning_rate": 1.1436343403356019e-06,
1825
+ "logits/chosen": -3.0002076625823975,
1826
+ "logits/rejected": -2.9881558418273926,
1827
+ "logps/chosen": -29.46624183654785,
1828
+ "logps/rejected": -39.757606506347656,
1829
+ "loss": 0.0878,
1830
+ "rewards/accuracies": 0.9750000238418579,
1831
+ "rewards/chosen": 1.5060511827468872,
1832
+ "rewards/margins": 5.485553741455078,
1833
+ "rewards/rejected": -3.9795022010803223,
1834
+ "step": 1100
1835
+ },
1836
+ {
1837
+ "epoch": 2.86,
1838
+ "eval_logits/chosen": -2.8282854557037354,
1839
+ "eval_logits/rejected": -2.825747489929199,
1840
+ "eval_logps/chosen": -33.10422134399414,
1841
+ "eval_logps/rejected": -37.09624099731445,
1842
+ "eval_loss": 0.7929172515869141,
1843
+ "eval_rewards/accuracies": 0.5888704061508179,
1844
+ "eval_rewards/chosen": -1.0930629968643188,
1845
+ "eval_rewards/margins": 0.3457900285720825,
1846
+ "eval_rewards/rejected": -1.4388530254364014,
1847
+ "eval_runtime": 113.0018,
1848
+ "eval_samples_per_second": 3.035,
1849
+ "eval_steps_per_second": 0.381,
1850
+ "step": 1100
1851
+ },
1852
+ {
1853
+ "epoch": 2.88,
1854
+ "grad_norm": 2.296875,
1855
+ "learning_rate": 1.0963854773524548e-06,
1856
+ "logits/chosen": -2.8882246017456055,
1857
+ "logits/rejected": -2.8708715438842773,
1858
+ "logps/chosen": -30.411975860595703,
1859
+ "logps/rejected": -40.37923049926758,
1860
+ "loss": 0.0861,
1861
+ "rewards/accuracies": 0.949999988079071,
1862
+ "rewards/chosen": 1.3059101104736328,
1863
+ "rewards/margins": 5.2735981941223145,
1864
+ "rewards/rejected": -3.967688798904419,
1865
+ "step": 1110
1866
+ },
1867
+ {
1868
+ "epoch": 2.91,
1869
+ "grad_norm": 1.5234375,
1870
+ "learning_rate": 1.049857726072005e-06,
1871
+ "logits/chosen": -2.9631104469299316,
1872
+ "logits/rejected": -2.9671130180358887,
1873
+ "logps/chosen": -27.215435028076172,
1874
+ "logps/rejected": -36.89886474609375,
1875
+ "loss": 0.0417,
1876
+ "rewards/accuracies": 0.987500011920929,
1877
+ "rewards/chosen": 1.3935225009918213,
1878
+ "rewards/margins": 5.249666690826416,
1879
+ "rewards/rejected": -3.8561434745788574,
1880
+ "step": 1120
1881
+ },
1882
+ {
1883
+ "epoch": 2.94,
1884
+ "grad_norm": 2.1875,
1885
+ "learning_rate": 1.0040749902836508e-06,
1886
+ "logits/chosen": -2.940502643585205,
1887
+ "logits/rejected": -2.9377074241638184,
1888
+ "logps/chosen": -25.027814865112305,
1889
+ "logps/rejected": -33.21368408203125,
1890
+ "loss": 0.0557,
1891
+ "rewards/accuracies": 0.9624999761581421,
1892
+ "rewards/chosen": 1.6164283752441406,
1893
+ "rewards/margins": 5.1664042472839355,
1894
+ "rewards/rejected": -3.549975872039795,
1895
+ "step": 1130
1896
+ },
1897
+ {
1898
+ "epoch": 2.96,
1899
+ "grad_norm": 0.61328125,
1900
+ "learning_rate": 9.59060791022566e-07,
1901
+ "logits/chosen": -2.8224692344665527,
1902
+ "logits/rejected": -2.817772626876831,
1903
+ "logps/chosen": -29.64740562438965,
1904
+ "logps/rejected": -36.215633392333984,
1905
+ "loss": 0.0744,
1906
+ "rewards/accuracies": 0.9624999761581421,
1907
+ "rewards/chosen": 1.1733062267303467,
1908
+ "rewards/margins": 4.800383567810059,
1909
+ "rewards/rejected": -3.6270766258239746,
1910
+ "step": 1140
1911
+ },
1912
+ {
1913
+ "epoch": 2.99,
1914
+ "grad_norm": 3.171875,
1915
+ "learning_rate": 9.148382544856885e-07,
1916
+ "logits/chosen": -3.0086512565612793,
1917
+ "logits/rejected": -3.0138959884643555,
1918
+ "logps/chosen": -24.296131134033203,
1919
+ "logps/rejected": -36.91239547729492,
1920
+ "loss": 0.0624,
1921
+ "rewards/accuracies": 0.9750000238418579,
1922
+ "rewards/chosen": 1.2671399116516113,
1923
+ "rewards/margins": 5.15519905090332,
1924
+ "rewards/rejected": -3.888059139251709,
1925
+ "step": 1150
1926
+ },
1927
+ {
1928
+ "epoch": 3.01,
1929
+ "grad_norm": 0.375,
1930
+ "learning_rate": 8.714301001505568e-07,
1931
+ "logits/chosen": -2.880089521408081,
1932
+ "logits/rejected": -2.867708921432495,
1933
+ "logps/chosen": -28.80812644958496,
1934
+ "logps/rejected": -40.27983093261719,
1935
+ "loss": 0.0588,
1936
+ "rewards/accuracies": 0.9624999761581421,
1937
+ "rewards/chosen": 1.2912962436676025,
1938
+ "rewards/margins": 5.556339740753174,
1939
+ "rewards/rejected": -4.265043258666992,
1940
+ "step": 1160
1941
+ },
1942
+ {
1943
+ "epoch": 3.04,
1944
+ "grad_norm": 2.75,
1945
+ "learning_rate": 8.288586291031025e-07,
1946
+ "logits/chosen": -3.001692771911621,
1947
+ "logits/rejected": -2.9999001026153564,
1948
+ "logps/chosen": -26.936702728271484,
1949
+ "logps/rejected": -37.397525787353516,
1950
+ "loss": 0.0313,
1951
+ "rewards/accuracies": 0.9750000238418579,
1952
+ "rewards/chosen": 1.323664665222168,
1953
+ "rewards/margins": 5.6624555587768555,
1954
+ "rewards/rejected": -4.338790416717529,
1955
+ "step": 1170
1956
+ },
1957
+ {
1958
+ "epoch": 3.06,
1959
+ "grad_norm": 0.236328125,
1960
+ "learning_rate": 7.871457125803897e-07,
1961
+ "logits/chosen": -2.917046308517456,
1962
+ "logits/rejected": -2.911705732345581,
1963
+ "logps/chosen": -31.885528564453125,
1964
+ "logps/rejected": -39.40591049194336,
1965
+ "loss": 0.0893,
1966
+ "rewards/accuracies": 0.949999988079071,
1967
+ "rewards/chosen": 1.519230604171753,
1968
+ "rewards/margins": 5.741326332092285,
1969
+ "rewards/rejected": -4.222095489501953,
1970
+ "step": 1180
1971
+ },
1972
+ {
1973
+ "epoch": 3.09,
1974
+ "grad_norm": 0.98828125,
1975
+ "learning_rate": 7.463127807341966e-07,
1976
+ "logits/chosen": -3.032956838607788,
1977
+ "logits/rejected": -3.034572124481201,
1978
+ "logps/chosen": -30.1025333404541,
1979
+ "logps/rejected": -37.68088150024414,
1980
+ "loss": 0.0492,
1981
+ "rewards/accuracies": 0.9624999761581421,
1982
+ "rewards/chosen": 1.7244186401367188,
1983
+ "rewards/margins": 5.53281307220459,
1984
+ "rewards/rejected": -3.8083953857421875,
1985
+ "step": 1190
1986
+ },
1987
+ {
1988
+ "epoch": 3.12,
1989
+ "grad_norm": 8.5625,
1990
+ "learning_rate": 7.063808116212021e-07,
1991
+ "logits/chosen": -3.031939744949341,
1992
+ "logits/rejected": -3.020230293273926,
1993
+ "logps/chosen": -27.5956974029541,
1994
+ "logps/rejected": -35.825416564941406,
1995
+ "loss": 0.0534,
1996
+ "rewards/accuracies": 0.9624999761581421,
1997
+ "rewards/chosen": 1.5125226974487305,
1998
+ "rewards/margins": 5.0801496505737305,
1999
+ "rewards/rejected": -3.567626953125,
2000
+ "step": 1200
2001
+ },
2002
+ {
2003
+ "epoch": 3.12,
2004
+ "eval_logits/chosen": -2.8285086154937744,
2005
+ "eval_logits/rejected": -2.8258352279663086,
2006
+ "eval_logps/chosen": -33.16932678222656,
2007
+ "eval_logps/rejected": -37.17423629760742,
2008
+ "eval_loss": 0.7996842265129089,
2009
+ "eval_rewards/accuracies": 0.5888704061508179,
2010
+ "eval_rewards/chosen": -1.1321243047714233,
2011
+ "eval_rewards/margins": 0.35352617502212524,
2012
+ "eval_rewards/rejected": -1.4856503009796143,
2013
+ "eval_runtime": 112.7911,
2014
+ "eval_samples_per_second": 3.041,
2015
+ "eval_steps_per_second": 0.381,
2016
+ "step": 1200
2017
+ },
2018
+ {
2019
+ "epoch": 3.14,
2020
+ "grad_norm": 15.0625,
2021
+ "learning_rate": 6.673703204254348e-07,
2022
+ "logits/chosen": -2.9277374744415283,
2023
+ "logits/rejected": -2.9358885288238525,
2024
+ "logps/chosen": -27.505834579467773,
2025
+ "logps/rejected": -35.87134552001953,
2026
+ "loss": 0.0495,
2027
+ "rewards/accuracies": 0.9750000238418579,
2028
+ "rewards/chosen": 1.1557390689849854,
2029
+ "rewards/margins": 5.609216690063477,
2030
+ "rewards/rejected": -4.453477382659912,
2031
+ "step": 1210
2032
+ },
2033
+ {
2034
+ "epoch": 3.17,
2035
+ "grad_norm": 0.51953125,
2036
+ "learning_rate": 6.293013489185315e-07,
2037
+ "logits/chosen": -2.9534084796905518,
2038
+ "logits/rejected": -2.9638400077819824,
2039
+ "logps/chosen": -28.469470977783203,
2040
+ "logps/rejected": -37.100807189941406,
2041
+ "loss": 0.0428,
2042
+ "rewards/accuracies": 0.9750000238418579,
2043
+ "rewards/chosen": 1.8623542785644531,
2044
+ "rewards/margins": 5.414062976837158,
2045
+ "rewards/rejected": -3.551708698272705,
2046
+ "step": 1220
2047
+ },
2048
+ {
2049
+ "epoch": 3.19,
2050
+ "grad_norm": 0.69140625,
2051
+ "learning_rate": 5.921934551632086e-07,
2052
+ "logits/chosen": -2.8607680797576904,
2053
+ "logits/rejected": -2.8647537231445312,
2054
+ "logps/chosen": -28.534021377563477,
2055
+ "logps/rejected": -38.18036651611328,
2056
+ "loss": 0.0965,
2057
+ "rewards/accuracies": 0.8999999761581421,
2058
+ "rewards/chosen": 1.4797532558441162,
2059
+ "rewards/margins": 5.64510440826416,
2060
+ "rewards/rejected": -4.165350914001465,
2061
+ "step": 1230
2062
+ },
2063
+ {
2064
+ "epoch": 3.22,
2065
+ "grad_norm": 0.9921875,
2066
+ "learning_rate": 5.560657034652405e-07,
2067
+ "logits/chosen": -2.927048683166504,
2068
+ "logits/rejected": -2.914555072784424,
2069
+ "logps/chosen": -28.605026245117188,
2070
+ "logps/rejected": -35.72632598876953,
2071
+ "loss": 0.0492,
2072
+ "rewards/accuracies": 0.949999988079071,
2073
+ "rewards/chosen": 1.7661066055297852,
2074
+ "rewards/margins": 5.441458702087402,
2075
+ "rewards/rejected": -3.675352096557617,
2076
+ "step": 1240
2077
+ },
2078
+ {
2079
+ "epoch": 3.25,
2080
+ "grad_norm": 4.09375,
2081
+ "learning_rate": 5.2093665457911e-07,
2082
+ "logits/chosen": -2.8794398307800293,
2083
+ "logits/rejected": -2.883221387863159,
2084
+ "logps/chosen": -29.48556900024414,
2085
+ "logps/rejected": -39.60463333129883,
2086
+ "loss": 0.0413,
2087
+ "rewards/accuracies": 0.9624999761581421,
2088
+ "rewards/chosen": 1.490134596824646,
2089
+ "rewards/margins": 5.7836713790893555,
2090
+ "rewards/rejected": -4.293536186218262,
2091
+ "step": 1250
2092
+ },
2093
+ {
2094
+ "epoch": 3.27,
2095
+ "grad_norm": 0.3359375,
2096
+ "learning_rate": 4.868243561723535e-07,
2097
+ "logits/chosen": -2.925483226776123,
2098
+ "logits/rejected": -2.9221901893615723,
2099
+ "logps/chosen": -26.7332820892334,
2100
+ "logps/rejected": -37.69228744506836,
2101
+ "loss": 0.0422,
2102
+ "rewards/accuracies": 0.9750000238418579,
2103
+ "rewards/chosen": 1.5008256435394287,
2104
+ "rewards/margins": 5.654725074768066,
2105
+ "rewards/rejected": -4.1538987159729,
2106
+ "step": 1260
2107
+ },
2108
+ {
2109
+ "epoch": 3.3,
2110
+ "grad_norm": 0.8203125,
2111
+ "learning_rate": 4.537463335535161e-07,
2112
+ "logits/chosen": -2.9841361045837402,
2113
+ "logits/rejected": -2.9754185676574707,
2114
+ "logps/chosen": -26.88946533203125,
2115
+ "logps/rejected": -36.481143951416016,
2116
+ "loss": 0.0543,
2117
+ "rewards/accuracies": 0.9750000238418579,
2118
+ "rewards/chosen": 1.6370275020599365,
2119
+ "rewards/margins": 5.698369026184082,
2120
+ "rewards/rejected": -4.061341285705566,
2121
+ "step": 1270
2122
+ },
2123
+ {
2124
+ "epoch": 3.32,
2125
+ "grad_norm": 1.671875,
2126
+ "learning_rate": 4.217195806684629e-07,
2127
+ "logits/chosen": -3.086789608001709,
2128
+ "logits/rejected": -3.0812721252441406,
2129
+ "logps/chosen": -29.082738876342773,
2130
+ "logps/rejected": -37.11894989013672,
2131
+ "loss": 0.0392,
2132
+ "rewards/accuracies": 0.9750000238418579,
2133
+ "rewards/chosen": 1.423052191734314,
2134
+ "rewards/margins": 5.66457462310791,
2135
+ "rewards/rejected": -4.241522312164307,
2136
+ "step": 1280
2137
+ },
2138
+ {
2139
+ "epoch": 3.35,
2140
+ "grad_norm": 0.625,
2141
+ "learning_rate": 3.907605513696808e-07,
2142
+ "logits/chosen": -3.1405227184295654,
2143
+ "logits/rejected": -3.136273145675659,
2144
+ "logps/chosen": -28.447912216186523,
2145
+ "logps/rejected": -41.65070724487305,
2146
+ "loss": 0.0934,
2147
+ "rewards/accuracies": 0.9750000238418579,
2148
+ "rewards/chosen": 1.576712965965271,
2149
+ "rewards/margins": 5.907447814941406,
2150
+ "rewards/rejected": -4.330735206604004,
2151
+ "step": 1290
2152
+ },
2153
+ {
2154
+ "epoch": 3.38,
2155
+ "grad_norm": 1.6171875,
2156
+ "learning_rate": 3.6088515096305675e-07,
2157
+ "logits/chosen": -3.004781723022461,
2158
+ "logits/rejected": -2.992069959640503,
2159
+ "logps/chosen": -27.656494140625,
2160
+ "logps/rejected": -36.953086853027344,
2161
+ "loss": 0.035,
2162
+ "rewards/accuracies": 0.9750000238418579,
2163
+ "rewards/chosen": 1.8298753499984741,
2164
+ "rewards/margins": 5.944838047027588,
2165
+ "rewards/rejected": -4.114962577819824,
2166
+ "step": 1300
2167
+ },
2168
+ {
2169
+ "epoch": 3.38,
2170
+ "eval_logits/chosen": -2.829094409942627,
2171
+ "eval_logits/rejected": -2.8265905380249023,
2172
+ "eval_logps/chosen": -33.189910888671875,
2173
+ "eval_logps/rejected": -37.201393127441406,
2174
+ "eval_loss": 0.8024306297302246,
2175
+ "eval_rewards/accuracies": 0.5888704061508179,
2176
+ "eval_rewards/chosen": -1.144473671913147,
2177
+ "eval_rewards/margins": 0.35747113823890686,
2178
+ "eval_rewards/rejected": -1.5019447803497314,
2179
+ "eval_runtime": 112.8427,
2180
+ "eval_samples_per_second": 3.04,
2181
+ "eval_steps_per_second": 0.381,
2182
+ "step": 1300
2183
+ },
2184
+ {
2185
+ "epoch": 3.4,
2186
+ "grad_norm": 0.2265625,
2187
+ "learning_rate": 3.321087280364757e-07,
2188
+ "logits/chosen": -2.9668993949890137,
2189
+ "logits/rejected": -2.9619083404541016,
2190
+ "logps/chosen": -25.407943725585938,
2191
+ "logps/rejected": -39.095542907714844,
2192
+ "loss": 0.0595,
2193
+ "rewards/accuracies": 0.9375,
2194
+ "rewards/chosen": 1.6230905055999756,
2195
+ "rewards/margins": 5.879204750061035,
2196
+ "rewards/rejected": -4.256114482879639,
2197
+ "step": 1310
2198
+ },
2199
+ {
2200
+ "epoch": 3.43,
2201
+ "grad_norm": 2.609375,
2202
+ "learning_rate": 3.044460665744284e-07,
2203
+ "logits/chosen": -2.921325206756592,
2204
+ "logits/rejected": -2.9301552772521973,
2205
+ "logps/chosen": -27.841144561767578,
2206
+ "logps/rejected": -38.21234130859375,
2207
+ "loss": 0.034,
2208
+ "rewards/accuracies": 0.9750000238418579,
2209
+ "rewards/chosen": 1.3703900575637817,
2210
+ "rewards/margins": 5.775041580200195,
2211
+ "rewards/rejected": -4.404651165008545,
2212
+ "step": 1320
2213
+ },
2214
+ {
2215
+ "epoch": 3.45,
2216
+ "grad_norm": 0.5,
2217
+ "learning_rate": 2.779113783626916e-07,
2218
+ "logits/chosen": -2.928009510040283,
2219
+ "logits/rejected": -2.9167959690093994,
2220
+ "logps/chosen": -27.176036834716797,
2221
+ "logps/rejected": -38.40895462036133,
2222
+ "loss": 0.0451,
2223
+ "rewards/accuracies": 0.9624999761581421,
2224
+ "rewards/chosen": 1.5559582710266113,
2225
+ "rewards/margins": 5.725183963775635,
2226
+ "rewards/rejected": -4.169225692749023,
2227
+ "step": 1330
2228
+ },
2229
+ {
2230
+ "epoch": 3.48,
2231
+ "grad_norm": 2.03125,
2232
+ "learning_rate": 2.5251829568697204e-07,
2233
+ "logits/chosen": -3.1457560062408447,
2234
+ "logits/rejected": -3.1327528953552246,
2235
+ "logps/chosen": -28.73006248474121,
2236
+ "logps/rejected": -34.102195739746094,
2237
+ "loss": 0.0314,
2238
+ "rewards/accuracies": 1.0,
2239
+ "rewards/chosen": 1.7201951742172241,
2240
+ "rewards/margins": 5.409661293029785,
2241
+ "rewards/rejected": -3.689465284347534,
2242
+ "step": 1340
2243
+ },
2244
+ {
2245
+ "epoch": 3.51,
2246
+ "grad_norm": 0.7578125,
2247
+ "learning_rate": 2.2827986432927774e-07,
2248
+ "logits/chosen": -3.0063319206237793,
2249
+ "logits/rejected": -2.9997639656066895,
2250
+ "logps/chosen": -27.629440307617188,
2251
+ "logps/rejected": -38.41633605957031,
2252
+ "loss": 0.0178,
2253
+ "rewards/accuracies": 0.987500011920929,
2254
+ "rewards/chosen": 1.9246017932891846,
2255
+ "rewards/margins": 6.342138290405273,
2256
+ "rewards/rejected": -4.417536735534668,
2257
+ "step": 1350
2258
+ },
2259
+ {
2260
+ "epoch": 3.53,
2261
+ "grad_norm": 5.21875,
2262
+ "learning_rate": 2.0520853686560177e-07,
2263
+ "logits/chosen": -2.905151844024658,
2264
+ "logits/rejected": -2.89668345451355,
2265
+ "logps/chosen": -29.729381561279297,
2266
+ "logps/rejected": -39.10436248779297,
2267
+ "loss": 0.0216,
2268
+ "rewards/accuracies": 0.987500011920929,
2269
+ "rewards/chosen": 1.1899113655090332,
2270
+ "rewards/margins": 5.677424907684326,
2271
+ "rewards/rejected": -4.487513542175293,
2272
+ "step": 1360
2273
+ },
2274
+ {
2275
+ "epoch": 3.56,
2276
+ "grad_norm": 0.625,
2277
+ "learning_rate": 1.833161662683672e-07,
2278
+ "logits/chosen": -2.9205312728881836,
2279
+ "logits/rejected": -2.9178478717803955,
2280
+ "logps/chosen": -27.772390365600586,
2281
+ "logps/rejected": -35.85506057739258,
2282
+ "loss": 0.0327,
2283
+ "rewards/accuracies": 0.9750000238418579,
2284
+ "rewards/chosen": 1.4194756746292114,
2285
+ "rewards/margins": 5.484891414642334,
2286
+ "rewards/rejected": -4.065415859222412,
2287
+ "step": 1370
2288
+ },
2289
+ {
2290
+ "epoch": 3.58,
2291
+ "grad_norm": 0.5859375,
2292
+ "learning_rate": 1.626139998169246e-07,
2293
+ "logits/chosen": -2.941387414932251,
2294
+ "logits/rejected": -2.953021287918091,
2295
+ "logps/chosen": -26.46402931213379,
2296
+ "logps/rejected": -34.60562515258789,
2297
+ "loss": 0.0423,
2298
+ "rewards/accuracies": 0.9624999761581421,
2299
+ "rewards/chosen": 1.3287296295166016,
2300
+ "rewards/margins": 5.227254390716553,
2301
+ "rewards/rejected": -3.8985252380371094,
2302
+ "step": 1380
2303
+ },
2304
+ {
2305
+ "epoch": 3.61,
2306
+ "grad_norm": 3.6875,
2307
+ "learning_rate": 1.4311267331922535e-07,
2308
+ "logits/chosen": -3.097482681274414,
2309
+ "logits/rejected": -3.1019105911254883,
2310
+ "logps/chosen": -27.129032135009766,
2311
+ "logps/rejected": -37.47324752807617,
2312
+ "loss": 0.0338,
2313
+ "rewards/accuracies": 0.9750000238418579,
2314
+ "rewards/chosen": 1.2118251323699951,
2315
+ "rewards/margins": 5.151190757751465,
2316
+ "rewards/rejected": -3.939366579055786,
2317
+ "step": 1390
2318
+ },
2319
+ {
2320
+ "epoch": 3.64,
2321
+ "grad_norm": 0.90234375,
2322
+ "learning_rate": 1.2482220564763669e-07,
2323
+ "logits/chosen": -2.914987564086914,
2324
+ "logits/rejected": -2.896317958831787,
2325
+ "logps/chosen": -30.679147720336914,
2326
+ "logps/rejected": -36.49955368041992,
2327
+ "loss": 0.0126,
2328
+ "rewards/accuracies": 1.0,
2329
+ "rewards/chosen": 1.818377137184143,
2330
+ "rewards/margins": 5.521285533905029,
2331
+ "rewards/rejected": -3.702908754348755,
2332
+ "step": 1400
2333
+ },
2334
+ {
2335
+ "epoch": 3.64,
2336
+ "eval_logits/chosen": -2.8293960094451904,
2337
+ "eval_logits/rejected": -2.8266942501068115,
2338
+ "eval_logps/chosen": -33.2208366394043,
2339
+ "eval_logps/rejected": -37.21278381347656,
2340
+ "eval_loss": 0.8125642538070679,
2341
+ "eval_rewards/accuracies": 0.5859634280204773,
2342
+ "eval_rewards/chosen": -1.1630306243896484,
2343
+ "eval_rewards/margins": 0.3457449674606323,
2344
+ "eval_rewards/rejected": -1.5087755918502808,
2345
+ "eval_runtime": 112.9366,
2346
+ "eval_samples_per_second": 3.037,
2347
+ "eval_steps_per_second": 0.381,
2348
+ "step": 1400
2349
+ },
2350
+ {
2351
+ "epoch": 3.66,
2352
+ "grad_norm": 1.4140625,
2353
+ "learning_rate": 1.0775199359171346e-07,
2354
+ "logits/chosen": -2.8738229274749756,
2355
+ "logits/rejected": -2.89139986038208,
2356
+ "logps/chosen": -29.04506492614746,
2357
+ "logps/rejected": -38.35009002685547,
2358
+ "loss": 0.0338,
2359
+ "rewards/accuracies": 0.9750000238418579,
2360
+ "rewards/chosen": 1.687996506690979,
2361
+ "rewards/margins": 5.789109230041504,
2362
+ "rewards/rejected": -4.101112365722656,
2363
+ "step": 1410
2364
+ },
2365
+ {
2366
+ "epoch": 3.69,
2367
+ "grad_norm": 4.0625,
2368
+ "learning_rate": 9.191080703056604e-08,
2369
+ "logits/chosen": -3.110901355743408,
2370
+ "logits/rejected": -3.107856273651123,
2371
+ "logps/chosen": -28.8006534576416,
2372
+ "logps/rejected": -36.73194122314453,
2373
+ "loss": 0.0216,
2374
+ "rewards/accuracies": 0.987500011920929,
2375
+ "rewards/chosen": 1.4968552589416504,
2376
+ "rewards/margins": 5.476016044616699,
2377
+ "rewards/rejected": -3.979161024093628,
2378
+ "step": 1420
2379
+ },
2380
+ {
2381
+ "epoch": 3.71,
2382
+ "grad_norm": 1.609375,
2383
+ "learning_rate": 7.730678442730539e-08,
2384
+ "logits/chosen": -2.9601540565490723,
2385
+ "logits/rejected": -2.9479076862335205,
2386
+ "logps/chosen": -30.2454776763916,
2387
+ "logps/rejected": -38.509239196777344,
2388
+ "loss": 0.0347,
2389
+ "rewards/accuracies": 0.9624999761581421,
2390
+ "rewards/chosen": 1.5139565467834473,
2391
+ "rewards/margins": 5.956180572509766,
2392
+ "rewards/rejected": -4.44222354888916,
2393
+ "step": 1430
2394
+ },
2395
+ {
2396
+ "epoch": 3.74,
2397
+ "grad_norm": 1.7890625,
2398
+ "learning_rate": 6.394742864787806e-08,
2399
+ "logits/chosen": -2.9412589073181152,
2400
+ "logits/rejected": -2.9450554847717285,
2401
+ "logps/chosen": -28.570653915405273,
2402
+ "logps/rejected": -39.62348175048828,
2403
+ "loss": 0.035,
2404
+ "rewards/accuracies": 0.987500011920929,
2405
+ "rewards/chosen": 1.05934739112854,
2406
+ "rewards/margins": 5.714109420776367,
2407
+ "rewards/rejected": -4.654762268066406,
2408
+ "step": 1440
2409
+ },
2410
+ {
2411
+ "epoch": 3.77,
2412
+ "grad_norm": 1.109375,
2413
+ "learning_rate": 5.183960310644748e-08,
2414
+ "logits/chosen": -2.916991710662842,
2415
+ "logits/rejected": -2.9185004234313965,
2416
+ "logps/chosen": -30.148340225219727,
2417
+ "logps/rejected": -38.64313507080078,
2418
+ "loss": 0.0452,
2419
+ "rewards/accuracies": 0.9624999761581421,
2420
+ "rewards/chosen": 1.505264163017273,
2421
+ "rewards/margins": 5.883156776428223,
2422
+ "rewards/rejected": -4.37789249420166,
2423
+ "step": 1450
2424
+ },
2425
+ {
2426
+ "epoch": 3.79,
2427
+ "grad_norm": 3.25,
2428
+ "learning_rate": 4.098952823928693e-08,
2429
+ "logits/chosen": -2.9078240394592285,
2430
+ "logits/rejected": -2.9042093753814697,
2431
+ "logps/chosen": -27.049224853515625,
2432
+ "logps/rejected": -37.686256408691406,
2433
+ "loss": 0.0372,
2434
+ "rewards/accuracies": 0.9750000238418579,
2435
+ "rewards/chosen": 1.3148393630981445,
2436
+ "rewards/margins": 5.67122745513916,
2437
+ "rewards/rejected": -4.356388568878174,
2438
+ "step": 1460
2439
+ },
2440
+ {
2441
+ "epoch": 3.82,
2442
+ "grad_norm": 0.6171875,
2443
+ "learning_rate": 3.1402778309014284e-08,
2444
+ "logits/chosen": -2.910646915435791,
2445
+ "logits/rejected": -2.896136999130249,
2446
+ "logps/chosen": -25.80971336364746,
2447
+ "logps/rejected": -35.506134033203125,
2448
+ "loss": 0.0511,
2449
+ "rewards/accuracies": 0.949999988079071,
2450
+ "rewards/chosen": 1.3185876607894897,
2451
+ "rewards/margins": 5.667901039123535,
2452
+ "rewards/rejected": -4.349313259124756,
2453
+ "step": 1470
2454
+ },
2455
+ {
2456
+ "epoch": 3.84,
2457
+ "grad_norm": 1.1796875,
2458
+ "learning_rate": 2.3084278540791427e-08,
2459
+ "logits/chosen": -2.8877739906311035,
2460
+ "logits/rejected": -2.8964595794677734,
2461
+ "logps/chosen": -29.988643646240234,
2462
+ "logps/rejected": -36.62062072753906,
2463
+ "loss": 0.0204,
2464
+ "rewards/accuracies": 1.0,
2465
+ "rewards/chosen": 1.4086239337921143,
2466
+ "rewards/margins": 5.426566123962402,
2467
+ "rewards/rejected": -4.017942428588867,
2468
+ "step": 1480
2469
+ },
2470
+ {
2471
+ "epoch": 3.87,
2472
+ "grad_norm": 1.53125,
2473
+ "learning_rate": 1.6038302591975807e-08,
2474
+ "logits/chosen": -2.9591662883758545,
2475
+ "logits/rejected": -2.951643466949463,
2476
+ "logps/chosen": -24.451513290405273,
2477
+ "logps/rejected": -32.70746612548828,
2478
+ "loss": 0.0734,
2479
+ "rewards/accuracies": 0.925000011920929,
2480
+ "rewards/chosen": 1.471919059753418,
2481
+ "rewards/margins": 5.120394229888916,
2482
+ "rewards/rejected": -3.648474931716919,
2483
+ "step": 1490
2484
+ },
2485
+ {
2486
+ "epoch": 3.9,
2487
+ "grad_norm": 3.21875,
2488
+ "learning_rate": 1.0268470356514237e-08,
2489
+ "logits/chosen": -2.883286952972412,
2490
+ "logits/rejected": -2.896113395690918,
2491
+ "logps/chosen": -27.21588706970215,
2492
+ "logps/rejected": -34.79141616821289,
2493
+ "loss": 0.0525,
2494
+ "rewards/accuracies": 0.9624999761581421,
2495
+ "rewards/chosen": 1.690004587173462,
2496
+ "rewards/margins": 5.427872657775879,
2497
+ "rewards/rejected": -3.737868547439575,
2498
+ "step": 1500
2499
+ },
2500
+ {
2501
+ "epoch": 3.9,
2502
+ "eval_logits/chosen": -2.829244613647461,
2503
+ "eval_logits/rejected": -2.8264925479888916,
2504
+ "eval_logps/chosen": -33.229942321777344,
2505
+ "eval_logps/rejected": -37.220787048339844,
2506
+ "eval_loss": 0.808788537979126,
2507
+ "eval_rewards/accuracies": 0.5917773842811584,
2508
+ "eval_rewards/chosen": -1.168494701385498,
2509
+ "eval_rewards/margins": 0.34508630633354187,
2510
+ "eval_rewards/rejected": -1.5135811567306519,
2511
+ "eval_runtime": 113.0108,
2512
+ "eval_samples_per_second": 3.035,
2513
+ "eval_steps_per_second": 0.38,
2514
+ "step": 1500
2515
+ },
2516
+ {
2517
+ "epoch": 3.92,
2518
+ "grad_norm": 0.3125,
2519
+ "learning_rate": 5.777746105209147e-09,
2520
+ "logits/chosen": -3.059047222137451,
2521
+ "logits/rejected": -3.0689499378204346,
2522
+ "logps/chosen": -30.036035537719727,
2523
+ "logps/rejected": -38.22625732421875,
2524
+ "loss": 0.0788,
2525
+ "rewards/accuracies": 0.9125000238418579,
2526
+ "rewards/chosen": 1.6072747707366943,
2527
+ "rewards/margins": 5.195981502532959,
2528
+ "rewards/rejected": -3.5887062549591064,
2529
+ "step": 1510
2530
+ },
2531
+ {
2532
+ "epoch": 3.95,
2533
+ "grad_norm": 2.3125,
2534
+ "learning_rate": 2.5684369628148352e-09,
2535
+ "logits/chosen": -2.882291555404663,
2536
+ "logits/rejected": -2.8862195014953613,
2537
+ "logps/chosen": -26.208663940429688,
2538
+ "logps/rejected": -37.17507553100586,
2539
+ "loss": 0.0545,
2540
+ "rewards/accuracies": 0.9624999761581421,
2541
+ "rewards/chosen": 1.587720274925232,
2542
+ "rewards/margins": 5.605459690093994,
2543
+ "rewards/rejected": -4.017739295959473,
2544
+ "step": 1520
2545
+ },
2546
+ {
2547
+ "epoch": 3.97,
2548
+ "grad_norm": 0.7421875,
2549
+ "learning_rate": 6.421917227455999e-10,
2550
+ "logits/chosen": -2.954590320587158,
2551
+ "logits/rejected": -2.955815315246582,
2552
+ "logps/chosen": -23.254398345947266,
2553
+ "logps/rejected": -32.172245025634766,
2554
+ "loss": 0.0378,
2555
+ "rewards/accuracies": 0.987500011920929,
2556
+ "rewards/chosen": 1.555316686630249,
2557
+ "rewards/margins": 5.062749862670898,
2558
+ "rewards/rejected": -3.5074334144592285,
2559
+ "step": 1530
2560
+ },
2561
+ {
2562
+ "epoch": 4.0,
2563
+ "grad_norm": 1.8203125,
2564
+ "learning_rate": 0.0,
2565
+ "logits/chosen": -2.8720169067382812,
2566
+ "logits/rejected": -2.8812081813812256,
2567
+ "logps/chosen": -28.843318939208984,
2568
+ "logps/rejected": -40.02803421020508,
2569
+ "loss": 0.0247,
2570
+ "rewards/accuracies": 0.987500011920929,
2571
+ "rewards/chosen": 1.1202117204666138,
2572
+ "rewards/margins": 5.758116245269775,
2573
+ "rewards/rejected": -4.637904644012451,
2574
+ "step": 1540
2575
+ },
2576
+ {
2577
+ "epoch": 4.0,
2578
+ "step": 1540,
2579
+ "total_flos": 0.0,
2580
+ "train_loss": 0.21652605050763526,
2581
+ "train_runtime": 11213.836,
2582
+ "train_samples_per_second": 1.098,
2583
+ "train_steps_per_second": 0.137
2584
+ }
2585
+ ],
2586
+ "logging_steps": 10,
2587
+ "max_steps": 1540,
2588
+ "num_input_tokens_seen": 0,
2589
+ "num_train_epochs": 4,
2590
+ "save_steps": 100,
2591
+ "total_flos": 0.0,
2592
+ "train_batch_size": 4,
2593
+ "trial_name": null,
2594
+ "trial_params": null
2595
+ }