ds_chat_sppo_hard_cosine_iter0_2024-09-16-16.38

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

Loss: 4957.3081
Rewards/chosen: 0.0206
Rewards/rejected: -0.0002
Rewards/accuracies: 0.3026
Rewards/margins: 0.0208
Logps/rejected: -63.9058
Logps/chosen: -121.0837
Logits/rejected: 1.7198
Logits/chosen: 1.6603
Debug/policy Chosen Logits: 1.6603
Debug/policy Rejected Logits: 1.7198
Debug/policy Chosen Logps: -121.0837
Debug/policy Rejected Logps: -63.9058
Debug/reference Chosen Logps: -123.1481
Debug/reference Rejected Logps: -63.8871
Debug/sppo Chosen Reward In Loss: 2.0643
Debug/sppo Rej Reward In Loss: -0.0187
Debug/sppo Chosen Loss: 2387.4246
Debug/sppo Reject Loss: 2498.1609

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
lr_scheduler_warmup_steps: 100
num_epochs: 8.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Debug/policy Chosen Logits	Debug/policy Rejected Logits	Debug/policy Chosen Logps	Debug/policy Rejected Logps	Debug/reference Chosen Logps	Debug/reference Rejected Logps	Debug/sppo Chosen Reward In Loss	Debug/sppo Rej Reward In Loss	Debug/sppo Chosen Loss	Debug/sppo Reject Loss
4999.5461	0.3623	100	4988.0952	0.0050	0.0020	0.2763	0.0031	-63.6883	-122.6432	1.7269	1.6642	1.6642	1.7269	-122.6432	-63.6883	-123.1481	-63.8871	0.5049	0.1988	2453.1523	2523.2144
5011.4531	0.7246	200	4990.5610	0.0177	0.0058	0.3158	0.0119	-63.3097	-121.3786	1.7330	1.6732	1.6732	1.7330	-121.3786	-63.3097	-123.1481	-63.8871	1.7695	0.5774	2386.0396	2582.6948
4987.3762	1.0870	300	4987.7910	0.0199	0.0061	0.2632	0.0137	-63.2725	-121.1585	1.7421	1.6830	1.6830	1.7421	-121.1585	-63.2725	-123.1481	-63.8871	1.9895	0.6145	2385.2695	2590.7976
5014.9531	1.4493	400	4983.8423	0.0200	0.0047	0.2632	0.0152	-63.4148	-121.1519	1.7308	1.6711	1.6711	1.7308	-121.1519	-63.4148	-123.1481	-63.8871	1.9962	0.4722	2383.6707	2565.9753
5006.941	1.8116	500	4965.4326	0.0117	-0.0005	0.3158	0.0122	-63.9328	-121.9733	1.7113	1.6503	1.6503	1.7113	-121.9733	-63.9328	-123.1481	-63.8871	1.1748	-0.0457	2416.3770	2495.6252
4945.2656	2.1739	600	4971.4199	0.0165	0.0030	0.2632	0.0134	-63.5826	-121.4996	1.7310	1.6724	1.6724	1.7310	-121.4996	-63.5826	-123.1481	-63.8871	1.6485	0.3045	2391.6709	2537.9797
5016.1723	2.5362	700	4956.6055	0.0193	0.0038	0.3684	0.0155	-63.5097	-121.2218	1.7528	1.6919	1.6919	1.7528	-121.2218	-63.5097	-123.1481	-63.8871	1.9263	0.3774	2372.3936	2549.7046
4980.475	2.8986	800	4967.6992	0.0217	0.0048	0.3421	0.0169	-63.4108	-120.9796	1.7533	1.6937	1.6937	1.7533	-120.9796	-63.4108	-123.1481	-63.8871	2.1685	0.4763	2370.3362	2566.8535
4962.825	3.2609	900	4973.9316	0.0239	0.0047	0.3026	0.0192	-63.4168	-120.7541	1.7347	1.6754	1.6754	1.7347	-120.7541	-63.4168	-123.1481	-63.8871	2.3940	0.4702	2374.9814	2564.9277
4960.6797	3.6232	1000	4954.9062	0.0185	0.0027	0.3553	0.0158	-63.6219	-121.2982	1.7363	1.6773	1.6773	1.7363	-121.2982	-63.6219	-123.1481	-63.8871	1.8498	0.2651	2376.7742	2531.5662
4996.0746	3.9855	1100	4978.2021	0.0089	-0.0022	0.3684	0.0112	-64.1119	-122.2532	1.6884	1.6291	1.6291	1.6884	-122.2532	-64.1119	-123.1481	-63.8871	0.8949	-0.2249	2438.2773	2479.8074
4988.032	4.3478	1200	4952.4019	0.0171	-0.0003	0.3816	0.0174	-63.9132	-121.4333	1.7223	1.6634	1.6634	1.7223	-121.4333	-63.9132	-123.1481	-63.8871	1.7148	-0.0261	2381.5840	2497.4338
4982.1008	4.7101	1300	4951.4316	0.0171	-0.0003	0.3553	0.0174	-63.9127	-121.4370	1.7192	1.6602	1.6602	1.7192	-121.4370	-63.9127	-123.1481	-63.8871	1.7111	-0.0257	2388.1934	2497.4824
4966.7375	5.0725	1400	4954.5615	0.0185	0.0008	0.3289	0.0177	-63.8112	-121.3000	1.7216	1.6631	1.6631	1.7216	-121.3000	-63.8112	-123.1481	-63.8871	1.8480	0.0759	2383.4727	2508.1672
4937.6176	5.4348	1500	4952.7949	0.0157	-0.0019	0.3289	0.0176	-64.0738	-121.5761	1.7099	1.6508	1.6508	1.7099	-121.5761	-64.0738	-123.1481	-63.8871	1.5720	-0.1868	2396.6667	2483.3738
4969.5398	5.7971	1600	4948.7925	0.0184	-0.0001	0.3289	0.0186	-63.8999	-121.3049	1.7190	1.6601	1.6601	1.7190	-121.3049	-63.8999	-123.1481	-63.8871	1.8432	-0.0128	2383.5056	2498.8604
4931.8516	6.1594	1700	4959.4023	0.0213	0.0026	0.2632	0.0188	-63.6300	-121.0142	1.7206	1.6597	1.6597	1.7206	-121.0142	-63.6300	-123.1481	-63.8871	2.1339	0.2570	2381.4475	2532.8616
4953.9797	6.5217	1800	4962.0317	0.0210	0.0004	0.2895	0.0206	-63.8433	-121.0445	1.7201	1.6602	1.6602	1.7201	-121.0445	-63.8433	-123.1481	-63.8871	2.1036	0.0438	2382.3406	2504.5334
4965.893	6.8841	1900	4953.7192	0.0187	0.0005	0.3289	0.0182	-63.8390	-121.2794	1.7207	1.6619	1.6619	1.7207	-121.2794	-63.8390	-123.1481	-63.8871	1.8687	0.0481	2383.2534	2505.0400
4950.5336	7.2464	2000	4958.1733	0.0211	0.0004	0.3158	0.0207	-63.8483	-121.0380	1.7193	1.6611	1.6611	1.7193	-121.0380	-63.8483	-123.1481	-63.8871	2.1101	0.0387	2382.7937	2504.2783
4966.3176	7.6087	2100	4951.5176	0.0195	-0.0005	0.3816	0.0200	-63.9397	-121.2030	1.7190	1.6607	1.6607	1.7190	-121.2030	-63.9397	-123.1481	-63.8871	1.9451	-0.0526	2381.8259	2494.8140
4946.1824	7.9710	2200	4957.3081	0.0206	-0.0002	0.3026	0.0208	-63.9058	-121.0837	1.7198	1.6603	1.6603	1.7198	-121.0837	-63.9058	-123.1481	-63.8871	2.0643	-0.0187	2387.4246	2498.1609

Framework versions

Transformers 4.42.0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.19.1

yiran-wang3
/

ds_chat_sppo_hard_cosine_iter0_masked_cosine_schedule

ds_chat_sppo_hard_cosine_iter0_2024-09-16-16.38

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yiran-wang3/ds_chat_sppo_hard_cosine_iter0_masked_cosine_schedule

Datasets used to train yiran-wang3/ds_chat_sppo_hard_cosine_iter0_masked_cosine_schedule

Evaluation results