ds_chat_sppo_hard_new_iter0_2024-09-14-21.15

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

Loss: 0.4951
Rewards/chosen: 0.0190
Rewards/rejected: -0.0009
Rewards/accuracies: 0.3684
Rewards/margins: 0.0199
Logps/rejected: -63.9738
Logps/chosen: -121.2440
Logits/rejected: 1.7159
Logits/chosen: 1.6562
Debug/policy Chosen Logits: 1.6562
Debug/policy Rejected Logits: 1.7159
Debug/policy Chosen Logps: -121.2440
Debug/policy Rejected Logps: -63.9738
Debug/reference Chosen Logps: -123.1481
Debug/reference Rejected Logps: -63.8871

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
lr_scheduler_warmup_steps: 100
num_epochs: 8.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Debug/policy Chosen Logits	Debug/policy Rejected Logits	Debug/policy Chosen Logps	Debug/policy Rejected Logps	Debug/reference Chosen Logps	Debug/reference Rejected Logps
0.4997	0.3623	100	0.4979	0.0051	-0.0005	0.3421	0.0056	-63.9373	-122.6352	1.7236	1.6612	1.6612	1.7236	-122.6352	-63.9373	-123.1481	-63.8871
0.5018	0.7246	200	0.4996	0.0156	0.0052	0.3421	0.0104	-63.3698	-121.5860	1.7403	1.6799	1.6799	1.7403	-121.5860	-63.3698	-123.1481	-63.8871
0.4991	1.0870	300	0.4987	0.0190	0.0068	0.3158	0.0123	-63.2120	-121.2448	1.7605	1.7000	1.7000	1.7605	-121.2448	-63.2120	-123.1481	-63.8871
0.5007	1.4493	400	0.4975	0.0176	0.0038	0.2895	0.0139	-63.5094	-121.3837	1.7412	1.6815	1.6815	1.7412	-121.3837	-63.5094	-123.1481	-63.8871
0.5006	1.8116	500	0.4966	0.0132	0.0019	0.3553	0.0113	-63.6979	-121.8322	1.7278	1.6669	1.6669	1.7278	-121.8322	-63.6979	-123.1481	-63.8871
0.4944	2.1739	600	0.4969	0.0196	0.0035	0.3421	0.0160	-63.5333	-121.1920	1.7400	1.6805	1.6805	1.7400	-121.1920	-63.5333	-123.1481	-63.8871
0.4988	2.5362	700	0.4959	0.0175	0.0032	0.3553	0.0143	-63.5656	-121.4005	1.7441	1.6843	1.6843	1.7441	-121.4005	-63.5656	-123.1481	-63.8871
0.4975	2.8986	800	0.4967	0.0221	0.0072	0.3553	0.0150	-63.1701	-120.9358	1.7439	1.6851	1.6851	1.7439	-120.9358	-63.1701	-123.1481	-63.8871
0.495	3.2609	900	0.4955	0.0202	0.0021	0.3421	0.0180	-63.6741	-121.1320	1.7492	1.6875	1.6875	1.7492	-121.1320	-63.6741	-123.1481	-63.8871
0.4961	3.6232	1000	0.4958	0.0210	0.0019	0.3421	0.0191	-63.6937	-121.0436	1.7449	1.6854	1.6854	1.7449	-121.0436	-63.6937	-123.1481	-63.8871
0.4979	3.9855	1100	0.4952	0.0160	-0.0011	0.3816	0.0171	-63.9974	-121.5451	1.7309	1.6720	1.6720	1.7309	-121.5451	-63.9974	-123.1481	-63.8871
0.4985	4.3478	1200	0.4958	0.0157	0.0002	0.3289	0.0154	-63.8621	-121.5809	1.7273	1.6675	1.6675	1.7273	-121.5809	-63.8621	-123.1481	-63.8871
0.4977	4.7101	1300	0.4968	0.0195	0.0012	0.3158	0.0182	-63.7631	-121.2019	1.7106	1.6512	1.6512	1.7106	-121.2019	-63.7631	-123.1481	-63.8871
0.4966	5.0725	1400	0.4958	0.0186	0.0002	0.3289	0.0184	-63.8648	-121.2832	1.7173	1.6585	1.6585	1.7173	-121.2832	-63.8648	-123.1481	-63.8871
0.4935	5.4348	1500	0.4958	0.0160	0.0005	0.2632	0.0155	-63.8391	-121.5465	1.7152	1.6570	1.6570	1.7152	-121.5465	-63.8391	-123.1481	-63.8871
0.4975	5.7971	1600	0.4963	0.0197	0.0018	0.3026	0.0179	-63.7076	-121.1778	1.7160	1.6571	1.6571	1.7160	-121.1778	-63.7076	-123.1481	-63.8871
0.4934	6.1594	1700	0.4958	0.0142	-0.0019	0.3553	0.0162	-64.0808	-121.7252	1.7082	1.6502	1.6502	1.7082	-121.7252	-64.0808	-123.1481	-63.8871
0.4956	6.5217	1800	0.4957	0.0210	0.0005	0.3421	0.0205	-63.8361	-121.0436	1.7185	1.6581	1.6581	1.7185	-121.0436	-63.8361	-123.1481	-63.8871
0.496	6.8841	1900	0.4958	0.0212	0.0018	0.2895	0.0194	-63.7090	-121.0307	1.7158	1.6582	1.6582	1.7158	-121.0307	-63.7090	-123.1481	-63.8871
0.495	7.2464	2000	0.4953	0.0175	0.0019	0.3289	0.0156	-63.6983	-121.4027	1.7189	1.6600	1.6600	1.7189	-121.4027	-63.6983	-123.1481	-63.8871
0.4967	7.6087	2100	0.4958	0.0202	-0.0001	0.2895	0.0203	-63.8998	-121.1321	1.7188	1.6592	1.6592	1.7188	-121.1321	-63.8998	-123.1481	-63.8871
0.4948	7.9710	2200	0.4951	0.0190	-0.0009	0.3684	0.0199	-63.9738	-121.2440	1.7159	1.6562	1.6562	1.7159	-121.2440	-63.9738	-123.1481	-63.8871

Framework versions

Transformers 4.42.0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.19.1

yiran-wang3
/

ds_chat_sppo_hard_new_iter0_masked_linear_schedule

ds_chat_sppo_hard_new_iter0_2024-09-14-21.15

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yiran-wang3/ds_chat_sppo_hard_new_iter0_masked_linear_schedule

Datasets used to train yiran-wang3/ds_chat_sppo_hard_new_iter0_masked_linear_schedule

Evaluation results