mistral-dpo

This model is a fine-tuned version of TheBloke/Mistral-7B-v0.1-GPTQ on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0000
Rewards/chosen: -2.0502
Rewards/rejected: -28.3632
Rewards/accuracies: 1.0
Rewards/margins: 26.3129
Logps/rejected: -399.8283
Logps/chosen: -35.7179
Logits/rejected: -2.1171
Logits/chosen: -1.8480

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2
training_steps: 250
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6453	0.2	10	0.4086	0.1393	-0.7001	1.0	0.8394	-123.1976	-13.8225	-2.5461	-2.5162
0.1759	0.4	20	0.0051	0.3963	-6.4413	1.0	6.8376	-180.6101	-11.2527	-2.5253	-2.4045
0.0015	0.6	30	0.0000	0.2885	-20.7441	1.0	21.0326	-323.6376	-12.3309	-2.2440	-1.8851
0.0	0.8	40	0.0000	-0.6913	-26.5964	1.0	25.9051	-382.1607	-22.1282	-1.9054	-1.5507
0.0	1.0	50	0.0000	-1.6661	-28.8376	1.0	27.1715	-404.5731	-31.8766	-1.7581	-1.4145
0.0	1.2	60	0.0000	-2.1659	-29.6823	1.0	27.5164	-413.0200	-36.8745	-1.7071	-1.3649
0.0	1.4	70	0.0000	-2.0973	-30.0476	1.0	27.9503	-416.6729	-36.1886	-1.6955	-1.3541
0.0	1.6	80	0.0000	-2.0065	-30.1726	1.0	28.1661	-417.9230	-35.2805	-1.6941	-1.3519
0.0	1.8	90	0.0000	-1.9541	-30.2266	1.0	28.2724	-418.4622	-34.7568	-1.6935	-1.3518
0.0023	2.0	100	0.0000	-0.7061	-30.2814	1.0	29.5753	-419.0107	-22.2763	-1.7664	-1.4215
0.0	2.2	110	0.0000	-1.6234	-29.4682	1.0	27.8448	-410.8783	-31.4494	-2.0371	-1.7164
0.0	2.4	120	0.0000	-1.9528	-28.6154	1.0	26.6626	-402.3507	-34.7431	-2.0991	-1.8126
0.0	2.6	130	0.0000	-2.0210	-28.3739	1.0	26.3529	-399.9358	-35.4253	-2.1141	-1.8394
0.0	2.8	140	0.0000	-2.0443	-28.2878	1.0	26.2435	-399.0752	-35.6588	-2.1185	-1.8487
0.0	3.0	150	0.0000	-2.0504	-28.2651	1.0	26.2147	-398.8474	-35.7192	-2.1201	-1.8510
0.0	3.2	160	0.0000	-2.0500	-28.2657	1.0	26.2157	-398.8541	-35.7157	-2.1202	-1.8519
0.0	3.4	170	0.0000	-2.0530	-28.2687	1.0	26.2157	-398.8837	-35.7460	-2.1205	-1.8521
0.0	3.6	180	0.0000	-2.0529	-28.2660	1.0	26.2131	-398.8570	-35.7444	-2.1202	-1.8515
0.0	3.8	190	0.0000	-2.0531	-28.2649	1.0	26.2119	-398.8461	-35.7464	-2.1202	-1.8519
0.0	4.0	200	0.0000	-2.0579	-28.3150	1.0	26.2571	-399.3466	-35.7943	-2.1191	-1.8507
0.0	4.2	210	0.0000	-2.0509	-28.3341	1.0	26.2832	-399.5381	-35.7246	-2.1178	-1.8487
0.0	4.4	220	0.0000	-2.0516	-28.3405	1.0	26.2889	-399.6018	-35.7316	-2.1178	-1.8490
0.0	4.6	230	0.0000	-2.0516	-28.3495	1.0	26.2979	-399.6917	-35.7317	-2.1176	-1.8489
0.0	4.8	240	0.0000	-2.0508	-28.3684	1.0	26.3176	-399.8806	-35.7236	-2.1173	-1.8488
0.0	5.0	250	0.0000	-2.0502	-28.3632	1.0	26.3129	-399.8283	-35.7179	-2.1171	-1.8480

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.0.1+cu118
Datasets 2.15.0
Tokenizers 0.15.0

AlbelTec
/

mistral-dpo-old

mistral-dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for AlbelTec/mistral-dpo-old

Evaluation results