zephyr-7b-dpo-lora-pairrm

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2 on the snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset dataset. It achieves the following results on the evaluation set:

Loss: 0.6764
Rewards/chosen: -0.9885
Rewards/rejected: -1.0650
Rewards/accuracies: 0.5657
Rewards/margins: 0.0765
Logps/rejected: -320.4450
Logps/chosen: -307.4615
Logits/rejected: -2.7535
Logits/chosen: -2.7599

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6916	0.08	100	0.6925	-0.0162	-0.0177	0.5280	0.0015	-215.7187	-210.2296	-2.5058	-2.5086
0.6855	0.16	200	0.6880	-0.0651	-0.0772	0.5613	0.0121	-221.6710	-215.1240	-2.5152	-2.5178
0.6825	0.24	300	0.6854	-0.1874	-0.2081	0.5473	0.0207	-234.7546	-227.3457	-2.5175	-2.5192
0.6676	0.32	400	0.6827	-0.2909	-0.3222	0.5477	0.0313	-246.1682	-237.7042	-2.5347	-2.5368
0.6458	0.4	500	0.6805	-0.3693	-0.4104	0.5567	0.0410	-254.9852	-245.5435	-2.6328	-2.6364
0.6592	0.48	600	0.6789	-0.6010	-0.6528	0.5560	0.0518	-279.2278	-268.7087	-2.6805	-2.6845
0.6107	0.56	700	0.6785	-0.8159	-0.8786	0.5550	0.0627	-301.8047	-290.1964	-2.6914	-2.6969
0.6475	0.64	800	0.6770	-0.8845	-0.9544	0.5610	0.0699	-309.3867	-297.0627	-2.7237	-2.7295
0.6639	0.72	900	0.6766	-0.9705	-1.0450	0.5667	0.0746	-318.4507	-305.6558	-2.7464	-2.7525
0.6305	0.8	1000	0.6764	-0.9844	-1.0603	0.5680	0.0759	-319.9799	-307.0536	-2.7543	-2.7606
0.6754	0.88	1100	0.6763	-0.9882	-1.0648	0.5687	0.0766	-320.4283	-307.4264	-2.7538	-2.7602
0.6577	0.96	1200	0.6764	-0.9885	-1.0649	0.5663	0.0764	-320.4412	-307.4615	-2.7538	-2.7602

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.0

shenxq
/

zephyr-7b-dpo-lora-pairrm

zephyr-7b-dpo-lora-pairrm

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for shenxq/zephyr-7b-dpo-lora-pairrm

Dataset used to train shenxq/zephyr-7b-dpo-lora-pairrm

Evaluation results