llama-7b-SFT-qlora-eli5-wiki_DPO_ds_RM_random_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_eli5_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6913	0.1	19	0.6845	-0.4006	-0.4672	0.5558	0.0665	-205.3114	-202.4936	1.0265	1.0467
0.6768	0.21	38	0.6796	-0.3409	-0.4196	0.5603	0.0787	-204.8360	-201.8965	1.0326	1.0538
0.6771	0.31	57	0.6788	-0.0760	-0.1428	0.5781	0.0669	-202.0682	-199.2469	1.0323	1.0541
0.6665	0.41	76	0.6826	-0.1511	-0.2355	0.5703	0.0843	-202.9944	-199.9986	1.0413	1.0635
0.6669	0.52	95	0.6830	-0.1285	-0.2165	0.5781	0.0880	-202.8050	-199.7720	1.0299	1.0522
0.669	0.62	114	0.6800	-0.0932	-0.1803	0.5725	0.0871	-202.4429	-199.4187	1.0126	1.0352
0.6559	0.72	133	0.6829	-0.0011	-0.1074	0.5759	0.1063	-201.7135	-198.4980	1.0015	1.0232
0.6698	0.83	152	0.6810	-0.0519	-0.1530	0.5781	0.1011	-202.1696	-199.0062	0.9974	1.0192
0.6643	0.93	171	0.6799	-0.0579	-0.1589	0.5658	0.1010	-202.2284	-199.0658	1.0002	1.0220