G-reen/EXPERIMENT-DPO-m7b2-3-merged

This model was trained as part of a series of experiments testing the performance of pure DPO vs SFT vs ORPO, all supported by Unsloth/Huggingface TRL.

Note: This model failed to train because the LR was too high (stopped early at 300 steps). Do not use!

Benchmarks

Average 29.55

ARC 29.52

HellaSwag 25.9

MMLU 23.12

TruthfulQA 48.27

Winogrande 50.51

GSM8K 0

Training Details

Duration: ~3 hours on one Kaggle T4 with Unsloth

Model: https://huggingface.co/unsloth/mistral-7b-v0.2-bnb-4bit

Dataset: https://huggingface.co/datasets/argilla/dpo-mix-7k

Rank: 8

Alpha: 16

Learning rate: 5e-4

Beta: 0.1

Batch size: 8

Epochs: 1

Learning rate scheduler: Linear

Prompt Format: ChatML

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Why is the sky blue?<|im_end|>
<|im_start|>assistant

WanDB Reports

G-reen
/

EXPERIMENT-DPO-m7b2-3-merged

Collection including G-reen/EXPERIMENT-DPO-m7b2-3-merged

ORPO v DPO v SFT + Training Loss Curves; argilla/dpo-mix-7k