Transformers
Safetensors
deci
Generated from Trainer
custom_code
Inference Endpoints
rohansolo's picture
Update README.md
22cc0e5
metadata
license: apache-2.0
base_model: Deci/DeciLM-7B
tags:
  - generated_from_trainer
datasets:
  - HuggingFaceH4/ultrachat_200k
  - HuggingFaceH4/ultrafeedback_binarized
model-index:
  - name: bbdeci7b-sft-lora-dpo-lora
    results: []

bbdeci7b-sft-lora-dpo-lora

This model is a SFT then DPO fine-tuned version of Deci/DeciLM-7B on the HuggingFaceH4/ultrachat_200k for SFT and the HuggingFaceH4/ultrafeedback_binarized

Evals and more details coming soon

SFT was conducted on 2X Nvidia A100 for 21 Hours, and DPO was codnucted on 8X Nvida A100 for 4 Hours

It achieves the following results on the evaluation set(SFT):

  • Loss: 1.0110

It achieves the following results on the evaluation set(DPO):

  • Loss: 0.5908
  • Rewards/chosen: 0.0960
  • Rewards/rejected: -0.2480
  • Rewards/accuracies: 0.7222
  • Rewards/margins: 0.3440
  • Logps/rejected: -241.9212
  • Logps/chosen: -295.2642
  • Logits/rejected: -2.6769
  • Logits/chosen: -2.6941

Training hyperparameters

The following hyperparameters were used during SFT training:

  • learning_rate: 2e-05
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 128
  • total_train_batch_size: 1024
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • num_epochs: 1

The following hyperparameters were used during DPO training:

  • learning_rate: 5e-07
  • train_batch_size: 2
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 512
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 3

Training results

SFT:

Training Loss Epoch Step Validation Loss
1.0062 1.00 136 1.0110

DPO:

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6401 1.0 121 0.6354 0.0634 -0.0940 0.7302 0.1573 -240.3806 -295.5903 -2.6840 -2.7020
0.6014 2.0 242 0.5988 0.0861 -0.2096 0.7460 0.2956 -241.5365 -295.3633 -2.6795 -2.6965
0.5911 3.0 363 0.5908 0.0960 -0.2480 0.7222 0.3440 -241.9212 -295.2642 -2.6769 -2.6941

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu118
  • Datasets 2.14.6
  • Tokenizers 0.14.1