Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
Edit model card

Visualize in Weights & Biases

ds_chat_sppo_hard_new_iter0_2024-09-14-21.15

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 0.4951
  • Rewards/chosen: 0.0190
  • Rewards/rejected: -0.0009
  • Rewards/accuracies: 0.3684
  • Rewards/margins: 0.0199
  • Logps/rejected: -63.9738
  • Logps/chosen: -121.2440
  • Logits/rejected: 1.7159
  • Logits/chosen: 1.6562
  • Debug/policy Chosen Logits: 1.6562
  • Debug/policy Rejected Logits: 1.7159
  • Debug/policy Chosen Logps: -121.2440
  • Debug/policy Rejected Logps: -63.9738
  • Debug/reference Chosen Logps: -123.1481
  • Debug/reference Rejected Logps: -63.8871

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps
0.4997 0.3623 100 0.4979 0.0051 -0.0005 0.3421 0.0056 -63.9373 -122.6352 1.7236 1.6612 1.6612 1.7236 -122.6352 -63.9373 -123.1481 -63.8871
0.5018 0.7246 200 0.4996 0.0156 0.0052 0.3421 0.0104 -63.3698 -121.5860 1.7403 1.6799 1.6799 1.7403 -121.5860 -63.3698 -123.1481 -63.8871
0.4991 1.0870 300 0.4987 0.0190 0.0068 0.3158 0.0123 -63.2120 -121.2448 1.7605 1.7000 1.7000 1.7605 -121.2448 -63.2120 -123.1481 -63.8871
0.5007 1.4493 400 0.4975 0.0176 0.0038 0.2895 0.0139 -63.5094 -121.3837 1.7412 1.6815 1.6815 1.7412 -121.3837 -63.5094 -123.1481 -63.8871
0.5006 1.8116 500 0.4966 0.0132 0.0019 0.3553 0.0113 -63.6979 -121.8322 1.7278 1.6669 1.6669 1.7278 -121.8322 -63.6979 -123.1481 -63.8871
0.4944 2.1739 600 0.4969 0.0196 0.0035 0.3421 0.0160 -63.5333 -121.1920 1.7400 1.6805 1.6805 1.7400 -121.1920 -63.5333 -123.1481 -63.8871
0.4988 2.5362 700 0.4959 0.0175 0.0032 0.3553 0.0143 -63.5656 -121.4005 1.7441 1.6843 1.6843 1.7441 -121.4005 -63.5656 -123.1481 -63.8871
0.4975 2.8986 800 0.4967 0.0221 0.0072 0.3553 0.0150 -63.1701 -120.9358 1.7439 1.6851 1.6851 1.7439 -120.9358 -63.1701 -123.1481 -63.8871
0.495 3.2609 900 0.4955 0.0202 0.0021 0.3421 0.0180 -63.6741 -121.1320 1.7492 1.6875 1.6875 1.7492 -121.1320 -63.6741 -123.1481 -63.8871
0.4961 3.6232 1000 0.4958 0.0210 0.0019 0.3421 0.0191 -63.6937 -121.0436 1.7449 1.6854 1.6854 1.7449 -121.0436 -63.6937 -123.1481 -63.8871
0.4979 3.9855 1100 0.4952 0.0160 -0.0011 0.3816 0.0171 -63.9974 -121.5451 1.7309 1.6720 1.6720 1.7309 -121.5451 -63.9974 -123.1481 -63.8871
0.4985 4.3478 1200 0.4958 0.0157 0.0002 0.3289 0.0154 -63.8621 -121.5809 1.7273 1.6675 1.6675 1.7273 -121.5809 -63.8621 -123.1481 -63.8871
0.4977 4.7101 1300 0.4968 0.0195 0.0012 0.3158 0.0182 -63.7631 -121.2019 1.7106 1.6512 1.6512 1.7106 -121.2019 -63.7631 -123.1481 -63.8871
0.4966 5.0725 1400 0.4958 0.0186 0.0002 0.3289 0.0184 -63.8648 -121.2832 1.7173 1.6585 1.6585 1.7173 -121.2832 -63.8648 -123.1481 -63.8871
0.4935 5.4348 1500 0.4958 0.0160 0.0005 0.2632 0.0155 -63.8391 -121.5465 1.7152 1.6570 1.6570 1.7152 -121.5465 -63.8391 -123.1481 -63.8871
0.4975 5.7971 1600 0.4963 0.0197 0.0018 0.3026 0.0179 -63.7076 -121.1778 1.7160 1.6571 1.6571 1.7160 -121.1778 -63.7076 -123.1481 -63.8871
0.4934 6.1594 1700 0.4958 0.0142 -0.0019 0.3553 0.0162 -64.0808 -121.7252 1.7082 1.6502 1.6502 1.7082 -121.7252 -64.0808 -123.1481 -63.8871
0.4956 6.5217 1800 0.4957 0.0210 0.0005 0.3421 0.0205 -63.8361 -121.0436 1.7185 1.6581 1.6581 1.7185 -121.0436 -63.8361 -123.1481 -63.8871
0.496 6.8841 1900 0.4958 0.0212 0.0018 0.2895 0.0194 -63.7090 -121.0307 1.7158 1.6582 1.6582 1.7158 -121.0307 -63.7090 -123.1481 -63.8871
0.495 7.2464 2000 0.4953 0.0175 0.0019 0.3289 0.0156 -63.6983 -121.4027 1.7189 1.6600 1.6600 1.7189 -121.4027 -63.6983 -123.1481 -63.8871
0.4967 7.6087 2100 0.4958 0.0202 -0.0001 0.2895 0.0203 -63.8998 -121.1321 1.7188 1.6592 1.6592 1.7188 -121.1321 -63.8998 -123.1481 -63.8871
0.4948 7.9710 2200 0.4951 0.0190 -0.0009 0.3684 0.0199 -63.9738 -121.2440 1.7159 1.6562 1.6562 1.7159 -121.2440 -63.9738 -123.1481 -63.8871

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1
Downloads last month
50
Safetensors
Model size
6.91B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for yiran-wang3/ds_chat_sppo_hard_new_iter0_masked_linear_schedule

Finetuned
(12)
this model

Datasets used to train yiran-wang3/ds_chat_sppo_hard_new_iter0_masked_linear_schedule