Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
Edit model card

Visualize in Weights & Biases

ds_chat_sppo_hard_cosine_iter0_2024-09-16-16.38

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 4957.3081
  • Rewards/chosen: 0.0206
  • Rewards/rejected: -0.0002
  • Rewards/accuracies: 0.3026
  • Rewards/margins: 0.0208
  • Logps/rejected: -63.9058
  • Logps/chosen: -121.0837
  • Logits/rejected: 1.7198
  • Logits/chosen: 1.6603
  • Debug/policy Chosen Logits: 1.6603
  • Debug/policy Rejected Logits: 1.7198
  • Debug/policy Chosen Logps: -121.0837
  • Debug/policy Rejected Logps: -63.9058
  • Debug/reference Chosen Logps: -123.1481
  • Debug/reference Rejected Logps: -63.8871
  • Debug/sppo Chosen Reward In Loss: 2.0643
  • Debug/sppo Rej Reward In Loss: -0.0187
  • Debug/sppo Chosen Loss: 2387.4246
  • Debug/sppo Reject Loss: 2498.1609

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps Debug/sppo Chosen Reward In Loss Debug/sppo Rej Reward In Loss Debug/sppo Chosen Loss Debug/sppo Reject Loss
4999.5461 0.3623 100 4988.0952 0.0050 0.0020 0.2763 0.0031 -63.6883 -122.6432 1.7269 1.6642 1.6642 1.7269 -122.6432 -63.6883 -123.1481 -63.8871 0.5049 0.1988 2453.1523 2523.2144
5011.4531 0.7246 200 4990.5610 0.0177 0.0058 0.3158 0.0119 -63.3097 -121.3786 1.7330 1.6732 1.6732 1.7330 -121.3786 -63.3097 -123.1481 -63.8871 1.7695 0.5774 2386.0396 2582.6948
4987.3762 1.0870 300 4987.7910 0.0199 0.0061 0.2632 0.0137 -63.2725 -121.1585 1.7421 1.6830 1.6830 1.7421 -121.1585 -63.2725 -123.1481 -63.8871 1.9895 0.6145 2385.2695 2590.7976
5014.9531 1.4493 400 4983.8423 0.0200 0.0047 0.2632 0.0152 -63.4148 -121.1519 1.7308 1.6711 1.6711 1.7308 -121.1519 -63.4148 -123.1481 -63.8871 1.9962 0.4722 2383.6707 2565.9753
5006.941 1.8116 500 4965.4326 0.0117 -0.0005 0.3158 0.0122 -63.9328 -121.9733 1.7113 1.6503 1.6503 1.7113 -121.9733 -63.9328 -123.1481 -63.8871 1.1748 -0.0457 2416.3770 2495.6252
4945.2656 2.1739 600 4971.4199 0.0165 0.0030 0.2632 0.0134 -63.5826 -121.4996 1.7310 1.6724 1.6724 1.7310 -121.4996 -63.5826 -123.1481 -63.8871 1.6485 0.3045 2391.6709 2537.9797
5016.1723 2.5362 700 4956.6055 0.0193 0.0038 0.3684 0.0155 -63.5097 -121.2218 1.7528 1.6919 1.6919 1.7528 -121.2218 -63.5097 -123.1481 -63.8871 1.9263 0.3774 2372.3936 2549.7046
4980.475 2.8986 800 4967.6992 0.0217 0.0048 0.3421 0.0169 -63.4108 -120.9796 1.7533 1.6937 1.6937 1.7533 -120.9796 -63.4108 -123.1481 -63.8871 2.1685 0.4763 2370.3362 2566.8535
4962.825 3.2609 900 4973.9316 0.0239 0.0047 0.3026 0.0192 -63.4168 -120.7541 1.7347 1.6754 1.6754 1.7347 -120.7541 -63.4168 -123.1481 -63.8871 2.3940 0.4702 2374.9814 2564.9277
4960.6797 3.6232 1000 4954.9062 0.0185 0.0027 0.3553 0.0158 -63.6219 -121.2982 1.7363 1.6773 1.6773 1.7363 -121.2982 -63.6219 -123.1481 -63.8871 1.8498 0.2651 2376.7742 2531.5662
4996.0746 3.9855 1100 4978.2021 0.0089 -0.0022 0.3684 0.0112 -64.1119 -122.2532 1.6884 1.6291 1.6291 1.6884 -122.2532 -64.1119 -123.1481 -63.8871 0.8949 -0.2249 2438.2773 2479.8074
4988.032 4.3478 1200 4952.4019 0.0171 -0.0003 0.3816 0.0174 -63.9132 -121.4333 1.7223 1.6634 1.6634 1.7223 -121.4333 -63.9132 -123.1481 -63.8871 1.7148 -0.0261 2381.5840 2497.4338
4982.1008 4.7101 1300 4951.4316 0.0171 -0.0003 0.3553 0.0174 -63.9127 -121.4370 1.7192 1.6602 1.6602 1.7192 -121.4370 -63.9127 -123.1481 -63.8871 1.7111 -0.0257 2388.1934 2497.4824
4966.7375 5.0725 1400 4954.5615 0.0185 0.0008 0.3289 0.0177 -63.8112 -121.3000 1.7216 1.6631 1.6631 1.7216 -121.3000 -63.8112 -123.1481 -63.8871 1.8480 0.0759 2383.4727 2508.1672
4937.6176 5.4348 1500 4952.7949 0.0157 -0.0019 0.3289 0.0176 -64.0738 -121.5761 1.7099 1.6508 1.6508 1.7099 -121.5761 -64.0738 -123.1481 -63.8871 1.5720 -0.1868 2396.6667 2483.3738
4969.5398 5.7971 1600 4948.7925 0.0184 -0.0001 0.3289 0.0186 -63.8999 -121.3049 1.7190 1.6601 1.6601 1.7190 -121.3049 -63.8999 -123.1481 -63.8871 1.8432 -0.0128 2383.5056 2498.8604
4931.8516 6.1594 1700 4959.4023 0.0213 0.0026 0.2632 0.0188 -63.6300 -121.0142 1.7206 1.6597 1.6597 1.7206 -121.0142 -63.6300 -123.1481 -63.8871 2.1339 0.2570 2381.4475 2532.8616
4953.9797 6.5217 1800 4962.0317 0.0210 0.0004 0.2895 0.0206 -63.8433 -121.0445 1.7201 1.6602 1.6602 1.7201 -121.0445 -63.8433 -123.1481 -63.8871 2.1036 0.0438 2382.3406 2504.5334
4965.893 6.8841 1900 4953.7192 0.0187 0.0005 0.3289 0.0182 -63.8390 -121.2794 1.7207 1.6619 1.6619 1.7207 -121.2794 -63.8390 -123.1481 -63.8871 1.8687 0.0481 2383.2534 2505.0400
4950.5336 7.2464 2000 4958.1733 0.0211 0.0004 0.3158 0.0207 -63.8483 -121.0380 1.7193 1.6611 1.6611 1.7193 -121.0380 -63.8483 -123.1481 -63.8871 2.1101 0.0387 2382.7937 2504.2783
4966.3176 7.6087 2100 4951.5176 0.0195 -0.0005 0.3816 0.0200 -63.9397 -121.2030 1.7190 1.6607 1.6607 1.7190 -121.2030 -63.9397 -123.1481 -63.8871 1.9451 -0.0526 2381.8259 2494.8140
4946.1824 7.9710 2200 4957.3081 0.0206 -0.0002 0.3026 0.0208 -63.9058 -121.0837 1.7198 1.6603 1.6603 1.7198 -121.0837 -63.9058 -123.1481 -63.8871 2.0643 -0.0187 2387.4246 2498.1609

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1
Downloads last month
2
Safetensors
Model size
6.91B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for yiran-wang3/ds_chat_sppo_hard_cosine_iter0_masked_cosine_schedule

Finetuned
(12)
this model

Datasets used to train yiran-wang3/ds_chat_sppo_hard_cosine_iter0_masked_cosine_schedule