kalomaze commited on
Commit
e7766e1
1 Parent(s): f4ccb00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -2
README.md CHANGED
@@ -1,2 +1,34 @@
1
- lr = 2e-6, ~2.5 mil tokens of Python instruct data, all around ~7k tokens ish for each sample (300 total samples).
2
- 1 epoch distillation of 70b logprobs, topk=200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 70b Distillation Experiment
2
+ This is not the full-fledged run that I plan to do for a large scale distillation of Llama3 70b.
3
+ Instead, it's a preliminary test train of the custom distillation trainer, where we target KL divergence from the larger Llama3 70b teacher model onto 4x8b (the student).
4
+ I'm releasing it here mainly so that people who are interested can tinker with it / finetune it to see how it behaves before I am ready to do a larger run.
5
+
6
+ # Training details
7
+ Each of the 8b expert MLP layers is duplicated 3x from the original Llama3 8b in a typical Mixtral-style Sparse MoE layout.
8
+
9
+ Over the course of the training run, the expert selection count was gradually increased from the minimum (topk=1) to the maximum (topk=4), as in [Sparse MoE as the New Dropout](https://arxiv.org/abs/2303.01610). This was done with a stochastic / randomized top_k expert selection with **frozen gate layers**, as recommended in the paper.
10
+
11
+ LR = 2e-6, ~2.5 mil tokens of Python instruct data, all around ~8k tokens ish for each sample ~(300 total samples).
12
+ Despite the use of instruct data, the model does not necessarily behave like one, as the training process involves mimicking a larger base model's distributions over to said data.
13
+
14
+ 1 epoch distillation of 70b logprobs, topk=200 logits from the fp16 Llama3-70b.
15
+
16
+ # Evals
17
+
18
+ ## llama3-4x8b-pythonT2_step_final
19
+
20
+ * mmlu: 65.10 (66.69) - 0.97x
21
+ * arc: 57.94 (59.47) - 0.97x
22
+ * hellaswag: 81.93 (82.09) - 0.99x
23
+ * winogrande: 77.03 (77.35) - 0.99x
24
+ * gsm8k: 50.95 (45.79) - 1.11x
25
+ * truthfulqa-mc1: 27.66
26
+ * truthfulqa-mc2: 44.53 (43.9) - 1.01x
27
+ * humaneval+: 32.9 (29.3) - 1.12x
28
+ * humaneval: 37.2 (33.5) - 1.11x
29
+
30
+ # Current Conclusions
31
+ Going by evals (and evals alone), full-finetuning seems to have caused some degree of mild catastrophic forgetting outside of the domains that were specifically distilled, as you might expect from the lack of data. I plan to remedy this with lower LRs and/or bigger batch sizes, and of course, on a much larger dataset than the limited selection seen here.
32
+ The plan is to do at least 1 billion unique tokens; we are still conducting custom tests for alternative loss functions (i.e, things in the vein of a weighted Cross-Entropy loss function to be used in tandem with KL divergence.)
33
+
34
+ ![Embedded Image](https://cdn.discordapp.com/attachments/1201939812488052757/1242759414033813565/image.png?ex=664faa25&is=664e58a5&hm=a15028990e0f6dfba573ea2a00207daa01e953b19c2637e40546d5e594b51b36&)