distily_bench_gpt2_optim

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: LinearObjective(logits_weight=1, logits_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, activations_weight=10, activations_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, attentions_weight=0, attentions_loss_fn=<function mse_loss at 0x7f57c4b07880>)
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Peak GPU Memory: 4.5067 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2385	57.2728					18.1772
0	0	55339.3672	57682.5742	31197.1836	21.4398	46.642	11.661	57080.2930
500	0.0808	1545.6934	7685.4297	3209.9360	21.4847	46.545	11.636	63830.4023
1000	0.1616	1108.6847	5659.8701	2933.1360	21.4559	46.607	11.652	31166.1797
1500	0.2424	913.3565	4893.8623	2798.0161	21.5956	46.306	11.576	23215.4258
2000	0.3232	813.5310	4763.6436	2700.0161	21.635	46.221	11.555	22568.9238
2500	0.4040	747.3608	4565.6851	2631.0720	21.5442	46.416	11.604	18090.1602
3000	0.4848	711.6094	4255.0127	2579.2639	21.7116	46.058	11.515	16199.8096
3500	0.5657	666.4665	4117.3369	2530.9441	21.5886	46.321	11.58	16435.1426
4000	0.6465	638.0192	4058.8262	2500.0801	21.4712	46.574	11.643	16069.4648
4500	0.7273	597.0923	4013.0125	2459.4241	21.7093	46.063	11.516	12965.0762
5000	0.8081	567.6912	3822.9963	2424.4800	21.5309	46.445	11.611	10275.5850
5500	0.8889	548.5159	3864.8674	2399.5359	21.6408	46.209	11.552	8114.6914
6000	0.9697	539.3817	3793.8606	2379.3601	21.5636	46.374	11.594	6467.9736
6187	0.9999	524.7870	3705.5625	2370.7361	21.6322	46.227	11.557	6035.2861