Teacher correction training hyperparameters
#13
by
hjlee1371
- opened
Hi, thank you for open-sourcing these great models and methods. However, I cannot find the training hyperparameters (e.g., learning rate, batch size, learning rate scheduler) for the teacher correction in the paper. Could you please share these details?
Thank you.
Hi, thanks for the question.
For teacher correction of Llama-3.1 8B, we use: peak_lr=5e-5, min_lr=1e-5, cosine decay schedule, warmup=45 steps, batch size=2048.
The idea is to retain the hyperparameters of the original training schedule, but reduce peak_lr (to around 1/5th the original) and retain the min_lr.
Thanks for the info! That really helps. Appreciate the quick response.
hjlee1371
changed discussion status to
closed