NeMo
Safetensors
llama

Teacher correction training hyperparameters

#13
by hjlee1371 - opened

Hi, thank you for open-sourcing these great models and methods. However, I cannot find the training hyperparameters (e.g., learning rate, batch size, learning rate scheduler) for the teacher correction in the paper. Could you please share these details?

Thank you.

NVIDIA org

Hi, thanks for the question.

For teacher correction of Llama-3.1 8B, we use: peak_lr=5e-5, min_lr=1e-5, cosine decay schedule, warmup=45 steps, batch size=2048.
The idea is to retain the hyperparameters of the original training schedule, but reduce peak_lr (to around 1/5th the original) and retain the min_lr.

Thanks for the info! That really helps. Appreciate the quick response.

hjlee1371 changed discussion status to closed

Sign up or log in to comment