Teacher correction training hyperparameters

#13

by hjlee1371 - opened Sep 30

Sep 30

Hi, thank you for open-sourcing these great models and methods. However, I cannot find the training hyperparameters (e.g., learning rate, batch size, learning rate scheduler) for the teacher correction in the paper. Could you please share these details?

Thank you.

sharathtsnv

NVIDIA org Oct 3

Hi, thanks for the question.

For teacher correction of Llama-3.1 8B, we use: peak_lr=5e-5, min_lr=1e-5, cosine decay schedule, warmup=45 steps, batch size=2048.
The idea is to retain the hyperparameters of the original training schedule, but reduce peak_lr (to around 1/5th the original) and retain the min_lr.

hjlee1371

Oct 4

Thanks for the info! That really helps. Appreciate the quick response.

hjlee1371 changed discussion status to closed Oct 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment