javi8979 commited on
Commit
fb08f72
1 Parent(s): 42b91e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -80,7 +80,7 @@ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_to
80
 
81
  ## Training
82
 
83
- For training, the learning rate is warmed up from 1e-7 to a maximum of 3e-4 over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full \texttt{float32} training. We show in the next table the training hyperparameters:
84
 
85
  | **Hyper-Parameter** | **Value** |
86
  |---------------------|--------------------------|
 
80
 
81
  ## Training
82
 
83
+ For training, the learning rate is warmed up from 1e-7 to a maximum of 3e-4 over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full *float32* training. We show in the next table the training hyperparameters:
84
 
85
  | **Hyper-Parameter** | **Value** |
86
  |---------------------|--------------------------|