Update README.md
Browse files
README.md
CHANGED
@@ -80,7 +80,7 @@ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_to
|
|
80 |
|
81 |
## Training
|
82 |
|
83 |
-
For training, the learning rate is warmed up from 1e-7 to a maximum of 3e-4 over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full
|
84 |
|
85 |
| **Hyper-Parameter** | **Value** |
|
86 |
|---------------------|--------------------------|
|
|
|
80 |
|
81 |
## Training
|
82 |
|
83 |
+
For training, the learning rate is warmed up from 1e-7 to a maximum of 3e-4 over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full *float32* training. We show in the next table the training hyperparameters:
|
84 |
|
85 |
| **Hyper-Parameter** | **Value** |
|
86 |
|---------------------|--------------------------|
|