training data word choice fixe
Browse files
README.md
CHANGED
@@ -276,10 +276,10 @@ Granite-3.0-8B-Base is based on a decoder-only dense transformer architecture. C
|
|
276 |
| # Training tokens | 12T | **12T** | 10T | 10T |
|
277 |
|
278 |
**Training Data:**
|
279 |
-
This model is trained on a mix of open source and proprietary data following a two-
|
280 |
-
* Stage 1 data: The data for
|
281 |
-
* Stage 2 data: The data for
|
282 |
-
|
283 |
**Infrastructure:**
|
284 |
We train Granite 3.0 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
285 |
|
|
|
276 |
| # Training tokens | 12T | **12T** | 10T | 10T |
|
277 |
|
278 |
**Training Data:**
|
279 |
+
This model is trained on a mix of open source and proprietary data following a two-stage training strategy.
|
280 |
+
* Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data.
|
281 |
+
* Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks.
|
282 |
+
|
283 |
**Infrastructure:**
|
284 |
We train Granite 3.0 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
285 |
|