ibm-granite
/

granite-3.1-2b-instruct

@@ -13,7 +13,7 @@ base_model:
 # Granite-3.1-2B-Instruct
 **Model Summary:**
-Granite-3.1-2B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-2B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
 - **Developers:** Granite Team, IBM
 - **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
@@ -56,7 +56,7 @@ import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 device = "auto"
-model_path = "ibm-granite/Granite-3.1-2B-instruct"
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 # drop device_map if running on CPU
 model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
@@ -82,21 +82,21 @@ Granite-3.1-2B-Instruct is based on a decoder-only dense transformer architectur
 | Model                     | 2B Dense | 8B Dense     | 1B MoE | 3B MoE |
 | :--------                 | :--------| :--------    | :------| :------|
-| Embedding size            | 2048     | **4096**     | 1024   | 1536   |
-| Number of layers          | 40       | **40**       | 24     | 32     |
-| Attention head size       | 64       | **128**      | 64     | 64     |
-| Number of attention heads | 32       | **32**       | 16     | 24     |
-| Number of KV heads        | 8        | **8**        | 8      | 8      |
-| MLP hidden size           | 8192     | **12800**    | 512    | 512    |
-| MLP activation            | SwiGLU   | **SwiGLU**   | SwiGLU | SwiGLU |
-| Number of experts         | —        | **—**        | 32     | 40     |
-| MoE TopK                  | —        | **—**        | 8      | 8      |
-| Initialization std        | 0.1      | **0.1**      | 0.1    | 0.1    |
-| Sequence length           | 128K     | **128K**     | 128K   | 128K   |
-| Position embedding        | RoPE     | **RoPE**     | RoPE   | RoPE   |
-| # Parameters              | 2.5B     | **8.1B**     | 1.3B   | 3.3B   |
-| # Active parameters       | 2.5B     | **8.1B**     | 400M   | 800M   |
-| # Training tokens         | 12T      | **12T**      | 10T    | 10T    |
 **Training Data:**
 Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).

 # Granite-3.1-2B-Instruct
 **Model Summary:**
+Granite-3.1-2B-Instruct is a 2B parameter long-context instruct model finetuned from Granite-3.1-2B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
 - **Developers:** Granite Team, IBM
 - **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
 from transformers import AutoModelForCausalLM, AutoTokenizer
 device = "auto"
+model_path = "ibm-granite/granite-3.1-2b-instruct"
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 # drop device_map if running on CPU
 model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
 | Model                     | 2B Dense | 8B Dense     | 1B MoE | 3B MoE |
 | :--------                 | :--------| :--------    | :------| :------|
+| Embedding size            | **2048**     | 4096     | 1024   | 1536   |
+| Number of layers          | **40**       | 40       | 24     | 32     |
+| Attention head size       | **64**       | 128      | 64     | 64     |
+| Number of attention heads | **32**       | 32       | 16     | 24     |
+| Number of KV heads        | **8**        | 8        | 8      | 8      |
+| MLP hidden size           | **8192**     | 12800    | 512    | 512    |
+| MLP activation            | **SwiGLU**   | SwiGLU   | SwiGLU | SwiGLU |
+| Number of experts         | **—**        | —        | 32     | 40     |
+| MoE TopK                  | **—**        | —        | 8      | 8      |
+| Initialization std        | **0.1**      | 0.1      | 0.1    | 0.1    |
+| Sequence length           | **128K**     | 128K     | 128K   | 128K   |
+| Position embedding        | **RoPE**     | RoPE     | RoPE   | RoPE   |
+| # Parameters              | **2.5B**     | 8.1B     | 1.3B   | 3.3B   |
+| # Active parameters       | **2.5B**     | 8.1B     | 400M   | 800M   |
+| # Training tokens         | **12T**      | 12T      | 10T    | 10T    |
 **Training Data:**
 Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).