LLM360-MBZUAI
commited on
Update README.md
Browse filesadd some bullets emoji to model card
README.md
CHANGED
@@ -25,14 +25,17 @@ By comparing CrystalCoder with other similar work, CrystalCoder is quite balance
|
|
25 |
|
26 |
|
27 |
**Notes**
|
28 |
-
|
29 |
-
-
|
30 |
-
|
|
|
|
|
31 |
- As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
|
32 |
- Scores for HumanEval is computed with a temperature of 0.2
|
33 |
- Scores for MBPP is computed with a temperature of 0.1
|
34 |
- For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
35 |
|
|
|
36 |
|
37 |
## About LLM360
|
38 |
LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
|
@@ -47,7 +50,7 @@ effort.
|
|
47 |
|
48 |
Get access now at [LLM360 site](https://www.llm360.ai/)
|
49 |
|
50 |
-
## Model Description
|
51 |
|
52 |
- **Model type:** Language model with the same architecture as LLaMA-7B
|
53 |
- **Language(s) (NLP):** English
|
@@ -58,7 +61,7 @@ Get access now at [LLM360 site](https://www.llm360.ai/)
|
|
58 |
- [Metrics](https://github.com/LLM360/Analysis360)
|
59 |
- [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
|
60 |
|
61 |
-
# Model Architecture
|
62 |
|
63 |
CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
|
64 |
|
@@ -85,7 +88,7 @@ For other architecture choices:
|
|
85 |
- Training sequence length is `2048`.
|
86 |
- Embedding dimension is `32032`.
|
87 |
|
88 |
-
# Tokenization
|
89 |
|
90 |
Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
|
91 |
- 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
|
@@ -94,7 +97,7 @@ Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens
|
|
94 |
|
95 |
Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
|
96 |
|
97 |
-
# Training
|
98 |
|
99 |
Our training has 3 stages:
|
100 |
- Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
|
@@ -114,12 +117,12 @@ For hyperparameters used in each stage, please refer to the following table:
|
|
114 |
|
115 |
For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
|
116 |
|
117 |
-
# Dataset
|
118 |
|
119 |
Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
120 |
|
121 |
|
122 |
-
# Model Usage
|
123 |
|
124 |
To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
|
125 |
|
@@ -146,7 +149,7 @@ print("-"*20 + "Output for model" + 20 * '-')
|
|
146 |
print(tokenizer.batch_decode(gen_tokens)[0])
|
147 |
```
|
148 |
|
149 |
-
## Completion Example:
|
150 |
|
151 |
### prompt:
|
152 |
|
@@ -185,7 +188,7 @@ def closest_pair(numbers: List[float], threshold: float) -> int:
|
|
185 |
<unk> import torch
|
186 |
import numpy as np
|
187 |
```
|
188 |
-
# Training Logs and Evaluation Results
|
189 |
|
190 |
Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
|
191 |
|
@@ -204,11 +207,11 @@ Selected Metrics are displayed below.
|
|
204 |
|<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
|
205 |
|
206 |
|
207 |
-
# CrystalCoder-Instruct
|
208 |
|
209 |
We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
|
210 |
|
211 |
-
# Citation
|
212 |
|
213 |
**BibTeX:**
|
214 |
|
|
|
25 |
|
26 |
|
27 |
**Notes**
|
28 |
+
|
29 |
+
- We compute all evaluation metrics ourselves.
|
30 |
+
|
31 |
+
- Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
|
32 |
+
|
33 |
- As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
|
34 |
- Scores for HumanEval is computed with a temperature of 0.2
|
35 |
- Scores for MBPP is computed with a temperature of 0.1
|
36 |
- For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
37 |
|
38 |
+
|
39 |
|
40 |
## About LLM360
|
41 |
LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
|
|
|
50 |
|
51 |
Get access now at [LLM360 site](https://www.llm360.ai/)
|
52 |
|
53 |
+
## π£ Model Description
|
54 |
|
55 |
- **Model type:** Language model with the same architecture as LLaMA-7B
|
56 |
- **Language(s) (NLP):** English
|
|
|
61 |
- [Metrics](https://github.com/LLM360/Analysis360)
|
62 |
- [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
|
63 |
|
64 |
+
# π£ Model Architecture
|
65 |
|
66 |
CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
|
67 |
|
|
|
88 |
- Training sequence length is `2048`.
|
89 |
- Embedding dimension is `32032`.
|
90 |
|
91 |
+
# π£ Tokenization
|
92 |
|
93 |
Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
|
94 |
- 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
|
|
|
97 |
|
98 |
Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
|
99 |
|
100 |
+
# π£ Training
|
101 |
|
102 |
Our training has 3 stages:
|
103 |
- Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
|
|
|
117 |
|
118 |
For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
|
119 |
|
120 |
+
# π£ Dataset
|
121 |
|
122 |
Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
|
123 |
|
124 |
|
125 |
+
# π£ Model Usage
|
126 |
|
127 |
To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
|
128 |
|
|
|
149 |
print(tokenizer.batch_decode(gen_tokens)[0])
|
150 |
```
|
151 |
|
152 |
+
## π£ Completion Example:
|
153 |
|
154 |
### prompt:
|
155 |
|
|
|
188 |
<unk> import torch
|
189 |
import numpy as np
|
190 |
```
|
191 |
+
# π£ Training Logs and Evaluation Results
|
192 |
|
193 |
Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
|
194 |
|
|
|
207 |
|<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
|
208 |
|
209 |
|
210 |
+
# π£ CrystalCoder-Instruct
|
211 |
|
212 |
We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
|
213 |
|
214 |
+
# π£ Citation
|
215 |
|
216 |
**BibTeX:**
|
217 |
|