alexmarques commited on
Commit
c8857c0
1 Parent(s): 0dbc179

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -72
README.md CHANGED
@@ -35,8 +35,8 @@ It achieves an average score of 67.57 on the [OpenLLM](https://huggingface.co/sp
35
  This model was obtained by quantizing the weights of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to INT4 data type.
36
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
37
 
38
- Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights.
39
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library. GPTQ used a 1% damping factor and 512 sequences of 8,192 random tokens.
40
 
41
 
42
  ## Deployment
@@ -80,45 +80,40 @@ Although AutoGPTQ was used for this particular model, Neural Magic is transition
80
 
81
  ```python
82
  from transformers import AutoTokenizer
83
- from datasets import Dataset
84
- from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
85
- from llmcompressor.modifiers.quantization import GPTQModifier
86
- import random
87
 
88
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
89
 
90
- num_samples = 512
91
- max_seq_len = 8192
92
 
93
  tokenizer = AutoTokenizer.from_pretrained(model_id)
94
 
95
- preprocess_fn = lambda example: {"text": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n{text}".format_map(example)}
 
96
 
97
- dataset_name = "neuralmagic/LLM_compression_calibration"
98
- dataset = load_dataset(dataset_name, split="train")
99
- ds = dataset.shuffle().select(range(num_samples))
100
  ds = ds.map(preprocess_fn)
101
 
102
- recipe = GPTQModifier(
103
- targets="Linear",
104
- scheme="W4A16",
105
- ignore=["lm_head"],
106
- dampening_frac=0.01,
 
 
 
107
  )
108
 
109
- model = SparseAutoModelForCausalLM.from_pretrained(
110
  model_id,
 
111
  device_map="auto",
112
- trust_remote_code=True,
113
  )
114
 
115
- oneshot(
116
- model=model,
117
- dataset=ds,
118
- recipe=recipe,
119
- max_seq_length=max_seq_len,
120
- num_calibration_samples=num_samples,
121
- )
122
  model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
123
  ```
124
 
@@ -126,14 +121,9 @@ model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
126
 
127
  ## Evaluation
128
 
129
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
130
- ```
131
- lm_eval \
132
- --model vllm \
133
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
134
- --tasks openllm \
135
- --batch_size auto
136
- ```
137
 
138
  ### Accuracy
139
 
@@ -143,48 +133,50 @@ lm_eval \
143
  <td><strong>Benchmark</strong>
144
  </td>
145
  <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
146
- </td>
147
- <td><strong>hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4</strong>
148
  </td>
149
  <td><strong>Meta-Llama-3.1-8B-Instruct-quantized.w4a16 (this model)</strong>
150
  </td>
151
- <td><strong>Recovery (this model) </strong>
152
  </td>
153
  </tr>
154
  <tr>
155
  <td>MMLU (5-shot)
156
  </td>
157
- <td>67.94
158
- </td>
159
- <td>66.33
160
  </td>
161
- <td>65.38
162
  </td>
163
- <td>96.23%
164
  </td>
165
  </tr>
166
  <tr>
167
- <td>ARC Challenge (25-shot)
168
  </td>
169
- <td>60.41
170
  </td>
171
- <td>58.36
172
  </td>
173
- <td>59.30
174
- </td>
175
- <td>98.16%
176
  </td>
177
  </tr>
178
  <tr>
179
- <td>GSM-8K (5-shot, strict-match)
 
 
 
 
 
 
180
  </td>
181
- <td>75.66
 
 
182
  </td>
183
- <td>74.07
184
  </td>
185
- <td>75.43
186
  </td>
187
- <td>99.69%
188
  </td>
189
  </tr>
190
  <tr>
@@ -192,11 +184,9 @@ lm_eval \
192
  </td>
193
  <td>80.01
194
  </td>
195
- <td>79.18
196
  </td>
197
- <td>79.05
198
- </td>
199
- <td>98.80%
200
  </td>
201
  </tr>
202
  <tr>
@@ -204,35 +194,109 @@ lm_eval \
204
  </td>
205
  <td>77.90
206
  </td>
207
- <td>76.00
208
- </td>
209
- <td>76.08
210
  </td>
211
- <td>97.66%
212
  </td>
213
  </tr>
214
  <tr>
215
- <td>TruthfulQA (0-shot)
216
  </td>
217
  <td>54.04
218
  </td>
219
- <td>51.91
220
  </td>
221
- <td>50.19
222
- </td>
223
- <td>92.8%
224
  </td>
225
  </tr>
226
  <tr>
227
  <td><strong>Average</strong>
228
  </td>
229
- <td><strong>69.33</strong>
230
- </td>
231
- <td><strong>67.64</strong>
232
  </td>
233
- <td><strong>67.57</strong>
234
  </td>
235
- <td><strong>97.47%</strong>
236
  </td>
237
  </tr>
238
- </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  This model was obtained by quantizing the weights of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to INT4 data type.
36
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
37
 
38
+ Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
39
+ [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 10% damping factor and 768 sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
40
 
41
 
42
  ## Deployment
 
80
 
81
  ```python
82
  from transformers import AutoTokenizer
83
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
84
+ from datasets import load_dataset
 
 
85
 
86
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
87
 
88
+ num_samples = 756
89
+ max_seq_len = 4064
90
 
91
  tokenizer = AutoTokenizer.from_pretrained(model_id)
92
 
93
+ def preprocess_fn(example):
94
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
95
 
96
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
97
+ ds = ds.shuffle().select(range(num_samples))
 
98
  ds = ds.map(preprocess_fn)
99
 
100
+ examples = [tokenizer(example["text"], padding=False, max_length=max_seq_len, truncation=True) for example in ds]
101
+
102
+ quantize_config = BaseQuantizeConfig(
103
+ bits=4,
104
+ group_size=128,
105
+ desc_act=True,
106
+ model_file_base_name="model",
107
+ damp_percent=0.1,
108
  )
109
 
110
+ model = AutoGPTQForCausalLM.from_pretrained(
111
  model_id,
112
+ quantize_config,
113
  device_map="auto",
 
114
  )
115
 
116
+ model.quantize(examples)
 
 
 
 
 
 
117
  model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
118
  ```
119
 
 
121
 
122
  ## Evaluation
123
 
124
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
125
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
126
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
127
 
128
  ### Accuracy
129
 
 
133
  <td><strong>Benchmark</strong>
134
  </td>
135
  <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
 
 
136
  </td>
137
  <td><strong>Meta-Llama-3.1-8B-Instruct-quantized.w4a16 (this model)</strong>
138
  </td>
139
+ <td><strong>Recovery</strong>
140
  </td>
141
  </tr>
142
  <tr>
143
  <td>MMLU (5-shot)
144
  </td>
145
+ <td>69.43
 
 
146
  </td>
147
+ <td>67.68
148
  </td>
149
+ <td>97.5%
150
  </td>
151
  </tr>
152
  <tr>
153
+ <td>MMLU (CoT, 0-shot)
154
  </td>
155
+ <td>72.56
156
  </td>
157
+ <td>70.36
158
  </td>
159
+ <td>97.0%
 
 
160
  </td>
161
  </tr>
162
  <tr>
163
+ <td>ARC Challenge (0-shot)
164
+ </td>
165
+ <td>81.57
166
+ </td>
167
+ <td>79.95
168
+ </td>
169
+ <td>98.0%
170
  </td>
171
+ </tr>
172
+ <tr>
173
+ <td>GSM-8K (CoT, 8-shot, strict-match)
174
  </td>
175
+ <td>82.79
176
  </td>
177
+ <td>79.53
178
  </td>
179
+ <td>96.1%
180
  </td>
181
  </tr>
182
  <tr>
 
184
  </td>
185
  <td>80.01
186
  </td>
187
+ <td>78.57
188
  </td>
189
+ <td>98.2%
 
 
190
  </td>
191
  </tr>
192
  <tr>
 
194
  </td>
195
  <td>77.90
196
  </td>
197
+ <td>76.48
 
 
198
  </td>
199
+ <td>98.2%
200
  </td>
201
  </tr>
202
  <tr>
203
+ <td>TruthfulQA (0-shot, mc2)
204
  </td>
205
  <td>54.04
206
  </td>
207
+ <td>50.46
208
  </td>
209
+ <td>93.4%
 
 
210
  </td>
211
  </tr>
212
  <tr>
213
  <td><strong>Average</strong>
214
  </td>
215
+ <td><strong>74.04</strong>
 
 
216
  </td>
217
+ <td><strong>71.86</strong>
218
  </td>
219
+ <td><strong>97.1%</strong>
220
  </td>
221
  </tr>
222
+ </table>
223
+
224
+ ### Reproduction
225
+
226
+ The results were obtained using the following commands:
227
+
228
+ #### MMLU
229
+ ```
230
+ lm_eval \
231
+ --model vllm \
232
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
233
+ --tasks mmlu_llama_3.1_instruct \
234
+ --fewshot_as_multiturn \
235
+ --apply_chat_template \
236
+ --num_fewshot 5 \
237
+ --batch_size auto
238
+ ```
239
+
240
+ #### MMLU-CoT
241
+ ```
242
+ lm_eval \
243
+ --model vllm \
244
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
245
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
246
+ --apply_chat_template \
247
+ --num_fewshot 0 \
248
+ --batch_size auto
249
+ ```
250
+
251
+ #### ARC-Challenge
252
+ ```
253
+ lm_eval \
254
+ --model vllm \
255
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
256
+ --tasks arc_challenge_llama_3.1_instruct \
257
+ --apply_chat_template \
258
+ --num_fewshot 0 \
259
+ --batch_size auto
260
+ ```
261
+
262
+ #### GSM-8K
263
+ ```
264
+ lm_eval \
265
+ --model vllm \
266
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
267
+ --tasks gsm8k_cot_llama_3.1_instruct \
268
+ --fewshot_as_multiturn \
269
+ --apply_chat_template \
270
+ --num_fewshot 8 \
271
+ --batch_size auto
272
+ ```
273
+
274
+ #### Hellaswag
275
+ ```
276
+ lm_eval \
277
+ --model vllm \
278
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
279
+ --tasks hellaswag \
280
+ --num_fewshot 10 \
281
+ --batch_size auto
282
+ ```
283
+
284
+ #### Winogrande
285
+ ```
286
+ lm_eval \
287
+ --model vllm \
288
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
289
+ --tasks winogrande \
290
+ --num_fewshot 5 \
291
+ --batch_size auto
292
+ ```
293
+
294
+ #### TruthfulQA
295
+ ```
296
+ lm_eval \
297
+ --model vllm \
298
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
299
+ --tasks truthfulqa \
300
+ --num_fewshot 0 \
301
+ --batch_size auto
302
+ ```