neuralmagic
/

Meta-Llama-3.1-8B-Instruct-quantized.w4a16

@@ -35,8 +35,8 @@ It achieves an average score of 67.57 on the [OpenLLM](https://huggingface.co/sp
 This model was obtained by quantizing the weights of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to INT4 data type.
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
-Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights.
-The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library. GPTQ used a 1% damping factor and 512 sequences of 8,192 random tokens.
 ## Deployment
@@ -80,45 +80,40 @@ Although AutoGPTQ was used for this particular model, Neural Magic is transition
 ```python
 from transformers import AutoTokenizer
-from datasets import Dataset
-from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
-from llmcompressor.modifiers.quantization import GPTQModifier
-import random
 model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
-num_samples = 512
-max_seq_len = 8192
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-preprocess_fn = lambda example: {"text": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n{text}".format_map(example)}
-dataset_name = "neuralmagic/LLM_compression_calibration"
-dataset = load_dataset(dataset_name, split="train")
-ds = dataset.shuffle().select(range(num_samples))
 ds = ds.map(preprocess_fn)
-recipe = GPTQModifier(
-  targets="Linear",
-  scheme="W4A16",
-  ignore=["lm_head"],
-  dampening_frac=0.01,
 )
-model = SparseAutoModelForCausalLM.from_pretrained(
   model_id,
   device_map="auto",
-  trust_remote_code=True,
 )
-oneshot(
-  model=model,
-  dataset=ds,
-  recipe=recipe,
-  max_seq_length=max_seq_len,
-  num_calibration_samples=num_samples,
-)
 model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
 ```
@@ -126,14 +121,9 @@ model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
 ## Evaluation
-The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
-```
-lm_eval \
-  --model vllm \
-  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
-  --tasks openllm \
-  --batch_size auto
-```
 ### Accuracy
@@ -143,48 +133,50 @@ lm_eval \
    <td><strong>Benchmark</strong>
    </td>
    <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
-   </td>
-    <td><strong>hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4</strong>
    </td>
    <td><strong>Meta-Llama-3.1-8B-Instruct-quantized.w4a16 (this model)</strong>
    </td>
-   <td><strong>Recovery (this model) </strong>
    </td>
   </tr>
   <tr>
    <td>MMLU (5-shot)
    </td>
-   <td>67.94
-   </td>
-   <td>66.33
    </td>
-   <td>65.38
    </td>
-   <td>96.23%
    </td>
   </tr>
   <tr>
-   <td>ARC Challenge (25-shot)
    </td>
-   <td>60.41
    </td>
-   <td>58.36
    </td>
-   <td>59.30
-   </td>
-   <td>98.16%
    </td>
   </tr>
   <tr>
-   <td>GSM-8K (5-shot, strict-match)
    </td>
-   <td>75.66
    </td>
-   <td>74.07
    </td>
-   <td>75.43
    </td>
-   <td>99.69%
    </td>
   </tr>
   <tr>
@@ -192,11 +184,9 @@ lm_eval \
    </td>
    <td>80.01
    </td>
-   <td>79.18
    </td>
-   <td>79.05
-   </td>
-   <td>98.80%
    </td>
   </tr>
   <tr>
@@ -204,35 +194,109 @@ lm_eval \
    </td>
    <td>77.90
    </td>
-   <td>76.00
-   </td>
-   <td>76.08
    </td>
-   <td>97.66%
    </td>
   </tr>
   <tr>
-   <td>TruthfulQA (0-shot)
    </td>
    <td>54.04
    </td>
-   <td>51.91
    </td>
-   <td>50.19
-   </td>
-   <td>92.8%
    </td>
   </tr>
   <tr>
    <td><strong>Average</strong>
    </td>
-   <td><strong>69.33</strong>
-   </td>
-   <td><strong>67.64</strong>
    </td>
-   <td><strong>67.57</strong>
    </td>
-   <td><strong>97.47%</strong>
    </td>
   </tr>
-</table>

 This model was obtained by quantizing the weights of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to INT4 data type.
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
+Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
+[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 10% damping factor and 768 sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
 ## Deployment
 ```python
 from transformers import AutoTokenizer
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+from datasets import load_dataset
 model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
+num_samples = 756
+max_seq_len = 4064
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+def preprocess_fn(example):
+  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
+ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+ds = ds.shuffle().select(range(num_samples))
 ds = ds.map(preprocess_fn)
+examples = [tokenizer(example["text"], padding=False, max_length=max_seq_len, truncation=True) for example in ds]
+quantize_config = BaseQuantizeConfig(
+  bits=4,
+  group_size=128,
+  desc_act=True,
+  model_file_base_name="model",
+  damp_percent=0.1,
 )
+model = AutoGPTQForCausalLM.from_pretrained(
   model_id,
+  quantize_config,
   device_map="auto",
 )
+model.quantize(examples)
 model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
 ```
 ## Evaluation
+The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
+Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
+This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 ### Accuracy
    <td><strong>Benchmark</strong>
    </td>
    <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
    </td>
    <td><strong>Meta-Llama-3.1-8B-Instruct-quantized.w4a16 (this model)</strong>
    </td>
+   <td><strong>Recovery</strong>
    </td>
   </tr>
   <tr>
    <td>MMLU (5-shot)
    </td>
+   <td>69.43
    </td>
+   <td>67.68
    </td>
+   <td>97.5%
    </td>
   </tr>
   <tr>
+   <td>MMLU (CoT, 0-shot)
    </td>
+   <td>72.56
    </td>
+   <td>70.36
    </td>
+   <td>97.0%
    </td>
   </tr>
   <tr>
+   <td>ARC Challenge (0-shot)
+   </td>
+   <td>81.57
+   </td>
+   <td>79.95
+   </td>
+   <td>98.0%
    </td>
+  </tr>
+  <tr>
+   <td>GSM-8K (CoT, 8-shot, strict-match)
    </td>
+   <td>82.79
    </td>
+   <td>79.53
    </td>
+   <td>96.1%
    </td>
   </tr>
   <tr>
    </td>
    <td>80.01
    </td>
+   <td>78.57
    </td>
+   <td>98.2%
    </td>
   </tr>
   <tr>
    </td>
    <td>77.90
    </td>
+   <td>76.48
    </td>
+   <td>98.2%
    </td>
   </tr>
   <tr>
+   <td>TruthfulQA (0-shot, mc2)
    </td>
    <td>54.04
    </td>
+   <td>50.46
    </td>
+   <td>93.4%
    </td>
   </tr>
   <tr>
    <td><strong>Average</strong>
    </td>
+   <td><strong>74.04</strong>
    </td>
+   <td><strong>71.86</strong>
    </td>
+   <td><strong>97.1%</strong>
    </td>
   </tr>
+</table>
+### Reproduction
+The results were obtained using the following commands:
+#### MMLU
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
+  --tasks mmlu_llama_3.1_instruct \
+  --fewshot_as_multiturn \
+  --apply_chat_template \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+#### MMLU-CoT
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
+  --tasks mmlu_cot_0shot_llama_3.1_instruct \
+  --apply_chat_template \
+  --num_fewshot 0 \
+  --batch_size auto
+```
+#### ARC-Challenge
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
+  --tasks arc_challenge_llama_3.1_instruct \
+  --apply_chat_template \
+  --num_fewshot 0 \
+  --batch_size auto
+```
+#### GSM-8K
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
+  --tasks gsm8k_cot_llama_3.1_instruct \
+  --fewshot_as_multiturn \
+  --apply_chat_template \
+  --num_fewshot 8 \
+  --batch_size auto
+```
+#### Hellaswag
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
+  --tasks hellaswag \
+  --num_fewshot 10 \
+  --batch_size auto
+```
+#### Winogrande
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
+  --tasks winogrande \
+  --num_fewshot 5 \
+  --batch_size auto
+```
+#### TruthfulQA
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
+  --tasks truthfulqa \
+  --num_fewshot 0 \
+  --batch_size auto
+```