neuralmagic
/

Llama-3.2-1B-Instruct-quantized.w8a8

Text Generation

8-bit precision

Model card Files Files and versions Community

alexmarques commited on 5 days ago

Commit

e0f0220

•

1 Parent(s): 0c8a3b8

Update README.md

Files changed (1) hide show

README.md +20 -9

README.md CHANGED Viewed

@@ -47,7 +47,8 @@ Only weights and activations of the linear operators within transformers blocks
 Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
 Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
 Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
-The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
 GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
 ## Deployment
@@ -108,14 +109,24 @@ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
 ds = ds.shuffle().select(range(num_samples))
 ds = ds.map(preprocess_fn)
-recipe = GPTQModifier(
-  use_sequential=True,
-  targets="Linear",
-  scheme="W8A8",
-  ignore=["lm_head"],
-  dampening_frac=0.01,
-  observer="mse",
-)
 model = SparseAutoModelForCausalLM.from_pretrained(
   model_id,

 Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
 Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
 Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
+The [SmoothQuant](https://arxiv.org/abs/2211.10438) algorithm is used to alleviate outliers in the activations, whereas rhe [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization.
+Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
 GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
 ## Deployment
 ds = ds.shuffle().select(range(num_samples))
 ds = ds.map(preprocess_fn)
+recipe = [
+  SmoothQuantModifier(
+    smoothing_strength=0.7,
+    mappings=[
+      [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
+      [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
+      [["re:.*down_proj"], "re:.*up_proj"],
+    ],
+  ),
+  GPTQModifier(
+    sequential=True,
+    targets="Linear",
+    scheme="W8A8",
+    ignore=["lm_head"],
+    dampening_frac=0.01,
+    observer="mse",
+  )
+]
 model = SparseAutoModelForCausalLM.from_pretrained(
   model_id,