alexmarques commited on
Commit
e0f0220
1 Parent(s): 0c8a3b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -9
README.md CHANGED
@@ -47,7 +47,8 @@ Only weights and activations of the linear operators within transformers blocks
47
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
48
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
49
  Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
50
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
 
51
  GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
52
 
53
  ## Deployment
@@ -108,14 +109,24 @@ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
108
  ds = ds.shuffle().select(range(num_samples))
109
  ds = ds.map(preprocess_fn)
110
 
111
- recipe = GPTQModifier(
112
- use_sequential=True,
113
- targets="Linear",
114
- scheme="W8A8",
115
- ignore=["lm_head"],
116
- dampening_frac=0.01,
117
- observer="mse",
118
- )
 
 
 
 
 
 
 
 
 
 
119
 
120
  model = SparseAutoModelForCausalLM.from_pretrained(
121
  model_id,
 
47
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
48
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
49
  Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
50
+ The [SmoothQuant](https://arxiv.org/abs/2211.10438) algorithm is used to alleviate outliers in the activations, whereas rhe [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization.
51
+ Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
52
  GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
53
 
54
  ## Deployment
 
109
  ds = ds.shuffle().select(range(num_samples))
110
  ds = ds.map(preprocess_fn)
111
 
112
+ recipe = [
113
+ SmoothQuantModifier(
114
+ smoothing_strength=0.7,
115
+ mappings=[
116
+ [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
117
+ [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
118
+ [["re:.*down_proj"], "re:.*up_proj"],
119
+ ],
120
+ ),
121
+ GPTQModifier(
122
+ sequential=True,
123
+ targets="Linear",
124
+ scheme="W8A8",
125
+ ignore=["lm_head"],
126
+ dampening_frac=0.01,
127
+ observer="mse",
128
+ )
129
+ ]
130
 
131
  model = SparseAutoModelForCausalLM.from_pretrained(
132
  model_id,