abhinavnmagic commited on
Commit
0dbc179
1 Parent(s): 34fbd4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -36,7 +36,7 @@ This model was obtained by quantizing the weights of [Meta-Llama-3.1-8B-Instruct
36
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
37
 
38
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights.
39
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. GPTQ used a 1% damping factor and 512 sequences of 8,192 random tokens.
40
 
41
 
42
  ## Deployment
@@ -75,7 +75,8 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
75
 
76
  ## Creation
77
 
78
- This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
 
79
 
80
  ```python
81
  from transformers import AutoTokenizer
 
36
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
37
 
38
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights.
39
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library. GPTQ used a 1% damping factor and 512 sequences of 8,192 random tokens.
40
 
41
 
42
  ## Deployment
 
75
 
76
  ## Creation
77
 
78
+ This model was created by applying the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library as presented in the code snipet below.
79
+ Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
80
 
81
  ```python
82
  from transformers import AutoTokenizer