OpenSourceRonin commited on
Commit
8453bfa
1 Parent(s): ac97099

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -0
README.md CHANGED
@@ -15,3 +15,205 @@ It is intended only for experimental purposes.
15
 
16
  Users are responsible for any consequences arising from the use of this model.
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  Users are responsible for any consequences arising from the use of this model.
17
 
18
+ # VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
19
+
20
+ ## TL;DR
21
+
22
+ **Vector Post-Training Quantization (VPTQ)** is a novel Post-Training Quantization method that leverages **Vector Quantization** to high accuracy on LLMs at an extremely low bit-width (<2-bit).
23
+ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
24
+
25
+ * Better Accuracy on 1-2 bits
26
+ * Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
27
+ * Agile Quantization Inference: low decode overhead, best throughput, and TTFT
28
+
29
+ **Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time**
30
+ ![Llama3 1-70b-prompt](https://github.com/user-attachments/assets/d8729aca-4e1d-4fe1-ac71-c14da4bdd97f)
31
+
32
+
33
+ ## [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
34
+
35
+ Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
36
+
37
+ Read tech report at [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf) and [**arXiv Paper**](https://arxiv.org/pdf/2409.17066)
38
+
39
+ ### Early Results from Tech Report
40
+ VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
41
+
42
+ <img src="assets/vptq.png" width="500">
43
+
44
+ | Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
45
+ | ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
46
+ | LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
47
+ | | 2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 |
48
+ | LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
49
+ | | 2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 |
50
+ | LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
51
+ | | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
52
+
53
+ ---
54
+
55
+ ## Installation
56
+
57
+ ### Dependencies
58
+
59
+ - python 3.10+
60
+ - torch >= 2.2.0
61
+ - transformers >= 4.44.0
62
+ - Accelerate >= 0.33.0
63
+ - latest datasets
64
+
65
+ ### Installation
66
+
67
+ > Preparation steps that might be needed: Set up CUDA PATH.
68
+ ```bash
69
+ export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
70
+ ```
71
+
72
+ *Will Take several minutes to compile CUDA kernels*
73
+ ```python
74
+ pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
75
+ ```
76
+
77
+ ## Evaluation
78
+ ### Models from Open Source Community
79
+
80
+ ⚠️ The repository only provides a method of model quantization algorithm.
81
+
82
+ ⚠️ The open-source community [VPTQ-community](https://huggingface.co/VPTQ-community) provides models based on the technical report and quantization algorithm.
83
+
84
+ ⚠️ The repository cannot guarantee the performance of those models.
85
+
86
+
87
+ **Quick Estimation of Model Bitwidth (Excluding Codebook Overhead)**:
88
+ - **Model Naming Convention**: The model's name includes the **vector length** $v$, **codebook (lookup table) size**, and **residual codebook size**. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" and "Meta-Llama-3.1-70B-Instruct", where:
89
+ - **Vector Length**: 8
90
+ - **Number of Centroids**: 65536 (2^16)
91
+ - **Number of Residual Centroids**: 256 (2^8)
92
+ - **Equivalent Bitwidth Calculation**:
93
+ - **Index**: log2(65536) = 16 / 8 = 2 bits
94
+ - **Residual Index**: log2(256) = 8 / 8 = 1 bit
95
+ - **Total Bitwidth**: 2 + 1 = 3 bits
96
+ - **Model Size Estimation**: 70B * 3 bits / 8 bits per Byte = 26.25 GB
97
+
98
+ - **Note**: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer to **Tech Report Appendix C.2**.
99
+
100
+
101
+ | Model Series | Collections | (Estimated) Bit per weight |
102
+ |:----------------------:|:-----------:| ----------------------------|
103
+ | Llama 3.1 8B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-8b-instruct-without-finetune-66f2b70b1d002ceedef02d2e) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft) [3.5 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft) [2.3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v12-k65536-4096-woft) |
104
+ | Llama 3.1 70B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-70b-instruct-without-finetune-66f2bf454d3dd78dfee2ff11) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft) [2.25 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-4-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft) [1.93 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-32768-woft) [1.875 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft) [1.75 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k16384-0-woft) |
105
+ | Llama 3.1 405B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-405b-instruct-without-finetune-66f4413f9ba55e1a9e52cfb0) | [1.875 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k32768-32768-woft) [1.625 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-1024-woft) [1.5 bits (1)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k4096-0-woft) [1.5 bits (2)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-256-woft) [1.43 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-128-woft) [1.375 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-64-woft)|
106
+ | Qwen 2.5 7B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-7b-instruct-without-finetune-66f3e9866d3167cc05ce954a) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k256-256-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v16-k65536-65536-woft) |
107
+ | Qwen 2.5 14B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-14b-instruct-without-finetune-66f827f83c7ffa7931b8376c) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k256-256-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v16-k65536-65536-woft) |
108
+ | Qwen 2.5 72B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-72b-instruct-without-finetune-66f3bf1b3757dfa1ecb481c0) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-256-woft) [2.38 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k1024-512-woft) [2.25 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k512-512-woft) [2.25 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-4-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-0-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-65536-woft) [1.94 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-32768-woft) |
109
+
110
+
111
+ ### Language Generation Example
112
+ To generate text using the pre-trained model, you can use the following code snippet:
113
+
114
+ The model [*VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft*](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft) (~2 bit) is provided by open source community. The repository cannot guarantee the performance of those models.
115
+
116
+ ```python
117
+ python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"
118
+ ```
119
+ ![Llama3 1-70b-prompt](https://github.com/user-attachments/assets/d8729aca-4e1d-4fe1-ac71-c14da4bdd97f)
120
+
121
+
122
+ ### Terminal Chatbot Example
123
+ Launching a chatbot:
124
+ Note that you must use a chat model for this to work
125
+
126
+ ```python
127
+ python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --chat
128
+ ```
129
+ ![Llama3 1-70b-chat](https://github.com/user-attachments/assets/af051234-d1df-4e25-95e7-17a5ce98f3ea)
130
+
131
+
132
+ ### Python API Example
133
+ Using the Python API:
134
+
135
+ ```python
136
+ import vptq
137
+ import transformers
138
+ tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft")
139
+ m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft", device_map='auto')
140
+
141
+ inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
142
+ out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
143
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
144
+ ```
145
+
146
+ ### Gradio Web App Example
147
+ A environment variable is available to control share link or not.
148
+ `export SHARE_LINK=1`
149
+ ```
150
+ python -m vptq.app
151
+ ```
152
+
153
+ ---
154
+
155
+ ## Road Map
156
+ - [ ] Merge the quantization algorithm into the public repository.
157
+ - [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
158
+ - [ ] Improve the implementation of the inference kernel.
159
+ - [ ] **TBC**
160
+
161
+ ## Project main members:
162
+ * Yifei Liu (@lyf-00)
163
+ * Jicheng Wen (@wejoncy)
164
+ * Yang Wang (@YangWang92)
165
+
166
+ ## Acknowledgement
167
+
168
+ * We thank for **James Hensman** for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
169
+ * We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
170
+
171
+ ## Publication
172
+
173
+ EMNLP 2024 Main
174
+ ```bibtex
175
+ @inproceedings{
176
+ vptq,
177
+ title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
178
+ author={Yifei Liu and
179
+ Jicheng Wen and
180
+ Yang Wang and
181
+ Shengyu Ye and
182
+ Li Lyna Zhang and
183
+ Ting Cao and
184
+ Cheng Li and
185
+ Mao Yang},
186
+ booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
187
+ year={2024},
188
+ }
189
+ ```
190
+
191
+ ---
192
+
193
+ ## Limitation of VPTQ
194
+ * ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
195
+ * ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
196
+ * ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
197
+ * ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
198
+
199
+ ## Contributing
200
+
201
+ This project welcomes contributions and suggestions. Most contributions require you to agree to a
202
+ Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
203
+ the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
204
+
205
+ When you submit a pull request, a CLA bot will automatically determine whether you need to provide
206
+ a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
207
+ provided by the bot. You will only need to do this once across all repos using our CLA.
208
+
209
+ This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
210
+ For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
211
+ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
212
+
213
+ ## Trademarks
214
+
215
+ This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
216
+ trademarks or logos is subject to and must follow
217
+ [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
218
+ Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
219
+ Any use of third-party trademarks or logos are subject to those third-party's policies.