Spaces:
Running
Running
OpenSourceRonin
commited on
Commit
•
8453bfa
1
Parent(s):
ac97099
Update README.md
Browse files
README.md
CHANGED
@@ -15,3 +15,205 @@ It is intended only for experimental purposes.
|
|
15 |
|
16 |
Users are responsible for any consequences arising from the use of this model.
|
17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
Users are responsible for any consequences arising from the use of this model.
|
17 |
|
18 |
+
# VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
|
19 |
+
|
20 |
+
## TL;DR
|
21 |
+
|
22 |
+
**Vector Post-Training Quantization (VPTQ)** is a novel Post-Training Quantization method that leverages **Vector Quantization** to high accuracy on LLMs at an extremely low bit-width (<2-bit).
|
23 |
+
VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
|
24 |
+
|
25 |
+
* Better Accuracy on 1-2 bits
|
26 |
+
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
27 |
+
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
28 |
+
|
29 |
+
**Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time**
|
30 |
+
![Llama3 1-70b-prompt](https://github.com/user-attachments/assets/d8729aca-4e1d-4fe1-ac71-c14da4bdd97f)
|
31 |
+
|
32 |
+
|
33 |
+
## [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
34 |
+
|
35 |
+
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|
36 |
+
|
37 |
+
Read tech report at [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf) and [**arXiv Paper**](https://arxiv.org/pdf/2409.17066)
|
38 |
+
|
39 |
+
### Early Results from Tech Report
|
40 |
+
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
|
41 |
+
|
42 |
+
<img src="assets/vptq.png" width="500">
|
43 |
+
|
44 |
+
| Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
|
45 |
+
| ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
|
46 |
+
| LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
|
47 |
+
| | 2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 |
|
48 |
+
| LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
|
49 |
+
| | 2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 |
|
50 |
+
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
|
51 |
+
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
|
52 |
+
|
53 |
+
---
|
54 |
+
|
55 |
+
## Installation
|
56 |
+
|
57 |
+
### Dependencies
|
58 |
+
|
59 |
+
- python 3.10+
|
60 |
+
- torch >= 2.2.0
|
61 |
+
- transformers >= 4.44.0
|
62 |
+
- Accelerate >= 0.33.0
|
63 |
+
- latest datasets
|
64 |
+
|
65 |
+
### Installation
|
66 |
+
|
67 |
+
> Preparation steps that might be needed: Set up CUDA PATH.
|
68 |
+
```bash
|
69 |
+
export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
|
70 |
+
```
|
71 |
+
|
72 |
+
*Will Take several minutes to compile CUDA kernels*
|
73 |
+
```python
|
74 |
+
pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
|
75 |
+
```
|
76 |
+
|
77 |
+
## Evaluation
|
78 |
+
### Models from Open Source Community
|
79 |
+
|
80 |
+
⚠️ The repository only provides a method of model quantization algorithm.
|
81 |
+
|
82 |
+
⚠️ The open-source community [VPTQ-community](https://huggingface.co/VPTQ-community) provides models based on the technical report and quantization algorithm.
|
83 |
+
|
84 |
+
⚠️ The repository cannot guarantee the performance of those models.
|
85 |
+
|
86 |
+
|
87 |
+
**Quick Estimation of Model Bitwidth (Excluding Codebook Overhead)**:
|
88 |
+
- **Model Naming Convention**: The model's name includes the **vector length** $v$, **codebook (lookup table) size**, and **residual codebook size**. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" and "Meta-Llama-3.1-70B-Instruct", where:
|
89 |
+
- **Vector Length**: 8
|
90 |
+
- **Number of Centroids**: 65536 (2^16)
|
91 |
+
- **Number of Residual Centroids**: 256 (2^8)
|
92 |
+
- **Equivalent Bitwidth Calculation**:
|
93 |
+
- **Index**: log2(65536) = 16 / 8 = 2 bits
|
94 |
+
- **Residual Index**: log2(256) = 8 / 8 = 1 bit
|
95 |
+
- **Total Bitwidth**: 2 + 1 = 3 bits
|
96 |
+
- **Model Size Estimation**: 70B * 3 bits / 8 bits per Byte = 26.25 GB
|
97 |
+
|
98 |
+
- **Note**: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer to **Tech Report Appendix C.2**.
|
99 |
+
|
100 |
+
|
101 |
+
| Model Series | Collections | (Estimated) Bit per weight |
|
102 |
+
|:----------------------:|:-----------:| ----------------------------|
|
103 |
+
| Llama 3.1 8B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-8b-instruct-without-finetune-66f2b70b1d002ceedef02d2e) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft) [3.5 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft) [2.3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v12-k65536-4096-woft) |
|
104 |
+
| Llama 3.1 70B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-70b-instruct-without-finetune-66f2bf454d3dd78dfee2ff11) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft) [2.25 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-4-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft) [1.93 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-32768-woft) [1.875 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft) [1.75 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k16384-0-woft) |
|
105 |
+
| Llama 3.1 405B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-405b-instruct-without-finetune-66f4413f9ba55e1a9e52cfb0) | [1.875 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k32768-32768-woft) [1.625 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-1024-woft) [1.5 bits (1)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k4096-0-woft) [1.5 bits (2)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-256-woft) [1.43 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-128-woft) [1.375 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-64-woft)|
|
106 |
+
| Qwen 2.5 7B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-7b-instruct-without-finetune-66f3e9866d3167cc05ce954a) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k256-256-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v16-k65536-65536-woft) |
|
107 |
+
| Qwen 2.5 14B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-14b-instruct-without-finetune-66f827f83c7ffa7931b8376c) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k256-256-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v16-k65536-65536-woft) |
|
108 |
+
| Qwen 2.5 72B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-72b-instruct-without-finetune-66f3bf1b3757dfa1ecb481c0) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-256-woft) [2.38 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k1024-512-woft) [2.25 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k512-512-woft) [2.25 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-4-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-0-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-65536-woft) [1.94 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-32768-woft) |
|
109 |
+
|
110 |
+
|
111 |
+
### Language Generation Example
|
112 |
+
To generate text using the pre-trained model, you can use the following code snippet:
|
113 |
+
|
114 |
+
The model [*VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft*](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft) (~2 bit) is provided by open source community. The repository cannot guarantee the performance of those models.
|
115 |
+
|
116 |
+
```python
|
117 |
+
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"
|
118 |
+
```
|
119 |
+
![Llama3 1-70b-prompt](https://github.com/user-attachments/assets/d8729aca-4e1d-4fe1-ac71-c14da4bdd97f)
|
120 |
+
|
121 |
+
|
122 |
+
### Terminal Chatbot Example
|
123 |
+
Launching a chatbot:
|
124 |
+
Note that you must use a chat model for this to work
|
125 |
+
|
126 |
+
```python
|
127 |
+
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --chat
|
128 |
+
```
|
129 |
+
![Llama3 1-70b-chat](https://github.com/user-attachments/assets/af051234-d1df-4e25-95e7-17a5ce98f3ea)
|
130 |
+
|
131 |
+
|
132 |
+
### Python API Example
|
133 |
+
Using the Python API:
|
134 |
+
|
135 |
+
```python
|
136 |
+
import vptq
|
137 |
+
import transformers
|
138 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft")
|
139 |
+
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft", device_map='auto')
|
140 |
+
|
141 |
+
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
|
142 |
+
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
|
143 |
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
144 |
+
```
|
145 |
+
|
146 |
+
### Gradio Web App Example
|
147 |
+
A environment variable is available to control share link or not.
|
148 |
+
`export SHARE_LINK=1`
|
149 |
+
```
|
150 |
+
python -m vptq.app
|
151 |
+
```
|
152 |
+
|
153 |
+
---
|
154 |
+
|
155 |
+
## Road Map
|
156 |
+
- [ ] Merge the quantization algorithm into the public repository.
|
157 |
+
- [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
|
158 |
+
- [ ] Improve the implementation of the inference kernel.
|
159 |
+
- [ ] **TBC**
|
160 |
+
|
161 |
+
## Project main members:
|
162 |
+
* Yifei Liu (@lyf-00)
|
163 |
+
* Jicheng Wen (@wejoncy)
|
164 |
+
* Yang Wang (@YangWang92)
|
165 |
+
|
166 |
+
## Acknowledgement
|
167 |
+
|
168 |
+
* We thank for **James Hensman** for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
|
169 |
+
* We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
|
170 |
+
|
171 |
+
## Publication
|
172 |
+
|
173 |
+
EMNLP 2024 Main
|
174 |
+
```bibtex
|
175 |
+
@inproceedings{
|
176 |
+
vptq,
|
177 |
+
title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
|
178 |
+
author={Yifei Liu and
|
179 |
+
Jicheng Wen and
|
180 |
+
Yang Wang and
|
181 |
+
Shengyu Ye and
|
182 |
+
Li Lyna Zhang and
|
183 |
+
Ting Cao and
|
184 |
+
Cheng Li and
|
185 |
+
Mao Yang},
|
186 |
+
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
|
187 |
+
year={2024},
|
188 |
+
}
|
189 |
+
```
|
190 |
+
|
191 |
+
---
|
192 |
+
|
193 |
+
## Limitation of VPTQ
|
194 |
+
* ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
|
195 |
+
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
|
196 |
+
* ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
|
197 |
+
* ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
|
198 |
+
|
199 |
+
## Contributing
|
200 |
+
|
201 |
+
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
202 |
+
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
203 |
+
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
|
204 |
+
|
205 |
+
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
|
206 |
+
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
|
207 |
+
provided by the bot. You will only need to do this once across all repos using our CLA.
|
208 |
+
|
209 |
+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
210 |
+
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
211 |
+
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
212 |
+
|
213 |
+
## Trademarks
|
214 |
+
|
215 |
+
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
216 |
+
trademarks or logos is subject to and must follow
|
217 |
+
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
218 |
+
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
219 |
+
Any use of third-party trademarks or logos are subject to those third-party's policies.
|