-
VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v8-k65536-65536-woft
Updated • 88 • 4 -
VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v8-k65536-256-woft
Updated • 36 -
VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-65536-woft
Updated • 11 -
VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v8-k65536-0-woft
Updated • 11
VPTQ-community
AI & ML interests
None defined yet.
Disclaimer:
VPTQ-community is a open source community to reproduced models on the paper VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models github
It is intended only for experimental purposes.
Users are responsible for any consequences arising from the use of this model.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
TL;DR
Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
- Better Accuracy on 1-2 bits
- Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
- Agile Quantization Inference: low decode overhead, best throughput, and TTFT
Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time
Tech Report
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
Read tech report at Tech Report and arXiv Paper
Models from Open Source Community
⚠️ The repository only provides a method of model quantization algorithm.
⚠️ The open-source community VPTQ-community provides models based on the technical report and quantization algorithm.
⚠️ The repository cannot guarantee the performance of those models.
Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):
Model Naming Convention: The model's name includes the vector length $v$, codebook (lookup table) size, and residual codebook size. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" and "Meta-Llama-3.1-70B-Instruct", where:
- Vector Length: 8
- Number of Centroids: 65536 (2^16)
- Number of Residual Centroids: 256 (2^8)
Equivalent Bitwidth Calculation:
- Index: log2(65536) = 16 / 8 = 2 bits
- Residual Index: log2(256) = 8 / 8 = 1 bit
- Total Bitwidth: 2 + 1 = 3 bits
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB
Note: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer to Tech Report Appendix C.2.
A Space Demo
A live-chatbot is created with VPTQ-LLM-2bit demo over VPTQ.
Collections
11
-
VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k65536-65536-woft
Updated • 13 -
VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k65536-256-woft
Updated • 11 • 1 -
VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-65536-woft
Updated • 10 • 3 -
VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k32768-32768-woft
Updated • 10 • 1