gptq_model / README.md

Upload 10 files

e5bdf53 over 1 year ago

8.86 kB

	# GPTQ-for-LLaMA
	<img src = https://user-images.githubusercontent.com/64115820/235287009-2d07bba8-9b85-4973-9e06-2a3c28777f06.png width="50%" height="50%">

	4 bits quantization of [LLaMA](https://arxiv.org/abs/2302.13971) using [GPTQ](https://arxiv.org/abs/2210.17323)

	GPTQ is SOTA one-shot weight quantization method

	It can be used universally, but it is not the [fastest](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/old-cuda) and only supports linux.

	Triton only supports Linux, so if you are a Windows user, please use [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install).

	## News or Update
	AutoGPTQ-triton, a packaged version of GPTQ with triton, has been integrated into [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
	## Result
	<details>
	<summary>LLaMA-7B(click me)</summary>

	\| [LLaMA-7B](https://arxiv.org/abs/2302.13971) \| Bits \| group-size \| memory(MiB) \| Wikitext2 \| checkpoint size(GB) \|
	\| -------------------------------------------------- \| ---- \| ---------- \| ----------- \| --------- \| ------------------- \|
	\| FP16 \| 16 \| - \| 13940 \| 5.68 \| 12.5 \|
	\| RTN \| 4 \| - \| - \| 6.29 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| - \| 4740 \| 6.09 \| 3.5 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| 128 \| 4891 \| 5.85 \| 3.6 \|
	\| RTN \| 3 \| - \| - \| 25.54 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| - \| 3852 \| 8.07 \| 2.7 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| 128 \| 4116 \| 6.61 \| 3.0 \|

	</details>

	<details>
	<summary>LLaMA-13B</summary>

	\| [LLaMA-13B](https://arxiv.org/abs/2302.13971) \| Bits \| group-size \| memory(MiB) \| Wikitext2 \| checkpoint size(GB) \|
	\| -------------------------------------------------- \| ---- \| ---------- \| ----------- \| --------- \| ------------------- \|
	\| FP16 \| 16 \| - \| OOM \| 5.09 \| 24.2 \|
	\| RTN \| 4 \| - \| - \| 5.53 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| - \| 8410 \| 5.36 \| 6.5 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| 128 \| 8747 \| 5.20 \| 6.7 \|
	\| RTN \| 3 \| - \| - \| 11.40 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| - \| 6870 \| 6.63 \| 5.1 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| 128 \| 7277 \| 5.62 \| 5.4 \|

	</details>

	<details>
	<summary>LLaMA-33B</summary>

	\| [LLaMA-33B](https://arxiv.org/abs/2302.13971) \| Bits \| group-size \| memory(MiB) \| Wikitext2 \| checkpoint size(GB) \|
	\| -------------------------------------------------- \| ---- \| ---------- \| ----------- \| --------- \| ------------------- \|
	\| FP16 \| 16 \| - \| OOM \| 4.10 \| 60.5 \|
	\| RTN \| 4 \| - \| - \| 4.54 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| - \| 19493 \| 4.45 \| 15.7 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| 128 \| 20570 \| 4.23 \| 16.3 \|
	\| RTN \| 3 \| - \| - \| 14.89 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| - \| 15493 \| 5.69 \| 12.0 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| 128 \| 16566 \| 4.80 \| 13.0 \|

	</details>

	<details>
	<summary>LLaMA-65B</summary>

	\| [LLaMA-65B](https://arxiv.org/abs/2302.13971) \| Bits \| group-size \| memory(MiB) \| Wikitext2 \| checkpoint size(GB) \|
	\| -------------------------------------------------- \| ---- \| ---------- \| ----------- \| --------- \| ------------------- \|
	\| FP16 \| 16 \| - \| OOM \| 3.53 \| 121.0 \|
	\| RTN \| 4 \| - \| - \| 3.92 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| - \| OOM \| 3.84 \| 31.1 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 4 \| 128 \| OOM \| 3.65 \| 32.3 \|
	\| RTN \| 3 \| - \| - \| 10.59 \| - \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| - \| OOM \| 5.04 \| 23.6 \|
	\| [GPTQ](https://arxiv.org/abs/2210.17323) \| 3 \| 128 \| OOM \| 4.17 \| 25.6 \|
	</details>

	Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

	Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(https://github.com/IST-DASLab/gptq/issues/1)

	According to [GPTQ paper](https://arxiv.org/abs/2210.17323), As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

	## Installation
	If you don't have [conda](https://docs.conda.io/en/latest/miniconda.html), install it first.
	```
	conda create --name gptq python=3.9 -y
	conda activate gptq
	conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
	# Or, if you're having trouble with conda, use pip with python3.9:
	# pip3 install torch torchvision torchaudio

	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
	cd GPTQ-for-LLaMa
	pip install -r requirements.txt
	```
	## Dependencies

	* `torch`: tested on v2.0.0+cu117
	* `transformers`: tested on v4.28.0.dev0
	* `datasets`: tested on v2.10.1
	* `safetensors`: tested on v0.3.0

	All experiments were run on a single NVIDIA RTX3090.

	# Language Generation
	## LLaMA

	```
	#convert LLaMA to hf
	python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

	# Benchmark language generation with 4-bit LLaMA-7B:

	# Save compressed model
	CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt

	# Or save compressed `.safetensors` model
	CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors

	# Benchmark generating a 2048 token sequence with the saved model
	CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check

	# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
	CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ${MODEL_DIR} c4 --benchmark 2048 --check

	# model inference with the saved model
	CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"

	# model inference with the saved model using safetensors loaded direct to gpu
	CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama" --device=0

	# model inference with the saved model with offload(This is very slow).
	CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
	It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.
	```
	Basically, 4-bit quantization and 128 groupsize are recommended.

	You can also export quantization parameters with toml+numpy format.
	```
	CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --quant-directory ${TOML_DIR}
	```

	# Acknowledgements
	This code is based on [GPTQ](https://github.com/IST-DASLab/gptq)

	Thanks to Meta AI for releasing [LLaMA](https://arxiv.org/abs/2302.13971), a powerful LLM.

	Triton GPTQ kernel code is based on [GPTQ-triton](https://github.com/fpgaminer/GPTQ-triton)