Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -33,19 +33,24 @@ quantized_by: TheBloke
|
|
33 |
- Model creator: [Eric Hartford](https://huggingface.co/ehartford)
|
34 |
- Original model: [Samantha 1.11 CodeLlama 34B](https://huggingface.co/ehartford/Samantha-1.11-CodeLlama-34b)
|
35 |
|
|
|
36 |
## Description
|
37 |
|
38 |
This repo contains GPTQ model files for [Eric Hartford's Samantha 1.11 CodeLlama 34B](https://huggingface.co/ehartford/Samantha-1.11-CodeLlama-34b).
|
39 |
|
40 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
41 |
|
|
|
|
|
42 |
## Repositories available
|
43 |
|
44 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ)
|
45 |
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GGUF)
|
46 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GGML)
|
47 |
* [Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/ehartford/Samantha-1.11-CodeLlama-34b)
|
|
|
48 |
|
|
|
49 |
## Prompt template: Samantha
|
50 |
|
51 |
```
|
@@ -53,8 +58,12 @@ You are Samantha, a sentient AI companion.
|
|
53 |
|
54 |
USER: {prompt}
|
55 |
ASSISTANT:
|
|
|
56 |
```
|
57 |
|
|
|
|
|
|
|
58 |
## Provided files and GPTQ parameters
|
59 |
|
60 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
@@ -68,7 +77,7 @@ All GPTQ files are made with AutoGPTQ.
|
|
68 |
|
69 |
- Bits: The bit size of the quantised model.
|
70 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
71 |
-
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
|
72 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
73 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
74 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
@@ -85,6 +94,9 @@ All GPTQ files are made with AutoGPTQ.
|
|
85 |
| [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
86 |
| [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 14.14 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
|
87 |
|
|
|
|
|
|
|
88 |
## How to download from branches
|
89 |
|
90 |
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ:gptq-4bit-32g-actorder_True`
|
@@ -93,78 +105,78 @@ All GPTQ files are made with AutoGPTQ.
|
|
93 |
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ
|
94 |
```
|
95 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
96 |
-
|
|
|
97 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
98 |
|
99 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
100 |
|
101 |
-
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
102 |
|
103 |
1. Click the **Model tab**.
|
104 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ`.
|
105 |
- To download from a specific branch, enter for example `TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ:gptq-4bit-32g-actorder_True`
|
106 |
- see Provided Files above for the list of branches for each option.
|
107 |
3. Click **Download**.
|
108 |
-
4. The model will start downloading. Once it's finished it will say "Done"
|
109 |
5. In the top left, click the refresh icon next to **Model**.
|
110 |
6. In the **Model** dropdown, choose the model you just downloaded: `Samantha-1.11-CodeLlama-34B-GPTQ`
|
111 |
7. The model will automatically load, and is now ready for use!
|
112 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
113 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
114 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
|
|
115 |
|
|
|
116 |
## How to use this GPTQ model from Python code
|
117 |
|
118 |
-
|
119 |
|
120 |
-
|
121 |
-
pip3 install auto-gptq
|
122 |
-
```
|
123 |
|
124 |
-
|
|
|
|
|
125 |
```
|
|
|
|
|
|
|
|
|
126 |
pip3 uninstall -y auto-gptq
|
127 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
128 |
cd AutoGPTQ
|
129 |
pip3 install .
|
130 |
```
|
131 |
|
132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
|
134 |
```python
|
135 |
-
from transformers import AutoTokenizer, pipeline
|
136 |
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
137 |
|
138 |
model_name_or_path = "TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ"
|
139 |
-
|
140 |
-
|
|
|
|
|
|
|
|
|
141 |
|
142 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
143 |
|
144 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
145 |
-
use_safetensors=True,
|
146 |
-
trust_remote_code=False,
|
147 |
-
device="cuda:0",
|
148 |
-
use_triton=use_triton,
|
149 |
-
quantize_config=None)
|
150 |
-
|
151 |
-
"""
|
152 |
-
# To download from a specific branch, use the revision parameter, as in this example:
|
153 |
-
# Note that `revision` requires AutoGPTQ 0.3.1 or later!
|
154 |
-
|
155 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
156 |
-
revision="gptq-4bit-32g-actorder_True",
|
157 |
-
use_safetensors=True,
|
158 |
-
trust_remote_code=False,
|
159 |
-
device="cuda:0",
|
160 |
-
quantize_config=None)
|
161 |
-
"""
|
162 |
-
|
163 |
prompt = "Tell me about AI"
|
164 |
prompt_template=f'''You are Samantha, a sentient AI companion.
|
165 |
|
166 |
USER: {prompt}
|
167 |
ASSISTANT:
|
|
|
168 |
'''
|
169 |
|
170 |
print("\n\n*** Generate:")
|
@@ -175,9 +187,6 @@ print(tokenizer.decode(output[0]))
|
|
175 |
|
176 |
# Inference can also be done using transformers' pipeline
|
177 |
|
178 |
-
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
179 |
-
logging.set_verbosity(logging.CRITICAL)
|
180 |
-
|
181 |
print("*** Pipeline:")
|
182 |
pipe = pipeline(
|
183 |
"text-generation",
|
@@ -191,12 +200,17 @@ pipe = pipeline(
|
|
191 |
|
192 |
print(pipe(prompt_template)[0]['generated_text'])
|
193 |
```
|
|
|
194 |
|
|
|
195 |
## Compatibility
|
196 |
|
197 |
-
The files provided
|
|
|
|
|
198 |
|
199 |
-
|
|
|
200 |
|
201 |
<!-- footer start -->
|
202 |
<!-- 200823 -->
|
|
|
33 |
- Model creator: [Eric Hartford](https://huggingface.co/ehartford)
|
34 |
- Original model: [Samantha 1.11 CodeLlama 34B](https://huggingface.co/ehartford/Samantha-1.11-CodeLlama-34b)
|
35 |
|
36 |
+
<!-- description start -->
|
37 |
## Description
|
38 |
|
39 |
This repo contains GPTQ model files for [Eric Hartford's Samantha 1.11 CodeLlama 34B](https://huggingface.co/ehartford/Samantha-1.11-CodeLlama-34b).
|
40 |
|
41 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
42 |
|
43 |
+
<!-- description end -->
|
44 |
+
<!-- repositories-available start -->
|
45 |
## Repositories available
|
46 |
|
47 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ)
|
48 |
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GGUF)
|
49 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GGML)
|
50 |
* [Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/ehartford/Samantha-1.11-CodeLlama-34b)
|
51 |
+
<!-- repositories-available end -->
|
52 |
|
53 |
+
<!-- prompt-template start -->
|
54 |
## Prompt template: Samantha
|
55 |
|
56 |
```
|
|
|
58 |
|
59 |
USER: {prompt}
|
60 |
ASSISTANT:
|
61 |
+
|
62 |
```
|
63 |
|
64 |
+
<!-- prompt-template end -->
|
65 |
+
|
66 |
+
<!-- README_GPTQ.md-provided-files start -->
|
67 |
## Provided files and GPTQ parameters
|
68 |
|
69 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
|
|
77 |
|
78 |
- Bits: The bit size of the quantised model.
|
79 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
80 |
+
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
|
81 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
82 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
83 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
|
|
94 |
| [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
95 |
| [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 14.14 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
|
96 |
|
97 |
+
<!-- README_GPTQ.md-provided-files end -->
|
98 |
+
|
99 |
+
<!-- README_GPTQ.md-download-from-branches start -->
|
100 |
## How to download from branches
|
101 |
|
102 |
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ:gptq-4bit-32g-actorder_True`
|
|
|
105 |
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ
|
106 |
```
|
107 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
108 |
+
<!-- README_GPTQ.md-download-from-branches end -->
|
109 |
+
<!-- README_GPTQ.md-text-generation-webui start -->
|
110 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
111 |
|
112 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
113 |
|
114 |
+
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
115 |
|
116 |
1. Click the **Model tab**.
|
117 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ`.
|
118 |
- To download from a specific branch, enter for example `TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ:gptq-4bit-32g-actorder_True`
|
119 |
- see Provided Files above for the list of branches for each option.
|
120 |
3. Click **Download**.
|
121 |
+
4. The model will start downloading. Once it's finished it will say "Done".
|
122 |
5. In the top left, click the refresh icon next to **Model**.
|
123 |
6. In the **Model** dropdown, choose the model you just downloaded: `Samantha-1.11-CodeLlama-34B-GPTQ`
|
124 |
7. The model will automatically load, and is now ready for use!
|
125 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
126 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
127 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
128 |
+
<!-- README_GPTQ.md-text-generation-webui end -->
|
129 |
|
130 |
+
<!-- README_GPTQ.md-use-from-python start -->
|
131 |
## How to use this GPTQ model from Python code
|
132 |
|
133 |
+
### Install the necessary packages
|
134 |
|
135 |
+
Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
|
|
|
|
|
136 |
|
137 |
+
```shell
|
138 |
+
pip3 install transformers>=4.32.0 optimum>=1.12.0
|
139 |
+
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
140 |
```
|
141 |
+
|
142 |
+
If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
|
143 |
+
|
144 |
+
```shell
|
145 |
pip3 uninstall -y auto-gptq
|
146 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
147 |
cd AutoGPTQ
|
148 |
pip3 install .
|
149 |
```
|
150 |
|
151 |
+
### For CodeLlama models only: you must use Transformers 4.33.0 or later.
|
152 |
+
|
153 |
+
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
|
154 |
+
```shell
|
155 |
+
pip3 uninstall -y transformers
|
156 |
+
pip3 install git+https://github.com/huggingface/transformers.git
|
157 |
+
```
|
158 |
+
|
159 |
+
### You can then use the following code
|
160 |
|
161 |
```python
|
162 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
|
|
163 |
|
164 |
model_name_or_path = "TheBloke/Samantha-1.11-CodeLlama-34B-GPTQ"
|
165 |
+
# To use a different branch, change revision
|
166 |
+
# For example: revision="gptq-4bit-32g-actorder_True"
|
167 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
168 |
+
torch_dtype=torch.float16,
|
169 |
+
device_map="auto",
|
170 |
+
revision="main")
|
171 |
|
172 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
prompt = "Tell me about AI"
|
175 |
prompt_template=f'''You are Samantha, a sentient AI companion.
|
176 |
|
177 |
USER: {prompt}
|
178 |
ASSISTANT:
|
179 |
+
|
180 |
'''
|
181 |
|
182 |
print("\n\n*** Generate:")
|
|
|
187 |
|
188 |
# Inference can also be done using transformers' pipeline
|
189 |
|
|
|
|
|
|
|
190 |
print("*** Pipeline:")
|
191 |
pipe = pipeline(
|
192 |
"text-generation",
|
|
|
200 |
|
201 |
print(pipe(prompt_template)[0]['generated_text'])
|
202 |
```
|
203 |
+
<!-- README_GPTQ.md-use-from-python end -->
|
204 |
|
205 |
+
<!-- README_GPTQ.md-compatibility start -->
|
206 |
## Compatibility
|
207 |
|
208 |
+
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
|
209 |
+
|
210 |
+
[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
|
211 |
|
212 |
+
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
|
213 |
+
<!-- README_GPTQ.md-compatibility end -->
|
214 |
|
215 |
<!-- footer start -->
|
216 |
<!-- 200823 -->
|