TheBloke commited on
Commit
1ef24b9
1 Parent(s): e482f0c

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +12 -2
README.md CHANGED
@@ -1,4 +1,14 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  inference: false
3
  language:
4
  - en
@@ -37,7 +47,7 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
37
 
38
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ)
39
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGML)
40
- * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
41
 
42
  ## Prompt template: Llama-2-Chat
43
 
@@ -58,7 +68,7 @@ Each separate quant is in a different branch. See below for instructions on fet
58
  | main | 4 | 128 | False | 3.90 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
59
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 4.28 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
60
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 4.02 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
61
- | gptq-4bit-128g-actorder_True | 4 | 128 | True | TBC | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
62
 
63
  ## How to download from branches
64
 
 
1
  ---
2
+ extra_gated_button_content: Submit
3
+ extra_gated_description: This is a form to enable access to Llama 2 on Hugging Face
4
+ after you have been granted access from Meta. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads)
5
+ and accept our license terms and acceptable use policy before submitting this form.
6
+ Requests will be processed in 1-2 days.
7
+ extra_gated_fields:
8
+ ? I agree to share my name, email address and username with Meta and confirm that
9
+ I have already been granted download access on the Meta website
10
+ : checkbox
11
+ extra_gated_heading: Access Llama 2 on Hugging Face
12
  inference: false
13
  language:
14
  - en
 
47
 
48
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ)
49
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGML)
50
+ * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
51
 
52
  ## Prompt template: Llama-2-Chat
53
 
 
68
  | main | 4 | 128 | False | 3.90 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
69
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 4.28 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
70
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 4.02 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
71
+ | gptq-4bit-128g-actorder_True | 4 | 128 | True | 3.90 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
72
 
73
  ## How to download from branches
74