TheBloke
/

Llama-2-13B-GPTQ

@@ -1,14 +1,4 @@
 ---
-extra_gated_button_content: Submit
-extra_gated_description: This is a form to enable access to Llama 2 on Hugging Face
-  after you have been granted access from Meta. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads)
-  and accept our license terms and acceptable use policy before submitting this form.
-  Requests will be processed in 1-2 days.
-extra_gated_fields:
-  ? I agree to share my name, email address and username with Meta and confirm that
-    I have already been granted download access on the Meta website
-  : checkbox
-extra_gated_heading: Access Llama 2 on Hugging Face
 inference: false
 language:
 - en
@@ -39,7 +29,7 @@ tags:
 # Meta's Llama 2 13B GPTQ
-These files are GPTQ model files for [Meta's Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf).
 Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
@@ -47,12 +37,14 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
-* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13b-hf)
-## Prompt template: None
 ```
-{prompt}
 ```
 ## Provided files
@@ -67,6 +59,10 @@ Each separate quant is in a different branch.  See below for instructions on fet
 | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 ## How to download from branches
@@ -116,7 +112,7 @@ use_triton = False
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
-        model_basename=model_basename
         use_safetensors=True,
         trust_remote_code=True,
         device="cuda:0",
@@ -136,7 +132,9 @@ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
 """
 prompt = "Tell me about AI"
-prompt_template=f'''{prompt}
 '''
 print("\n\n*** Generate:")
@@ -201,7 +199,7 @@ Thank you to all my generous patrons and donaters!
 # Original model card: Meta's Llama 2 13B
 # **Llama 2**
-Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B pretrained model, converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.
 ## Model Details
 *Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept our License before requesting access here.*

 ---
 inference: false
 language:
 - en
 # Meta's Llama 2 13B GPTQ
+These files are GPTQ model files for [Meta's Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf).
 Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
+* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13B-hf)
+## Prompt template: Llama-2-Chat
 ```
+SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+USER: {prompt}
+ASSISTANT:
 ```
 ## Provided files
 | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. |
+| gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
+| gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
 ## How to download from branches
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
+        model_basename=model_basename,
         use_safetensors=True,
         trust_remote_code=True,
         device="cuda:0",
 """
 prompt = "Tell me about AI"
+prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
+USER: {prompt}
+ASSISTANT:
 '''
 print("\n\n*** Generate:")
 # Original model card: Meta's Llama 2 13B
 # **Llama 2**
+Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.
 ## Model Details
 *Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept our License before requesting access here.*