TheBloke/Llama-2-13B-GPTQ · Low quality output solution

The code for using this model seems to have been broken with newer versions of auto-gptq and transformers. Options:

Do not install auto-gptq from source and get "CUDA extension is not installed" and normal output (idk how), although 10x slower
Install auto-gptq from github and get fast but low quality output with this code

model_name_or_path = "TheBloke/Llama-2-13B-GPTQ"
# To use a different branch, change revision
# For example: revision="main"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

Change the code to use auto-gptq class instead of transformers one:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, torch_dtype=torch.float16)

and get normal and fast output.