Low quality output solution

#8
by IvanCheb - opened

The code for using this model seems to have been broken with newer versions of auto-gptq and transformers. Options:

  1. Do not install auto-gptq from source and get "CUDA extension is not installed" and normal output (idk how), although 10x slower
  2. Install auto-gptq from github and get fast but low quality output with this code
model_name_or_path = "TheBloke/Llama-2-13B-GPTQ"
# To use a different branch, change revision
# For example: revision="main"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
  1. Change the code to use auto-gptq class instead of transformers one:
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, torch_dtype=torch.float16)

and get normal and fast output.

Sign up or log in to comment