Low quality output solution
#8
by
IvanCheb
- opened
The code for using this model seems to have been broken with newer versions of auto-gptq and transformers. Options:
- Do not install auto-gptq from source and get "CUDA extension is not installed" and normal output (idk how), although 10x slower
- Install auto-gptq from github and get fast but low quality output with this code
model_name_or_path = "TheBloke/Llama-2-13B-GPTQ"
# To use a different branch, change revision
# For example: revision="main"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
trust_remote_code=False,
revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
- Change the code to use auto-gptq class instead of transformers one:
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, torch_dtype=torch.float16)
and get normal and fast output.