Not able to generate answer from astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit
Hi!
Thank you for uploading this quantized model, as it allows me to use Llama 3 from Google Colab, as otherwise it's not possible because the original model is too big to fit in Nvidia T4.
I use the following code to load model and generate text:
from transformers import AutoTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch
model_id = "astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit"
print('Creating QuantizeConfig...')
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False
)
print('Loading Quantized model...')
model = AutoGPTQForCausalLM.from_quantized(
model_id,
use_safetensors=True,
device="cuda:0",
quantize_config=quantize_config)
print('Loading Tokenizer model...')
tokenizer = AutoTokenizer.from_pretrained(model_id)
print('Creating Pipeline...')
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
)
rompt = "What is the capital of Indonesia?"
terminators = [
pipe.tokenizer.eos_token_id,
pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": prompt},
]
outputs = pipe(messages,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9)
print(outputs[0]["generated_text"][-1])
However the output is:
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '】,【\x00\x0c】,【\\",\\"gers\x00】,\x00`,{"gorithms`\n\n`.\n\n»\n\n»\n\n`,`},»\n\n`\n\n}\n\n».\n\n`]("><}[],](`\n\n],[`,]]["],"),\n\n」\n\n},{"]\n\n`,}\n\n}\n\n%),»\n\n],}\n\n`\n\n)</」\n\n```\n\n`,`,\n`\n\n`\n\n`.\n\n},{{"`\n\n`.\n\n`,>`{"{"},`,}\n\n</],\n\n{"],"}`>\n\n».\n\n».\n\n)`]\n\n%`,»{"}\n\n"`].\n\n`\n\n}\n\n])\n\n}`,`\n\n`\n\n«],{"`,`\n\n`\n\nAE)`]\n\n}\n\n``,»\n\n`,.\n\n`.\n\n`,`\n\n}\n\n>,"""\n\n]>`,{"`,)``,``»`,)\n\n\n`,{"`,{"}\n\n))\n\n]({"`,]]`\n\n\n\n\n\n\n\n\n\n%,``.\n\n\n\n\n`,\n\n\n`\n\n`\n\n.\n\n\n``\n\n`\n\n]\n\n`,]( "\n\n\n.\n\n\n#](}}`\n\n<<{"))\n\n%)`%\n\n\n.\n\n\n\n\n\n`]`,]]}\n\n.\n\n\n`\n\nD<<``][B»!\n\n>\n\n`\n\n`,]]¢]\n\n`\n\n\n\n``»\n\n\n`\n\n\n\n]]\n\n}\n\n%]][<``]\n\n.\n\n\n\n\n<<=`]`,{}`\n\n\n\n[/`\n\n¢}=)\n\n.\n\n}\n\n}\n\n\n\n](%}<AE}`\n\n``»>\n\n%G``\n\n\n\n\nAE\n\n\n]\n\n"""\n\n)<\n\n\n\n\n`\n\n]][](\n\n#%}>\n\n``>\n\n\n\n»`\n\n`\n\n'}
I used a different system message:
{"role": "system", "content": "You are a helpful assistant."},
with similar garbled output:
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '\x00】,【\x00`,\x00\x00\\"><},},{\x00`\n\n\\"],」\n\n`.\n\n»\n\n\\",\\"】,【},`](«】,【],[】,【»\n\n},`\n\n](}\n\n»\n\n}[`,``,},"></«"""\n\n]][».\n\n`.\n\n»,%),"`},`]\n\n`,`,»`,`,»`,"],"},]\n\n},}`>``,]{.\n\n`.\n\n},"\n\n].\n\n`\n\n],"},],\n\n],%,},{"%,]]`,`\n\n{}`\n\n%```,.\n\n»]\n\n`,,"»%`,},]\n\n»},.\n\n\n`,}\n\n`\n\n`.\n\n]\n\n},}%``\n\n\n.\n\n\n``.\n\nologists},>`\n\n\n``\n\n<<AE`,`\n\n`,.\n\n`,```,>,""``,»}](`\n\n€`,>\n\n{"F«{"`,}]\n\n`\n\n\n\n\n]\n\n)``\n\n>`»,``}}\n\n\n\n<<\n\n\n]]`.\n\n}\n\n`\n\nG\n\n\n`\n\n},{"]]},`\n\n`,]>`\n\n\n`AE`\n\n\n\n\nAE<<))\n\n\n\n`,<<\n\n<B "\n\n),»\n\n\n\n`}\n\n#F]AE`\n\n}\n\n`,\n\n\n</]][\n\n\n]][],<<],¢\n\n\n=`\n\n,{{\n\n\n}\n\n\n\n\n\n.\n\n<<\x00\n\n\n%]]\n\n\n\n\n\n\n»`A]][.\n\n\n\n\n]]\n\n)\n\n`\n\n\n\n"""\n\n\n\n]]\n\n</{"```\n\n\n@C\n\n\n`\n\n</]]`>\n\n\n\n\n\n\n\n\n'}
A simple string prompt:
# prompt = "What is a large language model?"
prompt = "What is the capital of Indonesia?"
terminators = [
pipe.tokenizer.eos_token_id,
pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipe(prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9)
print(outputs[0]["generated_text"][-1])
And a single output:
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
s
Can you suggest where I did wrong?
Let me try to see if I can reproduce this on my end. However, I think this maybe a known bug with the AutoGPTQ library and huggingface transformers. I think their integration broke at some point near 4.40 release of transformers.
Can you leave a comment here describing what happened in this github issue https://github.com/AutoGPTQ/AutoGPTQ/issues/657? Someone familiar with both transformers and AutoGPTQ would need to do a deepdive to resolve this. The more people commenting on the github issue the more likely the major maintainers will try to help resolve it.
In the mean time, I would suggest just loading the model in vLLM or any serving engine that doesn't directly use huggingface transformers under the hood for generation should work.