hugging-quants
/

Meta-Llama-3.1-8B-Instruct-AWQ-INT4

@@ -41,6 +41,8 @@ In order to run the inference with Llama 3.1 8B Instruct AWQ in INT4, both `torc
 pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
 ```
 Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
 ```bash
@@ -61,7 +63,13 @@ prompt = [
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt").cuda()
 model = AutoModelForCausalLM.from_pretrained(
   model_id,
@@ -70,7 +78,7 @@ model = AutoModelForCausalLM.from_pretrained(
   device_map="auto",
 )
-outputs = model.generate(inputs, do_sample=True, max_new_tokens=256)
 print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 ```
@@ -82,6 +90,8 @@ In order to run the inference with Llama 3.1 8B Instruct AWQ in INT4, both `torc
 pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
 ```
 Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
 ```bash
@@ -103,7 +113,13 @@ prompt = [
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt").cuda()
 model = AutoAWQForCausalLM.from_pretrained(
   model_id,
@@ -112,11 +128,11 @@ model = AutoAWQForCausalLM.from_pretrained(
   device_map="auto",
 )
-outputs = model.generate(inputs, do_sample=True, max_new_tokens=256)
 print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 ```
-The AutoAWQ script has been adapted from [AutoAWQ/examples/generate.py](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/generate.py).
 ### 🤗 Text Generation Inference (TGI)

 pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
 ```
+Otherwise, running the model inference may fail, since the AutoAWQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
 Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
 ```bash
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+inputs = tokenizer.apply_chat_template(
+  prompt,
+  tokenize=True,
+  add_generation_prompt=True,
+  return_tensors="pt",
+  return_dict=True,
+).to("cuda")
 model = AutoModelForCausalLM.from_pretrained(
   model_id,
   device_map="auto",
 )
+outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
 print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 ```
 pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
 ```
+Otherwise, running the model inference may fail, since the AutoAWQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
 Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
 ```bash
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+inputs = tokenizer.apply_chat_template(
+  prompt,
+  tokenize=True,
+  add_generation_prompt=True,
+  return_tensors="pt",
+  return_dict=True,
+).to("cuda")
 model = AutoAWQForCausalLM.from_pretrained(
   model_id,
   device_map="auto",
 )
+outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
 print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 ```
+The AutoAWQ script has been adapted from [`AutoAWQ/examples/generate.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/generate.py).
 ### 🤗 Text Generation Inference (TGI)