Jul 14

Hi, I recently downloaded llama3 and am trying to run it on vscode. I've installed of the prereqs and it seems my computer meets the hardware reqs (Lenovo ThinkPad with 16GB of RAM); however, when I execute the model.generate, it acts like it tries to run the model but nothing generates. I see my memory spike from 6GB to 13 and hovers here until I restart my computer. Any idea as to what is going on?

Here is my code:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import os
from dotenv import load_dotenv

load_dotenv()

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

Load the Hugging Face token from the environment variable

huggingface_token = os.getenv("HUGGINGFACE_TOKEN")

tokenizer = AutoTokenizer.from_pretrained(model_id)

Explicitly set pad_token_id to eos_token_id (128001)

model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # This requires the accelerate library
use_auth_token=huggingface_token,
)

#response = text_generator(prompt)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)

if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=[128001,128009],
pad_token_id=tokenizer.pad_token_id,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

snoozeberry

Jul 23

I am having the same issue. Any updates?

Mattb0124

Jul 28

Any help would be appreciated!

meta-llama
/

Meta-Llama-3-8B-Instruct

Llama3 Not Running

Load the Hugging Face token from the environment variable

Explicitly set pad_token_id to eos_token_id (128001)