Inference's speed
#19
by
VlaTal
- opened
I load model in 8bit and it fit at all in my GPU, but GPU don't work at all. GPU doesn't even heats a lot, like with other models. But the CPU works on 100% percent. And the inference speed is about 2-3 tokens/s.
Here`s a code which I use for loading and inferece:
model = 'WizardLM/WizardCoder-15B-V1.0'
def load_model(model = model):
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, device_map=device_map, load_in_8bit = True)
return tokenizer, model
tokenizer, model = load_model(model)
generation_config = GenerationConfig(
temperature=0.0,
top_p=0.95,
top_k=50,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
prompt_template = f'''
Below is an instruction that describes a task. Write a response that appropriately completes the request
### Instruction: {prompt}
### Response:'''
inputs = tokenizer(prompt_template, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, generation_config=generation_config, max_new_tokens=3000)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs[0])