can't run inference on multi GPU
#8
by
daryl149
- opened
Works on a single A6000:
from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer
tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="sequential", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)
Throws error on 2 V100S cards (hosting 17GB of model weights each):
from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer
tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="auto", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)
throws:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "myvenv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1558, in generate
return self.sample(
File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2641, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Only difference is I'm using device_map auto to make use of both GPUs. (Also happens for .to('cuda')
, .to(0)
, .to(1)
instead of .to(model.device)
.)
ah, there's an open bug in transformers
for it:
https://github.com/huggingface/transformers/issues/22914
daryl149
changed discussion status to
closed
Update:
The inf/nan is caused by CUDA 11.8
and bitsandbytes==0.38.1
. It's solved by downgrading to CUDA 11.6
and bitsandbytes 0.31.8
However, the inference on multi gpu is still broken. It returns gibberish when using load_in_8bit=True
. See this issue I created in transformers https://github.com/huggingface/transformers/issues/23989
daryl149
changed discussion status to
open