Running the model in 'cpu'

#22

by JugsMa - opened 28 days ago

28 days ago

I loaded the model in cpu, but it requires the model and data to be in gpu (cuda) device.
Is there a way to do it in cpu?

Also i tried to load it in gpu with gpu size 16Gb but i don't have the large gpu, and caught with the error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 52.12 MiB is free. Process 19046 has 14.69 GiB memory in use. Of the allocated memory 14.41 GiB is allocated by PyTorch, and 179.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

boxin-wbx

NVIDIA org 28 days ago

I haven't tried using CPU for inference, but to do that, the following code should work:

path = "nvidia/NVLM-D-72B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map="cpu").eval()

JugsMa

28 days ago

•

edited 27 days ago

I tried passing the model to cpu

...
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
model = model.to(device)

And, i am trying for text generator

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False, device=device)
generation_config = dict(max_new_tokens=1024, do_sample=False)

query = 'What is transformer model?'
response, history = model.chat(tokenizer, None, query, generation_config, history=None, return_history=True)

when calling the model.chat() function, it throws error saying:

File ~/.cache/huggingface/modules/transformers_modules/nvidia/NVLM-D-72B/5a57d927ac0ab6b0a96ebc90f5ee7901ddca790d/modeling_nvlm_d.py:252, in NVLM_D_Model.chat(self, tokenizer, pixel_values, question, generation_config, history, return_history, num_patches_list, IMG_START_TOKEN, IMG_END_TOKEN, IMG_CONTEXT_TOKEN, verbose, visual_features)
    249     query = query.replace('<image>', image_tokens, 1)
    251 model_inputs = tokenizer(query, return_tensors='pt')
--> 252 input_ids = model_inputs['input_ids'].cuda()
    253 attention_mask = model_inputs['attention_mask'].cuda()
    254 generation_config['eos_token_id'] = eos_token_id

File ~/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:319, in _lazy_init()
    317 if "CUDA_MODULE_LOADING" not in os.environ:
    318     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 319 torch._C._cuda_init()
    320 # Some of the queued calls may reentrantly call _lazy_init();
    321 # we need to just return without initializing in that case.
    322 # However, we must not let any *other* threads in!
    323 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

boxin-wbx

NVIDIA org 27 days ago

According to the error, you need to update the line:

input_ids = model_inputs['input_ids'].cuda()

input_ids = model_inputs['input_ids'].cpu()

JugsMa

27 days ago

•

edited 27 days ago

hi @boxin-wbx , The tokenizer is already in cpu (device mapped to cpu).
Could you suggest , Where do i input_ids = model_inputs['input_ids'].cpu() this?

This need to be done in chat function of the model, to overwrite the existing lib defination?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment