Running the model in 'cpu'

#22
by JugsMa - opened

I loaded the model in cpu, but it requires the model and data to be in gpu (cuda) device.
Is there a way to do it in cpu?

Also i tried to load it in gpu with gpu size 16Gb but i don't have the large gpu, and caught with the error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 52.12 MiB is free. Process 19046 has 14.69 GiB memory in use. Of the allocated memory 14.41 GiB is allocated by PyTorch, and 179.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
NVIDIA org

I haven't tried using CPU for inference, but to do that, the following code should work:

path = "nvidia/NVLM-D-72B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map="cpu").eval()

I tried passing the model to cpu

...
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
model = model.to(device)

And, i am trying for text generator

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False, device=device)
generation_config = dict(max_new_tokens=1024, do_sample=False)

query = 'What is transformer model?'
response, history = model.chat(tokenizer, None, query, generation_config, history=None, return_history=True)

when calling the model.chat() function, it throws error saying:

File ~/.cache/huggingface/modules/transformers_modules/nvidia/NVLM-D-72B/5a57d927ac0ab6b0a96ebc90f5ee7901ddca790d/modeling_nvlm_d.py:252, in NVLM_D_Model.chat(self, tokenizer, pixel_values, question, generation_config, history, return_history, num_patches_list, IMG_START_TOKEN, IMG_END_TOKEN, IMG_CONTEXT_TOKEN, verbose, visual_features)
    249     query = query.replace('<image>', image_tokens, 1)
    251 model_inputs = tokenizer(query, return_tensors='pt')
--> 252 input_ids = model_inputs['input_ids'].cuda()
    253 attention_mask = model_inputs['attention_mask'].cuda()
    254 generation_config['eos_token_id'] = eos_token_id

File ~/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:319, in _lazy_init()
    317 if "CUDA_MODULE_LOADING" not in os.environ:
    318     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 319 torch._C._cuda_init()
    320 # Some of the queued calls may reentrantly call _lazy_init();
    321 # we need to just return without initializing in that case.
    322 # However, we must not let any *other* threads in!
    323 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
NVIDIA org

According to the error, you need to update the line:

input_ids = model_inputs['input_ids'].cuda()

to

input_ids = model_inputs['input_ids'].cpu()

hi @boxin-wbx , The tokenizer is already in cpu (device mapped to cpu).
Could you suggest , Where do i input_ids = model_inputs['input_ids'].cpu() this?

This need to be done in chat function of the model, to overwrite the existing lib defination?

Sign up or log in to comment