CUDA out of memory error when using

#7
by hwasiti - opened

I am trying to use my 2 GPUs + RAM mapped by device_map="auto"
However, I am getting CUDA OOM error.
Using only 1 GPU gave the same error by specifying:

import torch
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"

If only CPU used it will take 30 min. for max_length=2000 on my corei9-9900K/64GB which is too much. I just hoped that my 2x1080Ti (11GB) could help to speed up the text generation.

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")    
model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto",  offload_state_dict = True)  #  no disk offloading

input_text = """
# The benefits of deadlifting

## INTRODUCTION
"""

randomizer_value = 0
repititions = 1

# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(randomizer_value)

# input_ids = tokenizer(input_text, return_tensors="pt").input_ids   ############### CPU only
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=2000, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=1
)

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 10.92 GiB total capacity; 9.80 GiB already allocated; 9.75 MiB free; 9.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Answering my own question:

We should remap the model a little bit to make the part of the map that occupies the GPU a little smaller. One leyr offloaded to another device like a CPU or another GPU.

Here is an example how I did it for the galactica-30b model:
Run the following code to explore the current mapping that gives the CUDA error

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")    
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map=device_map,  torch_dtype=torch.float16)  

model.hf_device_map

It will output a dictionary of the map like:

{'model.decoder.embed_tokens': 0,
 'lm_head': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
...
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 0,
 'model.decoder.layers.8': 1,
...
 'model.decoder.layers.14': 1,
 'model.decoder.layers.15': 1,
 'model.decoder.layers.16': 1,
 'model.decoder.layers.17': 1,
 'model.decoder.layers.18': 'cpu',
...
 'model.decoder.layers.45': 'cpu',
 'model.decoder.layers.46': 'cpu',
 'model.decoder.layers.47': 'cpu'}

Change the numbers of GPU 0 and GPU 1 to make it a bit less and move those layers to the CPU by the following before initializing the model when you execute the script the next time: (Note the layers 7 and 14, 15 how I changed them to decrease the mapping on gpus 0 and 1)

device_map = {'model.decoder.embed_tokens': 0,
 'lm_head': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 'model.decoder.layers.2': 0,
 'model.decoder.layers.3': 0,
 'model.decoder.layers.4': 0,
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 1,
 'model.decoder.layers.8': 1,
 'model.decoder.layers.9': 1,
 'model.decoder.layers.10': 1,
 'model.decoder.layers.11': 1,
 'model.decoder.layers.12': 1,
 'model.decoder.layers.13': 1,
 'model.decoder.layers.14': 1,
 'model.decoder.layers.15': 1,
 'model.decoder.layers.16': 'cpu',
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18': 'cpu',
 'model.decoder.layers.19': 'cpu',
 'model.decoder.layers.20': 'cpu',
 'model.decoder.layers.21': 'cpu',
 'model.decoder.layers.22': 'cpu',
 'model.decoder.layers.23': 'cpu',
 'model.decoder.layers.24': 'cpu',
 'model.decoder.layers.25': 'cpu',
 'model.decoder.layers.26': 'cpu',
 'model.decoder.layers.27': 'cpu',
 'model.decoder.layers.28': 'cpu',
 'model.decoder.layers.29': 'cpu',
 'model.decoder.layers.30': 'cpu',
 'model.decoder.layers.31': 'cpu',
 'model.decoder.layers.32': 'cpu',
 'model.decoder.layers.33': 'cpu',
 'model.decoder.layers.34': 'cpu',
 'model.decoder.layers.35': 'cpu',
 'model.decoder.layers.36': 'cpu',
 'model.decoder.layers.37': 'cpu',
 'model.decoder.layers.38': 'cpu',
 'model.decoder.layers.39': 'cpu',
 'model.decoder.layers.40': 'cpu',
 'model.decoder.layers.41': 'cpu',
 'model.decoder.layers.42': 'cpu',
 'model.decoder.layers.43': 'cpu',
 'model.decoder.layers.44': 'cpu',
 'model.decoder.layers.45': 'cpu',
 'model.decoder.layers.46': 'cpu',
 'model.decoder.layers.47': 'cpu'}

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")  
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map=device_map,  torch_dtype=torch.float16)  # GPU: manually device mapped # do not map to disk (no disk offloading)

keep experimenting and observe GPU memory utilization in
'nvidia-smi`
and increase/decrease the mapping on gpus until you find the sweet spot.

Hope that helps somebody :)

Hey, I'm using the model = BartForSequenceClassification.from_pretrained("facebook/bart-base") and it does not work for me your solution? I'm getting the RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)' error, I made forum question about it here

Thanks so much! It helps me to resolve the issue!

Sign up or log in to comment