Unable to load model in GPU
Hi I'm trying to play with this model, but i cannot load it in gpu (T4 16GB provided by Colab), even if I specify device_map="cuda:0" it still loads in RAM. Any advice? I have another question why the model weights so much ~ 30GB despite having 7B parameters?
import transformers
quantization_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
#bnb_4bit_compute_dtype=bfloat16
)
llm = AutoModelForCausalLM.from_pretrained(
"galatolo/cerbero-7b",
quantization_config = quantization_config,
device_map="cuda:0"
)
Hi, it weighs that much because the weights are in the float32 format (rather than the more common float16).
I attempted to load the model using Google Colab, and it appears to crash due to insufficient RAM.
I will upload a float16
variant, maybe it will solve this issue
I uploaded the float16 variant, and you can load it using the following code:
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")
However, it appears that Colab does not have enough RAM to handle this. I believe the best option is to use the llama.cpp version, which I have already quantized to 4 bits.