N_ctx = 0 causes to use a lot of vram

#4
by Nandobob23 - opened

I'm trying to use q_6k quantized model (100GB) of on a machine with 384GB of Vram. The model loads and uses 100GB total vram but when I use n_ctx = 0, it causes the vram usage to go to 240Gb total. I am using llama-cpp-python. Cuda 12.4, Python 3.11. I am trying to upload a 364 page pdf document (extracting text only) to the model (about 99k tokens) but it gets a cuda out of memory error. It works when I only send 80 pages. I tried the same thing with the 16bit Mistral model (224GB model size) with transformers library and I was able to do 182 pages. I am not sure what is happening since I thought the smaller size would allow me to submit more tokens to the Vram.

it's possible that llama.cpp doesn't handle splitting context across cards as cleanly as transformers does, n_ctx = 0 I also assume would default to max context which in this case is 131072 which could definitely cause issues

what's your setup in terms of GPUs?

I have 4 H100's

Is there a way to use the GGUF files with the transformers library?

Sign up or log in to comment