N_ctx = 0 causes to use a lot of vram
I'm trying to use q_6k quantized model (100GB) of on a machine with 384GB of Vram. The model loads and uses 100GB total vram but when I use n_ctx = 0, it causes the vram usage to go to 240Gb total. I am using llama-cpp-python. Cuda 12.4, Python 3.11. I am trying to upload a 364 page pdf document (extracting text only) to the model (about 99k tokens) but it gets a cuda out of memory error. It works when I only send 80 pages. I tried the same thing with the 16bit Mistral model (224GB model size) with transformers library and I was able to do 182 pages. I am not sure what is happening since I thought the smaller size would allow me to submit more tokens to the Vram.
it's possible that llama.cpp doesn't handle splitting context across cards as cleanly as transformers does, n_ctx = 0
I also assume would default to max context which in this case is 131072
which could definitely cause issues
what's your setup in terms of GPUs?
I have 4 H100's
Is there a way to use the GGUF files with the transformers library?