vLLM out of memory
#2
by
cfrancois7
- opened
I have only a RTX 3070 with 8Go VRAM.
When I execute your code for AutoAWQ, it works well on my computer.
I succeed to manage the size of my max tokens.
I run with around 7Go of RAM.
But when I want to test vLLM, the script want 14Go of GPU RAM allocation, and crash.
I do not succeed to change the max tokens size.
Try with --max-model-len 512
I reinstall and test with :
llm = LLM(
model="TheBloke/zephyr-7B-beta-AWQ",
quantization="awq",
dtype="auto",
max_model_len=512,
gpu_memory_utilization=0.8
)
And it works.