vLLM out of memory

by cfrancois7 - opened Nov 21, 2023

Nov 21, 2023

•

I have only a RTX 3070 with 8Go VRAM.

When I execute your code for AutoAWQ, it works well on my computer.
I succeed to manage the size of my max tokens.
I run with around 7Go of RAM.

But when I want to test vLLM, the script want 14Go of GPU RAM allocation, and crash.
I do not succeed to change the max tokens size.

Dec 4, 2023

•

Try with --max-model-len 512

Dec 8, 2023

I reinstall and test with :

llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ", 
    quantization="awq",
    dtype="auto",
    max_model_len=512,
   gpu_memory_utilization=0.8
)

And it works.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment