VRAM consumption when using GPU (CUDA)

#37
by Sunjay353 - opened

I noticed that the VRAM usage increases by around the model size when loading the model, which is expected. However, it then increases again by roughly twice the model size during inference. This means the VRAM consumption is approximately three times the model size overall. Furthermore, this additional utilization is not released after inference, only at model unload. Is this normal and expected behavior?

Yes, it's normal and expected. Transformers consume memory proportional to the square of the tokens number in sequence.

Suggest wrapping the call to model.generate() into a torch.no_grad() context manager to see if that helps:

with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3,
    )

Thank you for the feedback!

Sunjay353 changed discussion status to closed

Sign up or log in to comment