generate +6 min, +20GB V-ram

#17
by NickyNicky - opened

enable: flash attn 2

more 5k tokens generate, use +20GB V-ram, total time +6 min

Is it possible to reduce inference time and memory consumption?

Sliding window configuration looks strange.

Microsoft org

Soon we will have vLLM support: https://github.com/vllm-project/vllm/pull/4298

nguyenbh changed discussion status to closed

Sign up or log in to comment