generate +6 min, +20GB V-ram
#17
by
NickyNicky
- opened
enable: flash attn 2
more 5k tokens generate, use +20GB V-ram, total time +6 min
Is it possible to reduce inference time and memory consumption?
Sliding window configuration looks strange.
Soon we will have vLLM support: https://github.com/vllm-project/vllm/pull/4298
nguyenbh
changed discussion status to
closed