Enhance response time
I used the llama-2-7b-chat.ggmlv3.q8_0.bin model and tried to get response from the prompt i provided and it is taking around 2-3mins to return the response locally. How to reduce the response time?
@Situn007
Well first I think update your llama.cpp or whatever thing you are using. Ggml is a very outdated format and you should use gguf models.
If you are using a gpu, install llama cpp with cublas and set gpu layers to -1
Else install llama cpp with openblas
@YaTharThShaRma999
The way i am using ggml can i use gguf the same way?
@Janmejay123 yeah it’s the same exact thing except gguf has a bunch of metadata attached like prompt format, rope, and more so it’s easier to run(it’s still 1 file).
And since you update llama.cpp, it should be much faster as new things have been introduced