Enhance response time

by Janmejay123 - opened Mar 5

Mar 5

I used the llama-2-7b-chat.ggmlv3.q8_0.bin model and tried to get response from the prompt i provided and it is taking around 2-3mins to return the response locally. How to reduce the response time?

YaTharThShaRma999

Mar 5

•

edited Mar 5

@Situn007
Well first I think update your llama.cpp or whatever thing you are using. Ggml is a very outdated format and you should use gguf models.

If you are using a gpu, install llama cpp with cublas and set gpu layers to -1

Else install llama cpp with openblas

Janmejay123

Mar 6

@YaTharThShaRma999
The way i am using ggml can i use gguf the same way?

YaTharThShaRma999

Mar 6

@Janmejay123 yeah it’s the same exact thing except gguf has a bunch of metadata attached like prompt format, rope, and more so it’s easier to run(it’s still 1 file).

And since you update llama.cpp, it should be much faster as new things have been introduced

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment