chatpdflocal/llama3.1-8b-gguf · Unable to load model when used with `llama-cli`

Unable to load model when used with `llama-cli`

by dimanis - opened Aug 6

Aug 6

Trying to run ggml-model-Q4_K_M.gguf on my Macbook M1 Pro 16GB and getting this error. I've spent some time searching for the possible answers online, but wasn't able to find anything helpful.

Here are some relevant logs:

llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init:      Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer, size =  8480.02 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 8891928576
llama_new_context_with_model: failed to allocate compute buffers
ggml_metal_free: deallocating
llama_init_from_gpt_params: error: failed to create context with model '.llama/7B/ggml-model-Q4_K_M.gguf'
main: error: unable to load model

Vincent-Lee

chatpdflocal org Aug 6

Thanks for your issue, we will check it right now.

Vincent-Lee

chatpdflocal org Aug 6

Hi, dimanis.
We have tested in X86 and m2 Mac Pro, both are ok.
See your output log, maybe you can try to reduce your metal buffer size.

Vincent-Lee

chatpdflocal org Aug 6

By the way, could you give the start command of your llama-cli?

dimanis

Aug 6

@Vincent-Lee , sure, I'm running it with parameters from the quick start section of the llama.cpp repo:
llama-cli -m /path/to/model/model_name.gguf -p "I believe the meaning of life is" -n 128

dimanis

Aug 6

@Vincent-Lee Are you building llama.cpp from source? I'm running the brew version. Perhaps that could be the culprit?

Vincent-Lee

chatpdflocal org Aug 6

@Vincent-Lee , sure, I'm running it with parameters from the quick start section of the llama.cpp repo:
llama-cli -m /path/to/model/model_name.gguf -p "I believe the meaning of life is" -n 128

@dimanis
I doubt the reason is that the context window was set too large.
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model). Here without set prompt context, the size of prompt context loaded from model was set to llama_new_context_with_model: n_ctx = 131072.

Could you add the "-c 2048" to the start command and try to load the model?

Vincent-Lee

chatpdflocal org Aug 6

@Vincent-Lee Are you building llama.cpp from source? I'm running the brew version. Perhaps that could be the culprit?

yes, we have built from source.

Vincent-Lee

chatpdflocal org Aug 7

@dimanis Have the problem been solved?

dimanis

Aug 7

@Vincent-Lee yes, I can confirm that adding -c 2048 to the start command solved the problem. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment