Unable to load model when used with `llama-cli`
Trying to run ggml-model-Q4_K_M.gguf
on my Macbook M1 Pro 16GB and getting this error. I've spent some time searching for the possible answers online, but wasn't able to find anything helpful.
Here are some relevant logs:
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
llama_kv_cache_init: Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer, size = 8480.02 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 8891928576
llama_new_context_with_model: failed to allocate compute buffers
ggml_metal_free: deallocating
llama_init_from_gpt_params: error: failed to create context with model '.llama/7B/ggml-model-Q4_K_M.gguf'
main: error: unable to load model
Thanks for your issue, we will check it right now.
Hi, dimanis.
We have tested in X86 and m2 Mac Pro, both are ok.
See your output log, maybe you can try to reduce your metal buffer size.
By the way, could you give the start command of your llama-cli?
@Vincent-Lee
, sure, I'm running it with parameters from the quick start section of the llama.cpp repo:llama-cli -m /path/to/model/model_name.gguf -p "I believe the meaning of life is" -n 128
@Vincent-Lee Are you building llama.cpp from source? I'm running the brew version. Perhaps that could be the culprit?
@Vincent-Lee , sure, I'm running it with parameters from the quick start section of the llama.cpp repo:
llama-cli -m /path/to/model/model_name.gguf -p "I believe the meaning of life is" -n 128
@dimanis
I doubt the reason is that the context window was set too large.
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model). Here without set prompt context, the size of prompt context loaded from model was set to llama_new_context_with_model: n_ctx = 131072.
Could you add the "-c 2048" to the start command and try to load the model?
@Vincent-Lee Are you building llama.cpp from source? I'm running the brew version. Perhaps that could be the culprit?
yes, we have built from source.
@dimanis Have the problem been solved?
@Vincent-Lee
yes, I can confirm that adding -c 2048
to the start command solved the problem. Thanks!