prompt eval too slow
Hey, do you know why the prompt eval is so slow?
Below is the log:
Log start
main: build = 2664 (ab9a3240)
main: built with MSVC 19.38.33133.0 for x64
main: seed = 1713113009
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from D:\codebase\ai\models\dranger003\c4ai-command-r-v01-iMat.GGUF\ggml-c4ai-command-r-35b-v01-iq3_xxs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = command-r
llama_model_loader: - kv 1: general.name str = 960d0bad58d35c695ff69dda1d2b90978f40b196
llama_model_loader: - kv 2: command-r.block_count u32 = 40
llama_model_loader: - kv 3: command-r.context_length u32 = 131072
llama_model_loader: - kv 4: command-r.embedding_length u32 = 8192
llama_model_loader: - kv 5: command-r.feed_forward_length u32 = 22528
llama_model_loader: - kv 6: command-r.attention.head_count u32 = 64
llama_model_loader: - kv 7: command-r.attention.head_count_kv u32 = 64
llama_model_loader: - kv 8: command-r.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 9: command-r.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 23
llama_model_loader: - kv 11: command-r.logit_scale f32 = 0.062500
llama_model_loader: - kv 12: command-r.rope.scaling.type str = none
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 5
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 255001
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 41 tensors
llama_model_loader: - type q5_K: 1 tensors
llama_model_loader: - type iq3_xxs: 160 tensors
llama_model_loader: - type iq3_s: 40 tensors
llama_model_loader: - type iq2_s: 80 tensors
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = command-r
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 253333
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 64
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 8192
llm_load_print_meta: n_embd_v_gqa = 8192
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 6.2e-02
llm_load_print_meta: n_ff = 22528
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = none
llm_load_print_meta: freq_base_train = 8000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 35B
llm_load_print_meta: model ftype = IQ3_XXS - 3.0625 bpw
llm_load_print_meta: model params = 34.98 B
llm_load_print_meta: model size = 12.87 GiB (3.16 BPW)
llm_load_print_meta: general.name = 960d0bad58d35c695ff69dda1d2b90978f40b196
llm_load_print_meta: BOS token = 5 ''
llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 136 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.25 MiB
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloaded 36/41 layers to GPU
llm_load_tensors: CPU buffer size = 13181.28 MiB
llm_load_tensors: CUDA0 buffer size = 10625.62 MiB
....................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 64.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 576.00 MiB
llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1891.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 33.01 MiB
llama_new_context_with_model: graph nodes = 1208
llama_new_context_with_model: graph splits = 44
system_info: n_threads = 14 / 28 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful AI programming assistant.<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Write hello world program in python.<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Here's a simple "Hello, World!" program in Python:
print("Hello, World!")
You can save this code in a file with a .py extension, like hello_world.py, and then run it using the Python interpreter:
python hello_world.py
When you run the program, it will display "Hello, World!" as the output. [end of text]
llama_print_timings: load time = 3750.31 ms
llama_print_timings: sample time = 11.56 ms / 88 runs ( 0.13 ms per token, 7613.12 tokens per second)
llama_print_timings: prompt eval time = 78029.30 ms / 25 tokens ( 3121.17 ms per token, 0.32 tokens per second)
llama_print_timings: eval time = 8844.64 ms / 87 runs ( 101.66 ms per token, 9.84 tokens per second)
llama_print_timings: total time = 86948.98 ms / 112 tokens
Log end
@lfjmgs Not sure, but it could be that CUDA is falling back on the policy and using shared memory - I recall this totally kills performance. There is a setting in the nvidia control panel under manage 3D settings (its called CUDA sysmem fallback policy) and everytime you update your driver it resets to "driver default" and so I always change it back to "prefer no sysmem fallback".
Thank you for reply. It’s a bit strange that the setting doesn't exsit in the nvidia control panel. My nvidia driver version is 546.12 which shoud have this setting.