Unable to use GGUF file with llama.cpp

by iambulb - opened

Error received when tried to use with llama.cpp:

llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                          llama.block_count u32     
llama_model_loader: - kv   3:                       llama.context_length u32     
llama_model_loader: - kv   4:                     llama.embedding_length u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.attention.head_count u32     
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   8:                       llama.rope.freq_base f32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                           llama.vocab_size u32     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32     
llama_model_loader: - kv  13:                       tokenizer.ggml.model str     
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr     
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32     
llama_model_loader: - kv  20:                    tokenizer.chat_template str     
llama_model_loader: - kv  21:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
error loading model: cannot find tokenizer scores in model file

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf'
load_binding_model: error: unable to load model
Loading the model failed: failed loading model
exit status 1

In addition, also tried to convert a working copy of Llama-3-8B-Lexi-Uncensored to GGUF using instructions given in llama.cpp repo:

# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert-hf-to-gguf.py models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY

Able to convert to GGUF. But still getting the same error when trying to use it.

I believe it has some issue with the tokenizer scores.

Sign up or log in to comment