https://huggingface.co/FourOhFour/Deedlit_4B

#249
by jeiku - opened

Support for the Minitron Width Base models has been merged into latest llamacpp. I'm sorry about the mix up yesterday, but I would greatly appreciate if you could run these for me since I have no interest in repeating my own quants in such a short span of time. Thank you so much and again, sorry about yesterday.

absolutely no problem, especially not for a 4b :)

Hmm, which change is it? I can't really see a change to the conversion script, which would indicate that the conversion was already correct (that's good news).

https://github.com/ggerganov/llama.cpp/commit/75e1dbbaaba5d762223370f3dab9db24a7f53465

I'm almost positive it was this change. Several people from Anthracite have confirmed latest is working. The issue was always a rope issue.

ah right, so that means the quants were actually correct, it's just that llama couldn't load them. thanks for clarifying!

mradermacher changed discussion status to closed

Didn't seem to work quite as we expected:

llama_model_load: error loading model: check_tensor_dims: tensor 'rope_freqs.weight' has wrong shape; expected    48, got    64,     1,     1,     1

Just to be sure, did you use the latest version of llama.cpp to quant?

There was a change to the convert script.

Yes, I updated right before queuing, and No, there was no change to the convert script, but it doesn't matter as it re-ran anyway.

I think I see what you mean. There was a change to the script in that diff, but no change in the script since the model was converted.

Yes, I updated right before queuing, and No, there was no change to the convert script, but it doesn't matter as it re-ran anyway.

Does this imply it worked the second time? Weird...

the conversion always worked. what doesn't work is the resulting model.

The model is working on a llama.cpp derivative that merged the change yesterday. As long as you are on latest llama.cpp, inference should work and does for me.

I haven't tested inference, but llama-imatrix can't load the model. But imatrix quants are not terribly important for a 4B model, so if it works for inference, this is currently good enough.

I am literally running an imatrix quant, with llamacpp updated yesterday, at q4_0_4x8 with this model. calculating the imatrix gave me no issues. Quantizing gave me no issues. Quant just finished, just need to test inference. Not sure what issue you are having.

I've documented everything. Anyways, quite possibly they have been further fixes not present when I tried to create the imatrix.

Inference is working. Sorry about all this. I will refrain from requesting quants for this architecture for now.

We seem to have a communication disconnect. I never claimed inferencing isn't working, and I have zero problems with being asked to quant these models. But there is also no need for you to request quants if you don't feel like it, of course. Cheers :)

I just tried this out of curiosity. I can confirm convert_hf_to_gguf.py, llama-imatrix and llama-quantize all worked without any issues. Here the script to reproduce my success. The only thing that could really make the difference is using latest llama.cpp. First time I realized you can actually run two imatrix tasks on the same GPU without things breaking as I accidentally forgot to set CUDA_VISIBLE_DEVICES and you were running imatrix on the same time but GPU memory luckily came nowhere close on getting full and all tasks completed without any issues.

#!/bin/bash
git clone https://huggingface.co/FourOhFour/Deedlit_4B
git clone --recursive https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
make GGML_CUDA=1 -j
cd ..
python3 -m venv venv
venv/bin/pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
venv/bin/pip3 install sentencepiece
venv/bin/pip3 install pyyaml
venv/bin/pip3 install safetensors
venv/bin/pip3 install transformers
venv/bin/python3 llama.cpp/convert_hf_to_gguf.py --outfile Deedlit_4B.gguf ./Deedlit_4B/
wget https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt
./llama.cpp/llama-imatrix -m Deedlit_4B.gguf -f calibration_datav3.txt -o Deedlit_4B-f16.imatrix -ngl 0
./llama.cpp/llama-quantize --imatrix Deedlit_4B-f16.imatrix Deedlit_4B.gguf Deedlit_4B.i1-Q5_K_M.gguf Q5_K_M 12

We seem to have a communication disconnect. I never claimed inferencing isn't working, and I have zero problems with being asked to quant these models. But there is also no need for you to request quants if you don't feel like it, of course. Cheers :)

Oh no, I was just following up to my previous comment to verify that the quantized imatrix model was working. I don't want to create any issues for you, since I see the volume of quants you provide. If we can get this worked out, that would be great.

@nicoboss once imatrix is running for a cycle, it won't allocate more gpu memory, so at that point, its always safe to start more.

@jeiku As i said, a 4B is not an issue at all, and I have models failing in different ways every day, so it's not an issue for me to try. I suspect that future models will also just work, and I just had a slightly too old version of llama.cpp which would convert it correctly but then couldn't imatrix-quant. which is, btw., a very common occurance for llama.cpp.

In fact, just for everybodies entertainment, here is the list of failures sorted by most commno to leats common (from my head):

  1. no tokenizer.model file but one is required. i have not dug into this, but maybe transformers doesn't need that file to run the model. can't be that so many people forget to upload it?
  2. no matching pretokenizer
  3. converts, but does not work with imatrix (or does not load at all)
  4. other issues, such as duplicate key errors or other bugs in convert_hd_to_gguf.py

@jeiku As for this model, I can try again if you want, but the advantages for a 4B are limited (from my perspective, where I would just run the f16, maybe hubris :).

Sign up or log in to comment