https://huggingface.co/mgoin/Nemotron-4-340B-Instruct-hf

#229
by nicoboss - opened

Nemotron4/Nemotron3/Minitron llama.cpp support is finally implemented in the latest release of llama.cpp. You need to use the HuggingFace Transformer compatible version linked below and not the Nemo based version published by NVidia for it to work with llama.cpp.

Nemotron-4-340B-Instruct-hf: https://huggingface.co/mgoin/Nemotron-4-340B-Instruct-hf
Nemotron-4-340B-Base-hf: https://huggingface.co/mgoin/Nemotron-4-340B-Base-hf
nemotron-3-8b-chat-4k-sft-hf : https://huggingface.co/mgoin/nemotron-3-8b-chat-4k-sft-hf

right, those have the correct tokenizer. are there also conversions for the 8b(4b minitron relases?

right, those have the correct tokenizer. are there also conversions for the 8b(4b minitron relases?

Yes there is but only in FP8. You can find all HugggingFace transformers compatible Nemotron4/Nemotron3/Minitron versions published by mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5.

Hmm, see my other message to you. Things don't add up and it looks these -hf models are the wrong ones. I will hold back on them till I can see what's going on.

@nicoboss and now to something completely different: the Q4_K_S of the 1T model will likely be >500GB, possible too large. The next smaller size to try would be the IQ4_XS, but thats expensive to quantize wily-nily and might still be too large. What would you suggets I try?

and the above comment should have gone somewhere else, but I leave it here to not make it more confusing :)

@nicoboss and now to something completely different: the Q4_K_S of the 1T model will likely be >500GB, possible too large. The next smaller size to try would be the IQ4_XS, but thats expensive to quantize wily-nily and might still be too large. What would you suggets I try?

It looks to me as if Q3_K_L is the largest one to fully fit into RAM. It is 534.1 GB and so 497.4 GiB. Usable RAM is 503 GiB - I think the rest is reserved for hardware. So the only way of getting IQ4_XS working without streaming from SSD is by offloading some layers to the GPU or maybe even multiple GPUs. With all 4 GPUs there is 66 GiB GPU memory available. If we can use multiple GPUs this should even work for Q4_K_S which is 538.3 GiB. I'm more than happy to let you try out something even if it might end up not working. IQ4_XS quantizing is not that expensive and relatively quick if your input model is on the SSD. Don't we have to quantize them anyways for static quants?

Wow, Q3_K_L is already 534GB. How about IQ3 variants? they should give better quality poer byte.

As for reusing the quants, it's possibly with some manual juggling, but of course, we can't reuse them if we go for Q* quants.

And as for quant selection, Q3 surely feels much worse than Q4, but IQ4 already feels much worse then Q4_K_S for example. And it's just feels, because I have no exact data for non-imatrix quants.

Wow, Q3_K_L is already 534GB. How about IQ3 variants? they should give better quality poer byte.

IQ3_M will easily fit as it is smaller than Q3_K_L and according to benchmarks might be slightly better than Q3_K_L.

As for reusing the quants, it's possibly with some manual juggling, but of course, we can't reuse them if we go for Q* quants.

And as for quant selection, Q3 surely feels much worse than Q4, but IQ4 already feels much worse then Q4_K_S for example. And it's just feels, because I have no exact data for non-imatrix quants.

The difference between IQ3_M and IQ4_XS is massive. I would try to go for at least IQ4_XS or even Q4_K_S if you can offload layers to GPUs to make it fit. Currently there are 3 GPUs assigned to your LXC container totaling 58 GB GPU memory and if you need the 4th one for the full 66 GB GPU memory just let me know.

well... let's see how big the iq4_xs actually becomes and then see if i can fit it into mem+gpu.

and generating an iq4_xs is quite slow after all. and i forgot i can't reuse it, because it's different than what my normal quantize produces. sigh.

update: with slow i mean it uses vastly more cpu than a q4_k_s.

mradermacher changed discussion status to closed

Sign up or log in to comment