Add the ability to set the Output Tensors and Embeddings to (F32/F16/Bf16/Q8_0) in GGUF-My-Repo

#2
by Joseph717171 - opened

Please add the ability to set the Output Tensors and Embeddings to F32/F16/Bf16/Q8_0 in GGUF-My-Repo. The following llama-quantize flags will be necessary and will require the ability to convert repos to (F32/F16/BF16): ๐Ÿ™

--allow-requantize
--output-tensor-type
--token-embedding-type

@bartowski What do you think, my dude,? It's a good idea, right? ๐Ÿ˜‹

maybe, bf16 isn't worth it ATM cause it'll break all GPU offloading sadly even though it's the best theoretical option

F32 adds way too much size to be worth it IMO, except maaaybe people who are crazy for Q8_0 with F32 embed/output, just for the absolute highest performance GGUF quant?

F16 isn't worth it in my testing and is often outmatched by Q8_0, sometimes even by as low as Q6_K and Q4

Q8_0 is interesting and possibly worth it, though even then I have yet to see significant improvement from going to whatever is default to Q8, but it's at least an okay middle ground between size and potential performance

So maybe adding a toggle for keeping embed/output at Q8_0 could be nifty, especially if people report their findings about actual perceived performance and show some data/tests that put them head to head with the defaults

@bartowski I agree! Q8_0 is definitely the optimum quant for Output Tensors and Embeddings. However, I still think each selection should be offered as llama.cpp supports it. That way people can individually make up their minds what custom quants they want - it can default to Q8_0 for the Output Tensors and Embedding, if the option is toggled on . ๐Ÿค”

Yeah that's not bad either! I just wish we had the ability to use bf16 in CUDA, would make this so much more of a no-brainer.. max quality bar none? bf16. better quality with reasonable size? Q8. otherwise, default.

microsoft/Phi-3.5-MoE-instruct to gguf?

Sign up or log in to comment