tsumeone/llama-30b-supercot-4bit-cuda

Apr 27, 2023

How about adding 3 bit version for quality testing?
to see how much worse it perform in comparison to 4 bit version, because 4 bit gets oom somewhere above 1000 tokens on 24gb gpu
this 4 bit version works pretty good.

tsumeone

Owner Apr 28, 2023

On KoboldAI, I can run the 4bit non-groupsize model at full context in windows on my 3090. Ooba takes up more vram for some reason; I'm guessing that's what you're using.

I am uploading a 3bit-128g quant of this model. It might take a couple hours since HF seems to be having some troubles right now and is refusing to let me create a model card. The wikitext 2 ppl is 12% worse than 4bit non-groupsize which is a substantial loss in coherence. But the file is 17% smaller, which should roughly translate into similar VRAM savings. You will find it here: https://huggingface.co/tsumeone/llama-30b-supercot-3bit-128g-cuda

Just want to add that I also tried quantizing a 3bit-32g version to see if the ppl could be improved, but the file size ended up 2% larger than 4bit non-groupsize while still having 5% worse ppl. Basically no reason to even consider that one since it will use more VRAM and also be less coherent.

tsumeone changed discussion status to closed Apr 28, 2023

tsumeone changed discussion status to open Apr 28, 2023

tsumeone
/

llama-30b-supercot-4bit-cuda

3bit version