1.64-upmanship (2 bit HQQ quantize)
Turns out if you claim you're not going to do something, you'll do it.
https://huggingface.co/ProphetOfBostrom/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss_attn-4bit-moe-2bit-HQQ ta-da!
I've barely tested this and not at all out of text-generation-webui, but nothing's obviously wrong with it. It's 18 gigabytes. It should Just Work on TGWUI.
The (shared between all experts) attention weights are 4 bit instead. This adds almost nothing to the file size and nothing beyond that to the total memory usage. I think
This isn't really one-upmanship at all though, zaq likely has the upper hand. Exl2 is fast and their quant has, yknow, more of the model left in it.
This is not so fast. I get 6 tokens per second with torch.compile(), half that with straight torch and ATEN is between but it seems to encode very quickly so it feels quite snappy. 6 is just about fast enough that I'm producing tokens faster than I can really 'enjoy' so I'm happy.
***and so I speculate that this should have a stronger better understanding of long contexts, even if you're less likely to generate one in the first place.
the real boon ought to be contrastive search which i've kept forgetting to do. I'll go try it now.
Thanks for this!
Amazing! I was hoping to see a HQQ of Noromix ever since I stumbled across this - https://github.com/dvmazur/mixtral-offloading
Would the expert offloading strategy work with this model? I am told HQQ + MoE offloading can make Mixtral 8x7b usable with 12gb, which means I could run this on my personal desktop.
"Amazing! I was hoping to see a HQQ of Noromix"
Genuinely thought I'd never see a single download lol. Glad to be wrong. I've never seen this project before - but my understanding of what HQQ does suggests that my model should for 3 reasons.
- I followed the method mobius used for their reference mixtral quant. I think I'm being reasonable to assume that the HQQ Mixtral from the HQQ people will work with your HQQ Mixtral only loader - this can't be far off.
And then I deleted the other reasons because
tl;dr. yes, it'd better work.
In fact I'm not very clear on why they've decided to be so specific about the model it loads. HQQ can do any transformer.
I mean any transformer. Not "any llama2 language model". I mean here's their reference ViT-H. Which is part of stable diffusion 2, for instance. It's an image processor. It doesn't have any text input. so if my quant of a mixtral finetune isn't compatible, I'm not inclined to pin it on anything HQQ-y.
However, I was wondering why I hadn't seen any discussion of CPU inference for HQQ, and I didn't even have to resort to google, you just found one for me. Very cool! Please let me know if it works, but I may well test it now.
Given how they say it works, you're going to find that you're pressed to squeeze every last MOE layer you can in to the GPU. I suggest, if there's any options to adjust, you prioritise that when it comes to allocating VRAM. Attention layers are much less, if not the least important thing to get to the GPU. and I expect that's why they're using more bits for them (as am I, as did the OGs)- because they can get away with being slow to move. * Caches go next. It's the big dumb fully connected feedforward weights that really are best moved as little as possible. For mixtral, that's the expert/MOE layers.
Sweet ... might give it a try to compare with the EXL2. I've had mixed luck with this flavor of Noromaid vs. previous ones. Sometimes, it spits some hot RP. Sometimes, it spits strangely summarized crap-ola. (Although this version is able to have witty banter about The Princess Bride and Die Hard, which I have found hilarious.)