Thank you.
I just want to say that it's really awesome that:
A. You put in so much time and effort into pushing the llm space in such interesting directions. And;
B. That you additionally take time and clearly serious effort to really give everyone a chance to squeeze what they are looking for out of these wild models.
Your page is the most consistently interesting set of repos that I've come across. Keep it up.
Thanks!
Thank you so much ;
Wait for... the "mad scientist sections" ... ;
I want to second what schonsense said. Not sure if you remember me asking on one of the other models, but this pretty much cleared up all the questions I had.
Thank you -; there is a lot of "power" in the models that has not been accessed.
And I mean all the models - from "old" L2s, right up to the newest archs.
Likewise with quants too ... a new paper/doc I am working on will cover how to access the models at full power regardless of the size, quant or arch.
Research/testing side and tuning of "augmentation methods" are almost complete.
To give you a preview, it will discuss, detail and include augmentation methods to use almost all models at very low quant levels, mid range and high end.
It will also "map" the quant levels - by parameters in the model/coherence levels as well as augmentation to make them run better.
So far the "winners" at the moment are:
Mistral 7bs running at 100 t/s, with Solar 11bs @ 70 t/s with Gemma 2 running at 55-60 t/s on a low end 16GB card.
But also MOES 4x7 / 8x7 running in the 40 t/s range - at incredible coherence levels.
70B models running at 14 t/s.
On a high end card, double these speeds.
This paper/doc will likely be in the 30+ page zone.
It will cover a lot of different area/models and how to get the most out of them.
There are also some interesting surprises coming too.
@DavidAU Oh wow that sounds amazing!
How is the intelligence like at ultra low quants like IQ1_S and stuff? Does a 70b model actually feel like a 70b in RP and reasoning capabilities?
Secondly (btw I have just 20gb of vram) besides RP, I work with sensitive data so I needed some advice from you since I need powerful reasoning. Would using a 70b model at IQ1_S using your settings be beneficial for me? or would a 22b model at 5.75 bpw and a 32b model at 3 bpw be better for this particular use case?
RE: IQ1S:
This is very model/arch specific. There is a big jump from IQ1S to IQ1M (and another at IQ2XXS).
Likewise the AGE of the quant (how long ago it was created) is also critical. Improvements at LLAMAcpp (when you re-quant) are night and day.
I have found quants as young as 60days are in serious need of re-quanting.
This improves quants too - from the bottom all the way to Q8.
RE Reasoning.
There are two issues that occur the lower you go in "bits" / BPW:
1 - Loss of instruction following (relative).
2 - Output generation - corruption, repeats, brain damage.
I can fix #2 - in most cases; proof of concept here:
https://huggingface.co/DavidAU/Llama-3.3-70B-Instruct-How-To-Run-on-Low-BPW-IQ1_S-IQ1_M-at-maximum-speed-quality
For #1 - Clarification of instructions is key.
Llama 3.3 has exceptional instruction following - even at IQ1_S.
What I would do in your case is test the model at IQ1_S, IQ1_M and up ... ; you will see the differences per quant here rather quickly.
Instruction following ++ = output generation ++ ; they are linked.
RE: Downright "reasoning".
Reasoning (in a llm) is about nuance... which roughly mean "bits".
However, you can make up for "lost bits" in many cases by parameters.
And to make it tougher - model training will have big impact too, depending on what topic(s) you need the model to reason.
In addition to 70B / other models, you may want to try the MOEs too. (and up the number of experts used)
With MOES however "newest" of the quant is critical as there are "old" ones out there that do not function right, until they are re-quanted.
And for those that are "old" and work -> get them re-quanted for far better performance.