VERY impressive!
Just going to drop this here, I'm so hyped about this. You guys have been doing such amazing work and it's been such a pleasure to join you, on this journey!
Thank you!
Mistral CEO confirmed that Miqu is a prototype Mistral 70b. So this is a finetune of the padded dequantized version? Interesting!
Mistral CEO confirmed that Miqu is a prototype Mistral 70b. So this is a finetune of the padded dequantized version? Interesting!
We finetuned on top of that : https://huggingface.co/152334H/miqu-1-70b-sf
Apparently it has less problem, less perplexity too. Fixes a bunch of things.
https://twitter.com/arthurmensch/status/1752734898476007821 proof just in case someone reads that and doubts
Ooooo nice!
It certainly seems to have it's own style of writing, much nicer than a lot of other models. I like it a lot just playing with it on it's own, need to play with settings to dial it in for Silly Tavern.
Seems good at general knowledge, and it's code generation beat GPT 3.5 (better quality answers, newer libraries).
On a Macbook M1 Max w/ 64GB, the Q3_K_M runs 16s TTFT, ~5.25t/s.
Code generation!?
(checks title, again)
Oooooooookayyyyyyy ...
Can't wait for that Noromaid-EveryoneCoder33b merge. :-)
I think it has my favorite writing style so far very creative, I will edit this later and tell you how coherent it is etc. If I am insane enough I might try to reinstall my system with a stupidly large swap file and try to make a frankenmerge to 120B with this model, seems doable since someone else did it with swap so maybe? I have 64GBs of RAM so that will at least make it a lot better
Edit: it's really dumb sometimes especially at higher temperature, but there is a nice balance you can find somewhat, it also needs a really high repeat penalty. I think this my favorite model so far at least sometimes, but other times it's really bad. But my standards might be corrupted by 120B models. It needs a lot of tweaking to get good results and it acts weirdly still sometimes. I will have to do more testing though. I think it's a good base for further refinement and merging. So overall it's complicated... It's exceptionally good 50% of the time, then exceptionally terrible the other 50% of the time I'd maybe look into RLHF to maybe keep the same feel, but help it with logic perhaps but I don't know much about training