This LLM is unique in a good way, but hallucinates like crazy in the other direction.

#4
by deleted - opened
deleted

I'm very impressed with this LLM. However, it stubbornly denies commonly known and unquestionable truths in order to achieve a TruthfulIQ over 70. This goes well beyond throwing the baby out with the bathwater.

For example, one of my censorship test questions is how many movies did the actress ___ appear nude or topless in. The actress in question had 10 such scenes, but this LLM said their was only 1. And when I provided another scene and asked for more it stubbornly denied the example and insisted there was only 1.

This LLM throws out about 10x more undeniably true and commonly known facts (100s of millions of people know) in order to reduce falsehoods/hallucinations by perhaps 25%. This is nowhere near a reasonable compromise. I have a little offline copy of Wikipedia that can correctly answer nearly all the questions this LLM adamantly denies is true.

In short, there is something special about this LLM, but claiming 10 undeniably TRUE and widely known things are false to avoid saying 1 false things is true isn't a reasonable compromise. This LLM is less than useless as a reference.

Note: The scene in the example above was correctly described. That's one of the reasons this LLM is special. All other Mistral based LLMs described it wrong. So when it gets the facts right, they're more right. If that makes sense.

This model also seems good at reasoning. I tried the Orca 2 paper reasoning challenge multiple times and it seems to always get the response right:

John and Mark are in a room with a ball, a basket and a box. John puts the ball in the box, then leaves for work. While John is away, Mark puts the ball in the basket, and then leaves for school. They both come back together later in the day, and they do not know what happened in the room after each of them left the room. Where do they think the ball is?

One thing I really loved is that the Q4_K_M GGUF version completely fits in a 8GB VRAM GPU!

There's a lot of potential for this model, especially if it is trained for multi-turn conversation and function calling is implemented.

deleted

@tarruda You're not lying. I'm still finishing up my test on it. But it's doing things other Mistrals can't. Like you said, it's "smarter".

For example, I ask the LLM to make a joke about 2 disparate things (e.g. cat and a telescope), start with a random header, such as "Out in a field", and then explain itself (all with the same prompt). And this LLM made coherent and slightly humerus jokes and correctly explained why they were funny.

Another example is prompting a poem type (e.g. sonnet) with several directives it must follow. This is hard because it has to follow rhythm, meter... while remaining coherent and including the prompt directives.

In short, this is better than 7b Mistrals. I thought this was a hoax when I saw it so high on the leaderboard. This is not a hoax.

I thought this was a hoax when I saw it so high on the leaderboard. This is not a hoax.

To me the best LLM I can run locally is still NeuralHermes 2.5, but maybe this will surpass once there are some fine tunes by @teknium / @mlabonne . Lack of system prompt and multi-turn chat is limiting and makes it harder to compare with existing mistral fine tunes...

deleted

@tarruda Multi-turn chat isn't part of my personal testing, but knowledge is its achilles heel. It did notably worse on all my fringe knowledge questions than leading Mistrals, and this bleed into my story prompts (e.g. despite Friends being a widely popular show, it had Monica and her brother be love interests). In short, it's notably smarter than Mistral, but less knowledgeable. I'm starting to think they pushed TruthfulQA so hard in this Instruct version because they had to keep this substantial reduction of knowledge in check.

Mixtral and Yi-34b have FAR more knowledge than this LLM. And even the original 7b Mistral has notably more knowledge. Somehow they lost information during the up-scaling process, yet gained intelligence.

Would be worth checking the knowledge of the base version of the model, which has 45.04 score on Truthful QA: https://huggingface.co/upstage/SOLAR-10.7B-v1.0

It is still one of the top LLMs in the leaderboard, but if it was the instruction fine tune that killed its knowledge, then some other fine tunes might fix it.

deleted

@tarruda Thanks for the suggestion. I'm going to try neuralhermes 2.5 next. Hopefully someone will fine-tune SOLAR base with the same methods and data as Mistral 7b so that it's easier to compare the difference between the base models.

I wonder if a variation on this kind of model could be good as RAG-enhanced. Like you mentioned a local Wikipedia archive. If a small model has good logic and a wrapper can help get data into it on the fly, that might be a best of both worlds thing. I’m sure that breaks down in extremes (not knowing enough to know what to search for), but seems it could let you reduce GPU requirements in general to keep more knowledge external to the LLM itself.

deleted

@InvidFlower I hope someone with expertise tries this someday. The offline version of wikipedia I'm using is with the Kiwix app using a fully indexed zim file, so it has full text search. It would be nice if their was a button that said "verify with Wikipedia" so I didn't have to search and scan manually through the results to verify key facts.

deleted changed discussion status to closed

Sign up or log in to comment