I Don't Understand This Model

#9
by phil111 - opened

This model has an impossibly high English MMLU pro score if 48 for its size, while at the same time an absurdly low English SimpleQA score of 3.

I've never seen overfitting to this degree before, and it's not due to an evaluation error since my personal testing confirms it. Phi4 knows the small subset of popular domains of English knowledge covered by the English MMLU orders of magnitude better than all other popular domains of English knowledge.

Phi3 was already extremely overfit to the small subset of popular English knowledge covered by the MMLU compared to other models like Llama 3.1 8b & Gemma2 9b. And Phi4 pushed it even further, boosting English MMLU scores even higher from 78 to 85, while loosing a commensurate amount of general English knowledge (e.g. SimpleQA dropped from an already very low 7.6 to 3). This goes beyond overfitting, and don't get me started on why RAG isn't a viable solution to this profound general ignorance.

Overfitting by aggressively filtering the corpus has tons of other negative consequence beyond just profound general ignorance. But for the sake of brevity I'll only mention instruction following. The HF IF score of only 6 doesn't accurately represent the situation. There's clearly some sort of evaluation issue. If anything Phi4 can be unusually good at instruction following. However, it's so overfit for specific use cases (e.g. coding), and to output in such a way that increases test scores on multiple choice tests, that the responses are commonly completely disconnected from the phrasing and subject matter of varied prompts, such as randomly outputting in json.

In short, Phi4 is simply not, by an stretch of the imagination, a general purpose AI model. For example, the bulk of humanity's most popular knowledge was aggressively filtered out of the corpus. And to make matters worse, it was then aggressively trained to perform particularly well on select tests (e.g. the MMLU) and tasks (e.g. math and coding). So instead of trying to respond appropriate to the nuances of the user's prompts, it instead often returns pre-packaged nearest matches, giving it both profound general ignorance and instruction following issues across diverse tasks.

Most people are speaking highly of the model. I myself have been using it for days for a lot of tasks and it performs excellently.

@jvaladezp Yes, what it's designed to do, it does very well.

However, there's no arguing with a SimpleQA score of only 3, especially considering Phi4's unusually high MMLU score of 85. Saying Phi4 is overfit is an understatement. Additionally, this imbalance is getting progressively worse with each generation, with Phi3 having a notably lower MMLU (78), but notably higher SimpleQA (7.3).

There's also no arguing with an Instruction Following score of only 6. Yes, there's something else going on because across many tasks Phi4's instruction following is good. However, Phi4's responses are often completely disconnected from their respective prompts. Again, this is clearly due to extreme overfitting because it's pulled into predictable tangents considering what was overfit (e.g. outputting in json).

Lastly, it's not at all surprising that so many people here like Phi4 since almost everyone here is a STEM-focused coding geek (the target of the overfitting). In fact, this is one of the purest circle jerks I've ever come across. To the general population, or more neurotypical intellectuals, Phi4 is little more than a hallucination generator when it comes to most things.

I think it makes no sense to query true information in a language model, let alone a small one like phi4. Language models are primarily intended to manage language, not to be an encyclopedia.

@jvaladezp Currently available LLMs are all about knowledge. No LLM (at least not yet) is actually thinking through a coding, math, logic... problem. It's simply accurately retrieving the most relevant information considering the wording of the prompt and the current chat history.

For example, when an AI model produces code it's using detailed knowledge of the largely arbitrary set of rules of the desired programming language (python, C++, powershell...) to begin outputting a functional block of code (e.g. declaring variables). Then based on this, producing the next token/line, and so on. This is ALL about contextually relevant knowledge retrieval. No thinking is involved. Again, LLMs are currently all about knowledge.

So my gripe with Phi4 is the overwhelming focus on knowledge that helps boost key test scores (e.g. the MMLU), or performance on select tasks (e.g. coding), while actively suppressing/removing the overwhelming majority of the humanity's most popular knowledge to free up more space for the small subset of MMLU & code boosting knowledge.

Said popular knowledge is essential when doing things like telling stories about popular subject matter. The stories make no sense and are filled with absurd counterfactuals because very basic knowledge that countless millions of English speakers know were stripped from the corpus in order to prioritize rarely used domains of knowledge like virology because it appears on the MMLU.

Which leads me to another point. The MMLU is a horrendous test of even the small subset of knowledge it covers. Picking the correct answer out of a lineup (multiple choice test) is rarely how knowledge is retrieved from AI models in the real-world. As a consequence, I'm witnessing models like Qwen2.5 & Phi4 climb higher on the MMLU, yet hallucinate more when forced to actually retrieve the same information. So we need to stop using multiple choice tests to evaluate LLMs. This is causing companies to prematurely end training once information is held tight enough to pick the answer out of a lineup (not a realistic use case), but not enough to be accurately recalled in full.

Said popular knowledge is essential when doing things like telling stories about popular subject matter. The stories make no sense and are filled with absurd counterfactuals because very basic knowledge that countless millions of English speakers know were stripped from the corpus in order to prioritize rarely used domains of knowledge like virology because it appears on the MMLU.

Please post some of the nonsensical stories to pastebin if you will @phil111 . I am tremendously interested in the mundane topics you're missing from this model. What exactly is humanity's most popular knowledge and what is missing. Once again, it would be great if you could provide as much examples / outputs as possible. Virology can hardly be called a rarely used domain of knowledge since COVID.

It's funny how your critique of the MMLU sounds very much like a high-school professor critiquing multiple-choice tests for his or her students.

phil111 is spot on. The model is severely overfitted to specific finetuning templates. It refuses to follow simple requests, adding explanations and repeating answers instead. Phi3.5 were actually good models, this version has regressed to the same benchmark-overfitting as earlier Phi-series models.

If Microsoft released the base-model, or some sufficiently filtered version, that could be a real contribution. This one isn't.

@Etherdrake That's the beauty of SimpleQA. It's a diverse general knowledge non-multiple choice test that's too new to for things like contamination to mask broad & deep ignorance. Consequently, I no longer feel compelled to provide my own examples & substantiations (although if you go through my comment history you'll find it).

I ran my own private non-multiple choice test for over a year on well over 100 models, and the results correlate nicely with SimpleQA's, and they both show a drastic drop in broad knowledge by many open source AI models in favor of higher multiple choice MMLU scores. For example, on my test Qwen2 72b scored a respectable 85.9, but Qwen2.5 72b scored a pitiful 68.4, which is slightly lower than much smaller models like Llama 3.1 8b & Gemma 2 9b. Sure enough Qwen2.5 72b only scored 9 on the English SimpleQA, compared to Llama 3.1 70b's 20.

Anyways, it's not just that the models are profoundly ignorant across numerous wildly popular domains of knowledge, especially pop culture, such as movies, music, literature, sports, TV shows & games, but they also struggle to fully retrieve the same knowledge covered by the MMLU.

For example, models like Phi4 & Qwen2.5 72b have comparable MMLU scores to GPT4o, a far larger and more powerful model. But when I ask said models about the same information covered by the MMLU GPT4o reliably returns the correct answers, while models like Qwen2.5 72b hallucinate like crazy. This is why I strongly dislike multiple choice tests. Being shown the answer and picking it out of a lineup doesn't mirror real-world use cases. Additionally, multiple choice tests are a magnet for accidental/deliberate contamination.

Plus real-world phrasing variations, spelling/grammar errors... have little impact on GPT4os performance, but drastically reduces the performance of models like Qwen2.5 72b. Training on a highly filtered and far less diverse corpus drastically limits the paths to the desired information.

And in regards to my providing story examples, you're going to have to trust me, or try it out for yourself. For example, if asked to write about any topic in one of its numerous blind spots, such as the aforementioned pop culture domains, it's comically bad. For example, making a famous male celebrity a woman because he has a unisex name like Kelly, or making a big deal about his dark hair, short stature... when he has blonde hair, is tall... You can't write a story about a topic you know virtually nothing about. Pop culture is called pop culture for a reason. It's popular. Removing the vast majority of it from an AI model's corpus in order to match the multiple choice STEM test performance (MMLU) of a vastly more knowledgeable and powerful proprietary model like GPT4o is less than honorable. When will this overfitting madness stop?

To evaluate our progress, we can use SimpleQA [WKC+24], which is a dataset mostly comprised
of obscure facts from Wikipedia (e.g., β€œHow many more votes did Freeman Freeman-Thomas win than
George Sandys in the 1906 Bodmin by-election?”). Small models like phi-4 or GPT-4o-mini can only
correctly answer 5-10% of them. Our performance can be found in Figure 6.
Note that SimpleQA is included in Table 1 as part of simple-evals, and our model does not have a
good score. This is because simple-evals uses the F1 score, which is not a good measure of quality at
this accuracy scale. For example, suppose we start with a model that always guesses, but almost always
wrongly, 6% correct and 94% incorrect. Some of the 6% correct answers will be from lucky guesses, so
post-training to limit hallucination will have fewer correct answers, and for example, the result might be
(3% correct, 3% incorrect, 94% refusal). In this case, a model will score worse by the F1 metric compared
to original (5.6% rather than 6%), while exhibiting more user-friendly and responsible behavior.

If SimpleQA tests stupid questions like "How many more votes did Freeman Freeman-Thomas win than George Sandys in the 1906 Bodmin by-election?", then I would not only support excluding relevant corpus from LLMs like phi that specialize in reasoning, but I would also agree to delete content that matches SimpleQA from the training corpus of all LLMs. It would be a complete waste to let LLMs memorize these things.

This comment has been hidden

@noneUsername Yes, the example you provided is too esoteric to expect a small model to know. And the same goes for the large majority of SimpleQA questions. However...

  1. SimpleQA denials rarely mask correct answers, especially in smaller models like Phi4 15b that only score 3.

As you noted, the questions are esoteric and not multiple choice, so the odds of being able to output the correct answer is slim to none. Additionally, Phi4 hallucinates like crazy and never said I don't know in my testing, so it's not prone to denials. It's only very large models like GPT4o that start (once in a while) refusing to answer a question it can potentially answer. Phi4 can attempt to answer all the SimpleQA questions and it would still only get <4% correct, so your defense of Phi4 scoring only 3 on SimpleQA doesn't change the fact that Phi4 is profoundly ignorant.

  1. A score of 3 is remarkably low for a 15b model. Llama 3 70b scores ~20, and GPT4o mini and other ~8b models score ~8. A score of 3 for a 15b model stands out as absurdly low. Since the numbers are so low the difference between 8 and 3 doesn't appear to be much, but it's actually a profound difference that equates to an MMLU score differential of ~60 to ~80.

  2. My personal broad knowledge test sticks to the top ~1% most popular knowledge across domains, such as basic questions about extremely popular movies like Pulp Fiction, with the highest scoring model I've tested (Llama 3.1 70b) scoring ~88.5/100, compared to its only ~20 SimpleQA score. And Phi4 is performing well below models like Llama 3 8b and Gemma2 9b on my test (69.7 and 69.1, respectively) despite being ~2x as large. It even performed lower than Llama 3.2 3b despite being ~5x larger. But the real kicker, and why I'm repeating claiming Phi4 is grossly overfit, is Phi4 has a much higher MMLU score than said models. So Phi4 has unusually low general English knowledge for its size, yet unusually high English knowledge that overlaps what's covered by the MMLU. In other words, it's grossly overfit.

Additionally, you don't have to look outside the Phi family to get a clear picture of this overfitting. For example, Phi3 has notably more broad knowledge (SimplyQA of 7.3 vs 3), but notably less MMLU knowledge/scores.

I'm tired of being negative. But being on team open source, then watching model after model (e.g. Qwen2.5 & Phi4) claim test scores on par with vastly larger and more powerful models like GPT4o, despite being vastly inferior in my testing and hallucinating orders of magnitude more often, is annoying, especially considering the 100s of hours I invested in said testing.

Lastly, what is it going to look like if Llama 4 and Gemma 3 are released and they have far lower MMLU scores than models like Phi4 and Qwen2.5 that were released much earlier? The combination of models grossly overfitting select tests and tasks (e.g. coding), and the self-centered obliviousness of the coding first adopter community praising said overfitters, and shaming the vastly superior and more knowledgeable models like Llama 3.1 and Gemma2 for having lower test scores, is going to put immense pressure on companies like Meta and Google to follow suit, leaving all open source models overfit empty shells compared to the proprietary models like GPT4o and Sonnet 3.5.

This is why I'm freaking out. I don't give a shit about Microsoft's deluded belief that textbooks are all you need. I just wont use the Phi series. Nor do I care about some random Chinese company like Qwen making a solid Chinese model (with high Chinese MMLU and SimpleQA scores), while overfitting English for the bragging rights and attention (e.g. high English MMLU, but low English SimpleQA, scores). What concerns me is far too many people are oblivious to such obvious and overwhelming overfitting on select tests and tasks (e.g. coding and MMLU), and the pressure this puts on non-overfitting general purpose LLM creators like Meta to do the same or appear non-competitive.

If SimpleQA tests stupid questions like "How many more votes did Freeman Freeman-Thomas win than George Sandys in the 1906 Bodmin by-election?", then I would not only support excluding relevant corpus from LLMs like phi that specialize in reasoning, but I would also agree to delete content that matches SimpleQA from the training corpus of all LLMs. It would be a complete waste to let LLMs memorize these things.

You can't entirely separate reasoning from real-world knowledge. I asked this from ChatGPT, it gave mostly the wrong answer, but wrong numbers. Sonnet3.5 said it would hallucinate, Gemini Experimental refused to answer. If I ask a made-up similar example without numbers, like "What were the parties in the August 1815 French legislative election and which party won?", I get the right answer from ChatGPT and Sonnet3.5, but Gemini again refuses to answer, even though it likely knows the answer. Even the top-range models can't yet say what they confidently know. They don't understand their "circle of competence", to use Warren Buffett's term.

The models should learn how to handle less frequent or reliable facts. Natural language follows a Zipf distribution, where most facts are singletons. If you focus the LLM training data to only well-known repeated facts, you are distorting the real-world distribution of the data and throwing away the vast majority of real world knowledge.

@phil111 You said a lot, I will reply to your points one by one.
In section 1 of your reply, you said: β‘  "Phi4 hallucinates like crazy and never said I don't know in my testing", β‘‘ "your defense of Phi4 scoring only 3 on SimpleQA doesn't change the fact that Phi4 is profoundly ignorant."
Regarding β‘ , if your test content is SimpleQA, and the results are inconsistent with the experimental charts provided in section 4.4 of the phi-4 paper, I suggest you open another discussion and provide your test process. This is a very serious accusation. Please bravely raise questions and provide evidence in a clearer and more serious occasion. If your test content is not SimpleQA, then phi-4's behavior can certainly be "never said I don't know", and you can still open another discussion to tell the phi-4 team that their efforts in hallucination suppression have achieved little success, but you should understand: this does not constitute a denial of the content of the paper I cited.
Regarding β‘‘, I suspect you are intentionally confusing multiple viewpoints. I have never said or denied that "Phi4 is not ignorant", nor do I think that the high or low SimpleQA score can reflect whether LLM is "profoundly ignorant". I need to make my point clear again: some questions in SimpleQA are not just esoteric, but thoroughly polluting and wasteful. Whether it is 3B LLM or 1000B GPT4, they should not recite such stupid information. We train LLM to get a replica of human thinking, not to get a talking Wikipedia. Wikipedia works very well. If you want to know the information in it, ask LLM to use search engines instead of memorizing it. Therefore, not being able to recite Wikipedia cannot prove being ignorant. I repeat my words again: "If SimpleQA is full of such questions, then I will applaud any LLM with a low SimpleQA score."

In section 2 of your response, you reiterated that Phi4's score in the SimpleQA test is really very, very low. I did not say "3 points is also high", so your reiteration is probably just to emphasize that "Phi4 is profoundly ignorant".
Then I will emphasize again: there may be a way to measure whether LLM is ignorant of reality, but that method is definitely not to ask it how many votes someone got in the 1906 Bodmin by-election. "If SimpleQA is full of questions like this, then I will applaud any LLM with a low SimpleQA score, they are away from the pollution of useless information and the waste of information capacity."

In section 3 of your response, you mentioned that your "personal broad knowledge test sticks to the top ~1% most popular knowledge across domains". Unfortunately, your test range lacks overlap with MMLU. You call it "the real kicker", that is, Phi4 performs very poorly in your "personal broad knowledge test sticks to the top ~1% most popular knowledge across domains", but gets a high score in the MMLU test. You question whether Phi4 and many cutting-edge LLMs today have serious overfitting to MMLU.
This is a profound insight, far more significant than above the weird worship of SimpleQA, the overconfidence in your ability to identify ignorance, and the weird misunderstanding of my views. I am very sure that this "overfitting" phenomenon is becoming more and more serious in the open source LLM community. Like you, I will call on the Phi4 team to "pay attention to the mastery of popular culture". Of course, this does not contradict my contempt for SimpleQA. I don't think SimpleQA is a standard for representing popular culture, and I won't repeat this.
Of course, such a call is not appropriate for an LLM like Phi4 that claims to be "advanced reasoning", but since Phi4 does not solidify the response style into a specialized reasoning style like QwQ, it means that Phi4 still has the use of following user instructions to personalize the response style. Popular culture is such a type of knowledge that is highly related to the response style, so I think Phi4 still has the "responsibility" to implement relevant capabilities. This is my point of view on the "Phi4 overfitting MMLU" problem.

I use LLM to complete ERP, and I have noticed the "overfitting" of LLM to MMLU in many ERP conversations, rather than through SimpleQA scores. They didn't know BDSM at first, then steampunk, and then swords and magic. Obviously, some knowledge is being deliberately filtered out of the training corpus, perhaps because it is "unsafe" or because it cannot be used to brush scores on a famous list. I think an evaluation standard that meets the needs of people like us will be the key to solving this problem, but that standard is unlikely to be SimpleQA.

@anttip
You should understand that SimpleQA cannot represent "real-world knowledge", and excluding the corpus involving SimpleQA from the training corpus of the model does not mean "entirely separate reasoning from real-world knowledge". And my claim is not "entirely separate reasoning from real-world knowledge".
I need to popularize a concept to you: some knowledge is "time-space related", such as the election results at a certain place and time. The information as the result is highly dependent on accurate time and space information. Once the values ​​of time and space are slightly changed, the election results are difficult to explain.
The degree of time-space correlation is another criterion for measuring information. It does not mean whether the information is knowable or not. The eternal truth of the universe is obviously unknown, and the laws of mechanics that describe low-speed motion are obviously known, and both are highly time-space independent.
What I advocate is not "focusing the LLM training data to only well-known repeated facts", but "focusing the LLM training data to time-space independent facts", such as whether LLM knows that it does not know a fact. This is a meta skill, a highly time-space independent knowledge.

@anttip
You should understand that SimpleQA cannot represent "real-world knowledge", and excluding the corpus involving SimpleQA from the training corpus of the model does not mean "entirely separate reasoning from real-world knowledge". And my claim is not "entirely separate reasoning from real-world knowledge".

It does represent real-world knowledge. We can argue its not the best sample, or how the samples of real-world knowledge should be drawn.

I need to popularize a concept to you: some knowledge is "time-space related", such as the election results at a certain place and time. The information as the result is highly dependent on accurate time and space information. Once the values ​​of time and space are slightly changed, the election results are difficult to explain.

Let's start with the first concept of computational linguistics, which is Zipf's law. Most facts are singletons, and this is what real-world knowledge is mostly about.

The degree of time-space correlation is another criterion for measuring information. It does not mean whether the information is knowable or not. The eternal truth of the universe is obviously unknown, and the laws of mechanics that describe low-speed motion are obviously known, and both are highly time-space independent.

Just start with the basic concepts of the field. No need to bring concepts that are unnecessary, or even harmful, as newcomers won't learn the fundamentals.

What I advocate is not "focusing the LLM training data to only well-known repeated facts", but "focusing the LLM training data to time-space independent facts", such as whether LLM knows that it does not know a fact. This is a meta skill, a highly time-space independent knowledge.

This sounds a bit like manual engineering, instead of giving the AI the real full data and let it learn its own concepts and connections. If "time-space independent facts" is a useful concept, the LLM will internally learn this, compress its internal representations with it, and predict better.

@noneUsername You made several good points, but the 'serious accusations' tone is nonsensical considering the context. You're acting like an anon posting in a comment section just published a news article.

But since your point is valid, relevant, and important I'll address it, and probably some of your other points later.

Firstly, I read section 4.4 (just a paragraph), and the linked table and appendix. And to start with, when I said Phi4 hallucinated like mad for me, I was referring to my general knowledge test, not SimpleQA's, which is far less esoteric (~4-5x higher scores, such as Llama 3.1 70b getting ~90 on my test, and ~20 on SimpleQA).

MS even made it clear that initially Phi4 never 'admitted ignorance' to 'too-difficult questions'. In other words, the denials were put in place to reject esoteric questions it was unlikely to know based on what Phi4 was trained on. Since my questions were reserved to the top 5% most popular knowledge (usually the top 1%) none of my questions triggered a denial. My questions were things like 'What are the names of the six main characters on the TV show ..., and which actors portrayed them?'. Phi4 didn't refuse to answer any of my questions.

Secondly, the base model only scored 6.8, which I honestly already knew since it was posted outside the paper (although I also previously scanned the paper). And they didn't loose most of said 3.8 points simply because of denials.

Fine-tuning a base model, especially to the degree Microsoft does, always starts scrambling the weights along the fringes (esoteric knowledge), especially when it comes to full recollection (e.g. SimpleQA) vs picking the answer out of a line up with multiple choice questions (e.g. MMLU).

This happens with all the base model vs fine-tunes comparisons I tested, including from Llama, Mistral... For example, after fine-tuning, when the base model returned the correct name for a character in a movie at the fringes of its knowledge (Dede in Joe v Volcano), a code heavy fine-tune returned the wrong name (DoDo). In this case it's clear that all the do statements for extensive code fine tuning (do statements) started to fuse with the weakly held information and scrambled them. Sadly, a lot of a base model's power, beyond just increasing fringe knowledge hallucinations, such as creativity and variation in story telling, is compromised by fine-tuning.

Anyways, in conclusion, my hallucinating like mad sans a single denial statement was 100% true and referred to my test questions, not SimpleQA's, and a significant percentage of the 6.8 to 3 drop in the SimpleQA score was the inevitable consequence of extensively fine-tuning the base model, which always increases hallucinations when attempting to fully and accurately retrieve a model's most fringe knowledge, even if Microsoft didn't add denials to the training data. Perhaps they may have cracked 4 on SimpleQA, but that has absolutely no relevance to the point I'm making about overfitting. Phi4 scores relatively high on the MMLU compared to comparable models, including Phi3, yet scores relatively low on SimpleQA compared to comparable models, including Phi3. And the difference on my much less esoteric popular knowledge test is EXTREMELY pronounced. Phi4 has FAR LESS general knowledge than it should considering it's unusually high MMLU score. They grossly overfit specific test and tasks (e.g. the MMLU and coding). This isn't my opinion. It's a fact, and MS knows it, and it's even made clear by their own paper.

@noneUsername I think you're being overly dismissive of SimpleQA'a relevance. The questions are diverse and considered common knowledge within various subcultures.

In other words, it's all Wikipedia worthy, and if included in the weights of an AI model the information adds notable value to millions of potential users across numerous tasks, not just answering questions, such as writing stories and chatting about a subject that interests a user.

I strongly disagree with the belief that popular knowledge can be selectively filtered out of an AI model's corpus because it doesn't improve the model's smarts (e.g. pop culture). No technique, including RAG, can restore the profound loss of functionality to countless users.

phil111 changed discussion status to closed

Sign up or log in to comment