Good model but Bullshit chart and inaccurate numbers

#20
by rombodawg - opened

This chart is fake. I know for a fact llama-3 scores way higher in humaneval, as well as other models on this chart scoring way higher. Phi-2 also scored much lower on humaneval. Its really shameful of you Microsoft to pull all this non-sense out of your behind and try to show it off as fact. Really get a grip on reality.

Phi-3-Mini-128K-In
3.8b
Phi-3-Small
7b (preview)
Phi-3-Medium
14b (preview)
Phi-2
2.7b
Mistral
7b
Gemma
7b
Llama-3-In
8b
Mixtral
8x7b
GPT-3.5
version 1106
MMLU
5-Shot
68.1 75.3 78.2 56.3 61.7 63.6 66.5 68.4 71.4
HellaSwag
5-Shot
74.5 78.7 83.2 53.6 58.5 49.8 71.1 70.4 78.8
ANLI
7-Shot
52.8 55.0 58.7 42.5 47.1 48.7 57.3 55.2 58.1
GSM-8K
0-Shot; CoT
83.6 86.4 90.8 61.1 46.4 59.8 77.4 64.7 78.1
MedQA
2-Shot
55.3 58.2 69.8 40.9 49.6 50.0 60.5 62.2 63.4
AGIEval
0-Shot
36.9 45.0 49.7 29.8 35.1 42.1 42.0 45.2 48.4
TriviaQA
5-Shot
57.1 59.1 73.3 45.2 72.3 75.2 67.7 82.2 85.8
Arc-C
10-Shot
84.0 90.7 91.9 75.9 78.6 78.3 82.8 87.3 87.4
Arc-E
10-Shot
95.2 97.1 98.0 88.5 90.6 91.4 93.4 95.6 96.3
PIQA
5-Shot
83.6 87.8 88.2 60.2 77.7 78.1 75.7 86.0 86.6
SociQA
5-Shot
76.1 79.0 79.4 68.3 74.6 65.5 73.9 75.9 68.3
BigBench-Hard
0-Shot
71.5 75.0 82.5 59.4 57.3 59.6 51.5 69.7 68.32
WinoGrande
5-Shot
72.5 82.5 81.2 54.7 54.2 55.6 65.0 62.0 68.8
OpenBookQA
10-Shot
80.6 88.4 86.6 73.6 79.8 78.6 82.6 85.8 86.0
BoolQ
0-Shot
78.7 82.9 86.5 -- 72.2 66.0 80.9 77.6 79.1
CommonSenseQA
10-Shot
78.0 80.3 82.6 69.3 72.6 76.2 79 78.1 79.6
TruthfulQA
10-Shot
63.2 68.1 74.8 -- 52.1 53.0 63.2 60.1 85.8
HumanEval
0-Shot
57.9 59.1 54.7 59.0 28.0 34.1 60.4 37.8 62.2
MBPP
3-Shot
62.5 71.4 73.7 60.6 50.8 51.5 67.7 60.2 77.8

Im seeing now you updated llama-3 numbers but previously you posted this bellow, and it still doesnt change the fact that Phi-2 numbers are lied about.

Screenshot (656).png

I test the microsoft/Phi-3-mini-128k-instruct, it is not so good.

I test the microsoft/Phi-3-mini-128k-instruct, it is not so good.

The model is not bad. It answers questions like "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" correctly, whereas both the 8b and the 70b Llama 3 cannot answer this correctly straight away. The translation into the foreign language, in my case German, is also acceptable (not perfect) compared to Llama3.

@zeynel I dont care if its a good or bad model. Thats not what this post is about. If its a good model, than praise be!

The point is they are lying out of their asses on these charts, and it is very fishy smelling. Something is up

Beyond the benchmarks and distortions that may exist, the model seems very good to me. It performs better than Llama-3 in several tasks despite having half the parameters. It's an open-source model, licensed under MIT, so I can download it and run it on an old Mac offline, and it provides good results and incredible speed. I'm happy; Llama-3 and Phi-3 made my month!

It's great news that these models are being released, as it marks a significant step towards introducing AI agents into our daily lives. However, I didn't notice any improvement in phi-3-mini's performance, at least not in Greek. In fact, quite the opposite, llama-3 gives coherent responses while phi-3 not. (I know it's intended purpose is for use in English)

Totally agree. This model is just Fake. It does not have even close to the same quality as LLAMA-3. It is obvious that the model was trained mainly on data taken and generated from tests. HumanEval - 57.9 That's bullshit. HumanEval here is 8-10 at best.
Screenshot from 2024-04-24 15-18-05.png

I’m from the GenAI team responsible for the phi models, and was involved in running these evals.

It’s true the LLaMA-3 and Phi-2 HumanEval scores in the initial draft were wrong - it was caused by some parsing errors in our pipeline. We updated the results ASAP, but the v1 draft was already live. As you noticed, it was already updated in v2. Sorry for the confusion, it was an honest mistake.

We’re really trying to be open and transparent with the phi models - hence open-sourcing the models - so apologies if this gave the wrong impression.

@amin-saied no way in hell phi-2 scored better than mistral-7b and gemma-7b in coding, and humaneval benchmarks. which on your current chart it does. Once again, i call bullshit. and i dont trust a single benchmark for phi-3

Ive tested phi-2 base model myself, hell even the instruct finetunes from the communty dont hold a candle to mistral or gemma

rombodawg changed discussion title from Bullshit chart. fake numbers to Good model but Bullshit chart. fake numbers
rombodawg changed discussion title from Good model but Bullshit chart. fake numbers to Good model but Bullshit chart and inaccurate numbers

I have been told that Im being too harsh because the model that has been release is a good model. So i am changing the name of the post to reflect that. However I still stand by my argument that the chart is completely inaccurate and misleading

I don't think you're being too harsh. I test new models all the time. I'm also fine-tuning models. And this one clearly doesn't show any comparable results to these numbers.
This is clearly some kind of cheating on the benchmarks. This model shows comparable results to the h2oai/h2o-danube2-1.8b-base but clearly no more than LLAMA-3 and Mistral. These figures are overstated by a lot.
None of my personal Phi-3 tests show that Phi-3 has the level of math proficiency that this test does. Nor do other similar tests show it. It's just not a nice thing to do.
Screenshot from 2024-04-24 23-32-36.png

Once again, i call bullshit. and i dont trust a single benchmark for phi-3

I'm sorry to hear that you don't trust the numbers we ran. On the plus side, we open-sourced the model so you can run any evals you want yourself. If you think there is an error somewhere, please call it out.

Once again, i call bullshit. and i dont trust a single benchmark for phi-3

I'm sorry to hear that you don't trust the numbers we ran. On the plus side, we open-sourced the model so you can run any evals you want yourself. If you think there is an error somewhere, please call it out.

I'd like to give this model a real go, as it's quite interesting (and kudos to you for the MIT license!), but unfortunately, the bad instruct model makes it useless for automated pipelines. You tell it, "Write [X] in format [Y]. Do not write anything else. Do not include any explanations. Only write [X] in format [Y]", and it just outright ignores the instructions and insists on explaining itself nonetheless.

deleted

After using Phi3 for a while I'd say it's pretty clear the benchmarks are accurate, and Microsoft didn't cheat. But it's also perfectly clear it doesn't perform anywhere near the levels the scores imply.

  1. MMLU: When it comes to core knowledge typically tested by the MMLU its score of 68 seems pretty accurate. However, when it comes to pop culture it's only scoring on par with ~50 MMLU scoring LLMs like Llama 2 7b. For example, if you ask for a list of main characters and the actors who portrayed them from popular TV shows the hallucinations are as numerous and egregious as Llama 2 7b. Same goes for movies, music and other areas of pop culture.

  2. WinoGrande: Another example is its language skills, which are CLEARLY superior to Llama 3 8b's when writing stories, explaining scientific concepts... However, absurd things keep happening throughout stories, such as suddenly breaking the flow and going on tangents.

However, this seems to be a self-inflicted wound. Phi3 is so excessively aligned that whenever it talks it's way out of a perfectly harmonious Disney universe were everybody behaves like angels Phi3 will detect things heading in the "wrong" direction and suddenly go down a more acceptable path. I've even witness this happening mid-sentence and mid-words. And a few times it interrupted the story to moralizing about proper behavior, then continued with the next paragraph.

I'm near certain this is self inflicted because of the contentious areas during which these sudden shifts tend to occur, but also because it will sometimes continue talking past the end token and go on and on about how essential it is not to offend anyone, say anything inappropriate for children, or otherwise be anything other than a saint. Plus it did the same when I tricked it into making a list of dirty words. It got half way through a mild list of curse words and started rambling on an on about As an AI model by Microsoft... it's important to find better ways to converse... And as an AI model by OpenAI...

Thanks Microsoft. This was amazing. But a 3.8b LLM ain't going to be used to build nuclear bombs. Please pull back on the alignment. You turned this LLM into a schizophrenic moralizing dolt willing to break the flow of stories, and even interrupt them with absurd lecturing, when they drift out of a fairy tail perversion of reality that you've deemed appropriate for every person who may exist in a seas of billions, including thumb sucking young children and thin skinned Bible clutching loons.

Just wait for openchat-llama-3-8b to release. it will blow all of these models out of the water. Should be out soon

Just wait for openchat-llama-3-8b to release. it will blow all of these models out of the water. Should be out soon

8B is not 4B. LLaMA license isn't MIT license.

@Nafnlaus Honestly i dont care. Im gonna use the model for whatever I want, i dont use models for commercial settings anyway.

@Nafnlaus Honestly i dont care. Im gonna use the model for whatever I want, i dont use models for commercial settings anyway.

You should care at least about the fact that it's 8B and not 4B. Double the memory and halve the tokens per second.

But as for licenses, it's not "commercial" that's the problem. It's a viral license. It works like this:

  1. It looks open! Anyone can download it and use it! So people do, en masse. Including people who create outputs that go into training other model datasets (Alpaca, Dolphin,etc).

  2. Since the community trains and merges models many levels deep, it becomes increasingly likely that at some point your model was contaminated with LLaMA-licensed outputs.

  3. The LLaMA license prohibit the use of its outputs its use for improving non-LLaMA licensed models. So it forces the spread of the LLaMA license. A particular model may claim whatever license they want, but if push comes to shove in court, Meta, with all its legal resources, is going to win this.

  4. If that's where it stopped, it'd be bad enough, but then a second part hits once any project makes it big: THEN that project has to license with Meta, and Meta holds all the cards in the negotiation and can charge basically whatever they want.

The license pollutes the model ecosystem. Hence, it gets a thumbs down from me. I mean, if you're just using it to ask questions or write stories, fine. But I advise against using it (or its copycats) to make derivatives. Because it just hands Meta a lot of power down the line, from some project that built on a project that built on a project that many steps down the line built on yours. I will continue to cheer on those who choose truly open licenses. Like this dev team did.

But we're getting really off topic here. The topic of discussion is issues with this model's finetune.

Trust me, no one is gonna be able to tell if people trained the model on meta's model. Most likely they arent going to train on base or instruct model. Its gonna be a finetune. And meta isnt gonna look through every finetune on huggingface testing them to see which of them one person trained a model on to make another non-meta model. Its just not gonna happen, you have to be realistic

@raidhon What tool are you using in that screenshot to run benchmarks with? I can see it's Python running in Tmux in a ssh session, but I've never seen a benchmarking tool like that.

@aberrio ))
I used https://github.com/EleutherAI/lm-evaluation-harness.
lm_eval --model hf --model_args pretrained=../Phi-3-mini-128k-instruct/ --tasks arc_easy,arc_challenge,gsm8k,winogrande,hellaswag,mmlu,boolq,piqa,openbookqa,truthfulqa --trust_remote_code

@amin-saied I just want to point out the open source community is extremly grateful you are open sourcing this model. I've tested it quite a bit and in many tasks, it indeed comes close to much bigger models, sometimes even excels them.

So atleast in my testing, the model proved to be as good as the benchmarks claim which is simply astounding considering its size of just 3.8B. This is the first really good LLM that can be run on a phone, so again, the effort made by Microsoft is very appreciated.

Please ignore the moodswings of this individual.

rombodawg changed discussion status to closed

Please ignore the moodswings of this individual.

this triggers me

In all fairness, MS isnt the only one fudging numbers. Gemini, Claude, i'm sure ChatGPT, etc. People want the next best sexy model but all these evals are in-house and they are going to publish the best numbers cause most people wont ever know the actual evals.

That being said, that is why I love the open-source community because collectively we can keep checks on reality. Great thread and appreciate the insights.

Honestly after seeing how phi-3-medium is completely on crack (not in a good way) I dont regret saying what I said about microsoft. The numbers absolutely do not line up with anything these guys publish. Im not the only one either, lots of people have tested phi-3-medium and it outputs complete garbage.

Thank God for Meta and llama-3 🙏🙏

Who knows maybe if they posted a base model and let us finetune our own better, not completely aligned to shit models like theirs, phi-3 wouldn't be so bad. But that probably wont ever happen considering how restricted Microsoft is by their top dogs in the company. I mean look how they butchered WizardLM, completely taking down all the repositories just because it wasnt censored. And wizardLM-7b was better than phi-3-14b by a longshot.

deleted

@rombodawg I think Microsoft made a sincere attempt to use exclusively synthetic and highly filtered quality data to create an AI. It was important that someone tried.

But I get what you're saying. L3 8b has broad abilities, while it's hit and miss with Phi3 medium. Most importantly, there's large pockets of information, such as pop culture, sports... missing from Phi3. On top of which it keeps falling off the rails, is WAY too censored, such as refusing to define terms I come across on social media because they're not G-rated, only loosely adheres to instructions in user prompts, makes the same absurd story mistakes as Phi 3.8b, apparently because it keeps forcing the same small set of pre-packaged story elements, even when they contradict the user's story instructions, and so on.

The primary lesson here is that if you're going to use nothing but synthetic and high quality data you need to represent ALL of humanity (e.g. pop culture & commonly used non G-rated colloquialisms). The pattern of failure during tasks like Q&A and story writing can be traced back to Phi's ignorance of entire domains of popular knowledge, refusal to potentially offend anyone, or share information that's inappropriate for a 5 year old.

That does make alot of sense. A failed attempt at a bad idea for making a good model. Well now we know exactly why real data us so valuable

I still don't think this is a bad idea, especially as we run out of high quality non-synthetic data to shove into models, it's good we have large companies attempting new innovative ways to generate training data

deleted

@bartwoski Perhaps, but I fail to see how anyone could use Phi3 as a daily driver. I can get answers from a pocket dictionary and wikipedia that Phi3 not only refuses to disclose, but moralizes about. Most people cuss, have s*x, and so on. Extreme G-rated censorship belies reality, as does never responding to anything remotely contentious, such as politics, religion, or humor at someone's expense. Plus the lack of even basic knowledge about very popular areas of humanity, such celebrity news and popular movies, means that the majority of the general (non-nerd) population will be buried with 'I don't knows' or hallucinations.

Every time I send a prompt to Phi3 I feel like I'm rolling the dice, even though all said prompts are very popular and not at all illegal. They will commonly get denied, moralized, buried in hallucinations, responded to as if I asked something else entirely, and so on. What's the point of that? Any general purpose AI needs to represent all of humanity, not just the ~5% that mostly overlaps with geekdom and academia.

True they do need to push innovation in diffrent ways, but the question is how? Synthetically generating data alone isnt enough

Sign up or log in to comment