Nice try, but the model is just not what it pretends to be...

#14

by MrDevolver - opened 6 days ago

6 days ago

It answers the question "How many 'r's are there in the word strawberry?" perfectly even with no reasoning. I like this test question for its simplicity and the fact that even large models struggle with it. Your 7B model gives the correct answer even without reasoning, however, when we slightly change the question to "How many 'e's are there in the word blueberry?", it gives a wrong answer even when we ask for reasoning which is a direct proof that this model is not what it pretends to be.

Sniper

AIDC-AI org 6 days ago

Thank you for your attention.
We tried the case you mentioned. We found that when using greedy decoding, the model indeed incorrectly remembers the word "blueberry." The actual calculated sequence of letters is "b, l, u, e, e, r, b, e, r, r, y," with a total of 3 'e's（and is the correct answer）. This indicates that there are still some flaws in the overall reasoning of the model.

But we attempted to explore the PASS@K accuracy using a temperature of 0.7. We found that in 2 attempts, the model was able to output the correct answer each time.

Below are the screenshots of our outputs.

Sniper changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment