Nice try, but the model is just not what it pretends to be...
It answers the question "How many 'r's are there in the word strawberry?" perfectly even with no reasoning. I like this test question for its simplicity and the fact that even large models struggle with it. Your 7B model gives the correct answer even without reasoning, however, when we slightly change the question to "How many 'e's are there in the word blueberry?", it gives a wrong answer even when we ask for reasoning which is a direct proof that this model is not what it pretends to be.
Thank you for your attention.
We tried the case you mentioned. We found that when using greedy decoding, the model indeed incorrectly remembers the word "blueberry." The actual calculated sequence of letters is "b, l, u, e, e, r, b, e, r, r, y," with a total of 3 'e's(and is the correct answer). This indicates that there are still some flaws in the overall reasoning of the model.
But we attempted to explore the PASS@K accuracy using a temperature of 0.7. We found that in 2 attempts, the model was able to output the correct answer each time.
Below are the screenshots of our outputs.