Abstract
Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.
Community
In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps.
This is incredibly naive. Neural networks are known for reward hacking and finding shortcuts. You don't mitigate this by helping it take shortcuts, you mitigate it by making it impossible to take shortcuts. In this case, by randomizing the order during training. You make it harder, not easier.
If you make it easier by ensuring the correct order, the moment the ordering is incorrect, the model will revert to hacking and performance will drop.
This is telling you, that your training regime is not robust enough.
You also mention that
LLMs are surprisingly brittle to the ordering of the premises
Why is this surprising? This ought to patently obvious to anyone and everyone. The fact that Google Deepmind finds this surprising, is surprising.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities (2024)
- Minds versus Machines: Rethinking Entailment Verification with Language Models (2024)
- LLMs for Relational Reasoning: How Far are We? (2024)
- Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs (2024)
- A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Actually the order sensitivity occurs and is mostly considered in MCQA(multi-choice question answering). This work built the R-GSM benchmark which is more challenging than MCQA and I'm wondering where researchers can find the released benchmark. It's also be OK that the authors keep their dataset as a private one.
I would appreciate if anyone can tell me where the R-GSM is released.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper