Papers
arxiv:2402.08939

Premise Order Matters in Reasoning with Large Language Models

Published on Feb 14
· Submitted by akhaliq on Feb 15
#2 Paper of the day
Authors:
,

Abstract

Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

Community

In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps.

This is incredibly naive. Neural networks are known for reward hacking and finding shortcuts. You don't mitigate this by helping it take shortcuts, you mitigate it by making it impossible to take shortcuts. In this case, by randomizing the order during training. You make it harder, not easier.

If you make it easier by ensuring the correct order, the moment the ordering is incorrect, the model will revert to hacking and performance will drop.

This is telling you, that your training regime is not robust enough.

You also mention that

LLMs are surprisingly brittle to the ordering of the premises

Why is this surprising? This ought to patently obvious to anyone and everyone. The fact that Google Deepmind finds this surprising, is surprising.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Actually the order sensitivity occurs and is mostly considered in MCQA(multi-choice question answering). This work built the R-GSM benchmark which is more challenging than MCQA and I'm wondering where researchers can find the released benchmark. It's also be OK that the authors keep their dataset as a private one.
I would appreciate if anyone can tell me where the R-GSM is released.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.08939 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.08939 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.08939 in a Space README.md to link it from this page.

Collections including this paper 18