arxiv:2402.08939

Premise Order Matters in Reasoning with Large Language Models

Published on Feb 14

· Submitted by

akhaliq on Feb 15

#2 Paper of the day

Upvote

Authors:

Xinyun Chen ,

Ryan A. Chi ,

Denny Zhou

Abstract

Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

View arXiv page View PDF Add to collection

Community

MichaelBarryUK

Feb 15

•

edited Feb 15

In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps.

This is incredibly naive. Neural networks are known for reward hacking and finding shortcuts. You don't mitigate this by helping it take shortcuts, you mitigate it by making it impossible to take shortcuts. In this case, by randomizing the order during training. You make it harder, not easier.

If you make it easier by ensuring the correct order, the moment the ordering is incorrect, the model will revert to hacking and performance will drop.

This is telling you, that your training regime is not robust enough.

You also mention that

LLMs are surprisingly brittle to the ordering of the premises

Why is this surprising? This ought to patently obvious to anyone and everyone. The fact that Google Deepmind finds this surprising, is surprising.

librarian-bot

Feb 16

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

L4m3r

May 21

Actually the order sensitivity occurs and is mostly considered in MCQA(multi-choice question answering). This work built the R-GSM benchmark which is more challenging than MCQA and I'm wondering where researchers can find the released benchmark. It's also be OK that the authors keep their dataset as a private one.
I would appreciate if anyone can tell me where the R-GSM is released.