Papers
arxiv:2409.12917

Training Language Models to Self-Correct via Reinforcement Learning

Published on Sep 19
· Submitted by akhaliq on Sep 20
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Community

Paper submitter

Screen Shot 2024-09-20 at 1.00.57 AM.png

Very interesting paper.
The problem of concern is very interesting in itself.
LLM cannot achieve self-correction by themselves or by SFT. We also have an earlier article that also argues this point: Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

suspect there is something wrong with their SFT/STaR baseline && in the theoretical conclusions they make about the results. their method in section 5 feels a lot closer to simply forcing self-correction for the sake of it rather than solving D1/D2 (though I also it unlikely that those desiderata are even the right problems to be solved)

Great paper!

I believe that replacing the optimization algorithm REINFORCE with ReMax may achieve better results

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

I don't believe in their claimed desiderata.

[D1] it should directly train on self-generated traces to alleviate distribution mismatch that affected SFT (Figure 4), and
[D2] self-generated traces employed should prevent a collapse to making minor edits during learning.

STaR already uses self-generated traces, and satisfies [D1]. The histogram in Figure 3(a) also shows a larger 2nd mode around an edit distance of ~0.7, whereas SCoRe's edit distances in the same diagram are concentrated around small edit distance ratios (which would be the "minor edits" they claim are bad?)

Here is my blog explaining this paper https://ajithp.com/2024/09/23/google-deepind-score/

In equation (1), y_0~y_i is used to generate y_(i+1), but in equations (3) and (4), it is not present. Is it correct that y_1 is missing in the equation?

‪Is the following phrasing the closest the authors give to what the reward function is?‬ or is it obvious given the references to similar work?

‪"Moreover, we assume access to a reward function / verifier r̂(y, y*), such as a string-matching based answer checking function) that evaluates correctness of response y by comparing with the oracle response y*."‬

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.12917 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.12917 in a Space README.md to link it from this page.

Collections including this paper 54