arxiv:2501.01821

SDPO: Segment-Level Direct Preference Optimization for Social Agents

Published on Jan 3

· Submitted by

KAB1314 on Jan 6

Upvote

Authors:

Aobo Kong ,

Shiwan Zhao ,

Abstract

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

View arXiv page View PDF Add to collection

Community

KAB1314

Paper author Paper submitter 3 days ago

Hey everyone, stop whatever you're doing and look over here! We've come up with a brand new way to align multi-turn interactions in LLMs — SDPO! Feel free to discuss, feel free to cite (or just stare in awe)!