arxiv:2412.05210

Evaluating and Aligning CodeLLMs on Human Preference

Published on Dec 6

· Submitted by

CSJianYang on Dec 11

#2 Paper of the day

Upvote

Authors:

Jiaxi Yang ,

Yibo Miao ,

Zeyu Cui ,

Binyuan Hui ,

Junyang Lin

Abstract

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }

View arXiv page View PDF Add to collection

Community

CSJianYang

Paper submitter 16 days ago

🌟 CodeArena: A Benchmark in Optimizing Code Generation and Enhancing User Experience 🚀
As developers' reliable assistants, CodeLLMs must generate code that not only meets technical requirements but also focuses on developers' intuitive experience.
To this end, this paper introduces , CodeArena, currently the most comprehensive benchmark for evaluating CodeLLMs' alignment with human preferences, and SynCode-Instruct, a high-quality, large-scale code-text synthesis instruction corpus, marking a major leap in code generation technology for user experience.

CodeArena's Major Achievements: 🏆

Real-world Challenges ⚔: CodeArena carefully selected 397 high-quality samples from actual user queries through rigorous manual annotation and quality control, covering 40 task scenarios and 44 common programming languages. Compared to other benchmarks, it features more diverse problem distributions and complex real-world scenarios. 39 LLMs have been systematically evaluated.

Large-scale Corpus 📘: For highly relevant code-text pairs collected from code-related websites, Qwen2.5-72B was used to generate better code or text snippets, followed by code sandbox filtering and other large model scoring screening, ultimately resulting in 20B tokens of learning material, SynCode-Instruct.

User Preference Oriented 💡: SynCoder is obtained by fine-tuning Qwen2.5-Coder-32B on the SynCode-Instruct corpus. The two-stage training process highlighted the significant improvement that high-quality datasets bring to models, ultimately narrowing the significant performance gap between open-source and closed-source models in both traditional programming tasks and CodeArena.

Github: https://github.com/QwenLM/Qwen2.5-Coder/tree/main/qwencoder-eval/instruct/CodeArena
arxiv: https://arxiv.org/abs/2412.05210
hf: https://huggingface.co/datasets/CSJianYang/CodeArena