Abstract
Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }
Community
๐ CodeArena: A Benchmark in Optimizing Code Generation and Enhancing User Experience ๐
As developers' reliable assistants, CodeLLMs must generate code that not only meets technical requirements but also focuses on developers' intuitive experience.
To this end, this paper introduces , CodeArena, currently the most comprehensive benchmark for evaluating CodeLLMs' alignment with human preferences, and SynCode-Instruct, a high-quality, large-scale code-text synthesis instruction corpus, marking a major leap in code generation technology for user experience.
CodeArena's Major Achievements: ๐
Real-world Challenges โ: CodeArena carefully selected 397 high-quality samples from actual user queries through rigorous manual annotation and quality control, covering 40 task scenarios and 44 common programming languages. Compared to other benchmarks, it features more diverse problem distributions and complex real-world scenarios. 39 LLMs have been systematically evaluated.
Large-scale Corpus ๐: For highly relevant code-text pairs collected from code-related websites, Qwen2.5-72B was used to generate better code or text snippets, followed by code sandbox filtering and other large model scoring screening, ultimately resulting in 20B tokens of learning material, SynCode-Instruct.
User Preference Oriented ๐ก: SynCoder is obtained by fine-tuning Qwen2.5-Coder-32B on the SynCode-Instruct corpus. The two-stage training process highlighted the significant improvement that high-quality datasets bring to models, ultimately narrowing the significant performance gap between open-source and closed-source models in both traditional programming tasks and CodeArena.
Github: https://github.com/QwenLM/Qwen2.5-Coder/tree/main/qwencoder-eval/instruct/CodeArena
arxiv: https://arxiv.org/abs/2412.05210
hf: https://huggingface.co/datasets/CSJianYang/CodeArena
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SelfCodeAlign: Self-Alignment for Code Generation (2024)
- CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement (2024)
- StackEval: Benchmarking LLMs in Coding Assistance (2024)
- FullStack Bench: Evaluating LLMs as Full Stack Coders (2024)
- MdEval: Massively Multilingual Code Debugging (2024)
- Effi-Code: Unleashing Code Efficiency in Language Models (2024)
- ProSec: Fortifying Code LLMs with Proactive Security Alignment (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper