CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs
Abstract
Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
Community
Dataset can be found here: https://github.com/FSoft-AI4Code/CodeMMLU
Dear authors,
Unfortunately, when I accessed your GitHub address, I could not find the dataset and source code. Are you sure you have uploaded the source code, dataset and related information to your GitHub address? I was wondering how we can download it?
BRs,
Geunsik Lim.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? (2024)
- Qwen2.5-Coder Technical Report (2024)
- DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation (2024)
- Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement (2024)
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper