Post
1871
How Robust Is Your Model in Complex Code Generation Tasks? 🤔
We've launched the PECC benchmark to challenge chat models in code generation, drawing from the Advent of Code for programming tasks and the Euler Project for math-heavy challenges. This new task tests models with problems presented in both detailed prose and concise "leet code" styles, evaluating their ability to understand and solve complex coding issues and math problem in chat-based interactions.
It seems that the Claude 3 models outperforme ChatGPT:
Model / Avg. (pass@3)
Claude 3 Haiku / 27.67
GPT-3.5-Turbo / 23.75
Mixtral-8x22B-Instruct-v0.1 / 8.35
Read our Preprint📃: PECC: Problem Extraction and Coding Challenges (2404.18766)
Look at the dataset🔎: PatrickHaller/pecc
We also got accepted at LREC-COLING '24 🎉
We've launched the PECC benchmark to challenge chat models in code generation, drawing from the Advent of Code for programming tasks and the Euler Project for math-heavy challenges. This new task tests models with problems presented in both detailed prose and concise "leet code" styles, evaluating their ability to understand and solve complex coding issues and math problem in chat-based interactions.
It seems that the Claude 3 models outperforme ChatGPT:
Model / Avg. (pass@3)
Claude 3 Haiku / 27.67
GPT-3.5-Turbo / 23.75
Mixtral-8x22B-Instruct-v0.1 / 8.35
Read our Preprint📃: PECC: Problem Extraction and Coding Challenges (2404.18766)
Look at the dataset🔎: PatrickHaller/pecc
We also got accepted at LREC-COLING '24 🎉