Law of the Weakest Link: Cross Capabilities of Large Language Models
Abstract
The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.
Community
We discuss “cross capabilities” and “Law of the Weakest Link” of Large Language Models (LLMs):
🔹 Cross capabilities: the intersection of multiple distinct capabilities across different types of expertise necessary to address complex, real-world tasks.
🔹 Law of the Weakest Link: cross-capability performance is limited by the weakest underlying capability. Identifying and strengthening these weakest links is key to tackling complex challenges.
We’re also releasing a comprehensive taxonomy for LLM capabilities—spanning 76 primary categories and 332 subcategories—plus a CrossEval benchmark of 7 individual capabilities and 7 cross capabilities with 8,400 expert ratings and detailed explanations:
🗂️ Individual Capabilities: Core skills like English, reasoning, coding, image recognition, tool use, long context, and multilinguality.
🗂️ Cross Capabilities: Complex combinations such as coding & reasoning, image recognition & reasoning, tool use & coding, tool use & reasoning, long context & coding, multilingual & image recognition, and more. Example: “What’s the 10-year trend for total rainfall in Tokyo?”—requires both tool use & reasoning.
🗂️ Taxonomy & CrossEval Benchmark: Includes 1,400 expert-annotated prompts based on the taxonomy and 4,200 model responses, each rated by two experts with explanations across easy, medium, and hard levels. Designed by a diverse team of researchers, engineers, data scientists, product managers, and content designers.
➡️ Paper, Data, Benchmark, & Code: https://www.llm-cross-capabilities.org
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios? (2024)
- Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models (2024)
- What is the Role of Large Language Models in the Evolution of Astronomy Research? (2024)
- LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models (2024)
- AXCEL: Automated eXplainable Consistency Evaluation using LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper