inftyBench: Extending Long Context Evaluation Beyond 100K Tokens
Abstract
Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose inftyBench, the first LLM benchmark featuring an average data length surpassing 100K tokens. inftyBench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in inftyBench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. In our experiments, based on inftyBench, we evaluate the state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. We further present three intriguing analyses regarding the behavior of LLMs processing long context.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? (2024)
- Making Long-Context Language Models Better Multi-Hop Reasoners (2024)
- Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction (2024)
- KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches (2024)
- DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 8
Browse 8 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 6
Collections including this paper 0
No Collection including this paper