# InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens
中文 •
English •
论文
## 简介
理解、处理长文本,是大模型迈向更深层次理解与交互阶段必备的能力。现已有大模型声称可以处理100k+的长序列,但是对应的标准评测集却是空缺的。为此,我们构建了一个面向 100k+ 的评测集,InfiniteBench。该评测集针对大模型在长文本方面的五项能力而设计:检索、数学、代码、问答、和摘要。
## 特点
- **长上下文:** InfiniteBench 测试数据的平均上下文长度为195k,远超现有评测数据。
- **多领域多语言:** InfiniteBench 评测集包含12个任务,包括中英双语,涵盖了检索、数学、代码、问答、和摘要等5个领域。
- **前瞻性挑战性:** InfiniteBench 测试任务,对标当前最强的模型如 GPT-4, Claude 2 等。
- **真实场景与合成场景:** InfiniteBench 既包含真实场景数据,探测大模型在处理实际问题的能力;也包含合成数据,为测试数据拓展上下文窗口提供了便捷。
## 任务构成
| Task Name | Context | # Examples | Avg Input Tokens | Avg Output Tokens | Description |
| -------------------- | ------------- | ---------- | ---------------- | ----------------- | ------------------------------------------------------------------------------------------- |
| En.Sum | Fake Book | 103 | 171.5k | 1.1k | Summarization of a fake book created with core entity substitution. |
| En.QA | Fake Book | 351 | 192.6k | 4.8 | Free-form question answering based on the fake book. |
| En.MC | Fake Book | 229 | 184.4k | 5.3 | Multiple choice questions derived from the fake book. |
| En.Dia | Script | 200 | 103.6k | 3.4 | Identification of talkers in partially anonymized scripts. |
| Zh.QA | New Book | 175 | 2068.6k | 6.3 | Question answering on a set of newly collected books. |
| Code.Debug | Code Document | 394 | 114.7k | 4.8 | Finding which function in a code repo contains an crashing error (in multiple choice form). |
| Code.Run | Synthetic | 400 | 75.2k | 1.3 | Simulating execution of multiple simple, synthetic functions. |
| Math.Calc | Synthetic | 50 | 43.9k | 43.9k | Calculations involving super-long arithmetic equations. |
| Math.Find | Synthetic | 350 | 87.9k | 1.3 | Finding special integers in a lengthy list. |
| Retrieve.PassKey[^1] | Synthetic | 590 | 122.4k | 2.0 | Retrieving hidden keys in a noisy long context. |
| Retrieve.Number | Synthetic | 590 | 122.4k | 4.0 | Locating repeated hidden numbers in a noisy long context. |
| Retrieve.KV[^2] | Synthetic | 500 | 89.9k | 22.7 | Finding the corresponding value from a dictionary and a key. |
## 评测结果
我们在 SOTA 模型上评测了 InfiniteBench 结果如下:
| Task Name | GPT-4 | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K |
| ---------------- | ------ | --------------- | --------- | -------- | -----------| -----------| -----------|
| Retrieve.PassKey | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% |
| Retrieve.Number | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% |
| Retrieve.KV | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% |
| En.Sum | 14.73% | 9.09% | 17.96% | 14.50% | < 5% | < 5% |< 5% |
| En.QA | 22.44% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% |< 5% |
| En.MC | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% |38.43% |10.48% |
| En.Dia | 8.50% | 7.50% | 11.50% | 46.50% | < 5% |< 5% |< 5% |
| Zh.QA | 25.96% | 16.98% | 17.93% | 9.64% | 15.07% |13.61% |< 5% |
| Code.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% |13.96% |7.36% |
| Code.Run | 23.25% | < 5% | < 5% | < 5% | < 5% |< 5% |< 5% |
| Math.Calc | < 5% | < 5% | < 5% | < 5% | < 5% |< 5% |< 5% |
| Math.Find | 60.00% | 17.14% | 12.57% | 32.29% | < 5% |25.71% |7.71% |
注:
1. YaRN-Mistral-7B 实现代码已开源在仓库,请大家批评指正;Kimi-Chat 和 Claude 2 使用用户界面评测,GPT-4 使用 API 评测,均使用官方默认配置。
## 评测
## 获取数据集
从