Upload README.md
Browse files
README.md
CHANGED
@@ -35,13 +35,9 @@ InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8
|
|
35 |
|
36 |
- **Enhanced performance at reduced cost**:
|
37 |
State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale.
|
38 |
-
|
39 |
- **Deep thinking capability**:
|
40 |
InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions.
|
41 |
|
42 |
-
- **Web browser use**:
|
43 |
-
InternLM3 is the first general-purpose LLM in the open-source community to support browser usage. Leveraging the deep thinking capability, InternLM3 enables over 20 steps of web navigation for in-depth information retrieval and summary.
|
44 |
-
|
45 |
## InternLM3-8B-Instruct
|
46 |
|
47 |
### Performance Evaluation
|
@@ -50,15 +46,15 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
|
|
50 |
|
51 |
| Benchmark | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
|
52 |
| ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
|
53 |
-
| General | CMMLU
|
54 |
| | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
|
55 |
| | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
|
56 |
| Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
|
57 |
| | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
|
58 |
| | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
|
59 |
| | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
|
60 |
-
| MATH | MATH-500(0-shot
|
61 |
-
| | AIME2024(0-shot
|
62 |
| Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
|
63 |
| | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
|
64 |
| Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
|
@@ -67,10 +63,15 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
|
|
67 |
| | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
|
68 |
| | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
|
69 |
|
70 |
-
- The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means
|
71 |
- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
|
72 |
|
73 |
**Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
### Conversation Mode
|
76 |
|
@@ -157,8 +158,8 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
|
|
157 |
We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
|
158 |
|
159 |
```python
|
160 |
-
git clone https://github.com/RunningLeon/vllm.git
|
161 |
-
pip install -e .
|
162 |
```
|
163 |
|
164 |
inference code:
|
@@ -194,7 +195,7 @@ print(outputs)
|
|
194 |
|
195 |
|
196 |
### Thinking Mode
|
197 |
-
####
|
198 |
|
199 |
<img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
|
200 |
|
@@ -284,7 +285,7 @@ print(response)
|
|
284 |
```
|
285 |
#### LMDeploy inference
|
286 |
|
287 |
-
LMDeploy is a toolkit for compressing, deploying, and serving LLM
|
288 |
|
289 |
```bash
|
290 |
pip install lmdeploy
|
@@ -369,28 +370,24 @@ InternLM3,即书生·浦语大模型第3代,开源了80亿参数,面向通
|
|
369 |
|
370 |
- **更低的代价取得更高的性能**:
|
371 |
在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B. 值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
|
372 |
-
|
373 |
- **深度思考能力**:
|
374 |
InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
|
375 |
|
376 |
-
- **网页浏览能力**:
|
377 |
-
InternLM3是开源社区首个支持浏览器使用的通用对话模型。在深度思考能力的加持下,支持20步以上网页跳转以完成深度信息挖掘与整合。
|
378 |
-
|
379 |
#### 性能评测
|
380 |
|
381 |
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
|
382 |
|
383 |
| 评测集\模型 | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
|
384 |
| ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
|
385 |
-
| General | CMMLU
|
386 |
| | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
|
387 |
| | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
|
388 |
| Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
|
389 |
| | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
|
390 |
| | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
|
391 |
| | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
|
392 |
-
| MATH | MATH-500(0-shot
|
393 |
-
| | AIME2024(0-shot
|
394 |
| Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
|
395 |
| | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
|
396 |
| Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
|
@@ -399,11 +396,20 @@ InternLM3是开源社区首个支持浏览器使用的通用对话模型。在
|
|
399 |
| | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
|
400 |
| | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
|
401 |
|
402 |
-
- 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/)
|
403 |
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
|
404 |
|
405 |
**局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
|
406 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
407 |
#### 常规对话模式
|
408 |
|
409 |
##### Transformers 推理
|
@@ -445,7 +451,7 @@ print(response)
|
|
445 |
|
446 |
##### LMDeploy 推理
|
447 |
|
448 |
-
LMDeploy
|
449 |
|
450 |
```bash
|
451 |
pip install lmdeploy
|
@@ -523,7 +529,7 @@ print(outputs)
|
|
523 |
|
524 |
#### 深度思考模式
|
525 |
|
526 |
-
#####
|
527 |
|
528 |
<img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
|
529 |
|
@@ -601,7 +607,7 @@ model = model.eval()
|
|
601 |
|
602 |
messages = [
|
603 |
{"role": "system", "content": thinking_system_prompt},
|
604 |
-
{"role": "user", "content": "
|
605 |
]
|
606 |
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
607 |
|
@@ -631,7 +637,7 @@ pipe = pipeline(model_dir, chat_template_config=chat_template_config)
|
|
631 |
|
632 |
messages = [
|
633 |
{"role": "system", "content": thinking_system_prompt},
|
634 |
-
{"role": "user", "content": "
|
635 |
]
|
636 |
|
637 |
response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
|
@@ -662,7 +668,7 @@ prompts = [
|
|
662 |
},
|
663 |
{
|
664 |
"role": "user",
|
665 |
-
"content": "
|
666 |
},
|
667 |
]
|
668 |
outputs = llm.chat(prompts,
|
|
|
35 |
|
36 |
- **Enhanced performance at reduced cost**:
|
37 |
State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale.
|
|
|
38 |
- **Deep thinking capability**:
|
39 |
InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions.
|
40 |
|
|
|
|
|
|
|
41 |
## InternLM3-8B-Instruct
|
42 |
|
43 |
### Performance Evaluation
|
|
|
46 |
|
47 |
| Benchmark | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
|
48 |
| ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
|
49 |
+
| General | CMMLU(0-shot) | **83.1** | 75.8 | 53.9 | 66.0 |
|
50 |
| | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
|
51 |
| | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
|
52 |
| Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
|
53 |
| | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
|
54 |
| | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
|
55 |
| | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
|
56 |
+
| MATH | MATH-500(0-shot) | **83.0*** | 72.4 | 48.4 | 74.0 |
|
57 |
+
| | AIME2024(0-shot) | **20.0*** | 16.7 | 6.7 | 13.3 |
|
58 |
| Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
|
59 |
| | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
|
60 |
| Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
|
|
|
63 |
| | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
|
64 |
| | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
|
65 |
|
66 |
+
- The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means evaluating with Thinking Mode), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
|
67 |
- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
|
68 |
|
69 |
**Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
|
70 |
+
### Requirements
|
71 |
+
```python
|
72 |
+
transformers >= 4.48
|
73 |
+
```
|
74 |
+
|
75 |
|
76 |
### Conversation Mode
|
77 |
|
|
|
158 |
We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
|
159 |
|
160 |
```python
|
161 |
+
git clone -b support-internlm3 https://github.com/RunningLeon/vllm.git
|
162 |
+
pip install -e .
|
163 |
```
|
164 |
|
165 |
inference code:
|
|
|
195 |
|
196 |
|
197 |
### Thinking Mode
|
198 |
+
#### Thinking Demo
|
199 |
|
200 |
<img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
|
201 |
|
|
|
285 |
```
|
286 |
#### LMDeploy inference
|
287 |
|
288 |
+
LMDeploy is a toolkit for compressing, deploying, and serving LLM.
|
289 |
|
290 |
```bash
|
291 |
pip install lmdeploy
|
|
|
370 |
|
371 |
- **更低的代价取得更高的性能**:
|
372 |
在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B. 值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
|
|
|
373 |
- **深度思考能力**:
|
374 |
InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
|
375 |
|
|
|
|
|
|
|
376 |
#### 性能评测
|
377 |
|
378 |
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
|
379 |
|
380 |
| 评测集\模型 | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
|
381 |
| ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
|
382 |
+
| General | CMMLU(0-shot) | **83.1** | 75.8 | 53.9 | 66.0 |
|
383 |
| | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
|
384 |
| | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
|
385 |
| Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
|
386 |
| | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
|
387 |
| | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
|
388 |
| | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
|
389 |
+
| MATH | MATH-500(0-shot) | **83.0*** | 72.4 | 48.4 | 74.0 |
|
390 |
+
| | AIME2024(0-shot) | **20.0*** | 16.7 | 6.7 | 13.3 |
|
391 |
| Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
|
392 |
| | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
|
393 |
| Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
|
|
|
396 |
| | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
|
397 |
| | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
|
398 |
|
399 |
+
- 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表使用深度思考模式进行评测),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
|
400 |
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
|
401 |
|
402 |
**局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
|
403 |
|
404 |
+
#### 依赖
|
405 |
+
|
406 |
+
```python
|
407 |
+
transformers >= 4.48
|
408 |
+
```
|
409 |
+
|
410 |
+
|
411 |
+
|
412 |
+
|
413 |
#### 常规对话模式
|
414 |
|
415 |
##### Transformers 推理
|
|
|
451 |
|
452 |
##### LMDeploy 推理
|
453 |
|
454 |
+
LMDeploy 是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
|
455 |
|
456 |
```bash
|
457 |
pip install lmdeploy
|
|
|
529 |
|
530 |
#### 深度思考模式
|
531 |
|
532 |
+
##### 深度思考 Demo
|
533 |
|
534 |
<img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
|
535 |
|
|
|
607 |
|
608 |
messages = [
|
609 |
{"role": "system", "content": thinking_system_prompt},
|
610 |
+
{"role": "user", "content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n(1)当\(a = 1\)时,求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程;\n(2)若\(f(x)\)有极小值,且极小值小于\(0\),求\(a\)的取值范围。"},
|
611 |
]
|
612 |
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
613 |
|
|
|
637 |
|
638 |
messages = [
|
639 |
{"role": "system", "content": thinking_system_prompt},
|
640 |
+
{"role": "user", "content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n(1)当\(a = 1\)时,求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程;\n(2)若\(f(x)\)有极小值,且极小值小于\(0\),求\(a\)的取值范围。"},
|
641 |
]
|
642 |
|
643 |
response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
|
|
|
668 |
},
|
669 |
{
|
670 |
"role": "user",
|
671 |
+
"content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n(1)当\(a = 1\)时,求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程;\n(2)若\(f(x)\)有极小值,且极小值小于\(0\),求\(a\)的取值范围。"
|
672 |
},
|
673 |
]
|
674 |
outputs = llm.chat(prompts,
|