haijunlv commited on
Commit
9adf5b1
·
verified ·
1 Parent(s): 71f8d62

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -25
README.md CHANGED
@@ -35,13 +35,9 @@ InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8
35
 
36
  - **Enhanced performance at reduced cost**:
37
  State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale.
38
-
39
  - **Deep thinking capability**:
40
  InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions.
41
 
42
- - **Web browser use**:
43
- InternLM3 is the first general-purpose LLM in the open-source community to support browser usage. Leveraging the deep thinking capability, InternLM3 enables over 20 steps of web navigation for in-depth information retrieval and summary.
44
-
45
  ## InternLM3-8B-Instruct
46
 
47
  ### Performance Evaluation
@@ -50,15 +46,15 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
50
 
51
  | Benchmark | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
52
  | ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
53
- | General | CMMLU (0-shot) | **83.1** | 75.8 | 53.9 | 66.0 |
54
  | | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
55
  | | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
56
  | Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
57
  | | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
58
  | | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
59
  | | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
60
- | MATH | MATH-500(0-shot Thinking Mode) | **83.0** | 72.4 | 48.4 | 74.0 |
61
- | | AIME2024(0-shot Thinking Mode) | **20.0** | 16.7 | 6.7 | 13.3 |
62
  | Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
63
  | | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
64
  | Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
@@ -67,10 +63,15 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
67
  | | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
68
  | | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
69
 
70
- - The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
71
  - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
72
 
73
  **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
 
 
 
 
 
74
 
75
  ### Conversation Mode
76
 
@@ -157,8 +158,8 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
157
  We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
158
 
159
  ```python
160
- git clone https://github.com/RunningLeon/vllm.git
161
- pip install -e .
162
  ```
163
 
164
  inference code:
@@ -194,7 +195,7 @@ print(outputs)
194
 
195
 
196
  ### Thinking Mode
197
- #### puzzle demo
198
 
199
  <img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
200
 
@@ -284,7 +285,7 @@ print(response)
284
  ```
285
  #### LMDeploy inference
286
 
287
- LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
288
 
289
  ```bash
290
  pip install lmdeploy
@@ -369,28 +370,24 @@ InternLM3,即书生·浦语大模型第3代,开源了80亿参数,面向通
369
 
370
  - **更低的代价取得更高的性能**:
371
  在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B. 值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
372
-
373
  - **深度思考能力**:
374
  InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
375
 
376
- - **网页浏览能力**:
377
- InternLM3是开源社区首个支持浏览器使用的通用对话模型。在深度思考能力的加持下,支持20步以上网页跳转以完成深度信息挖掘与整合。
378
-
379
  #### 性能评测
380
 
381
  我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
382
 
383
  | 评测集\模型 | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
384
  | ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
385
- | General | CMMLU (0-shot) | **83.1** | 75.8 | 53.9 | 66.0 |
386
  | | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
387
  | | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
388
  | Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
389
  | | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
390
  | | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
391
  | | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
392
- | MATH | MATH-500(0-shot Thinking Mode) | **83.0** | 72.4 | 48.4 | 74.0 |
393
- | | AIME2024(0-shot Thinking Mode) | **20.0** | 16.7 | 6.7 | 13.3 |
394
  | Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
395
  | | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
396
  | Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
@@ -399,11 +396,20 @@ InternLM3是开源社区首个支持浏览器使用的通用对话模型。在
399
  | | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
400
  | | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
401
 
402
- - 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表数据来自原始论文),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
403
  - 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
404
 
405
  **局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
406
 
 
 
 
 
 
 
 
 
 
407
  #### 常规对话模式
408
 
409
  ##### Transformers 推理
@@ -445,7 +451,7 @@ print(response)
445
 
446
  ##### LMDeploy 推理
447
 
448
- LMDeploy MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
449
 
450
  ```bash
451
  pip install lmdeploy
@@ -523,7 +529,7 @@ print(outputs)
523
 
524
  #### 深度思考模式
525
 
526
- ##### 解谜 demo
527
 
528
  <img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
529
 
@@ -601,7 +607,7 @@ model = model.eval()
601
 
602
  messages = [
603
  {"role": "system", "content": thinking_system_prompt},
604
- {"role": "user", "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."},
605
  ]
606
  tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
607
 
@@ -631,7 +637,7 @@ pipe = pipeline(model_dir, chat_template_config=chat_template_config)
631
 
632
  messages = [
633
  {"role": "system", "content": thinking_system_prompt},
634
- {"role": "user", "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."},
635
  ]
636
 
637
  response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
@@ -662,7 +668,7 @@ prompts = [
662
  },
663
  {
664
  "role": "user",
665
- "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."
666
  },
667
  ]
668
  outputs = llm.chat(prompts,
 
35
 
36
  - **Enhanced performance at reduced cost**:
37
  State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale.
 
38
  - **Deep thinking capability**:
39
  InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions.
40
 
 
 
 
41
  ## InternLM3-8B-Instruct
42
 
43
  ### Performance Evaluation
 
46
 
47
  | Benchmark | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
48
  | ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
49
+ | General | CMMLU(0-shot) | **83.1** | 75.8 | 53.9 | 66.0 |
50
  | | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
51
  | | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
52
  | Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
53
  | | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
54
  | | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
55
  | | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
56
+ | MATH | MATH-500(0-shot) | **83.0*** | 72.4 | 48.4 | 74.0 |
57
+ | | AIME2024(0-shot) | **20.0*** | 16.7 | 6.7 | 13.3 |
58
  | Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
59
  | | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
60
  | Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
 
63
  | | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
64
  | | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
65
 
66
+ - The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means evaluating with Thinking Mode), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
67
  - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
68
 
69
  **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
70
+ ### Requirements
71
+ ```python
72
+ transformers >= 4.48
73
+ ```
74
+
75
 
76
  ### Conversation Mode
77
 
 
158
  We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
159
 
160
  ```python
161
+ git clone -b support-internlm3 https://github.com/RunningLeon/vllm.git
162
+ pip install -e .
163
  ```
164
 
165
  inference code:
 
195
 
196
 
197
  ### Thinking Mode
198
+ #### Thinking Demo
199
 
200
  <img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
201
 
 
285
  ```
286
  #### LMDeploy inference
287
 
288
+ LMDeploy is a toolkit for compressing, deploying, and serving LLM.
289
 
290
  ```bash
291
  pip install lmdeploy
 
370
 
371
  - **更低的代价取得更高的性能**:
372
  在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B. 值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
 
373
  - **深度思考能力**:
374
  InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
375
 
 
 
 
376
  #### 性能评测
377
 
378
  我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
379
 
380
  | 评测集\模型 | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) |
381
  | ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- |
382
+ | General | CMMLU(0-shot) | **83.1** | 75.8 | 53.9 | 66.0 |
383
  | | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 |
384
  | | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 |
385
  | Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 |
386
  | | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 |
387
  | | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 |
388
  | | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 |
389
+ | MATH | MATH-500(0-shot) | **83.0*** | 72.4 | 48.4 | 74.0 |
390
+ | | AIME2024(0-shot) | **20.0*** | 16.7 | 6.7 | 13.3 |
391
  | Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 |
392
  | | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 |
393
  | Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 |
 
396
  | | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 |
397
  | | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 |
398
 
399
+ - 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表使用深度思考模式进行评测),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
400
  - 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
401
 
402
  **局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
403
 
404
+ #### 依赖
405
+
406
+ ```python
407
+ transformers >= 4.48
408
+ ```
409
+
410
+
411
+
412
+
413
  #### 常规对话模式
414
 
415
  ##### Transformers 推理
 
451
 
452
  ##### LMDeploy 推理
453
 
454
+ LMDeploy 是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
455
 
456
  ```bash
457
  pip install lmdeploy
 
529
 
530
  #### 深度思考模式
531
 
532
+ ##### 深度思考 Demo
533
 
534
  <img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
535
 
 
607
 
608
  messages = [
609
  {"role": "system", "content": thinking_system_prompt},
610
+ {"role": "user", "content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n1)当\(a = 1\)时,求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程;\n2)若\(f(x)\)有极小值,且极小值小于\(0\),求\(a\)的取值范围。"},
611
  ]
612
  tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
613
 
 
637
 
638
  messages = [
639
  {"role": "system", "content": thinking_system_prompt},
640
+ {"role": "user", "content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n1)当\(a = 1\)时,求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程;\n2)若\(f(x)\)有极小值,且极小值小于\(0\),求\(a\)的取值范围。"},
641
  ]
642
 
643
  response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
 
668
  },
669
  {
670
  "role": "user",
671
+ "content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n1)当\(a = 1\)时,求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程;\n2)若\(f(x)\)有极小值,且极小值小于\(0\),求\(a\)的取值范围。"
672
  },
673
  ]
674
  outputs = llm.chat(prompts,