czczup commited on
Commit
dd4d5dd
1 Parent(s): 0706313

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +46 -50
README.md CHANGED
@@ -37,28 +37,28 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
37
 
38
  ### Image Benchmarks
39
 
40
- | Benchmark | PaliGemma-3B | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
41
- | :-----------------------------: | :----------: | :------------------: | :----------: | :----------: |
42
- | Model Size | 2.9B | 2.2B | 2.2B | 0.9B |
43
- | | | | | |
44
- | DocVQA<sub>test</sub> | - | 85.0 | 86.9 | 81.7 |
45
- | ChartQA<sub>test</sub> | - | 74.8 | 76.2 | 72.9 |
46
- | InfoVQA<sub>test</sub> | - | 55.4 | 58.9 | 50.9 |
47
- | TextVQA<sub>val</sub> | 68.1 | 70.5 | 73.4 | 70.5 |
48
- | OCRBench | 614 | 654 | 784 | 754 |
49
- | MME<sub>sum</sub> | 1686.1 | 1901.5 | 1876.8 | 1794.4 |
50
- | RealWorldQA | 55.2 | 57.9 | 57.3 | 50.3 |
51
- | AI2D<sub>test</sub> | 68.3 | 69.8 | 74.1 | 64.1 |
52
- | MMMU<sub>val</sub> | 34.9 | 34.6 / 37.4 | 34.3 / 36.3 | 35.4 / 36.7 |
53
- | MMBench-EN<sub>test</sub> | 71.0 | 70.9 | 73.2 | 65.4 |
54
- | MMBench-CN<sub>test</sub> | 63.6 | 66.2 | 70.9 | 60.7 |
55
- | CCBench<sub>dev</sub> | 29.6 | 63.5 | 74.7 | 75.7 |
56
- | MMVet<sub>GPT-4-0613</sub> | - | 39.3 | 44.6 | 37.8 |
57
- | MMVet<sub>GPT-4-Turbo</sub> | 33.1 | 35.5 | 39.5 | 33.3 |
58
- | SEED-Image | 69.6 | 69.8 | 71.6 | 65.6 |
59
- | HallBench<sub>avg</sub> | 32.2 | 37.5 | 37.9 | 33.4 |
60
- | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
61
- | OpenCompass<sub>avg-score</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
62
 
63
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
64
 
@@ -66,8 +66,6 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
66
 
67
  - Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
68
 
69
- - It is important to mention that the MMVet scores we report are evaluated using GPT-4-0613 as the judge model. Different versions of GPT-4 can lead to significant variations in the scores for this dataset. For instance, using GPT-4-Turbo would result in significantly lower scores.
70
-
71
  ### Video Benchmarks
72
 
73
  | Benchmark | VideoChat2-Phi3 | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
@@ -424,37 +422,35 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
424
 
425
  ### 图像相关评测
426
 
427
- | 评测数据集 | PaliGemma-3B | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
428
- | :-----------------------------: | :----------: | :------------------: | :----------: | :----------: |
429
- | 模型大小 | 2.9B | 2.2B | 2.2B | 0.9B |
430
- | | | | | |
431
- | DocVQA<sub>test</sub> | - | 85.0 | 86.9 | 81.7 |
432
- | ChartQA<sub>test</sub> | - | 74.8 | 76.2 | 72.9 |
433
- | InfoVQA<sub>test</sub> | - | 55.4 | 58.9 | 50.9 |
434
- | TextVQA<sub>val</sub> | 68.1 | 70.5 | 73.4 | 70.5 |
435
- | OCRBench | 614 | 654 | 784 | 754 |
436
- | MME<sub>sum</sub> | 1686.1 | 1901.5 | 1876.8 | 1794.4 |
437
- | RealWorldQA | 55.2 | 57.9 | 57.3 | 50.3 |
438
- | AI2D<sub>test</sub> | 68.3 | 69.8 | 74.1 | 64.1 |
439
- | MMMU<sub>val</sub> | 34.9 | 34.6 / 37.4 | 34.3 / 36.3 | 35.4 / 36.7 |
440
- | MMBench-EN<sub>test</sub> | 71.0 | 70.9 | 73.2 | 65.4 |
441
- | MMBench-CN<sub>test</sub> | 63.6 | 66.2 | 70.9 | 60.7 |
442
- | CCBench<sub>dev</sub> | 29.6 | 63.5 | 74.7 | 75.7 |
443
- | MMVet<sub>GPT-4-0613</sub> | - | 39.3 | 44.6 | 37.8 |
444
- | MMVet<sub>GPT-4-Turbo</sub> | 33.1 | 35.5 | 39.5 | 37.3 |
445
- | SEED-Image | 69.6 | 69.8 | 71.6 | 65.6 |
446
- | HallBench<sub>avg</sub> | 32.2 | 37.5 | 37.9 | 33.4 |
447
- | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
448
- | OpenCompass<sub>avg-score</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
449
-
450
- - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。MMMU、OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
451
 
452
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
453
 
454
  - 请注意,使用不同的测试工具包(如 InternVL 和 VLMEvalKit)评估同一模型可能会导致细微差异,这是正常的。代码版本的更新、环境和硬件的变化也可能导致结果的微小差异。
455
 
456
- - 需要提到的是,我们报告的 MMVet 分数是使用 GPT-4-0613 作为评判模型评估的。不同版本的 GPT-4 会导致该数据集分数的显著变化。例如,使用 GPT-4-Turbo 会导致分数显著降低。
457
-
458
  ### 视频相关评测
459
 
460
  | 评测数据集 | VideoChat2-Phi3 | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
 
37
 
38
  ### Image Benchmarks
39
 
40
+ | Benchmark | PaliGemma-3B | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
41
+ | :--------------------------: | :----------: | :------------------: | :----------: | :----------: |
42
+ | Model Size | 2.9B | 2.2B | 2.2B | 0.9B |
43
+ | | | | | |
44
+ | DocVQA<sub>test</sub> | - | 85.0 | 86.9 | 81.7 |
45
+ | ChartQA<sub>test</sub> | - | 74.8 | 76.2 | 72.9 |
46
+ | InfoVQA<sub>test</sub> | - | 55.4 | 58.9 | 50.9 |
47
+ | TextVQA<sub>val</sub> | 68.1 | 70.5 | 73.4 | 70.5 |
48
+ | OCRBench | 614 | 654 | 784 | 754 |
49
+ | MME<sub>sum</sub> | 1686.1 | 1901.5 | 1876.8 | 1794.4 |
50
+ | RealWorldQA | 55.2 | 57.9 | 57.3 | 50.3 |
51
+ | AI2D<sub>test</sub> | 68.3 | 69.8 | 74.1 | 64.1 |
52
+ | MMMU<sub>val</sub> | 34.9 | 34.6 / 37.4 | 34.3 / 36.3 | 35.4 / 36.7 |
53
+ | MMBench-EN<sub>test</sub> | 71.0 | 70.9 | 73.2 | 65.4 |
54
+ | MMBench-CN<sub>test</sub> | 63.6 | 66.2 | 70.9 | 60.7 |
55
+ | CCBench<sub>dev</sub> | 29.6 | 63.5 | 74.7 | 75.7 |
56
+ | MMVet<sub>GPT-4-0613</sub> | - | 39.3 | 44.6 | 37.8 |
57
+ | MMVet<sub>GPT-4-Turbo</sub> | 33.1 | 35.5 | 39.5 | 33.3 |
58
+ | SEED-Image | 69.6 | 69.8 | 71.6 | 65.6 |
59
+ | HallBench<sub>avg</sub> | 32.2 | 37.5 | 37.9 | 33.4 |
60
+ | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
61
+ | OpenCompass<sub>avg</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
62
 
63
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
64
 
 
66
 
67
  - Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
68
 
 
 
69
  ### Video Benchmarks
70
 
71
  | Benchmark | VideoChat2-Phi3 | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
 
422
 
423
  ### 图像相关评测
424
 
425
+ | 评测数据集 | PaliGemma-3B | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |
426
+ | :--------------------------: | :----------: | :------------------: | :----------: | :----------: |
427
+ | 模型大小 | 2.9B | 2.2B | 2.2B | 0.9B |
428
+ | | | | | |
429
+ | DocVQA<sub>test</sub> | - | 85.0 | 86.9 | 81.7 |
430
+ | ChartQA<sub>test</sub> | - | 74.8 | 76.2 | 72.9 |
431
+ | InfoVQA<sub>test</sub> | - | 55.4 | 58.9 | 50.9 |
432
+ | TextVQA<sub>val</sub> | 68.1 | 70.5 | 73.4 | 70.5 |
433
+ | OCRBench | 614 | 654 | 784 | 754 |
434
+ | MME<sub>sum</sub> | 1686.1 | 1901.5 | 1876.8 | 1794.4 |
435
+ | RealWorldQA | 55.2 | 57.9 | 57.3 | 50.3 |
436
+ | AI2D<sub>test</sub> | 68.3 | 69.8 | 74.1 | 64.1 |
437
+ | MMMU<sub>val</sub> | 34.9 | 34.6 / 37.4 | 34.3 / 36.3 | 35.4 / 36.7 |
438
+ | MMBench-EN<sub>test</sub> | 71.0 | 70.9 | 73.2 | 65.4 |
439
+ | MMBench-CN<sub>test</sub> | 63.6 | 66.2 | 70.9 | 60.7 |
440
+ | CCBench<sub>dev</sub> | 29.6 | 63.5 | 74.7 | 75.7 |
441
+ | MMVet<sub>GPT-4-0613</sub> | - | 39.3 | 44.6 | 37.8 |
442
+ | MMVet<sub>GPT-4-Turbo</sub> | 33.1 | 35.5 | 39.5 | 37.3 |
443
+ | SEED-Image | 69.6 | 69.8 | 71.6 | 65.6 |
444
+ | HallBench<sub>avg</sub> | 32.2 | 37.5 | 37.9 | 33.4 |
445
+ | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
446
+ | OpenCompass<sub>avg</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
447
+
448
+ - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
449
 
450
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
451
 
452
  - 请注意,使用不同的测试工具包(如 InternVL 和 VLMEvalKit)评估同一模型可能会导致细微差异,这是正常的。代码版本的更新、环境和硬件的变化也可能导致结果的微小差异。
453
 
 
 
454
  ### 视频相关评测
455
 
456
  | 评测数据集 | VideoChat2-Phi3 | Mini-InternVL-2B-1-5 | InternVL2-2B | InternVL2-1B |