README.md CHANGED
@@ -1,20 +1,6 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - OpenGVLab/InternViT-300M-448px
7
- - internlm/internlm2_5-7b-chat
8
- base_model_relation: merge
9
- language:
10
- - multilingual
11
- tags:
12
- - internvl
13
- - vision
14
- - ocr
15
- - multi-image
16
- - video
17
- - custom_code
18
  ---
19
 
20
  # InternVL2-8B
@@ -76,13 +62,11 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
76
  | MathVista<sub>testmini</sub> | 54.3 | 53.5 | 58.3 |
77
  | OpenCompass<sub>avg</sub> | 58.8 | 61.7 | 64.1 |
78
 
79
- - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
80
-
81
- - We simultaneously use [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
82
 
83
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
84
 
85
- - Please note that evaluating the same model using different testing toolkits like [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
86
 
87
  ### Video Benchmarks
88
 
@@ -144,7 +128,6 @@ model = AutoModel.from_pretrained(
144
  path,
145
  torch_dtype=torch.bfloat16,
146
  low_cpu_mem_usage=True,
147
- use_flash_attn=True,
148
  trust_remote_code=True).eval().cuda()
149
  ```
150
 
@@ -159,7 +142,6 @@ model = AutoModel.from_pretrained(
159
  torch_dtype=torch.bfloat16,
160
  load_in_8bit=True,
161
  low_cpu_mem_usage=True,
162
- use_flash_attn=True,
163
  trust_remote_code=True).eval()
164
  ```
165
 
@@ -174,7 +156,6 @@ model = AutoModel.from_pretrained(
174
  torch_dtype=torch.bfloat16,
175
  load_in_4bit=True,
176
  low_cpu_mem_usage=True,
177
- use_flash_attn=True,
178
  trust_remote_code=True).eval()
179
  ```
180
 
@@ -219,7 +200,6 @@ model = AutoModel.from_pretrained(
219
  path,
220
  torch_dtype=torch.bfloat16,
221
  low_cpu_mem_usage=True,
222
- use_flash_attn=True,
223
  trust_remote_code=True,
224
  device_map=device_map).eval()
225
  ```
@@ -315,13 +295,12 @@ model = AutoModel.from_pretrained(
315
  path,
316
  torch_dtype=torch.bfloat16,
317
  low_cpu_mem_usage=True,
318
- use_flash_attn=True,
319
  trust_remote_code=True).eval().cuda()
320
  tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
321
 
322
  # set the max number of tiles in `max_num`
323
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
324
- generation_config = dict(max_new_tokens=1024, do_sample=True)
325
 
326
  # pure-text conversation (纯文本对话)
327
  question = 'Hello, who are you?'
@@ -473,7 +452,7 @@ for new_text in streamer:
473
 
474
  ## Finetune
475
 
476
- Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
477
 
478
  ## Deployment
479
 
@@ -482,7 +461,7 @@ Many repositories now support fine-tuning of the InternVL series models, includi
482
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
483
 
484
  ```sh
485
- pip install lmdeploy==0.5.3
486
  ```
487
 
488
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
@@ -490,12 +469,16 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
490
  #### A 'Hello, world' example
491
 
492
  ```python
493
- from lmdeploy import pipeline, TurbomindEngineConfig
494
  from lmdeploy.vl import load_image
495
 
496
  model = 'OpenGVLab/InternVL2-8B'
 
497
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
498
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
499
  response = pipe(('describe this image', image))
500
  print(response.text)
501
  ```
@@ -509,12 +492,16 @@ When dealing with multiple images, you can put them all in one list. Keep in min
509
  > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
510
 
511
  ```python
512
- from lmdeploy import pipeline, TurbomindEngineConfig
513
  from lmdeploy.vl import load_image
514
  from lmdeploy.vl.constants import IMAGE_TOKEN
515
 
516
  model = 'OpenGVLab/InternVL2-8B'
517
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
518
 
519
  image_urls=[
520
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -532,11 +519,15 @@ print(response.text)
532
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
533
 
534
  ```python
535
- from lmdeploy import pipeline, TurbomindEngineConfig
536
  from lmdeploy.vl import load_image
537
 
538
  model = 'OpenGVLab/InternVL2-8B'
539
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
540
 
541
  image_urls=[
542
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -552,11 +543,15 @@ print(response)
552
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
553
 
554
  ```python
555
- from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
556
  from lmdeploy.vl import load_image
557
 
558
  model = 'OpenGVLab/InternVL2-8B'
559
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
560
 
561
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
562
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -568,10 +563,20 @@ print(sess.response.text)
568
 
569
  #### Service
570
 
 
 
 
 
 
 
 
 
 
 
571
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
572
 
573
  ```shell
574
- lmdeploy serve api_server OpenGVLab/InternVL2-8B --backend turbomind --server-port 23333
575
  ```
576
 
577
  To use the OpenAI-style interface, you need to install OpenAI:
@@ -608,6 +613,14 @@ response = client.chat.completions.create(
608
  print(response)
609
  ```
610
 
 
 
 
 
 
 
 
 
611
  ## License
612
 
613
  This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.
@@ -680,8 +693,6 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
680
  | MathVista<sub>testmini</sub> | 54.3 | 53.5 | 58.3 |
681
  | OpenCompass<sub>avg</sub> | 58.8 | 61.7 | 64.1 |
682
 
683
- - 关于更多的细节以及评测复现,请看我们的[评测指南](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)。
684
-
685
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
686
 
687
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
@@ -740,7 +751,7 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
740
 
741
  ## 微调
742
 
743
- 许多仓库现在都支持 InternVL 系列模型的微调,包括 [InternVL](https://github.com/OpenGVLab/InternVL)、[SWIFT](https://github.com/modelscope/ms-swift)、[XTurner](https://github.com/InternLM/xtuner) 等。请参阅它们的文档以获取更多微调细节。
744
 
745
  ## 部署
746
 
@@ -749,7 +760,7 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
749
  LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
750
 
751
  ```sh
752
- pip install lmdeploy==0.5.3
753
  ```
754
 
755
  LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM)的推理管道。
@@ -757,12 +768,16 @@ LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为
757
  #### 一个“你好,世界”示例
758
 
759
  ```python
760
- from lmdeploy import pipeline, TurbomindEngineConfig
761
  from lmdeploy.vl import load_image
762
 
763
  model = 'OpenGVLab/InternVL2-8B'
 
764
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
765
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
766
  response = pipe(('describe this image', image))
767
  print(response.text)
768
  ```
@@ -774,12 +789,16 @@ print(response.text)
774
  在处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入 token 数量增加,因此通常需要增加上下文窗口的大小。
775
 
776
  ```python
777
- from lmdeploy import pipeline, TurbomindEngineConfig
778
  from lmdeploy.vl import load_image
779
  from lmdeploy.vl.constants import IMAGE_TOKEN
780
 
781
  model = 'OpenGVLab/InternVL2-8B'
782
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
783
 
784
  image_urls=[
785
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -787,7 +806,6 @@ image_urls=[
787
  ]
788
 
789
  images = [load_image(img_url) for img_url in image_urls]
790
- # Numbering images improves multi-image conversations
791
  response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
792
  print(response.text)
793
  ```
@@ -797,11 +815,15 @@ print(response.text)
797
  使用批量Prompt进行推理非常简单;��需将它们放在一个列表结构中:
798
 
799
  ```python
800
- from lmdeploy import pipeline, TurbomindEngineConfig
801
  from lmdeploy.vl import load_image
802
 
803
  model = 'OpenGVLab/InternVL2-8B'
804
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
805
 
806
  image_urls=[
807
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -817,11 +839,15 @@ print(response)
817
  使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法,另一种是使用 `pipeline.chat` 接口。
818
 
819
  ```python
820
- from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
821
  from lmdeploy.vl import load_image
822
 
823
  model = 'OpenGVLab/InternVL2-8B'
824
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
825
 
826
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
827
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -833,10 +859,20 @@ print(sess.response.text)
833
 
834
  #### API部署
835
 
 
 
 
 
 
 
 
 
 
 
836
  LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
837
 
838
  ```shell
839
- lmdeploy serve api_server OpenGVLab/InternVL2-8B --backend turbomind --server-port 23333
840
  ```
841
 
842
  为了使用OpenAI风格的API接口,您需要安装OpenAI:
@@ -873,6 +909,14 @@ response = client.chat.completions.create(
873
  print(response)
874
  ```
875
 
 
 
 
 
 
 
 
 
876
  ## 开源许可证
877
 
878
  该项目采用 MIT 许可证发布,而 InternLM2 则采用 Apache-2.0 许可证。
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # InternVL2-8B
 
62
  | MathVista<sub>testmini</sub> | 54.3 | 53.5 | 58.3 |
63
  | OpenCompass<sub>avg</sub> | 58.8 | 61.7 | 64.1 |
64
 
65
+ - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
 
 
66
 
67
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
68
 
69
+ - Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
70
 
71
  ### Video Benchmarks
72
 
 
128
  path,
129
  torch_dtype=torch.bfloat16,
130
  low_cpu_mem_usage=True,
 
131
  trust_remote_code=True).eval().cuda()
132
  ```
133
 
 
142
  torch_dtype=torch.bfloat16,
143
  load_in_8bit=True,
144
  low_cpu_mem_usage=True,
 
145
  trust_remote_code=True).eval()
146
  ```
147
 
 
156
  torch_dtype=torch.bfloat16,
157
  load_in_4bit=True,
158
  low_cpu_mem_usage=True,
 
159
  trust_remote_code=True).eval()
160
  ```
161
 
 
200
  path,
201
  torch_dtype=torch.bfloat16,
202
  low_cpu_mem_usage=True,
 
203
  trust_remote_code=True,
204
  device_map=device_map).eval()
205
  ```
 
295
  path,
296
  torch_dtype=torch.bfloat16,
297
  low_cpu_mem_usage=True,
 
298
  trust_remote_code=True).eval().cuda()
299
  tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
300
 
301
  # set the max number of tiles in `max_num`
302
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
303
+ generation_config = dict(max_new_tokens=1024, do_sample=False)
304
 
305
  # pure-text conversation (纯文本对话)
306
  question = 'Hello, who are you?'
 
452
 
453
  ## Finetune
454
 
455
+ SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
456
 
457
  ## Deployment
458
 
 
461
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
462
 
463
  ```sh
464
+ pip install lmdeploy
465
  ```
466
 
467
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
 
469
  #### A 'Hello, world' example
470
 
471
  ```python
472
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
473
  from lmdeploy.vl import load_image
474
 
475
  model = 'OpenGVLab/InternVL2-8B'
476
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
477
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
478
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
479
+ chat_template_config.meta_instruction = system_prompt
480
+ pipe = pipeline(model, chat_template_config=chat_template_config,
481
+ backend_config=TurbomindEngineConfig(session_len=8192))
482
  response = pipe(('describe this image', image))
483
  print(response.text)
484
  ```
 
492
  > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
493
 
494
  ```python
495
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
496
  from lmdeploy.vl import load_image
497
  from lmdeploy.vl.constants import IMAGE_TOKEN
498
 
499
  model = 'OpenGVLab/InternVL2-8B'
500
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
501
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
502
+ chat_template_config.meta_instruction = system_prompt
503
+ pipe = pipeline(model, chat_template_config=chat_template_config,
504
+ backend_config=TurbomindEngineConfig(session_len=8192))
505
 
506
  image_urls=[
507
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 
519
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
520
 
521
  ```python
522
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
523
  from lmdeploy.vl import load_image
524
 
525
  model = 'OpenGVLab/InternVL2-8B'
526
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
527
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
528
+ chat_template_config.meta_instruction = system_prompt
529
+ pipe = pipeline(model, chat_template_config=chat_template_config,
530
+ backend_config=TurbomindEngineConfig(session_len=8192))
531
 
532
  image_urls=[
533
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 
543
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
544
 
545
  ```python
546
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
547
  from lmdeploy.vl import load_image
548
 
549
  model = 'OpenGVLab/InternVL2-8B'
550
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
551
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
552
+ chat_template_config.meta_instruction = system_prompt
553
+ pipe = pipeline(model, chat_template_config=chat_template_config,
554
+ backend_config=TurbomindEngineConfig(session_len=8192))
555
 
556
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
557
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 
563
 
564
  #### Service
565
 
566
+ To deploy InternVL2 as an API, please configure the chat template config first. Create the following JSON file `chat_template.json`.
567
+
568
+ ```json
569
+ {
570
+ "model_name":"internvl-internlm2",
571
+ "meta_instruction":"我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
572
+ "stop_words":["<|im_start|>", "<|im_end|>"]
573
+ }
574
+ ```
575
+
576
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
577
 
578
  ```shell
579
+ lmdeploy serve api_server OpenGVLab/InternVL2-8B --backend turbomind --server-port 23333 --chat-template chat_template.json
580
  ```
581
 
582
  To use the OpenAI-style interface, you need to install OpenAI:
 
613
  print(response)
614
  ```
615
 
616
+ ### vLLM
617
+
618
+ TODO
619
+
620
+ ### Ollama
621
+
622
+ TODO
623
+
624
  ## License
625
 
626
  This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.
 
693
  | MathVista<sub>testmini</sub> | 54.3 | 53.5 | 58.3 |
694
  | OpenCompass<sub>avg</sub> | 58.8 | 61.7 | 64.1 |
695
 
 
 
696
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
697
 
698
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
 
751
 
752
  ## 微调
753
 
754
+ 来自ModelScope社区的SWIFT已经支持对InternVL进行微调(图像/视频),详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)
755
 
756
  ## 部署
757
 
 
760
  LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
761
 
762
  ```sh
763
+ pip install lmdeploy
764
  ```
765
 
766
  LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM)的推理管道。
 
768
  #### 一个“你好,世界”示例
769
 
770
  ```python
771
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
772
  from lmdeploy.vl import load_image
773
 
774
  model = 'OpenGVLab/InternVL2-8B'
775
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
776
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
777
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
778
+ chat_template_config.meta_instruction = system_prompt
779
+ pipe = pipeline(model, chat_template_config=chat_template_config,
780
+ backend_config=TurbomindEngineConfig(session_len=8192))
781
  response = pipe(('describe this image', image))
782
  print(response.text)
783
  ```
 
789
  在处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入 token 数量增加,因此通常需要增加上下文窗口的大小。
790
 
791
  ```python
792
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
793
  from lmdeploy.vl import load_image
794
  from lmdeploy.vl.constants import IMAGE_TOKEN
795
 
796
  model = 'OpenGVLab/InternVL2-8B'
797
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
798
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
799
+ chat_template_config.meta_instruction = system_prompt
800
+ pipe = pipeline(model, chat_template_config=chat_template_config,
801
+ backend_config=TurbomindEngineConfig(session_len=8192))
802
 
803
  image_urls=[
804
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 
806
  ]
807
 
808
  images = [load_image(img_url) for img_url in image_urls]
 
809
  response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
810
  print(response.text)
811
  ```
 
815
  使用批量Prompt进行推理非常简单;��需将它们放在一个列表结构中:
816
 
817
  ```python
818
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
819
  from lmdeploy.vl import load_image
820
 
821
  model = 'OpenGVLab/InternVL2-8B'
822
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
823
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
824
+ chat_template_config.meta_instruction = system_prompt
825
+ pipe = pipeline(model, chat_template_config=chat_template_config,
826
+ backend_config=TurbomindEngineConfig(session_len=8192))
827
 
828
  image_urls=[
829
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 
839
  使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法,另一种是使用 `pipeline.chat` 接口。
840
 
841
  ```python
842
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
843
  from lmdeploy.vl import load_image
844
 
845
  model = 'OpenGVLab/InternVL2-8B'
846
+ system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
847
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
848
+ chat_template_config.meta_instruction = system_prompt
849
+ pipe = pipeline(model, chat_template_config=chat_template_config,
850
+ backend_config=TurbomindEngineConfig(session_len=8192))
851
 
852
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
853
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 
859
 
860
  #### API部署
861
 
862
+ 为了将InternVL2部署成API,请先配置聊天模板配置文件。创建如下的 JSON 文件 `chat_template.json`。
863
+
864
+ ```json
865
+ {
866
+ "model_name":"internvl-internlm2",
867
+ "meta_instruction":"我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
868
+ "stop_words":["<|im_start|>", "<|im_end|>"]
869
+ }
870
+ ```
871
+
872
  LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
873
 
874
  ```shell
875
+ lmdeploy serve api_server OpenGVLab/InternVL2-8B --backend turbomind --server-port 23333 --chat-template chat_template.json
876
  ```
877
 
878
  为了使用OpenAI风格的API接口,您需要安装OpenAI:
 
909
  print(response)
910
  ```
911
 
912
+ ### vLLM
913
+
914
+ TODO
915
+
916
+ ### Ollama
917
+
918
+ TODO
919
+
920
  ## 开源许可证
921
 
922
  该项目采用 MIT 许可证发布,而 InternLM2 则采用 Apache-2.0 许可证。
configuration_intern_vit.py CHANGED
@@ -3,7 +3,6 @@
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
-
7
  import os
8
  from typing import Union
9
 
 
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  import os
7
  from typing import Union
8
 
configuration_internvl_chat.py CHANGED
@@ -47,12 +47,12 @@ class InternVLChatConfig(PretrainedConfig):
47
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
 
49
  self.vision_config = InternVisionConfig(**vision_config)
50
- if llm_config.get('architectures')[0] == 'LlamaForCausalLM':
51
  self.llm_config = LlamaConfig(**llm_config)
52
- elif llm_config.get('architectures')[0] == 'InternLM2ForCausalLM':
53
  self.llm_config = InternLM2Config(**llm_config)
54
  else:
55
- raise ValueError('Unsupported architecture: {}'.format(llm_config.get('architectures')[0]))
56
  self.use_backbone_lora = use_backbone_lora
57
  self.use_llm_lora = use_llm_lora
58
  self.select_layer = select_layer
 
47
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
 
49
  self.vision_config = InternVisionConfig(**vision_config)
50
+ if llm_config['architectures'][0] == 'LlamaForCausalLM':
51
  self.llm_config = LlamaConfig(**llm_config)
52
+ elif llm_config['architectures'][0] == 'InternLM2ForCausalLM':
53
  self.llm_config = InternLM2Config(**llm_config)
54
  else:
55
+ raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
56
  self.use_backbone_lora = use_backbone_lora
57
  self.use_llm_lora = use_llm_lora
58
  self.select_layer = select_layer
conversation.py CHANGED
@@ -3,13 +3,11 @@ Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
  If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
-
7
- Modified from https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
8
  """
9
 
10
  import dataclasses
11
  from enum import IntEnum, auto
12
- from typing import Dict, List, Tuple, Union
13
 
14
 
15
  class SeparatorStyle(IntEnum):
@@ -346,6 +344,12 @@ register_conv_template(
346
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
347
  sep_style=SeparatorStyle.MPT,
348
  sep='<|im_end|>',
 
 
 
 
 
 
349
  stop_str='<|endoftext|>',
350
  )
351
  )
@@ -361,6 +365,11 @@ register_conv_template(
361
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
362
  sep_style=SeparatorStyle.MPT,
363
  sep='<|im_end|>',
 
 
 
 
 
364
  )
365
  )
366
 
@@ -375,17 +384,10 @@ register_conv_template(
375
  roles=('<|user|>\n', '<|assistant|>\n'),
376
  sep_style=SeparatorStyle.MPT,
377
  sep='<|end|>',
378
- )
379
- )
380
-
381
-
382
- register_conv_template(
383
- Conversation(
384
- name='internvl2_5',
385
- system_template='<|im_start|>system\n{system_message}',
386
- system_message='你是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
387
- roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
388
- sep_style=SeparatorStyle.MPT,
389
- sep='<|im_end|>\n',
390
  )
391
  )
 
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
  If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
 
 
6
  """
7
 
8
  import dataclasses
9
  from enum import IntEnum, auto
10
+ from typing import Any, Dict, List, Tuple, Union
11
 
12
 
13
  class SeparatorStyle(IntEnum):
 
344
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
345
  sep_style=SeparatorStyle.MPT,
346
  sep='<|im_end|>',
347
+ stop_token_ids=[
348
+ 2,
349
+ 6,
350
+ 7,
351
+ 8,
352
+ ],
353
  stop_str='<|endoftext|>',
354
  )
355
  )
 
365
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
366
  sep_style=SeparatorStyle.MPT,
367
  sep='<|im_end|>',
368
+ stop_token_ids=[
369
+ 2,
370
+ 92543,
371
+ 92542
372
+ ]
373
  )
374
  )
375
 
 
384
  roles=('<|user|>\n', '<|assistant|>\n'),
385
  sep_style=SeparatorStyle.MPT,
386
  sep='<|end|>',
387
+ stop_token_ids=[
388
+ 2,
389
+ 32000,
390
+ 32007
391
+ ]
 
 
 
 
 
 
 
392
  )
393
  )
eval_llm_benchmark.log DELETED
@@ -1,53 +0,0 @@
1
- /mnt/petrelfs/wangweiyun/miniconda3/envs/internvl_eval/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
2
- warn("The installed version of bitsandbytes was compiled without GPU support. "
3
- /mnt/petrelfs/wangweiyun/miniconda3/envs/internvl_eval/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
4
- model path is /mnt/petrelfs/wangweiyun/workspace_cz/InternVL/internvl_chat_dev/share_internvl/InternVL2-8B
5
- 09/30 19:08:03 - OpenCompass - WARNING - No previous results to reuse!
6
- 09/30 19:08:03 - OpenCompass - INFO - Reusing experiements from 20240930_190803
7
- 09/30 19:08:03 - OpenCompass - INFO - Current exp folder: /mnt/petrelfs/wangweiyun/workspace_cz/InternVL/internvl_chat_dev/share_internvl/InternVL2-8B/20240930_190803
8
- 09/30 19:08:06 - OpenCompass - INFO - Partitioned into 64 tasks.
9
- [ ] 0/64, elapsed: 0s, ETA:
10
- 09/30 19:52:33 - OpenCompass - INFO - Partitioned into 287 tasks.
11
- [ ] 0/287, elapsed: 0s, ETA:
12
- dataset version metric mode internvl-chat-20b
13
- ---------------------------- --------- ---------------------------- ------ -------------------
14
- mmlu - naive_average gen 73.17
15
- cmmlu - naive_average gen 79.21
16
- ceval - naive_average gen 80.14
17
- agieval - - - -
18
- GaokaoBench - weighted_average gen 74.99
19
- triviaqa 2121ce score gen 62.03
20
- triviaqa_wiki_1shot - - - -
21
- nq 3dcea1 score gen 28.12
22
- C3 8c358f accuracy gen 94.19
23
- race-high 9a54b6 accuracy gen 90.82
24
- flores_100 - - - -
25
- winogrande b36770 accuracy gen 85.87
26
- hellaswag e42710 accuracy gen 94.91
27
- bbh - naive_average gen 72.67
28
- gsm8k 1d7fe4 accuracy gen 75.59
29
- math 393424 accuracy gen 39.50
30
- TheoremQA 6f0af8 score gen 15.62
31
- MathBench - - - -
32
- openai_humaneval 8e312c humaneval_pass@1 gen 69.51
33
- humanevalx - - - -
34
- sanitized_mbpp a447ff score gen 58.75
35
- mbpp_cn 6fb572 score gen 48.20
36
- leval - - - -
37
- leval_closed - - - -
38
- leval_open - - - -
39
- longbench - - - -
40
- longbench_single-document-qa - - - -
41
- longbench_multi-document-qa - - - -
42
- longbench_summarization - - - -
43
- longbench_few-shot-learning - - - -
44
- longbench_synthetic-tasks - - - -
45
- longbench_code-completion - - - -
46
- teval - - - -
47
- teval_zh - - - -
48
- IFEval 3321a3 Prompt-level-strict-accuracy gen 52.31
49
- IFEval 3321a3 Inst-level-strict-accuracy gen 62.71
50
- IFEval 3321a3 Prompt-level-loose-accuracy gen 54.90
51
- IFEval 3321a3 Inst-level-loose-accuracy gen 64.87
52
- 09/30 19:55:16 - OpenCompass - INFO - write summary to /mnt/petrelfs/wangweiyun/workspace_cz/InternVL/internvl_chat_dev/share_internvl/InternVL2-8B/20240930_190803/summary/summary_20240930_190803.txt
53
- 09/30 19:55:16 - OpenCompass - INFO - write csv to /mnt/petrelfs/wangweiyun/workspace_cz/InternVL/internvl_chat_dev/share_internvl/InternVL2-8B/20240930_190803/summary/summary_20240930_190803.csv
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generation_config.json CHANGED
@@ -1,8 +1,4 @@
1
  {
2
  "_from_model_config": true,
3
- "transformers_version": "4.37.2",
4
- "eos_token_id": [
5
- 92542,
6
- 92543
7
- ]
8
  }
 
1
  {
2
  "_from_model_config": true,
3
+ "transformers_version": "4.37.2"
 
 
 
 
4
  }
modeling_intern_vit.py CHANGED
@@ -3,7 +3,6 @@
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
-
7
  from typing import Optional, Tuple, Union
8
 
9
  import torch
@@ -21,12 +20,18 @@ from transformers.utils import logging
21
  from .configuration_intern_vit import InternVisionConfig
22
 
23
  try:
 
 
 
 
 
 
 
24
  from flash_attn.bert_padding import pad_input, unpad_input
25
- from flash_attn.flash_attn_interface import \
26
- flash_attn_varlen_qkvpacked_func
27
  has_flash_attn = True
28
  except:
29
- print('FlashAttention2 is not installed.')
30
  has_flash_attn = False
31
 
32
  logger = logging.get_logger(__name__)
@@ -69,7 +74,7 @@ class FlashAttention(nn.Module):
69
  max_s = seqlen
70
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
71
  device=qkv.device)
72
- output = flash_attn_varlen_qkvpacked_func(
73
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
74
  softmax_scale=self.softmax_scale, causal=causal
75
  )
@@ -79,7 +84,7 @@ class FlashAttention(nn.Module):
79
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
80
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
81
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
82
- output_unpad = flash_attn_varlen_qkvpacked_func(
83
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
84
  softmax_scale=self.softmax_scale, causal=causal
85
  )
@@ -88,7 +93,7 @@ class FlashAttention(nn.Module):
88
  'b s (h d) -> b s h d', h=nheads)
89
  else:
90
  assert max_s is not None
91
- output = flash_attn_varlen_qkvpacked_func(
92
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
93
  softmax_scale=self.softmax_scale, causal=causal
94
  )
@@ -288,9 +293,9 @@ class InternVisionEncoderLayer(nn.Module):
288
  Args:
289
  hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
290
  """
291
- hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states).to(hidden_states.dtype)) * self.ls1)
292
 
293
- hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states).to(hidden_states.dtype)) * self.ls2)
294
 
295
  return hidden_states
296
 
 
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  from typing import Optional, Tuple, Union
7
 
8
  import torch
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
23
+ try: # v1
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_unpadded_qkvpacked_func
26
+ except: # v2
27
+ from flash_attn.flash_attn_interface import \
28
+ flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
+
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
+
 
32
  has_flash_attn = True
33
  except:
34
+ print('FlashAttention is not installed.')
35
  has_flash_attn = False
36
 
37
  logger = logging.get_logger(__name__)
 
74
  max_s = seqlen
75
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
76
  device=qkv.device)
77
+ output = flash_attn_unpadded_qkvpacked_func(
78
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
79
  softmax_scale=self.softmax_scale, causal=causal
80
  )
 
84
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
85
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
86
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
87
+ output_unpad = flash_attn_unpadded_qkvpacked_func(
88
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
89
  softmax_scale=self.softmax_scale, causal=causal
90
  )
 
93
  'b s (h d) -> b s h d', h=nheads)
94
  else:
95
  assert max_s is not None
96
+ output = flash_attn_unpadded_qkvpacked_func(
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
 
293
  Args:
294
  hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
295
  """
296
+ hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
297
 
298
+ hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states)) * self.ls2)
299
 
300
  return hidden_states
301
 
modeling_internvl_chat.py CHANGED
@@ -3,9 +3,8 @@
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
-
7
  import warnings
8
- from typing import List, Optional, Tuple, Union
9
 
10
  import torch.utils.checkpoint
11
  import transformers
@@ -19,7 +18,7 @@ from transformers.utils import ModelOutput, logging
19
 
20
  from .configuration_internvl_chat import InternVLChatConfig
21
  from .conversation import get_conv_template
22
- from .modeling_intern_vit import InternVisionModel, has_flash_attn
23
  from .modeling_internlm2 import InternLM2ForCausalLM
24
 
25
  logger = logging.get_logger(__name__)
@@ -36,11 +35,10 @@ def version_cmp(v1, v2, op='eq'):
36
  class InternVLChatModel(PreTrainedModel):
37
  config_class = InternVLChatConfig
38
  main_input_name = 'pixel_values'
39
- base_model_prefix = 'language_model'
40
  _supports_flash_attn_2 = True
41
  _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
42
 
43
- def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None, use_flash_attn=True):
44
  super().__init__(config)
45
 
46
  assert version_cmp(transformers.__version__, '4.36.2', 'ge')
@@ -52,9 +50,6 @@ class InternVLChatModel(PreTrainedModel):
52
  self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
53
  self.downsample_ratio = config.downsample_ratio
54
  self.ps_version = config.ps_version
55
- use_flash_attn = use_flash_attn if has_flash_attn else False
56
- config.vision_config.use_flash_attn = True if use_flash_attn else False
57
- config.llm_config.attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'
58
 
59
  logger.info(f'num_image_token: {self.num_image_token}')
60
  logger.info(f'ps_version: {self.ps_version}')
@@ -103,7 +98,7 @@ class InternVLChatModel(PreTrainedModel):
103
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
104
 
105
  image_flags = image_flags.squeeze(-1)
106
- input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()
107
 
108
  vit_embeds = self.extract_feature(pixel_values)
109
  vit_embeds = vit_embeds[image_flags == 1]
@@ -236,9 +231,9 @@ class InternVLChatModel(PreTrainedModel):
236
 
237
  tokenizer.padding_side = 'left'
238
  model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
239
- input_ids = model_inputs['input_ids'].to(self.device)
240
- attention_mask = model_inputs['attention_mask'].to(self.device)
241
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
242
  generation_config['eos_token_id'] = eos_token_id
243
  generation_output = self.generate(
244
  pixel_values=pixel_values,
@@ -247,7 +242,7 @@ class InternVLChatModel(PreTrainedModel):
247
  **generation_config
248
  )
249
  responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
250
- responses = [response.split(template.sep.strip())[0].strip() for response in responses]
251
  return responses
252
 
253
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
@@ -266,7 +261,7 @@ class InternVLChatModel(PreTrainedModel):
266
 
267
  template = get_conv_template(self.template)
268
  template.system_message = self.system_message
269
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
270
 
271
  history = [] if history is None else history
272
  for (old_question, old_answer) in history:
@@ -285,8 +280,8 @@ class InternVLChatModel(PreTrainedModel):
285
  query = query.replace('<image>', image_tokens, 1)
286
 
287
  model_inputs = tokenizer(query, return_tensors='pt')
288
- input_ids = model_inputs['input_ids'].to(self.device)
289
- attention_mask = model_inputs['attention_mask'].to(self.device)
290
  generation_config['eos_token_id'] = eos_token_id
291
  generation_output = self.generate(
292
  pixel_values=pixel_values,
@@ -295,7 +290,7 @@ class InternVLChatModel(PreTrainedModel):
295
  **generation_config
296
  )
297
  response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
298
- response = response.split(template.sep.strip())[0].strip()
299
  history.append((question, response))
300
  if return_history:
301
  return response, history
@@ -315,6 +310,7 @@ class InternVLChatModel(PreTrainedModel):
315
  visual_features: Optional[torch.FloatTensor] = None,
316
  generation_config: Optional[GenerationConfig] = None,
317
  output_hidden_states: Optional[bool] = None,
 
318
  **generate_kwargs,
319
  ) -> torch.LongTensor:
320
 
@@ -342,6 +338,7 @@ class InternVLChatModel(PreTrainedModel):
342
  attention_mask=attention_mask,
343
  generation_config=generation_config,
344
  output_hidden_states=output_hidden_states,
 
345
  use_cache=True,
346
  **generate_kwargs,
347
  )
 
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  import warnings
7
+ from typing import Any, List, Optional, Tuple, Union
8
 
9
  import torch.utils.checkpoint
10
  import transformers
 
18
 
19
  from .configuration_internvl_chat import InternVLChatConfig
20
  from .conversation import get_conv_template
21
+ from .modeling_intern_vit import InternVisionModel
22
  from .modeling_internlm2 import InternLM2ForCausalLM
23
 
24
  logger = logging.get_logger(__name__)
 
35
  class InternVLChatModel(PreTrainedModel):
36
  config_class = InternVLChatConfig
37
  main_input_name = 'pixel_values'
 
38
  _supports_flash_attn_2 = True
39
  _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
40
 
41
+ def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None):
42
  super().__init__(config)
43
 
44
  assert version_cmp(transformers.__version__, '4.36.2', 'ge')
 
50
  self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
51
  self.downsample_ratio = config.downsample_ratio
52
  self.ps_version = config.ps_version
 
 
 
53
 
54
  logger.info(f'num_image_token: {self.num_image_token}')
55
  logger.info(f'ps_version: {self.ps_version}')
 
98
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
99
 
100
  image_flags = image_flags.squeeze(-1)
101
+ input_embeds = self.language_model.get_input_embeddings()(input_ids)
102
 
103
  vit_embeds = self.extract_feature(pixel_values)
104
  vit_embeds = vit_embeds[image_flags == 1]
 
231
 
232
  tokenizer.padding_side = 'left'
233
  model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
234
+ input_ids = model_inputs['input_ids'].cuda()
235
+ attention_mask = model_inputs['attention_mask'].cuda()
236
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
237
  generation_config['eos_token_id'] = eos_token_id
238
  generation_output = self.generate(
239
  pixel_values=pixel_values,
 
242
  **generation_config
243
  )
244
  responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
245
+ responses = [response.split(template.sep)[0].strip() for response in responses]
246
  return responses
247
 
248
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
 
261
 
262
  template = get_conv_template(self.template)
263
  template.system_message = self.system_message
264
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
265
 
266
  history = [] if history is None else history
267
  for (old_question, old_answer) in history:
 
280
  query = query.replace('<image>', image_tokens, 1)
281
 
282
  model_inputs = tokenizer(query, return_tensors='pt')
283
+ input_ids = model_inputs['input_ids'].cuda()
284
+ attention_mask = model_inputs['attention_mask'].cuda()
285
  generation_config['eos_token_id'] = eos_token_id
286
  generation_output = self.generate(
287
  pixel_values=pixel_values,
 
290
  **generation_config
291
  )
292
  response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
293
+ response = response.split(template.sep)[0].strip()
294
  history.append((question, response))
295
  if return_history:
296
  return response, history
 
310
  visual_features: Optional[torch.FloatTensor] = None,
311
  generation_config: Optional[GenerationConfig] = None,
312
  output_hidden_states: Optional[bool] = None,
313
+ return_dict: Optional[bool] = None,
314
  **generate_kwargs,
315
  ) -> torch.LongTensor:
316
 
 
338
  attention_mask=attention_mask,
339
  generation_config=generation_config,
340
  output_hidden_states=output_hidden_states,
341
+ return_dict=return_dict,
342
  use_cache=True,
343
  **generate_kwargs,
344
  )