jklj077 commited on
Commit
258ddfb
·
verified ·
1 Parent(s): 8f7a585

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -193
README.md CHANGED
@@ -37,206 +37,21 @@ We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model,
37
  <img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="80%"/>
38
  <p>
39
 
40
- We have three models with 2, 7 and 72 billion parameters. This repo contains the pretrained 2B Qwen2-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub](https://github.com/QwenLM/Qwen2-VL).
41
 
42
- ## Requirements
43
-
44
- The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:
45
-
46
- ```
47
- KeyError: 'qwen2_vl'
48
- ```
49
-
50
- ## Quickstart
51
-
52
- Here we show a code snippet to show you how to use the chat model with `transformers`:
53
-
54
- ### Single Media inference
55
-
56
- The model can accept both images and videos as input. Here's an example code for inference.
57
-
58
- ```python
59
- from PIL import Image
60
- import requests
61
- import torch
62
- from torchvision import io
63
- from typing import Dict
64
- from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
65
-
66
- # Load the model in half-precision on the available device(s)
67
- model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/QQwen2-VL-2B-Base", device_map="auto")
68
- processor = AutoProcessor.from_pretrained("Qwen/QQwen2-VL-2B-Base")
69
-
70
- # Image
71
- url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
72
- image = Image.open(requests.get(url, stream=True).raw)
73
-
74
- conversation = [
75
- {
76
- "type":"image",
77
- },
78
- {
79
- "type":"text",
80
- "text":"In this image,"
81
- }
82
- ]
83
-
84
-
85
- # Preprocess the inputs
86
- text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
87
- # Excepted output: '<|vision_start|><|image_pad|><|vision_end|>In this image,'
88
-
89
- inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
90
- inputs = inputs.to('cuda')
91
-
92
- # Inference: Generation of the output
93
- output_ids = model.generate(**inputs, max_new_tokens=128)
94
- generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
95
- output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
96
- print(output_text)
97
-
98
-
99
-
100
- # Video
101
- def fetch_video(ele: Dict, nframe_factor=2):
102
- if isinstance(ele['video'], str):
103
- def round_by_factor(number: int, factor: int) -> int:
104
- return round(number / factor) * factor
105
-
106
- video = ele["video"]
107
- if video.startswith("file://"):
108
- video = video[7:]
109
-
110
- video, _, info = io.read_video(
111
- video,
112
- start_pts=ele.get("video_start", 0.0),
113
- end_pts=ele.get("video_end", None),
114
- pts_unit="sec",
115
- output_format="TCHW",
116
- )
117
- assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
118
- if "nframes" in ele:
119
- nframes = round_by_factor(ele["nframes"], nframe_factor)
120
- else:
121
- fps = ele.get("fps", 1.0)
122
- nframes = round_by_factor(video.size(0) / info["video_fps"] * fps, nframe_factor)
123
- idx = torch.linspace(0, video.size(0) - 1, nframes, dtype=torch.int64)
124
- return video[idx]
125
 
126
- video_info = {"type": "video", "video": "/path/to/video.mp4", "fps": 1.0}
127
- video = fetch_video(video_info)
128
- conversation = [
129
- {"type": "video"},
130
- {"type": "text", "text": "What happened in the video? Answer:"},
131
- ]
132
 
133
- # Preprocess the inputs
134
- text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
135
- # Excepted output: '<|vision_start|><|video_pad|><|vision_end|>What happened in the video? Answer:'
136
 
137
- inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
138
- inputs = inputs.to('cuda')
139
-
140
- # Inference: Generation of the output
141
- output_ids = model.generate(**inputs, max_new_tokens=128)
142
- generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
143
- output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
144
- print(output_text)
145
- ```
146
-
147
- ### Batch Mixed Media Inference
148
-
149
- The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.
150
-
151
- ```python
152
- image1 = Image.open("/path/to/image1.jpg")
153
- image2 = Image.open("/path/to/image2.jpg")
154
- image3 = Image.open("/path/to/image3.jpg")
155
- image4 = Image.open("/path/to/image4.jpg")
156
- image5 = Image.open("/path/to/image5.jpg")
157
- video = fetch_video({
158
- "type": "video",
159
- "video": "/path/to/video.mp4",
160
- "fps": 1.0
161
- })
162
-
163
- # Conversation for the first image
164
- conversation1 = [
165
- {"type": "image"},
166
- {"type": "text", "text": "In this image."}
167
- ]
168
-
169
- # Conversation with two images
170
- conversation2 = [
171
- {"type": "image"},
172
- {"type": "image"},
173
- {"type": "text", "text": "What is written in the pictures?"}
174
- ]
175
-
176
- # Conversation with pure text
177
- conversation3 = "who are you?"
178
-
179
-
180
- # Conversation with mixed midia
181
- conversation4 = [
182
- {"type": "image"},
183
- {"type": "image"},
184
- {"type": "video"},
185
- {"type": "text", "text": "What are the common elements in these medias?"},
186
- ]
187
-
188
- conversations = [conversation1, conversation2, conversation3, conversation4]
189
- # Preparation for batch inference
190
- texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations]
191
- inputs = processor(
192
- text=texts,
193
- images=[image1, image2, image3, image4, image5],
194
- videos=[video],
195
- padding=True,
196
- return_tensors="pt",
197
- )
198
- inputs = inputs.to('cuda')
199
-
200
- # Batch Inference
201
- output_ids = model.generate(**inputs, max_new_tokens=128)
202
- generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
203
- output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
204
- print(output_text)
205
- ```
206
-
207
- #### Image Resolution for performance boost
208
 
209
- The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
210
 
211
- ```python
212
- min_pixels = 256 * 28 * 28
213
- max_pixels = 1280 * 28 * 28
214
- processor = AutoProcessor.from_pretrained(
215
- "Qwen/Qwen2-VL-2B-Base", min_pixels=min_pixels, max_pixels=max_pixels
216
- )
217
  ```
218
-
219
- #### Flash-Attention 2 to speed up generation
220
-
221
- First, make sure to install the latest version of Flash Attention 2:
222
-
223
- ```bash
224
- pip install -U flash-attn --no-build-isolation
225
  ```
226
 
227
- Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
228
-
229
- To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
230
-
231
- ```python
232
- from transformers import Qwen2VLForConditionalGeneration
233
-
234
- model = Qwen2VLForConditionalGeneration.from_pretrained(
235
- "Qwen/Qwen2-VL-2B-Instruct",
236
- torch_dtype=torch.bfloat16,
237
- attn_implementation="flash_attention_2",
238
- )
239
- ```
240
 
241
  ## Limitations
242
 
@@ -257,8 +72,9 @@ If you find our work helpful, feel free to give us a cite.
257
 
258
  ```
259
  @article{Qwen2-VL,
260
- title={Qwen2-VL},
261
- author={Qwen team},
 
262
  year={2024}
263
  }
264
 
 
37
  <img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="80%"/>
38
  <p>
39
 
40
+ We have three models with 2, 7 and 72 billion parameters.
41
 
42
+ This repo contains the **pretrained** 2B Qwen2-VL model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
 
 
 
 
 
 
44
 
45
+ For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub](https://github.com/QwenLM/Qwen2-VL).
 
 
46
 
47
+ ## Requirements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ The code of Qwen2-VL has been in the latest Hugging Face `transformers` and we advise you to install the latest version with command `pip install -U transformers`, or you might encounter the following error:
50
 
 
 
 
 
 
 
51
  ```
52
+ KeyError: 'qwen2_vl'
 
 
 
 
 
 
53
  ```
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Limitations
57
 
 
72
 
73
  ```
74
  @article{Qwen2-VL,
75
+ title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
76
+ author={Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin},
77
+ journal={arXiv preprint arXiv:2409.12191},
78
  year={2024}
79
  }
80