arxiv:2412.04467

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Published on Dec 5

· Submitted by

Senqiao on Dec 6

#1 Paper of the day

Upvote

Authors:

Senqiao Yang ,

Yukang Chen ,

Zhuotao Tian ,

Chengyao Wang ,

Jingyao Li ,

Abstract

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

View arXiv page View PDF Add to collection

Community

Senqiao

Paper author Paper submitter 7 days ago

•

edited 7 days ago

🚀 Demo: http://202.104.135.156:7860/
🌟 Video: https://youtu.be/sytaAzmxxpo?si=IieArmQ7YNf2dVyM
🎯 Code: https://github.com/dvlab-research/VisionZip

Usage:

pip install visionzip
from visionzip import visionzip
model = visionzip(model)

Senqiao

Paper author Paper submitter 7 days ago

🔥Highlights:

VisionZip achieves state-of-the-art performance among efficient VLM methods. By retaining only 10% of visual tokens, it achieves nearly 95% of the performance in training-free mode.

VisionZip can be applied during the inference stage (without incurring any additional training cost), the efficient tuning stage (to achieve better results), and the training stage (almost no performance degradation，saving 2× memory and 2× training time).

VisionZip significantly reduces the pre-filling time by 8x and the total inference time by 2x(with KV cache enabled).

DSY001

7 days ago

Similar methods will destroy the model’s performance on OCR tasks, especially those with high text density.

Senqiao

Paper author 7 days ago

Thank you for your interest in our work. OCR capability was also a concern during the development of VisionZIP. However, our results show that it does not cause a significant drop in performance. For example, with LLaVA-1.5 retains only 64 tokens, the TextVQA benchmark still achieves 96.2%.

We believe this is because the local textual information is highly aggregated in the deeper layers of the vision encoder. Therefore, even when a large number of tokens are dropped, there is minimal impact. We suggest you could also try inputting different visual tokens in our demo to explore this further.