VisionZip: Longer is Better but Not Necessary in Vision Language Models
Abstract
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .
Community
🚀 Demo: http://202.104.135.156:7860/
🌟 Video: https://youtu.be/sytaAzmxxpo?si=IieArmQ7YNf2dVyM
🎯 Code: https://github.com/dvlab-research/VisionZip
Usage:
pip install visionzip
from visionzip import visionzip
model = visionzip(model)
🔥Highlights:
VisionZip achieves state-of-the-art performance among efficient VLM methods. By retaining only 10% of visual tokens, it achieves nearly 95% of the performance in training-free mode.
VisionZip can be applied during the inference stage (without incurring any additional training cost), the efficient tuning stage (to achieve better results), and the training stage (almost no performance degradation,saving 2× memory and 2× training time).
VisionZip significantly reduces the pre-filling time by 8x and the total inference time by 2x(with KV cache enabled).
Similar methods will destroy the model’s performance on OCR tasks, especially those with high text density.
Thank you for your interest in our work. OCR capability was also a concern during the development of VisionZIP. However, our results show that it does not cause a significant drop in performance. For example, with LLaVA-1.5 retains only 64 tokens, the TextVQA benchmark still achieves 96.2%.
We believe this is because the local textual information is highly aggregated in the deeper layers of the vision encoder. Therefore, even when a large number of tokens are dropped, there is minimal impact. We suggest you could also try inputting different visual tokens in our demo to explore this further.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FoPru: Focal Pruning for Efficient Large Vision-Language Models (2024)
- Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model (2024)
- Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings (2024)
- DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models (2024)
- Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification (2024)
- Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers (2024)
- [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper