Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Abstract
Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than 5times performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.
Community
๐ Paper: https://arxiv.org/abs/2412.13180
๐ Website: https://web.stanford.edu/~markendo/projects/feather
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FoPru: Focal Pruning for Efficient Large Vision-Language Models (2024)
- ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models (2024)
- Cross-Self KV Cache Pruning for Efficient Vision-Language Inference (2024)
- [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs (2024)
- Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models (2024)
- VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)
- [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper