Visual Context Window Extension: A New Perspective for Long Video Understanding
Abstract
Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.
Community
In this paper, we address the long video understanding issue from the perspective of context windows, effectively avoiding the resource consumption associated with training from scratch.
- By redefining the effective context window of LMMs into visual and language context windows, we propose the visual context window extension. This approach allows LMMs trained on short videos to be applied to long video understanding tasks without fine-tuning.
- Additionally, we introduce a progressive pooling strategy to mitigate memory consumption issues caused by long sequences.
- On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B.
- In a 256-frame setting, this strategy reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.
We hope this work will advance research in long video understanding and provide insights for the design of future long video understanding models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding (2024)
- Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos (2024)
- Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding (2024)
- Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner (2024)
- VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper