Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Abstract
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mimir: Improving Video Diffusion Models for Precise Text Understanding (2024)
- Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction (2024)
- DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models (2024)
- LinVT: Empower Your Image-level Large Language Model to Understand Videos (2024)
- Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model (2024)
- Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation (2024)
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper