Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Abstract
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.
Community
We present Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning.
Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content.
Code: https://github.com/magic-research/Sa2VA
Paper: https://arxiv.org/abs/2501.04001
Huggingface: https://huggingface.co/ByteDance/Sa2VA-4B
Project Page: https://lxtgh.github.io/project/sa2va/
Welcome to try our MLLM models and send your feedback to xiangtai.li@bytedance.com or yuanhaobo@whu.edu.cn or zhang_tao@whu.edu.cn.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models (2024)
- HyperSeg: Towards Universal Visual Segmentation with Large Language Model (2024)
- SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation (2024)
- FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity (2024)
- LinVT: Empower Your Image-level Large Language Model to Understand Videos (2024)
- VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM (2024)
- TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper