Multi-subject Open-set Personalization in Video Generation
Abstract
Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist - a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics (2024)
- DIVE: Taming DINO for Subject-Driven Video Editing (2024)
- VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing (2024)
- MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation (2024)
- DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models (2024)
- VIRES: Video Instance Repainting with Sketch and Text Guidance (2024)
- Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper