Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Abstract
Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/
Community
We introduce Diptych Prompting, a novel zero-shot subject-driven text-to-image generation that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RelationBooth: Towards Relation-Aware Customized Object Generation (2024)
- MagicEraser: Erasing Any Objects via Semantics-Aware Control (2024)
- 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation (2024)
- TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control (2024)
- DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation (2024)
- Boundary Attention Constrained Zero-Shot Layout-To-Image Generation (2024)
- Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper