3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
Abstract
The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.
Community
3DIS-FLUX: Achieve Layout-to-Image Generation Without Extra Training!
๐ฅ Harness the power of 3DIS to transform layouts into stunning High-Resolution images effortlessly, with no additional training required!
โ๏ธ Seamlessly integrates with various finetuned FLUX models and LoRA weight!
๐ Project Page: https://limuloo.github.io/3DIS/
๐ป GitHub Repository: https://github.com/limuloo/3DIS/tree/main
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation (2024)
- EliGen: Entity-Level Controlled Image Generation with Regional Attention (2025)
- MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation (2024)
- CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation (2024)
- LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation (2024)
- Enhancing Image Generation Fidelity via Progressive Prompts (2025)
- T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper