Djrango commited on
Commit
4bb9970
1 Parent(s): c4854b4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -4
README.md CHANGED
@@ -1,9 +1,229 @@
1
  ---
2
  license: mit
3
  language:
4
- - en
5
  base_model:
6
- - black-forest-labs/FLUX.1-dev
7
- - Qwen/Qwen2-VL-7B-Instruct
8
  library_name: diffusers
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  language:
4
+ - en
5
  base_model:
6
+ - black-forest-labs/FLUX.1-dev
7
+ - Qwen/Qwen2-VL-7B-Instruct
8
  library_name: diffusers
9
+ tags:
10
+ - flux
11
+ - qwen2vl
12
+ - stable-diffusion
13
+ - text-to-image
14
+ - image-to-image
15
+ - controlnet
16
+ pipeline_tag: text-to-image
17
+ ---
18
+
19
+ # Qwen2vl-Flux
20
+
21
+ <div align="center">
22
+ <img src="landing-1.png" alt="Qwen2vl-Flux Banner" width="100%">
23
+ </div>
24
+
25
+ Qwen2vl-Flux is a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities. This model excels at generating high-quality images based on both text prompts and visual references, offering superior multimodal understanding and control.
26
+
27
+ ## Model Architecture
28
+
29
+ <div align="center">
30
+ <img src="flux-architecture.svg" alt="Flux Architecture" width="800px">
31
+ </div>
32
+
33
+ The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, enabling more precise and context-aware image generation. Key components include:
34
+ - Vision-Language Understanding Module (Qwen2VL)
35
+ - Enhanced FLUX backbone
36
+ - Multi-mode Generation Pipeline
37
+ - Structural Control Integration
38
+
39
+ ## Features
40
+
41
+ - **Enhanced Vision-Language Understanding**: Leverages Qwen2VL for superior multimodal comprehension
42
+ - **Multiple Generation Modes**: Supports variation, img2img, inpainting, and controlnet-guided generation
43
+ - **Structural Control**: Integrates depth estimation and line detection for precise structural guidance
44
+ - **Flexible Attention Mechanism**: Supports focused generation with spatial attention control
45
+ - **High-Resolution Output**: Supports various aspect ratios up to 1536x1024
46
+
47
+ ## Generation Examples
48
+
49
+ ### Image Variation
50
+ Create diverse variations while maintaining the essence of the original image:
51
+
52
+ <div align="center">
53
+ <table>
54
+ <tr>
55
+ <td><img src="variation_1.png" alt="Variation Example 1" width="256px"></td>
56
+ <td><img src="variation_2.png" alt="Variation Example 2" width="256px"></td>
57
+ <td><img src="variation_3.png" alt="Variation Example 3" width="256px"></td>
58
+ </tr>
59
+ <tr>
60
+ <td><img src="variation_4.png" alt="Variation Example 4" width="256px"></td>
61
+ <td><img src="variation_5.png" alt="Variation Example 5" width="256px"></td>
62
+ </tr>
63
+ </table>
64
+ </div>
65
+
66
+ ### Image Blending
67
+ Seamlessly blend multiple images with intelligent style transfer:
68
+
69
+ <div align="center">
70
+ <table>
71
+ <tr>
72
+ <td><img src="blend_1.png" alt="Blend Example 1" width="256px"></td>
73
+ <td><img src="blend_2.png" alt="Blend Example 2" width="256px"></td>
74
+ <td><img src="blend_3.png" alt="Blend Example 3" width="256px"></td>
75
+ </tr>
76
+ <tr>
77
+ <td><img src="blend_4.png" alt="Blend Example 4" width="256px"></td>
78
+ <td><img src="blend_5.png" alt="Blend Example 5" width="256px"></td>
79
+ <td><img src="blend_6.png" alt="Blend Example 6" width="256px"></td>
80
+ </tr>
81
+ <tr>
82
+ <td><img src="blend_7.png" alt="Blend Example 7" width="256px"></td>
83
+ </tr>
84
+ </table>
85
+ </div>
86
+
87
+ ### Text-Guided Image Blending
88
+ Control image generation with textual prompts:
89
+
90
+ <div align="center">
91
+ <table>
92
+ <tr>
93
+ <td><img src="textblend_1.png" alt="Text Blend Example 1" width="256px"></td>
94
+ <td><img src="textblend_2.png" alt="Text Blend Example 2" width="256px"></td>
95
+ <td><img src="textblend_3.png" alt="Text Blend Example 3" width="256px"></td>
96
+ </tr>
97
+ <tr>
98
+ <td><img src="textblend_4.png" alt="Text Blend Example 4" width="256px"></td>
99
+ <td><img src="textblend_5.png" alt="Text Blend Example 5" width="256px"></td>
100
+ <td><img src="textblend_6.png" alt="Text Blend Example 6" width="256px"></td>
101
+ </tr>
102
+ <tr>
103
+ <td><img src="textblend_7.png" alt="Text Blend Example 7" width="256px"></td>
104
+ <td><img src="textblend_8.png" alt="Text Blend Example 8" width="256px"></td>
105
+ <td><img src="textblend_9.png" alt="Text Blend Example 9" width="256px"></td>
106
+ </tr>
107
+ </table>
108
+ </div>
109
+
110
+ ### Grid-Based Style Transfer
111
+ Apply fine-grained style control with grid attention:
112
+
113
+ <div align="center">
114
+ <table>
115
+ <tr>
116
+ <td><img src="griddot_1.png" alt="Grid Example 1" width="256px"></td>
117
+ <td><img src="griddot_2.png" alt="Grid Example 2" width="256px"></td>
118
+ <td><img src="griddot_3.png" alt="Grid Example 3" width="256px"></td>
119
+ </tr>
120
+ <tr>
121
+ <td><img src="griddot_4.png" alt="Grid Example 4" width="256px"></td>
122
+ <td><img src="griddot_5.png" alt="Grid Example 5" width="256px"></td>
123
+ <td><img src="griddot_6.png" alt="Grid Example 6" width="256px"></td>
124
+ </tr>
125
+ <tr>
126
+ <td><img src="griddot_7.png" alt="Grid Example 7" width="256px"></td>
127
+ <td><img src="griddot_8.png" alt="Grid Example 8" width="256px"></td>
128
+ <td><img src="griddot_9.png" alt="Grid Example 9" width="256px"></td>
129
+ </tr>
130
+ </table>
131
+ </div>
132
+
133
+ ## Usage
134
+
135
+ The inference code is available via our [GitHub repository](https://github.com/erwold/qwen2vl-flux) which provides comprehensive Python interfaces and examples.
136
+
137
+ ### Installation
138
+
139
+ 1. Clone the repository and install dependencies:
140
+ ```bash
141
+ git clone https://github.com/erwold/qwen2vl-flux
142
+ cd flux
143
+ pip install -r requirements.txt
144
+ ```
145
+
146
+ 2. Download model checkpoints from Hugging Face:
147
+ ```python
148
+ from huggingface_hub import snapshot_download
149
+
150
+ snapshot_download("Djrango/Qwen2vl-Flux")
151
+ ```
152
+
153
+ ### Basic Examples
154
+
155
+ ```python
156
+ from model import FluxModel
157
+
158
+ # Initialize model
159
+ model = FluxModel(device="cuda")
160
+
161
+ # Image Variation
162
+ outputs = model.generate(
163
+ input_image_a=input_image,
164
+ prompt="Your text prompt",
165
+ mode="variation"
166
+ )
167
+
168
+ # Image Blending
169
+ outputs = model.generate(
170
+ input_image_a=source_image,
171
+ input_image_b=reference_image,
172
+ mode="img2img",
173
+ denoise_strength=0.8
174
+ )
175
+
176
+ # Text-Guided Blending
177
+ outputs = model.generate(
178
+ input_image_a=input_image,
179
+ prompt="Transform into an oil painting style",
180
+ mode="variation",
181
+ guidance_scale=7.5
182
+ )
183
+
184
+ # Grid-Based Style Transfer
185
+ outputs = model.generate(
186
+ input_image_a=content_image,
187
+ input_image_b=style_image,
188
+ mode="controlnet",
189
+ line_mode=True,
190
+ depth_mode=True
191
+ )
192
+ ```
193
+
194
+ ## Technical Specifications
195
+
196
+ - **Framework**: PyTorch 2.4.1+
197
+ - **Base Models**:
198
+ - FLUX.1-dev
199
+ - Qwen2-VL-7B-Instruct
200
+ - **Memory Requirements**: 48GB+ VRAM
201
+ - **Supported Image Sizes**:
202
+ - 1024x1024 (1:1)
203
+ - 1344x768 (16:9)
204
+ - 768x1344 (9:16)
205
+ - 1536x640 (2.4:1)
206
+ - 896x1152 (3:4)
207
+ - 1152x896 (4:3)
208
+
209
+
210
+ ## Citation
211
+
212
+ ```bibtex
213
+ @misc{erwold-2024-qwen2vl-flux,
214
+ title={Qwen2VL-Flux: Unifying Image and Text Guidance for Controllable Image Generation},
215
+ author={Pengqi Lu},
216
+ year={2024},
217
+ url={https://github.com/erwold/qwen2vl-flux}
218
+ }
219
+ ```
220
+
221
+ ## License
222
+
223
+ This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
224
+
225
+ ## Acknowledgments
226
+
227
+ - Based on the FLUX architecture
228
+ - Integrates Qwen2VL for vision-language understanding
229
+ - Thanks to the open-source communities of FLUX and Qwen