Update README.md
Browse files
README.md
CHANGED
@@ -88,6 +88,24 @@ import torch
|
|
88 |
import numpy as np
|
89 |
import cv2
|
90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
controlnet_conditioning_scale = 1.0
|
92 |
prompt = "your prompt, the longer the better, you can describe it as detail as possible"
|
93 |
negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'
|
@@ -143,7 +161,10 @@ images[0].save(f"your image save path, png format is usually better than jpg or
|
|
143 |
|
144 |
## Training Details
|
145 |
|
146 |
-
The model is trained using high quality data, only 1 stage training
|
|
|
|
|
|
|
147 |
|
148 |
|
149 |
### Training Data
|
|
|
88 |
import numpy as np
|
89 |
import cv2
|
90 |
|
91 |
+
def HWC3(x):
|
92 |
+
assert x.dtype == np.uint8
|
93 |
+
if x.ndim == 2:
|
94 |
+
x = x[:, :, None]
|
95 |
+
assert x.ndim == 3
|
96 |
+
H, W, C = x.shape
|
97 |
+
assert C == 1 or C == 3 or C == 4
|
98 |
+
if C == 3:
|
99 |
+
return x
|
100 |
+
if C == 1:
|
101 |
+
return np.concatenate([x, x, x], axis=2)
|
102 |
+
if C == 4:
|
103 |
+
color = x[:, :, 0:3].astype(np.float32)
|
104 |
+
alpha = x[:, :, 3:4].astype(np.float32) / 255.0
|
105 |
+
y = color * alpha + 255.0 * (1.0 - alpha)
|
106 |
+
y = y.clip(0, 255).astype(np.uint8)
|
107 |
+
return y
|
108 |
+
|
109 |
controlnet_conditioning_scale = 1.0
|
110 |
prompt = "your prompt, the longer the better, you can describe it as detail as possible"
|
111 |
negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'
|
|
|
161 |
|
162 |
## Training Details
|
163 |
|
164 |
+
The model is trained using high quality data, only 1 stage training, the resolution setting is the same with sdxl-base, 1024*1024. We use random threshold to generate canny images like lvming zhang, It is essential to find proper hyerparameters
|
165 |
+
to realize data augmentation, too easy or too hard will hurt the model performance. Besides, we use random mask to random mask out a random percentage of canny images to force the model to learn more semantic meaning between the prompt and the line.
|
166 |
+
We use over 10000000 images, which are annotated carefully, cogvlm is proved to be a powerful image caption model[https://github.com/THUDM/CogVLM?tab=readme-ov-file]. For comic images, it is recommened to use waifu tagger to generate special tags
|
167 |
+
[https://huggingface.co/spaces/SmilingWolf/wd-tagger]. More than 64 A100s are used to train the model and the real batch size is 2560 when used accumulate_grad_batches.
|
168 |
|
169 |
|
170 |
### Training Data
|