About training code

#8
by CaptainZZZ - opened

Hi,
Thanks for the excellent work! Have you considered making the training code open source?

alimama-creative org

Sorry, we have no plans to open the training code, but most of the code is based on small modifications to the sd3 dream booth lora training script in the diffusers library, and the specific link can be found here (https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sd3.py)

Thanks a lot for the response!
And i'm curious about how to generate "controlnet-inpainting" dataset by using 12M laion2B?

alimama-creative org

It is obtained by filtering with high threshold values for resolution, aesthetics, and clip scores, and the mask is randomly generated.

Thanks so much for the reply !

Hi,
I have another question, what and how many devices did you use for training, and how many days did the training cost ?
Thanks a lot for the reply!

alimama-creative org

16 x A100 used for a week.

how long does it take to get a decent result? maybe 1-3days and the model may rought converge? @ljp

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

alimama-creative org

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

alimama-creative org

Select high-resolution images and then crop and resize them to 1024x1024.

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Select high-resolution images and then crop and resize them to 1024x1024.

Thanks!

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

1.png

I would be grateful if the author could give us some suggestions @ljp

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

1.png

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

1.png

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Hi,
Could you please share your train details (batch size, accelerate cmd line, train dataset and number of train images)?
Also, which steps of checkpoints did you test on, and did the model converge at that point?

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

1.png

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Hi,
Could you please share your train details (batch size, accelerate cmd line, train dataset and number of train images)?
Also, which steps of checkpoints did you test on, and did the model converge at that point?

may i have your email adress or WeChat for further discussion?

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

1.png

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Hi,
Could you please share your train details (batch size, accelerate cmd line, train dataset and number of train images)?
Also, which steps of checkpoints did you test on, and did the model converge at that point?

may i have your email adress or WeChat for further discussion?

email

Hi author,
May I ask if the base SD3 model you use was fp16 checkpoints or fp32 checkpoints? I trained ControlNet with SD3 fp16 checkpoints for training and found that the results were really bad and the image like I have mentioned above.
Looking forward to your reply. Thanks so much!

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

Could you please explain what is "sd3 (with 33 channel transformers)"? Thanks!

16 x A100 used for a week.

You say that your batch_size=192, so for each A100, the batch_size=12, but I can only set batch_size=6 for A100 with resolution=1024*1024 and layers=23 in transformer.

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

Could you please explain what is "sd3 (with 33 channel transformers)"? Thanks!

Like sdxl-inpainting model, with 4+4+1, 4 channels for noise, 4 channels for masked image, and 1 channel for 0-1 mask tensor. And for sd3, the channel number is 16, so it is 16+16+1=33 channels.

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

Could you please explain what is "sd3 (with 33 channel transformers)"? Thanks!

Like sdxl-inpainting model, with 4+4+1, 4 channels for noise, 4 channels for masked image, and 1 channel for 0-1 mask tensor. And for sd3, the channel number is 16, so it is 16+16+1=33 channels.

Could you please contact me with email: xduzhangjiayu@163.com? we can discuss about sd3 controlnet training, thanks

16 x A100 used for a week.

You say that your batch_size=192, so for each A100, the batch_size=12, but I can only set batch_size=6 for A100 with resolution=1024*1024 and layers=23 in transformer.

Same, for me only 6 batch in 1 A100 80G

16 x A100 used for a week.

You say that your batch_size=192, so for each A100, the batch_size=12, but I can only set batch_size=6 for A100 with resolution=1024*1024 and layers=23 in transformer.

Maybe you can try deepspeed to reduce the VRAM occupation

Hi, have you solve the issue ? would be grateful if you can give some advice
@AppleYang

No, the results have the same issue. Like this :
image.png
image.png

@AppleYang
Exactly the same issue, may I have you E-mail for further disscusion? or you can contact me: xduzhangjiayu@163.com, thanks!

Hi, I'm also trying to train a controlnet to do inpainting. The system works pretty well, but I can't keep the areas outside the mask identical to the input image. Can you give me some references on the loss you used?

Hi, I'm also trying to train a controlnet to do inpainting. The system works pretty well, but I can't keep the areas outside the mask identical to the input image. Can you give me some references on the loss you used?

Do you train a SD3 controlnet for inpainting task, and you haven't faced the above issue. So can you share some codes for me, and I am curious about how you solved this issue.

Sign up or log in to comment