Text-to-Image
Diffusers
lora

Training details

#12
by ywlee88 - opened

Hi,

I'm impressed by your amazing work.

Could you describe the training details (e.g., batch size, lr, scheduler, etc) for lcm-lora-sdxl model?

When I tried to train lcm-lora-sdxl model with the official diffusers's training script, the intermediate validation result images were not as good as yours.

Thanks in advance.

Latent Consistency org

Did you try the exact same training setup? Dataset, hyperparameters, etc?

@sayakpaul

Thank you for your quick response.

Yes, except for the training data.

I used a subset of laion-aesthetic dataset (11K text-image pairs) provided by BK-SDM.

I shared validated generation images at 700 iterations.

image.png

This is hyper-params:

--train_data_dir=./data/laion_aes/preprocessed_11k --pretrained_teacher_model=stabilityai/stable-diffusion-xl-base-1.0 --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --output_dir=./results/TOY_LCM_LORA_LAION/lcm_lora_sdxl_base_24x1x1_lr_1e-4 --tracker_project_name=TOY_LCM_LORA_LAION --tracker_output_name=lcm_lora_sdxl_base_24x1x1_lr_1e-4 --mixed_precision=fp16 --resolution=1024 --train_batch_size=24 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam --lora_rank=64 --learning_rate=1e-4 --report_to=wandb --lr_scheduler=constant --lr_warmup_steps=0 --max_train_steps=100000 --checkpointing_steps=2000 --validation_steps=20 --seed=0 --report_to=wandb

I have another question.

Could you let me know what data is used for training lcm-lora-ssd-1b model and lcm-lora-sdxl?

When I generated some samples, the result of lcm-lora-ssd-1b showed better quality than that of lcm-lora-sdxl.

I wonder if this difference in generation quality is caused by differences in the data used for training.

For the sake of the community, it would be very helpful if you could share the training details of the lcm-lora-sdxl and lcm-lora-ssd-1b models.

In my case, I'm trying to create an lcm-lora version of the koala model, which is a lightweight T2I model like ssd-1b.

Thanks in advance.

image.png

For lcm-lora-ssd-1b:

model_id = "segmind/SSD-1B"
adapter_id = "latent-consistency/lcm-lora-ssd-1b"

pipe = AutoPipelineForText2Image.from_pretrained(model_id, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")

# load and fuse lcm lora
pipe.load_lora_weights(adapter_id)
pipe.fuse_lora()

prompt = "Portrait photo of a standing girl, photograph, golden hair, depth of field, moody light, golden hour, centered, extremely detailed, award winning photography, realistic."
image = pipe(prompt=prompt, num_inference_steps=4, guidance_scale=1).images[0]

For lcm-lora-sdxl:

pipe2 = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") 
pipe2.scheduler = LCMScheduler.from_config(pipe2.scheduler.config)
pipe2.to("cuda:4")

# load and fuse lcm lora
pipe2.load_lora_weights("latent-consistency/lcm-lora-sdxl") 
pipe2.fuse_lora()

prompt = "Portrait photo of a standing girl, photograph, golden hair, depth of field, moody light, golden hour, centered, extremely detailed, award winning photography, realistic."
image2 = pipe2(prompt=prompt, num_inference_steps=4, guidance_scale=1).images[0]

@sayakpaul

Thank you for your quick response.

Yes, except for the training data.

I used a subset of laion-aesthetic dataset (11K text-image pairs) provided by BK-SDM.

I shared validated generation images at 700 iterations.

image.png

This is hyper-params:

--train_data_dir=./data/laion_aes/preprocessed_11k --pretrained_teacher_model=stabilityai/stable-diffusion-xl-base-1.0 --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --output_dir=./results/TOY_LCM_LORA_LAION/lcm_lora_sdxl_base_24x1x1_lr_1e-4 --tracker_project_name=TOY_LCM_LORA_LAION --tracker_output_name=lcm_lora_sdxl_base_24x1x1_lr_1e-4 --mixed_precision=fp16 --resolution=1024 --train_batch_size=24 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam --lora_rank=64 --learning_rate=1e-4 --report_to=wandb --lr_scheduler=constant --lr_warmup_steps=0 --max_train_steps=100000 --checkpointing_steps=2000 --validation_steps=20 --seed=0 --report_to=wandb

In your example generated images, how many inference steps is that with?

Latent Consistency org

I didn’t work on the example so don’t have exact details but the dataset would definitely impact the quality here and also the length of the training schedule.

Cc: @pcuenq if any details pop up on the number of steps.

Sign up or log in to comment