|
# 1to2: Training Multiple-Subject Models using only Single-Subject Data (Experimental) |
|
|
|
Updates will be mirrored on both Hugging Face and Civitai. |
|
|
|
## Introduction |
|
|
|
[It has been shown that multiple characters can be trained into the model](https://civitai.com/models/23476/the-idolmster-cinderella-girls-starlight-stage-style-90-characters). A harder task is to create a model that can generate multiple characters simultaneously without modifying the generation pipeline. This document describes a simple technique that has been shown to help generating multiple characters in the same image. |
|
|
|
## Method |
|
|
|
``` |
|
Requirement: Sets of single-character images |
|
Steps: |
|
1. Train a multi-concept model using the original dataset |
|
2. Create an augmentation dataset of joined image pairs from the original dataset |
|
3. Train on the augmentation dataset |
|
``` |
|
|
|
## Experiment |
|
|
|
|
|
### Setup |
|
|
|
3 characters from the game Cinderella Girls are chosen for the experiment. The base model is `anime-final-pruned`. It has been checked that the base model has minimal knowledge of the trained characters. |
|
|
|
For the captions of the joined images, the template format `CharLeft/CharRight/COMPOSITE, TagsLeft, TagsRight` is used. |
|
|
|
A LoRA (Hadamard product) is trained using the config file below: |
|
``` |
|
[model_arguments] |
|
v2 = false |
|
v_parameterization = false |
|
pretrained_model_name_or_path = "Animefull-final-pruned.ckpt" |
|
|
|
[additional_network_arguments] |
|
no_metadata = false |
|
unet_lr = 0.0005 |
|
text_encoder_lr = 0.0005 |
|
network_module = "lycoris.kohya" |
|
network_dim = 8 |
|
network_alpha = 1 |
|
network_args = [ "conv_dim=0", "conv_alpha=16", "algo=loha",] |
|
network_train_unet_only = false |
|
network_train_text_encoder_only = false |
|
|
|
[optimizer_arguments] |
|
optimizer_type = "AdamW8bit" |
|
learning_rate = 0.0005 |
|
max_grad_norm = 1.0 |
|
lr_scheduler = "cosine" |
|
lr_warmup_steps = 0 |
|
|
|
[dataset_arguments] |
|
debug_dataset = false |
|
# keep token 1 |
|
|
|
[training_arguments] |
|
output_name = "cg3comp" |
|
save_precision = "fp16" |
|
save_every_n_epochs = 1 |
|
train_batch_size = 2 |
|
max_token_length = 225 |
|
mem_eff_attn = false |
|
xformers = true |
|
max_train_epochs = 40 |
|
max_data_loader_n_workers = 8 |
|
persistent_data_loader_workers = true |
|
gradient_checkpointing = false |
|
gradient_accumulation_steps = 1 |
|
mixed_precision = "fp16" |
|
clip_skip = 2 |
|
lowram = true |
|
|
|
[sample_prompt_arguments] |
|
sample_every_n_epochs = 1 |
|
sample_sampler = "k_euler_a" |
|
|
|
[saving_arguments] |
|
save_model_as = "safetensors" |
|
``` |
|
For the second stage of training, the batch size was reduced to 2 while keeping other settings identical. |
|
The training took less than 2 hours on a T4 GPU. |
|
|
|
### Results |
|
(see preview images) |
|
|
|
## Limitations |
|
* This technique doubles the memory/compute requirement |
|
* Composites can still be generated despite negative prompting |
|
* Cloned characters seem to become the primary failure mode in place of blended characters |
|
|
|
## Related Works |
|
|
|
Models been trained on datasets based on anime shows have [demonstrated](https://civitai.com/models/21305/) multi-subject capabilty. |
|
Simply using concepts distant enough such as `1girl, 1boy` [has also been shown to be effective](https://civitai.com/models/17640/). |
|
|
|
## Future work |
|
|
|
Below is a list of ideas yet to be explored |
|
* Synthetic datasets |
|
* Regularatization |
|
* Joint training instaed of sequential |