test

File size: 6,032 Bytes

c336648

# SD-Latent-Interposer
A small neural network to provide interoperability between the latents generated by the different Stable Diffusion models.

I wanted to see if it was possible to pass latents generated by the new SDXL model directly into SDv1.5 models without decoding and re-encoding them using a VAE first.

## Installation
To install it, simply clone this repo to your custom_nodes folder using the following command:
```
git clone https://github.com/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer
```

Alternatively, you can download the [comfy_latent_interposer.py](https://github.com/city96/SD-Latent-Interposer/raw/main/comfy_latent_interposer.py) file to your `ComfyUI/custom_nodes` folder as well. You may need to install hfhub using the command `pip install huggingface-hub` inside your venv.

If you need the model weights for something else, they are [hosted on HF](https://huggingface.co/city96/SD-Latent-Interposer/tree/main) under the same Apache2 license as the rest of the repo. The current files are in the **"v4.0"** subfolder.

## Usage
Simply place it where you would normally place a VAE decode followed by a VAE encode. Set the denoise as appropirate to hide any artifacts while keeping the composition. See image below.

![LATENT_INTERPOSER_V3 1_TEST](https://github.com/city96/SD-Latent-Interposer/assets/125218114/849574b4-2565-4090-85d3-ae63ab425ee2)

Without the interposer, the two latent spaces are incompatible:

![LATENT_INTERPOSER_V3 1](https://github.com/city96/SD-Latent-Interposer/assets/125218114/13e2c01f-580e-4ecb-af1f-b6b21699127b)

### Local models
The node pulls the required files from huggingface hub by default. You can create a `models` folder and place the models there if you have a flaky connection or prefer to use it completely offline. The custom node will prefer local files over HF when available. The path should be: `ComfyUI/custom_nodes/SD-Latent-Interposer/models`

Alternatively, just clone the entire HF repo to it:
```
git clone https://huggingface.co/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer/models
```

### Supported Models

Model names:

| code | name                       |
| ---- | -------------------------- |
| `v1` | Stable Diffusion v1.x      |
| `xl` | SDXL                       |
| `v3` | Stable Diffusion 3         |
| `ca` | Stable Cascade (Stage A/B) |

Available models:

| From | to `v1` | to `xl` | to `v3` | to `ca` |
|:----:|:-------:|:-------:|:-------:|:-------:|
| `v1` |    -    |  v4.0   |  v4.0   |    No   |
| `xl` |  v4.0   |    -    |  v4.0   |    No   |
| `v3` |  v4.0   |  v4.0   |    -    |    No   |
| `ca` |  v4.0   |  v4.0   |  v4.0   |    -    |

## Training

The training code initializes most training parameters from the provided config file. The dataset should be a single .bin file saved with `torch.save` for each latent version. The format should be [batch, channels, height, width] with the "batch" being as large as the dataset, ie 88000.

### Interposer v4.0

The training code currently initializes two copies of the model, one in the target direction and one in the opposite. The losses are defined based on this.

- `p_loss` is the main criterion for the primary model.
- `b_loss` is the main criterion for the secondary one.
- `r_loss` is the output of the primary model back through the secondary model and checked against the source latent (basically a round trip through the two models).
- `h_loss` is the same as `r_loss` but for the secondary model.

All models were trained for 50000 steps with either batch size 128 (xl/v1) or 48 (cascade). 
The training was done locally on an RTX 3080 and a Tesla V100S.

![LATENT_INTERPOSER_V4_LOSS](https://github.com/city96/SD-Latent-Interposer/assets/125218114/3a0d8920-ed48-42f0-96c9-897263525efb)

### Older versions

<details><summary>Interposer v3.1</summary>

### Interposer v3.1

This is basically a complete rewrite. Replaced the mediocre bunch of conv2d layers with something that looks more like a proper neural network. No VGG loss because I still don't have a better GPU.

Training was done on combined Flickr2K + DIV2K, with each image being processed into 6 1024x1024 segments. Padded with some of my random images for a total of 22,000 source images in the dataset.

I think I got rid of most of the XL artifacts, but the color/hue/saturation shift issues are still there. I actually saved the optimizer state this time so I might be able to do 100K steps with visual loss on my P40s. Hopefully they won't burn up.

v3.0 was 500k steps at a constant LR of 1e-4, v3.1 was 1M steps using a CosineAnnealingLR to drop the learning rate towards the end. Both used AdamW.

![INTERPOSER_V3 1](https://github.com/city96/SD-Latent-Interposer/assets/125218114/daff0ae2-4739-4cef-ba54-ac1d156d3388)

</details>

<details><summary>Interposer v1.1</summary>

### Interposer v1.1
This is the second release using the "spaceship" architecture. It was trained on the Flickr2K dataset and was continued from the v1.0 checkpoint.
Overall, it seems to perform a lot better, especially for real life photos. I also investigated the odd v1->xl artifacts but in the end it seems [inherent to the VAE decoder stage.](https://github.com/comfyanonymous/ComfyUI/issues/1116)

![loss](https://github.com/city96/SD-Latent-Interposer/assets/125218114/e890420f-cebd-4f88-b243-62560b8384e5)

</details>


<details><summary>Interposer v1.0</summary>

### Interposer v1.0 
Not sure why the training loss is so different, it might be due to the """highly curated""" dataset of 1000 random images from my Downloads folder that I used to train it.

I probably should've just grabbed LAION.

I also trained a v1-to-v2 mode, before realizing v1 and v2 shared the same latent space. Oh well.

![loss](https://github.com/city96/SD-Latent-Interposer/assets/125218114/f92c399b-a823-4521-b09b-8bdc3795f1ea)
  
![xl-to-v1_interposer](https://github.com/city96/SD-Latent-Interposer/assets/125218114/0d963bc5-570f-4ebe-95db-16e261f05e48)
  
</details>

</details>