dikdimon's picture
Upload extensions using SD-Hub extension
c336648 verified
|
raw
history blame
6.03 kB
# SD-Latent-Interposer
A small neural network to provide interoperability between the latents generated by the different Stable Diffusion models.
I wanted to see if it was possible to pass latents generated by the new SDXL model directly into SDv1.5 models without decoding and re-encoding them using a VAE first.
## Installation
To install it, simply clone this repo to your custom_nodes folder using the following command:
```
git clone https://github.com/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer
```
Alternatively, you can download the [comfy_latent_interposer.py](https://github.com/city96/SD-Latent-Interposer/raw/main/comfy_latent_interposer.py) file to your `ComfyUI/custom_nodes` folder as well. You may need to install hfhub using the command `pip install huggingface-hub` inside your venv.
If you need the model weights for something else, they are [hosted on HF](https://huggingface.co/city96/SD-Latent-Interposer/tree/main) under the same Apache2 license as the rest of the repo. The current files are in the **"v4.0"** subfolder.
## Usage
Simply place it where you would normally place a VAE decode followed by a VAE encode. Set the denoise as appropirate to hide any artifacts while keeping the composition. See image below.
![LATENT_INTERPOSER_V3 1_TEST](https://github.com/city96/SD-Latent-Interposer/assets/125218114/849574b4-2565-4090-85d3-ae63ab425ee2)
Without the interposer, the two latent spaces are incompatible:
![LATENT_INTERPOSER_V3 1](https://github.com/city96/SD-Latent-Interposer/assets/125218114/13e2c01f-580e-4ecb-af1f-b6b21699127b)
### Local models
The node pulls the required files from huggingface hub by default. You can create a `models` folder and place the models there if you have a flaky connection or prefer to use it completely offline. The custom node will prefer local files over HF when available. The path should be: `ComfyUI/custom_nodes/SD-Latent-Interposer/models`
Alternatively, just clone the entire HF repo to it:
```
git clone https://huggingface.co/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer/models
```
### Supported Models
Model names:
| code | name |
| ---- | -------------------------- |
| `v1` | Stable Diffusion v1.x |
| `xl` | SDXL |
| `v3` | Stable Diffusion 3 |
| `ca` | Stable Cascade (Stage A/B) |
Available models:
| From | to `v1` | to `xl` | to `v3` | to `ca` |
|:----:|:-------:|:-------:|:-------:|:-------:|
| `v1` | - | v4.0 | v4.0 | No |
| `xl` | v4.0 | - | v4.0 | No |
| `v3` | v4.0 | v4.0 | - | No |
| `ca` | v4.0 | v4.0 | v4.0 | - |
## Training
The training code initializes most training parameters from the provided config file. The dataset should be a single .bin file saved with `torch.save` for each latent version. The format should be [batch, channels, height, width] with the "batch" being as large as the dataset, ie 88000.
### Interposer v4.0
The training code currently initializes two copies of the model, one in the target direction and one in the opposite. The losses are defined based on this.
- `p_loss` is the main criterion for the primary model.
- `b_loss` is the main criterion for the secondary one.
- `r_loss` is the output of the primary model back through the secondary model and checked against the source latent (basically a round trip through the two models).
- `h_loss` is the same as `r_loss` but for the secondary model.
All models were trained for 50000 steps with either batch size 128 (xl/v1) or 48 (cascade).
The training was done locally on an RTX 3080 and a Tesla V100S.
![LATENT_INTERPOSER_V4_LOSS](https://github.com/city96/SD-Latent-Interposer/assets/125218114/3a0d8920-ed48-42f0-96c9-897263525efb)
### Older versions
<details><summary>Interposer v3.1</summary>
### Interposer v3.1
This is basically a complete rewrite. Replaced the mediocre bunch of conv2d layers with something that looks more like a proper neural network. No VGG loss because I still don't have a better GPU.
Training was done on combined Flickr2K + DIV2K, with each image being processed into 6 1024x1024 segments. Padded with some of my random images for a total of 22,000 source images in the dataset.
I think I got rid of most of the XL artifacts, but the color/hue/saturation shift issues are still there. I actually saved the optimizer state this time so I might be able to do 100K steps with visual loss on my P40s. Hopefully they won't burn up.
v3.0 was 500k steps at a constant LR of 1e-4, v3.1 was 1M steps using a CosineAnnealingLR to drop the learning rate towards the end. Both used AdamW.
![INTERPOSER_V3 1](https://github.com/city96/SD-Latent-Interposer/assets/125218114/daff0ae2-4739-4cef-ba54-ac1d156d3388)
</details>
<details><summary>Interposer v1.1</summary>
### Interposer v1.1
This is the second release using the "spaceship" architecture. It was trained on the Flickr2K dataset and was continued from the v1.0 checkpoint.
Overall, it seems to perform a lot better, especially for real life photos. I also investigated the odd v1->xl artifacts but in the end it seems [inherent to the VAE decoder stage.](https://github.com/comfyanonymous/ComfyUI/issues/1116)
![loss](https://github.com/city96/SD-Latent-Interposer/assets/125218114/e890420f-cebd-4f88-b243-62560b8384e5)
</details>
<details><summary>Interposer v1.0</summary>
### Interposer v1.0
Not sure why the training loss is so different, it might be due to the """highly curated""" dataset of 1000 random images from my Downloads folder that I used to train it.
I probably should've just grabbed LAION.
I also trained a v1-to-v2 mode, before realizing v1 and v2 shared the same latent space. Oh well.
![loss](https://github.com/city96/SD-Latent-Interposer/assets/125218114/f92c399b-a823-4521-b09b-8bdc3795f1ea)
![xl-to-v1_interposer](https://github.com/city96/SD-Latent-Interposer/assets/125218114/0d963bc5-570f-4ebe-95db-16e261f05e48)
</details>
</details>