Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,106 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
![](https://user-images.githubusercontent.com/61938694/231021615-38df0a0a-d97e-4f7a-99d9-99952357b4b1.png)
|
5 |
+
## Paella
|
6 |
+
We are releasing a new Paella model which builds on top of our initial paper https://arxiv.org/abs/2211.07292.
|
7 |
+
Paella is a text-to-image model that works in a quantized latent space and learns similarly to MUSE and Diffusion models.
|
8 |
+
Since the paper-release we worked intensively to bring Paella to a similar level as other
|
9 |
+
state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not
|
10 |
+
to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer
|
11 |
+
to people outside the field on a technical basis. For example, many models have codebases with many thousand lines of
|
12 |
+
code, that make it pretty hard for people to dive into the code and easily understand it. And that is the contribution
|
13 |
+
we are the most with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few
|
14 |
+
minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire
|
15 |
+
sampling code can be written in just **12 lines** of code.
|
16 |
+
|
17 |
+
### How does Paella work?
|
18 |
+
Paella works in a quantized latent space, just like StableDiffusion etc., to reduce the computational power needed.
|
19 |
+
Images will be encoded to a smaller latent space and converted to visual tokens of shape *h x w*. Now during training,
|
20 |
+
these visual tokens will be noised, by replacing a random amount of tokens with other randomly selected tokens
|
21 |
+
from the codebook of the VQGAN. The noised image will be given to the model, along with a timestep and the conditional
|
22 |
+
information, which is text in our case. The model is tasked to predict the un-noised version of the tokens.
|
23 |
+
And that's it. The model is optimized with the CrossEntropy loss between the original tokens and the predicted tokens.
|
24 |
+
The amount of noise added during the training is just a linear schedule, meaning that we uniformly sample a percentage
|
25 |
+
between 0 and 100% and noise that amount of tokens.<br><br>
|
26 |
+
|
27 |
+
<figure>
|
28 |
+
<img src="https://user-images.githubusercontent.com/61938694/231248435-d21170c1-57b4-4a8f-90a6-62cf3e7effcd.png" width="400">
|
29 |
+
<figcaption>Images are noised and then fed to the model during training.</figcaption>
|
30 |
+
</figure>
|
31 |
+
|
32 |
+
|
33 |
+
Sampling is also extremely simple, we start with the entire image being random tokens. Then we feed the latent image,
|
34 |
+
the timestep and the condition into the model and let it predict the final image. The models outputs a distribution
|
35 |
+
over every token, which we sample from with standard multinomial sampling.
|
36 |
+
Since there are infinite possibilities for the result to look like, just doing a single step results in very basic
|
37 |
+
shapes without any details. That is why we add noise to the image again and feed it back to the model. And we repeat
|
38 |
+
that process for a number of times with less noise being added every time and slowly get our final image.
|
39 |
+
You can see how images emerge [here](https://user-images.githubusercontent.com/61938694/231252449-d9ac4d15-15ef-4aed-a0de-91fa8746a415.png).<br>
|
40 |
+
The following is the entire sampling code needed to generate images:
|
41 |
+
```python
|
42 |
+
def sample(model_inputs, latent_shape, unconditional_inputs, steps=12, renoise_steps=11, temperature=(0.7, 0.3), cfg=8.0):
|
43 |
+
with torch.inference_mode():
|
44 |
+
sampled = torch.randint(0, model.num_labels, size=latent_shape)
|
45 |
+
initial_noise = sampled.clone()
|
46 |
+
timesteps = torch.linspace(1.0, 0.0, steps+1)
|
47 |
+
temperatures = torch.linspace(temperature[0], temperature[1], steps)
|
48 |
+
for i, t in enumerate(timesteps[:steps]):
|
49 |
+
t = torch.ones(latent_shape[0]) * t
|
50 |
+
|
51 |
+
logits = model(sampled, t, **model_inputs)
|
52 |
+
if cfg:
|
53 |
+
logits = logits * cfg + model(sampled, t, **unconditional_inputs) * (1-cfg)
|
54 |
+
sampled = logits.div(temperatures[i]).softmax(dim=1).permute(0, 2, 3, 1).reshape(-1, logits.size(1))
|
55 |
+
sampled = torch.multinomial(sampled, 1)[:, 0].view(logits.size(0), *logits.shape[2:])
|
56 |
+
|
57 |
+
if i < renoise_steps:
|
58 |
+
t_next = torch.ones(latent_shape[0]) * timesteps[i+1]
|
59 |
+
sampled = model.add_noise(sampled, t_next, random_x=initial_noise)[0]
|
60 |
+
return sampled
|
61 |
+
```
|
62 |
+
|
63 |
+
### Results
|
64 |
+
<img src="https://user-images.githubusercontent.com/61938694/231598512-2410c172-5a9d-43f4-947c-6ff7eaee77e7.png">
|
65 |
+
Since Paella is also conditioned on CLIP image embeddings the following things are also possible:<br><br>
|
66 |
+
<img src="https://user-images.githubusercontent.com/61938694/231278319-16551a8d-bfd1-49c9-b604-c6da3955a6d4.png">
|
67 |
+
<img src="https://user-images.githubusercontent.com/61938694/231287637-acd0b9b2-90c7-4518-9b9e-d7edefc6c3af.png">
|
68 |
+
<img src="https://user-images.githubusercontent.com/61938694/231287119-42fe496b-e737-4dc5-8e53-613bdba149da.png">
|
69 |
+
|
70 |
+
### Technical Details.
|
71 |
+
Model-Architecture: U-Net (Mix of....) <br>
|
72 |
+
Dataset: Laion-A, Laion Aesthetic > 6.0 <br>
|
73 |
+
Training Steps: 1.3M <br>
|
74 |
+
Batch Size: 2048 <br>
|
75 |
+
Resolution: 256 <br>
|
76 |
+
VQGAN Compression: f4 <br>
|
77 |
+
Condition: ByT5-XL (95%), CLIP-H Image Embedding (10%), CLIP-H Text Embedding (10%)
|
78 |
+
Optimizer: AdamW
|
79 |
+
Hardware: 128 A100 @ 80GB <br>
|
80 |
+
Training Time: ~3 weeks <br>
|
81 |
+
Learning Rate: 1e-4 <br>
|
82 |
+
More details on the approach, training and sampling can be found in paper and on GitHub.
|
83 |
+
|
84 |
+
### Paper, Code Release
|
85 |
+
Paper: https://arxiv.org/abs/2211.07292 <br>
|
86 |
+
Code: https://github.com/dome272/Paella <br>
|
87 |
+
|
88 |
+
### Goal
|
89 |
+
So you see, there are no heavy math formulas or theorems needed to achieve good sampling qualities. Moreover,
|
90 |
+
there are no constants such as alpha, beta, alpha_cum_prod etc. necessary as in diffusion models. This makes this
|
91 |
+
method really suitable for people new to the field of generative AI. We hope we can set the foundation for further
|
92 |
+
research in that direction and hope to contribute to a world where AI is accessible and can be understood by everyone.
|
93 |
+
|
94 |
+
### Limitations & Conclusion
|
95 |
+
There are still many things to improve for Paella to get on par with standard diffusion models or to even outperform
|
96 |
+
them. One primary thing we notice is that even though we only condition the model on CLIP image embedding 10% of the
|
97 |
+
time, during inference the model heavily relies on the generated image embeddings by a prior model (mapping clip text
|
98 |
+
embeddings to image embeddings as proposed in Dalle2). We counteract this by decreasing the importance of the image
|
99 |
+
embeddings by reweighing the attention scores. There probably is a way to avoid this happening already in training.
|
100 |
+
Other limitations such as lack of composition, text depiction, unawareness of concepts etc. could also be reduced by
|
101 |
+
continuing the training for longer. As a reference, Paella has only seen as many images as SD 1.4 and due to earlier
|
102 |
+
To conclude, this is still work in progress, but our first model that works a million times better than the first
|
103 |
+
versions we trained months ago. We hope that more people become interested in this approach, since we believe it has
|
104 |
+
a lot of potential to become much better than this and to enable new people to have an easy-to-understand introduction
|
105 |
+
to the field of generative AI.
|
106 |
+
|