README.md · apol/dalle-mini at e1d15514a835e8ca0304edd16820815f6c447395

metadata

language:
  - en

DALL·E mini - Generate images from text

Model Description

This is an attempt to replicate OpenAI's DALL·E, a model capable of generating arbitrary images from a text prompt that describes the desired result.

This model's architecture is a simplification of the original, and leverages previous open source efforts and available pre-trained models. Results have lower quality than OpenAI's, but the model can be trained and used on less demanding hardware. Our training was performed on a single TPU v3-8 for a few days.

Components of the Architecture

The system relies in the Flax/JAX infrastructure, which are ideal for TPU training. TPUs are not required, both Flax and JAX run very efficiently on GPU backends.

The main components of the architecture include:

An encoder, based on BART. The encoder's mission is to transform a sequence of input text tokens to a sequence of image tokens. The input tokens are extracted from the text prompt by using the model's tokenizer. The image tokens are a fixed-length sequence, and they represent indices in a VQGAN-based pre-trained codebook.
A decoder, with converts the image tokens to an image for visualization. As mentioned above, the decoder is based on a VQGAN model.

The model definition we use for the encoder can be downloaded from our Github repo. The encoder is reprensented by the class CustomFlaxBartForConditionalGeneration.

To use the decoder, you need to follow the instructions in our accompanying VQGAN model in the hub, flax-community/vqgan_f16_16384.

How to Use

The easiest way to get familiar with the code and the models is to follow the inference notebook we provide in our github repo. For your convenience, you can open it in Google Colaboratory:

If you just want to test the trained model and see what it comes up with, please visit our demo, available as a Space in huggingface's hub.

Additional Details

Our report contains a lot of details about how the model was trained and shows many examples that demonstrate its capabilities.