Wavelets Are All You Need for Autoregressive Image Generation
Abstract
In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
Community
We propose an innovative approach to AI image generation. While current state-of-the-art models like DALL-E and Stable Diffusion use complex diffusion processes, this method represents images as sequences of wavelet coefficients and generates them using language model techniques.
The key insight is using wavelet transforms to encode images into compact token sequences that capture visual information from coarse to fine details. By ordering the tokens from most to least significant, the model can progressively build up an image, similar to how jpeg compression works.
This allowed us to treat image generation like a language modeling task. We modified a GPT transformer architecture to work with these "wavelet languages" consisting of just 7 token types. The model learns to predict sequences of tokens that, when decoded, form coherent images.
Some notable advantages of this approach:
Flexibility in image resolution and detail - longer token sequences simply produce higher-res or more detailed images.
Easy conditioning on text prompts or class labels by concatenating embeddings to the token representations.
Fine-grained control over image quality and generation time by adjusting the number of bit planes used.
Potential for multi-modal generation by combining wavelet tokens with text tokens.
We demonstrated the results on MNIST and Fashion-MNIST datasets, generating diverse yet class-appropriate images using techniques like top-k sampling.
While still early stage, this wavelet-based method opens up exciting possibilities. It could scale well to high-res color images and even enable localized control over different image regions.
Of course, significant work remains to scale this to more complex datasets and compare against leading diffusion models. But by bridging classic signal processing techniques with modern deep learning, this research points to intriguing new directions for AI image generation.
What do you think about this approach? Does it have potential to compete with diffusion models? Let me know your thoughts in the comments!
Would this allow you to easily rollback changes, by deleting the most recent tokens, and regenerating? I'm imagining a slider that represents output during inference that could be rewound if you liked how the image started but didn't like how it ended up?
Yes, the wavelet-based autoregressive approach does indeed allow for easy rollback of changes during image generation. This is due to the hierarchical nature of wavelet tokenization, where visual information is progressively encoded from coarse to fine details. Here's how this can be achieved:
Token Sequence Rollback: Since the image is generated token by token, you can easily rollback changes by deleting the most recent tokens. This removes the finer details added later in the sequence, and preserves the low-scale global trends of the generated image. Moreover, the concatenation with the wavelet position can be informative in guiding the user to control the locations of the rollback changes.
Regeneration: After rolling back to a desired point, you can resume the generation process, potentially using different guiding parameters (i.e. class labels or textual prompts) to steer the image generation in a new direction.
Slider for Visual Control: Implementing a slider interface that represents the progression of token generation would allow users to visually control the level of detail. By sliding backwards, users can undo recent tokens and view the image at various stages of generation. Sliding forwards resumes the generation process from the selected point, enabling real-time adjustments and fine-tuning.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper