PixelBytes: Unified Multimodal Generation
Welcome to the PixelBytes repository! This project features models designed to generate text, audio and images simultaneously, pixel by pixel, using a unified embedding. (only testing weight)
Overview
Key Concepts
- Image Transformer: Generates images pixel by pixel.
- Bi-Mamba+: A bidirectional model for time series prediction.
- MambaByte: A selective state-space model without tokens.
The PixelByte model generates mixed sequences of text and images, handling transitions with line breaks and maintaining image dimension consistency.
Dataset
We use the PixelBytes-PokemonAll dataset, available on Hugging Face: PixelBytes-PokemonAll. It contains text and image sequences of Pokémon for training our model.
Models Trained
- 3 LSTM Models: 2 Auto-regressive and 1 only predictive.
Citation
Furfaro, F. (2024). PixelBytes: A Unified Multimodal Representation Learning Project. (https://github.com/fabienfrfr/PixelBytes)
Thank you for exploring PixelBytes! We hope this model aids your multimodal generation projects.
Inference API (serverless) does not yet support pytorch models for this pipeline type.