PixelBytes: Unified Multimodal Generation

Welcome to the PixelBytes repository! This project features models designed to generate text and images simultaneously, pixel by pixel, using a unified embedding. (only testing weight)

Overview

Key Concepts

  • Image Transformer: Generates images pixel by pixel.
  • Bi-Mamba+: A bidirectional model for time series prediction.
  • MambaByte: A selective state-space model without tokens.

The PixelByte model generates mixed sequences of text and images, handling transitions with line breaks and maintaining image dimension consistency.

Dataset

We use the PixelBytes-Pokemon dataset, available on Hugging Face: PixelBytes-Pokemon. It contains text and image sequences of Pokémon for training our model.

Models Trained

  • 10 LSTM Models: (Uni-Bi)directional + 1, 2, 3 layers (including special config : p_embed + 3xhidden_state + 3xembedding_dim)
  • 3 Mamba Models: Bidirectional + 1, 2 layers, Unidirectional + 2 layers
  • 2 Transformer Models: 1, 2 layers

Citation

Furfaro, F. (2024). PixelBytes: A Unified Multimodal Representation Learning Project. (https://github.com/fabienfrfr/PixelBytes)


Thank you for exploring PixelBytes! We hope this model aids your multimodal generation projects.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) does not yet support pytorch models for this pipeline type.

Dataset used to train ffurfaro/PixelBytes-Pokemon