Reducio-VAE Model Card

This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of $\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}$ , enabling 4096x downsampling. It is part of the Reducio-DiT, which is a video generation method. Codebase available here.

Model Details

Model Sources

Repository: GitHub Repository
Paper: arXiv

Uses

Common use scenario is described here.

Direct Use

The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.

Results

Metrics on 1K Pexels validation set and UCF-101:

Method	Downsample Factor	\|z\|	PSNR	SSIM	LPIPS	rFVD (Pexels)	rFVD (UCF-101)
SD2.1-VAE	188	4	29.23	0.82	0.09	25.96	21.00
SDXL-VAE	188	16	30.54	0.85	0.08	19.87	23.68
OmniTokenizer	488	8	27.11	0.89	0.07	23.88	30.52
OpenSora-1.2	488	16	30.72	0.85	0.11	60.88	67.52
Cosmos Tokenizer	888	16	30.84	0.74	0.12	29.44	22.06
Cosmos Tokenizer	81616	16	28.14	0.65	0.18	77.87	119.37
Reducio-VAE	43232	16	35.88	0.94	0.05	17.88	65.17

Citation

BibTeX:

@article{tian2024reducio,
      title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents}, 
      author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
      journal={arXiv preprint arXiv:2411.13552},
      year={2024}
}