VAE
Video-Generation
Edit model card

Reducio-VAE Model Card

This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of T4×H32×W32\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}, enabling 4096x downsampling. It is part of the Reducio-DiT, which is a video generation method. Codebase available here.

Model Details

Model Sources

Uses

Common use scenario is described here.

Direct Use

The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.

Results

Results

Metrics on 1K Pexels validation set and UCF-101:

Method Downsample Factor |z| PSNR SSIM LPIPS rFVD (Pexels) rFVD (UCF-101)
SD2.1-VAE 1*8*8 4 29.23 0.82 0.09 25.96 21.00
SDXL-VAE 1*8*8 16 30.54 0.85 0.08 19.87 23.68
OmniTokenizer 4*8*8 8 27.11 0.89 0.07 23.88 30.52
OpenSora-1.2 4*8*8 16 30.72 0.85 0.11 60.88 67.52
Cosmos Tokenizer 8*8*8 16 30.84 0.74 0.12 29.44 22.06
Cosmos Tokenizer 8*16*16 16 28.14 0.65 0.18 77.87 119.37
Reducio-VAE 4*32*32 16 35.88 0.94 0.05 17.88 65.17

Citation

BibTeX:

@article{tian2024reducio,
      title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents}, 
      author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
      journal={arXiv preprint arXiv:2411.13552},
      year={2024}
}
Downloads last month
52
Inference API
Unable to determine this model's library. Check the docs .