Image / Video Gen - a Norm Collection

Note 1. Introduce v_pred. As for DDPM noise scheduler 1.1 definition: v = \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_0 1.2 The conversion btw epsilon pred and velocity pred: \epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t

Flow Matching for Generative Modeling

Paper • 2210.02747 • Published Oct 6, 2022 • 1

simple diffusion: End-to-end diffusion for high resolution images

Paper • 2301.11093 • Published Jan 26, 2023 • 2

Note 1. use (v-prediction, epsilon loss) the loss. v_pred = uvit ( z_t , logsnr_t ) eps_pred = sigma_t * z_t + alpha_t * v_t

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Paper • 2209.03003 • Published Sep 7, 2022 • 1

Scalable Diffusion Models with Transformers

Paper • 2212.09748 • Published Dec 19, 2022 • 17

Note 1. Following the U-Net initialization strategy, zero-initializing the final convolutional layer in each block before any residual connections, DiT regresses γ, β, and dimension-wise scaling parameters α that are applied immediately before any residual connections within the DiT block.

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Paper • 2401.08740 • Published Jan 16 • 12

Note 1. Generation Process: (i) Stochastic interpolant framework decouples the formulation of xt from the forward SDE. 2. Model prediction: (i) Learn the velocity field v(x, t) and use it to express the score s(x, t) when using an SDE for sampling. 3. Optimal choice of wt will always be model prediction and interpolant dependent. 4. from a DiT model (discrete, score prediction, VP interpolant) to a SiT model (continuous, velocity prediction, Linear interpolant)

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22 • 34

Note 1. Extend the 2D image-based VAE into a 3D VideoVAE with CausalConv3D. 2. Encode a long video with a divide-and-merge strategy. 3. Caption Model: 3.1 The temporal encoder is implemented with [Token Turing Machines](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing).

Classifier-Free Diffusion Guidance

Paper • 2207.12598 • Published Jul 26, 2022 • 2

Note 1. Follow-up work: APG(https://arxiv.org/pdf/2410.02416) 1.1 Leaning more on the orthogonal component significantly attenuates this saturation side effect in generations while maintaining the quality-boosting benefits of CFG. 1.2 APG performs best when applied to the denoised predictions rather than the noise prediction.

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Paper • 2310.00426 • Published Sep 30, 2023 • 61

Note 1. Training Receipt - Initialize the T2I model with a low-cost class-condition model; - Pretrain on text-image pair data rich in information density; - Fine-tuning with superior aesthetic quality data; 2. adaLN-single - one global set of shifts and scales is computed only at the first block which is shared across all the blocks, denoted as shared_adaln_cond; - a layer-specific trainable embedding, denoted as adaln_cond; adaptively adjusts the scale and shift parameters in different blocks

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Paper • 2312.07537 • Published Dec 12, 2023 • 26

Note 1. Gap btw training & inference: the initial noises corrupted from real videos remain temporally correlated at the low-frequency band. 2. Free-Init Procedure 2.1 Initialize an independent Gaussian noise; 2.2 DDIM denoising to generate a clean video latent; 2.3 Obtain noisy version video latent through forward diffusion; 2.4 Combine the low-frequency components of this video latent with the high-frequency components from random Gaussian noise; 2.5 Repeat;

black-forest-labs/FLUX.1-schnell

Text-to-Image • Updated Aug 16 • 1.64M • • 2.96k

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5 • 59

Note Known as SD-3 1. Change the distribution over t from the uniform distribution to the one giving more weight to intermediate timesteps by sampling them more frequently. 2. Use a ratio of 50 % original and 50 % synthetic captions. 3. MM-DiT

On the Importance of Noise Scheduling for Diffusion Models

Paper • 2301.10972 • Published Jan 26, 2023 • 1

Note 1. When increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels). This is more important in video generation.

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Paper • 2402.14797 • Published Feb 22 • 19

Note 1. Argue that treating spatial and temporal modeling in a separable way causes motion artifacts, temporal inconsistencies, or generation of dynamic images rather than videos with vivid motion.

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Paper • 2404.07724 • Published Apr 11 • 13

Note 1. guidance is harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle.

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published Oct 9 • 6

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Paper • 2410.20280 • Published Oct 26 • 21

Note 1. For Spatio-Temporal Attention, 2D RoPE for spatial & temporal. Insert a learnable [NEXT] token to differentiate image patches across different rows is enough for Spatial. No need for 3D RoPE. 2. Do not include dynamic resolution training in our main training stages. Instead, after convergence, fine-tuning the model for a few steps (10K-20K) with dynamic resolutions enables it.

In-Context LoRA for Diffusion Transformers

Paper • 2410.23775 • Published Oct 31 • 10

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17 • 35

Note 1. validation loss is a proxy for generation quality.

OminiControl: Minimal and Universal Control for Diffusion Transformer

Paper • 2411.15098 • Published 11 days ago • 41

Note 1. process condition image tokens uniformly with text and noisy image tokens, integrating them into a unified sequence. Not using the direct addition of hidden states b/c constrains token interactions.

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published 5 days ago • 19