|
# ANALYTIC-DPM: AN ANALYTIC ESTIMATE OF THE OPTIMAL REVERSE VARIANCE IN DIFFUSION PROB## ABILISTIC MODELS |
|
|
|
**Fan Bao[1][ ∗], Chongxuan Li[2 3][ †], Jun Zhu[1][ †], Bo Zhang[1]** |
|
|
|
1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua-Huawei Joint Center for AI |
|
BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China |
|
2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China |
|
3Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China |
|
bf19@mails.tsinghua.edu.cn,chongxuanli1991@gmail.com, |
|
_{dcszj, dcszb}@tsinghua.edu.cn_ |
|
|
|
ABSTRACT |
|
|
|
Diffusion probabilistic models (DPMs) represent a class of powerful generative |
|
models. Despite their success, the inference of DPMs is expensive since it generally needs to iterate over thousands of timesteps. A key problem in the inference |
|
is to estimate the variance in each timestep of the reverse process. In this work, |
|
we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score |
|
function. Building upon it, we propose Analytic-DPM, a training-free inference |
|
framework that estimates the analytic forms of the variance and KL divergence |
|
using the Monte Carlo method and a pretrained score-based model. Further, to |
|
correct the potential bias caused by the score-based model, we derive both lower |
|
and upper bounds of the optimal variance and clip the estimate for a better result. |
|
Empirically, our analytic-DPM improves the log-likelihood of various DPMs, produces high-quality samples, and meanwhile enjoys a 20× to 80× speed up. |
|
|
|
1 INTRODUCTION |
|
|
|
A diffusion process gradually adds noise to a data distribution over a series of timesteps. By learning |
|
to reverse it, diffusion probabilistic models (DPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; |
|
Song et al., 2020b) define a data generative process. Recently, it is shown that DPMs are able |
|
to produce high-quality samples (Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2020b; |
|
Dhariwal & Nichol, 2021), which are comparable or even superior to the current state-of-the-art |
|
GAN models (Goodfellow et al., 2014; Brock et al., 2018; Wu et al., 2019; Karras et al., 2020b). |
|
|
|
Despite their success, the inference of DPMs (e.g., sampling and density evaluation) often requires |
|
to iterate over thousands of timesteps, which is two or three orders of magnitude slower (Song et al., |
|
2020a) than other generative models such as GANs. A key problem in the inference is to estimate |
|
the variance in each timestep of the reverse process. Most of the prior works use a handcrafted |
|
value for all timesteps, which usually run a long chain to obtain a reasonable sample and density |
|
value (Nichol & Dhariwal, 2021). Nichol & Dhariwal (2021) attempt to improve the efficiency of |
|
sampling by learning a variance network in the reverse process. However, it still needs a relatively |
|
long trajectory to get a reasonable log-likelihood (see Appendix E in Nichol & Dhariwal (2021)). |
|
|
|
In this work, we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score function (i.e., the |
|
gradient of a log density). Building upon it, we propose Analytic-DPM, a training-free inference |
|
framework to improve the efficiency of a pretrained DPM while achieving comparable or even superior performance. Analytic-DPM estimates the analytic forms of the variance and KL divergence |
|
using the Monte Carlo method and the score-based model in the pretrained DPM. The corresponding trajectory is calculated via a dynamic programming algorithm (Watson et al., 2021). Further, to |
|
|
|
_∗Work done during an internship at Huawei Noah’s Ark Lab._ _†Correspondence to: C. Li and J. Zhu._ |
|
|
|
|
|
----- |
|
|
|
correct the potential bias caused by the score-based model, we derive both lower and upper bounds |
|
of the optimal variance and clip its estimate for a better result. Finally, we reveal an interesting |
|
relationship between the score function and the data covariance matrix. |
|
|
|
Analytic-DPM is applicable to a variety of DPMs (Ho et al., 2020; Song et al., 2020a; Nichol & |
|
Dhariwal, 2021) in a plug-and-play manner. Empirically, Analytic-DPM consistently improves the |
|
log-likelihood of these DPMs and meanwhile enjoys a 20× to 40× speed up. Besides, AnalyticDPM also consistently improves the sample quality of DDIMs (Song et al., 2020a) and requires |
|
up to 50 timesteps (which is a 20× to 80× speed up compared to the full timesteps) to achieve a |
|
comparable FID to the corresponding baseline. |
|
|
|
2 BACKGROUND |
|
|
|
Diffusion probabilistic models (DPMs) firstly construct a forward process q(x1:N **_x0) that injects_** |
|
_|_ |
|
noise to a data distribution q(x0), and then reverse the forward process to recover it. Given a forward |
|
noise schedule βn (0, 1), n = 1, _, N_, denoising diffusion probabilistic models (DDPMs) (Ho |
|
et al., 2020) consider a Markov forward process: ∈ _· · ·_ |
|
|
|
|
|
_N_ |
|
|
|
_qM(xn|xn−1),_ _qM(xn|xn−1) = N_ (xn|[√]αnxn−1, βnI), (1) |
|
_n=1_ |
|
|
|
Y |
|
|
|
|
|
_qM(x1:N_ **_x0) =_** |
|
_|_ |
|
|
|
|
|
where I is the identity matrix, αn and βn are scalars and αn := 1−βn. Song et al. (2020a) introduce |
|
a more general non-Markov process indexed by a non-negative vector λ = (λ1, _, λN_ ) R[N]0[:] |
|
_· · ·_ _∈_ _≥_ |
|
|
|
|
|
_qλ(x1:N_ _|x0) = qλ(xN_ _|x0)_ _qλ(xn−1|xn, x0),_ (2) |
|
|
|
_n=2_ |
|
|
|
Y |
|
|
|
_qλ(xN_ **_x0) =_** (xN _αN_ **_x0, βN_** **_I),_** |
|
_|_ _N_ _|[√]_ |
|
|
|
_qλ(xn_ 1 **_xn, x0) =_** (xn 1 **_µ˜_** _n(xn, x0), λ[2]n[I][)][,]_ |
|
_−_ _|_ _N_ _−_ _|_ |
|
|
|
**_µ˜_** _n(xn, x0) =_ _αn−1x0 +_ qβn−1 − _λ[2]n_ _[·][ x][n][ −√]β[α]n_ _[n][x][0]_ _._ |
|
p |
|
|
|
q |
|
|
|
|
|
Here αn := _i=1_ _[α][i][ and][ β]n_ [:= 1][ −] _[α][n][. Indeed, Eq. (2) includes the DDPM forward process as a]_ |
|
|
|
special case when λ[2]n [= ˜]βn, where _β[˜]n :=_ _ββn−n_ 1 _[β][n][. Another special case of Eq. (2) is the denoising]_ |
|
|
|
[Q][n] |
|
|
|
diffusion implicit model (DDIM) forward process, where λ[2]n [= 0][. Besides, we can further derive] |
|
_qλ(xn_ **_x0) =_** (xn _αnx0, βnI), which is independent of λ. In the rest of the paper, we will_ |
|
_|_ _N_ _|[√]_ |
|
focus on the forward process in Eq. (2) since it is more general, and we will omit the index λ and |
|
denote it as q(x1:N **_x0) for simplicity._** |
|
_|_ |
|
|
|
The reverse process for Eq. (2) is defined as a Markov process aimed to approximate q(x0) by |
|
gradually denoising from the standard Gaussian distribution p(xN ) = (xN **0, I):** |
|
_N_ _|_ |
|
|
|
_N_ |
|
|
|
_p(x0:N_ ) = p(xN ) _p(xn_ 1 **_xn),_** _p(xn_ 1 **_xn) =_** (xn 1 **_µn(xn), σn[2]_** **_[I][)][,]_** |
|
|
|
_−_ _|_ _−_ _|_ _N_ _−_ _|_ |
|
_n=1_ |
|
|
|
Y |
|
|
|
where µn(xn) is generally parameterized [1] by a time-dependent score-based model sn(xn) (Song |
|
& Ermon, 2019; Song et al., 2020b): |
|
|
|
|
|
1 |
|
**_xn,_** (xn + βnsn(xn)) _._ (3) |
|
_√αn_ |
|
|
|
|
|
|
|
**_µn(xn) = ˜µn_** |
|
|
|
|
|
The reverse process can be learned by optimizing a variational bound Lvb on negative log-likelihood: |
|
|
|
|
|
_Lvb =_ Eq |
|
|
|
|
|
"−log p(x0|x1)+n=2DKL(q(xn−1|x0, xn)||p(xn−1|xn))+DKL(q(xN _|x0)||p(xN_ )) |
|
|
|
X |
|
|
|
|
|
1Ho et al. (2020); Song et al. (2020a) parameterize1 **_µn(xn) with ˜µn(xn,_** _√α1_ _n (xn_ _−_ _βnϵn(xn))), which_ |
|
|
|
is equivalent to Eq. (3) by letting sn(xn) = **_ϵn(xn)._** q |
|
_−_ _√βn_ |
|
|
|
|
|
----- |
|
|
|
which is equivalent to optimizing the KL divergence between the forward and the reverse process: |
|
|
|
min _Lvb_ min _DKL(q(x0:N_ ) _p(x0:N_ )). (4) |
|
**_µn,σn[2]_** _[}][N]n=1_ _⇔_ **_µn,σn[2]_** _[}][N]n=1_ _||_ |
|
_{_ _{_ |
|
|
|
To improve the sample quality in practice, instead of directly optimizing Lvb, Ho et al. (2020) |
|
consider a reweighted variant of Lvb to learn sn(xn): |
|
|
|
|
|
min EnβnEqn(xn) **_sn(xn)_** **_xn log qn(xn)_** = En,x0,ϵ **_ϵ +_** |
|
**_sn_** _n=1_ _||_ _−∇_ _||[2]_ _||_ |
|
_{_ _}[N]_ |
|
|
|
|
|
_βnsn(xn)_ + c, (5) |
|
_||[2]_ |
|
|
|
|
|
where n is uniform between 1 and N, qn(xn) is the marginal distribution of the forward process |
|
at timestep n, ϵ is a standard Gaussian noise, xn on the right-hand side is reparameterized by |
|
|
|
**_xn =_** _αnx0+_ _βnϵ and c is a constant only related to q. Indeed, Eq. (5) is exactly a weighted sum_ |
|
|
|
_[√]_ |
|
|
|
of score matching objectives (Song & Ermon, 2019), which admits an optimal solutionq **_s[∗]n[(][x][n][) =]_** |
|
_∇xn log qn(xn) for all n ∈{1, 2 · · ·, N_ _}._ |
|
|
|
Note that Eq. (5) provides no learning signal for the variance σn[2] [. Indeed,][ σ]n[2] [is generally handcrafted] |
|
in most of prior works. In DDPMs (Ho et al., 2020), two commonly used settings are σn[2] [=][ β][n] [and] |
|
_σn[2]_ [= ˜]βn. In DDIMs, Song et al. (2020a) consistently use σn[2] [=][ λ]n[2] [. We argue that these handcrafted] |
|
values are not the true optimal solution of Eq. (4) in general, leading to a suboptimal performance. |
|
|
|
3 ANALYTIC ESTIMATE OF THE OPTIMAL REVERSE VARIANCE |
|
|
|
For a DPM, we first show that both the optimal mean µ[∗]n[(][x][n][)][ and the optimal variance][ σ]n[∗][2] [to Eq. (4)] |
|
have analytic forms w.r.t. the score function, which is summarized in the following Theorem 1. |
|
**Theorem 1. (Score representation of the optimal solution to Eq. (4), proof in Appendix A.2)** |
|
|
|
_The optimal solution µ[∗]n[(][x][n][)][ and][ σ]n[∗][2]_ _[to Eq. (4) are]_ |
|
|
|
|
|
**_µ[∗]n[(][x][n][) = ˜]µn_** |
|
|
|
_σn[∗][2]_ [=][ λ]n[2] [+] |
|
|
|
|
|
|
|
|
|
1 |
|
**_xn,_** _√αn_ (xn + βn∇xn log qn(xn)) _,_ (6) |
|
|
|
|
|
2 |
|
|
|
_βn_ **_xn log qn(xn)_** |
|
_αn_ _−_ _βn−1 −_ _λ[2]n_ 1 − _βnEqn(xn)_ _||∇_ _d_ _||[2]_ _,_ (7) |
|
q |
|
|
|
|
|
|
|
|
|
_where qn(xn) is the marginal distribution of the forward process at the timestep n and d is the_ |
|
_dimension of the data._ |
|
|
|
The proof of Theorem 1 consists of three key steps: |
|
|
|
- The first step (see Lemma 9) is known as the moment matching (Minka, 2013), which |
|
states that approximating arbitrary density by a Gaussian density under the KL divergence |
|
is equivalent to setting the first two moments of the two densities as the same. To our |
|
knowledge, the connection of moment matching and DPMs has not been revealed before. |
|
|
|
- In the second step (see Lemma 13), we carefully use the law of total variance conditioned |
|
on x0 and convert the second moment of q(xn 1 **_xn) to that of q(x0_** **_xn)._** |
|
_−_ _|_ _|_ |
|
|
|
- In the third step (see Lemma 11), we surprisingly find that the second moment of q(x0 **_xn)_** |
|
_|_ |
|
can be represented by the score function, and we plug the score representation into the |
|
second moment of q(xn−1|xn) to get the final results in Theorem 1. |
|
|
|
The results in Theorem 1 (and other results to appear later) can be further simplified for the DDPM |
|
forward process (i.e., λ[2]n [= ˜]βn, see Appendix D for details). Besides, we can also extend Theorem 1 to DPMs with continuous timesteps (Song et al., 2020b; Kingma et al., 2021), where their |
|
corresponding optimal mean and variance are also determined by the score function in an analytic |
|
form (see Appendix E.1 for the extension). |
|
|
|
Note that our analytic form of the optimal mean µ[∗]n[(][x][n][)][ in Eq. (6) and the previous parameterization] |
|
of µn(xn) (Ho et al., 2020) in Eq. (3) coincide. The only difference is that Eq. (3) replaces the |
|
|
|
|
|
----- |
|
|
|
2 7 2 12 |
|
|
|
2 9 2 13 |
|
|
|
variance2 11 2[3] |
|
|
|
2 13 |
|
|
|
n n 2n |
|
|
|
2[2] 2[5] 2[8] |
|
|
|
timestep n |
|
|
|
|
|
0.4 n n 2n |
|
|
|
0.3 |
|
|
|
(bits/dim) |
|
|
|
0.2 |
|
|
|
term 0.1 |
|
vb0.1 |
|
L 2[2] |
|
|
|
2[2] 2[5] 2[8] |
|
|
|
timestep n |
|
|
|
|
|
(a) Variance |
|
|
|
|
|
(b) Terms in Lvb |
|
|
|
|
|
Figure 1: Comparing our analytic estimate ˆσn[2] [and prior works with handcrafted variances][ β][n] [and] |
|
_β˜n. (a) compares the values of the variance for different timesteps. (b) compares the term in Lvb_ |
|
corresponding to each timestep. The value of Lvb is the area under the corresponding curve. |
|
|
|
score function ∇xn log qn(xn) in Eq. (6) with the score-based model sn(xn). This result explicitly |
|
shows that Eq. (5) essentially shares the same optimal mean solution to the Lvb objective, providing |
|
a simple and alternative perspective to prior works. |
|
|
|
In contrast to the handcrafted strategies used in (Ho et al., 2020; Song et al., 2020a), Theorem 1 |
|
shows that the optimal reverse variance σn[∗][2] [can also be estimated without any extra training process] |
|
given a pretrained score-based model sn(xn). In fact, we first estimate the expected mean squared |
|
norm of ∇xn log qn(xn) by Γ = (Γ1, ..., ΓN ), where |
|
|
|
|
|
**_sn(xn,m)_** |
|
_||_ _||[2]_ |
|
|
|
|
|
Γn = [1] |
|
|
|
|
|
_iid_ |
|
**_xn,m_** _qn(xn)._ (8) |
|
_∼_ |
|
|
|
|
|
_m=1_ |
|
|
|
|
|
_M is the number of Monte Carlo samples. We only need to calculate Γ once for a pretrained_ |
|
model and reuse it in downstream computations (see Appendix H.1 for a detailed discussion of the |
|
computation cost of Γ). Then, according to Eq. (7), we estimate σn[∗][2] [as follows:] |
|
|
|
|
|
_βn_ |
|
_αn_ |
|
|
|
|
|
_σˆn[2]_ [=][ λ]n[2] [+] |
|
|
|
|
|
1 _βnΓn_ _._ (9) |
|
_−_ |
|
|
|
|
|
|
|
_βn−1 −_ _λ[2]n_ |
|
|
|
|
|
We empirically validate Theorem 1. In Figure 1 (a), we plot our analytic estimate ˆσn[2] [of a DDPM] |
|
trained on CIFAR10, as well as the baselines βn and _β[˜]n used by Ho et al. (2020). At small timesteps,_ |
|
these strategies behave differently. Figure 1 (b) shows that our ˆσn[2] [outperforms the baselines for each] |
|
term of Lvb, especially at the small timesteps. We also obtain similar results on other datasets (see |
|
Appendix G.1). Besides, we show that only a small number of Monte Carlo (MC) samples (e.g., |
|
_M_ =10, 100) is required to achieve a sufficiently small variance caused by MC and get a similar |
|
performance to that with a large M (see Appendix G.2). We also discuss the stochasticity of Lvb |
|
after plugging ˆσn[2] [in Appendix H.2.] |
|
|
|
3.1 BOUNDING THE OPTIMAL REVERSE VARIANCE TO REDUCE BIAS |
|
|
|
|
|
According to Eq. (7) and Eq. (9), the bias of the analytic estimate ˆσn[2] [is] |
|
|
|
|
|
_βn_ **_xn log qn(xn)_** |
|
|
|
_|σn[∗][2]_ _[−]_ _σ[ˆ]n[2]_ _[|][ =]_ s _αn_ _−_ _βn−1 −_ _λ[2]n_ _βn_ _|Γn −_ Eqn(xn) _||∇_ _d_ _||[2]_ _|_ _._ (10) |
|
|
|
q |
|
|
|
Approximation error |
|
|
|
Coefficient |
|
|
|
| {z } |
|
|
|
Our estimate of the variance employs a score-based model| {z } **_sn(xn) to approximate the true score_** |
|
function ∇xn log qn(xn). Thus, the approximation error in Eq. (10) is irreducible given a pretrained |
|
model. Meanwhile, the coefficient in Eq. (10) can be large if we use a shorter trajectory to sample |
|
(see details in Section 4), potentially resulting in a large bias. |
|
|
|
|
|
_βn_ |
|
_αn_ |
|
|
|
|
|
_|σn[∗][2]_ _[−]_ _σ[ˆ]n[2]_ _[|][ =]_ |
|
|
|
|
|
----- |
|
|
|
To reduce the bias, we derive bounds of the optimal reverse variance σn[∗][2] [and clip our estimate based] |
|
on the bounds. Importantly, these bounds are unrelated to the data distribution q(x0) and hence |
|
can be efficiently calculated. We firstly derive both upper and lower bounds of σn[∗][2] [without any] |
|
assumption about the data. Then we show another upper bound of σn[∗][2] [if the data distribution is] |
|
bounded. We formalize these bounds in Theorem 2. |
|
**Theorem 2. (Bounds of the optimal reverse variance, proof in Appendix A.3)** |
|
|
|
_σn[∗][2]_ _[has the following lower and upper bounds:]_ |
|
|
|
|
|
_βn_ |
|
_αn_ |
|
|
|
|
|
_λ[2]n_ _[≤]_ _[σ]n[∗][2]_ _[≤]_ _[λ]n[2]_ [+] |
|
|
|
|
|
(11) |
|
|
|
|
|
_βn−1 −_ _λ[2]n_ |
|
|
|
|
|
_If we further assume q(x0) is a bounded distribution in [a, b][d], where d is the dimension of data,_ |
|
_then σn[∗][2]_ _[can be further upper bounded by]_ |
|
|
|
2 |
|
|
|
2 |
|
|
|
_αn_ _b_ _a_ |
|
|
|
_σn[∗][2]_ _n_ [+] _αn_ 1 _βn_ 1 _λ[2]n_ _−_ _._ (12) |
|
|
|
_[≤]_ _[λ][2]_ _−_ _−_ q _−_ _−_ _[·]_ s _βn_ ! 2 |
|
|
|
p |
|
|
|
Theorem 2 states that the handcrafted reverse variance λ[2]n [in prior works (Ho et al., 2020; Song et al.,] |
|
2020a) underestimates σn[∗][2][. For instance,][ λ]n[2] [= ˜]βn in DDPM. We compare it with our estimate in |
|
Figure 1 (a) and the results agree with Theorem 2. Besides, the boundedness assumption of q(x0) |
|
is satisfied in many scenarios including generative modelling of images, and which upper bound |
|
among Eq. (11) and Eq. (12) is tighter depends on n. Therefore, we clip our estimate based on the |
|
minimum one. Further, we show theses bounds are tight numerically in Appendix G.3. |
|
|
|
4 ANALYTIC ESTIMATION OF THE OPTIMAL TRAJECTORY |
|
|
|
The number of full timesteps N can be large, making the inference slow in practice. Thereby, we can |
|
construct a shorter forward process q(xτ1 _, · · ·, xτK_ _|x0) constrained on a trajectory 1 = τ1 < · · · <_ |
|
_τK = N of K timesteps (Song et al., 2020a; Nichol & Dhariwal, 2021; Watson et al., 2021), and K_ |
|
can be much smaller than N to speed up the inference. Formally, the shorter process is defined as |
|
_q(xτ1_ _, · · ·, xτK_ _|x0) = q(xτK_ _|x0)_ _k=2_ _[q][(][x][τ]k−1_ _[|][x][τ]k_ _[,][ x][0][)][, where]_ |
|
|
|
_q(xτk−1_ _|xτk_ _, x0) = N[Q](x[K]τk−1_ _|µ˜_ _τk−1|τk_ (xτk _, x0), λ[2]τk−1|τk_ **_[I][)][,]_** (13) |
|
|
|
**_µ˜_** _τk−1|τk_ (xτk _, x0) =_ _ατk−1_ **_x0 +_** _βτk−1_ _λ[2]τk−1|τk_ _._ |
|
q _−_ _[·][ x][τ][k][ −√]βτ[α]k[τ][k]_ **_[x][0]_** |
|
|
|
[p] q |
|
|
|
The corresponding reverse process is p(x0, xτ1 _, · · ·, xτK_ ) = p(xτK ) _k=1_ _[p][(][x][τ]k−1_ _[|][x][τ]k_ [)][, where] |
|
|
|
_p(xτk−1_ _|xτk_ ) = N (xτk−1 _|µτk−1|τk_ (xτk ), στ[2]k−1[Q]|τ[K]k **_[I][)][.]_** |
|
|
|
According to Theorem 1, the mean and variance of the optimal p[∗](xτk−1 _|xτk_ ) in the sense of KL |
|
minimization is |
|
|
|
|
|
**_µ[∗]τk−1|τk_** [(][x][τ]k [) = ˜]µτk−1|τk |
|
|
|
_στ[∗]k[2]−1|τk_ [=] _[λ]τ[2]k−1|τk_ [+]s |
|
|
|
|
|
|
|
|
|
**_xτk_** _,_ _√ατk_ (xτk + βτk _∇xτk log q(xτk_ )) |
|
|
|
|
|
(1−βτk Eq(xτk ) _||∇xτk logd q(xτk_ )||[2] |
|
|
|
|
|
_βτk_ |
|
|
|
_ατk|τk−1_ |
|
|
|
|
|
_βτk−1_ _λ[2]τk−1|τk_ |
|
_−_ |
|
|
|
|
|
), |
|
|
|
|
|
where ατk|τk−1 := ατk _/ατk−1_ . According to Theorem 2, we can derive similar bounds for στ[∗]k[2]−1|τk |
|
(see details in Appendix C). Similarly to Eq. (9), the estimate of στ[∗]k[2]−1|τk [is] |
|
|
|
|
|
_βτk_ |
|
|
|
_ατk|τk−1_ |
|
|
|
|
|
_σˆτ[2]k−1|τk_ [=] _[λ]τ[2]k−1|τk_ [+] |
|
|
|
|
|
_βτk−1_ _λ[2]τk−1|τk_ |
|
_−_ |
|
|
|
|
|
(1 _βτk_ Γτk ), |
|
_−_ |
|
|
|
|
|
----- |
|
|
|
where Γ is defined in Eq. (8) and can be shared across different selections of trajectories. Based on |
|
the optimal reverse process p[∗] above, we further optimize the trajectory: |
|
|
|
|
|
min |
|
_τ1,_ _,τK_ _[D][KL][(][q][(][x][0][,][ x][τ][1]_ _[,][ · · ·][,][ x][τ][K]_ [)][||][p][∗][(][x][0][,][ x][τ][1] _[,][ · · ·][,][ x][τ][K]_ [)) =][ d]2 |
|
|
|
_···_ |
|
|
|
|
|
_J(τk−1, τk) + c,_ (14) |
|
_k=2_ |
|
|
|
X |
|
|
|
|
|
where J(τk−1, τk) = log(στ[∗]k[2]−1|τk _[/λ]τ[2]k−1|τk_ [)][ and][ c][ is a constant unrelated to the trajectory][ τ][ (see] |
|
proof in Appendix A.4). The KL in Eq. (14) can be decomposed into K − 1 terms and each term |
|
has an analytic form w.r.t. the score function. We view each term as a cost function J evaluated |
|
at (τk−1, τk), and it can be efficiently estimated by J(τk−1, τk) ≈ log(ˆστ[2]k−1|τk _[/λ]τ[2]k−1|τk_ [)][, which] |
|
doesn’t require any neural network computation once Γ is given. While the logarithmic function |
|
causes bias even when the correct score function is known, it can be reduced by increasing M . |
|
|
|
As a result, Eq. (14) is reduced to a canonical least-cost-path problem (Watson et al., 2021) on a |
|
directed graph, where the nodes are {1, 2, · · ·, N _} and the edge from s to t has cost J(s, t). We_ |
|
want to find a least-cost path of K nodes starting from 1 and terminating at N . This problem |
|
can be solved by the dynamic programming (DP) algorithm introduced by Watson et al. (2021). |
|
We present this algorithm in Appendix B. Besides, we can also extend Eq. (14) to DPMs with |
|
continuous timesteps (Song et al., 2020b; Kingma et al., 2021), where their corresponding optimal |
|
KL divergences are also decomposed to terms determined by score functions. Thereby, the DP |
|
algorithm is also applicable. See Appendix E.2 for the extension. |
|
|
|
5 RELATIONSHIP BETWEEN THE SCORE FUNCTION AND THE DATA |
|
COVARIANCE MATRIX |
|
|
|
In this part, we further reveal a relationship between the score function and the data covariance |
|
matrix. Indeed, the data covariance matrix can be decomposed to the sum of Eq(xn)Covq(x0 **_xn)[x0]_** |
|
_|_ |
|
and Covq(xn)Eq(x0 **_xn)[x0], where the first term can be represented by the score function. Further,_** |
|
_|_ |
|
the second term is negligible when n is sufficiently large because x0 and xn are almost independent. |
|
In such cases, the data covariance matrix is almost determined by the score function. Currently, the |
|
relationship is purely theoretical and its practical implication is unclear. See details in Appendix A.5. |
|
|
|
6 EXPERIMENTS |
|
|
|
We consider the DDPM forward process (λ[2]n [= ˜]βn) and the DDIM forward process (λ[2]n [= 0][),] |
|
which are the two most commonly used special cases of Eq. (2). We denote our method, which |
|
uses the analytic estimate σn[2] [= ˆ]σn[2] [, as][ Analytic-DPM][, and explicitly call it][ Analytic-DDPM][ or] |
|
_Analytic-DDIM according to which forward process is used. We compare our Analytic-DPM with_ |
|
the original DDPM (Ho et al., 2020), where the reverse variance is either σn[2] [= ˜]βn or σn[2] [=][ β][n][, as] |
|
well as the original DDIM (Song et al., 2020a), where the reverse variance is σn[2] [=][ λ]n[2] [= 0][.] |
|
|
|
We adopt two methods to get the trajectory for both the analytic-DPM and baselines. The first one is |
|
even trajectory (ET) (Nichol & Dhariwal, 2021), where the timesteps are determined according to a |
|
fixed stride (see details in Appendix F.4). The second one is optimal trajectory (OT) (Watson et al., |
|
2021), where the timesteps are calculated via dynamic programming (see Section 4). Note that the |
|
baselines calculate the OT based on the Lvb with their handcrafted variances (Watson et al., 2021). |
|
|
|
We apply our Analytic-DPM to three pretrained score-based models provided by prior works (Ho |
|
et al., 2020; Song et al., 2020a; Nichol & Dhariwal, 2021), as well as two score-based models |
|
trained by ourselves. The pretrained score-based models are trained on CelebA 64x64 (Liu et al., |
|
2015), ImageNet 64x64 (Deng et al., 2009) and LSUN Bedroom (Yu et al., 2015) respectively. Our |
|
score-based models are trained on CIFAR10 (Krizhevsky et al., 2009) with two different forward |
|
noise schedules: the linear schedule (LS) (Ho et al., 2020) and the cosine schedule (CS) (Nichol & |
|
Dhariwal, 2021). We denote them as CIFAR10 (LS) and CIFAR10 (CS) respectively. The number |
|
of the full timesteps N is 4000 for ImageNet 64x64 and 1000 for other datasets. During sampling, |
|
we only display the mean of p(x0 **_x1) and discard the noise following Ho et al. (2020), and we_** |
|
_|_ |
|
additionally clip the noise scale σ2 of p(x1|x2) for all methods compared in Table 2 (see details in |
|
Appendix F.2 and its ablation study in Appendix G.4). See more experimental details in Appendix F. |
|
|
|
|
|
----- |
|
|
|
Table 1: Negative log-likelihood (bits/dim) ↓ under the DDPM forward process. We show results |
|
under trajectories of different number of timesteps K. We select the minimum K such that analyticDPM can outperform the baselines with full timesteps and underline the corresponding results. |
|
|
|
Model \ # timesteps K 10 25 50 100 200 400 1000 |
|
|
|
CIFAR10 (LS) |
|
|
|
DDPM, σn[2] [= ˜]βn 74.95 24.98 12.01 7.08 5.03 4.29 3.73 |
|
ET DDPM, σn[2] [=][ β][n] 6.99 6.11 5.44 4.86 4.39 4.07 3.75 |
|
Analytic-DDPM **5.47** **4.79** **4.38** **4.07** **3.84** **3.71** **3.59** |
|
|
|
OT DDPM, σn[2] [=][ β][n] 5.38 4.34 3.97 3.82 3.77 3.75 3.75 |
|
Analytic-DDPM **4.11** **3.68** **3.61** **3.59** **3.59** **3.59** **3.59** |
|
|
|
CIFAR10 (CS) |
|
|
|
DDPM, σn[2] [= ˜]βn 75.96 24.94 11.96 7.04 4.95 4.19 3.60 |
|
ET DDPM, σn[2] [=][ β][n] 6.51 5.55 4.92 4.41 4.03 3.78 3.54 |
|
Analytic-DDPM **5.08** **4.45** **4.09** **3.83** **3.64** **3.53** **3.42** |
|
|
|
OT DDPM, σn[2] [=][ β][n] 5.51 4.30 3.86 3.65 3.57 3.54 3.54 |
|
Analytic-DDPM **3.99** **3.56** **3.47** **3.44** **3.43** **3.42** **3.42** |
|
|
|
CelebA 64x64 |
|
|
|
DDPM, σn[2] [= ˜]βn 33.42 13.09 7.14 4.60 3.45 3.03 2.71 |
|
ET DDPM, σn[2] [=][ β][n] 6.67 5.72 4.98 4.31 3.74 3.34 2.93 |
|
Analytic-DDPM **4.54** **3.89** **3.48** **3.16** **2.92** **2.79** **2.66** |
|
|
|
OT DDPM, σn[2] [=][ β][n] 4.76 3.58 3.16 2.99 2.94 2.93 2.93 |
|
Analytic-DDPM **2.97** **2.71** **2.67** **2.66** **2.66** **2.66** **2.66** |
|
|
|
|
|
Model \ # timesteps K 25 50 100 200 400 1000 4000 |
|
|
|
ImageNet 64x64 |
|
|
|
DDPM, σn[2] [= ˜]βn 105.87 46.25 22.02 12.10 7.59 5.04 3.89 |
|
ET DDPM, σn[2] [=][ β][n] 5.81 5.20 4.70 4.31 4.04 3.81 3.65 |
|
Analytic-DDPM **4.78** **4.42** **4.15** **3.95** **3.81** **3.69** **3.61** |
|
|
|
OT DDPM, σn[2] [=][ β][n] 4.56 4.09 3.84 3.73 3.68 3.65 3.65 |
|
Analytic-DDPM **3.83** **3.70** **3.64** **3.62** **3.62** **3.61** **3.61** |
|
|
|
We conduct extensive experiments to demonstrate that analytic-DPM can consistently improve the |
|
inference efficiency of a pretrained DPM while achieving a comparable or even superior performance. Specifically, Section 6.1 and Section 6.2 present the likelihood and sample quality results |
|
respectively. Additional experiments such as ablation studies can be found in Appendix G. |
|
|
|
6.1 LIKELIHOOD RESULTS |
|
|
|
Since λ[2]n [= 0][ in the DDIM forward process, its variational bound][ L][vb] [is infinite. Thereby, we] |
|
only consider the likelihood results under the DDPM forward process. As shown in Table 1, on all |
|
three datasets, our Analytic-DPM consistently improves the likelihood results of the original DDPM |
|
using both ET and OT. Remarkably, using a much shorter trajectory (i.e., a much less inference |
|
time), Analytic-DPM with OT can still outperform the baselines. In Table 1, we select the minimum K such that analytic-DPM can outperform the baselines with full timesteps and underline the |
|
corresponding results. Specifically, analytic-DPM enjoys a 40× speed up on CIFAR10 (LS) and |
|
ImageNet 64x64, and a 20× speed up on CIFAR10 (CS) and CelebA 64x64. |
|
|
|
Although we mainly focus on learning-free strategies of choosing the reverse variance, we also |
|
compare to another strong baseline that predicts the variance by a neural network (Nichol & Dhariwal, 2021). With full timesteps, Analytic-DPM achieves a NLL of 3.61 on ImageNet 64x64, which |
|
is very close to 3.57 reported in Nichol & Dhariwal (2021). Besides, while Nichol & Dhariwal |
|
(2021) report that the ET drastically reduces the log-likelihood performance of their neural-networkparameterized variance, Analytic-DPM performs well with the ET. See details in Appendix G.6. |
|
|
|
|
|
----- |
|
|
|
Table 2: FID ↓ under the DDPM and DDIM forward processes. All are evaluated under the even |
|
trajectory (ET). The result with _[†]_ is slightly better than 3.17 reported by Ho et al. (2020), because |
|
we use an improved model architecture following Nichol & Dhariwal (2021). |
|
|
|
Model \ # timesteps K 10 25 50 100 200 400 1000 |
|
|
|
CIFAR10 (LS) |
|
|
|
DDPM, σn[2] [= ˜]βn 44.45 21.83 15.21 10.94 8.23 6.43 5.11 |
|
DDPM, σn[2] [=][ β][n] 233.41 125.05 66.28 31.36 12.96 4.86 _†3.04_ |
|
Analytic-DDPM **34.26** **11.60** **7.25** **5.40** **4.01** **3.62** 4.03 |
|
|
|
DDIM 21.31 10.70 7.74 6.08 5.07 4.61 4.13 |
|
Analytic-DDIM **14.00** **5.81** **4.04** **3.55** **3.39** **3.50** **3.74** |
|
|
|
CIFAR10 (CS) |
|
|
|
DDPM, σn[2] [= ˜]βn 34.76 16.18 11.11 8.38 6.66 5.65 4.92 |
|
DDPM, σn[2] [=][ β][n] 205.31 84.71 37.35 14.81 5.74 **3.40** **3.34** |
|
Analytic-DDPM **22.94** **8.50** **5.50** **4.45** **4.04** 3.96 4.31 |
|
|
|
DDIM 34.34 16.68 10.48 7.94 6.69 5.78 4.89 |
|
Analytic-DDIM **26.43** **9.96** **6.02** **4.88** **4.92** **5.00** **4.66** |
|
|
|
CelebA 64x64 |
|
|
|
DDPM, σn[2] [= ˜]βn 36.69 24.46 18.96 14.31 10.48 8.09 5.95 |
|
DDPM, σn[2] [=][ β][n] 294.79 115.69 53.39 25.65 9.72 **3.95** **3.16** |
|
Analytic-DDPM **28.99** **16.01** **11.23** **8.08** **6.51** 5.87 5.21 |
|
|
|
DDIM 20.54 13.45 9.33 6.60 4.96 4.15 3.40 |
|
Analytic-DDIM **15.62** **9.22** **6.13** **4.29** **3.46** **3.38** **3.13** |
|
|
|
|
|
Model \ # timesteps K 25 50 100 200 400 1000 4000 |
|
|
|
ImageNet 64x64 |
|
|
|
DDPM, σn[2] [= ˜]βn **29.21** **21.71** 19.12 17.81 17.48 16.84 16.55 |
|
DDPM, σn[2] [=][ β][n] 170.28 83.86 45.04 28.39 21.38 17.58 16.38 |
|
Analytic-DDPM 32.56 22.45 **18.80** **17.16** **16.40** **16.14** **16.34** |
|
|
|
DDIM 26.06 20.10 18.09 17.84 17.74 17.73 19.00 |
|
Analytic-DDIM **25.98** **19.23** **17.73** **17.49** **17.44** **17.57** **18.98** |
|
|
|
6.2 SAMPLE QUALITY |
|
|
|
As for the sample quality, we consider the commonly used FID score (Heusel et al., 2017), where a |
|
lower value indicates a better sample quality. As shown in Table 2, under trajectories of different K, |
|
our Analytic-DDIM consistently improves the sample quality of the original DDIM. This allows us |
|
to generate high-quality samples with less than 50 timesteps, which results in a 20× to 80× speed |
|
up compared to the full timesteps. Indeed, in most cases, Analytic-DDIM only requires up to 50 |
|
timesteps to get a similar performance to the baselines. Besides, Analytic-DDPM also improves the |
|
sample quality of the original DDPM in most cases. For fairness, we use the ET implementation in |
|
Nichol & Dhariwal (2021) for all results in Table 2. We also report the results on CelebA 64x64 |
|
using a slightly different implementation of the ET following Song et al. (2020a) in Appendix G.7, |
|
and our Analytic-DPM is still effective. We show generated samples in Appendix G.9. |
|
|
|
We observe that Analytic-DDPM does not always outperform the baseline under the FID metric, |
|
which is inconsistent with the likelihood results in Table 1. Such a behavior essentially roots in the |
|
different natures of the two metrics and has been investigated in extensive prior works (Theis et al., |
|
2015; Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2021; Vahdat et al., 2021; Watson et al., |
|
2021; Kingma et al., 2021). Similarly, using more timesteps doesn’t necessarily yield a better FID. |
|
For instance, see the Analytic-DDPM results on CIFAR10 (LS) and the DDIM results on ImageNet |
|
64x64 in Table 2. A similar phenomenon is observed in Figure 8 in Nichol & Dhariwal (2021). |
|
Moreover, a DPM (including Analytic-DPM) with OT does not necessarily lead to a better FID |
|
score (Watson et al., 2021) (see Appendix G.5 for a comparison of ET and OT in Analytic-DPM). |
|
|
|
|
|
----- |
|
|
|
Table 3: Efficiency comparison, based on the least number of timesteps ↓ required to achieve a FID |
|
around 6 (with the corresponding FID). To get the strongest baseline, the results with _[†]_ are achieved |
|
by using the quadratic trajectory Song et al. (2020a) instead of the default even trajectory. |
|
|
|
Method CIFAR10 CelebA 64x64 LSUN Bedroom |
|
|
|
DDPM (Ho et al., 2020) _†90 (6.12)_ _> 200_ 130 (6.06) |
|
DDIM (Song et al., 2020a) _†30 (5.85)_ _> 100_ Best FID > 6 |
|
Improved DDPM (Nichol & Dhariwal, 2021) 45 (5.96) Missing model **90 (6.02)** |
|
Analytic-DPM (ours) **25 (5.81)** **55 (5.98)** 100 (6.05) |
|
|
|
We summarize the efficiency of different methods in Table 3, where we consider the least number |
|
of timesteps required to achieve a FID around 6 as the metric for a more direct comparison. |
|
|
|
7 RELATED WORK |
|
|
|
**DPMs and their applications. The diffusion probabilistic model (DPM) is initially introduced by** |
|
Sohl-Dickstein et al. (2015), where the DPM is trained by optimizing the variational bound Lvb. |
|
Ho et al. (2020) propose the new parameterization of DPMs in Eq. (3) and learn DPMs with the |
|
reweighted variant of Lvb in Eq. (5). Song et al. (2020b) model the noise adding forward process |
|
as a stochastic differential equation (SDE) and introduce DPMs with continuous timesteps. With |
|
these important improvements, DPMs show great potential in various applications, including speech |
|
synthesis (Chen et al., 2020; Kong et al., 2020; Popov et al., 2021; Lam et al., 2021), controllable |
|
generation (Choi et al., 2021; Sinha et al., 2021), image super-resolution (Saharia et al., 2021; Li |
|
et al., 2021), image-to-image translation (Sasaki et al., 2021), shape generation (Zhou et al., 2021) |
|
and time series forecasting (Rasul et al., 2021). |
|
|
|
**Faster DPMs. Several works attempt to find short trajectories while maintaining the DPM per-** |
|
formance. Chen et al. (2020) find an effective trajectory of only six timesteps by the grid search. |
|
However, the grid search is only applicable to very short trajectories due to its exponentially growing |
|
time complexity. Watson et al. (2021) model the trajectory searching as a least-cost-path problem |
|
and introduce a dynamic programming (DP) algorithm to solve this problem. Our work uses this |
|
DP algorithm, where the cost is defined as a term of the optimal KL divergence. In addition to |
|
these trajectory searching techniques, Luhman & Luhman (2021) compress the reverse denoising |
|
process into a single step model; San-Roman et al. (2021) dynamically adjust the trajectory during inference. Both of them need extra training after getting a pretrained DPM. As for DPMs with |
|
continuous timesteps (Song et al., 2020b), Song et al. (2020b) introduce an ordinary differential |
|
equation (ODE), which improves sampling efficiency and enables exact likelihood computation. |
|
However, the likelihood computation involves a stochastic trace estimator, which requires a multiple |
|
number of runs for accurate computation. Jolicoeur-Martineau et al. (2021) introduce an advanced |
|
SDE solver to simulate the reverse process in a more efficient way. However, the log-likelihood |
|
computation based on this solver is not specified. |
|
|
|
**Variance Learning in DPMs. In addition to the reverse variance, there are also works on learning** |
|
the forward noise schedule (i.e., the forward variance). Kingma et al. (2021) propose variational |
|
diffusion models (VDMs) on continuous timesteps, which use a signal-to-noise ratio function to |
|
parameterize the forward variance and directly optimize the variational bound objective for a better |
|
log-likelihood. While we primarily apply our method to DDPMs and DDIMs, estimating the optimal |
|
reverse variance can also be applied to VDMs (see Appendix E). |
|
|
|
8 CONCLUSION |
|
|
|
We present that both the optimal reverse variance and the corresponding optimal KL divergence |
|
of a DPM have analytic forms w.r.t. its score function. Building upon it, we propose Analytic_DPM, a training-free inference framework that estimates the analytic forms of the variance and KL_ |
|
divergence using the Monte Carlo method and a pretrained score-based model. We derive bounds |
|
of the optimal variance to correct potential bias and reveal a relationship between the score function |
|
and the data covariance matrix. Empirically, our analytic-DPM improves both the efficiency and |
|
performance of likelihood results, and generates high-quality samples efficiently in various DPMs. |
|
|
|
|
|
----- |
|
|
|
ACKNOWLEDGMENTS |
|
|
|
This work was supported by NSF of China Projects (Nos. 62061136001, 61620106010, |
|
62076145), Beijing NSF Project (No. JQ19016), Beijing Outstanding Young Scientist Program |
|
NO. BJJWZYJH012019100020098, Beijing Academy of Artificial Intelligence (BAAI), TsinghuaHuawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, and the NVIDIA |
|
NVAIL Program with GPU/DGX Acceleration, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China. |
|
|
|
ETHICS STATEMENT |
|
|
|
This work proposes an analytic estimate of the optimal variance in the reverse process of diffusion |
|
probabilistic models. As a fundamental research in machine learning, the negative consequences are |
|
not obvious. Though in theory any technique can be misused, it is not likely to happen at the current |
|
stage. |
|
|
|
REPRODUCIBILITY STATEMENT |
|
|
|
[We provide our codes and links to pretrained models in https://github.com/baofff/](https://github.com/baofff/Analytic-DPM) |
|
[Analytic-DPM. We provide details of these pretrained models in Appendix F.1. We provide de-](https://github.com/baofff/Analytic-DPM) |
|
tails of data processing, log-likelihood evaluation, sampling and FID computation in Appendix F.2. |
|
We provide complete proofs of all theoretical results in Appendix A. |
|
|
|
REFERENCES |
|
|
|
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural |
|
image synthesis. arXiv preprint arXiv:1809.11096, 2018. |
|
|
|
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv |
|
_preprint arXiv:1509.00519, 2015._ |
|
|
|
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020. |
|
|
|
Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, |
|
2021. |
|
|
|
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, |
|
pp. 248–255. Ieee, 2009. |
|
|
|
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint |
|
_arXiv:2105.05233, 2021._ |
|
|
|
Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv |
|
_preprint arXiv:1903.08689, 2019._ |
|
|
|
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, |
|
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information |
|
_processing systems, 27, 2014._ |
|
|
|
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. |
|
Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in |
|
_neural information processing systems, 30, 2017._ |
|
|
|
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint |
|
_arXiv:2006.11239, 2020._ |
|
|
|
|
|
----- |
|
|
|
Alexia Jolicoeur-Martineau, Ke Li, R´emi Pich´e-Taillefer, Tal Kachman, and Ioannis Mitliagkas. |
|
Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, |
|
2021. |
|
|
|
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training |
|
generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676, 2020a. |
|
|
|
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on |
|
_Computer Vision and Pattern Recognition, pp. 8110–8119, 2020b._ |
|
|
|
Hyun-Chul Kim and Zoubin Ghahramani. Bayesian gaussian process classification with the em-ep |
|
algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1948–1959, |
|
2006. |
|
|
|
Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. |
|
_arXiv preprint arXiv:1807.03039, 2018._ |
|
|
|
Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. |
|
_arXiv preprint arXiv:2107.00630, 2021._ |
|
|
|
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile |
|
diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020. |
|
|
|
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. |
|
2009. |
|
|
|
Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. Bilateral denoising diffusion |
|
models. arXiv preprint arXiv:2108.11514, 2021. |
|
|
|
Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. |
|
Srdiff: Single image super-resolution with diffusion probabilistic models. _arXiv preprint_ |
|
_arXiv:2104.14951, 2021._ |
|
|
|
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In |
|
_2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December_ |
|
_7-13, 2015, pp. 3730–3738. IEEE Computer Society, 2015._ |
|
|
|
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. _arXiv preprint_ |
|
_arXiv:1711.05101, 2017._ |
|
|
|
Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved |
|
sampling speed. arXiv preprint arXiv:2101.02388, 2021. |
|
|
|
Thomas P Minka. Expectation propagation for approximate bayesian inference. arXiv preprint |
|
_arXiv:1301.2294, 2013._ |
|
|
|
Thomas Peter Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, |
|
Massachusetts Institute of Technology, 2001. |
|
|
|
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization |
|
for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018. |
|
|
|
Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. _arXiv_ |
|
_preprint arXiv:2102.09672, 2021._ |
|
|
|
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: |
|
A diffusion probabilistic model for text-to-speech. arXiv preprint arXiv:2105.06337, 2021. |
|
|
|
Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. _arXiv preprint_ |
|
_arXiv:2101.12072, 2021._ |
|
|
|
|
|
----- |
|
|
|
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, |
|
pp. 1278–1286. PMLR, 2014. |
|
|
|
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad |
|
Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021. |
|
|
|
Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. arXiv preprint arXiv:2104.02600, 2021. |
|
|
|
Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon. Unit-ddpm: Unpaired image translation |
|
with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358, 2021. |
|
|
|
Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-denoising |
|
models for few-shot conditional generation. arXiv preprint arXiv:2106.06819, 2021. |
|
|
|
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised |
|
learning using nonequilibrium thermodynamics. In International Conference on Machine Learn_ing, pp. 2256–2265. PMLR, 2015._ |
|
|
|
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv |
|
_preprint arXiv:2010.02502, 2020a._ |
|
|
|
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. |
|
_arXiv preprint arXiv:1907.05600, 2019._ |
|
|
|
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben |
|
Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint |
|
_arXiv:2011.13456, 2020b._ |
|
|
|
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. arXiv e-prints, pp. arXiv–2101, 2021. |
|
|
|
Lucas Theis, A¨aron van den Oord, and Matthias Bethge. A note on the evaluation of generative |
|
models. arXiv preprint arXiv:1511.01844, 2015. |
|
|
|
Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. arXiv preprint |
|
_arXiv:2007.03898, 2020._ |
|
|
|
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. arXiv |
|
_preprint arXiv:2106.05931, 2021._ |
|
|
|
Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021. |
|
|
|
Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent |
|
optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953, 2019. |
|
|
|
Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. Vaebm: A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Repre_sentations, 2020._ |
|
|
|
Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of |
|
a large-scale image dataset using deep learning with humans in the loop. _arXiv preprint_ |
|
_arXiv:1506.03365, 2015._ |
|
|
|
Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel |
|
diffusion. arXiv preprint arXiv:2104.03670, 2021. |
|
|
|
|
|
----- |
|
|
|
A PROOFS AND DERIVATIONS |
|
|
|
A.1 LEMMAS |
|
|
|
**Lemma 1. (Cross-entropy to Gaussian) Suppose q(x) is a probability density function with mean** |
|
**_µq and covariance matrix Σq and p(x) = N_** (x|µ, Σ) is a Gaussian distribution, then the cross_entropy between q and p is equal to the cross-entropy between_ (x **_µq, Σq) and p, i.e.,_** |
|
_N_ _|_ |
|
|
|
_H(q, p) = H(_ (x **_µq, Σq), p)_** |
|
_N_ _|_ |
|
|
|
= [1] |
|
|
|
2 [log((2][π][)][d][|][Σ][|][) + 1]2 [tr(][Σ][q][Σ][−][1][) + 1]2 [(][µ][q][ −] **_[µ][)][⊤][Σ][−][1][(][µ][q][ −]_** **_[µ][)][.]_** |
|
|
|
|
|
_Proof._ |
|
|
|
|
|
1 |
|
|
|
exp( |
|
(2π)[d]|Σ| _−_ [(][x][ −] **_[µ][)][⊤][Σ]2[−][1][(][x][ −]_** **_[µ][)]_** |
|
|
|
|
|
_H(q, p) = −Eq(x) log p(x) = −Eq(x) log_ |
|
|
|
|
|
= [1] |
|
|
|
2 [log((2][π][)][d][|][Σ][|][) + 1]2 [E][q][(][x][)][(][x][ −] **_[µ][)][⊤][Σ][−][1][(][x][ −]_** **_[µ][)]_** |
|
|
|
|
|
= [1] |
|
|
|
2 [log((2][π][)][d][|][Σ][|][) + 1]2 [E][q][(][x][)][ tr((][x][ −] **_[µ][)(][x][ −]_** **_[µ][)][⊤][Σ][−][1][)]_** |
|
|
|
= 2 [1] [log((2][π][)][d][|][Σ][|][) + 1]2 [tr(][E][q][(][x][)] (x − **_µ)(x −_** **_µ)[⊤][]_** **Σ[−][1])** |
|
|
|
|
|
|
|
= [1] (x **_µq)(x_** **_µq)[⊤]_** + (µq **_µ)(µq_** **_µ)[⊤][]_** **Σ[−][1])** |
|
|
|
2 [log((2][π][)][d][|][Σ][|][) + 1]2 [tr(][E][q][(][x][)] _−_ _−_ _−_ _−_ |
|
|
|
|
|
|
|
= [1]2 [log((2][π][)][d][|][Σ][|][) + 1]2 [tr(] **Σq + (µq −** **_µ)(µq −_** **_µ)[⊤][]_** **Σ[−][1])** |
|
|
|
|
|
|
|
= [1] |
|
|
|
2 [log((2][π][)][d][|][Σ][|][) + 1]2 [tr(][Σ][q][Σ][−][1][) + 1]2 [tr((][µ][q][ −] **_[µ][)(][µ][q][ −]_** **_[µ][)][⊤][Σ][−][1][)]_** |
|
|
|
= [1] |
|
|
|
2 [log((2][π][)][d][|][Σ][|][) + 1]2 [tr(][Σ][q][Σ][−][1][) + 1]2 [(][µ][q][ −] **_[µ][)][⊤][Σ][−][1][(][µ][q][ −]_** **_[µ][)]_** |
|
|
|
=H( (x **_µq, Σq), p)._** |
|
_N_ _|_ |
|
|
|
**Lemma 2. (KL to Gaussian) Suppose q(x) is a probability density function with mean µq and** |
|
_covariance matrix Σq and p(x) = N_ (x|µ, Σ) is a Gaussian distribution, then |
|
|
|
_DKL(q_ _p) = DKL(_ (x **_µq, Σq)_** _p) + H(_ (x **_µq, Σq))_** _H(q),_ |
|
_||_ _N_ _|_ _||_ _N_ _|_ _−_ |
|
|
|
_where H(·) denotes the entropy of a distribution._ |
|
|
|
_Proof. According to Lemma 1, we have H(q, p) = H(_ (x **_µq, Σq), p). Thereby,_** |
|
_N_ _|_ |
|
|
|
_DKL(q_ _p) = H(q, p)_ _H(q) = H(_ (x **_µq, Σq), p)_** _H(q)_ |
|
_||_ _−_ _N_ _|_ _−_ |
|
=H( (x **_µq, Σq), p)_** _H(_ (x **_µq, Σq)) + H(_** (x **_µq, Σq))_** _H(q)_ |
|
_N_ _|_ _−_ _N_ _|_ _N_ _|_ _−_ |
|
=DKL( (x **_µq, Σq)_** _p) + H(_ (x **_µq, Σq))_** _H(q)._ |
|
_N_ _|_ _||_ _N_ _|_ _−_ |
|
|
|
**Lemma 3. (Equivalence between the forward and reverse Markov property) Suppose q(x0:N** ) = |
|
|
|
|
|
_n=1_ _q(xn|xn−1) is a Markov chain, then q is also a Markov chain in the reverse direction,_ |
|
|
|
Q |
|
|
|
|
|
_q(x0)_ |
|
|
|
|
|
_i.e., q(x0:N_ ) = q(xN ) |
|
|
|
|
|
_n=1_ _q(xn−1|xn)._ |
|
|
|
Q |
|
|
|
|
|
----- |
|
|
|
_Proof._ |
|
|
|
|
|
_q(xn−1|xn, · · ·, xN_ ) = _[q][(][x]q[n]([−]x[1]n[,][ x],_ _[n][,],[ · · ·] x[,]N[ x])_ _[N]_ [)] |
|
|
|
_· · ·_ |
|
|
|
_N_ |
|
_q(xn−1, xn)_ _i=n+1_ _q(xi|xi−1)_ |
|
|
|
= q(xn 1 **_xn)._** |
|
|
|
_N_ Q _−_ _|_ |
|
_q(xn)_ _i=n+1_ _q(xi|xi−1)_ |
|
Q |
|
|
|
|
|
Thereby, q(x0:N ) = q(xN ) |
|
|
|
|
|
_n=1_ _q(xn−1|xn)._ |
|
|
|
Q |
|
|
|
|
|
**Lemma 4. (Entropy of a Markov chain) Suppose q(x0:N** ) is a Markov chain, then |
|
|
|
|
|
EqH(q(xn 1 **_xn)) = H(q(x0)) +_** EqH(q(xn **_xn_** 1)). |
|
_−_ _|_ _|_ _−_ |
|
_n=1_ _n=1_ |
|
|
|
X X |
|
|
|
|
|
_H(q(x0:N_ )) = H(q(xN )) + |
|
|
|
|
|
_Proof. According to Lemma 3, we have_ |
|
|
|
|
|
_H(q(x0:N_ )) = − Eq log q(xN ) |
|
|
|
|
|
_q(xn_ 1 **_xn) =_** Eq log q(xN ) Eq log q(xn 1 **_xn)_** |
|
_−_ _|_ _−_ _−_ _−_ _|_ |
|
_n=1_ _n=1_ |
|
|
|
Y X |
|
|
|
|
|
=H(q(xN )) + EqH(q(xn 1 **_xn))._** |
|
|
|
_−_ _|_ |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
Similarly, we also have H(q(x0:N )) = H(q(x0)) + |
|
|
|
|
|
EqH(q(xn **_xn_** 1)). |
|
_n=1_ _|_ _−_ |
|
|
|
P |
|
|
|
|
|
**Lemma 5. (Entropy of a DDPM forward process) Suppose q(x0:N** ) is a Markov chain and |
|
_q(xn|xn−1) = N_ (xn|[√]αnxn−1, βnI), then |
|
|
|
|
|
_H(q(x0:N_ )) = H(q(x0)) + _[d]_ |
|
|
|
2 |
|
|
|
_Proof. According to Lemma 4, we have_ |
|
|
|
|
|
log(2πeβn). |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
EqH(q(xn **_xn_** 1)) = H(q(x0)) + |
|
_|_ _−_ |
|
_n=1_ _n=1_ |
|
|
|
X X |
|
|
|
|
|
_H(q(x0:N_ )) = H(q(x0)) + |
|
|
|
|
|
2 [log(2][πeβ][n][)][.] |
|
|
|
|
|
**Lemma 6. (Entropy of a conditional Markov chain) Suppose q(x1:N** _|x0) is Markov, then_ |
|
|
|
|
|
_H(q(x0:N_ )) = H(q(x0)) + EqH(q(xN _|x0)) +_ |
|
|
|
_Proof. According to Lemma 4, we have_ |
|
|
|
_H(q(x0:N_ )) =H(q(x0)) + EqH(q(x1:N _|x0))_ |
|
|
|
=H(q(x0)) + EqH(q(xN _|x0)) +_ |
|
|
|
|
|
EqH(q(xn 1 **_xn, x0))._** |
|
_−_ _|_ |
|
_n=2_ |
|
|
|
X |
|
|
|
_N_ |
|
|
|
EqH(q(xn 1 **_xn, x0))._** |
|
_−_ _|_ |
|
_n=2_ |
|
|
|
X |
|
|
|
|
|
----- |
|
|
|
**Lemma 7. (Entropy of a generalized DDPM forward process) Suppose q(x1:N** _|x0) is Markov,_ |
|
_q(xN_ **_x0) is Gaussian with covariance βN_** **_I and q(xn_** 1 **_xn, x0) is Gaussian with covariance λ[2]n[I][,]_** |
|
_|_ _−_ _|_ |
|
_then_ |
|
|
|
|
|
_H(q(x0:N_ )) = H(q(x0)) + _[d]_ |
|
|
|
2 [log(2][πeβ][N] [) +][ d]2 |
|
|
|
_Proof. Directly derived from Lemma 6._ |
|
|
|
|
|
log(2πeλ[2]n[)][.] |
|
_n=2_ |
|
|
|
X |
|
|
|
|
|
**Lemma 8. (KL to a Markov chain) Suppose q(x0:N** ) is a probability distribution and p(x0:N ) = |
|
|
|
|
|
_p(xN_ ) _n=1_ _p(xn−1|xn) is a Markov chain, then we have_ |
|
Q |
|
|
|
|
|
EqDKL(q(x0:N 1 **_xN_** ) _p(x0:N_ 1 **_xN_** )) = |
|
_−_ _|_ _||_ _−_ _|_ |
|
|
|
|
|
EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn)) + c,_** |
|
_−_ _|_ _||_ _−_ _|_ |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
_where c =_ EqH(q(xn 1 **_xn))_** EqH(q(x0:N 1 **_xN_** )) is only related to q. Particularly, if |
|
|
|
_n=1_ _−_ _|_ _−_ _−_ _|_ |
|
|
|
_q(x0:N_ ) is also a Markov chain, thenP _c = 0._ |
|
|
|
|
|
_Proof._ |
|
|
|
EqDKL(q(x0:N 1 **_xN_** ) _p(x0:N_ 1 **_xN_** )) = Eq log p(x0:N 1 **_xN_** ) EqH(q(x0:N 1 **_xN_** )) |
|
_−_ _|_ _||_ _−_ _|_ _−_ _−_ _|_ _−_ _−_ _|_ |
|
|
|
|
|
= Eq log p(xn 1 **_xn)_** EqH(q(x0:N 1 **_xN_** )) |
|
_−_ _−_ _|_ _−_ _−_ _|_ |
|
|
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn)) +_** |
|
_−_ _|_ _||_ _−_ _|_ |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
EqH(q(xn 1 **_xn))_** EqH(q(x0:N 1 **_xN_** )). |
|
_−_ _|_ _−_ _−_ _|_ |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
Let c = |
|
|
|
|
|
EqH(q(xn 1 **_xn))_** EqH(q(x0:N 1 **_xN_** )), then |
|
_n=1_ _−_ _|_ _−_ _−_ _|_ |
|
|
|
P |
|
|
|
|
|
EqDKL(q(x0:N 1 **_xN_** ) _p(x0:N_ 1 **_xN_** )) = |
|
_−_ _|_ _||_ _−_ _|_ |
|
|
|
|
|
EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn)) + c._** |
|
_−_ _|_ _||_ _−_ _|_ |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
If q(x0:N ) is also a Markov chain, according to Lemma 4, we have c = 0. |
|
|
|
**Lemma 9. (The optimal Markov reverse process with Gaussian transitions is equivalent to moment** |
|
|
|
|
|
_matching) Suppose q(x0:N_ ) is probability density function and p(x0:N ) = _n=1_ _p(xn−1|xn)p(xN_ ) |
|
|
|
_is a Gaussian Markov chain with p(xn_ 1 **_xn) =_** (xn 1 **_µn(xn), σn[2]_** **_[I][)][, then the joint KL opti-]Q_** |
|
_−_ _|_ _N_ _−_ _|_ |
|
_mization_ |
|
|
|
|
|
min _DKL(q(x0:N_ ) _p(x0:N_ )) |
|
**_µn,σn[2]_** _[}][N]n=1_ _||_ |
|
_{_ |
|
|
|
_has an optimal solution_ |
|
|
|
**_µ[∗]n[(][x][n][) =][ E]q(xn−1|xn)[[][x][n][−][1][]][,]_** _σn[∗][2]_ [=][ E]qn(xn) tr(Covq(xn−d1|xn)[xn−1]) |
|
|
|
_which match the first two moments of q(xn−1|xn). The corresponding optimal KL is_ |
|
|
|
|
|
_DKL(q(x0:N_ ) _p[∗](x0:N_ )) = H(q(xN ), p(xN )) + _[d]_ |
|
_||_ 2 |
|
|
|
|
|
log(2πeσn[∗][2][)][ −] _[H][(][q][(][x][0:][N]_ [))][.] |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
**Remark. Lemma 9 doesn’t assume the form of q(x0:N** ), thereby it can be applied to more general |
|
_Gaussian models, such as multi-layer VAEs with Gaussian decoders (Rezende et al., 2014; Burda_ |
|
_et al., 2015). In this case, q(x1:N_ **_x0) is the hierarchical encoders of multi-layer VAEs._** |
|
_|_ |
|
|
|
|
|
----- |
|
|
|
_Proof. According to Lemma 8, we have_ |
|
|
|
_DKL(q(x0:N_ ) _p(x0:N_ )) = DKL(q(xN ) _p(xN_ )) + |
|
_||_ _||_ |
|
|
|
|
|
EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn)) + c,_** |
|
_−_ _|_ _||_ _−_ _|_ |
|
_n=1_ |
|
|
|
X |
|
|
|
|
|
where c = |
|
|
|
|
|
EqH(q(xn 1 **_xn))_** EqH(q(x0:N 1 **_xN_** )). |
|
_n=1_ _−_ _|_ _−_ _−_ _|_ |
|
|
|
P |
|
|
|
|
|
Since EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn)) is only related to µn(_** ) and σn[2] [, the joint KL optimization] |
|
_−_ _|_ _||_ _−_ _|_ _·_ |
|
is decomposed into n independent optimization sub-problems: |
|
|
|
min EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn)),_** 1 _n_ _N._ |
|
**_µn,σn[2]_** _−_ _|_ _||_ _−_ _|_ _≤_ _≤_ |
|
|
|
According to Lemma 2, we have |
|
|
|
EqDKL(q(xn 1 **_xn)_** _p(xn_ 1 **_xn))_** |
|
_−_ _|_ _||_ _−_ _|_ |
|
=EqDKL( (xn 1 Eq(xn 1 **_xn)[xn_** 1], Covq(xn 1 **_xn)[xn_** 1]) _p(xn_ 1 **_xn))_** |
|
_N_ _−_ _|_ _−_ _|_ _−_ _−_ _|_ _−_ _||_ _−_ _|_ |
|
|
|
+ EqH( (xn 1 Eq(xn 1 **_xn)[xn_** 1], Covq(xn 1 **_xn)[xn_** 1])) EqH(q(xn 1 **_xn))_** |
|
_N_ _−_ _|_ _−_ _|_ _−_ _−_ _|_ _−_ _−_ _−_ _|_ |
|
|
|
=F(σn[2] [) +][ G][(][σ]n[2] _[,][ µ][n][) +][ c][′]_ |
|
|
|
where |
|
|
|
|
|
_F(σn[2]_ [) = 1]2 _σn[−][2][E][q]_ [tr(Cov]q(xn−1|xn)[[][x][n][−][1][]) +][ d][ log][ σ]n[2] _,_ |
|
|
|
|