## YOUR AUTOREGRESSIVE GENERATIVE MODEL CAN BE BETTER IF YOU TREAT IT AS AN ENERGY-BASED ONE

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Autoregressive generative models are commonly used, especially for those tasks
involving sequential data. They have, however, been plagued by a slew of inherent
flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g.,
exposure bias or lack of long-range coherence), severely limiting their ability to
model distributions properly. In this paper, we propose a unique method for training the autoregressive generative model that takes advantage of a well-designed
energy-based learning objective. We show that our method is capable of alleviating the exposure bias problem and increase temporal coherence by imposing a
constraint which fits joint distributions at each time step. Besides, unlike former
energy-based models, we estimate energy scores based on the underlying autoregressive network itself, which does not require any extra network. Finally, thanks
to importance sampling, we can train the entire model efficiently without requiring
an MCMC process. Extensive empirical results, covering benchmarks like language modeling, neural machine translation, and image generation, demonstrate
the effectiveness of the proposed approach.

1 INTRODUCTION

By factorizing the joint distribution into the product of a series of conditional distributions, autoregressive generative models (abbr. ARGMs) (Vaswani et al., 2017; Dai et al., 2019; van den Oord
et al., 2016a;b; Salimans et al., 2017; Chen et al., 2018) simplify the difficult challenge of modeling
high-dimensional joint distributions. They can be trained efficiently via maximum likelihood and
generate samples of exceptional quality, making this technique popular for modeling distributions,
especially for sequential data. Nonetheless, despite their potency and flexibility, ARGMs still have
inherent weaknesses due to the intrinsic characteristics of chain-style conditional modeling. For
example, ARGMs usually suffer from a discrepancy of the input context distributions between the
training and inference stages, which causes consequent error propagation (i.e., Exposure Bias (Ranzato et al., 2016; Bengio et al., 2015)). Besides, due to the nature of greedy selection of beam
search approximations, the decoded results from ARGMs may also lack in long-range coherence.
We consider one approach by which ARGMs could be adapted to reduce these concerns.

Earlier work, both heuristic and theoretical, has already been proposed with those goals. For instance, the exposure bias problem of ARGMs can be alleviated to some extent with scheduled
sampling (Bengio et al., 2015; Mihaylova & Martins, 2019), by mixing input contexts from both
real data and autoregressive generation, during the training stage. However, this scheme suffers
from an over-correcting problem (Zhang et al., 2019). In addition, at the inference stage, beam
search makes it possible to choose more diverse candidates, improving the quality of generated sequences. Nevertheless, this results in only marginal improvements in temporal coherence, since
ARGMs can only leverage previous decoded contexts without consideration of the whole sequence
information. Moreover, setting aside the difficulty in training them, energy-based models (EBMs)
have demonstrated their effectiveness in modeling high-dimensional distributions in a variety of machine learning applications (Zhao et al., 2017; Arbel et al., 2021; Gao et al., 2021), without requiring
the transformation of the target distribution into a product of conditional distributions. As a result,
several studies (Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) attempt to combine
EBMs with ARGMs, expecting to benefit from the strengths of both approaches. However, though
some positive results were obtained, the existing works preferred a two-stage optimization, which
first obtains a well-trained ARGM and then trains an additional EBM based on it. Such an optimiza

-----

tion strategy does not enable ARGMs to benefit from the properties of EBM in modeling the joint
distribution in a temporally more coherent way.

In this paper, we present a novel design for seamlessly integrating Energy-based models into
**AutoRegressive Models (E-ARM). Our training is based on an energy-based learning objective,**
which forces ARGMs training to fit the joint distribution along with the conditional one at each time
step. Thanks to our well-designed energy function, the two involved models can share a single base
network without additional parameters, that is, the base network not only serves as a generator that
provides fake data to facilitate the training of EBMs like previous works (Che et al., 2020; Xiao
et al., 2021; Durkan & Nash, 2019; Deng et al., 2020), but also plays the role of modeling the energy surface. This property makes it easy to plug E-ARM into the training of any autoregressive
generative models.

Intuitively, the exposure bias in ARGMs is caused by the fact that the model is trained on real data
rather than data generated by the model. On the other hand, in the EBM’s optimization process
for modeling joint densities, the negative phase of wake-sleep algorithms (Hinton, 2002; Kim &
Bengio, 2016) requires sampling data from the EBM itself. Along with the fact that our method
combines the EBM and the ARGM seamlessly as a whole, E-ARM can reduce the discrepancy
between input data of the training and inference stage, which mitigates the exposure bias problem
of the ARGM. On top of it, unlike ARGMs, which factor the joint distribution into a product of
conditional distributions, EBMs are able to model the joint distribution directly and score each input
at the sequence level instead of at the token level, which makes them capable of modeling longrange coherence. Additionally, in order to optimize the proposed energy-based learning objective
efficiently via gradient-based wake-sleep algorithms (Kim & Bengio, 2016), we present a way to
estimate the negative phase gradient (which is a necessary component in the gradient-based wakesleep algorithms) through those samples generated with the autoregressive view instead of the EBM
view, which would require an expensive Markov Chain Monte Carlo (MCMC) process. This allows
us to sidestep extremely time-consuming MCMCs, thus accelerating training.

In summary, the following contributions are made with this paper: i) We introduce a novel scheme,
E-ARM, to integrate the EBM view into autoregressive generative models seamlessly; ii) we attempt
to reduce the intrinsic problems of autoregressive models, such as exposure bias and weak temporal
coherence, by optimizing an energy-based learning objective, which uses samples autoregressively
generated; iii) We demonstrate how to efficiently optimize our model constructed from a single
network, using wake-sleep algorithms without MCMC; iv) In a number of applications, such as
language modeling, neural machine translation, and image generation, our model can achieve better
results in comparison with relevant baselines.

2 BACKGROUND

2.1 ENERGY-BASED MODELS

Energy-based models (LeCun et al., 2006) can express any probability p(x) for x ∈ R[K] as

_pθ(x) = [exp(][−][E][θ][(][x][))]_ _,_ (1)

**Zθ**


where Eθ : R[D] _→_ R denotes an energy function which aims to map a D-dimensional datapoint
to a scalar, and Z(θ) = **x** [exp(][−][E][θ][(][x][))][ denotes the normalizing constant, also known as the]

partition function. Any function can be used as an energy function to represent an EBM as long as it
can generate a single scalar given some input x and the normalizing constant is finite[1]. Wake-sleep

[P]

algorithms are commonly used to optimize EBMs (Hinton, 2002; Kim & Bengio, 2016; Grathwohl
et al., 2020) via gradient-based approximate maximum likelihood. Specifically, the gradient of the
log-likelihood, which needs to be maximized, with respect to θ can be expressed as

_∂_ _∂_ _∂_
Epd(x) = Epθ(x) Epd(x) _._ (2)

_∂θ_ [log][ p][θ][(][x][)] _∂θ_ **[E][θ][(][x][)]** _−_ _∂θ_ **[E][θ][(][x][)]**

h i h i h i

1Without constraining the parametrization of Eθ, this can be achieved by bounding the region of space in
which x takes its allowed values.


-----

The first term in the right hand side of Eq. 2 is the negative phase term while the second term is
called the positive phase term. MCMC methods have been used (Hinton, 2002; Welling & Teh,
2011a) to approximately sample from pθ(x), for estimating the negative phase term.

2.2 MODELING DISTRIBUTIONS AUTOREGRESSIVELY

Autoregressive generative models (ARGM)[2] can decompose any joint distribution p(x) into a product of conditional distributions using the product rule of probability by ordering those random variables within the joint distribution and characterizing each random variable given all variables preceding it in that order. Formally, we use x<k to denote the vector variable covering all random
variables before the time step k and use xk denote the random variable at time step k. Then we have


_p(xk_ **x<k).** (3)
_|_
_k=1_

Y


_p(x) =_


In general, modeling distributions autoregressively has achieved remarkable accomplishments in
numerous areas (Vaswani et al., 2017; Radford et al., 2019; van den Oord et al., 2016c;b; Salimans
et al., 2017) thanks to its ability to avoid the challenging target of modeling joint high-dimensional
distributions directly. We primarily focus on autoregressive language models in this paper, but we
also conduct experiments on image generation to validate the generality of our method.

2.3 EXPOSURE BIAS AND INCOHERENCE PROBLEMS IN AUTOREGRESSIVE MODELS

In the discussion about the defects of sequential autoregressive generative models, the exposure bias
problem (Bengio et al., 2015; Ranzato et al., 2016) is an important issue, which greatly affects the
model’s deployment performance. During the training stage, the autoregressive model is always
conditioned on ground truth token sequences. In generation stage, however, the model has to rely
on its own previously generated tokens to predict the next token, when the model is deployed. If an
incorrect token is selected, this error can be amplified in following steps because the next prediction
will be made using an unusual input (one unlike those in the training set). Besides, out of the
consideration of efficiency, autoregressive decoding usually selects the most probable token at each
time step, given the ones previously selected. Such a scheme assumes the largest joint probability of
the whole sequence can be achieved by separately choosing the most probable next token (given its
previous context) over all time steps, which is only the local optimum. Correspondingly, the chosen
sequence can not always be the model’s optimum result.

3 INTEGRATE EBMS INTO AUTOREGRESSIVE MODELS SEAMLESSLY

For a long time, as a result of compromises for improving training stability and efficiency (e.g.,
modeling a joint distribution by decomposing it and using a teacher-forcing training strategy), conventional autoregressive generative models have suffered from flaws such as the exposure bias and
the lack of long-range coherence. To tackle these issues, we attempt to seamlessly integrate Energybased models into AutoRegressive Models (E-ARM), which can be regarded as a variant of autoregressive generative models blending with an energy-based learning objective. Given a joint
sequential distribution, E-ARM also addresses it autoregressively, that is, tackling tokens step by
step under a specific order. However, what differs from conventional ARGMs is that we attempt
to model both the conditional and the joint distributions simultaneously at each time step. In this
way, E-ARM can model distributions conveniently in an autoregressive manner while avoiding those
potential problems brought by ARGMs.

Formally, given a sequence of random variables (x1, x2, . . ., xK) with length K, we introduce a
parametric autoregressive model qθ(x<k) = _l=1_ _[q][θ][(][x][l][|][x][<l][)][ (][k][ denotes the time step) with pa-]_
rameters θ. Particularly, we define ˜qθ(x<k) = _l=m_ _[q][θ][(][x][l][|][x][<l][)][ Q]n[m]=1[−][1]_ _[q][(][x][n][|][x][<n][)][, which means]_
only those conditional distributions qθ(xl **x<l[Q]) of the most recent[k][−][1]** _k_ _m time steps are involved_
_|_ _−_
in the current update of parameters θ while those distant conditional distributions q(xn **x<n) are**

[Q][k][−][1] _|_

2In this paper, the term “autoregressive model” is sometimes used to denote the autoregressive generative
model for convenience.


-----

treated as fixed (The rationale behind such a design will be elaborated in Sec.4). Then, we define
_pθ(xk, x<k) as a product of the autoregressive model and an EBM as follows,_

_pθ(xk, x<k) = ˜qθ(x<k)_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ _,_ (4)
_·_ **Zθ**


where the energy function φθ(xk, x<k) is defined as the xk’s negative corresponding component
of the base network’s output logit with the input prefix context x<k = (x1, x2, . . ., xk 1) (e.g.,
_−_
given a sequence “This is Friday.” and assuming the corresponding index of the token “Friday” in
the vocabulary is i, then the value of _φθ(“Friday”, “This is”) is the i-th component of the output_
_−_
logit, which is the straight input tensor of the final softmax layer), and the normalization term Zθ =
Ex[′]<k[∼]q[˜]θ(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]].

Our primary goal is to make the distribution qθ(xk **x<k) to approach the real conditional pd(xk** **x<k)**
_|_ _|_
whilst maintaining pθ(xk, x<k) as close to the real joint pd(xk, x<k) as possible at each time step,
which can be achieved by minimizing the Kullback-Leibler (KL) divergence between the distributions,

_K_

_θ[∗]_ = arg minθ **DKL** _pd(xk|x<k)||qθ(xk|x<k)_ + λDKL _pd(xk, x<k)||pθ(xk, x<k)_ _, (5)_

_kX=1_     []

where λ adjusts the ratio between the two objectives. In Eq. 5, the first objective at each time step
_k can be easily optimized by cross entropy while the second objective is optimized in the sense of_
EBMs by wake-sleep algorithms (Hinton et al., 1995; Kim & Bengio, 2016), which minimizes the
objective by descending the following gradient of θ according to Eq. 2[3]

_∂_ _∂_

Exk,x<k _pd(xk,x<k)_ Exk,x<k _pθ(xk,x<k)_ _,_
_∼_ _∂θ_ **[E][θ][(][x][k][,][ x][<k][)]** _−_ _∼_ _∂θ_ **[E][θ][(][x][k][,][ x][<k][)]** (6)

   

**Positive Phase** **Negative Phase**

where we have| **Eθ(xk, x<k{z) = φθ(xk, x<k)** } log ˜| _qθ(x<k). Optimization via Eq. 2 or 6 involves{z_ }
_−_
sampling data from the model and can thus lead to the discovery of non-data-like samples, whose
likelihood is then explicitly reduced by the energy function. E-ARM is therefore not plagued by
the exposure bias problem. Besides, because we model the joint distribution throughout the training
process, E-ARM can assess the entire sequence as a whole and generate more coherent data using
energy sampling (Deng et al., 2020).

4 OPTIMIZATION

In this section, we present how to efficiently optimize E-ARM. To begin with, we optimize the first
objective in Eq. 5 as with conventional autoregressive models by reducing the per time-step crossentropy loss. As for the second objective, we resort to descend the estimated gradient as shown in
Eq. 6. Thanks to the importance sampling technique and our well-defined energy function, we now
show that an improved version of Eq. 6 has a simple and symmetric form that can be easily estimated
whilst not requiring an expensive MCMC.

Specifically, by replacing Eθ(xk, x<k) with the specific form φθ(xk, x<k) log ˜qθ(x<k), the gra_−_
dient w.r.t. θ in the positive phase of Eq. 6 can be written into


Ex<k _pd_ [ _[∂]_ _qθ(x<k)] + Exk,x<k_ _pd_ [ _[∂]_ (7)
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][.]_


Similarly, we can get the negative phase gradient as

Ex<k _pθ_ [ _[∂]_ _qθ(x<k)] + Exk,x<k_ _pθ_ [ _[∂]_ (8)
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][.]_


The first term −Ex<k∼pd [ _∂θ[∂]_ [log ˜]qθ(x<k)] in Eq. 7 is equivalent to the log-likelihood gradient of

_q˜θ(x<k), which means improvements in this direction will be automatically taken care of as a re-_
sult of steps arising from the gradient of the first KL-divergence in Eq. 5, albeit at the expense of

3here, we take a minimization version of the Eq. 2. Thus the sign before each phase is converse.


-----

changing the weight given to the second vs. the first KL, λ. Besides, because the estimation of the
expectation operator over the data distribution pd is easy, and the score φθ(xk, x<k) can be acquired
simply accessing the output logit of ARGM (see the definition of φθ in Sec. 3), the second term
can likewise be readily estimated and optimized. As a result, the positive phase optimization is both
feasible and efficient.

The negative phase gradient estimation, on the other hand, is more involved. In Eq. 8, sampling data
from pθ is required for estimating the expectation Epθ, whereas pθ is a parametric joint probability
involving an energy-based unnormalized probability estimator that may require time-consuming
MCMC methods to generate data. However, thanks to importance sampling, we can substitute
the troublesome computation of the expectation over the distribution pθ with the expectation over
the distribution qθ, which can generate samples autoregressively without MCMC. Formally, the
negative phase gradient Exk,x<k∼pθ [ _∂θ[∂]_ **[E][θ][(][x][k][,][ x][<k][)]][ is equivalent to the following formulation (See]**

the detailed derivation in Appendix A),

Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)] + Exk,x<k_ _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_ (9)
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][,]_


_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_

**where** **w(x<k) =** _._ (10)

Ex[′]<k[∼]q[˜]θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]]

According to Eq. 9, all the estimated expectations only need sampling from the autoregressive model
rather than the joint distribution, and the reweighing weight w in Eq. 10 also does not involve
expectation computation over distribution pθ. Generally, producing data from an autoregressive
model is a very simple ancerstral sampling process, as compared with sampling straight from an
EBM, which needs MCMC approaches (Durkan & Nash, 2019). On account of that, the optimization
process can be much more efficient.

Besides, the term Ex<k∼q˜θ(x<k)[[][w][(][x]<k[)][ ∂]∂θ [log ˜]qθ(x<k)] in Eq. 9 is equivalent to a re-weighted

version of the gradient of qθ’s information entropy with respect to θ. This term can be optimized
similarly to the teacher-forcing training of autoregressive model with the “teacher” sequence generated autoregressively by the model itself. Actually, the scheduled sampling methods (Bengio et al.,
2015; Ranzato et al., 2016; Mihaylova & Martins, 2019) are similar to this term but without the reweighting factor. Furthermore, it is worth noting that for a sequence with total length K, since we
add a constraint to fit the joint distribution pθ at each time step k, Eq. 9 actually has K counterparts
with different time steps. If we use the qθ(x<k) directly instead of ˜qθ(x<k) in the Eq. 4 to define
_pθ(xk, x<k), due to the fact that the distribution qθ(x<k) modeled by an autoregressive model can_
be naturally broken up into pieces, simply summing up these K gradients results in the term


_K+1−l_

Eqθ(x<k)[w(x<k) _[∂]_ (11)

_∂θ_ [log][ q][θ][(][x][l][|][x][<l][)]][,]

_k=1_

X


_K_

Eqθ(x<k)[w(x<k) _[∂]_

_∂θ_ [log][ q][θ][(][x][<k][)] =]

_k=1_

X


_l=1_


where l indicates the specific index of the current token in the entire sequence. As a result, earlier
time steps (smaller l) will get stronger training signals (larger K + 1 − _l, indicating more gradi-_
ent terms), giving rise to imbalanced training for different time steps. To solve this, we introduce
_q˜θ(x<k) as_ _l=m_ _[q][θ][(][x][l][|][x][<l][)][ Q]n[m]=1[−][1]_ _[q][(][x][n][|][x][<n][)][ to define][ p][θ][(][x][k][,][ x][<k][)][ shown in Sec. 3, allowing]_
gradients only back propagate through conditional distributions w.r.t. a few recent tokens[4]. This
explains our proposal of using[Q][k][−][1] ˜qθ(x<k) to define pθ(xk, x<k).

Ultimately, combining Eq. 7 and Eq. 9, at each time step k, we can optimize pθ(xk, x<k) via
descending the estimated gradient of θ as follows,


Ex<k _pd_ [ _[∂]_ _qθ(x<k)]_
_−_ _∼_ _∂θ_ [log ˜]

+ Exk,x<k _pd_ [ _[∂]_
_∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_

**Positive Phase**
| {z }


Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)]_
_−_ _∼_ _∂θ_ [log ˜]

+ Exk,x<k _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_
_∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_

**Negative Phase**
| {z }


(12)


4In practice, we find that using recent 2 tokens worked best.


-----

From Eq. 12, we can see that the only difference between two phases is that in the negative phase,
the expectation over ˜qθ have a reweighing weight w for each sample. The reweighing weight w in
Eq. 10 and Eq. 12 can be further refined (see the derivation in Appendix B) and we can observe that

_µ(x<k)_
**w(x<k) =** (13)

Ex[′]<k _[µ][(][x][<k][)]_ _[,]_


where µ(x<k) = _[p]q˜θ[θ]([(]x[x]<k[<k])[)]_ [indicates the possibility of which distribution the prefix context][ x][<k][ is]

most likely to come from, the distribution pθ or the distribution ˜qθ. Correspondingly, w(x<k) reflects
the context x<k’s relative magnitude of µ(x<k) compared with the average among all potential
contexts—the larger the value of w(x<k), the more likely the context x<k in the data space coming
from pθ, which is modeled by the product of autoregressive models and EBMs. During training,
those input sequences with contexts more likely under pθ than qθ will be assigned larger weights w
while others will be assigned smaller weights w.

In general, E-ARM ought to be viewed as a new learning pattern for autoregressive models that ensures our base autoregressive network stays close to the real distribution pd. We found that training
from scratch with the energy-based learning objective of in Eq.12 alone did not work well. The
reason is that at the initial stage of the training process, what we have is just a randomly initialized autoregressive network which outputs sequences with random values given any context. This
indicates disjoint supports between the real sequence’s distribution pd and distribution pθ modeled
by ARGMs. If we only use the energy-based learning objective of Eq. 12, the whole gradient
Epd(x)[ _∂θ[∂]_ [log][ p][θ][(][x][)]][ in Eq.2 would be 0 due to disjoint supports between][ p][d][ and][ p][θ][. As a result, in]

order to make the optimization more feasible, we must maintain the cross-entropy loss low throughout training and pre-train as a pure ARGM for a few epochs before introducing the E-ARM objective.
Actually, the starting epoch of E-ARM is a hyper-parameter, and we discuss it in the Sec. 5.2.

Following the excellent work of Deng et al. (2020); Bakhtin et al. (2021), we also adopt Top-K
energy re-sampling in the inference stage, which means that in the generative process, we first gather
multiple candidate sequences generated autoregressively, and then re-sample from them based on
their energy scores estimated by the network’s logit at the last time step where the entire sequence
has been processed. Since we employ the EBM to model the joint distribution at each time step,
such a re-sampling strategy can mitigate the undesirable impact of the greedy selection of one token
at a time, and we found this variation to increase the coherence of generated samples.

5 EXPERIMENTS

To empirically corroborate the effectiveness of E-ARM and show its broad applicability, we conduct extensive experiments covering three machine learning applications, which are neural machine
translation (NMT), language modeling, and image generation. In this section, we will introduce the
three corresponding experimental setups, followed by an analysis of the obtained results. We will
release the source code once upon acceptance.

5.1 APPLICATION TO NEURAL MACHINE TRANSLATION

E-ARM is first evaluated in the context of neural machine translation (NMT), which is a conditional
generation task and is important in the natural language processing (NLP) field. We first analyze
E-ARM on the IWSLT14 dataset, which includes six different language pairs ({German, Spanish,
Italian} → English and English →{German, Spanish, Italian}). In addition, we test E-ARM on the
WMT16 (English → German) benchmark to make sure we evaluating E-ARM on a larger dataset.
Hereafter we abbreviate English, German, Spanish, Italian as “En”, “De”, “Es”, “It”. The weight
_λ in Eq. 5 is set as 0.05 for all translation tasks. We use one size of transformer (“Base-IWSLT”)_
for the IWSLT14 benchmark and two sizes of transformer (“Base-WMT”, “Large-WMT”) for the
WMT16 benchmark [5]. Scheduled Sampling is carried out following Mihaylova & Martins (2019).

The results of IWSLT14 tasks are shown in Table 1. We test not only the pure performance of
E-ARM but also the compatibility with other techniques. Specifically, we can observe that (1)
without any particular engineering, E-ARM outperforms the base autoregressive translation model

5The implementation is developed on Fairseq (Ott et al., 2019).


-----

|Model|Label Scheduled Smoothing Sampling|Beam Searching|BLEU ↑ DE→EN EN→DE EN→IT IT→EN ES→EN EN→ES|Avg.|
|---|---|---|---|---|
|Base|- -  -  |- 5 B - 5 B - 5 B|32.44±0.06 26.64±0.10 27.92±0.03 30.48±0.08 38.61±0.11 35.42±0.09 33.62±0.07 27.41±0.08 28.72±0.04 31.39±0.05 39.55±0.12 36.38±0.07 33.68±0.03 27.62±0.04 28.81±0.07 31.42±0.07 39.85±0.13 36.71±0.09 34.61±0.08 28.46±0.06 29.72±0.10 32.29±0.03 40.64±0.07 37.48±0.05 34.23±0.06 27.96±0.03 29.26±0.11 31.93±0.08 40.16±0.03 37.21±0.04 35.10±0.04 28.73±0.04 29.97±0.07 32.64±0.12 40.91±0.06 37.93±0.10|31.92 32.85 33.02 33.87 33.46 34.21|
|E-ARM|- -  -  |- 5 B - 5 B - 5 B|32.99±0.10 27.15±0.03 28.33±0.12 31.13±0.04 39.56±0.01 36.07±0.02 34.06±0.06 27.97±0.08 29.26±0.09 31.90 ±0.13 40.30 ±0.03 36.92 ±0.09 33.97 ±0.08 28.03 ±0.04 29.13 ±0.02 31.84 ±0.11 40.32 ±0.03 36.96 ±0.07 34.93 ±0.05 28.91 ±0.12 30.04 ±0.11 32.56 ±0.04 41.01 ±0.06 37.73 ±0.12 34.58 ±0.09 28.38 ±0.12 29.56 ±0.10 32.11 ±0.03 40.93 ±0.03 37.56 ±0.07 35.36 ±0.05 29.11 ±0.04 30.25 ±0.09 32.82 ±0.11 41.58 ±0.07 38.19 ±0.03|32.54 33.40 33.38 34.20 33.85 34.55|


Table 1: Comparison of BLEU scores between our approach E-ARM and the base ARGM trained just with
cross-entropy loss on six translation pairs of IWSLT14 datasets. We use “-” to denote that the training trick is
not used while “” indicates we use it. “5 B” represents we use beam searching with 5 beams.

trained with cross-entropy singly by 0.62 (31.92 → 32.54) BLEU points in average, especially
on three translation pairs—38.61 → 39.56 on Spanish-to-English, 30.48 → 31.13 on Italian-toEnglish, 35.42 → 36.07 on English-to-Spanish. (2) E-ARM is compatible with other techniques
like scheduled sampling, which can help alleviate the exposure bias problem to some extent. They
are not mutually exclusive and can work together to further improve the performance of the base
ARGM. (3) However, since scheduled sampling can reduce exposure bias and beam search can
somewhat alleviate the flaws caused by greedy selection at each time step, the performance gain of
E-ARM when all these tactics are combined is only 0.34 (34.21 → 34.55), which is lower than the
0.62 (31.92 → 32.54) obtained when the model is purely trained without these other techniques.

**Model** **L.S.** **S.S.** **w/E-ARM** **BLEU ↑** Additionally, Table 2 shows the per
-  -  -  27.56 formance of E-ARM on the WMT16

 -  -  28.04 English German task. For two

**Base-WMT**   -  28.36 different model sizes, enabling la- →

   **28.62** bel smoothing (L.S.) improves model

-  -  -  28.70 performance by 0.52 and 0.35, re
 -  -  29.05 spectively. The performance of the

**Large-WMT**   -  29.23 base transformer model further in
   **29.44** creases to 28.36 BLEU points when

scheduled sampling (S.S.) is used,

|Model|L.S. S.S. w/E-ARM|BLEU ↑|
|---|---|---|
|Base-WMT|- - -  - -   -   |27.56 28.04 28.36 28.62|
|Large-WMT|- - -  - -   -   |28.70 29.05 29.23 29.44|


Table 2: Translation performance of proposed E-ARM on while the larger model improves to
WMT16 English→German, evaluated with BLEU. We uniformly 29.23 points. E-ARM paired with
use 5 beams when applying beam search. “L.S.” denotes Label label smoothing and scheduled samSmoothing and “S.S.” denotes Scheduled Sampling. pling yields the highest scores of

28.62 and 29.44, respectively. Overall, our training strategy outperforms
ARGM’s vanilla teacher-forcing training and can have uniformly favorable impacts across different
models and dataset sizes.

5.2 APPLICATION TO LANGUAGE MODELING


To further demonstrate E-ARM’s
consistency in reducing flaws of
autoregressive generative models,
we also conduct language modeling
experiments. The WikiText-103
dataset (Merity et al., 2017), which
is the largest word-level language
modeling benchmark with long-term
dependency, was chosen as the
testbed. It comprises 103 million

|Model|#Params PPL ↓|
|---|---|
|Tr-Base Tr-Base (w/E-ARM) Standard Tr-XL Standard Tr-XL (w/E-ARM)|156M 30.56 156M 29.89 151M 24.20 151M 23.81|


Table 3: Language modeling performance of different models on
WikiText103. Evaluation is conducted using perplexity (PPL).


-----

training tokens from 28 thousand
articles, with an average length of 3.6 thousand tokens per article, which allows model to evaluate
the ability of modeling long-term dependency. Two network structures are mainly tested, which
are Transformer-Base (Vaswani et al., 2017) and Transformer-XL (Dai et al., 2019) (Tr-Base and
Tr-XL for short respectively hereafter).


The final results are reported in Table 3. We can see
from the results that E-ARM outperforms baselines
with clear margins for different types of models.
Specifically, the Transformer-Base improves performance by 0.67 PPL points (from 30.56 to 29.89), **0.00** 30.56 30.56 30.56
while the Transformer-XL improves model by 0.20 **0.01** 30.48 30.12 30.22
PPL points (from 24.20 to 23.81). Our strategy does **_λ_** **0.05** 30.43 **29.89** 30.16
not change the structure of the base network nor **0.1** 30.60 30.03 30.14
introduces any additional module or learnable pa- **0.5** 30.71 30.36 30.47
rameters, therefore we can conclude that the perfor
|Col1|Col2|Start Epoch|
|---|---|---|
|||5 15 25|
|λ|0.00 0.01 0.05 0.1 0.5|30.56 30.56 30.56 30.48 30.12 30.22 30.43 29.89 30.16 30.60 30.03 30.14 30.71 30.36 30.47|

mance boost is solely from the introduced energy- Table 4: How different λ and the E-ARM start
based learning objective. epoch (when we introduce the E-ARM into the

training on WikiText103) affect performance eval
In addition, we study the effect of hyper-parameter

uated by perplexity (PPL). The Tr-Base model

settings on the performance of language modeling,

structure is used and is train 40 epochs in total.

which can be seen in Table 4. From this, we may deduce that starting E-ARM training at the 15-th epoch
yields the best results, whereas starting earlier or later yields a performance decline. It is reasonable
because, if E-ARM was introduced too early, the autoregressive model may not have been optimized well at that moment. As a result, generative quality would be terrible, and make energy-based
training unstable. On the other hand, the underlying autoregressive model can be modified only
marginally if E-ARM is introduced when the ARGM training is virtually complete. Besides, from
the vertical perspective which presents the impact of different λ, we can observe that the best λ in
Eq. 5 is 0.05. The first line of the table indicates the baseline of training the autoregressive model
with pure cross-entropy loss.


5.3 APPLICATION TO IMAGE GENERATION

In order to illustrate the effectiveness and generality of our method in processing different modality
tasks, we further show the results of applying E-ARM to image generation in this section. We apply
E-ARM to Pixel-CNN (Van Oord et al., 2016) and its variant Gated Pixel-CNN (Oord et al., 2016).
Experiments are carried out on the MNIST and CIFAR-10 datasets.

|Model|Test (Train) NLL ↓|
|---|---|
||MNIST CIFAR-10|
|Pixel-CNN Pixel-CNN (w/E-ARM) Gated Pixel-CNN Gated Pixel-CNN (w/E-ARM)|0.17 (0.13) 3.14 (3.08) 0.15 (0.12) 3.07 (2.98) 0.14 (0.11) 3.03 (2.90) 0.12 (0.10) 2.97 (2.91)|


Table 5: Performance of E-ARM with different base
networks on MNIST and CIFAR-10 in bits/dim (lower
is better), training performance in brackets.


Figure 1: Samples of CIFAR-10 from
Gated Pixel-CNN (w/E-ARM).


Table 5 summarizes the quantitative results measured by per-pixel negative log-likelihood (NLL),
while Figure 1 depicts some of the generated samples. We can see that with the help of our E-ARM,
both the Pixel-CNN and the Gated Pixel-CNN can obtain improvements in all datasets (0.17 → 0.15
and 3.14 → 3.07 for Pixel-CNN on MNIST and CIFAR10 respectively and 0.14 → 0.12 and 3.03
_→_ 2.97 for Gated Pixel-CNN on MNIST and CIFAR10 respectively). This is further evidence in
favour of the energy-based learning objective for improving autoregressive models.


-----

6 RELATED WORKS

6.1 AUTOREGRESSIVE GENERATIVE MODELS

Modeling high-dimensional data distributions directly is usually a rather challenging task due to
“the curse of dimensionality” (Bellman, 1954). One alternative method is to sequentialize the random variables and then factorize the joint probability distribution into the product of conditionals
based on the sequence structure, which is exactly the core idea of autoregressive generative models
(ARGMs).

ARGMs have been very successful, in particular for sequential data. For example, ARGMs have
been widely used in language modeling (Vaswani et al., 2017; Dai et al., 2019; Radford et al.,
2019), audio synthesis (van den Oord et al., 2016a), and even image generation (van den Oord et al.,
2016c;b; Salimans et al., 2017). The advantages of ARGMs are however balanced by issues of (1)
exposure bias (Ranzato et al., 2016; Bengio et al., 2015; Song et al., 2020), due to the discrepancy
in input context distributions between the training and inference stages, and (2) weak long-range
coherence, due to the inherent greedy selection of one token at a time without look-ahead.

6.2 ENERGY-BASED MODELS

In the field of generative modeling, energy-based models (EBMs) have been widely used (Zhao
et al., 2017; Arbel et al., 2021; Gao et al., 2021). The primary idea behind EBMs is to decompose
the dependencies between variables (e.g. images and labels) through different terms of an energy
function, assigning low energies to proper configurations found in the dataset, while assigning high
energies to incorrect or unseen ones (LeCun et al., 2006).

Due to the challenge of sampling from EBMs, training EBMs by wake-sleep algorithms (Hinton,
2002; Kim & Bengio, 2016; Grathwohl et al., 2021), which require expensive MCMC approaches,
has been notoriously difficult, especially on high-dimensional data like images or texts. Stochastic
Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011a) is a frequently used gradient-based
MCMC approach that injects noise into parameter updates and anneals the step size during the
course of training, and which has been adopted in numerous prior works (Nijkamp et al., 2019; Du
& Mordatch, 2019; Grathwohl et al., 2020). However, these gradient-based MCMC methods require
enormous extra computing overheads and are not applicable when the input is discrete like for text
sequences (Deng et al., 2020).

As a result, a variety of recent works attempt to explore the strategy of training an EBM without
MCMC. In particular, Bakhtin et al. (2021); Xu et al. (2021a); Gao et al. (2020) optimize the EBMs
by using noise contrastive estimation (NCE) (Gutmann & Hyv¨arinen, 2010; Ma & Collins, 2018).
Durkan & Nash (2019) estimate the intractable normalization component by utilizing ARGMs and
importance sampling. Che et al. (2020); Wang et al. (2021) skirt the challenge of collecting data in
the high-dimensional data space by producing data in the lower-dimensional feature space, which
improves sampling efficiency.

7 CONCLUSIONS AND FUTURE WORK

In this paper, we propose a novel method dubbed E-ARM to integrate energy-based models into autoregressive generative models seamlessly, with an energy-based training objective that exploits an
underlying autoregressive model. This is achieved by defining the energy function from the output
logits of the base autoregressive network, to model the unnormalized joint distribution of the subsequence up to each time step. We also found ways to improve training of E-ARM using importance
sampling, avoiding the requirement of MCMC for the energy-based training. Experimental results
on two language tasks and one vision task demonstrate the effectiveness of E-ARM to alleviate exposure bias and incoherence problems of ARGMs. In the future, we expect to extend E-ARM on
other sequential generation tasks (e.g. text summarization, audio generation), and incorporate the
proposed methodology into other advanced autoregressive architectures.


-----

REFERENCES

Michael Arbel, Liang Zhou, and Arthur Gretton. Generalized energy based models. In 9th Inter_national Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,_
_2021, 2021._

Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc’Aurelio Ranzato, and Arthur Szlam.
Residual energy-based models for text. J. Mach. Learn. Res., 22:40:1–40:41, 2021.

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and
_Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, Ann_
Arbor, Michigan, jun 2005. Association for Computational Linguistics.

Richard Ernest Bellman. The Theory of Dynamic Programming. RAND Corporation, Santa Monica,
CA, 1954.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence
prediction with recurrent neural networks. In Advances in Neural Information Processing Systems
_28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015,_
_Montreal, Quebec, Canada, pp. 1171–1179, 2015._

Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and
Yoshua Bengio. Your GAN is secretly an energy-based model and you should use discriminator
driven latent sampling. In Advances in Neural Information Processing Systems 33: Annual Con_ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,_
_virtual, 2020._

Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In Proceedings of the 35th International Conference on Machine
_Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of_
_Proceedings of Machine Learning Research, pp. 863–871, 2018._

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of
_the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy,_
_July 28- August 2, 2019, Volume 1: Long Papers, pp. 2978–2988, 2019._

Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, and Marc’Aurelio Ranzato. Residual
energy-based models for text generation. In 8th International Conference on Learning Repre_sentations, ICLR 2020. OpenReview.net, 2020._

Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In Ad_vances in Neural Information Processing Systems 32: Annual Conference on Neural Information_
_Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp._
3603–3613, 2019.

Conor Durkan and Charlie Nash. Autoregressive energy machines. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning,
_ICML 2019, volume 97, pp. 1735–1744. PMLR, 2019._

Ruiqi Gao, Erik Nijkamp, Diederik P. Kingma, Zhen Xu, Andrew M. Dai, and Ying Nian Wu. Flow
contrastive estimation of energy-based models. In 2020 IEEE/CVF Conference on Computer
_Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 7515–_
7525, 2020.

Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma. Learning energybased models by diffusion recovery likelihood. In 9th International Conference on Learning
_Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021._

Will Grathwohl, Kuan-Chieh Wang, J¨orn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi,
and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like
one. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa,
_Ethiopia, April 26-30, 2020, 2020._


-----

Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky,
and David Duvenaud. No MCMC for me: Amortized sampling for fast and stable training of
energy-based models. In 9th International Conference on Learning Representations, ICLR 2021,
_Virtual Event, Austria, May 3-7, 2021, 2021._

Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principle
for unnormalized statistical models. In Proceedings of the Thirteenth International Conference
_on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May_
_13-15, 2010, volume 9 of JMLR Proceedings, pp. 297–304, 2010._

G. Hinton, P. Dayan, B. Frey, and R. Neal. The “wake-sleep” algorithm for unsupervised neural
networks. Science, 268 5214:1158–61, 1995.

Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural
_Comput., 14(8):1771–1800, 2002._

Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled
text generation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual
_Event, Austria, May 3-7, 2021, 2021._

Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability
estimation. CoRR, abs/1606.03439, 2016.

Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based
learning. Predicting structured data, 1(0), 2006.

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization
_Branches Out, pp. 74–81, Barcelona, Spain, jul 2004. Association for Computational Linguistics._

Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. In Proceedings of the 2018 Conference on
_Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November_
_4, 2018, pp. 3698–3707, 2018._

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon,
_France, April 24-26, 2017, Conference Track Proceedings, 2017._

Tsvetomila Mihaylova and Andr´e F. T. Martins. Scheduled sampling for transformers. In Fernando Emilio Alva-Manchego, Eunsol Choi, and Daniel Khashabi (eds.), Proceedings of the 57th
_Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28_

_- August 2, 2019, Volume 2, pp. 351–356, 2019._

Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent nonpersistent short-run MCMC toward energy-based model. In Advances in Neural Information
_Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,_
_NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 5233–5243, 2019._

Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. _arXiv preprint_
_arXiv:1606.05328, 2016._

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of
_NAACL-HLT 2019: Demonstrations, 2019._

Bo Pang, Tian Han, Erik Nijkamp, Song-Chun Zhu, and Ying Nian Wu. Learning latent space
energy-based prior model. In Advances in Neural Information Processing Systems 33: Annual
_Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_
_2020, virtual, 2020._


-----

Tetiana Parshakova, Jean-Marc Andreoli, and Marc Dymetman. Global autoregressive models for
data-efficient sequence learning. In Proceedings of the 23rd Conference on Computational Nat_ural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, pp. 900–909,_
2019a.

Tetiana Parshakova, Jean-Marc Andreoli, and Marc Dymetman. Distributional reinforcement learning for energy-based sequential models. CoRR, abs/1912.08517, 2019b.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations,
_ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016._

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the
pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th Interna_tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,_
_Conference Track Proceedings, 2017._

Kaitao Song, Xu Tan, and Jianfeng Lu. Neural machine translation with error correction. In Proceed_ings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp._
3891–3897, 2020.

A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for
raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September
_2016, pp. 125, 2016a._

A¨aron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and
Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural In_formation Processing Systems 29: Annual Conference on Neural Information Processing Systems_
_2016, December 5-10, 2016, Barcelona, Spain, pp. 4790–4798, 2016b._

A¨aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York
_City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings,_
pp. 1747–1756. JMLR.org, 2016c.

Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In
_International Conference on Machine Learning, pp. 1747–1756. PMLR, 2016._

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor_mation Processing Systems 30: Annual Conference on Neural Information Processing Systems_
_2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017._

Yezhen Wang, Bo Li, Tong Che, Kaiyang Zhou, Ziwei Liu, and Dongsheng Li. Energy-based openworld uncertainty modeling for confidence calibration. CoRR, abs/2107.12628, 2021.

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In
_Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue,_
_Washington, USA, June 28 - July 2, 2011, pp. 681–688, 2011a._

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In
_Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue,_
_Washington, USA, June 28 - July 2, 2011, pp. 681–688, 2011b._

Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational autoencoders and energy-based models. In 9th International Conference on Learning
_Representations, ICLR 2021. OpenReview.net, 2021._


-----

J. Xie, Z. Zheng, X. Fang, S. Zhu, and Y. Wu. Cooperative training of fast thinking initializer and
slow thinking solver for conditional learning. IEEE Transactions on Pattern Analysis & Machine
_Intelligence, (01):1–1, mar 2019. ISSN 1939-3539._

Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of
descriptor and generator networks. IEEE Trans. Pattern Anal. Mach. Intell., 42(1):27–45, 2020.

Minkai Xu, Shitong Luo, Yoshua Bengio, Jian Peng, and Jian Tang. Learning neural generative
dynamics for molecular conformation generation. In 9th International Conference on Learning
_Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021a._

Yilun Xu, Yang Song, Sahaj Garg, Linyuan Gong, Rui Shu, Aditya Grover, and Stefano Ermon.
Anytime sampling for autoregressive models via ordered autoencoding. In 9th International Con_ference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021b._

Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridging the gap between training
and inference for neural machine translation. In Anna Korhonen, David R. Traum, and Llu´ıs
M`arquez (eds.), Proceedings of the 57th Conference of the Association for Computational Lin_guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1, pp. 4334–4343, 2019._

Junbo Jake Zhao, Micha¨el Mathieu, and Yann LeCun. Energy-based generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,
_April 24-26, 2017, Conference Track Proceedings, 2017._


-----

A THE DERIVATION OF THE NEGATIVE PHASE GRADIENT

In this section, we show the detailed derivation of Eq. 9. Formally, as shown in Sec. 3, given an
autoregressive model qθ(x<k) = _l=1_ _[q][θ][(][x][l][|][x][<l][)][ (][k][ denotes the time step) with parameters][ θ][, we]_
define a product of the autoregressive model and an EBM as follows

[Q][k][−][1]

_pθ(xk, x<k) = ˜qθ(x<k)_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ _,_ (14)
_·_ **Zθ**


where ˜qθ(x<k) = _l=m_ _[q][θ][(][x][l][|][x][<l][)][ Q][m]n=1[−][1]_ _[q][(][x][n][|][x][<n][)][. Under such definition, only those con-]_
ditional distributions qθ(xl **x<l) of the most recent k** _m time steps are involved in the current_
_|_ _−_
update of parameters θ while those distant conditional distributions q(xn **x<n) are treated as fixed.**

[Q][k][−][1] _|_
We have explained the rationale and intuition in Sec.4. Zθ is the normalization term and equal to
Ex[′]<k[∼]q[˜]θ(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]]. The optimization of pθ(xk, x<k) includes two phases, and the

gradient w.r.t θ of negative phase is

Ex<k _pθ_ [ _[∂]_ _qθ(x<k)] + Exk,x<k_ _pθ_ [ _[∂]_ (15)
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][.]_


Next, we will show the specific derivation of these two terms in Eq. 15 so that the entire Eq. 15 can
be transformed into Eq. 9.

A.1 THE DERIVATION OF THE FIRST TERM

The first term Ex<k∼pθ [ _∂θ[∂]_ [log ˜]qθ(x<k)] can be processed as follows


Ex<k _pθ_ [ _[∂]_ _qθ(x<k)] =_
_∼_ _∂θ_ [log ˜]


_pθ(x<k)_ _[∂]_ _qθ(x<k)_

_∂θ_ [log ˜]

**x<k**

X


_pθ(xk, x<k)_ _[∂]_ _qθ(x<k)_

_∂θ_ [log ˜]

_xk_

X


**x<k**


(16)

(17)


_xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_

**Zθ**


_q˜θ(x<k)_
**x<k**

X


_qθ(x<k)_
_∂θ_ [log ˜]


=Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)],_
_∼_ _∂θ_ [log ˜]

_xk_ _[e][−][φ][(][xk,][x][<k]_ [)]

Ex′<k _[∼]qθ[˜]_ P(x<k )[[][P]xk _[e][−][φθ]_ [(][xk,][x]<k[′] [)]] [because]


where we have w(x<k) =

**w(x<k) =**


_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_

**Zθ**


_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_

Pxk _q[˜]θ(x<k)e[−][φ][θ][(][x][k][,][x][<k][)]_


**x<k**


_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_

**x<k** _q[˜]θP(x<k)_ _xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_

P _xk_ _[e][−][φ][(][x][k][,][x][<k][)]_

[P]

Ex<k∼q˜θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x][<k][)][]]_ _[.]_


-----

A.2 THE DERIVATION OF THE SECOND TERM

Then, we tackle the second term Exk,x<k∼pθ [ _∂θ[∂]_ _[φ][θ][(][x][k][,][ x][<k][)]][ as follows]_


_∂_
Epθ

_∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_


_pθ(xk, x<k)_ _[∂]_

_∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_

_xk,x<k_

X


_qθ(xk, x<k)_
_pθ(xk, x<k) [˜]_

_q˜θ(xk, x<k)_

_xk,x<k_

X


_∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_


_qθ(x<k)_ _e[−][φ][θ][(][x][k][,][x][<k][)]_ _∂_

= _q˜θ(xk, x<k) [˜]_ _·_

**Zθ** ˜qθ(xk, x<k) _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_

_xk,x<k_

X _·_

_∂_

= Exk,x<k _q˜θ(xk,x<k)[[]_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_
_∼_ _q˜θ(xk_ **x<k)** **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_

_|_ _[·][ 1]_

_∂_

= _q˜θ(x<k)_ _q˜θ(xk_ **x<k)** _[e][−][φ][θ][(][x][k][,][x][<k][)]_

_|_ _q˜θ(xk_ **x<k)** **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_

**xX<k** Xxk _|_ _[·][ 1]_

_∂_

= _q˜θ(x<k)_ _e[−][φ][θ][(][x][k][,][x][<k][)]_ [1]

_·_ **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]_

**x<k** _xk_

X X

_e[−][φ][θ][(][x][k][,][x][<k][)]_ _∂_

= Eq˜θ(x<k)[[]

**Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_

_xk_

X

_e[−][φ][θ][(][x][k][,][x][<k][)]_ _xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)]_ _∂_

= Eq˜θ(x<k)[[]

_xk_ _xk_ _[e][−][φ][θ][(][x][k][,][x][<k][)][ ·]_ P **Zθ** _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_

X

P


_q˜θ(xk_ **x<k)w(x<k)** _[∂]_
_|_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_
_xk_

X


= Eq˜θ(x<k)[[]


= Eq˜θ(x<k)[[][E]a _q˜θ(xk_ **x<k)[[][w][(][x]<k[)][ ∂]**
_∼_ _|_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]]_

= Exk,x<k _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_
_∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]]_

(18)

_xk_ _[e][−][φ][(][xk,][x][<k]_ [)]

where w(x<k) is also equal to P **Zθ** . Combining Eq. 16 and Eq. 18, we can obtain an

equivalent form of the gradient of the negative phase without any expectation over pθ as


Ex<k _q˜θ(x<k)[[][w][(][x]<k[)][ ∂]_ _qθ(x<k)] + Exk,x<k_ _q˜θ(xk,x<k)[[][w][(][x]<k[)][ ∂]_ (19)
_−_ _∼_ _∂θ_ [log ˜] _∼_ _∂θ_ _[φ][θ][(][x][k][,][ x][<k][)]][,]_


_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_

_._ (20)

Ex[′]<k[∼]q[˜]θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]]


**where** **w(x<k) =**


B THE FURTHER REFINEMENT OF w

The reweighing weight w can be further deduced as


_pθ(xk,x<k)_

_xk_ _[e][−][φ][(][x][k][,][x][<k][)]_ _xk_ _q˜θ(x<k)_

=

Ex[′]<k[∼]q[˜]θP(x<k)[[][P]xk _[e][−][φ][θ][(][x][k][,][x]<k[′]_ [)]] Ex[′]<k[∼]q[˜]Pθ(x<k)[[][P]xk _pθq˜(θx(kx,<kx<k)_ ) []]

_pq˜θθ((xx<k<k))_ = _µ(x<k)_

Ex[′]<k[∼]q[˜]θ(x<k)[[][ p]q˜θ[θ]([(]x[x]<k[<k])[)] []] Ex[′]<k _[µ][(][x][<k][)]_ _[,]_


**w(x<k) =**


(21)


where µ(x<k) is defined as _[p]q˜θ[θ]([(]x[x]<k[<k])[)]_ [.]


-----

C EXPERIMENTAL SETTINGS

In this section, we introduce the specific setup of different benchmarks in Table 6. We uniformly use
Adam optimizer. The training will be stopped once the model has not obtained better performance
for 20 epochs on the validation set. For translation tasks, the length of generated fake sentences,
which is used for the computing of negative phase in Eq. 12, is dependent on the source sequence
whilst for language modeling tasks, we fix the length of generated fake sentences as 50 during
training. As for the model structures of the image generation task, we use the official structure
reported by PixelCNN (van den Oord et al., 2016c) and Gated PixelCNN (van den Oord et al.,
2016b) without modification. The source code will be released once upon acceptance. We use the
same batch of samples generated autoregressively to approximate both the expectations in Eq.12
and weight w (i.e., shared), which does not need to sample twice. The number of samples in a
batch is dynamic while the maximum number of the total tokens in a batch are fixed (4096 in
our experiments). If the length of sequences in a batch is 32, then it includes 4096 / 32 = 128
samples in total. It is a common strategy in language generation tasks, and has been used in many
frameworks(e.g. Fairseq (Ott et al., 2019)). We generate samples autoregressively as many as the
number of sequences in the current batch at each update iteration.

|Hyper-Parameters|IWSLT14|WMT16|WiKiText103|
|---|---|---|---|
||Tr-Base|Tr-Base Tr-Large|Tr-Base Tr-XL|
|Number of Layers Hidden Embed Size FC-Layer Embed Size Attention Heads Dropout Learning Rate lr scheduler Warm up Updates Weigth Decay Coefficient λ E-ARM Start Epoch|12 512 1024 4 0.3 5e-4 inverse sqrt 4000 1e-4 0.05 15|12 12 512 1024 2048 4096 8 16 0.3 0.3 1e-3 1e-3 inverse sqrt inverse sqrt 4000 4000 0.0 0.0 0.05 0.05 15 10|6 16 512 410 2048 2100 8 10 0.1 0.1 5e-4 2.5e-4 inverse sqrt cosine 4000 10000 1e-2 0.0 0.05 0.02 15 10|


Table 6: Hyper-Parameters of different model structures and datasets. “Tr-Base”, “Tr-Large”, and “Tr-XL”
indicate Transformer-Base, Transformer-Large, and Transformer-XL respectively

D MORE EXPERIMENTAL ANALYSIS

D.1 EFFECT ON INCOHERENCE

In order to validate the effectiveness of our E-ARM for ameliorating the long-range coherence of
generations, we undertake an experiment to assess the model’s performance under different test sets
with varying sentence lengths. We divided the test set of IWSLT14 (German → English, Italian →
English, Spanish → English) translation dataset into three subsets ([0, 25], [25, 50], and [50, ∞))
based on the target sentence lengths. Then, we incrementally applied scheduled sampling technique
and our E-ARM above the base transformer network, and tested their performances on these three
subsets. Generally, the subset of samples with longer target sentences ([50, ∞)) should have been
more affected by the long-range incoherence problem (lower BLEU score). In practice, we uniformly applied label smoothing and beam searching (with 5 beams) strategy for all experiments in
Table 7.

Specifically, Table 7 shows that the base translation model improved performance for all three test
sets with varying target sentence lengths after using the scheduled sampling technique, especially
for the two sets [0, 25) and [25, 50) which had relatively short target sentence lengths (e.g. On
German to English task, 38.20 - 37.72 = +0.48 points and 33.76 - 33.24 = + 0.52 points for [0, 25)
and [25, 50) test sets respectively). We consider that this performance boost was achieved through
alleviating the exposure bias problem, since scheduled sampling approaches (Ranzato et al., 2016;
Zhang et al., 2019; Mihaylova & Martins, 2019) have been verified in mitigating the exposure bias
problem. Besides, after applying our E-ARM together with the scheduled sampling technique, the


-----

|Translation Task|Scheduled E-ARM Sampling Training|Target Sentence Length|All Test|
|---|---|---|---|
|||[0, 25) [25, 49) [50, ∞)||
|De→En|- -  -  |37.72 ±0.04 33.24 ±0.06 30.86 ±0.07 38.20 ±0.07 33.76 ±0.03 31.08 ±0.06 38.37 ±0.06 33.92 ±0.09 31.43 ±0.04|34.61 ±0.08 35.10 ±0.04 35.36 ±0.05|
|It→En|- -  -  |35.20 ±0.03 32.73 ±0.02 26.86 ±0.05 35.52 ±0.09 33.25 ±0.08 26.95 ±0.14 35.56 ±0.10 33.33 ±0.13 27.21 ±0.07|32.29 ±0.03 32.64 ±0.12 32.82 ±0.11|
|Es→En|- -  -  |43.37 ±0.05 39.67 ±0.08 37.14 ±0.06 43.61 ±0.09 40.00 ±0.04 37.38 ±0.06 43.84 ±0.10 40.35 ±0.05 38.07 ±0.04|40.64 ±0.07 40.91 ±0.06 41.58 ±0.07|


Table 7: Performance comparison on the IWSLT14 test set with respect to the different lengths of sentences
on three translation tasks (German to English, Italian to English, and Spanish to English). Performance is
evaluated by BLEU score.

base model can further obtain additional performance gain. Specifically, the improvement on the
longer sentence is more evident, since model can obtain large improvements on the [50, ∞) (e.g.
On German to English task, 31.43 - 31.08 = +0.35 points for [50, ∞) test sets) than short sets [0,
25] and [25, 50] (e.g. On German to English task, 38.37 - 38.20 = +0.17 points and 33.92 - 33.76
= + 0.16 points for [0, 25) and [25, 50) test sets respectively). This phenomenon indicates that our
E-ARM can resolve the incoherence problem to some extent.

D.2 EFFECT ON EXPOSURE BIAS

|Trans. Pairs|DE→EN EN→DE EN→IT IT→EN ES→EN EN→ES|
|---|---|
|N Total Ratio|14203 14554 14976 13952 16021 15359 22148 23057 23654 23744 23860 22775 64.12% 63.12% 63.31% 59.76% 68.33% 67.43%|


Table 8: The effect of E-ARM on the exposure bias problem. Each test set of translation tasks contains 1K
sentences selected randomly. N denote the ground truth words whose probabilities in the predicted distributions
produced by E-ARM are greater than those produced by the baseline.

We follow the analytic experiments in the work (Zhang et al., 2019) to show that our E-ARM is
capable of alleviating the exposure bias problem. Specifically, we randomly select 1K pairs from
the training data for each translation pair and use the trained autoregressive model which applied
E-ARM (Label Smoothing with smoothing factor 0.1 is applied during training while scheduled
sampling is not used) to decode the source sentences, and then count the ground truth words whose
probabilities in the predicted distributions produced by our E-ARM are greater than those produced
by the baseline and denote the number as N . The ratio of N to the total number of words tested
is calculated. The detailed results are shown in Table 8. We find that the results on all different
tasks are greater than 50%, which demonstrate the ability of our E-ARM in solving exposure bias
problem.

D.3 ANALYSIS TO MODEL’S CONVERGENCE

In this section, We will investigate the convergence of our E-ARM. To begin, we first train a base
Transformer model (“Tr-Base” architecture shown in Table 6) on the IWSLT14 Spanish to English
training set for baseline and E-ARM model respectively, and then record the training loss and test
loss (in cross entropy) at the end of each epoch. The loss curves are plotted in the Figure 2. From
Figure 2, we can see that (1) at the start of the training, our E-ARM converges slightly faster than
the baseline. (2) As the training process progresses, the cross entropy of the baseline on the training
set will gradually decrease, with a faster rate than E-ARM. On the other hand, the test loss curve of
the baseline will fall at initially and then slowly rose after 50 epochs while E-ARM always remains
stable convergence. This phenomenon also shows that our E-ARM model can effectively prevents
over-fitting and produce better generalization.


-----

(a) (b)

Figure 2: (a) Cross entropy loss curves on IWSLT14 Spanish to English translation task on training set.
The blue and orange colors represent base model and E-ARM respectively; (b) Cross entropy loss curves on
IWSLT14 Spanish → English translation task on test set.

D.4 ANALYSIS TO TOP-K RE-SAMPLING

|Trans. Pairs|Col2|
|---|---|
|k|0 5 10|


**Trans. Pairs** **DE→** **EN** **EN→** **DE** **EN→** **IT** **IT→** **EN** **ES→** **EN** **EN→** **ES**

**0** 34.86 28.73 29.91 32.44 40.88 37.59

**_k_** **5** **34.93** 28.85 **30.04** **32.56** **41.01** 37.66

**10** 34.88 **28.91** 29.96 32.41 40.90 **37.73**


Table 9: The effect of Top-K correction in the inference stage. We tested BLEU scores of using different k on
different translation pairs of IWSLT14 dataset.

Top-K energy re-sampling in the inference stage is introduced by Bakhtin et al. (2021), which
collects many candidate sequences generated autoregressively in the inference stage and then resamples from them depending on their energy scores estimated by the network. To measure the
contribution of the Top-K energy re-sampling in our method, we conduct ablation study to verify
it by selecting different K = {0, 5, 10}. The results are shown in Table 9 by using BLEU score.
From Table 9, we observe that the benefits brought by Top-K sampling is minor (K={5, 10}), when
compared with model without Top-K sampling (K=0). These results also indicate that the performance improvements of our E-ARM are mainly from our joint-training, rather than Top-K energy
re-sampling.

D.5 EVALUATION WITH OTHER METRICS

|Trans. Pairs|Scheduled E-ARM Sampling Training|Metrics|
|---|---|---|
|||ROUGE-1 ↑ ROUGE-2↑ ROUGE-L↑ METEOR↑ BLEU↑|
|De →En|- -  -  |66.51 43.69 63.69 64.35 34.61 66.83 44.08 64.02 64.61 35.10 67.46 44.77 64.78 65.13 35.36|
|It →En|- -  -  |64.50 40.65 61.69 62.18 32.29 64.73 40.97 61.94 62.51 32.64 65.27 41.51 62.49 62.80 32.82|
|Es →En|- -  -  |71.10 49.47 68.78 68.94 40.64 71.36 49.53 68.96 69.28 40.91 71.91 50.17 69.65 69.63 41.58|


Table 10: Comparison of ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BLEU scores between our approach E-ARM and the base ARGM trained just with cross-entropy loss on three translation pairs of IWSLT14
datasets. The value is expressed in percentage. We use “Tr-Base” as the network architecture.

To further evaluate the effectiveness of the our proposed E-ARM, we also evaluate our method by
using other metrics, such as ROUGE Lin (2004) and METEOR Banerjee & Lavie (2005) for neural


-----

machine translation. The results are shown in Table 10. In Table 10, the improvements of E-ARM in
different metrics is consistent with the conclusion of Table 1, which further prove the effectiveness
of our E-ARM model.

D.6 EFFICIENCY STUDY

Our E-ARM has the advantage of being able to optimize an energy-based learning target using
maximum log-likelihood, without the usage of MCMC procedures. The requirement to sample data
from the autoregressive model at each update step, on the other hand, remains a possible element
that could slow down the training process. Nonetheless, the extra overheads are still acceptable
when compared to sampling data using MCMC algorithms. The reasons are provided in below:

Assuming that the forward processes of the Transformer, with a length n sentence as the input, have
the time cost τ . We tested the time cost of gradients back-propagation for Transformer on Tesla
V100 GPU. We found that the time cost of the backward process is approximately twice as the
forward process, which is marked as 2τ . Therefore, the time cost of one step update is approximate
3τ . Autoregressively generating a sequence of length n by Transformer necessitates n feedforward
processes, as each predicted token must use all previously created tokens as input. One fact is that
we simply need to use the previously produced k tokens as the input at each time step k, and gradient
back-propagation is not required during the generation.

As a result, the time cost of
generating a fake sentence

2

Considering the IWSLT14
German to English transla
cost of generating fake

|Model|S.S. w/E-ARM|Sec./100 iter.|
|---|---|---|
|- - 27.3  - 30.1 Tr-Base - Autoreg. 145.8  Autoreg. 149.2 - 20 steps SGLD 630.6 - 50 steps SGLD 1452.3|||

data in each iteration is
roundly 9.5τ . Furthermore, Table 11: Efficiency performance on IWSLT14 German→ English, evalu
ated with BLEU. We uniformly use 12 layer “Tr-Base” in Table 6. “S.S.”

the generated fake sentence denotes Scheduled Sampling.”Autoreg.” indicates optimizing E-ARM with
will be fed into the trans- Eq.12 by sampling fake data from autoregressive models. ”* steps SGLD”
former and included in the represents optimizing our E-ARM with Eq.6, the fake data is sampled at the
overall loss computation, first transformer layer’s output by SGLD with * steps.
resulting in an extra forward and backward procedure apart from the update of original input. Thus, the total time cost of our E-ARM’s one update is
15.5τ, which is about 5.2 times as great as vanilla training. Table 11 shows the time cost of training a 12 layer transformer with 100 iterations. The time cost of our E-ARM roughly coincides the
extra time cost as we analyzed above. For long sequence tasks like image generation and language
modeling, which usually have sequences consisting of hundreds of tokens, we randomly truncate a
continuous sequence with length 50 for energy-based training in Eq.12.

When it comes to the MCMC sampling, one problematic issue is that for sequential data like text,
the intrinsic discrete property prevents it from applying MCMC in the data space, which forces us
to apply it in the latent feature space. Here, we take the SGLD (Welling & Teh, 2011b) algorithm
for example. Assuming that we apply the SGLD at the first layer of the network, then the time cost
of one SGLD iteration is about 3τ either. Since the SGLD process requires k iterations to reach
convergence, the total time cost of one update of our E-ARM with MCMC process is (3k + 6)τ . In
practice, the k is usually set as 100 for stable training Grathwohl et al. (2020), which results in the
time cost being [(3][×][100+6)]3τ _[τ]_ = 102 times as large as the vanilla training. For short-run SGLD, which

takes k as 20 with a sacrifice of performance, it still leads to the time cost being 22 times as large as
the vanilla training.


-----

D.7 CASES STUDIES

To better understand the advantages of our method in correcting error tokens, we also prepare some
translation cases in IWSLT14 German → English, as shown in Table 12.

|Source Sentence(German)|Predicted Target Sentence(English)|
|---|---|
|wenn ich ihnen 600 zeitschriften zeige und sie in 10 kategorien aufteile oder GroundTruth: if i show you 600 magazines and i divide them up into 10 ich ihnen 400 zeitschriften zeige, und diese in 20 kategorien aufteile, dann categories, versus i show you 400 magazines and divide them up into 20 cat- glauben sie, dass ich ihnen mehr auswahl und eine bessere auswahlerfahrung egories, you believe that i have given you more choice and a better choosing gegeben habe, als ich ihnen die 400 gegeben ha¨tte gegenu¨ber dem, wenn ich experience if i gave you the 400 than if i gave you the 600. ihnen die 600 gegeben ha¨tte. Baseline: if i show you 600 magazines and i split them in 10 categories, or i’m showing them 400 magazines, and i’m going to split them up into 20 categories, you think i’ve given them more choices and better choice than i would have given them the 400 over the time that i gave them the 600. Baseline + S.S.: if i show you 600 magazines and i give you 400 magazines in 10 categories, and i give you 400 magazines, and i can split them up in 20 categories, then you think i’ve given you more choice and a better selection than i would have given you the 400 of which if i gave you the 600. Ours: if i show you 600 magazines and i divide them into 10 categories, or i show you 400 magazines, and i divide them into 20 categories, you think i’ve given you more choices and better selection experience than i gave you the 400 of whom if i gave you the 600.||
|und ich weiß definitiv, dass es fu¨r mich – in meiner situation – sehr gefa¨hrlich wa¨re, anzufangen, diesen dunklen pfad der vermutung sozusagen herunterzu- sickern – besonders in dem umstand, in dem ich mich in meiner karriere gerade befinde.|GroundTruth: and i definitely know that, in my case – in my situation – it would be very dangerous for me to start sort of leaking down that dark path of assumption, particularly given the circumstance that i’m in right now in my career. Baseline: and i know definitely, for me, it would be very dangerous to begin to do this dark path of suspect – especially in the circumstance that i’m in my career right now. Baseline + S.S.: and i know definitely it would be – in my situation – very dangerous to start, to kind of settle down this dark path of presumption – es- pecially in the circumstance in which i’m in my career right now. Ours: and i definitely know that it’s for me – in my situation – very danger- ous to start to sickle down this dark path of suspection, in particular, in the circumstance of where i’m in my career right now.|
|wir haben das licht ausgeschaltet, legten es in ein vakuum und saugten die ganze luft aus und ku¨hlten es bis fast zum jetzt, ganz alleine im aufzug, war das stu¨ck metall frei, sich zu verhalten wie immer es wollte.|GroundTruth: we turned off the lights, and then we put it in a vacuum and sucked out all the air, and then we cooled it down now, all alone in the elevator, the little chunk of metal is free to act however it wanted. Baseline: we turned the light off, put it in a vacuum and sucked it out all the air and cooled it up until almost now, all the way alone, the piece of metal was open to behave as it was. Baseline + S.S.: we turned the lights off, we put it into a vacuum, and we sucked all the air, and we cooled it all the way up to now, all over the place, the piece of metal was free to behave whatever it wanted. Ours: we turned off the lights, we put it into a vacuum and we sucked all the air out, and we cooled it up until almost now, all alone in the elevator, the piece of metal was free to behave whatever it wanted.|
|und im grunde ko¨nnen sie das betrachten, wissen sie, als eine tyrannei des erin- nernden selbst, und sie ko¨nnen sich das erinnernde selbst denken als eins, das sozusagen das erlebende selbst schleppt durch erfahrungen, die das erlebende selbst nicht braucht.|GroundTruth: and basically you can look at this, you know, as a tyranny of the remembering self, and you can think of the remembering self sort of dragging the experiencing self through experiences that the experiencing self doesn’t need. Baseline: and basically, you can think of this, you know, as a tyranny of self, and you can think of the memorable self as one that kind of weaves the living self through experiences that don’t need the life itself. Baseline + S.S.: and basically, you can look at this, you know, as a tyrannei of memorial self, and you can think of the memorial self as one that kind of sucks the living self through experiences that don’t need the living self. Ours: and basically, you can look at that, you know, as a tyranny of the re- membering self, and you can think of the memory itself as one, which is sort of dragging the living self through experiences that the living self doesn’t need.|
|wir sind an der schwelle zu erstaunlichen, erstaunlichen ereignissen auf vielen gebieten. und doch denke ich wirklich, dass wir hunderte, 300 jahre vor die aufkla¨rung zuru¨ck gehen mu¨ssten, um eine zeit zu finden, in der wir fortschritt beka¨mpft haben, in der wir u¨ber diese dinge heftiger getritten haben, an mehr fronten als jetzt.|GroundTruth: we’re on the verge of amazing, amazing events in many fields, and yet i actually think we’d have to go back hundreds, 300 years, before the enlightenment, to find a time when we battled progress, when we fought about these things more vigorously, on more fronts, than we do now. Baseline: we are at the threshold of amazing, amazing events in many areas, and yet i really think that we have to go back hundreds and 300 years before the enlightenment to find a time when we have fought progress in which we have driven more of these things than now. Baseline + S.S.: we’re at the threshold of amazing, amazing events in many areas. and yet, i really think that we have to go back hundreds and hundreds of years before the enlightenment to find a time when we have struggled with progress in which we have driven on these things more powerful, more fronts than now. Ours: we’re at the threshold to amazing, amazing events in many areas, and yet i really think that we have to go back hundreds and 300 years before the en- lightenment to find a time when we fought progress, where we’ve been fighting about these things to more fronts than we have now.|


Table 12: Translation cases on IWSLT14 De→En test set, generated by the baseline method, baseline with
scheduled sampling and our E-ARM. The italic font means the mismatch translation


-----

E MORE DISCUSSION OF RELATED WORKS

The seminal idea of combing a generative model and an energy-based model has been explored by a
plethora of great works (Pang et al., 2020; Durkan & Nash, 2019; Xie et al., 2019; 2020; Xiao et al.,
2021; Bakhtin et al., 2021). Our E-ARM can be considered as a member of this family of models in general, but it has a different mechanism and goal than the others. In particular, Pang et al.
(2020) aimed to learn an energy-based model (EBM) in the latent space of a generator model, so that
the EBM can act as a prior model on the generator model’s top-down network. They believe that
the energy-based correction of the prior noise distribution will benefit the subsequent generator’s
generating process. Furthermore, Xie et al. (2019) attempted to learn the conditional distribution
of a high-dimensional output given an input by combining the efforts of a fast thinking initializer,
which generates the output and a latent vector, and a slow thinking solver, which learns an objective
function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based
model. A similar work is GAMs (Parshakova et al., 2019a;b; Khalifa et al., 2021), which combine
an autoregressive component with a log-linear component, allowing the use of global a priori features to compensate for lack of data. Moreover, VAEBM, a symbiotic composition of a variational
auto-encoder and an EBM, was proposed by (Xiao et al., 2021). It can use a state-of-the-art VAE
to capture the general mode structure of the data distribution while relying on its EBM component
to explicitly eliminate non-data-like regions from the model and refine the generation samples. In
addition, Bakhtin et al. (2021) designed a novel mechanism to train an unnormalized energy-based
models for modeling joint sequence by working in the residual of a pretrained locally normalized
language model and training using noise contrastive estimation. All of the above models require an
additional network to learn the energy scores, which prevents the base autoregressive model from
benefiting from EBM’s properties in modeling the joint distribution in a more temporally coherent
manner. In contrast, by carefully constructing an energy-based learning objective and its corresponding optimization procedure, we are able to smoothly integrate energy surface learning into
autoregressive networks that do not require additional learnable parameters. Rather than proposing
a new generative model, our method is more likely to a novel training pattern for training a better
autoregressive model. Recently, instead of constructing an autoregressive model in the data space,
Xu et al. (2021b) have proposed a unique way which uses autoregressive models in the latent space
followed by a decoder which decodes the autoregressively generated latent feature into the original
data space. They attempt to learn a structured representation space where dimensions are ordered
based on importance and trade off the sample quality for computational efficiency by truncating the
dimensions of latent generations. Their work is orthogonal to ours. We think the combination between our E-ARM with anytime sampling is also a valuable work which is worth exploration in the
future.


-----