## YOUR AUTOREGRESSIVE GENERATIVE MODEL CAN BE BETTER IF YOU TREAT IT AS AN ENERGY-BASED ONE **Anonymous authors** Paper under double-blind review ABSTRACT Autoregressive generative models are commonly used, especially for those tasks involving sequential data. They have, however, been plagued by a slew of inherent flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., exposure bias or lack of long-range coherence), severely limiting their ability to model distributions properly. In this paper, we propose a unique method for training the autoregressive generative model that takes advantage of a well-designed energy-based learning objective. We show that our method is capable of alleviating the exposure bias problem and increase temporal coherence by imposing a constraint which fits joint distributions at each time step. Besides, unlike former energy-based models, we estimate energy scores based on the underlying autoregressive network itself, which does not require any extra network. Finally, thanks to importance sampling, we can train the entire model efficiently without requiring an MCMC process. Extensive empirical results, covering benchmarks like language modeling, neural machine translation, and image generation, demonstrate the effectiveness of the proposed approach. 1 INTRODUCTION By factorizing the joint distribution into the product of a series of conditional distributions, autoregressive generative models (abbr. ARGMs) (Vaswani et al., 2017; Dai et al., 2019; van den Oord et al., 2016a;b; Salimans et al., 2017; Chen et al., 2018) simplify the difficult challenge of modeling high-dimensional joint distributions. They can be trained efficiently via maximum likelihood and generate samples of exceptional quality, making this technique popular for modeling distributions, especially for sequential data. Nonetheless, despite their potency and flexibility, ARGMs still have inherent weaknesses due to the intrinsic characteristics of chain-style conditional modeling. For example, ARGMs usually suffer from a discrepancy of the input context distributions between the training and inference stages, which causes consequent error propagation (i.e., Exposure Bias (Ranzato et al., 2016; Bengio et al., 2015)). Besides, due to the nature of greedy selection of beam search approximations, the decoded results from ARGMs may also lack in long-range coherence. We consider one approach by which ARGMs could be adapted to reduce these concerns. Earlier work, both heuristic and theoretical, has already been proposed with those goals. For instance, the exposure bias problem of ARGMs can be alleviated to some extent with scheduled sampling (Bengio et al., 2015; Mihaylova & Martins, 2019), by mixing input contexts from both real data and autoregressive generation, during the training stage. However, this scheme suffers from an over-correcting problem (Zhang et al., 2019). In addition, at the inference stage, beam search makes it possible to choose more diverse candidates, improving the quality of generated sequences. Nevertheless, this results in only marginal improvements in temporal coherence, since ARGMs can only leverage previous decoded contexts without consideration of the whole sequence information. Moreover, setting aside the difficulty in training them, energy-based models (EBMs) have demonstrated their effectiveness in modeling high-dimensional distributions in a variety of machine learning applications (Zhao et al., 2017; Arbel et al., 2021; Gao et al., 2021), without requiring the transformation of the target distribution into a product of conditional distributions. As a result, several studies (Deng et al., 2020; Bakhtin et al., 2021; Durkan & Nash, 2019) attempt to combine EBMs with ARGMs, expecting to benefit from the strengths of both approaches. However, though some positive results were obtained, the existing works preferred a two-stage optimization, which first obtains a well-trained ARGM and then trains an additional EBM based on it. Such an optimiza ----- tion strategy does not enable ARGMs to benefit from the properties of EBM in modeling the joint distribution in a temporally more coherent way. In this paper, we present a novel design for seamlessly integrating Energy-based models into **AutoRegressive Models (E-ARM). Our training is based on an energy-based learning objective,** which forces ARGMs training to fit the joint distribution along with the conditional one at each time step. Thanks to our well-designed energy function, the two involved models can share a single base network without additional parameters, that is, the base network not only serves as a generator that provides fake data to facilitate the training of EBMs like previous works (Che et al., 2020; Xiao et al., 2021; Durkan & Nash, 2019; Deng et al., 2020), but also plays the role of modeling the energy surface. This property makes it easy to plug E-ARM into the training of any autoregressive generative models. Intuitively, the exposure bias in ARGMs is caused by the fact that the model is trained on real data rather than data generated by the model. On the other hand, in the EBM’s optimization process for modeling joint densities, the negative phase of wake-sleep algorithms (Hinton, 2002; Kim & Bengio, 2016) requires sampling data from the EBM itself. Along with the fact that our method combines the EBM and the ARGM seamlessly as a whole, E-ARM can reduce the discrepancy between input data of the training and inference stage, which mitigates the exposure bias problem of the ARGM. On top of it, unlike ARGMs, which factor the joint distribution into a product of conditional distributions, EBMs are able to model the joint distribution directly and score each input at the sequence level instead of at the token level, which makes them capable of modeling longrange coherence. Additionally, in order to optimize the proposed energy-based learning objective efficiently via gradient-based wake-sleep algorithms (Kim & Bengio, 2016), we present a way to estimate the negative phase gradient (which is a necessary component in the gradient-based wakesleep algorithms) through those samples generated with the autoregressive view instead of the EBM view, which would require an expensive Markov Chain Monte Carlo (MCMC) process. This allows us to sidestep extremely time-consuming MCMCs, thus accelerating training. In summary, the following contributions are made with this paper: i) We introduce a novel scheme, E-ARM, to integrate the EBM view into autoregressive generative models seamlessly; ii) we attempt to reduce the intrinsic problems of autoregressive models, such as exposure bias and weak temporal coherence, by optimizing an energy-based learning objective, which uses samples autoregressively generated; iii) We demonstrate how to efficiently optimize our model constructed from a single network, using wake-sleep algorithms without MCMC; iv) In a number of applications, such as language modeling, neural machine translation, and image generation, our model can achieve better results in comparison with relevant baselines. 2 BACKGROUND 2.1 ENERGY-BASED MODELS Energy-based models (LeCun et al., 2006) can express any probability p(x) for x ∈ R[K] as _pθ(x) = [exp(][−][E][θ][(][x][))]_ _,_ (1) **Zθ** where Eθ : R[D] _→_ R denotes an energy function which aims to map a D-dimensional datapoint to a scalar, and Z(θ) = **x** [exp(][−][E][θ][(][x][))][ denotes the normalizing constant, also known as the] partition function. Any function can be used as an energy function to represent an EBM as long as it can generate a single scalar given some input x and the normalizing constant is finite[1]. Wake-sleep [P] algorithms are commonly used to optimize EBMs (Hinton, 2002; Kim & Bengio, 2016; Grathwohl et al., 2020) via gradient-based approximate maximum likelihood. Specifically, the gradient of the log-likelihood, which needs to be maximized, with respect to θ can be expressed as _∂_ _∂_ _∂_ Epd(x) = Epθ(x) Epd(x) _._ (2) _∂θ_ [log][ p][θ][(][x][)] _∂θ_ **[E][θ][(][x][)]** _−_ _∂θ_ **[E][θ][(][x][)]** h i h i h i 1Without constraining the parametrization of Eθ, this can be achieved by bounding the region of space in which x takes its allowed values. ----- The first term in the right hand side of Eq. 2 is the negative phase term while the second term is called the positive phase term. MCMC methods have been used (Hinton, 2002; Welling & Teh, 2011a) to approximately sample from pθ(x), for estimating the negative phase term. 2.2 MODELING DISTRIBUTIONS AUTOREGRESSIVELY Autoregressive generative models (ARGM)[2] can decompose any joint distribution p(x) into a product of conditional distributions using the product rule of probability by ordering those random variables within the joint distribution and characterizing each random variable given all variables preceding it in that order. Formally, we use x