# STATE-ACTION JOINT REGULARIZED IMPLICIT POL## ICY FOR OFFLINE REINFORCEMENT LEARNING **Anonymous authors** Paper under double-blind review ABSTRACT Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible and well-regularized policy, which consists of a fully implicit policy and a regularization through the stateaction visitation frequency induced by the current policy and that induced by the data-collecting behavior policy. We theoretically show the equivalence between policy-matching and state-action-visitation matching, and thus the compatibility of many prior work with our framework. An effective instantiation of our framework through the GAN structure is provided, together with some techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs. 1 INTRODUCTION Offline reinforcement learning (RL), also known as batch RL, aims at training agents from previously-collected fixed datasets that are typically large and heterogeneous, with a special emphasis on no interactions with the environment during the learning process (Ernst et al., 2005; Lange et al., 2012; Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Agarwal et al., 2020; Siegel et al., 2020; Wang et al., 2020). This paradigm extends the applicability of RL to where the environmental interactions are costly or even potentially dangerous, such as healthcare (Tseng et al., 2017; Gottesman et al., 2018; Nie et al., 2019), autonomous driving (Yurtsever et al., 2020), and recommendation systems (Swaminathan et al., 2017; Gilotte et al., 2018). While (online) off-policy RL algorithms, such as DDPG (Lillicrap et al., 2016), TD3 (Fujimoto et al., 2018), and SAC (Haarnoja et al., 2018a) could be directly adopted in an offline setting, their application can be unsuccessful (Fujimoto et al., 2019; Kumar et al., 2019), especially in high-dimensional continuous control tasks, where function approximations are inevitable and data samples are non-exhaustive. Such failures may be attributed to the shift between the state-action visitation frequency induced by the current policy and that by the data-collecting behavior policy, where unseen state-action pairs are presented to the action-value estimator, resulting in possibly uncontrollable extrapolation errors (Fujimoto et al., 2019; Kumar et al., 2019). In this regard, one approach to offline RL is to control the difference between the observed and policy-induced state-action visitation frequencies, so that the current policy mostly generates state-action pairs that have reliable action-value estimate. Previous work in this line of research typically (1) regularizes the current policy to be close to the behavior policy during the training process, i.e., policy or state-conditional action distribution matching; (2) uses a Gaussian policy class with a learnable mean and diagonal covariance matrix (Kumar et al., 2019; Wu et al., 2019). See Appendix A for a detailed review. However, at any given state s, the underlying action-value function over the action space may possess multiple local maxima. A deterministic or uni-modal stochastic policy may only capture one of the local optima and neglect lots of rewarding actions. An even worse situation occurs when such policies exhibit a strong mode-covering behavior, artificially inflating the density around the average of multiple rewarding actions that itself may be inferior. ----- Previous work under the policy-matching theme mainly takes two approaches. One approach directly estimates the divergence between the state-conditional distributions over actions (Wu et al., 2019). However, on tasks with continuous state space, with probability one, no state can appear in the dataset more than once. In other words, for each observed state si, the offline dataset has only one corresponding action ai from the behavior policy. Thus, one is only able to use a single point to assess whether the current policy is close to the data-collecting behavior policy at any particular state, which may not well reflect the true divergence between the two conditional distributions. The other approach, e.g., Kumar et al. (2019), resorts to a two-step strategy: First, fit a generative model _π(a | s) to clone the behavior policy; Second, estimate the distance between the fitted behavior pol-_ icy π(a | s) and the current policy, and minimize that distance as a way to regularize. While this approach, which can acquire multiple samples from the cloned behavior policy, is able to accurately estimate the distance between the current policy and cloned behavior, its success relies heavily on how well the inferred behavior-cloning generative model mimics the true behavior policy. On tasks with large or continuous state space, however, the same problem arises that each state-conditional action distribution is fitted on only one data point. Moreover, some prior work use conditional VAE (CVAE, Sohn et al. (2015)) as the generative model to clone the possibly-multimodal behavior policy, which further suffers to the problem that CVAE may exhibit a strong mode-covering behavior that allocates large probability density to low data-density regions. Inasmuch as these weaknesses, one may naturally question on how good the samples from such a cloned policy resemble the truth, and further, on the quality of the constraint imposed by sample-based calculation of the distance between such a cloned behavior policy and the current policy. To address these concerns, we are motivated to develop a new framework that not only supports an expressive policy, which can be as flexible as needed, but also well regularizes this flexible policy towards the data-collecting behavior policy. Specifically, (1) instead of using the classical deterministic or uni-modal Gaussian policy, we train a fully implicit policy for its flexibility to capture multiple modes in the action-value function; (2) to avoid the aforementioned potential problems in the policy-matching regularization, we directly control the distance between the state-action visitation frequency induced by the current policy and that induced by the behavior policy, as an auxiliary training target. Hence, our approach does not need to build a generative model to clone the behavior policy. On the theoretical side, we prove in Section 3 that the approach of matching the behavior and current policies is equivalent to matching their corresponding state-action visitation frequencies, which reveals the compatibility of many prior work with our framework. Similar notion of matching the state-action visitations in offline RL is taken by the DICE family (Nachum et al., 2019; Lee et al., 2021a), but they either use a Gaussian policy or a mixture of Gaussian policy with a per-dataset tuned number of mixtures. Besides, these algorithms have high computational complexity, which, together with inflexible policies and intensive hyperparameter tuning, limit their practical applicability. We instantiate our framework with a GAN structure that approximately minimizes the Jenson–Shannon divergence between the visitation frequencies. Furthermore, we design techniques to explicitly encourage robust behavior of our policy at states not covered in the static dataset. We conduct ablation study on several components of our algorithm and analyze their contributions. With all these considerations, our full algorithm achieves state-of-the-art performance on various tasks from the D4RL dataset (Fu et al., 2021), validating the effectiveness of our framework and implementation. 2 BACKGROUND AND MOTIVATION We first present some background information and then introduce a toy example to illustrate the motivations of the proposed framework for offline RL. **Offline RL. Following the classic RL setting (Sutton & Barto, 2018), the interaction between the** agent and environment is modeled as a Markov decision process (MDP), specified by the tuple M = (S, A, P, r, γ), where S denotes the state space, A the action space, γ ∈ (0, 1] the discount factor, _P(s[′]_ _| s, a) : S × S × A →_ [0, 1] the environmental dynamics, and r(s, a) : S × A → [Rmin, Rmax] the reward function. The goal of RL is to learn a policy πϕ(at **_st), parametrized by ϕ, that maxi-_** _|_ mizes the expected cumulative discounted reward Rt ≜ _R (st, at) = E_ _∞k=0_ _[γ][k][r][t][+][k][+1]_ _._ In offline RL (Fujimoto et al., 2019; Kumar et al., 2019; 2020; Levine et al., 2020), the agent onlyP  has access to a fixed dataset D ≜ _{(s, a, r, s[′])}, consisting of transition tuples from rollouts by some_ behavior policies πb(a **_s). We denote the state-action visitation frequency induced by the behavior_** _|_ ----- policy πb as db(s, a) and its state-marginal, the state visitation frequency, as db(s). Similarly, _dϕ(s, a) and dϕ(s) are the counterparts for the current policy πϕ. Here, db(s, a) = db(s)πb(a_ **_s)_** _|_ and we assume D ∼ _db(s, a). The visitation frequencies in the dataset are denoted as dD(s, a) and_ _dD(s), which are discrete approximations to db(s, a) and db(s), respectively._ **Actor-Critic Algorithm.** Denote Q[π](s, a) = E[[P][∞]t=0 _[γ][t][r][(][s][t][,][ a][t][)][ |][ s][0][ =][ s][,][ a][0][ =][ a][]][ as the]_ action-value function. In the actor-critic scheme (Sutton & Barto, 2018), the critic Q[π](s, a) is often approximated by a neural network Qθ (s, a), parametrized by θ and trained by applying the Bellman operator (Lillicrap et al., 2016; Haarnoja et al., 2018a; Fujimoto et al., 2019) as 2 arg minθ _Qθ(s, a) −_ _r(s, a) + γEs′∼P(· | s,a), a′∼π(· | s′) [Qθ(s[′], a[′])]_ _._ The actor πϕ aims at maximizing the expected value of _Qθ, with the training objective expressed as_ arg maxϕ _J (πϕ) = Es_ _dϕ(s), a_ _πϕ(a_ **_s) [Qθ (s, a)]_** _,_ (1) _∼_ _∼_ _|_ where dϕ(s) is the state visitation frequency under policy _πϕ. In offline RL, sampling from dϕ(s) is_ infeasible as no interactions with the environment are allowed. A common and practically effective approximation (Fu et al., 2019; Levine et al., 2020) to Equation 1 is _J (πϕ)_ Es _db(s), a_ _πϕ(a_ **_s) [Qθ (s, a)],_** (2) _≈_ _∼_ _∼_ _|_ where sampling from db(s) can be implemented easily as sampling from the offline dataset D. **Generative Adversarial Nets. GAN (Goodfellow et al., 2014) provides a framework to train deep** generative models, with two neural networks jointly trained in an adversarial manner: a generator _Gϕ, parametrized by ϕ, that fits the data distribution and a discriminator Dw, parametrized by w,_ that outputs the probability of a sample coming from the training data rather than Gϕ. Samples x’s from the generator’s distribution dϕ (x) are drawn via z ∼ _pz(z), x = Gϕ (z), where pz(z) is_ some noise distribution. Both Gϕ and Dw are trained via a two-player min-max game as minϕ maxw _V (Dw, Gϕ) = Ey∼dD(·) [log Dw (y)] + Ez∼pz(z) [log (1 −_ _Dw(Gϕ(z)))]_ _, (3)_ where dD (·) is the data distribution. Given the optimal discriminator _DG[∗]_ [at][ G][ϕ][, the training ob-] jective of Gϕ is determined by the Jensen–Shannon divergence (JSD) (Lin, 1991) between dD and _dϕ as V (DG[∗]_ _[, G][ϕ][) =][ −]_ [log 4 + 2][ ·][ JSD (][d][D][∥][d][ϕ][)][, with the global minimum achieved if and only if] _dϕ = dD. Therefore, one may view GAN as a distributional matching framework that approximately_ minimizes the Jensen–Shannon divergence between the generator distribution and data distribution. **Motivations.** To illustrate our motivations of training an expressive policy under an appropriate regularization, we conduct a toy experiment of behavior cloning, as shown in Figure 1, where we use the x- and y-axis values to represent the state and action, respectively. Figure 1a illustrates the state-action joint distribution of the data-collecting behavior policy that we try to mimic. For Figures 1b-1e, we use the same test-time state marginal distribution, which consists of an equal mixture of the behavior policy’s state distribution and a uniform state distribution between −1.5 and 1.5. If the inferred policy well approaches the behavior policy, we expect (1) clear concentration on the eight centers and (2) smooth interpolation between centers, which implies a good and smooth fit to the behavior policy. We start the toy experiment with fitting a CVAE model, a representative behavior-cloning method, to the dataset. As shown in Figure 1b, CVAE exhibits a mode-covering behavior that covers the data density modes at the expense of overestimating unwanted low data density regions. Hence, the regularization ability is questionable of using CVAE as a proxy for the behavior policy in some prior work. Replacing CVAE with the conditional GAN (CGAN, Mirza & Osindero (2014)), i.e., replacing the KL loss with JSD loss, but adopting the Gaussian policy popular in prior offline RL work partially alleviates the issues but drops necessary modes, as shown in Figure 1c. This shows the inflexibility of Gaussian policies. Replacing the Gaussian policy in CGAN with an implicit policy, while still training CGAN via sampled-based policy-matching, improves the capability of capturing multiple modes, as shown in Figure 1d. Nevertheless, it is still prone to mode collapse and interpolates less smoothly between the seen states. Finally, training the implicit-policy CGAN via direct state-action-visitation joint matching leads to the best performance. As shown in Figure 1e, it concentrates clearly on the eight centers and interpolates smoothly between the seen states. Thus, constraining the state-action visitations can be an effective way to regularize implicit policies in offline RL. These observations motivate us to train implicit policies in offline RL, with sample-based regularization on the state-action visitations. ----- 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (a) Truth |2 1 0|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |---|---|---|---|---|---|---|---|---| |||||||||| |||||||||| |1 2||||||||| |||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (b) CVAE |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (c) G-CGAN |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (d) CGAN |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (e) GAN Figure 1: Performance of approximating the behavior policy on the eight-Gaussian dataset. A conditional VAE (“CVAE”), a conditional GAN (“CGAN”), and a Gaussian-generator conditional GAN (“G-CGAN”) are fitted using the conditional-distribution (policy) matching approach. A conditional GAN (“GAN”) is fitted using the basic state-action-joint-visitation matching strategy (Section 4.1). More details are provided in Appendix D.1. 3 THEORETICAL ANALYSIS As discussed in Section 1 and detailed in Appendix A, one common theme in prior work in offline RL is controlling the distance between the behavior policy and current policy during the training process. In this section, we prove that this approach, in essence, controls the corresponding state-action visitations. This analysis theoretically links a line of prior offline-RL work with our proposed framework, manifesting the generality of our algorithmic idea. Note that dϕ(s, a) = dϕ(s)πϕ(a **_s) and_** _|_ similarly for db(s, a). Hence, it is sufficient to show the closeness between dϕ(s) and db(s) when _πϕ(a_ **_s) is close to πb(a_** **_s). Below we give our analysis for the matrix (finite state space) case._** _|_ _|_ Continuous state-space cases may be analyzed similarly and are left for future work. The Proofs are deferred to Appendix C. **Notation. Denote A** _i_ as matrix A with its i-th row removed; κ(A) as the 2-norm condition _−_ _∗_ number of A; 1 as a row vector of all ones and I as an identity matrix, both with an appropriate dimension. Assume that the state space S is finite with cardinality N, i.e., S = **_s[1], . . ., s[N]_** []. The transition probabilities associated with policy πϕ over S is then an N _N matrix Tϕ, whose (i, j) entry_ _×_ and similarly foris Tϕ,(i,j) = pπϕ TbS, the transition matrix associated witht+1 = s[j] _| St = s[i][]_ = A _[P]_ _St+1 = π s[j]b. Note that in this case,| St = s[i], At = at_ _πϕ_ _daϕt |(s s)[i], d[]_ _db(ast),_ R  are vectors and we denote dϕ ≜ **_dϕ(s), db ≜_** **_db(s) ∈_** R[N] and dϕ = db + ∆d. **Theorem 1. Denote** _κmax = maxi=2,...,N_ +1 κ _._ I − **_Tb[⊤]−i∗!_** _If_ maxi=1,...,N _πb_ **_s[i][]_** _πϕ_ **_s[i][ ]1_** _κmax1_ **_· |_** _−_ **_· |_** _[≤]_ _[ϵ <]_ _and Ti,j[ϕ]_ _[,][ T][ b]i,j_ _[>][ 0][,][ ∀]_ _[i, j][ ∈{][1][,][ 2][, . . ., N]_ _[}][, then]_ 2TV(dϕ, db) = ∆d 1 = **_dϕ_** **_db_** 1 1 _ϵ κϵ κmaxmax_ _as ϵ_ 0. _∥_ _∥_ _∥_ _−_ _∥_ _≤_ _−_ _[→]_ [0] _→_ _Remark. (1) We note that κmax is a constant for fixed Tb and can be calculated by iteratively_ removing columns of Tb and computing the SVD of the referred matrix. (2) The assumption that **_Ti,j[ϕ]_** _[,][ T][ b]i,j_ _[>][ 0][,][ ∀]_ _[i, j][ ∈{][1][, . . ., N]_ _[}][ can be satisfied by substituting the zero entries in the original]_ transition matrix with a small number and re-normalized each row of the resulting matrix, as in the PageRank algorithm (Page et al., 1998; Langville & Meyer, 2004). In practice, the offline dataset D often consists of samples collected by several policies. Equivalently, the behavior policy πb( **_s) is a mixture of single policies._** Assume that there **_· |_** areK K such policies {πbk (· | s)}k[K]=1 [with mixture probabilities][ {][w][k][}]k[K]=1[,][ i.e.][,][ π][b][(][·][ |][ s][)] = _k=1_ _[w][k][π][b]k_ [(][·][ |][ s][)][,][ P]k[K]=1 _[w][k][ = 1][. Since we collect][ D][ by running each][ π][b]k_ [(][·][ |][ s][)][ a proportion] ofP wk of total time, we may decompose D as D = _k=1_ [D][k][, where][ D][k][ consists of][ w][k][ proportion of] data in D. Thus, dD(s) = _k=1_ _[w][k][d][D]k_ [(][s][)][ and the approximation][ d][ϕ][(][s][)][ ≈] **_[d][D][(][s][)][ has population]_** version dϕ(s) _k=1_ _[w][k][d][b]k_ [(][s][)][ ≜] **_[d][b][(][s][)][. As before, denote][S][K]_** **_[ d][b]k_** [(][s][)][ ∈] [R][N][ as the limiting state-] _≈_ [P][K] occupancy measure induced by[P][K] πbk on M; Tbk as the transition matrix induced by πbk over S; and **_dϕ = dbk + ∆dk. We extend Theorem 1 into this mixture of policies case as follows._** ----- **Theorem 2. Denote** _κmax,k = maxi=2,...,N_ +1 κ _κmax = maxk=1...,K κmax,k._ **_I −_** **_Tb[⊤]k_** _−i∗_ _If_ maxk=1...,K maxi=1,...,N _πbk_ **_s[i][]_** _πϕ_ **_s[i][ ]1_** _κmax1_ **_· |_** _−_ **_· |_** _[≤]_ _[ϵ <]_ _and Ti,j[ϕ]_ _[,][ T][ b]i,j[k]_ _[>][ 0][,][ ∀]_ _[i, j][ ∈{][1][, . . ., N]_ _[}][, k][ ∈{][1][, . . ., K]_ _[}][, then]_ 2TV(dϕ, db) = **_dϕ_** **_db_** 1 _k=1_ _[w][k]_ 1 _ϵ κϵ κmax,kmax,k_ 1 _ϵ κϵ κmaxmax_ _as ϵ_ 0. _∥_ _−_ _∥_ _≤_ [P][K] _−_ _[≤]_ _−_ _[→]_ [0] _→_ _In particular, if wk = 1/K, ∀_ _k ∈{1, . . ., K}, then_ 2TV(dϕ, db) = **_dϕ_** **_db_** 1 _K_ _Kk=1_ 1 _ϵ κϵ κmax,kmax,k_ _as ϵ_ 0. _∥_ _−_ _∥_ _≤_ [1] _−_ _[→]_ [0] _→_ _Remark. We note that κmax,k is a constant for fixedP Tbk and κmax is a constant for fixed {πbk_ _}k[K]=1[.]_ We notice that similar analysis has been given in the prior work of bounding DKL (dϕ(s)∥db(s)) by O _ϵ/(1 −_ _γ)[2][]_ (Schulman et al., 2015; Levine et al., 2020). However, this prior work deals with (unnormalized) discounted visitation frequencies while our bound is devoted to undiscounted visitation frequencies, since neither the data collection (i.e., policy rollout) nor the state-actionvisitation regularization involve the discount factor. In short, the definitions of dϕ(s) and db(s) in our work are different from this prior work. Note that this prior work depends on 1 − _γ on_ the denominator and hence cannot be applied to the undiscounted case (discount factor γ = 1). Furthermore, instead of the KL divergence, we bound the total variation distance between the state visitation frequencies, which is a well-defined metric. 4 STATE-ACTION JOINT REGULARIZED IMPLICIT POLICY In this section we discuss an instance of our framework that will be used in our empirical study in Section 5. Concretely, we train a fully implicit policy via a GAN structure to approximately minimizes the JSD between the state-action visitation frequency induced by the current policy and that induced by the behavior policy. Our basic algorithm is discussed in Section 4.1, followed by three enhancing components presented in Section 4.2 to build up our full algorithm. This instantiation manifests three facets we consider important in offline RL: (1) the flexibility of the policy class, (2) an effective sample-based regularization without explicitly modelling the behavior policy, and (3) smoothness of the learned policy. 4.1 BASIC ALGORITHM Motivated by the standard actor-critic and GAN frameworks described in Section 2, our basic algorithm consists of a critic Qθ, an actor πϕ, and a discriminator Dw. For training stability, we adopt the double Q-learning (Hasselt, 2010) and target network formulation to train a pair of critics _Qθ1_ _, Qθ2 and the target networks Qθ1[′]_ _[, Q][θ]2[′]_ _[, π][ϕ][′]_ [.] To penalize uncertainty in the action-value estimates at future states, while controlling conservatism, we follow Fujimoto et al. (2019) and Kumar et al. (2019) to use the following critic-training target: _Q (s, a) ≜_ _r(s, a) + γEa′∼πϕ′_ (· | s′) _λ minj=1,2_ _[Q][θ]j[′]_ [(][s][′][,][ a][′][) + (1][ −] _[λ][) max]j=1,2_ _[Q][θ]j[′]_ [(][s][′][,][ a][′][)] (4)  with some hyperparameter λ ∈ [0, 1], where one a[′] is sampled at each s[′] in the mini-batch. The training objective for both critic networks is minimizing the mean-squared-error between their respective action-value estimates Qθj (s, a) and _Q (s, a) over state-action pairs in the mini-batch._ Our actor consists of three components: an implicit policy, state-action-visitation joint regularization, and a conservative training target. [e] **Implicit Policy. As discussed in Sections 1 and 2, a deterministic or Gaussian policy may miss** important rewarding actions, or even concentrate on inferior average actions. For online off-policy ----- RL, Yue et al. (2020) shows the benefit of introducing an implicit distribution mixed Gaussian policy. Generalizing that idea to offline RL, we train a fully implicit policy, which transforms a given noise distribution into the state-conditional action distribution via a neural network, in reminiscent of the generator in CGAN. Specifically, given s, _iid_ **_a_** _πϕ(_ **_s) = πϕ(s, z),_** **_z_** _pz(z)_ (5) _∼_ **_· |_** _∼_ where πϕ is a deterministic function and pz(z) is some noise distribution. As shown in Figures 1d and 1e, an implicit policy has stronger capability of learning a multi-modal policy, if needed. **State-action Joint Regularization. As discussed in Section 2, directly sampling from dϕ(s) is** infeasible in offline RL. Motivated by Equation 2, we approximate dϕ(s) by db(s) in our implementation, which allows an easy sampling scheme that simply samples s from D, without needing an importance sampling correction that may possess high variance (Liu et al., 2018). Such an approximation is classical in the off-policy and offline RL literature (Degris et al., 2012; Levine et al., 2020), with demonstrated empirical effectiveness in offline RL (Fu et al., 2019), besides its usage in off-policy RL (Silver et al., 2014; Schulman et al., 2015; Lillicrap et al., 2016). With this simplification, we can efficiently maximize the similarity between db(s, a) and dϕ(s, a) with respect to sample-based estimate of some statistical divergence, such as the Jensen–Shannon divergence. Using notations in GAN, the generator sample x and the data sample y are defined as **_x ≜_** (˜s, ˜a), ˜s ∼ D, ˜a ∼ _πϕ(· | ˜s); y ≜_ (s, a) ∼ D, ˜s independent of s (6) We then constraint this statistical divergence value, named generator loss _g(ϕ), in the training of_ _L_ actor. In our instantiation of approximately minimizing JSD via GAN, Lg = Ex [log (1 − _Dw(x))]._ Our choice of directly matching state-action joint visitations mitigates the issue of uncontrollable extrapolation errors in the action-value function estimate, since proximity of visitation frequencies leads to reduced chances of estimating the action-values of state-action pairs far from the offline dataset. Furthermore, the problem in the policy-matching approach of fitting each state-conditional action distributions on only one data point can be circumvented, since state-action pairs (si, ai) in the offline dataset are all viewed as samples from the joint visitation frequency, instead of each pair being separately viewed as one sample from the state-conditional distribution, i.e., ai _π (_ **_si)._** Besides, the state-action-visitation joint matching approach implicitly encourages the smoothness ∼ _· |_ of the state-action mapping, namely, similar states should have similar actions. This is because, for example, the discriminator in GAN can easily discriminate as “fake” a generator sample x should it has state similar to a data sample but action very different from. This smoothness feature helps ensure a reliable generalization of our policy to unseen states. **Actor-Training Target. To prevent the accumulation of accidental errors in the training process and** in the approximation of action-value function, we adopt the strategy in Kumar et al. (2019) to train the policy with respect to a conservative estimate of the action-values. Specifically, we exploit the double-Q structure and use the minimum of the two action-value estimates for policy improvement. For the ease of optimization, we use the Lagrange form of the constraint optimization problem and penalize the generator loss _g(ϕ) while improving the policy. Our policy-training target is_ _L_ arg minϕ −Es∼DEa∼πϕ(· | s) minj=1,2 Qθj (s, a) + α · Lg(ϕ), (7) where α is a fixed Lagrange multiplier. At the test time, we follow Fujimoto et al. (2019) and Kumar  et al. (2019) to first sample 10 actions from πϕ and then execute the action that maximizes Qθ1 . The discriminator is trained to better distinguish samples from dϕ(s, a) and from db(s, a). It aids the matching of the state-action-visitation frequencies through outputting _g(ϕ). As an example,_ _L_ for approximately minimizing JSD via GAN, the discriminator outputs the probability that the input, either the x or y in Equation 6, comes from db(s, a). In this case, the discriminator is trained to minimize the error in assigning x as “fake” and y as “true,” the inner maximization of Equation 3. 4.2 ENHANCING COMPONENTS In this section we present three enhancing components for further improving our basic algorithm. **Multiple Action-samples at Bellman Backup. The expectation part of the critic learning target in** Equation 4 can be better estimated, in terms of a smaller sample variance, by averaging over Na actions a[′] at each s[′] in the mini-batch, rather than just one a[′] as in Section 4.1. ----- **State-smoothing at Bellman Backup. Due to the stochastic nature of environmental transition,** multiple next states s[′] are possible after taking action a at state s, while the offline dataset D only contains one such s[′]. Since the agent is unable to interact with environment to collect more data in offline RL, local exploration (Sinha et al., 2021) in the state-space appears as an effective strategy to regularize the Bellman backup by taking consideration of states close to the records in the offline dataset. We assume that: (1) a small transformation to a state results in states physically plausible in the underlying environment (as in Sinha et al. (2021)); (2) when the state space is continuous, the transition kernel P (· | s, a) is locally continuous and centers at the recorded s[′] in the dataset. With these assumptions, we propose to fit Qθ(s, a) on the value of a small region around the recorded next state s[′]. Specifically, with a pre-specified standard deviation σB, we sample around **_s[′]_** as ˆs = s[′] + ϵ, ϵ ∼N (0, σB[2] **_[I][)][, and modify Equation 4 as]_** _Q (s, a) ≜_ _r(s, a) + γEsˆ[E]aˆ∼πϕ′_ (· | ˆs) e _λ minj=1,2_ _[Q][θ]j[′]_ [(ˆ]s, ˆa) + (1 − _λ) maxj=1,2_ _[Q][θ]j[′]_ [(ˆ]s, ˆa) _,_ (8)  where NB ˆs are sampled to estimate the expectation. This strategy is equivalent to using a Gaussian distribution centered at s[′] to approximate the otherwise non-smooth δs′ transition kernel manifested in the offline dataset. Similar technique is also considered as the target policy smoothing regularization in Fujimoto et al. (2018), though smoothing therein is applied on the target action. **State-smoothing at Joint-matching. As described in Section 4.1, we approximate dϕ(s) by dD(s)** in the sample-based estimate of the chosen statistical distance. However, dD(s) is in essence discrete and the idea of smoothing the discrete state-distribution can be applied again to provide a better coverage of the state space. This design explicitly encourages a predictable and smooth behavior at states unseen in the offline dataset. Specifically, with some pre-specified σJ[2] [, we modify the sampling] scheme of ˜s in Equation 6 as **_s˜ ∼_** D, ϵ ∼N **0, σJ[2]** **_[I]_** _, ˜s ←_ **_s˜ + ϵ._** (9) Our strategy is akin to sampling from a kernel density approximation (Wasserman, 2006) of  _dϕ(s)_ with data points s ∈ D and with radial basis kernel of bandwidth σJ . Algorithm 1 shows the main steps of our algorithm, instantiated by approximately minimizing JSD via GAN, and a detailed version is in Appendix B. Implementation details are in Appendix D.2. **Algorithm 1 State-Action Joint Regularized Implicit Policy (GAN-JSD, Main Steps)** Initialize policy network πϕ, critic network Qθ1 and Qθ2, discriminator network Dw. **for each iteration do** Sample a mini-batch of transitions B = {(s, a, r, s[′])} ∼ D. From Equation 8, train the critics by arg minθj (Qθj (s, a) − _Q[e](s, a))[2]_ over (s, a) ∈B, for j = 1, 2. Calculate generator loss Lg using Dw and the x, y in Equation 6, with state-smoothing in Section 4.2. Optimize policy network πϕ by Equation 7. Optimize discriminator Dw to maximize Ey∼dD(·) [log Dw (y)] + Ex [log (1 − _Dw(x))]._ **end for** 5 EXPERIMENTS As discussed in Section 1, we consider it important the flexibility of the policy class, the wellregularization and the smoothness of the learned policy. To this end, we develop our state-action joint regularized implicit policy, which is a framework that supports training an expressive implicit policy via an effective sample-based regularization without an explicit modelling of the behavior policy, and with techniques encouraging its smoothness. In this section we will test an instantiation of our framework on the continuous-control RL tasks. Specifically, we will first show the effectiveness of implicit policy, state-action-visitation matching, and our full algorithm (Section 5.1). We then show in ablation study (Section 5.2) the contributions of several building blocks. **Implementation. We use GAN to approximately control the Jensen–Shannon divergence between** the state-action visitation frequencies. Our source code builds on the official BCQ repository and largely follows its network architectures. Our implementation of the GAN structure and the hyperparameter choices follow the literature (Goodfellow et al., 2014; Mirza & Osindero, 2014; Radford ----- et al., 2016; Salimans et al., 2016; White, 2016; Goodfellow, 2017). To mimic a hyperparameteragnostic setting, we minimize hyperparameter tuning across datasets. Implementation details and hyperparameter setting of our full algorithm with GAN joint-matching is in Appendix D.2.1. 5.1 MAIN RESULTS To validate the effectiveness of our framework, we test three implementations of the GAN instantiation: (1) basic algorithm (Section 4.1) regularized by the classical policy-matching[1] (“GANCond:Basic”), (2) basic algorithm regularized by the state-action-visitation joint matching (“GANJoint:Basic”), (3) full algorithm, which adds state-smoothing techniques onto our basic algorithm (“GAN-Joint”). We compare them with the state-of-the-art (SOTA) offline-RL algorithms: BEAR (Kumar et al., 2019), BRAC (Wu et al., 2019) (BRAC-v: value penalty), BCQ (Fujimoto et al., 2019), and CQL (Kumar et al., 2020); together with offline soft actor-critic (SAC) (Haarnoja et al., 2018a). We re-run CQL using the official source code (details in Appendix D.2.2). Results for other algorithms are from Fu et al. (2021). Table 1 validates the effectiveness of our full algorithm, together with the efficacy of the implicit policy, state-action-visitation matching, and state-smoothing. Our full algorithm on average outperforms the baseline algorithms, and its performance is relatively stable across tasks and datasets that possess diverse nature. Our full algorithm especially shows robust and comparatively-good performance on the high-dimensional Adroit tasks and the Maze2D tasks that are collected by non-Markovian policies, which may traditionally be considered as hard in offline RL. On the Gym-MuJoCo domain, our full algorithm shows its ability to learn from datasets collected by a mixture of behavior policies, and from medium-quality examples, which are likely the datasets to encounter in real-world application. These results support our design of implicit policy, matching the state-action visitation frequencies, and explicit state-smoothing. Comparing “GAN-Cond:Basic” with the baseline algorithms, especially BEAR and CQL that use Gaussian policies, we see that an implicit policy does in general help the performance. This aligns with our intuition in Sections 1 and 2 of the incapability of the uni-modal Gaussian policy in capturing multiple action-modes. To verify the gain of our joint-visitation-matching approach over the classical policy matching, apart from the comparison between our full algorithm with BEAR and BRAC, the SOTA policy-matching algorithms, we further compare “GAN-Joint:Basic” with “GAN-Cond:Basic.” We see that on 11 out of 16 datasets, “GAN-Joint:Basic” wins against “GAN-Cond:Basic,” while results on other datasets are close. This empirical result may be attributed to the advantage of state-action-visitation matching, e.g., good regularization and smoothness in the state-action mapping (Section 4.1). Comparing “GAN-Joint” with “GAN-Joint:Basic,” we see that our state-smoothing techniques do in general help the performance. This performance gain may be related to a smoother action-choice at states not covered by the offline dataset, and a more regularized Bellman backup. Note that different datasets may require different smoothing strength, and thus this comparison can potentially be more significant should one is allowed to per-dataset tune the smoothing hyperparameter. 5.2 ABLATION STUDY The ablation study serves to answer the following questions: (a): Is implicit policy better than Gaussian policy in our framework? (b): Does state-smoothing at joint-matching help? (c): Does state-smoothing at Bellman backup matter? (d): How important is the standard deviation of the Gaussian noise injected in state-smoothing? Unless stated otherwise, hyperparameters for all algorithmic variants in all datasets are in Table 2. **(a): We compare the performance of our basic algorithm with its variant where the implicit policy** therein is replaced by a Gaussian policy. To make a fair comparison, the critics, discriminator, and hyperparameter setting remain the same. Technical details are in Appendix D.2.3. Table 4 presents the results. On 11 out of 16 datasets, our basic GAN-joint-matching algorithm has higher average return than the Gaussian policy variant. This empirical result coincides with our intuition in Section 4.1 and results in Section 5.1 that a Gaussian policy is less flexible to capture all 1Same state is used in generator and data sample, i.e., ˜s = s. Implementation follows Wu et al. (2019). ----- Table 1: Normalized returns for experiments on the D4RL suite of tasks. We perform experiments on tasks from the Maze2D, Gym-Mojoco, and Adroit domains. High average scores and low average ranks are desirable. Task Name SAC-off BEAR BRAC-v BCQ CQL GAN-Cond:Basic GAN-Joint:Basic GAN-Joint hopper-medium-replaymaze2d-umazemaze2d-mediummaze2d-largehalfcheetah-mediumwalker2d-mediumhopper-mediumhalfcheetah-medium-replaywalker2d-medium-replayhalfcheetah-medium-expertwalker2d-medium-experthopper-medium-expertpen-humanpen-clonedpen-expertdoor-expert 88.226.123.5-4.3-0.1-1.9-2.43.56.30.90.81.91.66.11.87.5 105.9103.433.729.026.541.759.152.119.253.440.196.338.6-1.03.44.6 -16.033.846.381.131.141.981.640.647.7-2.5-3.0-0.30.60.90.80.6 114.9110.912.840.753.154.533.164.757.568.944.038.215.099.08.36.2 103.550.530.729.595.943.739.060.234.543.416.434.579.887.91.52.1 126.152.342.668.052.919.467.573.336.9100.825.643.865.532.376.06.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 21.3 2.3 13.8 21.3 15.9 14.5 19.6 18.2 17.9 1.6 0.2 6.8 2.4 9.2 17.7 3.8 129.1104.157.639.452.166.576.570.961.031.336.574.243.770.031.39.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 11.0 10.3 20.8 15.4 2.0 15.1 26.8 13.7 21.5 5.2 0.5 9.1 2.8 6.1 14.4 3.6 141.171.299.945.518.075.8103.440.169.671.360.129.544.169.333.110.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 10.1 22.0 29.0 24.5 14.4 16.9 25.6 26.0 27.3 2.4 2.5 0.3 8.6 2.3 14.8 3.7 Average Score 10.0 44.1 24.1 51.4 47.1 **55.6** **59.6** **61.4** Average Rank 6.8 4.8 5.5 4.3 4.6 **3.9** **2.9** **3.2** the rewarding actions, of which an implicit policy is likely to be capable. Appendix F visualizes this comparison and shows in plots that Gaussian policies do leave out action modes in offline RL. **(b): We compare our full algorithm with its variant of no state-smoothing in the matching of state-** action visitations. The results are shown in Table 5, where our full algorithm overall performs better than the no state-smoothing variant. This performance gain may be attributed to a better coverage of the state-space by the smoothed state-distribution (Section 4.2), which is related to a more predictable and smoother action choice at unseen states. As stated previously, this comparison can potentially be more significant should one is allowed to per-dataset tune the σJ parameter. **(c): We compare our full algorithm with its variant of no state-smoothing in the Bellman backup. Ta-** ble 6 shows the results. Again, overall our full algorithm performs better than the no state-smoothing version, showing the benefit of smoothing the empirical transition kernel δs′ (Section 4.2), e.g., taking the stochasticity of state-transitions into account. As before, we use the same smoothing strength across all datasets, while a per-dataset tuning of the σB parameter may improve the distinction. **(d): To simplify hyperparameter tuning, in actual implementation we fix σB = σJ ≜** _σ (see Ap-_ pendix D.2.1). To test the robustness to the σ hyperparameter, we run our full algorithm under _σ ∈_ 1 × 10[−][2], 3 × 10[−][3], 1 × 10[−][3], 3 × 10[−][4], 1 × 10[−][4], 0 . Table 7 shows the normalized returns. We see that our algorithm is relatively insensitive to the setting of σ, especially in the range _σ ∈_ 1 × 10[−][4], 1 × 10[−][3][] where the performance varies little with σ. A too-small σ cannot provide enough smoothing to the state distributions. On the contrary, a too-large σ may highly distort  the information contained in the offline dataset, such as the state-transition kernel. In both cases, a degradation in the overall performance is expected. 6 CONCLUSION AND FUTURE WORK In this paper, we develop a framework that supports learning a flexible while well-regularized policy in offline RL. Specifically, we train a fully implicit policy via regularization on the difference between the state-action visitation frequency induced by the current policy and that induced by the data-collecting behavior policy. An effective instantiation of our framework through the GAN structure is provided for approximately minimizing the JSD between the visitation frequencies. Other divergence metrics, such as the MMD, may also be applied and are left for future work. We further augment our algorithm with explicit state-smoothing techniques to enhance its generalizability on states beyond the offline dataset. On the theoretical side, we show the equivalence between policymatching and state-action-visitation matching, and hence the compatibility of many prior algorithms with our framework. We note that our implementation-wise simplification of approximating current policy’s state visitation frequency by the behavior’s can be improved by a more tactful approximation of current policy’s state occupancy measure, which is left for future work. Nevertheless, the effectiveness of our framework and implementations is validated through extensive experiment and ablation study on the D4RL dataset. ----- REPRODUCIBILITY STATEMENT To help reproduce our empirical work, we provide detail algorithmic description in Appendix B and implementation details, including the hyperparameter choice and model architecture, in Appendix D. For reproducing our theoretical results, a step-by-step proof is provided in Appendix C. REFERENCES Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An Optimistic Perspective on Offline Reinforcement Learning, 2020. Mart´ın Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. ArXiv, abs/1701.07875, 2017. Peter Baxendale. T. E. Harris’s Contributions to Recurrent Markov Processes and Stochastic Flows. _The Annals of Probability, 39(2):417–428, 2011. ISSN 00911798._ Marc G. Bellemare, Ivo Danihelka, Will Dabney, S. Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and R. Munos. The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv, abs/1705.10743, 2017. Mikolaj Binkowski, Danica J. Sutherland, Michael Arbel, and A. Gretton. Demystifying MMD GANs. ArXiv, abs/1801.01401, 2018. Catherine Cang, Aravind Rajeswaran, P. Abbeel, and M. Laskin. Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL. ArXiv, abs/2106.09119, 2021. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, M. Laskin, P. Abbeel, A. Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learning via Sequence Modeling. ArXiv, abs/2106.01345, 2021. Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. _arXiv: Machine Learning, 2013._ T. Degris, Martha White, and R. Sutton. Off-Policy Actor-Critic. ArXiv, abs/1205.4839, 2012. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-Based Batch Mode Reinforcement Learning. J. Mach. Learn. Res., 6:503–556, 2005. Jean Feydy, Thibault S´ejourn´e, Franc¸ois-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyr´e. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690, 2019. Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing Bottlenecks in Deep Qlearning Algorithms. In ICML, 2019. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning, 2021. Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th _International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning_ _Research, pp. 1587–1596. PMLR, 10–15 Jul 2018._ Scott Fujimoto, David Meger, and Doina Precup. Off-Policy Deep Reinforcement Learning without Exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th _International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning_ _Research, pp. 2052–2062. PMLR, 09–15 Jun 2019._ Alexandre Gilotte, Cl´ement Calauz`enes, Thomas Nedelec, Alexandre Abraham, and Simon Doll´e. Offline A/B Testing for Recommender Systems. Proceedings of the Eleventh ACM International _Conference on Web Search and Data Mining, 2018._ ----- I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv, abs/1701.00160, 2017. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Infor_mation Processing Systems, volume 27. Curran Associates, Inc., 2014._ Omer Gottesman, Fredrik D. Johansson, J. Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, C. Mosch, Li wei H. Lehman, M. Komorowski, A. Faisal, L. Celi, D. Sontag, and Finale Doshi-Velez. Evaluating Reinforcement Learning Algorithms in Observational Health Settings. ArXiv, abs/1805.12298, 2018. A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨olkopf, and Alex Smola. A Kernel Two-Sample Test. _J. Mach. Learn. Res., 13:723–773, 2012._ Caglar Gulcehre, Sergio G´omez Colmenarejo, ziyu wang, Jakub Sygnowski, Thomas Paine, Konrad Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Addressing Extrapolation Error in Deep Offline Reinforcement Learning, 2021. Ishaan Gulrajani, Faruk Ahmed, Mart´ın Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved Training of Wasserstein GANs. In NIPS, 2017. Tuomas Haarnoja, Haoran Tang, P. Abbeel, and Sergey Levine. Reinforcement Learning with Deep Energy-Based Policies. In ICML, 2017. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1861–1870. PMLR, 10–15 Jul 2018a. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, G. Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, P. Abbeel, and Sergey Levine. Soft Actor-Critic Algorithms and Applications. ArXiv, abs/1812.05905, 2018b. H. V. Hasselt. Double Q-learning. In NIPS, 2010. Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information _Processing Systems, volume 29. Curran Associates, Inc., 2016._ Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, A. Lapedriza, Noah J.[`] Jones, S. Gu, and Rosalind W. Picard. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog. ArXiv, abs/1907.00456, 2019. Nathan Kallus and Angela Zhou. Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning, 2020. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization, 2014. Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes, 2013. Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning. ArXiv, abs/2110.06169, 2021. Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In Advances in Neural Information Processing _Systems, volume 32. Curran Associates, Inc., 2019._ Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1179–1191. Curran Associates, Inc., 2020. ----- Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and D. Vetrov. Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics. _ArXiv,_ abs/2005.04269, 2020. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch Reinforcement Learning, pp. 45–73. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-27645-3. doi: 10.1007/ 978-3-642-27645-3 2. Amy N. Langville and Carl D. Meyer. Deeper inside PageRank. Internet Mathematics, 1(3):335– 380, 2004. ISSN 1542-7951. Romain Laroche and P. Trichelair. Safe Policy Improvement with Baseline Bootstrapping. In ICML, 2019. Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, J. Pineau, and Kee-Eung Kim. OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation. ArXiv, abs/2106.10783, 2021a. Kimin Lee, M. Laskin, A. Srinivas, and P. Abbeel. SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning. In ICML, 2021b. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems, 2020. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and B. P´oczos. MMD GAN: Towards Deeper Understanding of Moment Matching Network. In NIPS, 2017. T. Lillicrap, Jonathan J. Hunt, A. Pritzel, N. Heess, T. Erez, Yuval Tassa, D. Silver, and Daan Wierstra. Continuous Control with Deep Reinforcement Learning. CoRR, abs/1509.02971, 2016. Jianhua Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Informa_tion theory, 37:145–151, 1991._ Long-Ji Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Machine Learning, 8(3–4):293–321, 1992. Q. Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the Curse of Horizon: InfiniteHorizon Off-Policy Estimation. In NeurIPS, 2018. Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Shane Gu. Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization. _ICLR,_ abs/2006.03647, 2021. Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial and Applied Mathematics, USA, 2000. ISBN 0898714540. Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. ArXiv, abs/1411.1784, 2014. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization for Generative Adversarial Networks. ArXiv, abs/1802.05957, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning, 2013. A. Mousavi, Lihong Li, Qiang Liu, and Denny Zhou. Black-box Off-policy Estimation for InfiniteHorizon Reinforcement Learning. ArXiv, abs/2003.11126, 2020. A. M¨uller. Integral Probability Metrics and Their Generating Classes of Functions. Advances in _Applied Probability, 29:429–443, 1997._ Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and D. Schuurmans. AlgaeDICE: Policy Gradient from Arbitrary Experience. ArXiv, abs/1912.02074, 2019. ----- Xinkun Nie, Emma Brunskill, and Stefan Wager. Learning When-to-Treat Policies. Journal of the _American Statistical Association, 116:392 – 409, 2019._ L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing order to the Web. In Proceedings of the 7th International World Wide Web Conference, pp. 161–172, Brisbane, Australia, 1998. G. Peyr´e and Marco Cuturi. Computational Optimal Transport. Found. Trends Mach. Learn., 11: 355–607, 2019. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434, 2016. Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, J. Schulman, E. Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. ArXiv, abs/1709.10087, 2018. Tim Salimans, I. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. In NIPS, 2016. Erwin Schr¨odinger. Sur la th´eorie relativiste de l’´electron et l’interpr´etation de la m´ecanique quantique. Annales de l’institut Henri Poincar´e, 2:269–310, 1932. J. Schulman, Sergey Levine, P. Abbeel, Michael I. Jordan, and Philipp Moritz. Trust Region Policy Optimization. ArXiv, abs/1502.05477, 2015. Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of Distance-based and RKHS-based Statistics in Hypothesis Testing. The Annals of Statistics, 41 (5):2263–2291, 2013. ISSN 00905364, 21688966. Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning, 2020. D. Silver, Guy Lever, N. Heess, T. Degris, Daan Wierstra, and Martin A. Riedmiller. Deterministic Policy Gradient Algorithms. In ICML, 2014. Samarth Sinha, Ajay Mandlekar, and Animesh Garg. S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning, 2021. Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning Structured Output Representation using Deep Conditional Generative Models. In NIPS, 2015. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. ISBN 0262039249. Adith Swaminathan, A. Krishnamurthy, Alekh Agarwal, Miroslav Dud´ık, J. Langford, D. Jose, and I. Zitouni. Off-policy Evaluation for Slate Recommendation. In NIPS, 2017. H. Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, R. Ten Haken, and I. Naqa. Deep Reinforcement Learning for Automated Radiation Adaptation in Lung Cancer. Medical Physics, 44:6690?6705, 2017. N´uria Armengol Urp´ı, S. Curi, and A. Krause. Risk-Averse Offline Reinforcement Learning. ArXiv, abs/2102.05371, 2021. Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic Regularized Regression. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7768–7778. Curran Associates, Inc., 2020. Larry Wasserman. All of Nonparametric Statistics. Springer, 2006. Tom White. Sampling Generative Networks. arXiv: Neural and Evolutionary Computing, 2016. ----- Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning, 2019. Yue Wu, Shuangfei Zhai, Nitish Srivastava, J. Susskind, Jian Zhang, R. Salakhutdinov, and Hanlin Goh. Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning. In ICML, 2021. Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative Offline Model-Based Policy Optimization. ArXiv, abs/2102.08363, 2021. Yuguang Yue, Zhendong Wang, and Mingyuan Zhou. Implicit Distributional Reinforcement Learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and HsuanTien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference _on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,_ 2020. Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and K. Takeda. A Survey of Autonomous Driving: Common Practices and Emerging Technologies. IEEE Access, 8:58443–58469, 2020. Ruiyi Zhang, Bo Dai, Lihong Li, and D. Schuurmans. GenDICE: Generalized Offline Estimation of Stationary Values. ArXiv, abs/2002.09072, 2020. Huangjie Zheng and Mingyuan Zhou. Exploiting Chain Rule and Bayes’ Theorem to Compare Probability Distributions, 2021. ----- A RELATED WORK **Offline Reinforcement Learning. Three major themes currently exist in offline-RL research. The** first focuses on more robustly estimate the action-value function (Agarwal et al., 2020; Gulcehre et al., 2021) or provide a conservative estimate of the Q-values (Kumar et al., 2020; Yu et al., 2021; Sinha et al., 2021), which may better guide the policy optimization process. The second research theme aims at designing a tactful behavior-cloning scheme so as to learn only from “good” actions in the offline dataset (Wang et al., 2020; Chen et al., 2021). Most similar to our approach, the third line of research tries to constraint the current policy to be close to the behavior policy during the training process, under the notion that Q-value estimates at unfamiliar state-action pairs can be pathologically worse due to a lack of supervised training. Specifically, Kumar et al. (2019) and Wu et al. (2021) use conditional variational autoencoder (CVAE) (Kingma & Welling, 2013; Sohn et al., 2015) to train a behavior cloning policy to sample multiple actions at each state for calculating the MMD constraint. Wu et al. (2019), Siegel et al. (2020), and Cang et al. (2021) fit a (Gaussian) behavioral prior to the offline dataset trained by (weighted) maximum likelihood objective. Jaques et al. (2019) consider a pre-trained generative prior of human dialog data before applying KL-control to the current policy. Note that these work essentially constraint the distance between the current policy and the cloned behavior policy. Laroche & Trichelair (2019) assume a known stochastic data-collecting behavior policy. Fujimoto et al. (2019) and Urp´ı et al. (2021) implicitly control the state-conditional action distribution by decomposing action into a behavior cloning component, trained by fitting a CVAE onto the offline data, and a perturbation component, trained to optimize the (risk-averse) returns. Besides, some work, such as Wu et al. (2019), directly estimates and regularizes the divergence between the state-conditional action distributions. In the paper, we instead take on the perspective of matching the state-action joint visitation frequencies, which avoids using a single point to estimate the divergence between distribution and removes the need for a good approximator to the behavior policy, while implicitly smooths the mapping from state to actions (Section 1). Note that if we use the same states in constructing the generator samples and the data samples, i.e., ˜s = s in Equation 6, our implementation of state-action-visitation matching is fundamentally equivalent to matching the state-conditional action distributions, as in some prior work. **Online Off-policy Reinforcement Learning. A large class of modern online off-policy deep rein-** forcement learning algorithms trains the policy using experience replay buffer (Lin, 1992), which is a storage of the rollouts of past policies encountered in the training process (Mnih et al., 2013; Lillicrap et al., 2016; Haarnoja et al., 2017; Fujimoto et al., 2018; Haarnoja et al., 2018b; Kuznetsov et al., 2020; Yue et al., 2020; Lee et al., 2021b). This approach essentially use the state-visitation frequency of past policies to approximate that of the current policy (Equation 2). We adopt this notion in both the policy improvement step and in the implementation of state-action-visitation joint matching, with additional smoothing techniques to encourage the smoothness of the empirical transition kernel and the approximated state-visitation frequency of current policy (Section 4.2). **Computational Distribution Matching. Many computationally efficient algorithms exist to match** two probability distributions with respective to some statistical divergence. GAN (Goodfellow et al., 2014) approximately minimizes the Jensen-Shannon divergence between the the model’s distribution and the data-generating distribution. Similar adversarial training strategy is further applied to estimate a class of statistical divergence, termed the integral probability metrics (M¨uller, 1997), in a sample-based manner. For example, Arjovsky et al. (2017); Gulrajani et al. (2017); Miyato et al. (2018) estimate the Wasserstein-1 distance by enforcing the Lipschitz norm of the witness function to be bounded by 1. Li et al. (2017); Binkowski et al. (2018) consider Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) with learnable kernels. Bellemare et al. (2017) studies the energy distance, an instance of the MMD (Sejdinovic et al., 2013). Another important class of statistical divergence, the optimal transport distance, is also studied under a computational lens. Peyr´e & Cuturi (2019) summarizes recent developments in computational optimal transport. Cuturi (2013); Feydy et al. (2019) study sinkhorn-based divergence, which solve an entropic regularized form of the optimal transport problem (Schr¨odinger, 1932) and may result in a physically more feasible distance measure in some real-world application. Zheng & Zhou (2021) contributes to this line of research by developing conditional transport algorithm, which possesses a better control over the mode-covering and mode-seeking behaviors in distributional matching. In this paper, we consider the classical GAN structure to approximately control the JSD between the state-action visitation frequencies, since the GAN structure is simple, effective and well-studied. Other divergence metrics may also be applicable in our framework and are left for future work. ----- B FULL ALGORITHM **Algorithm 2 State-Action Joint Regularized Implicit Policy (GAN-JSD, Detailed Version)** **Input: Learning rate ηθ, ηϕ, ηw; target smoothing factor β; noise distribution pz(z); policy network πϕ** with parameter ϕ; critic network Qθ1 and Qθ2 with parameters θ1,θ2; discriminator network Dw with parameter w; generator loss Lg; standard deviation σB, σJ ; number of smoothed states NB; number of actions a[′] at s[′] _Na; number of epochs for warm start Nwarm; policy frequency k._ **Output: Learned neural network parameters ϕ, θ1,θ2, w.** // Initialization Initialize ϕ, θ1,θ2, w. Initialize ϕ[′] **_ϕ, θ1[′]_** 1[,][ θ]2[′] 2[. Load dataset][ D][.] **for{** each epoch do} _←_ _[←]_ **_[θ]_** _[←]_ **_[θ]_** **for each iteration within current epoch do** Sample a mini-batch of transitions B = {(s, a, r, s[′])} ∼ D. _{// Policy Evaluation}_ For each s[′] _∈B sample NB ˆs with noise standard deviation σB for state-smoothing via,_ **_sˆ = s[′]_** + ϵ, ϵ ∼N (0, σB[2] **_[I][)][.]_** Sample Na corresponding actions ˆa _πϕ′_ ( ˆs) for each ˆs. _∼_ **_· |_** Calculate _Q(s, a) as_ _Q (s, a) ≜[e]_ _r(s, a) + γ_ _NB_ _×1_ _Na_ (ˆs,aˆ) _λ minj=1,2 Qθj′_ [(ˆ]s, ˆa) + (1 − _λ) maxj=1,2 Qθj′_ [(ˆ]s, ˆa) _._ h i P e Minimize the critic loss with respect to θj, j = 1, 2, over (s, a), with learning rate ηθ, _∈B_ 2 arg minθj 1 (s,a) _Qθj (s, a)_ _Q(s, a)_ _._ _|B|_ _∈B_ _−_ [e]   P _{// Policy Improvement}_ Resample |B| new states ˜s ∼ D independent of s. Add state-smoothing to ˜s with noise standard deviation σJ using ϵ ∼N **0, σJ[2]** **_[I]_** _, ˜s ←_ **_s˜ + ϵ._** Form the generator sample x and data sample 1 **_y using x ≜_** (˜s, ˜a), ˜a ∼ _πϕ(· | ˜s); y ≜_ (s, a) ∈B. Calculate the generator loss Lg(ϕ) = _|B|_ **_x_** [[log (1][ −] _[D][w][ (][x][))]][ using][ x][ and discriminator][ D][w][.]_ **if iteration count % k == 0 then** P **if epoch count < Nwarm then** Optimize policy with respect to _g(ϕ) only, with learning rate ηϕ, i.e.,_ _L_ arg minϕ α · Lg(ϕ). **else** Optimize policy with learning rate ηϕ for the target arg minϕ − _|B|[1]_ **_s∼B,a∼πϕ(· | s)_** minj=1,2 Qθj (s, a) + α · Lg(ϕ).   **end if** **else** Skip the policy improving step. **end if** // Training the Discriminator _{_ _}_ 1 1 Optimize the discriminator Dw to maximize **_y_** [[log][ D][w][(][y][)] +] **_x_** [[log (1][ −] _[D][w][(][x][))]][ with]_ _|B|_ _|B|_ respect to w with learning rate ηw. P P _{// Soft Update the Target Networks}_ **end forϕ[′]** _←_ _βϕ + (1 −_ _β)ϕ[′]; θj[′]_ _[←]_ _[β][θ][j]_ [+ (1][ −] _[β][)][θ]j[′]_ [for][ j][ = 1][,][ 2][.] **end for** C PROOFS We follow the offline RL literature (Liu et al., 2018; Nachum et al., 2019; Kallus & Zhou, 2020; Mousavi et al., 2020; Zhang et al., 2020) to assume the following regularity condition on the MDP ----- structure, which ensures ergodicity and that the limiting state occupancy measure exists and equals to the stationary distribution of the chain. **Assumption 1 (Ergodicity of MDP). The MDP M is ergodic, i.e., the Markov chains associated** _with any πb and any πϕ under consideration is positive Harris recurrent (Baxendale, 2011)._ **Lemma 3. Let A ∈** R[N] _[×][N]_ be nonsingular and let 0 ̸= b ∈ R[N] _, x = A[−][1]b ∈_ R[N] . Let ∆A ∈ R[N] _[×][N]_ be an arbitrary perturbation on A. Assume that the norm on R[N] _[×][N]_ satisfies ∥Ax∥≤ _∥A∥∥x∥_ for all A and x. If 1 (A + ∆A) (x + ∆x) = b and _<_ _[∥][∆]A[A][∥]_ _κ(A)_ _∥_ _∥_ then ∆x _κ(A)_ _[∥][∆]A[A][∥]_ _κ(A)_ _∥_ _∥_ _∥_ _∥_ = _∥x∥_ _≤_ 1 − _κ(A)_ _[∥]∥[∆]A[A]∥[∥]_ _∥∥∆AA∥∥_ _[−]_ _[κ][(][A][)]_ _Proof. Let_ **_x = x + ∆x, A (x + ∆x) + ∆A_** **_x = b_** =⇒ **_A ∆x + ∆A_** **_x = 0_** =⇒ ∆x = _−A[−][1]_ ∆A **_x. Then,_** b b b ∆x **_A[−][1]_** ∆A ( **_x_** + ∆x ) = κ(A) ( **_x_** + ∆x ) b _∥_ _∥≤∥_ _∥∥_ _∥_ _∥_ _∥_ _∥_ _∥_ _[∥][∆]A[A][∥]_ _∥_ _∥_ _∥_ _∥_ _∥_ _∥_ = 1 _κ(A)_ ∆x _κ(A)_ _⇒_ _−_ _[∥][∆]A[A][∥]_ _∥_ _∥≤_ _[∥][∆]A[A][∥]_  _∥_ _∥_  _∥_ _∥_ _[∥][x][∥]_ _κ(A)_ _[∥][∆]A[A][∥]_ _κ(A)_ = _∥_ _∥_ = _⇒_ _[∥]∥[∆]x[x]∥[∥]_ _≤_ 1 − _κ(A)_ _[∥]∥[∆]A[A]∥[∥]_ _∥∥∆AA∥∥_ _[−]_ _[κ][(][A][)]_ since κ(A) ∥∆A∥ _/ ∥A∥_ _< 1 by assumption._ _Proof of Theorem 1. By ergodicity, dϕ(s), db(s) ∈_ R[N] uniquely exists. For d ∈{dϕ, db}, T ∈ **_T_** **_[ϕ], T_** _[b]_, stationarity implies that d is an eigenvector of T _[⊤]_ associated with eigenvalue 1, and furthermore  1 0 **1** **_d[⊤]_** = d[⊤]T =⇒ **_I −_** **_T_** _[⊤][]_ **_d = 0 =⇒_** I − **_T_** _[⊤]_ **_d =_**  ...  (10) 0 ≜ **_T[b] ∈R[(][N]_** [+1)][×][N]   Since T is a positive matrix, by the Perron-Frobenius theorem (Meyer, 2000),| {z } 1 is an eigenvalue of **_T_** _[⊤]_ with algebraic multiplicity, and hence geometry multiplicity, 1. The eigen-equation T _[⊤]v = 1v_ has unique solution v up to a constant multiplier. Since T _[⊤]d = d, the eigenspace of T_ _[⊤]_ associated with eigenvalue 1 is span ({d}). Hence dim ker **_I −_** **_T_** _[⊤][]_ = 1 =⇒ rank **_I −_** **_T_** _[⊤][]_ = N − 1 = rank **_T_** = N . The reason is that if **_v_** = **0 s.t.** **_T v = 0, then_** _⇒_ _∃_ _̸_   **1 v** 0 **_T v =_** b = = **_v = c d for scalar c[b]_** R and 1 v = c 1 d = 0 = _c = 0,_  **_I −_** **_T_** _[⊤][]_ **_v_** 0 _⇒_ _∈_ _⇒_ and henceb **_v_** = c d = 0 which contradicts to v = 0. _̸_ Since rank **_T_** = N, dim **_v ∈_** R[N] [+1] : v[⊤] **_T[b] = 0_** = 1. For such a v ̸= 0, ∃ _i ∈{2, . . ., N_ + 1rank(} s.t. vAi) = ̸= 0 Nb. The reason is that if. WLOG, assume n _vN rank(+1 ̸= 0A) < N. Let, A ∃_ ow ∈ ̸= 0R[N] s.t.[×][N] wbe the first[⊤]A = 0, then N rows of **_T, then_** **_w_** _⊤_ [b] **_T = w[⊤]A = 0_** 0   and vN +1 = 0 = **_w[⊤], 0_** _⊤_ is not a constant multiple ofb **_v =_** dim ker **_T_** 2, which _̸_ _⇒_ _⇒_ _[⊤][]_ _≥_ contradicts to the fact that rank **_T_** = N . Thus, we conclude that rank (A) = _N_ = **_A is_** b _⇒_ invertible.   b ----- Let e[(1)] = (1, 0, . . ., 0)[⊤] _∈_ R[N], then Equation 10 implies A d = e[(1)]. Plug in dϕ, db, Tϕ, Tb and define Aϕ, Ab similarly, we have Aϕ dϕ = e[(1)], Ab db = e[(1)]. For Tb and Tϕ, we notice that by Jensen’s inequality, _N_ _N_ (Tb **_Tϕ)ij_** _St+1 = s[j]_ _St = s[i], At = at_ _πb_ **_at_** **_s[i][]_** _πϕ_ **_at_** **_s[i][ ] dat_** _j=1_ _−_ _≤_ _j=1_ ZA _P_ _|_ _|_ _−_ _|_ X X  _N_ = _St+1 = s[j]_ _St = s[i], At = at_ _πb_ **_at_** **_s[i][]_** _πϕ_ **_at_** **_s[i][ ] dat_** ZA j=1 _P_ _|_  _|_ _−_ _|_ X    = _πb_ **_at_** **_s[i][]_** _πϕ_ **_at_** **_s[i][ ] dat_** A _|_ _−_ _|_ Z = _πb_ s[i][] _πϕ_ **_s[i][ ]1_** **_· |_** _−_ **_· |_** Therefore, by assumption, _N_ _∥Tb −_ **_Tϕ∥∞_** = _i=1max,...,N_ _j=1_ (Tb − **_Tϕ)ij_** _≤_ _i=1max,...,N_ _πb_ **_· | s[i][]_** _−_ _πϕ_ **_· | s[i][ ]1_** _[≤]_ _[ϵ.]_ X Let Aϕ = Ab + ∆A. For ∥∆A∥1, we notice that _∥∆A∥1 = ∥Aϕ −_ **_Ab∥1 ≤_** ∆A 1 = **_Aϕ_** **_Ab_** 1 _∥_ _∥_ _∥_ _−_ _∥_ _≤_ I − **_Tϕ[⊤]_** _−_ I − **_Tb[⊤] 1_** **0** = **_Tb[⊤]_** **_ϕ_** 1 = (Tb − **_Tϕ)[⊤]_** 1 [=][ ∥][T][b][ −] **_[T][ϕ][∥][∞]_** _[≤]_ _[ϵ.]_  _[−]_ **_[T][ ⊤]_** Notice that matrix 1-norm satisfiesAb 1 1 and that **_db_** 1 = 1. Lemma 3 implies that ∥Mv∥1 ≤∥M _∥1 ∥v∥1 for all matrix M and vector v, that_ _∥_ _∥_ _≥_ _∥_ _∥_ ∆d 1 _∥_ _∥_ = ∆d 1 = **_dϕ_** **_db_** 1 **_db_** 1 _∥_ _∥_ _∥_ _−_ _∥_ _∥_ _∥_ _κ(Ab)_ **_I −_** **_Tϕ[⊤]_** 1 _∥∆A∥1_ _[−]_ _[κ][(][A][b][)]_ _κ(Ab)_ _ϵ κ(Ab)_ 1ϵ 1 _ϵ κ(Ab)_ _[−]ϵ κ[κ]max[(][A][b][) =]_ _−_ 0 as ϵ 0. 1 _ϵ κmax_ _→_ _→_ _−_ (11) _Proof of Theorem 2. By ergodicity, dϕ(s), dbk_ (s) _∈_ R[N] uniquely exists, ∀ _k._ For d _∈_ **_dϕ, db1_** _, . . ., dbk_ and T **_T_** **_[ϕ], T_** _[b][1]_ _, . . ., T_ _[b][K]_, we follow the steps and notations in the proof _{_ _}_ _∈_ of Theorem 1 to conclude that rank (A) = N = **_A is invertible and that Ad = e[(1)]_** R[N] .  _⇒_ _∈_ Plug in dϕ, dbk _, Tϕ, Tbk and define Aϕ, Abk similarly, we have Aϕdϕ = e[(1)], Abk_ **_dbk = e[(1)]. For_** the transition matrix Tb induced by the mixture of policies πb, we have **_Tb,(i,j) = pπb_** _St+1 = s[j]_ _| St = s[i][]_ = A _P_ _St+1 = s[j]_ _| St = s[i], At = at_ _πb_ **_at | s[i][]_** _dat_ Z _K_ K = _wk_ _St+1 = s[j]_ _St = s[i], At = at_ _πbk_ **_at_** **_s[i][]_** _dat =_ _wkTbk,(i,j)_ _k=1_ ZA _P_ _|_ _|_ _k=1_ X  X and therefore Tb = _k=1_ _[w][k][T][b]k_ [.] For Tbk and Tϕ, as in the proof of Theorem 1, by Jensen’s inequality, [P][K] _N_ (Tbk **_Tϕ)ij_** _πbk_ **_s[i][]_** _πϕ_ **_s[i][ ]1_** _j=1_ _−_ _≤_ **_· |_** _−_ **_· |_** X ----- Therefore, by assumption, _N_ _∥Tbk −_ **_Tϕ∥∞_** = _i=1max,...,N_ _j=1_ (Tbk − **_Tϕ)ij_** _≤_ _i=1max,...,N_ _πbk_ **_· | s[i][]_** _−_ _πϕ_ **_· | s[i][ ]1_** _[≤]_ _[ϵ.]_ X Let Aϕ = Abk + ∆Abk . For ∥∆Abk _∥1, we have_ _∥∆Abk_ _∥1 = ∥Aϕ −_ **_Abk_** _∥1 ≤_ **_Tb[⊤]k_** _[−]_ **_[T][ ⊤]ϕ_** 1 [=][ ∥][T][b][k][ −] **_[T][ϕ][∥][∞]_** _[≤]_ _[ϵ.]_ Note that ∀ _k, ∥Abk_ _∥1 ≥_ 1 + 1 − [P]i[N]=1 **_[T][ b]1i[k]_** [= 1][,][ ∥][d][b]k _[∥]1_ [= 1][. Lemma 3 implies that] _∥∆dk∥1_ = ∆dk 1 = **_dϕ_** **_dbk_** 1 _ϵ κ (Abk_ ) _ϵ κmax.k_ **_dbk_** 1 _∥_ _∥_ _∥_ _−_ _∥_ _≤_ 1 _ϵ κ (Abk_ ) 1 _ϵ κmax,k_ _∥_ _∥_ _−_ _[≤]_ _−_ Thus for the relative distance between dϕ and db, we have _N_ **_db_** **_s[i][]_** _i=1_ X _K_ _wkdbk_ **_s[i][]_** _k=1_ X _N_ **_dbk_** **_s[i][]_** _i=1_ X _∥db∥1 =_ _wk_ _k=1_ X _wk = 1,_ _k=1_ X _i=1_ **_dϕ_** **_db_** 1 _∥_ _−_ _∥_ = **_dϕ_** **_db_** 1 = **_db_** 1 _∥_ _−_ _∥_ _∥_ _∥_ (wkdϕ _wkdbk_ ) _k=1_ _−_ X _K_ _ϵ κmax.k_ _ϵ κmax_ _wk_ 0, as ϵ 0. 1 _ϵ κmax,k_ _≤_ 1 _ϵ κmax_ _→_ _→_ _k=1_ _−_ _−_ X _wk_ **_dϕ_** **_dbk_** 1 _k=1_ _∥_ _−_ _∥_ _≤_ X Plug wk = 1/K, ∀ _k ∈{1, . . ., K} into the second to last equation, we get_ _∥dϕ −_ **_db∥1 ≤_** _K[1]_ _ϵ κmax,k_ 0, as ϵ 0. 1 _ϵ κmax,k_ _→_ _→_ _−_ _k=1_ _Remark. In general, d[⊤]b_ **_[T][b][ =][ P]k[K]=1_** _[w]k[2][d][⊤]bk_ [+][ P]i=j _[w][i][w][j][d]b[⊤]i_ **_[T][ b][j][ ̸][=][ d]b[⊤][. One sufficient condition is]_** _̸_ **_d[⊤]bi_** **_[T][ b][j][ =][ d]b[⊤]i_** _[,][ ∀]_ _[i, j][ =]⇒_ **_dbi = dbj_** _, ∀_ _i, j =⇒_ _πbi = πbj_ _, ∀_ _i, j, similar to Ho & Ermon (2016)._ In such case, πb reduces to a single policy, not a mixture, and Theorem 1 applies. D TECHNICAL DETAILS D.1 TOY EXPERIMENT Denote the total sample size as Ntotal, we follow the convention to construct the eight-Gaussian dataset as in Algorithm 3. In our experiment we use Ntotal = 2000. **Algorithm 3 Constructing the Eight-Gaussian Dataset** **Input: Total sample size Ntotal.** **Output: Generated dataset DGaussian.** **while Dataset size < Ntotal do** Draw a random center c uniformly **_c ∼_** _√2, 0_ _,_ _−√2, 0_ _,_ 0, _√2_ _,_ 0, −√2 _, (1, 1), (1, −1), (−1, 1), (−1, −1)_ _._      Sample datapoint x = (x, y) **_c, 2_** 10[−][4] **_I2_** . _∼N_ _×_ _·_ Store x in the dataset.  **end while** We are interested in the 2-D eight-Gaussian dataset because (a) the conditional distribution of _p(y | x) is multi-modal in many x; and (b) interpolation is needed to fill-in the blanks between_ Gaussian-centers, where a smooth-interpolation into a circle is naturally expected. ----- To rephrase this dataset into offline reinforcement learning setting, we define x as state and the corresponding y as action. Note that in the behavior cloning algorithm, the information of reward, next state, and the episodic termination is not required. Hence, the generated dataset DGaussian can serve as an offline RL dataset readily applicable to train behavior cloning policies. In order to compare the ability to approximate the behavior policy by the KL loss and the JSD loss, the Gaussian policy and the implicit policy, the conditional-distribution matching and the jointvisitation matching, we fit a conditional VAE (“CVAE”), a Gaussian generator conditional GAN (“G-CGAN”) and a conditional GAN (“CGAN”) using the conditional-distribution matching approach similar to Kumar et al. (2019). We fit, using basic joint-visitation matching strategy, a conditional GAN (“GAN”). As discussed in Appendix A, the major distinction between “CGAN” and “GAN” is that the former uses the same states in constructing the generator samples and the data samples while the later resamples states. The network architecture of our conditional VAE is as follows. Conditional Variational Auto-encoder (CVAE) in Toy Experiment Encoder Decoder Linear(state_dim+action_dim, H) Linear(state_dim+latent_dim, H) BatchNorm1d(H) BatchNorm1d(H) ReLU ReLU Linear(H, H//2) Linear(H, H//2) BatchNorm1d(H//2) BatchNorm1d(H//2) ReLU ReLU mean = Linear(H//2, latent_dim) Linear(H//2, action_dim) log_std = Linear(H//2, latent_dim) with hidden dimension H = 100 and latent dimension latent dim = 50. CVAE is trained for 1200 epochs with a mini-batch size of 100 and random seed 0, using the mean-squared-error as the reconstruction loss, and the Gaussian-case closed-form formula in Kingma & Welling (2013) for the KL term. The network architecture of our conditional GAN, used in “CGAN” and “GAN,” is as follows. Conditional Generative Adversarial Nets (CGAN) in Toy Experiment Generator Linear(state_dim+z_dim, H) BatchNorm1d(H) ReLU Linear(H, H//2) BatchNorm1d(H//2) ReLU Linear(H//2, action_dim) Discriminator Linear(state_dim+action_dim, H) LeakyReLU(0.1) Linear(H, H//2) LeakyReLU(0.1) Linear(H//2, 1) where the structure of BatchNorm1d, LeakyReLU follows Radford et al. (2016). Here we again use H = 100, z dim = 50. Conditional GAN is trained for 2000 epochs with a mini-batch size of 100 and random seed 0. We follow Radford et al. (2016) to train CGAN using Adam optimizer with β1 = 0.5. The network architecture of our Gaussian-generator version of conditional GAN, used in the experiments of Appendix F, is as follows. Generator Linear(state_dim, H) BatchNorm1d(H) ReLU Linear(H, H//2) BatchNorm1d(H//2) ReLU mean = Linear(H//2, action_dim), log_std = Linear(H//2, action_dim) ----- with Discriminator and other technical details the same as CGAN. This Gaussian-generator version of CGAN is again trained for 2000 epochs with a mini-batch size of 100, random seed 0, and _β1 = 0.5 in the Adam optimizer._ Our test set is formed by a random sample of 2000 new states (x) from [−1.5, 1.5] together with the states in the training set. The performance on the test set thus shows both the concentration on the eight centers and the smooth interpolation between centers, which translates into a good and smooth fit to the behavior policy. Figure 1 shows the training set (“Truth”) and the kernel-density-estimate plot of each methods. D.2 REINFORCEMENT LEARNING EXPERIMENTS In practice, the rollouts contained in the offline dataset have finite horizon, and thus special treatment is needed per appearance of the terminal states in calculating the Bellman update target. We follow the standard treatment (Mnih et al., 2013; Sutton & Barto, 2018) to define the update target y as _r (s, a) + γQ[′]_ (s[′], a[′]) if s[′] is a non-terminal state _Q (s, a) =_ _,_ (r (s, a) if s[′] is a terminal state [e] e where _Q[′]_ (s[′], a[′]) refers to the expectation term in Equation 4 for basic algorithm (Section 4.1) or the expectation term in Equation 8 for the enhanced versions with state-smoothing at the Bellman Backup (Section 4.2). [e] For simplicity, we follow White (2016) to choose the noise distribution pz (z) as the multivariate standard normal distribution, where the dimension of z is conveniently chosen as dim (z) = ``` min(10, state dim//2). To sample from the implicit policy, for each state s, we first sample in ``` dependently z ∼N (0, I). We then concatenate s with z and feed the resulting [s, z] into the deterministic policy network to generate stochastic actions. To sample from a small region around the next state s[′] (Section 4.2), we keep the original s[′] and repeat it additionally NB times. For each of the NB replications, we add an independent Gaussian noise ϵ (0, σB[2] **_[I][)][. The original][ s][′][ and]_** _∼N_ its NB noisy replications are then fed into the implicit policy to sample the corresponding action. For fair comparison, we use the same network architecture as the official implementation of the BCQ algorithm, which will be stated in detail in Sections D.2.1. Due to limited computational resources, we leave a fine-tuning of the noise distribution pz (z), the network architectures, and the optimization hyperparameters for future work, which also leaves room for further improving our results. For a more stable training of the policy, we adopt the warm start strategy (Kumar et al., 2020; Yue et al., 2020). Specifically, in the first Nwarm epochs, the policy is trained to minimize Lg only, with the same learning rate ηϕ as the rest epochs that also try to maximize the expected Q-values. **Datasets. We use the continuous control tasks provided by the D4RL dataset (Fu et al., 2021)** to conduct algorithmic evaluations. Due to limited computational resources, we select therein the “medium-expert,” “medium-replay,” and “medium” datasets for the Hopper, HalfCheetah, Walker2d tasks in the Gym-MuJoCo domain, which are commonly used benchmarks in prior work (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Kumar et al., 2020). We follow the literature (Cang et al., 2021; Chen et al., 2021; Kostrikov et al., 2021) to not test on the “random” and “expert” datasets as they are less practical (Matsushima et al., 2021) and can be respectively solved by directly using standard off-policy RL algorithms (Agarwal et al., 2020) and the behavior cloning algorithms. We note that a comprehensive benchmarking of prior offline-RL algorithms on the “expert” datasets is currently unavailable in the literature, which is out of the scope of this paper. Apart from the Gym-MuJoCo domain, we also consider the Maze2D domain[2] of tasks for the non-Markovian datacollecting policy and the Adroit tasks[3] (Rajeswaran et al., 2018) for their sparse reward-signal and high dimensionality. **Evaluation Protocol. In all the experiments, we follow Fu et al. (2021) to use the “v0” version of** the datasets in the Gym-MuJoCo and Adroit domains. In our preliminary study, we find that the rollout results of some existing algorithms can be unstable across epochs in some datasets, even 2We use the tasks “maze2d-umaze,” “maze2d-medium,” and “maze2d-large.” 3We use the tasks “pen-human,” “pen-cloned,” “pen-expert,” and “door-expert.” ----- towards the end of training. To reduce the randomness in evaluation, we report, for our algorithm, the mean and standard deviation of the last five rollouts across three random seeds {0, 1, 2}. We train all agents for 1000 epochs, where each epoch consists of 1000 mini-batch stochastic gradient descent steps. We rollout each agent for 10 episodes after each epoch of training. D.2.1 GAN JOINT MATCHING In approximately matching the Jensen-Shannon divergence between the state-action visitation frequencies via Generative Adversarial Nets, a crucial step is to stably and effectively train the GAN structure. To this end, we adopt the following tricks from the literature. - To provide stronger gradients early in training, rather than training the policy πϕ to minimize Ex [log (1 − _Dw(x))]_ we follow (Goodfellow et al., 2014) to train πϕ to maximize Ex [log (Dw(x))] - Motivated by (Radford et al., 2016), we use LeakyReLU activation in both the generator and discriminator, with default negative slope=0.01. - To stabilize the training, we follow (Radford et al., 2016) to use a reduced momentum term β1 = 0.4 in the Adam optimizer (Kingma & Ba, 2014). - We follow (Radford et al., 2016) to use actor and discriminator learning rate ηϕ = ηw = 2×10[−][4]. - To avoid overfitting of the discriminator, we are motivated by Salimans et al. (2016) and Goodfellow (2017) to use one-sided label smoothing with soft and noisy labels. Specifically, the labels for the data sample y is replaced with a random number between 0.8 and 1.0, instead of the original 1. No label smoothing is applied for the generator sample x, therefore their labels are all 0. The loss function for training the discriminator in GAN is the Binary Cross Entropy between the labels and the outputs from the discriminator. Furthermore, motivated by TD3 (Fujimoto et al., 2018) and GAN (Section 2), we update πϕ( **_s)_** **_· |_** once per k updates of the critics and discriminator. Table 2 shows the hyperparameters for our GAN joint-matching framework. Note that several simplifications are made to minimize hyperparameter tuning, such as fixing ηϕ = ηw as in (Radford et al., 2016) and σB = σJ . We also fix Na = 1 to ease computation. We comment that many of these hyperparameters can be set based on literature, for example, we use _ηϕ = ηw = 2 × 10[−][4]_ as in Radford et al. (2016), ηθ = 3 × 10[−][4] and Nwarm = 40 as in Kumar et al. (2020), λ = 0.75 as in Fujimoto et al. (2019) and policy frequency k = 2 as in Fujimoto et al. (2018). Unless specified, the same hyperparameter is used across all datasets. Below we state the network architectures of the actor, critic, and the discriminator in GAN. Note that we use a pair of critic networks with the same architecture to perform clipped double Q-learning. Actor Linear(state_dim+noise_dim, 400) LeakyReLU Linear(400, 300) LeakyReLU Linear(300, action_dim) max_action * tanh Discriminator in GAN Linear(state_dim+action_dim, 400) LeakyReLU Linear(400, 300) LeakyReLU Linear(300, 1) Sigmoid Critic Linear(state_dim+action_dim, 400) LeakyReLU Linear(400, 300) LeakyReLU Linear(300, 1) Note that the network architectures follow from Fujimoto et al. (2019) and that all the LeakyReLU activation uses the default negative slope=0.01. ----- Table 2: Default Hyperparameters for GAN joint matching. Hyperparameter Value Optimizer Adam Kingma & Ba (2014) Learning rate ηθ 3 10[−][4] _×_ Learning rate ηϕ, ηw 2 10[−][4] _×_ Log Lagrange multiplier log α for non-Adroit datasets 4.0 Log Lagrange multiplier log α for Adroit datasets 8.0 Evaluation frequency 10[3] Training iterations 10[6] Batch size 512 Discount factor 0.99 Target network update rate β 0.005 Weighting for clipped double Q-learning λ 0.75 Noise distribution pz(z) (0, I) _N_ Standard deviations for state smoothing σB, σJ 3 10[−][4] _×_ Number of smoothed states in Bellman backup NB 50 Number of actions ˆa at each ˆs Na 1 Number of epochs for warm start Nwarm 40 Policy frequency k 2 Random seeds _{0, 1, 2}_ D.2.2 RESULTS OF CQL We note that the official CQL GitHub repository does not provide hyperparameter settings for the Maze2D and Adroit domain of tasks. For datasets in these two domains, we train an CQL agent using five hyperparameter settings: four recommended Gym-MuJoCo settings and one recommended Ant-Maze setting. We then calculate the average (normalized) return of the last five rollouts over random seeds {0, 1, 2} and report the per-dataset best results across those five hyperparameter settings. We comment that this approach is non-conventional, but rather a compensation for the missing of recommended hyperparameter settings, and may give CQL some advantage on the Maze2D and Adroit domains. For the Gym-MuJoCo domain, we follow Kumar et al. (2020) to use the recommended hyperparameter setting across all datasets. D.2.3 ABLATION STUDY ON GAUSSIAN POLICY The network architecture of the Gaussian policy we used in the ablation study (Section 5.2) follows common practice ((Haarnoja et al., 2018a; Kumar et al., 2020)). Gaussian Policy Linear(state_dim, 400) LeakyReLU Linear(400, 300) LeakyReLU mean = Linear(300, action_dim) log_std = Linear(300, action_dim) Critics and discriminator are the same as in the implicit policy case (Appendix D.2.1). For action selection from the Gaussian policy, a given state s is first mapped to the mean µ(s) and standard deviation vector σ(s). A raw action is sampled as araw **_µ(s), diag_** **_σ[2](s)_** . _∼N_ Finally, araw is mapped into the action space as max action tanh(araw). _×_  For fair comparison, other technical details, including the training procedure and hyperparameter setting, are exactly the same as the implicit policy case (Appendix D.2.1). ----- E ADDITIONAL TABLES E.1 RAW SCORES OF THE MAIN TABLE In Table 3 we report the un-normalized results corresponding to Table 1 for reference. Table 3: Raw returns for experiments on the D4RL suite of tasks. We performs experiments on tasks from the Gym-Mojoco, Maze2D, and Adroit domains. For our algorithm, we report in this table the mean and standard deviation of the raw returns of the last five rollouts across three random seeds {0, 1, 2}. We run CQL ourselves and report the average raw return of the last five rollouts over random seeds {0, 1, 2}. Results for other algorithms are from Fu et al. (2021). Task Name SAC-off BEAR BRAC-v BCQ CQL GAN-Cond:Basic GAN-Joint:Basic GAN-Joint maze2d-umazemaze2d-mediummaze2d-largehalfcheetah-mediumwalker2d-mediumhopper-mediumhalfcheetah-medium-replaywalker2d-medium-replayhopper-medium-replayhalfcheetah-medium-expertwalker2d-medium-experthopper-medium-expertpen-humanpen-clonedpen-expertdoor-expert -808.6-581.3145.6284.8797.6277.4163.8-55.782.044.287.893.332.9-5.15.71.5 2717.01674.53254.12980.14897.04517.91076.81842.73113.56349.6883.8885.428.689.866.319.0 5473.83725.85640.63747.54926.6114.7102.4990.4115.2-66.644.522.2-0.81.75.16.4 2441.02149.04767.91752.44463.91057.87750.82640.33588.51407.83521.32850.7688.735.041.523.2 2763.24566.21103.65105.54000.73666.63347.82954.52525.8159.8755.0941.3139.8123.594.393.5 9150.73009.32175.03726.53367.51672.23855.72904.62193.15155.3318.6675.1125.8105.4812.196.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 27.1 107.4 1142.8 431.2 48.0 47.7 312.3 692.5 295.0 51.8 631.3 693.5 473.3 528.6 110.3 30.8 3212.92142.53599.91166.93513.02286.41915.73943.63001.08936.91028.45149.0117.3145.8103.3458.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 15.2 27.3 55.5 416.3 499.6 342.3 91.1 168.8 760.8 691.8 870.9 407.0 639.7 429.0 106.1 61.6 9132.93269.73181.21935.23832.33232.21451.94302.12979.9471.6633.35197.4197.1197.3938.979.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 23.3 110.0 1259.3 1008.6 430.5 67.7 69.4 396.9 887.4 281.1 82.3 944.3 729.7 441.4 108.9 31.6 E.2 TABLES FOR THE ABLATION STUDY Table 4 - 7 correspond to the results for our ablation study in Section 5.2. Table 4: Normalized returns for comparing the implicit policy with the Gaussian policy on the basic algorithm (Section 4.1) on the D4RL suite of tasks. The reported number are the means and standard deviations of the normalized returns of the last five rollouts across three random seeds {0, 1, 2}. Task Name GAN-Joint-Matching: Basic GAN-Joint-Matching: Basic, Gaussian Policy maze2d-umaze 57.6 ± 11.0 23.9 ± 8.1 maze2d-medium 39.4 ± 10.3 0.9 ± 7.7 maze2d-large 52.1 ± 20.8 3.3 ± 0.7 halfcheetah-medium 43.7 ± 0.5 43.8 ± 0.4 walker2d-medium 70.0 ± 9.1 51.6 ± 10.9 hopper-medium 66.5 ± 15.4 79.8 ± 14.1 halfcheetah-medium-replay 31.3 ± 2.8 33.0 ± 3.1 walker2d-medium-replay 9.9 ± 2.0 9.1 ± 1.7 hopper-medium-replay 36.5 ± 5.2 26.3 ± 2.2 halfcheetah-medium-expert 74.2 ± 6.1 76.8 ± 6.7 walker2d-medium-expert 76.5 ± 15.1 59.6 ± 21.9 hopper-medium-expert 70.9 ± 26.8 84.6 ± 10.6 pen-human 61.0 ± 13.7 55.7 ± 21.1 pen-cloned 31.3 ± 21.5 36.6 ± 14.6 pen-expert 129.1 ± 14.4 118.8 ± 15.1 door-expert 104.1 ± 3.6 39.5 ± 24.6 Average Score **59.6** 46.5 ----- Table 5: Normalized returns for comparing our full algorithm with its counterpart of no state-smoothing in the joint matching of the state-action-visitation. The reported number are the means and standard deviations of the normalized returns of the last five rollouts across three random seeds {0, 1, 2}. Task Name GAN-Joint-Matching: Full Algorithm GAN-Joint-Matching: No State-smoothing in Joint Matching maze2d-umaze 40.1 ± 16.9 47.6 ± 4.7 maze2d-medium 69.6 ± 25.6 35.4 ± 5.5 maze2d-large 71.3 ± 26.0 69.9 ± 26.5 halfcheetah-medium 44.1 ± 0.3 44.0 ± 0.4 walker2d-medium 69.3 ± 8.6 62.7 ± 8.6 hopper-medium 60.1 ± 27.3 81.8 ± 16.6 halfcheetah-medium-replay 33.1 ± 2.3 31.8 ± 2.7 walker2d-medium-replay 10.2 ± 2.4 9.9 ± 2.5 hopper-medium-replay 29.5 ± 2.5 29.5 ± 2.4 halfcheetah-medium-expert 75.8 ± 10.1 69.1 ± 8.3 walker2d-medium-expert 71.2 ± 22.0 73.8 ± 18.4 hopper-medium-expert 99.9 ± 29.0 67.8 ± 18.9 pen-human 45.5 ± 24.5 54.3 ± 19.6 pen-cloned 18.0 ± 14.4 22.5 ± 17.4 pen-expert 141.1 ± 14.8 142.4 ± 14.1 door-expert 103.4 ± 3.7 102.5 ± 3.2 Average Score **61.4** 59.1 Table 6: Normalized returns for comparing our full algorithm with its counterpart of no state-smoothing in the Bellman backup. The reported number are the means and standard deviations of the normalized returns of the last five rollouts across three random seeds {0, 1, 2}. Task Name GAN-Joint-Matching: Full Algorithm GAN-Joint-Matching: No State-smoothing in Bellman Backup maze2d-mediummaze2d-largemaze2d-umazehalfcheetah-mediumwalker2d-mediumhopper-mediumhalfcheetah-medium-replaywalker2d-medium-replayhopper-medium-replayhalfcheetah-medium-expertwalker2d-medium-experthopper-medium-expertpen-humanpen-clonedpen-expertdoor-expert 141.169.671.340.160.171.275.899.945.518.0103.469.310.229.544.133.1 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 25.6 26.0 16.9 27.3 10.1 22.0 29.0 24.5 14.4 0.3 8.6 2.3 2.4 2.5 14.8 3.7 130.860.7104.651.260.766.983.372.140.727.744.133.411.630.850.871.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 18.3 10.3 11.0 22.9 31.0 20.6 29.5 15.1 6.7 0.4 2.7 1.7 2.9 9.8 11.3 0.9 Average Score **61.4** 58.8 Table 7: Normalized returns under several values of σ ≜ _σB = σJ (Appendix D.2.1) in the full algorithm of_ GAN-joint-matching. The reported number are the means and standard deviations of the normalized returns of the last five rollouts across three random seeds {0, 1, 2}. Task Name _σ = 1 × 10[−][2]_ _σ = 3 × 10[−][3]_ _σ = 1 × 10[−][3]_ _σ = 3 × 10[−][4]_ _σ = 1 × 10[−][4]_ _σ = 0_ maze2d-umaze 48.4 ± 21.4 48.3 ± 19.7 41.7 ± 11.5 40.1 ± 16.9 54.9 ± 10.4 50.8 ± 24.0 maze2d-medium 58.7 ± 33.6 48.7 ± 7.4 64.0 ± 23.9 69.6 ± 25.6 46.9 ± 15.5 26.4 ± 5.7 maze2d-large 87.1 ± 17.9 57.6 ± 21.3 62.4 ± 13.3 71.3 ± 26.0 61.0 ± 8.6 62.3 ± 32.3 halfcheetah-medium 43.0 ± 0.4 43.7 ± 0.3 43.9 ± 0.4 44.1 ± 0.3 43.8 ± 0.3 44.0 ± 0.4 walker2d-medium 56.9 ± 10.4 66.4 ± 7.9 68.8 ± 10.3 69.3 ± 8.6 64.6 ± 13.8 63.8 ± 8.4 hopper-medium 23.5 ± 8.4 66.7 ± 20.8 63.3 ± 21.0 60.1 ± 27.3 74.5 ± 19.3 89.6 ± 27.9 halfcheetah-medium-replay 32.1 ± 2.4 31.5 ± 3.3 32.3 ± 2.1 33.1 ± 2.3 31.2 ± 1.9 31.5 ± 3.2 walker2d-medium-replay 9.8 ± 2.4 10.7 ± 2.0 10.2 ± 1.8 10.2 ± 2.4 10.9 ± 1.7 9.4 ± 1.4 hopper-medium-replay 28.7 ± 3.8 30.1 ± 2.7 30.5 ± 2.9 29.5 ± 2.5 31.3 ± 1.9 29.2 ± 1.5 halfcheetah-medium-expert 79.9 ± 10.1 74.2 ± 13.0 76.8 ± 13.4 75.8 ± 10.1 71.3 ± 8.6 70.7 ± 7.9 walker2d-medium-expert 67.4 ± 16.0 63.4 ± 22.2 69.7 ± 17.3 71.2 ± 22.0 63.4 ± 23.1 77.2 ± 18.4 hopper-medium-expert 20.5 ± 6.8 56.7 ± 27.5 79.4 ± 21.9 99.9 ± 29.0 66.7 ± 19.6 62.0 ± 19.5 pen-human -3.3 ± 0.5 64.2 ± 17.0 46.6 ± 33.5 45.5 ± 24.5 67.8 ± 13.4 60.3 ± 11.4 pen-cloned 4.5 ± 1.9 19.6 ± 11.8 23.3 ± 13.2 18.0 ± 14.4 36.6 ± 18.4 40.0 ± 20.8 pen-expert 74.2 ± 26.6 132.8 ± 11.1 132.8 ± 17.9 141.1 ± 14.8 136.6 ± 10.8 132.0 ± 19.4 door-expert 29.1 ± 9.7 104.1 ± 1.6 104.2 ± 1.7 103.4 ± 3.7 102.9 ± 3.9 102.3 ± 4.8 Average Score 41.3 57.4 59.4 61.4 60.3 59.5 ----- F FURTHER COMPARISON BETWEEN IMPLICIT AND GAUSSIAN POLICY In Section 5.2 we note that a uni-modal stochastic policy, such as the Gaussian policy, is less flexible to capture all the rewarding actions, on which an implicit policy may fit well. In this section we visualize such a difference. Figure 2 compares the fitting of the eight-Gaussian toy dataset by implicit policy and Gaussian policy. Specifically, Figure 2a plots the dataset; Figure 2b plots CGAN with the default implicit generator (implicit policy) fitted by the conditional-distribution matching approach; Figure 2c plots CGAN with Gaussian generator (Gaussian policy) fitted by the conditional-distribution matching approach; Figure 2d plots CGAN with implicit policy fitted by the basic joint-visitation matching strategy (Section 4.1); Figure 2e plots CGAN with Gaussian policy fitted by the basic joint-visitation matching strategy. Experimental details are on Appendix D.1. 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 |Col1|Col2|Col3|Col4|Col5|Col6|Col7| |---|---|---|---|---|---|---| |||||||| |||||||| |||||||| |||||||| |||||||| |||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (a) Truth |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (b) CGAN |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (c) G-CGAN |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (d) GAN |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (e) G-GAN Figure 2: Performance of approximating the behavior policy on the eight-Gaussian dataset by conditional GAN with default (implicit) generator and Gaussian generator. A conditional GAN (“CGAN”) and a Gaussiangenerator conditional GAN (“G-CGAN”) are fitted using the conditional-distribution matching approach. A conditional GAN (“GAN”) and a Gaussian-generator conditional GAN (“G-GAN”) are fitted using the basic joint-visitation matching strategy (Section 4.1). Performance is judged by (1) clear concentration on the eight centers, and (2) smooth interpolation between centers, which implies a good and smooth fit to the behavior policy. Details of this toy experiment are presented on Appendix D.1. We see that whatever training strategies, Gaussian policies fail to learn multi-modal state-conditional action distributions, even if needed. Even though the Gaussian policy version of CGAN may still correctly capture some modes in the action distributions, an improvement over the mode-covering CVAE, they miss other modes. Besides, these Gaussian policy versions interpolate less-smoothly between the centers. In offline RL, these weaknesses is related the missing of some rewarding actions and less-predictable action-choices at unseen states. To visualize the differences between the implicit and the Gaussian policy in the offline RL setting, we plot the kernel density estimates of the distributions of the first two action-dimensions in the “maze2d-umaze-v1” dataset, where a performance difference is shown in Table 4. Specifically, Figure 3a plots the distribution of actions in the offline dataset. Figure 3b and 3c respectively plot action distributions produced by the final Gaussian policy and the final implicit policy generating Table 4. 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 (a) Truth (b) Gaussian Policy (c) Implicit Policy Figure 3: KDE plots of the first two dimensions of actions in the “maze2d-umaze-v1” dataset. From left to right: (a) Action distribution in the offline dataset; (b) Action distribution produced by the final Gaussian policy in Table 4; (c) Action distribution by the final implicit policy (Section 4.1) in Table 4. ----- We see from Figure 3a and Figure 3b that Gaussian policy leaves out two action modes, namely, modes on the upper-left and upper-right corners. Figure 3c shows that our implicit policy does capture all the modes shown in Figure 3a. Note that “maze2d-umaze” is a navigation task requiring agents to reach a goal location (Fu et al., 2021). Gaussian policy thus may miss out some directions in the offline dataset pertaining to short paths to the goal state, which may explain its inferior performance on this dataset in Table 4. -----