# STATE-ACTION JOINT REGULARIZED IMPLICIT POL## ICY FOR OFFLINE REINFORCEMENT LEARNING

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions
makes the policy training vulnerable to state-action pairs far from the training
dataset and prone to missing rewarding actions. For training more effective agents,
we propose a framework that supports learning a flexible and well-regularized policy, which consists of a fully implicit policy and a regularization through the stateaction visitation frequency induced by the current policy and that induced by the
data-collecting behavior policy. We theoretically show the equivalence between
policy-matching and state-action-visitation matching, and thus the compatibility
of many prior work with our framework. An effective instantiation of our framework through the GAN structure is provided, together with some techniques to
explicitly smooth the state-action mapping for robust generalization beyond the
static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.

1 INTRODUCTION

Offline reinforcement learning (RL), also known as batch RL, aims at training agents from
previously-collected fixed datasets that are typically large and heterogeneous, with a special emphasis on no interactions with the environment during the learning process (Ernst et al., 2005; Lange
et al., 2012; Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Agarwal et al., 2020; Siegel
et al., 2020; Wang et al., 2020). This paradigm extends the applicability of RL to where the environmental interactions are costly or even potentially dangerous, such as healthcare (Tseng et al., 2017;
Gottesman et al., 2018; Nie et al., 2019), autonomous driving (Yurtsever et al., 2020), and recommendation systems (Swaminathan et al., 2017; Gilotte et al., 2018). While (online) off-policy RL
algorithms, such as DDPG (Lillicrap et al., 2016), TD3 (Fujimoto et al., 2018), and SAC (Haarnoja
et al., 2018a) could be directly adopted in an offline setting, their application can be unsuccessful
(Fujimoto et al., 2019; Kumar et al., 2019), especially in high-dimensional continuous control tasks,
where function approximations are inevitable and data samples are non-exhaustive. Such failures
may be attributed to the shift between the state-action visitation frequency induced by the current
policy and that by the data-collecting behavior policy, where unseen state-action pairs are presented
to the action-value estimator, resulting in possibly uncontrollable extrapolation errors (Fujimoto
et al., 2019; Kumar et al., 2019). In this regard, one approach to offline RL is to control the difference between the observed and policy-induced state-action visitation frequencies, so that the current
policy mostly generates state-action pairs that have reliable action-value estimate.

Previous work in this line of research typically (1) regularizes the current policy to be close to
the behavior policy during the training process, i.e., policy or state-conditional action distribution
matching; (2) uses a Gaussian policy class with a learnable mean and diagonal covariance matrix
(Kumar et al., 2019; Wu et al., 2019). See Appendix A for a detailed review. However, at any
given state s, the underlying action-value function over the action space may possess multiple local
maxima. A deterministic or uni-modal stochastic policy may only capture one of the local optima
and neglect lots of rewarding actions. An even worse situation occurs when such policies exhibit
a strong mode-covering behavior, artificially inflating the density around the average of multiple
rewarding actions that itself may be inferior.


-----

Previous work under the policy-matching theme mainly takes two approaches. One approach directly estimates the divergence between the state-conditional distributions over actions (Wu et al.,
2019). However, on tasks with continuous state space, with probability one, no state can appear in
the dataset more than once. In other words, for each observed state si, the offline dataset has only
one corresponding action ai from the behavior policy. Thus, one is only able to use a single point
to assess whether the current policy is close to the data-collecting behavior policy at any particular
state, which may not well reflect the true divergence between the two conditional distributions. The
other approach, e.g., Kumar et al. (2019), resorts to a two-step strategy: First, fit a generative model
_π(a | s) to clone the behavior policy; Second, estimate the distance between the fitted behavior pol-_
icy π(a | s) and the current policy, and minimize that distance as a way to regularize. While this
approach, which can acquire multiple samples from the cloned behavior policy, is able to accurately
estimate the distance between the current policy and cloned behavior, its success relies heavily on
how well the inferred behavior-cloning generative model mimics the true behavior policy. On tasks
with large or continuous state space, however, the same problem arises that each state-conditional
action distribution is fitted on only one data point. Moreover, some prior work use conditional VAE
(CVAE, Sohn et al. (2015)) as the generative model to clone the possibly-multimodal behavior policy, which further suffers to the problem that CVAE may exhibit a strong mode-covering behavior
that allocates large probability density to low data-density regions. Inasmuch as these weaknesses,
one may naturally question on how good the samples from such a cloned policy resemble the truth,
and further, on the quality of the constraint imposed by sample-based calculation of the distance
between such a cloned behavior policy and the current policy.

To address these concerns, we are motivated to develop a new framework that not only supports an
expressive policy, which can be as flexible as needed, but also well regularizes this flexible policy
towards the data-collecting behavior policy. Specifically, (1) instead of using the classical deterministic or uni-modal Gaussian policy, we train a fully implicit policy for its flexibility to capture
multiple modes in the action-value function; (2) to avoid the aforementioned potential problems in
the policy-matching regularization, we directly control the distance between the state-action visitation frequency induced by the current policy and that induced by the behavior policy, as an auxiliary
training target. Hence, our approach does not need to build a generative model to clone the behavior
policy. On the theoretical side, we prove in Section 3 that the approach of matching the behavior
and current policies is equivalent to matching their corresponding state-action visitation frequencies,
which reveals the compatibility of many prior work with our framework. Similar notion of matching
the state-action visitations in offline RL is taken by the DICE family (Nachum et al., 2019; Lee et al.,
2021a), but they either use a Gaussian policy or a mixture of Gaussian policy with a per-dataset tuned
number of mixtures. Besides, these algorithms have high computational complexity, which, together
with inflexible policies and intensive hyperparameter tuning, limit their practical applicability. We
instantiate our framework with a GAN structure that approximately minimizes the Jenson–Shannon
divergence between the visitation frequencies. Furthermore, we design techniques to explicitly encourage robust behavior of our policy at states not covered in the static dataset. We conduct ablation
study on several components of our algorithm and analyze their contributions. With all these considerations, our full algorithm achieves state-of-the-art performance on various tasks from the D4RL
dataset (Fu et al., 2021), validating the effectiveness of our framework and implementation.

2 BACKGROUND AND MOTIVATION

We first present some background information and then introduce a toy example to illustrate the
motivations of the proposed framework for offline RL.

**Offline RL. Following the classic RL setting (Sutton & Barto, 2018), the interaction between the**
agent and environment is modeled as a Markov decision process (MDP), specified by the tuple M =
(S, A, P, r, γ), where S denotes the state space, A the action space, γ ∈ (0, 1] the discount factor,
_P(s[′]_ _| s, a) : S × S × A →_ [0, 1] the environmental dynamics, and r(s, a) : S × A → [Rmin, Rmax]
the reward function. The goal of RL is to learn a policy πϕ(at **_st), parametrized by ϕ, that maxi-_**
_|_
mizes the expected cumulative discounted reward Rt ≜ _R (st, at) = E_ _∞k=0_ _[γ][k][r][t][+][k][+1]_ _._

In offline RL (Fujimoto et al., 2019; Kumar et al., 2019; 2020; Levine et al., 2020), the agent onlyP 
has access to a fixed dataset D ≜ _{(s, a, r, s[′])}, consisting of transition tuples from rollouts by some_
behavior policies πb(a **_s). We denote the state-action visitation frequency induced by the behavior_**
_|_


-----

policy πb as db(s, a) and its state-marginal, the state visitation frequency, as db(s). Similarly,
_dϕ(s, a) and dϕ(s) are the counterparts for the current policy πϕ. Here, db(s, a) = db(s)πb(a_ **_s)_**
_|_
and we assume D ∼ _db(s, a). The visitation frequencies in the dataset are denoted as dD(s, a) and_
_dD(s), which are discrete approximations to db(s, a) and db(s), respectively._

**Actor-Critic Algorithm.** Denote Q[π](s, a) = E[[P][∞]t=0 _[γ][t][r][(][s][t][,][ a][t][)][ |][ s][0][ =][ s][,][ a][0][ =][ a][]][ as the]_
action-value function. In the actor-critic scheme (Sutton & Barto, 2018), the critic Q[π](s, a) is often
approximated by a neural network Qθ (s, a), parametrized by θ and trained by applying the Bellman
operator (Lillicrap et al., 2016; Haarnoja et al., 2018a; Fujimoto et al., 2019) as

2
arg minθ _Qθ(s, a) −_ _r(s, a) + γEs′∼P(· | s,a), a′∼π(· | s′) [Qθ(s[′], a[′])]_ _._

The actor πϕ aims at maximizing the expected value of   _Qθ, with the training objective expressed as_

arg maxϕ _J (πϕ) = Es_ _dϕ(s), a_ _πϕ(a_ **_s) [Qθ (s, a)]_** _,_ (1)
_∼_ _∼_ _|_

where dϕ(s) is the state visitation frequency under policy _πϕ. In offline RL, sampling from dϕ(s) is_
infeasible as no interactions with the environment are allowed. A common and practically effective
approximation (Fu et al., 2019; Levine et al., 2020) to Equation 1 is

_J (πϕ)_ Es _db(s), a_ _πϕ(a_ **_s) [Qθ (s, a)],_** (2)
_≈_ _∼_ _∼_ _|_

where sampling from db(s) can be implemented easily as sampling from the offline dataset D.

**Generative Adversarial Nets. GAN (Goodfellow et al., 2014) provides a framework to train deep**
generative models, with two neural networks jointly trained in an adversarial manner: a generator
_Gϕ, parametrized by ϕ, that fits the data distribution and a discriminator Dw, parametrized by w,_
that outputs the probability of a sample coming from the training data rather than Gϕ. Samples x’s
from the generator’s distribution dϕ (x) are drawn via z ∼ _pz(z), x = Gϕ (z), where pz(z) is_
some noise distribution. Both Gϕ and Dw are trained via a two-player min-max game as

minϕ maxw _V (Dw, Gϕ) = Ey∼dD(·) [log Dw (y)] + Ez∼pz(z) [log (1 −_ _Dw(Gϕ(z)))]_ _, (3)_

where dD (·) is the data distribution. Given the optimal discriminator _DG[∗]_ [at][ G][ϕ][, the training ob-]
jective of Gϕ is determined by the Jensen–Shannon divergence (JSD) (Lin, 1991) between dD and
_dϕ as V (DG[∗]_ _[, G][ϕ][) =][ −]_ [log 4 + 2][ ·][ JSD (][d][D][∥][d][ϕ][)][, with the global minimum achieved if and only if]
_dϕ = dD. Therefore, one may view GAN as a distributional matching framework that approximately_
minimizes the Jensen–Shannon divergence between the generator distribution and data distribution.

**Motivations.** To illustrate our motivations of training an expressive policy under an appropriate
regularization, we conduct a toy experiment of behavior cloning, as shown in Figure 1, where we
use the x- and y-axis values to represent the state and action, respectively. Figure 1a illustrates
the state-action joint distribution of the data-collecting behavior policy that we try to mimic. For
Figures 1b-1e, we use the same test-time state marginal distribution, which consists of an equal
mixture of the behavior policy’s state distribution and a uniform state distribution between −1.5 and
1.5. If the inferred policy well approaches the behavior policy, we expect (1) clear concentration on
the eight centers and (2) smooth interpolation between centers, which implies a good and smooth
fit to the behavior policy. We start the toy experiment with fitting a CVAE model, a representative
behavior-cloning method, to the dataset. As shown in Figure 1b, CVAE exhibits a mode-covering
behavior that covers the data density modes at the expense of overestimating unwanted low data
density regions. Hence, the regularization ability is questionable of using CVAE as a proxy for the
behavior policy in some prior work. Replacing CVAE with the conditional GAN (CGAN, Mirza
& Osindero (2014)), i.e., replacing the KL loss with JSD loss, but adopting the Gaussian policy
popular in prior offline RL work partially alleviates the issues but drops necessary modes, as shown
in Figure 1c. This shows the inflexibility of Gaussian policies. Replacing the Gaussian policy
in CGAN with an implicit policy, while still training CGAN via sampled-based policy-matching,
improves the capability of capturing multiple modes, as shown in Figure 1d. Nevertheless, it is still
prone to mode collapse and interpolates less smoothly between the seen states. Finally, training the
implicit-policy CGAN via direct state-action-visitation joint matching leads to the best performance.
As shown in Figure 1e, it concentrates clearly on the eight centers and interpolates smoothly between
the seen states. Thus, constraining the state-action visitations can be an effective way to regularize
implicit policies in offline RL. These observations motivate us to train implicit policies in offline
RL, with sample-based regularization on the state-action visitations.


-----

1.5

1.0

0.5

0.0

0.5

1.0

1.5


1.5

1.0

0.5

0.0

0.5

1.0

1.5


1.5

1.0

0.5

0.0

0.5


1.5

1.0

0.5

0.0

0.5

1.0

1.5


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(a) Truth


|2 1 0|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|
|---|---|---|---|---|---|---|---|---|
||||||||||
||||||||||
|1 2|||||||||
||||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(b) CVAE


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(c) G-CGAN


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(d) CGAN


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(e) GAN


Figure 1: Performance of approximating the behavior policy on the eight-Gaussian dataset. A conditional VAE
(“CVAE”), a conditional GAN (“CGAN”), and a Gaussian-generator conditional GAN (“G-CGAN”) are fitted
using the conditional-distribution (policy) matching approach. A conditional GAN (“GAN”) is fitted using the
basic state-action-joint-visitation matching strategy (Section 4.1). More details are provided in Appendix D.1.

3 THEORETICAL ANALYSIS

As discussed in Section 1 and detailed in Appendix A, one common theme in prior work in offline
RL is controlling the distance between the behavior policy and current policy during the training process. In this section, we prove that this approach, in essence, controls the corresponding state-action
visitations. This analysis theoretically links a line of prior offline-RL work with our proposed framework, manifesting the generality of our algorithmic idea. Note that dϕ(s, a) = dϕ(s)πϕ(a **_s) and_**
_|_
similarly for db(s, a). Hence, it is sufficient to show the closeness between dϕ(s) and db(s) when
_πϕ(a_ **_s) is close to πb(a_** **_s). Below we give our analysis for the matrix (finite state space) case._**
_|_ _|_
Continuous state-space cases may be analyzed similarly and are left for future work. The Proofs are
deferred to Appendix C.

**Notation. Denote A** _i_ as matrix A with its i-th row removed; κ(A) as the 2-norm condition
_−_ _∗_
number of A; 1 as a row vector of all ones and I as an identity matrix, both with an appropriate dimension. Assume that the state space S is finite with cardinality N, i.e., S = **_s[1], . . ., s[N]_** []. The transition probabilities associated with policy πϕ over S is then an N _N matrix Tϕ, whose (i, j) entry_
_×_  

and similarly foris Tϕ,(i,j) = pπϕ TbS, the transition matrix associated witht+1 = s[j] _| St = s[i][]_ = A _[P]_ _St+1 = π s[j]b. Note that in this case,| St = s[i], At = at_ _πϕ_ _daϕt |(s s)[i], d[]_ _db(ast),_

  R     

are vectors and we denote dϕ ≜ **_dϕ(s), db ≜_** **_db(s) ∈_** R[N] and dϕ = db + ∆d.

**Theorem 1. Denote**


_κmax = maxi=2,...,N_ +1 κ _._

I − **_Tb[⊤]−i∗!_**

_If_
maxi=1,...,N _πb_ **_s[i][]_** _πϕ_ **_s[i][]1_** _κmax1_

**_· |_** _−_ **_· |_** _[≤]_ _[ϵ <]_

_and Ti,j[ϕ]_ _[,][ T][ b]i,j_ _[>][ 0][,][ ∀]_ _[i, j][ ∈{][1][,][ 2][, . . ., N]_ _[}] [, then]_  


2TV(dϕ, db) = ∆d 1 = **_dϕ_** **_db_** 1 1 _ϵ κϵ κmaxmax_ _as ϵ_ 0.
_∥_ _∥_ _∥_ _−_ _∥_ _≤_ _−_ _[→]_ [0] _→_

_Remark. (1) We note that κmax is a constant for fixed Tb and can be calculated by iteratively_
removing columns of Tb and computing the SVD of the referred matrix. (2) The assumption that
**_Ti,j[ϕ]_** _[,][ T][ b]i,j_ _[>][ 0][,][ ∀]_ _[i, j][ ∈{][1][, . . ., N]_ _[}][ can be satisfied by substituting the zero entries in the original]_
transition matrix with a small number and re-normalized each row of the resulting matrix, as in the
PageRank algorithm (Page et al., 1998; Langville & Meyer, 2004).

In practice, the offline dataset D often consists of samples collected by several policies. Equivalently, the behavior policy πb( **_s) is a mixture of single policies._** Assume that there

**_· |_**
areK K such policies {πbk (· | s)}k[K]=1 [with mixture probabilities][ {][w][k][}]k[K]=1[,][ i.e.][,][ π][b][(][·][ |][ s][)] =
_k=1_ _[w][k][π][b]k_ [(][·][ |][ s][)][,][ P]k[K]=1 _[w][k][ = 1][. Since we collect][ D][ by running each][ π][b]k_ [(][·][ |][ s][)][ a proportion]
ofP wk of total time, we may decompose D as D = _k=1_ [D][k][, where][ D][k][ consists of][ w][k][ proportion of]
data in D. Thus, dD(s) = _k=1_ _[w][k][d][D]k_ [(][s][)][ and the approximation][ d][ϕ][(][s][)][ ≈] **_[d][D][(][s][)][ has population]_**
version dϕ(s) _k=1_ _[w][k][d][b]k_ [(][s][)][ ≜] **_[d][b][(][s][)][. As before, denote][S][K]_** **_[ d][b]k_** [(][s][)][ ∈] [R][N][ as the limiting state-]
_≈_ [P][K]
occupancy measure induced by[P][K] πbk on M; Tbk as the transition matrix induced by πbk over S; and
**_dϕ = dbk + ∆dk. We extend Theorem 1 into this mixture of policies case as follows._**


-----

**Theorem 2. Denote**

_κmax,k = maxi=2,...,N_ +1 κ


_κmax = maxk=1...,K κmax,k._


**_I −_** **_Tb[⊤]k_**


_−i∗_


_If_
maxk=1...,K maxi=1,...,N _πbk_ **_s[i][]_** _πϕ_ **_s[i][]1_** _κmax1_

**_· |_** _−_ **_· |_** _[≤]_ _[ϵ <]_

_and Ti,j[ϕ]_ _[,][ T][ b]i,j[k]_ _[>][ 0][,][ ∀]_ _[i, j][ ∈{][1][, . . ., N]_ _[}][, k][ ∈{][1][, . . ., K] _ _[}][, then] _

2TV(dϕ, db) = **_dϕ_** **_db_** 1 _k=1_ _[w][k]_ 1 _ϵ κϵ κmax,kmax,k_ 1 _ϵ κϵ κmaxmax_ _as ϵ_ 0.
_∥_ _−_ _∥_ _≤_ [P][K] _−_ _[≤]_ _−_ _[→]_ [0] _→_

_In particular, if wk = 1/K, ∀_ _k ∈{1, . . ., K}, then_

2TV(dϕ, db) = **_dϕ_** **_db_** 1 _K_ _Kk=1_ 1 _ϵ κϵ κmax,kmax,k_ _as ϵ_ 0.
_∥_ _−_ _∥_ _≤_ [1] _−_ _[→]_ [0] _→_

_Remark. We note that κmax,k is a constant for fixedP Tbk and κmax is a constant for fixed {πbk_ _}k[K]=1[.]_

We notice that similar analysis has been given in the prior work of bounding DKL (dϕ(s)∥db(s))
by O _ϵ/(1 −_ _γ)[2][]_ (Schulman et al., 2015; Levine et al., 2020). However, this prior work deals
with (unnormalized) discounted visitation frequencies while our bound is devoted to undiscounted
 
visitation frequencies, since neither the data collection (i.e., policy rollout) nor the state-actionvisitation regularization involve the discount factor. In short, the definitions of dϕ(s) and db(s)
in our work are different from this prior work. Note that this prior work depends on 1 − _γ on_
the denominator and hence cannot be applied to the undiscounted case (discount factor γ = 1).
Furthermore, instead of the KL divergence, we bound the total variation distance between the state
visitation frequencies, which is a well-defined metric.

4 STATE-ACTION JOINT REGULARIZED IMPLICIT POLICY

In this section we discuss an instance of our framework that will be used in our empirical study
in Section 5. Concretely, we train a fully implicit policy via a GAN structure to approximately
minimizes the JSD between the state-action visitation frequency induced by the current policy and
that induced by the behavior policy. Our basic algorithm is discussed in Section 4.1, followed by
three enhancing components presented in Section 4.2 to build up our full algorithm.

This instantiation manifests three facets we consider important in offline RL: (1) the flexibility of the
policy class, (2) an effective sample-based regularization without explicitly modelling the behavior
policy, and (3) smoothness of the learned policy.

4.1 BASIC ALGORITHM

Motivated by the standard actor-critic and GAN frameworks described in Section 2, our basic algorithm consists of a critic Qθ, an actor πϕ, and a discriminator Dw. For training stability, we
adopt the double Q-learning (Hasselt, 2010) and target network formulation to train a pair of critics
_Qθ1_ _, Qθ2 and the target networks Qθ1[′]_ _[, Q][θ]2[′]_ _[, π][ϕ][′]_ [.]

To penalize uncertainty in the action-value estimates at future states, while controlling conservatism,
we follow Fujimoto et al. (2019) and Kumar et al. (2019) to use the following critic-training target:


_Q (s, a) ≜_ _r(s, a) + γEa′∼πϕ′_ (· | s′)


_λ minj=1,2_ _[Q][θ]j[′]_ [(][s][′][,][ a][′][) + (1][ −] _[λ][) max]j=1,2_ _[Q][θ]j[′]_ [(][s][′][,][ a][′][)] (4)



with some hyperparameter λ ∈ [0, 1], where one a[′] is sampled at each s[′] in the mini-batch. The
training objective for both critic networks is minimizing the mean-squared-error between their respective action-value estimates Qθj (s, a) and _Q (s, a) over state-action pairs in the mini-batch._

Our actor consists of three components: an implicit policy, state-action-visitation joint regularization, and a conservative training target. [e]

**Implicit Policy. As discussed in Sections 1 and 2, a deterministic or Gaussian policy may miss**
important rewarding actions, or even concentrate on inferior average actions. For online off-policy


-----

RL, Yue et al. (2020) shows the benefit of introducing an implicit distribution mixed Gaussian policy.
Generalizing that idea to offline RL, we train a fully implicit policy, which transforms a given noise
distribution into the state-conditional action distribution via a neural network, in reminiscent of the
generator in CGAN. Specifically, given s,

_iid_
**_a_** _πϕ(_ **_s) = πϕ(s, z),_** **_z_** _pz(z)_ (5)
_∼_ **_· |_** _∼_

where πϕ is a deterministic function and pz(z) is some noise distribution. As shown in Figures 1d
and 1e, an implicit policy has stronger capability of learning a multi-modal policy, if needed.

**State-action Joint Regularization. As discussed in Section 2, directly sampling from dϕ(s) is**
infeasible in offline RL. Motivated by Equation 2, we approximate dϕ(s) by db(s) in our implementation, which allows an easy sampling scheme that simply samples s from D, without needing
an importance sampling correction that may possess high variance (Liu et al., 2018). Such an approximation is classical in the off-policy and offline RL literature (Degris et al., 2012; Levine et al.,
2020), with demonstrated empirical effectiveness in offline RL (Fu et al., 2019), besides its usage
in off-policy RL (Silver et al., 2014; Schulman et al., 2015; Lillicrap et al., 2016). With this simplification, we can efficiently maximize the similarity between db(s, a) and dϕ(s, a) with respect
to sample-based estimate of some statistical divergence, such as the Jensen–Shannon divergence.
Using notations in GAN, the generator sample x and the data sample y are defined as

**_x ≜_** (˜s, ˜a), ˜s ∼ D, ˜a ∼ _πϕ(· | ˜s); y ≜_ (s, a) ∼ D, ˜s independent of s (6)

We then constraint this statistical divergence value, named generator loss _g(ϕ), in the training of_
_L_
actor. In our instantiation of approximately minimizing JSD via GAN, Lg = Ex [log (1 − _Dw(x))]._

Our choice of directly matching state-action joint visitations mitigates the issue of uncontrollable
extrapolation errors in the action-value function estimate, since proximity of visitation frequencies
leads to reduced chances of estimating the action-values of state-action pairs far from the offline
dataset. Furthermore, the problem in the policy-matching approach of fitting each state-conditional
action distributions on only one data point can be circumvented, since state-action pairs (si, ai) in
the offline dataset are all viewed as samples from the joint visitation frequency, instead of each pair
being separately viewed as one sample from the state-conditional distribution, i.e., ai _π (_ **_si)._**
Besides, the state-action-visitation joint matching approach implicitly encourages the smoothness ∼ _· |_
of the state-action mapping, namely, similar states should have similar actions. This is because, for
example, the discriminator in GAN can easily discriminate as “fake” a generator sample x should
it has state similar to a data sample but action very different from. This smoothness feature helps
ensure a reliable generalization of our policy to unseen states.

**Actor-Training Target. To prevent the accumulation of accidental errors in the training process and**
in the approximation of action-value function, we adopt the strategy in Kumar et al. (2019) to train
the policy with respect to a conservative estimate of the action-values. Specifically, we exploit the
double-Q structure and use the minimum of the two action-value estimates for policy improvement.
For the ease of optimization, we use the Lagrange form of the constraint optimization problem and
penalize the generator loss _g(ϕ) while improving the policy. Our policy-training target is_
_L_

arg minϕ −Es∼DEa∼πϕ(· | s) minj=1,2 Qθj (s, a) + α · Lg(ϕ), (7)

where α is a fixed Lagrange multiplier. At the test time, we follow Fujimoto et al. (2019) and Kumar 
et al. (2019) to first sample 10 actions from πϕ and then execute the action that maximizes Qθ1 .

The discriminator is trained to better distinguish samples from dϕ(s, a) and from db(s, a). It aids
the matching of the state-action-visitation frequencies through outputting _g(ϕ). As an example,_
_L_
for approximately minimizing JSD via GAN, the discriminator outputs the probability that the input,
either the x or y in Equation 6, comes from db(s, a). In this case, the discriminator is trained to
minimize the error in assigning x as “fake” and y as “true,” the inner maximization of Equation 3.

4.2 ENHANCING COMPONENTS

In this section we present three enhancing components for further improving our basic algorithm.

**Multiple Action-samples at Bellman Backup. The expectation part of the critic learning target in**
Equation 4 can be better estimated, in terms of a smaller sample variance, by averaging over Na
actions a[′] at each s[′] in the mini-batch, rather than just one a[′] as in Section 4.1.


-----

**State-smoothing at Bellman Backup. Due to the stochastic nature of environmental transition,**
multiple next states s[′] are possible after taking action a at state s, while the offline dataset D only
contains one such s[′]. Since the agent is unable to interact with environment to collect more data in
offline RL, local exploration (Sinha et al., 2021) in the state-space appears as an effective strategy
to regularize the Bellman backup by taking consideration of states close to the records in the offline
dataset. We assume that: (1) a small transformation to a state results in states physically plausible
in the underlying environment (as in Sinha et al. (2021)); (2) when the state space is continuous, the
transition kernel P (· | s, a) is locally continuous and centers at the recorded s[′] in the dataset.

With these assumptions, we propose to fit Qθ(s, a) on the value of a small region around the
recorded next state s[′]. Specifically, with a pre-specified standard deviation σB, we sample around
**_s[′]_** as ˆs = s[′] + ϵ, ϵ ∼N (0, σB[2] **_[I][)][, and modify Equation 4 as]_**


_Q (s, a) ≜_ _r(s, a) + γEsˆ[E]aˆ∼πϕ′_ (· | ˆs)
e


_λ minj=1,2_ _[Q][θ]j[′]_ [(ˆ]s, ˆa) + (1 − _λ) maxj=1,2_ _[Q][θ]j[′]_ [(ˆ]s, ˆa) _,_ (8)



where NB ˆs are sampled to estimate the expectation. This strategy is equivalent to using a Gaussian
distribution centered at s[′] to approximate the otherwise non-smooth δs′ transition kernel manifested
in the offline dataset. Similar technique is also considered as the target policy smoothing regularization in Fujimoto et al. (2018), though smoothing therein is applied on the target action.

**State-smoothing at Joint-matching. As described in Section 4.1, we approximate dϕ(s) by dD(s)**
in the sample-based estimate of the chosen statistical distance. However, dD(s) is in essence discrete
and the idea of smoothing the discrete state-distribution can be applied again to provide a better
coverage of the state space. This design explicitly encourages a predictable and smooth behavior at
states unseen in the offline dataset. Specifically, with some pre-specified σJ[2] [, we modify the sampling]
scheme of ˜s in Equation 6 as

**_s˜ ∼_** D, ϵ ∼N **0, σJ[2]** **_[I]_** _, ˜s ←_ **_s˜ + ϵ._** (9)

Our strategy is akin to sampling from a kernel density approximation (Wasserman, 2006) of   _dϕ(s)_
with data points s ∈ D and with radial basis kernel of bandwidth σJ .

Algorithm 1 shows the main steps of our algorithm, instantiated by approximately minimizing JSD
via GAN, and a detailed version is in Appendix B. Implementation details are in Appendix D.2.

**Algorithm 1 State-Action Joint Regularized Implicit Policy (GAN-JSD, Main Steps)**


Initialize policy network πϕ, critic network Qθ1 and Qθ2, discriminator network Dw.
**for each iteration do**

Sample a mini-batch of transitions B = {(s, a, r, s[′])} ∼ D.
From Equation 8, train the critics by arg minθj (Qθj (s, a) − _Q[e](s, a))[2]_ over (s, a) ∈B, for j = 1, 2.

Calculate generator loss Lg using Dw and the x, y in Equation 6, with state-smoothing in Section 4.2.
Optimize policy network πϕ by Equation 7.
Optimize discriminator Dw to maximize Ey∼dD(·) [log Dw (y)] + Ex [log (1 − _Dw(x))]._

**end for**

5 EXPERIMENTS

As discussed in Section 1, we consider it important the flexibility of the policy class, the wellregularization and the smoothness of the learned policy. To this end, we develop our state-action joint
regularized implicit policy, which is a framework that supports training an expressive implicit policy
via an effective sample-based regularization without an explicit modelling of the behavior policy,
and with techniques encouraging its smoothness. In this section we will test an instantiation of our
framework on the continuous-control RL tasks. Specifically, we will first show the effectiveness of
implicit policy, state-action-visitation matching, and our full algorithm (Section 5.1). We then show
in ablation study (Section 5.2) the contributions of several building blocks.

**Implementation. We use GAN to approximately control the Jensen–Shannon divergence between**
the state-action visitation frequencies. Our source code builds on the official BCQ repository and
largely follows its network architectures. Our implementation of the GAN structure and the hyperparameter choices follow the literature (Goodfellow et al., 2014; Mirza & Osindero, 2014; Radford


-----

et al., 2016; Salimans et al., 2016; White, 2016; Goodfellow, 2017). To mimic a hyperparameteragnostic setting, we minimize hyperparameter tuning across datasets. Implementation details and
hyperparameter setting of our full algorithm with GAN joint-matching is in Appendix D.2.1.

5.1 MAIN RESULTS

To validate the effectiveness of our framework, we test three implementations of the GAN instantiation: (1) basic algorithm (Section 4.1) regularized by the classical policy-matching[1] (“GANCond:Basic”), (2) basic algorithm regularized by the state-action-visitation joint matching (“GANJoint:Basic”), (3) full algorithm, which adds state-smoothing techniques onto our basic algorithm
(“GAN-Joint”). We compare them with the state-of-the-art (SOTA) offline-RL algorithms: BEAR
(Kumar et al., 2019), BRAC (Wu et al., 2019) (BRAC-v: value penalty), BCQ (Fujimoto et al.,
2019), and CQL (Kumar et al., 2020); together with offline soft actor-critic (SAC) (Haarnoja et al.,
2018a). We re-run CQL using the official source code (details in Appendix D.2.2). Results for other
algorithms are from Fu et al. (2021). Table 1 validates the effectiveness of our full algorithm, together with the efficacy of the implicit policy, state-action-visitation matching, and state-smoothing.

Our full algorithm on average outperforms the baseline algorithms, and its performance is relatively
stable across tasks and datasets that possess diverse nature. Our full algorithm especially shows
robust and comparatively-good performance on the high-dimensional Adroit tasks and the Maze2D
tasks that are collected by non-Markovian policies, which may traditionally be considered as hard in
offline RL. On the Gym-MuJoCo domain, our full algorithm shows its ability to learn from datasets
collected by a mixture of behavior policies, and from medium-quality examples, which are likely the
datasets to encounter in real-world application. These results support our design of implicit policy,
matching the state-action visitation frequencies, and explicit state-smoothing.

Comparing “GAN-Cond:Basic” with the baseline algorithms, especially BEAR and CQL that use
Gaussian policies, we see that an implicit policy does in general help the performance. This aligns
with our intuition in Sections 1 and 2 of the incapability of the uni-modal Gaussian policy in capturing multiple action-modes.

To verify the gain of our joint-visitation-matching approach over the classical policy matching, apart
from the comparison between our full algorithm with BEAR and BRAC, the SOTA policy-matching
algorithms, we further compare “GAN-Joint:Basic” with “GAN-Cond:Basic.” We see that on 11 out
of 16 datasets, “GAN-Joint:Basic” wins against “GAN-Cond:Basic,” while results on other datasets
are close. This empirical result may be attributed to the advantage of state-action-visitation matching, e.g., good regularization and smoothness in the state-action mapping (Section 4.1).

Comparing “GAN-Joint” with “GAN-Joint:Basic,” we see that our state-smoothing techniques do in
general help the performance. This performance gain may be related to a smoother action-choice at
states not covered by the offline dataset, and a more regularized Bellman backup. Note that different
datasets may require different smoothing strength, and thus this comparison can potentially be more
significant should one is allowed to per-dataset tune the smoothing hyperparameter.

5.2 ABLATION STUDY

The ablation study serves to answer the following questions: (a): Is implicit policy better than
Gaussian policy in our framework? (b): Does state-smoothing at joint-matching help? (c): Does
state-smoothing at Bellman backup matter? (d): How important is the standard deviation of the
Gaussian noise injected in state-smoothing? Unless stated otherwise, hyperparameters for all algorithmic variants in all datasets are in Table 2.

**(a): We compare the performance of our basic algorithm with its variant where the implicit policy**
therein is replaced by a Gaussian policy. To make a fair comparison, the critics, discriminator, and
hyperparameter setting remain the same. Technical details are in Appendix D.2.3.

Table 4 presents the results. On 11 out of 16 datasets, our basic GAN-joint-matching algorithm
has higher average return than the Gaussian policy variant. This empirical result coincides with our
intuition in Section 4.1 and results in Section 5.1 that a Gaussian policy is less flexible to capture all

1Same state is used in generator and data sample, i.e., ˜s = s. Implementation follows Wu et al. (2019).


-----

Table 1: Normalized returns for experiments on the D4RL suite of tasks. We perform experiments on tasks
from the Maze2D, Gym-Mojoco, and Adroit domains. High average scores and low average ranks are desirable.

Task Name SAC-off BEAR BRAC-v BCQ CQL GAN-Cond:Basic GAN-Joint:Basic GAN-Joint

hopper-medium-replaymaze2d-umazemaze2d-mediummaze2d-largehalfcheetah-mediumwalker2d-mediumhopper-mediumhalfcheetah-medium-replaywalker2d-medium-replayhalfcheetah-medium-expertwalker2d-medium-experthopper-medium-expertpen-humanpen-clonedpen-expertdoor-expert 88.226.123.5-4.3-0.1-1.9-2.43.56.30.90.81.91.66.11.87.5 105.9103.433.729.026.541.759.152.119.253.440.196.338.6-1.03.44.6 -16.033.846.381.131.141.981.640.647.7-2.5-3.0-0.30.60.90.80.6 114.9110.912.840.753.154.533.164.757.568.944.038.215.099.08.36.2 103.550.530.729.595.943.739.060.234.543.416.434.579.887.91.52.1 126.152.342.668.052.919.467.573.336.9100.825.643.865.532.376.06.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 21.3 2.3 13.8 21.3 15.9 14.5 19.6 18.2 17.9 1.6 0.2 6.8 2.4 9.2 17.7 3.8 129.1104.157.639.452.166.576.570.961.031.336.574.243.770.031.39.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 11.0 10.3 20.8 15.4 2.0 15.1 26.8 13.7 21.5 5.2 0.5 9.1 2.8 6.1 14.4 3.6 141.171.299.945.518.075.8103.440.169.671.360.129.544.169.333.110.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 10.1 22.0 29.0 24.5 14.4 16.9 25.6 26.0 27.3 2.4 2.5 0.3 8.6 2.3 14.8 3.7

Average Score 10.0 44.1 24.1 51.4 47.1 **55.6** **59.6** **61.4**

Average Rank 6.8 4.8 5.5 4.3 4.6 **3.9** **2.9** **3.2**


the rewarding actions, of which an implicit policy is likely to be capable. Appendix F visualizes this
comparison and shows in plots that Gaussian policies do leave out action modes in offline RL.

**(b): We compare our full algorithm with its variant of no state-smoothing in the matching of state-**
action visitations. The results are shown in Table 5, where our full algorithm overall performs
better than the no state-smoothing variant. This performance gain may be attributed to a better
coverage of the state-space by the smoothed state-distribution (Section 4.2), which is related to a
more predictable and smoother action choice at unseen states. As stated previously, this comparison
can potentially be more significant should one is allowed to per-dataset tune the σJ parameter.

**(c): We compare our full algorithm with its variant of no state-smoothing in the Bellman backup. Ta-**
ble 6 shows the results. Again, overall our full algorithm performs better than the no state-smoothing
version, showing the benefit of smoothing the empirical transition kernel δs′ (Section 4.2), e.g., taking the stochasticity of state-transitions into account. As before, we use the same smoothing strength
across all datasets, while a per-dataset tuning of the σB parameter may improve the distinction.

**(d): To simplify hyperparameter tuning, in actual implementation we fix σB = σJ ≜** _σ (see Ap-_
pendix D.2.1). To test the robustness to the σ hyperparameter, we run our full algorithm under
_σ ∈_ 1 × 10[−][2], 3 × 10[−][3], 1 × 10[−][3], 3 × 10[−][4], 1 × 10[−][4], 0 . Table 7 shows the normalized returns. We see that our algorithm is relatively insensitive to the setting of σ, especially in the range
_σ ∈_ 1 × 10[−][4], 1 × 10[−][3][] where the performance varies little with σ. A too-small σ cannot provide enough smoothing to the state distributions. On the contrary, a too-large σ may highly distort

the information contained in the offline dataset, such as the state-transition kernel. In both cases, a
degradation in the overall performance is expected.

6 CONCLUSION AND FUTURE WORK

In this paper, we develop a framework that supports learning a flexible while well-regularized policy in offline RL. Specifically, we train a fully implicit policy via regularization on the difference
between the state-action visitation frequency induced by the current policy and that induced by the
data-collecting behavior policy. An effective instantiation of our framework through the GAN structure is provided for approximately minimizing the JSD between the visitation frequencies. Other
divergence metrics, such as the MMD, may also be applied and are left for future work. We further
augment our algorithm with explicit state-smoothing techniques to enhance its generalizability on
states beyond the offline dataset. On the theoretical side, we show the equivalence between policymatching and state-action-visitation matching, and hence the compatibility of many prior algorithms
with our framework. We note that our implementation-wise simplification of approximating current
policy’s state visitation frequency by the behavior’s can be improved by a more tactful approximation of current policy’s state occupancy measure, which is left for future work. Nevertheless, the
effectiveness of our framework and implementations is validated through extensive experiment and
ablation study on the D4RL dataset.


-----

REPRODUCIBILITY STATEMENT

To help reproduce our empirical work, we provide detail algorithmic description in Appendix B and
implementation details, including the hyperparameter choice and model architecture, in Appendix D.
For reproducing our theoretical results, a step-by-step proof is provided in Appendix C.

REFERENCES

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An Optimistic Perspective on Offline Reinforcement Learning, 2020.

Mart´ın Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. ArXiv, abs/1701.07875,
2017.

Peter Baxendale. T. E. Harris’s Contributions to Recurrent Markov Processes and Stochastic Flows.
_The Annals of Probability, 39(2):417–428, 2011. ISSN 00911798._

Marc G. Bellemare, Ivo Danihelka, Will Dabney, S. Mohamed, Balaji Lakshminarayanan, Stephan
Hoyer, and R. Munos. The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv,
abs/1705.10743, 2017.

Mikolaj Binkowski, Danica J. Sutherland, Michael Arbel, and A. Gretton. Demystifying MMD
GANs. ArXiv, abs/1801.01401, 2018.

Catherine Cang, Aravind Rajeswaran, P. Abbeel, and M. Laskin. Behavioral Priors and Dynamics
Models: Improving Performance and Domain Transfer in Offline RL. ArXiv, abs/2106.09119,
2021.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, M. Laskin, P. Abbeel,
A. Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learning via Sequence
Modeling. ArXiv, abs/2106.01345, 2021.

Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances.
_arXiv: Machine Learning, 2013._

T. Degris, Martha White, and R. Sutton. Off-Policy Actor-Critic. ArXiv, abs/1205.4839, 2012.

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-Based Batch Mode Reinforcement Learning. J. Mach. Learn. Res., 6:503–556, 2005.

Jean Feydy, Thibault S´ejourn´e, Franc¸ois-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and
Gabriel Peyr´e. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences.
In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690,
2019.

Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing Bottlenecks in Deep Qlearning Algorithms. In ICML, 2019.

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for
Deep Data-Driven Reinforcement Learning, 2021.

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error
in Actor-Critic Methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th
_International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning_
_Research, pp. 1587–1596. PMLR, 10–15 Jul 2018._

Scott Fujimoto, David Meger, and Doina Precup. Off-Policy Deep Reinforcement Learning without
Exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th
_International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning_
_Research, pp. 2052–2062. PMLR, 09–15 Jun 2019._

Alexandre Gilotte, Cl´ement Calauz`enes, Thomas Nedelec, Alexandre Abraham, and Simon Doll´e.
Offline A/B Testing for Recommender Systems. Proceedings of the Eleventh ACM International
_Conference on Web Search and Data Mining, 2018._


-----

I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv, abs/1701.00160,
2017.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Infor_mation Processing Systems, volume 27. Curran Associates, Inc., 2014._

Omer Gottesman, Fredrik D. Johansson, J. Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan,
Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, C. Mosch, Li wei
H. Lehman, M. Komorowski, A. Faisal, L. Celi, D. Sontag, and Finale Doshi-Velez. Evaluating
Reinforcement Learning Algorithms in Observational Health Settings. ArXiv, abs/1805.12298,
2018.

A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨olkopf, and Alex Smola. A Kernel Two-Sample Test.
_J. Mach. Learn. Res., 13:723–773, 2012._

Caglar Gulcehre, Sergio G´omez Colmenarejo, ziyu wang, Jakub Sygnowski, Thomas Paine, Konrad
Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Addressing
Extrapolation Error in Deep Offline Reinforcement Learning, 2021.

Ishaan Gulrajani, Faruk Ahmed, Mart´ın Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved Training of Wasserstein GANs. In NIPS, 2017.

Tuomas Haarnoja, Haoran Tang, P. Abbeel, and Sergey Levine. Reinforcement Learning with Deep
Energy-Based Policies. In ICML, 2017.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy
Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Jennifer Dy and
Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning,
volume 80 of Proceedings of Machine Learning Research, pp. 1861–1870. PMLR, 10–15 Jul
2018a.

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, G. Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, P. Abbeel, and Sergey Levine. Soft Actor-Critic Algorithms
and Applications. ArXiv, abs/1812.05905, 2018b.

H. V. Hasselt. Double Q-learning. In NIPS, 2010.

Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning. In D. Lee,
M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information
_Processing Systems, volume 29. Curran Associates, Inc., 2016._

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, A. Lapedriza, Noah J.[`]
Jones, S. Gu, and Rosalind W. Picard. Way Off-Policy Batch Deep Reinforcement Learning of
Implicit Human Preferences in Dialog. ArXiv, abs/1907.00456, 2019.

Nathan Kallus and Angela Zhou. Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization, 2014.

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes, 2013.

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit
Q-Learning. ArXiv, abs/2110.06169, 2021.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing Off-Policy
Q-Learning via Bootstrapping Error Reduction. In Advances in Neural Information Processing
_Systems, volume 32. Curran Associates, Inc., 2019._

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin
(eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1179–1191. Curran
Associates, Inc., 2020.


-----

Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and D. Vetrov. Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics. _ArXiv,_
abs/2005.04269, 2020.

Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch Reinforcement Learning, pp. 45–73.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-27645-3. doi: 10.1007/
978-3-642-27645-3 2.

Amy N. Langville and Carl D. Meyer. Deeper inside PageRank. Internet Mathematics, 1(3):335–
380, 2004. ISSN 1542-7951.

Romain Laroche and P. Trichelair. Safe Policy Improvement with Baseline Bootstrapping. In ICML,
2019.

Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, J. Pineau, and Kee-Eung Kim. OptiDICE: Offline
Policy Optimization via Stationary Distribution Correction Estimation. ArXiv, abs/2106.10783,
2021a.

Kimin Lee, M. Laskin, A. Srinivas, and P. Abbeel. SUNRISE: A Simple Unified Framework for
Ensemble Learning in Deep Reinforcement Learning. In ICML, 2021b.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning:
Tutorial, Review, and Perspectives on Open Problems, 2020.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and B. P´oczos. MMD GAN: Towards
Deeper Understanding of Moment Matching Network. In NIPS, 2017.

T. Lillicrap, Jonathan J. Hunt, A. Pritzel, N. Heess, T. Erez, Yuval Tassa, D. Silver, and Daan
Wierstra. Continuous Control with Deep Reinforcement Learning. CoRR, abs/1509.02971, 2016.

Jianhua Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Informa_tion theory, 37:145–151, 1991._

Long-Ji Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and
Teaching. Machine Learning, 8(3–4):293–321, 1992.

Q. Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the Curse of Horizon: InfiniteHorizon Off-Policy Estimation. In NeurIPS, 2018.

Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Shane Gu.
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization. _ICLR,_
abs/2006.03647, 2021.

Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial and Applied
Mathematics, USA, 2000. ISBN 0898714540.

Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. ArXiv, abs/1411.1784,
2014.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization
for Generative Adversarial Networks. ArXiv, abs/1802.05957, 2018.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning, 2013.

A. Mousavi, Lihong Li, Qiang Liu, and Denny Zhou. Black-box Off-policy Estimation for InfiniteHorizon Reinforcement Learning. ArXiv, abs/2003.11126, 2020.

A. M¨uller. Integral Probability Metrics and Their Generating Classes of Functions. Advances in
_Applied Probability, 29:429–443, 1997._

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and D. Schuurmans. AlgaeDICE:
Policy Gradient from Arbitrary Experience. ArXiv, abs/1912.02074, 2019.


-----

Xinkun Nie, Emma Brunskill, and Stefan Wager. Learning When-to-Treat Policies. Journal of the
_American Statistical Association, 116:392 – 409, 2019._

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing order
to the Web. In Proceedings of the 7th International World Wide Web Conference, pp. 161–172,
Brisbane, Australia, 1998.

G. Peyr´e and Marco Cuturi. Computational Optimal Transport. Found. Trends Mach. Learn., 11:
355–607, 2019.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434, 2016.

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, J. Schulman, E. Todorov, and Sergey Levine.
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. ArXiv, abs/1709.10087, 2018.

Tim Salimans, I. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. In NIPS, 2016.

Erwin Schr¨odinger. Sur la th´eorie relativiste de l’´electron et l’interpr´etation de la m´ecanique quantique. Annales de l’institut Henri Poincar´e, 2:269–310, 1932.

J. Schulman, Sergey Levine, P. Abbeel, Michael I. Jordan, and Philipp Moritz. Trust Region Policy
Optimization. ArXiv, abs/1502.05477, 2015.

Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of
Distance-based and RKHS-based Statistics in Hypothesis Testing. The Annals of Statistics, 41
(5):2263–2291, 2013. ISSN 00905364, 21688966.

Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep Doing What
Worked: Behavioral Modelling Priors for Offline Reinforcement Learning, 2020.

D. Silver, Guy Lever, N. Heess, T. Degris, Daan Wierstra, and Martin A. Riedmiller. Deterministic
Policy Gradient Algorithms. In ICML, 2014.

Samarth Sinha, Ajay Mandlekar, and Animesh Garg. S4RL: Surprisingly Simple Self-Supervision
for Offline Reinforcement Learning, 2021.

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning Structured Output Representation using
Deep Conditional Generative Models. In NIPS, 2015.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford
Book, Cambridge, MA, USA, 2018. ISBN 0262039249.

Adith Swaminathan, A. Krishnamurthy, Alekh Agarwal, Miroslav Dud´ık, J. Langford, D. Jose, and
I. Zitouni. Off-policy Evaluation for Slate Recommendation. In NIPS, 2017.

H. Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, R. Ten Haken, and I. Naqa. Deep Reinforcement
Learning for Automated Radiation Adaptation in Lung Cancer. Medical Physics, 44:6690?6705,
2017.

N´uria Armengol Urp´ı, S. Curi, and A. Krause. Risk-Averse Offline Reinforcement Learning. ArXiv,
abs/2102.05371, 2021.

Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E
Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas.
Critic Regularized Regression. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and
H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7768–7778.
Curran Associates, Inc., 2020.

Larry Wasserman. All of Nonparametric Statistics. Springer, 2006.

Tom White. Sampling Generative Networks. arXiv: Neural and Evolutionary Computing, 2016.


-----

Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning,
2019.

Yue Wu, Shuangfei Zhai, Nitish Srivastava, J. Susskind, Jian Zhang, R. Salakhutdinov, and Hanlin
Goh. Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning. In ICML, 2021.

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn.
COMBO: Conservative Offline Model-Based Policy Optimization. ArXiv, abs/2102.08363, 2021.

Yuguang Yue, Zhendong Wang, and Mingyuan Zhou. Implicit Distributional Reinforcement Learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and HsuanTien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference
_on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,_
2020.

Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and K. Takeda. A Survey of Autonomous
Driving: Common Practices and Emerging Technologies. IEEE Access, 8:58443–58469, 2020.

Ruiyi Zhang, Bo Dai, Lihong Li, and D. Schuurmans. GenDICE: Generalized Offline Estimation of
Stationary Values. ArXiv, abs/2002.09072, 2020.

Huangjie Zheng and Mingyuan Zhou. Exploiting Chain Rule and Bayes’ Theorem to Compare
Probability Distributions, 2021.


-----

A RELATED WORK

**Offline Reinforcement Learning. Three major themes currently exist in offline-RL research. The**
first focuses on more robustly estimate the action-value function (Agarwal et al., 2020; Gulcehre
et al., 2021) or provide a conservative estimate of the Q-values (Kumar et al., 2020; Yu et al., 2021;
Sinha et al., 2021), which may better guide the policy optimization process. The second research
theme aims at designing a tactful behavior-cloning scheme so as to learn only from “good” actions in
the offline dataset (Wang et al., 2020; Chen et al., 2021). Most similar to our approach, the third line
of research tries to constraint the current policy to be close to the behavior policy during the training
process, under the notion that Q-value estimates at unfamiliar state-action pairs can be pathologically
worse due to a lack of supervised training. Specifically, Kumar et al. (2019) and Wu et al. (2021) use
conditional variational autoencoder (CVAE) (Kingma & Welling, 2013; Sohn et al., 2015) to train a
behavior cloning policy to sample multiple actions at each state for calculating the MMD constraint.
Wu et al. (2019), Siegel et al. (2020), and Cang et al. (2021) fit a (Gaussian) behavioral prior to the
offline dataset trained by (weighted) maximum likelihood objective. Jaques et al. (2019) consider a
pre-trained generative prior of human dialog data before applying KL-control to the current policy.
Note that these work essentially constraint the distance between the current policy and the cloned
behavior policy. Laroche & Trichelair (2019) assume a known stochastic data-collecting behavior
policy. Fujimoto et al. (2019) and Urp´ı et al. (2021) implicitly control the state-conditional action
distribution by decomposing action into a behavior cloning component, trained by fitting a CVAE
onto the offline data, and a perturbation component, trained to optimize the (risk-averse) returns.
Besides, some work, such as Wu et al. (2019), directly estimates and regularizes the divergence
between the state-conditional action distributions. In the paper, we instead take on the perspective
of matching the state-action joint visitation frequencies, which avoids using a single point to estimate
the divergence between distribution and removes the need for a good approximator to the behavior
policy, while implicitly smooths the mapping from state to actions (Section 1). Note that if we use
the same states in constructing the generator samples and the data samples, i.e., ˜s = s in Equation 6,
our implementation of state-action-visitation matching is fundamentally equivalent to matching the
state-conditional action distributions, as in some prior work.

**Online Off-policy Reinforcement Learning. A large class of modern online off-policy deep rein-**
forcement learning algorithms trains the policy using experience replay buffer (Lin, 1992), which
is a storage of the rollouts of past policies encountered in the training process (Mnih et al., 2013;
Lillicrap et al., 2016; Haarnoja et al., 2017; Fujimoto et al., 2018; Haarnoja et al., 2018b; Kuznetsov
et al., 2020; Yue et al., 2020; Lee et al., 2021b). This approach essentially use the state-visitation frequency of past policies to approximate that of the current policy (Equation 2). We adopt this notion
in both the policy improvement step and in the implementation of state-action-visitation joint matching, with additional smoothing techniques to encourage the smoothness of the empirical transition
kernel and the approximated state-visitation frequency of current policy (Section 4.2).

**Computational Distribution Matching. Many computationally efficient algorithms exist to match**
two probability distributions with respective to some statistical divergence. GAN (Goodfellow et al.,
2014) approximately minimizes the Jensen-Shannon divergence between the the model’s distribution and the data-generating distribution. Similar adversarial training strategy is further applied to
estimate a class of statistical divergence, termed the integral probability metrics (M¨uller, 1997), in
a sample-based manner. For example, Arjovsky et al. (2017); Gulrajani et al. (2017); Miyato et al.
(2018) estimate the Wasserstein-1 distance by enforcing the Lipschitz norm of the witness function
to be bounded by 1. Li et al. (2017); Binkowski et al. (2018) consider Maximum Mean Discrepancy
(MMD) (Gretton et al., 2012) with learnable kernels. Bellemare et al. (2017) studies the energy
distance, an instance of the MMD (Sejdinovic et al., 2013). Another important class of statistical divergence, the optimal transport distance, is also studied under a computational lens. Peyr´e &
Cuturi (2019) summarizes recent developments in computational optimal transport. Cuturi (2013);
Feydy et al. (2019) study sinkhorn-based divergence, which solve an entropic regularized form of
the optimal transport problem (Schr¨odinger, 1932) and may result in a physically more feasible
distance measure in some real-world application. Zheng & Zhou (2021) contributes to this line of
research by developing conditional transport algorithm, which possesses a better control over the
mode-covering and mode-seeking behaviors in distributional matching. In this paper, we consider
the classical GAN structure to approximately control the JSD between the state-action visitation
frequencies, since the GAN structure is simple, effective and well-studied. Other divergence metrics
may also be applicable in our framework and are left for future work.


-----

B FULL ALGORITHM

**Algorithm 2 State-Action Joint Regularized Implicit Policy (GAN-JSD, Detailed Version)**

**Input: Learning rate ηθ, ηϕ, ηw; target smoothing factor β; noise distribution pz(z); policy network πϕ**
with parameter ϕ; critic network Qθ1 and Qθ2 with parameters θ1,θ2; discriminator network Dw with
parameter w; generator loss Lg; standard deviation σB, σJ ; number of smoothed states NB; number of
actions a[′] at s[′] _Na; number of epochs for warm start Nwarm; policy frequency k._
**Output: Learned neural network parameters ϕ, θ1,θ2, w.**
// Initialization Initialize ϕ, θ1,θ2, w. Initialize ϕ[′] **_ϕ, θ1[′]_** 1[,][ θ]2[′] 2[. Load dataset][ D][.]

**for{** each epoch do} _←_ _[←]_ **_[θ]_** _[←]_ **_[θ]_**

**for each iteration within current epoch do**

Sample a mini-batch of transitions B = {(s, a, r, s[′])} ∼ D.
_{// Policy Evaluation}_
For each s[′] _∈B sample NB ˆs with noise standard deviation σB for state-smoothing via,_

**_sˆ = s[′]_** + ϵ, ϵ ∼N (0, σB[2] **_[I][)][.]_**


Sample Na corresponding actions ˆa _πϕ′_ ( ˆs) for each ˆs.
_∼_ **_· |_**

Calculate _Q(s, a) as_

_Q (s, a) ≜[e]_ _r(s, a) + γ_ _NB_ _×1_ _Na_ (ˆs,aˆ) _λ minj=1,2 Qθj′_ [(ˆ]s, ˆa) + (1 − _λ) maxj=1,2 Qθj′_ [(ˆ]s, ˆa) _._

h i

P

e

Minimize the critic loss with respect to θj, j = 1, 2, over (s, a), with learning rate ηθ,
_∈B_

2

arg minθj 1 (s,a) _Qθj (s, a)_ _Q(s, a)_ _._

_|B|_ _∈B_ _−_ [e]

 

P

_{// Policy Improvement}_
Resample |B| new states ˜s ∼ D independent of s. Add state-smoothing to ˜s with noise standard
deviation σJ using ϵ ∼N **0, σJ[2]** **_[I]_** _, ˜s ←_ **_s˜ + ϵ._**

Form the generator sample x  and data sample 1 **_y using x ≜_** (˜s, ˜a), ˜a ∼ _πϕ(· | ˜s); y ≜_ (s, a) ∈B.

Calculate the generator loss Lg(ϕ) = _|B|_ **_x_** [[log (1][ −] _[D][w][ (][x][))]][ using][ x][ and discriminator][ D][w][.]_

**if iteration count % k == 0 then**

P

**if epoch count < Nwarm then**

Optimize policy with respect to _g(ϕ) only, with learning rate ηϕ, i.e.,_
_L_


arg minϕ α · Lg(ϕ).

**else**

Optimize policy with learning rate ηϕ for the target


arg minϕ − _|B|[1]_


**_s∼B,a∼πϕ(· | s)_** minj=1,2 Qθj (s, a) + α · Lg(ϕ).
 

**end if**

**else**

Skip the policy improving step.

**end if**
// Training the Discriminator
_{_ _}_ 1 1
Optimize the discriminator Dw to maximize **_y_** [[log][ D][w][(][y][)] +] **_x_** [[log (1][ −] _[D][w][(][x][))]][ with]_

_|B|_ _|B|_

respect to w with learning rate ηw.

P P

_{// Soft Update the Target Networks}_

**end forϕ[′]** _←_ _βϕ + (1 −_ _β)ϕ[′]; θj[′]_ _[←]_ _[β][θ][j]_ [+ (1][ −] _[β][)][θ]j[′]_ [for][ j][ = 1][,][ 2][.]

**end for**

C PROOFS

We follow the offline RL literature (Liu et al., 2018; Nachum et al., 2019; Kallus & Zhou, 2020;
Mousavi et al., 2020; Zhang et al., 2020) to assume the following regularity condition on the MDP


-----

structure, which ensures ergodicity and that the limiting state occupancy measure exists and equals
to the stationary distribution of the chain.
**Assumption 1 (Ergodicity of MDP). The MDP M is ergodic, i.e., the Markov chains associated**
_with any πb and any πϕ under consideration is positive Harris recurrent (Baxendale, 2011)._

**Lemma 3. Let A ∈** R[N] _[×][N]_ be nonsingular and let 0 ̸= b ∈ R[N] _, x = A[−][1]b ∈_ R[N] . Let ∆A ∈
R[N] _[×][N]_ be an arbitrary perturbation on A. Assume that the norm on R[N] _[×][N]_ satisfies ∥Ax∥≤
_∥A∥∥x∥_ for all A and x. If

1

(A + ∆A) (x + ∆x) = b and _<_

_[∥][∆]A[A][∥]_ _κ(A)_

_∥_ _∥_

then
∆x _κ(A)_ _[∥][∆]A[A][∥]_ _κ(A)_
_∥_ _∥_ _∥_ _∥_ =

_∥x∥_ _≤_ 1 − _κ(A)_ _[∥]∥[∆]A[A]∥[∥]_ _∥∥∆AA∥∥_ _[−]_ _[κ][(][A][)]_

_Proof. Let_ **_x = x + ∆x, A (x + ∆x) + ∆A_** **_x = b_** =⇒ **_A ∆x + ∆A_** **_x = 0_** =⇒ ∆x =
_−A[−][1]_ ∆A **_x. Then,_**
b b b

∆x **_A[−][1]_** ∆A ( **_x_** + ∆x ) = κ(A) ( **_x_** + ∆x )

b _∥_ _∥≤∥_ _∥∥_ _∥_ _∥_ _∥_ _∥_ _∥_ _[∥][∆]A[A][∥]_ _∥_ _∥_ _∥_ _∥_

_∥_ _∥_

= 1 _κ(A)_ ∆x _κ(A)_
_⇒_ _−_ _[∥][∆]A[A][∥]_ _∥_ _∥≤_ _[∥][∆]A[A][∥]_
 _∥_ _∥_  _∥_ _∥_ _[∥][x][∥]_

_κ(A)_ _[∥][∆]A[A][∥]_ _κ(A)_

= _∥_ _∥_ =
_⇒_ _[∥]∥[∆]x[x]∥[∥]_ _≤_ 1 − _κ(A)_ _[∥]∥[∆]A[A]∥[∥]_ _∥∥∆AA∥∥_ _[−]_ _[κ][(][A][)]_

since κ(A) ∥∆A∥ _/ ∥A∥_ _< 1 by assumption._

_Proof of Theorem 1. By ergodicity, dϕ(s), db(s) ∈_ R[N] uniquely exists. For d ∈{dϕ, db}, T ∈
**_T_** **_[ϕ], T_** _[b]_, stationarity implies that d is an eigenvector of T _[⊤]_ associated with eigenvalue 1, and
furthermore


1
0

**1**
**_d[⊤]_** = d[⊤]T =⇒ **_I −_** **_T_** _[⊤][]_ **_d = 0 =⇒_** I − **_T_** _[⊤]_ **_d =_**  ...  (10)
  0

≜ **_T[b] ∈R[(][N]_** [+1)][×][N]  

Since T is a positive matrix, by the Perron-Frobenius theorem (Meyer, 2000),| {z } 1 is an eigenvalue of
**_T_** _[⊤]_ with algebraic multiplicity, and hence geometry multiplicity, 1. The eigen-equation T _[⊤]v = 1v_
has unique solution v up to a constant multiplier. Since T _[⊤]d = d, the eigenspace of T_ _[⊤]_ associated
with eigenvalue 1 is span ({d}). Hence dim ker **_I −_** **_T_** _[⊤][]_ = 1 =⇒ rank **_I −_** **_T_** _[⊤][]_ = N −

1 = rank **_T_** = N . The reason is that if **_v_** =  **0 s.t.** **_T v = 0, then_**  
_⇒_ _∃_ _̸_
 

**1 v** 0
**_T v =_** b = = **_v = c d for scalar c[b]_** R and 1 v = c 1 d = 0 = _c = 0,_
 **_I −_** **_T_** _[⊤][]_ **_v_** 0 _⇒_ _∈_ _⇒_

and henceb **_v _** = c d = 0 which contradicts to v = 0.
_̸_

Since rank **_T_** = N, dim **_v ∈_** R[N] [+1] : v[⊤] **_T[b] = 0_** = 1. For such a v ̸= 0, ∃ _i ∈{2, . . ., N_ +

1rank(} s.t. vAi) = ̸= 0 Nb. The reason is that if. WLOG, assume n _vN rank(+1 ̸= 0A) < N. Let, A ∃_ ow ∈ ̸= 0R[N] s.t.[×][N] wbe the first[⊤]A = 0, then N rows of **_T, then_**
**_w_** _⊤_ [b]

**_T = w[⊤]A = 0_**

0
 

and vN +1 = 0 = **_w[⊤], 0_** _⊤_ is not a constant multiple ofb **_v =_** dim ker **_T_** 2, which
_̸_ _⇒_ _⇒_ _[⊤][]_ _≥_

contradicts to the fact that  rank **_T_** = N . Thus, we conclude that rank (A) = _N_ = **_A is_**

b _⇒_

invertible.  
b


-----

Let e[(1)] = (1, 0, . . ., 0)[⊤] _∈_ R[N], then Equation 10 implies A d = e[(1)]. Plug in dϕ, db, Tϕ, Tb and
define Aϕ, Ab similarly, we have Aϕ dϕ = e[(1)], Ab db = e[(1)]. For Tb and Tϕ, we notice that by
Jensen’s inequality,

_N_ _N_

(Tb **_Tϕ)ij_** _St+1 = s[j]_ _St = s[i], At = at_ _πb_ **_at_** **_s[i][]_** _πϕ_ **_at_** **_s[i][] dat_**

_j=1_ _−_ _≤_ _j=1_ ZA _P_ _|_ _|_ _−_ _|_

X X       

_N_

= _St+1 = s[j]_ _St = s[i], At = at_ _πb_ **_at_** **_s[i][]_** _πϕ_ **_at_** **_s[i][] dat_**
ZA j=1 _P_ _|_  _|_ _−_ _|_

X       
 

= _πb_ **_at_** **_s[i][]_** _πϕ_ **_at_** **_s[i][] dat_**

A _|_ _−_ _|_

Z

= _πb_   s[i][] _πϕ_ **_s [i][]1_**

**_· |_** _−_ **_· |_**
Therefore, by assumption,   

_N_

_∥Tb −_ **_Tϕ∥∞_** = _i=1max,...,N_ _j=1_ (Tb − **_Tϕ)ij_** _≤_ _i=1max,...,N_ _πb_ **_· | s[i][]_** _−_ _πϕ_ **_· | s[i][]1_** _[≤]_ _[ϵ.]_

X    

Let Aϕ = Ab + ∆A. For ∥∆A∥1, we notice that


_∥∆A∥1 = ∥Aϕ −_ **_Ab∥1 ≤_**


∆A 1 = **_Aϕ_** **_Ab_** 1
_∥_ _∥_ _∥_ _−_ _∥_ _≤_ I − **_Tϕ[⊤]_** _−_ I − **_Tb[⊤]1_**

**0**

= **_Tb[⊤]_** **_ϕ_** 1 = (Tb − **_Tϕ)[⊤]_** 1 [=][ ∥][T][b][ −] **_[T][ϕ][∥][∞]_** _[≤]_ _[ϵ.]_

 _[−]_ **_[T][ ⊤]_**

Notice that matrix 1-norm satisfiesAb 1 1 and that **_db_** 1 = 1. Lemma 3 implies that ∥Mv∥1 ≤∥M _∥1 ∥v∥1 for all matrix M and vector v, that_
_∥_ _∥_ _≥_ _∥_ _∥_

∆d 1
_∥_ _∥_ = ∆d 1 = **_dϕ_** **_db_** 1

**_db_** 1 _∥_ _∥_ _∥_ _−_ _∥_
_∥_ _∥_

_κ(Ab)_


**_I −_** **_Tϕ[⊤]_**


1

_∥∆A∥1_

_[−]_ _[κ][(][A][b][)]_

_κ(Ab)_ _ϵ κ(Ab)_

1ϵ 1 _ϵ κ(Ab)_

_[−]ϵ κ[κ]max[(][A][b][) =]_ _−_

0 as ϵ 0.
1 _ϵ κmax_ _→_ _→_
_−_


(11)


_Proof of Theorem 2. By ergodicity, dϕ(s), dbk_ (s) _∈_ R[N] uniquely exists, ∀ _k._ For d _∈_
**_dϕ, db1_** _, . . ., dbk_ and T **_T_** **_[ϕ], T_** _[b][1]_ _, . . ., T_ _[b][K]_, we follow the steps and notations in the proof
_{_ _}_ _∈_
of Theorem 1 to conclude that rank (A) = N = **_A is invertible and that Ad = e[(1)]_** R[N] .
 _⇒_ _∈_
Plug in dϕ, dbk _, Tϕ, Tbk and define Aϕ, Abk similarly, we have Aϕdϕ = e[(1)], Abk_ **_dbk = e[(1)]. For_**
the transition matrix Tb induced by the mixture of policies πb, we have

**_Tb,(i,j) = pπb_** _St+1 = s[j]_ _| St = s[i][]_ = A _P_ _St+1 = s[j]_ _| St = s[i], At = at_ _πb_ **_at | s[i][]_** _dat_
Z

_K_     K  

= _wk_ _St+1 = s[j]_ _St = s[i], At = at_ _πbk_ **_at_** **_s[i][]_** _dat =_ _wkTbk,(i,j)_

_k=1_ ZA _P_ _|_ _|_ _k=1_

X      X

and therefore Tb = _k=1_ _[w][k][T][b]k_ [.]

For Tbk and Tϕ, as in the proof of Theorem 1, by Jensen’s inequality,

[P][K]

_N_

(Tbk **_Tϕ)ij_** _πbk_ **_s[i][]_** _πϕ_ **_s[i][]1_**

_j=1_ _−_ _≤_ **_· |_** _−_ **_· |_**

X    


-----

Therefore, by assumption,

_N_

_∥Tbk −_ **_Tϕ∥∞_** = _i=1max,...,N_ _j=1_ (Tbk − **_Tϕ)ij_** _≤_ _i=1max,...,N_ _πbk_ **_· | s[i][]_** _−_ _πϕ_ **_· | s[i][]1_** _[≤]_ _[ϵ.]_

X    

Let Aϕ = Abk + ∆Abk . For ∥∆Abk _∥1, we have_

_∥∆Abk_ _∥1 = ∥Aϕ −_ **_Abk_** _∥1 ≤_ **_Tb[⊤]k_** _[−]_ **_[T][ ⊤]ϕ_** 1 [=][ ∥][T][b][k][ −] **_[T][ϕ][∥][∞]_** _[≤]_ _[ϵ.]_

Note that ∀ _k, ∥Abk_ _∥1 ≥_ 1 + 1 − [P]i[N]=1 **_[T][ b]1i[k]_** [= 1][,][ ∥][d][b]k _[∥]1_ [= 1][. Lemma 3 implies that]

_∥∆dk∥1_ = ∆dk 1 = **_dϕ_** **_dbk_** 1 _ϵ κ (Abk_ ) _ϵ κmax.k_

**_dbk_** 1 _∥_ _∥_ _∥_ _−_ _∥_ _≤_ 1 _ϵ κ (Abk_ ) 1 _ϵ κmax,k_
_∥_ _∥_ _−_ _[≤]_ _−_

Thus for the relative distance between dϕ and db, we have


_N_

**_db_** **_s[i][]_**
_i=1_

X  


_K_

_wkdbk_ **_s[i][]_**
_k=1_

X  


_N_

**_dbk_** **_s[i][]_**
_i=1_

X  


_∥db∥1 =_


_wk_

_k=1_

X


_wk = 1,_
_k=1_

X


_i=1_


**_dϕ_** **_db_** 1
_∥_ _−_ _∥_ = **_dϕ_** **_db_** 1 =

**_db_** 1 _∥_ _−_ _∥_
_∥_ _∥_


(wkdϕ _wkdbk_ )
_k=1_ _−_

X


_K_

_ϵ κmax.k_ _ϵ κmax_
_wk_ 0, as ϵ 0.

1 _ϵ κmax,k_ _≤_ 1 _ϵ κmax_ _→_ _→_

_k=1_ _−_ _−_

X


_wk_ **_dϕ_** **_dbk_** 1
_k=1_ _∥_ _−_ _∥_ _≤_

X


Plug wk = 1/K, ∀ _k ∈{1, . . ., K} into the second to last equation, we get_


_∥dϕ −_ **_db∥1 ≤_** _K[1]_


_ϵ κmax,k_

0, as ϵ 0.
1 _ϵ κmax,k_ _→_ _→_
_−_


_k=1_


_Remark. In general, d[⊤]b_ **_[T][b][ =][ P]k[K]=1_** _[w]k[2][d][⊤]bk_ [+][ P]i=j _[w][i][w][j][d]b[⊤]i_ **_[T][ b][j][ ̸][=][ d]b[⊤][. One sufficient condition is]_**

_̸_
**_d[⊤]bi_** **_[T][ b][j][ =][ d]b[⊤]i_** _[,][ ∀]_ _[i, j][ =]⇒_ **_dbi = dbj_** _, ∀_ _i, j =⇒_ _πbi = πbj_ _, ∀_ _i, j, similar to Ho & Ermon (2016)._
In such case, πb reduces to a single policy, not a mixture, and Theorem 1 applies.

D TECHNICAL DETAILS

D.1 TOY EXPERIMENT

Denote the total sample size as Ntotal, we follow the convention to construct the eight-Gaussian
dataset as in Algorithm 3. In our experiment we use Ntotal = 2000.

**Algorithm 3 Constructing the Eight-Gaussian Dataset**


**Input: Total sample size Ntotal.**
**Output: Generated dataset DGaussian.**
**while Dataset size < Ntotal do**

Draw a random center c uniformly

**_c ∼_** _√2, 0_ _,_ _−√2, 0_ _,_ 0, _√2_ _,_ 0, −√2 _, (1, 1), (1, −1), (−1, 1), (−1, −1)_ _._
           

Sample datapoint x = (x, y) **_c, 2_** 10[−][4] **_I2_** .
_∼N_ _×_ _·_

Store x in the dataset.

  

**end while**

We are interested in the 2-D eight-Gaussian dataset because (a) the conditional distribution of
_p(y | x) is multi-modal in many x; and (b) interpolation is needed to fill-in the blanks between_
Gaussian-centers, where a smooth-interpolation into a circle is naturally expected.


-----

To rephrase this dataset into offline reinforcement learning setting, we define x as state and the
corresponding y as action. Note that in the behavior cloning algorithm, the information of reward,
next state, and the episodic termination is not required. Hence, the generated dataset DGaussian can
serve as an offline RL dataset readily applicable to train behavior cloning policies.

In order to compare the ability to approximate the behavior policy by the KL loss and the JSD
loss, the Gaussian policy and the implicit policy, the conditional-distribution matching and the jointvisitation matching, we fit a conditional VAE (“CVAE”), a Gaussian generator conditional GAN
(“G-CGAN”) and a conditional GAN (“CGAN”) using the conditional-distribution matching approach similar to Kumar et al. (2019). We fit, using basic joint-visitation matching strategy, a conditional GAN (“GAN”). As discussed in Appendix A, the major distinction between “CGAN” and
“GAN” is that the former uses the same states in constructing the generator samples and the data
samples while the later resamples states.

The network architecture of our conditional VAE is as follows.

Conditional Variational Auto-encoder (CVAE) in Toy Experiment


Encoder


Decoder


Linear(state_dim+action_dim, H) Linear(state_dim+latent_dim, H)
BatchNorm1d(H) BatchNorm1d(H)
ReLU ReLU
Linear(H, H//2) Linear(H, H//2)
BatchNorm1d(H//2) BatchNorm1d(H//2)
ReLU ReLU
mean = Linear(H//2, latent_dim) Linear(H//2, action_dim)
log_std = Linear(H//2, latent_dim)

with hidden dimension H = 100 and latent dimension latent dim = 50. CVAE is trained for
1200 epochs with a mini-batch size of 100 and random seed 0, using the mean-squared-error as the
reconstruction loss, and the Gaussian-case closed-form formula in Kingma & Welling (2013) for the
KL term.

The network architecture of our conditional GAN, used in “CGAN” and “GAN,” is as follows.

Conditional Generative Adversarial Nets (CGAN) in Toy Experiment


Generator

Linear(state_dim+z_dim, H)
BatchNorm1d(H)
ReLU
Linear(H, H//2)
BatchNorm1d(H//2)
ReLU
Linear(H//2, action_dim)


Discriminator

Linear(state_dim+action_dim, H)
LeakyReLU(0.1)
Linear(H, H//2)
LeakyReLU(0.1)
Linear(H//2, 1)


where the structure of BatchNorm1d, LeakyReLU follows Radford et al. (2016). Here we
again use H = 100, z dim = 50. Conditional GAN is trained for 2000 epochs with a mini-batch size
of 100 and random seed 0. We follow Radford et al. (2016) to train CGAN using Adam optimizer
with β1 = 0.5.

The network architecture of our Gaussian-generator version of conditional GAN, used in the experiments of Appendix F, is as follows.

Generator
Linear(state_dim, H)
BatchNorm1d(H)
ReLU
Linear(H, H//2)
BatchNorm1d(H//2)
ReLU
mean = Linear(H//2, action_dim), log_std = Linear(H//2, action_dim)


-----

with Discriminator and other technical details the same as CGAN. This Gaussian-generator version
of CGAN is again trained for 2000 epochs with a mini-batch size of 100, random seed 0, and
_β1 = 0.5 in the Adam optimizer._

Our test set is formed by a random sample of 2000 new states (x) from [−1.5, 1.5] together with the
states in the training set. The performance on the test set thus shows both the concentration on the
eight centers and the smooth interpolation between centers, which translates into a good and smooth
fit to the behavior policy. Figure 1 shows the training set (“Truth”) and the kernel-density-estimate
plot of each methods.

D.2 REINFORCEMENT LEARNING EXPERIMENTS

In practice, the rollouts contained in the offline dataset have finite horizon, and thus special treatment
is needed per appearance of the terminal states in calculating the Bellman update target. We follow
the standard treatment (Mnih et al., 2013; Sutton & Barto, 2018) to define the update target y as

_r (s, a) + γQ[′]_ (s[′], a[′]) if s[′] is a non-terminal state

_Q (s, a) =_ _,_

(r (s, a) if s[′] is a terminal state

[e]

e

where _Q[′]_ (s[′], a[′]) refers to the expectation term in Equation 4 for basic algorithm (Section 4.1) or
the expectation term in Equation 8 for the enhanced versions with state-smoothing at the Bellman
Backup (Section 4.2).

[e]

For simplicity, we follow White (2016) to choose the noise distribution pz (z) as the multivariate standard normal distribution, where the dimension of z is conveniently chosen as dim (z) =
```
min(10, state dim//2). To sample from the implicit policy, for each state s, we first sample in
```
dependently z ∼N (0, I). We then concatenate s with z and feed the resulting [s, z] into the
deterministic policy network to generate stochastic actions. To sample from a small region around
the next state s[′] (Section 4.2), we keep the original s[′] and repeat it additionally NB times. For each
of the NB replications, we add an independent Gaussian noise ϵ (0, σB[2] **_[I][)][. The original][ s][′][ and]_**
_∼N_
its NB noisy replications are then fed into the implicit policy to sample the corresponding action.

For fair comparison, we use the same network architecture as the official implementation of the
BCQ algorithm, which will be stated in detail in Sections D.2.1. Due to limited computational
resources, we leave a fine-tuning of the noise distribution pz (z), the network architectures, and the
optimization hyperparameters for future work, which also leaves room for further improving our
results.

For a more stable training of the policy, we adopt the warm start strategy (Kumar et al., 2020; Yue
et al., 2020). Specifically, in the first Nwarm epochs, the policy is trained to minimize Lg only, with
the same learning rate ηϕ as the rest epochs that also try to maximize the expected Q-values.

**Datasets. We use the continuous control tasks provided by the D4RL dataset (Fu et al., 2021)**
to conduct algorithmic evaluations. Due to limited computational resources, we select therein the
“medium-expert,” “medium-replay,” and “medium” datasets for the Hopper, HalfCheetah, Walker2d
tasks in the Gym-MuJoCo domain, which are commonly used benchmarks in prior work (Fujimoto
et al., 2019; Kumar et al., 2019; Wu et al., 2019; Kumar et al., 2020). We follow the literature (Cang
et al., 2021; Chen et al., 2021; Kostrikov et al., 2021) to not test on the “random” and “expert”
datasets as they are less practical (Matsushima et al., 2021) and can be respectively solved by directly
using standard off-policy RL algorithms (Agarwal et al., 2020) and the behavior cloning algorithms.
We note that a comprehensive benchmarking of prior offline-RL algorithms on the “expert” datasets
is currently unavailable in the literature, which is out of the scope of this paper. Apart from the
Gym-MuJoCo domain, we also consider the Maze2D domain[2] of tasks for the non-Markovian datacollecting policy and the Adroit tasks[3] (Rajeswaran et al., 2018) for their sparse reward-signal and
high dimensionality.

**Evaluation Protocol. In all the experiments, we follow Fu et al. (2021) to use the “v0” version of**
the datasets in the Gym-MuJoCo and Adroit domains. In our preliminary study, we find that the
rollout results of some existing algorithms can be unstable across epochs in some datasets, even

2We use the tasks “maze2d-umaze,” “maze2d-medium,” and “maze2d-large.”
3We use the tasks “pen-human,” “pen-cloned,” “pen-expert,” and “door-expert.”


-----

towards the end of training. To reduce the randomness in evaluation, we report, for our algorithm,
the mean and standard deviation of the last five rollouts across three random seeds {0, 1, 2}. We
train all agents for 1000 epochs, where each epoch consists of 1000 mini-batch stochastic gradient
descent steps. We rollout each agent for 10 episodes after each epoch of training.

D.2.1 GAN JOINT MATCHING

In approximately matching the Jensen-Shannon divergence between the state-action visitation frequencies via Generative Adversarial Nets, a crucial step is to stably and effectively train the GAN
structure. To this end, we adopt the following tricks from the literature.

-  To provide stronger gradients early in training, rather than training the policy πϕ to minimize

Ex [log (1 − _Dw(x))]_

we follow (Goodfellow et al., 2014) to train πϕ to maximize

Ex [log (Dw(x))]

-  Motivated by (Radford et al., 2016), we use LeakyReLU activation in both the generator and
discriminator, with default negative slope=0.01.

-  To stabilize the training, we follow (Radford et al., 2016) to use a reduced momentum term β1 =
0.4 in the Adam optimizer (Kingma & Ba, 2014).

-  We follow (Radford et al., 2016) to use actor and discriminator learning rate ηϕ = ηw = 2×10[−][4].

-  To avoid overfitting of the discriminator, we are motivated by Salimans et al. (2016) and Goodfellow (2017) to use one-sided label smoothing with soft and noisy labels. Specifically, the labels for
the data sample y is replaced with a random number between 0.8 and 1.0, instead of the original
1. No label smoothing is applied for the generator sample x, therefore their labels are all 0.
The loss function for training the discriminator in GAN is the Binary Cross Entropy between the
labels and the outputs from the discriminator.

Furthermore, motivated by TD3 (Fujimoto et al., 2018) and GAN (Section 2), we update πϕ( **_s)_**

**_· |_**
once per k updates of the critics and discriminator.

Table 2 shows the hyperparameters for our GAN joint-matching framework. Note that several simplifications are made to minimize hyperparameter tuning, such as fixing ηϕ = ηw as in (Radford
et al., 2016) and σB = σJ . We also fix Na = 1 to ease computation.

We comment that many of these hyperparameters can be set based on literature, for example, we use
_ηϕ = ηw = 2 × 10[−][4]_ as in Radford et al. (2016), ηθ = 3 × 10[−][4] and Nwarm = 40 as in Kumar
et al. (2020), λ = 0.75 as in Fujimoto et al. (2019) and policy frequency k = 2 as in Fujimoto et al.
(2018). Unless specified, the same hyperparameter is used across all datasets.

Below we state the network architectures of the actor, critic, and the discriminator in GAN. Note that
we use a pair of critic networks with the same architecture to perform clipped double Q-learning.


Actor

Linear(state_dim+noise_dim, 400)
LeakyReLU
Linear(400, 300)
LeakyReLU
Linear(300, action_dim)
max_action * tanh

Discriminator in GAN

Linear(state_dim+action_dim, 400)
LeakyReLU
Linear(400, 300)
LeakyReLU
Linear(300, 1)
Sigmoid


Critic

Linear(state_dim+action_dim, 400)
LeakyReLU
Linear(400, 300)
LeakyReLU
Linear(300, 1)

Note that the network architectures follow
from Fujimoto et al. (2019) and that all
the LeakyReLU activation uses the default
negative slope=0.01.


-----

Table 2: Default Hyperparameters for GAN joint matching.

Hyperparameter Value

Optimizer Adam Kingma & Ba (2014)
Learning rate ηθ 3 10[−][4]
_×_

Learning rate ηϕ, ηw 2 10[−][4]
_×_
Log Lagrange multiplier log α for non-Adroit datasets 4.0
Log Lagrange multiplier log α for Adroit datasets 8.0
Evaluation frequency 10[3]

Training iterations 10[6]
Batch size 512
Discount factor 0.99
Target network update rate β 0.005
Weighting for clipped double Q-learning λ 0.75
Noise distribution pz(z) (0, I)
_N_
Standard deviations for state smoothing σB, σJ 3 10[−][4]
_×_
Number of smoothed states in Bellman backup NB 50
Number of actions ˆa at each ˆs Na 1
Number of epochs for warm start Nwarm 40
Policy frequency k 2
Random seeds _{0, 1, 2}_

D.2.2 RESULTS OF CQL

We note that the official CQL GitHub repository does not provide hyperparameter settings for the
Maze2D and Adroit domain of tasks. For datasets in these two domains, we train an CQL agent using five hyperparameter settings: four recommended Gym-MuJoCo settings and one recommended
Ant-Maze setting. We then calculate the average (normalized) return of the last five rollouts over
random seeds {0, 1, 2} and report the per-dataset best results across those five hyperparameter settings. We comment that this approach is non-conventional, but rather a compensation for the missing
of recommended hyperparameter settings, and may give CQL some advantage on the Maze2D and
Adroit domains. For the Gym-MuJoCo domain, we follow Kumar et al. (2020) to use the recommended hyperparameter setting across all datasets.

D.2.3 ABLATION STUDY ON GAUSSIAN POLICY

The network architecture of the Gaussian policy we used in the ablation study (Section 5.2) follows
common practice ((Haarnoja et al., 2018a; Kumar et al., 2020)).

Gaussian Policy
Linear(state_dim, 400)
LeakyReLU
Linear(400, 300)
LeakyReLU
mean = Linear(300, action_dim)
log_std = Linear(300, action_dim)

Critics and discriminator are the same as in the implicit policy case (Appendix D.2.1).

For action selection from the Gaussian policy, a given state s is first mapped to the mean µ(s)
and standard deviation vector σ(s). A raw action is sampled as araw **_µ(s), diag_** **_σ[2](s)_** .
_∼N_
Finally, araw is mapped into the action space as max action tanh(araw).
_×_     

For fair comparison, other technical details, including the training procedure and hyperparameter
setting, are exactly the same as the implicit policy case (Appendix D.2.1).


-----

E ADDITIONAL TABLES

E.1 RAW SCORES OF THE MAIN TABLE

In Table 3 we report the un-normalized results corresponding to Table 1 for reference.

Table 3: Raw returns for experiments on the D4RL suite of tasks. We performs experiments on tasks from
the Gym-Mojoco, Maze2D, and Adroit domains. For our algorithm, we report in this table the mean and
standard deviation of the raw returns of the last five rollouts across three random seeds {0, 1, 2}. We run CQL
ourselves and report the average raw return of the last five rollouts over random seeds {0, 1, 2}. Results for
other algorithms are from Fu et al. (2021).

Task Name SAC-off BEAR BRAC-v BCQ CQL GAN-Cond:Basic GAN-Joint:Basic GAN-Joint

maze2d-umazemaze2d-mediummaze2d-largehalfcheetah-mediumwalker2d-mediumhopper-mediumhalfcheetah-medium-replaywalker2d-medium-replayhopper-medium-replayhalfcheetah-medium-expertwalker2d-medium-experthopper-medium-expertpen-humanpen-clonedpen-expertdoor-expert -808.6-581.3145.6284.8797.6277.4163.8-55.782.044.287.893.332.9-5.15.71.5 2717.01674.53254.12980.14897.04517.91076.81842.73113.56349.6883.8885.428.689.866.319.0 5473.83725.85640.63747.54926.6114.7102.4990.4115.2-66.644.522.2-0.81.75.16.4 2441.02149.04767.91752.44463.91057.87750.82640.33588.51407.83521.32850.7688.735.041.523.2 2763.24566.21103.65105.54000.73666.63347.82954.52525.8159.8755.0941.3139.8123.594.393.5 9150.73009.32175.03726.53367.51672.23855.72904.62193.15155.3318.6675.1125.8105.4812.196.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 27.1 107.4 1142.8 431.2 48.0 47.7 312.3 692.5 295.0 51.8 631.3 693.5 473.3 528.6 110.3 30.8 3212.92142.53599.91166.93513.02286.41915.73943.63001.08936.91028.45149.0117.3145.8103.3458.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 15.2 27.3 55.5 416.3 499.6 342.3 91.1 168.8 760.8 691.8 870.9 407.0 639.7 429.0 106.1 61.6 9132.93269.73181.21935.23832.33232.21451.94302.12979.9471.6633.35197.4197.1197.3938.979.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 23.3 110.0 1259.3 1008.6 430.5 67.7 69.4 396.9 887.4 281.1 82.3 944.3 729.7 441.4 108.9 31.6

E.2 TABLES FOR THE ABLATION STUDY

Table 4 - 7 correspond to the results for our ablation study in Section 5.2.

Table 4: Normalized returns for comparing the implicit policy with the Gaussian policy on the basic algorithm
(Section 4.1) on the D4RL suite of tasks. The reported number are the means and standard deviations of the
normalized returns of the last five rollouts across three random seeds {0, 1, 2}.

Task Name GAN-Joint-Matching: Basic GAN-Joint-Matching: Basic, Gaussian Policy

maze2d-umaze 57.6 ± 11.0 23.9 ± 8.1
maze2d-medium 39.4 ± 10.3 0.9 ± 7.7
maze2d-large 52.1 ± 20.8 3.3 ± 0.7
halfcheetah-medium 43.7 ± 0.5 43.8 ± 0.4
walker2d-medium 70.0 ± 9.1 51.6 ± 10.9
hopper-medium 66.5 ± 15.4 79.8 ± 14.1
halfcheetah-medium-replay 31.3 ± 2.8 33.0 ± 3.1
walker2d-medium-replay 9.9 ± 2.0 9.1 ± 1.7
hopper-medium-replay 36.5 ± 5.2 26.3 ± 2.2
halfcheetah-medium-expert 74.2 ± 6.1 76.8 ± 6.7
walker2d-medium-expert 76.5 ± 15.1 59.6 ± 21.9
hopper-medium-expert 70.9 ± 26.8 84.6 ± 10.6
pen-human 61.0 ± 13.7 55.7 ± 21.1
pen-cloned 31.3 ± 21.5 36.6 ± 14.6
pen-expert 129.1 ± 14.4 118.8 ± 15.1
door-expert 104.1 ± 3.6 39.5 ± 24.6

Average Score **59.6** 46.5


-----

Table 5: Normalized returns for comparing our full algorithm with its counterpart of no state-smoothing in the
joint matching of the state-action-visitation. The reported number are the means and standard deviations of the
normalized returns of the last five rollouts across three random seeds {0, 1, 2}.

Task Name GAN-Joint-Matching: Full Algorithm GAN-Joint-Matching: No State-smoothing in Joint Matching

maze2d-umaze 40.1 ± 16.9 47.6 ± 4.7
maze2d-medium 69.6 ± 25.6 35.4 ± 5.5
maze2d-large 71.3 ± 26.0 69.9 ± 26.5
halfcheetah-medium 44.1 ± 0.3 44.0 ± 0.4
walker2d-medium 69.3 ± 8.6 62.7 ± 8.6
hopper-medium 60.1 ± 27.3 81.8 ± 16.6
halfcheetah-medium-replay 33.1 ± 2.3 31.8 ± 2.7
walker2d-medium-replay 10.2 ± 2.4 9.9 ± 2.5
hopper-medium-replay 29.5 ± 2.5 29.5 ± 2.4
halfcheetah-medium-expert 75.8 ± 10.1 69.1 ± 8.3
walker2d-medium-expert 71.2 ± 22.0 73.8 ± 18.4
hopper-medium-expert 99.9 ± 29.0 67.8 ± 18.9
pen-human 45.5 ± 24.5 54.3 ± 19.6
pen-cloned 18.0 ± 14.4 22.5 ± 17.4
pen-expert 141.1 ± 14.8 142.4 ± 14.1
door-expert 103.4 ± 3.7 102.5 ± 3.2

Average Score **61.4** 59.1

Table 6: Normalized returns for comparing our full algorithm with its counterpart of no state-smoothing in the
Bellman backup. The reported number are the means and standard deviations of the normalized returns of the
last five rollouts across three random seeds {0, 1, 2}.

Task Name GAN-Joint-Matching: Full Algorithm GAN-Joint-Matching: No State-smoothing in Bellman Backup

maze2d-mediummaze2d-largemaze2d-umazehalfcheetah-mediumwalker2d-mediumhopper-mediumhalfcheetah-medium-replaywalker2d-medium-replayhopper-medium-replayhalfcheetah-medium-expertwalker2d-medium-experthopper-medium-expertpen-humanpen-clonedpen-expertdoor-expert 141.169.671.340.160.171.275.899.945.518.0103.469.310.229.544.133.1 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 25.6 26.0 16.9 27.3 10.1 22.0 29.0 24.5 14.4 0.3 8.6 2.3 2.4 2.5 14.8 3.7 130.860.7104.651.260.766.983.372.140.727.744.133.411.630.850.871.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 18.3 10.3 11.0 22.9 31.0 20.6 29.5 15.1 6.7 0.4 2.7 1.7 2.9 9.8 11.3 0.9

Average Score **61.4** 58.8

Table 7: Normalized returns under several values of σ ≜ _σB = σJ (Appendix D.2.1) in the full algorithm of_
GAN-joint-matching. The reported number are the means and standard deviations of the normalized returns of
the last five rollouts across three random seeds {0, 1, 2}.

Task Name _σ = 1 × 10[−][2]_ _σ = 3 × 10[−][3]_ _σ = 1 × 10[−][3]_ _σ = 3 × 10[−][4]_ _σ = 1 × 10[−][4]_ _σ = 0_

maze2d-umaze 48.4 ± 21.4 48.3 ± 19.7 41.7 ± 11.5 40.1 ± 16.9 54.9 ± 10.4 50.8 ± 24.0
maze2d-medium 58.7 ± 33.6 48.7 ± 7.4 64.0 ± 23.9 69.6 ± 25.6 46.9 ± 15.5 26.4 ± 5.7
maze2d-large 87.1 ± 17.9 57.6 ± 21.3 62.4 ± 13.3 71.3 ± 26.0 61.0 ± 8.6 62.3 ± 32.3
halfcheetah-medium 43.0 ± 0.4 43.7 ± 0.3 43.9 ± 0.4 44.1 ± 0.3 43.8 ± 0.3 44.0 ± 0.4
walker2d-medium 56.9 ± 10.4 66.4 ± 7.9 68.8 ± 10.3 69.3 ± 8.6 64.6 ± 13.8 63.8 ± 8.4
hopper-medium 23.5 ± 8.4 66.7 ± 20.8 63.3 ± 21.0 60.1 ± 27.3 74.5 ± 19.3 89.6 ± 27.9
halfcheetah-medium-replay 32.1 ± 2.4 31.5 ± 3.3 32.3 ± 2.1 33.1 ± 2.3 31.2 ± 1.9 31.5 ± 3.2
walker2d-medium-replay 9.8 ± 2.4 10.7 ± 2.0 10.2 ± 1.8 10.2 ± 2.4 10.9 ± 1.7 9.4 ± 1.4
hopper-medium-replay 28.7 ± 3.8 30.1 ± 2.7 30.5 ± 2.9 29.5 ± 2.5 31.3 ± 1.9 29.2 ± 1.5
halfcheetah-medium-expert 79.9 ± 10.1 74.2 ± 13.0 76.8 ± 13.4 75.8 ± 10.1 71.3 ± 8.6 70.7 ± 7.9
walker2d-medium-expert 67.4 ± 16.0 63.4 ± 22.2 69.7 ± 17.3 71.2 ± 22.0 63.4 ± 23.1 77.2 ± 18.4
hopper-medium-expert 20.5 ± 6.8 56.7 ± 27.5 79.4 ± 21.9 99.9 ± 29.0 66.7 ± 19.6 62.0 ± 19.5
pen-human -3.3 ± 0.5 64.2 ± 17.0 46.6 ± 33.5 45.5 ± 24.5 67.8 ± 13.4 60.3 ± 11.4
pen-cloned 4.5 ± 1.9 19.6 ± 11.8 23.3 ± 13.2 18.0 ± 14.4 36.6 ± 18.4 40.0 ± 20.8
pen-expert 74.2 ± 26.6 132.8 ± 11.1 132.8 ± 17.9 141.1 ± 14.8 136.6 ± 10.8 132.0 ± 19.4
door-expert 29.1 ± 9.7 104.1 ± 1.6 104.2 ± 1.7 103.4 ± 3.7 102.9 ± 3.9 102.3 ± 4.8

Average Score 41.3 57.4 59.4 61.4 60.3 59.5


-----

F FURTHER COMPARISON BETWEEN IMPLICIT AND GAUSSIAN POLICY

In Section 5.2 we note that a uni-modal stochastic policy, such as the Gaussian policy, is less flexible
to capture all the rewarding actions, on which an implicit policy may fit well. In this section we
visualize such a difference.

Figure 2 compares the fitting of the eight-Gaussian toy dataset by implicit policy and Gaussian
policy. Specifically, Figure 2a plots the dataset; Figure 2b plots CGAN with the default implicit
generator (implicit policy) fitted by the conditional-distribution matching approach; Figure 2c plots
CGAN with Gaussian generator (Gaussian policy) fitted by the conditional-distribution matching
approach; Figure 2d plots CGAN with implicit policy fitted by the basic joint-visitation matching
strategy (Section 4.1); Figure 2e plots CGAN with Gaussian policy fitted by the basic joint-visitation
matching strategy. Experimental details are on Appendix D.1.


1.5

1.0

0.5

0.0

0.5

1.0

1.5


1.5

1.0

0.5

0.0

0.5

1.0

1.5


1.5

1.0

0.5

0.0

0.5

1.0

1.5


1.5

1.0

0.5

0.0

0.5


1.5

1.0

0.5

0.0

0.5

1.0

1.5


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(a) Truth


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(b) CGAN


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(c) G-CGAN


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(d) GAN


|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


1.5 1.0 0.5 0.0 0.5 1.0 1.5

(e) G-GAN


Figure 2: Performance of approximating the behavior policy on the eight-Gaussian dataset by conditional
GAN with default (implicit) generator and Gaussian generator. A conditional GAN (“CGAN”) and a Gaussiangenerator conditional GAN (“G-CGAN”) are fitted using the conditional-distribution matching approach. A
conditional GAN (“GAN”) and a Gaussian-generator conditional GAN (“G-GAN”) are fitted using the basic
joint-visitation matching strategy (Section 4.1). Performance is judged by (1) clear concentration on the eight
centers, and (2) smooth interpolation between centers, which implies a good and smooth fit to the behavior
policy. Details of this toy experiment are presented on Appendix D.1.

We see that whatever training strategies, Gaussian policies fail to learn multi-modal state-conditional
action distributions, even if needed. Even though the Gaussian policy version of CGAN may still
correctly capture some modes in the action distributions, an improvement over the mode-covering
CVAE, they miss other modes. Besides, these Gaussian policy versions interpolate less-smoothly
between the centers. In offline RL, these weaknesses is related the missing of some rewarding
actions and less-predictable action-choices at unseen states.

To visualize the differences between the implicit and the Gaussian policy in the offline RL setting,
we plot the kernel density estimates of the distributions of the first two action-dimensions in the
“maze2d-umaze-v1” dataset, where a performance difference is shown in Table 4. Specifically,
Figure 3a plots the distribution of actions in the offline dataset. Figure 3b and 3c respectively plot
action distributions produced by the final Gaussian policy and the final implicit policy generating
Table 4.


1.0

0.5

0.0

0.5

1.0

1.0 0.5 0.0 0.5 1.0


1.0

0.5

0.0

0.5

1.0

1.0 0.5 0.0 0.5 1.0


1.0

0.5

0.0

0.5

1.0

1.0 0.5 0.0 0.5 1.0


(a) Truth


(b) Gaussian Policy


(c) Implicit Policy


Figure 3: KDE plots of the first two dimensions of actions in the “maze2d-umaze-v1” dataset. From left to
right: (a) Action distribution in the offline dataset; (b) Action distribution produced by the final Gaussian policy
in Table 4; (c) Action distribution by the final implicit policy (Section 4.1) in Table 4.


-----

We see from Figure 3a and Figure 3b that Gaussian policy leaves out two action modes, namely,
modes on the upper-left and upper-right corners. Figure 3c shows that our implicit policy does capture all the modes shown in Figure 3a. Note that “maze2d-umaze” is a navigation task requiring
agents to reach a goal location (Fu et al., 2021). Gaussian policy thus may miss out some directions in the offline dataset pertaining to short paths to the goal state, which may explain its inferior
performance on this dataset in Table 4.


-----