pradachan's picture
Upload folder using huggingface_hub
f71c233 verified
raw
history blame
79.2 kB
Published as a conference paper at ICLR 2022
DIRECT THEN DIFFUSE:
INCREMENTAL UNSUPERVISED SKILL DISCOVERY
FOR STATE COVERING AND GOAL REACHING
Pierre-Alexandre Kamienny∗1,2, Jean Tarbouriech∗1,3,
Sylvain Lamprier2, Alessandro Lazaric1, Ludovic Denoyer†1
1 Meta AI
2 Sorbonne University, LIP6/ISIR
3 Inria Scool
ABSTRACT
Learning meaningful behaviors in the absence of reward is a difficult problem in
reinforcement learning. A desirable and challenging unsupervised objective is to
learn a set of diverse skills that provide a thorough coverage of the state space
while being directed, i.e., reliably reaching distinct regions of the environment.
In this paper, we build on the mutual information framework for skill discovery
and introduce UPSIDE, which addresses the coverage-directedness trade-off in
the following ways: 1) We design policies with a decoupled structure of a directed
skill, trained to reach a specific region, followed by a diffusing part that induces
a local coverage. 2) We optimize policies by maximizing their number under the
constraint that each of them reaches distinct regions of the environment (i.e., they
are sufficiently discriminable) and prove that this serves as a lower bound to the
original mutual information objective. 3) Finally, we compose the learned directed
skills into a growing tree that adaptively covers the environment. We illustrate in
several navigation and control environments how the skills learned by UPSIDE
solve sparse-reward downstream tasks better than existing baselines.
1
INTRODUCTION
Deep reinforcement learning (RL) algorithms have been shown to effectively solve a wide variety
of complex problems (e.g., Mnih et al., 2015; Bellemare et al., 2013). However, they are often
designed to solve one single task at a time and they need to restart the learning process from scratch
for any new problem, even when it is defined on the very same environment (e.g., a robot navigating
to different locations in the same apartment). Recently, Unsupervised RL (URL) has been proposed
as an approach to address this limitation. In URL, the agent first interacts with the environment
without any extrinsic reward signal. Afterward, the agent leverages the experience accumulated
during the unsupervised learning phase to efficiently solve a variety of downstream tasks defined on
the same environment. This approach is particularly effective in problems such as navigation (see
e.g., Bagaria et al., 2021) and robotics (see e.g., Pong et al., 2020) where the agent is often required
to readily solve a wide range of tasks while the dynamics of environment remains fixed.
In this paper, we focus on the unsupervised objective of discovering a set of skills that can be
used to efficiently solve sparse-reward downstream tasks. In particular, we build on the insight
that mutual information (MI) between the skills’ latent variables and the states reached by them
can formalize the dual objective of learning policies that both cover and navigate the environment
efficiently. Indeed, maximizing MI has been shown to be a powerful approach for encouraging
exploration in RL (Houthooft et al., 2016; Mohamed & Rezende, 2015) and for unsupervised skill
discovery (e.g., Gregor et al., 2016; Eysenbach et al., 2019; Achiam et al., 2018; Sharma et al., 2020;
Campos et al., 2020). Nonetheless, learning policies that maximize MI is a challenging optimization
problem. Several approximations have been proposed to simplify it at the cost of possibly deviating
from the original objective of coverage and directedness (see Sect. 4 for a review of related work).
∗equal contribution
†Now at Ubisoft La Forge
{pakamienny,jtarbouriech,lazaric}@fb.com, sylvain.lamprier@isir.upmc.fr, ludovic.den@gmail.com
1
Published as a conference paper at ICLR 2022
Figure 1: Overview of UPSIDE. The black dot corresponds to the initial state. (A) A set of random policies is
initialized, each policy being composed of a directed part called skill (illustrated as a black arrow) and a dif-
fusing part (red arrows) which induces a local coverage (colored circles). (B) The skills are then updated to
maximize the discriminability of the states reached by their corresponding diffusing part (Sect. 3.1). (C) The
least discriminable policies are iteratively removed while the remaining policies are re-optimized. This is ex-
ecuted until the discriminability of each policy satisfies a given constraint (Sect. 3.2). In this example two
policies are consolidated. (D) One of these policies is used as basis to add new policies, which are then opti-
mized following the same procedure. For the “red” and “purple” policy, UPSIDE is not able to find sub-policies
of sufficient quality and thus they are not expanded any further. (E) At the end of the process, UPSIDE has
created a tree of policies covering the state space, with skills as edges and diffusing parts as nodes (Sect. 3.3).
In this paper, we propose UPSIDE (UnsuPervised Skills that dIrect then DiffusE) to learn a set of
policies that can be effectively used to cover the environment and solve goal-reaching downstream
tasks. Our solution builds on the following components (Fig. 1):
• Policy structure (Sect. 3.1, see Fig. 1 (A)). We consider policies composed of two parts: 1) a
directed part, referred to as the skill, that is trained to reach a specific region of the environment,
and 2) a diffusing part that induces a local coverage around the region attained by the first part.
This structure favors coverage and directedness at the level of a single policy.
• New constrained objective (Sect. 3.2, see Fig. 1 (B) & (C)). We then introduce a constrained opti-
mization problem designed to maximize the number of policies under the constraint that the states
reached by each of the diffusing parts are distinct enough (i.e., they satisfy a minimum level of
discriminability). We prove that this problem can be cast as a lower bound to the original MI
objective, thus preserving its coverage-directedness trade-off. UPSIDE solves it by adaptively
adding or removing policies to a given initial set, without requiring any prior knowledge on a
sensible number of policies.
• Tree structure (Sect. 3.3, see Fig. 1 (D) & (E)). Leveraging the directed nature of the skills,
UPSIDE effectively composes them to build longer and longer policies organized in a tree struc-
ture. This overcomes the need of defining a suitable policy length in advance. Thus in UPSIDE
we can consider short policies to make the optimization easier, while composing their skills along
a growing tree structure to ensure an adaptive and thorough coverage of the environment.
The combination of these components allows UPSIDE to effectively adapt the number and the length
of policies to the specific structure of the environment, while learning policies that ensure coverage
and directedness. We study the effectiveness of UPSIDE and the impact of its components in hard-
to-explore continuous navigation and control environments, where UPSIDE improves over existing
baselines both in terms of exploration and learning performance.
2
SETTING
We consider the URL setting where the agent interacts with a Markov decision process (MDP) M
with state space S, action space A, dynamics p(s′|s, a), and no reward. The agent starts each
episode from a designated initial state s0 ∈S.1 Upon termination of the chosen policy, the agent is
then reset to s0. This setting is particularly challenging from an exploration point of view since the
agent cannot rely on the initial distribution to cover the state space.
We recall the MI-based unsupervised skill discovery approach (see e.g., Gregor et al., 2016). Denote
by Z some (latent) variables on which the policies of length T are conditioned (we assume that Z
is categorical for simplicity and because it is the most common case in practice). There are three
1More generally, s0 could be drawn from any distribution supported over a compact region.
2
Published as a conference paper at ICLR 2022
optimization variables: (i) the cardinality of Z denoted by NZ, i.e., the number of policies (we write
Z = {1, . . . , NZ} = [NZ]), (ii) the parameters π(z) of the policy indexed by z ∈Z, and (iii) the
policy sampling distribution ρ (i.e., ρ(z) is the probability of sampling policy z at the beginning of
the episode). Denote policy z’s action distribution in state s by π(·|z, s) and the entropy function
by H. Let the variable ST be the random (final) state induced by sampling a policy z from ρ and
executing π(z) from s0 for an episode. Denote by pπ(z)(sT ) the distribution over (final) states
induced by executing policy z, by p(z|sT ) the probability of z being the policy to induce (final)
state sT , and let p(sT ) = P
z∈Z ρ(z)pπ(z)(sT ). Maximizing the MI between Z and ST can be
written as maxNZ, ρ, π I(ST ; Z), where
I(ST ; Z) = H(ST ) −H(ST |Z) = −
X
sT
p(sT ) log p(sT ) +
X
z∈Z
ρ(z)EsT |z

log pπ(z)(sT )

= H(Z) −H(Z|ST ) = −
X
z∈Z
ρ(z) log ρ(z) +
X
z∈Z
ρ(z)EsT |z [log p(z|sT )] ,
(1)
where in the expectations sT |z ∼pπ(z)(sT ). In the first formulation, the entropy term over states
captures the requirement that policies thoroughly cover the state space, while the second term mea-
sures the entropy over the states reached by each policy and thus promotes policies that have a
directed behavior. Learning the optimal NZ, ρ, and π to maximize Equation 1 is a challenging prob-
lem and several approximations have been proposed (see e.g., Gregor et al., 2016; Eysenbach et al.,
2019; Achiam et al., 2018; Campos et al., 2020). Many approaches focus on the so-called reverse
formulation of the MI (second line of Equation 1). In this case, the conditional distribution p(z|sT ) is
usually replaced with a parametric model qϕ(z|sT ) called the discriminator that is trained via a neg-
ative log likelihood loss simultaneously with all other variables. Then one can maximize the lower
bound (Barber & Agakov, 2004): I(ST ; Z) ≥Ez∼ρ(z),τ∼π(z) [log qϕ(z|sT ) −log ρ(z)], where we
denote by τ ∼π(z) trajectories sampled from the policy indexed by z. As a result, each policy π(z)
can be trained with RL to maximize the intrinsic reward rz(sT ) := log qϕ(z|sT ) −log ρ(z).
3
THE UPSIDE ALGORITHM
In this section we detail the three main components of UPSIDE, which is summarized in Sect. 3.4.
3.1
DECOUPLED POLICY STRUCTURE OF DIRECT-THEN-DIFFUSE
While the trade-off between coverage and directedness is determined by the MI objective, the
amount of stochasticity of each policy (e.g., injected via a regularization on the entropy over the
actions) has also a major impact on the effectiveness of the overall algorithm (Eysenbach et al.,
2019). In fact, while randomness can promote broader coverage, a highly stochastic policy tends
to induce a distribution pπ(z)(sT ) over final states with high entropy, thus increasing H(ST |Z) and
losing in directedness. In UPSIDE, we define policies with a decoupled structure (see Fig. 1 (A))
composed of a) a directed part (of length T) that we refer to as skill, with low stochasticity and
trained to reach a specific region of the environment and b) a diffusing part (of length H) with high
stochasticity to promote local coverage of the states around the region reached by the skill.
Coherently with this structure, the state variable in the con-
ditional entropy in Equation 1 becomes any state reached
during the diffusing part (denote by Sdiff the random vari-
able) and not just the skill’s terminal state.
Following
Sect. 2 we define an intrinsic reward rz(s) = log qϕ(z|s)−
log ρ(z) and the skill of policy z maximizes the cumulative
reward over the states traversed by the diffusing part. For-
mally, we can conveniently define the objective function:
max
π(z) Eτ∼π(z)
h X
t∈J
α · rz(st) + β · H(π(·|z, st))
i
,
(2)
UPSIDE
directed
skill
UPSIDE
diffusing
part
VIC
policy
DIAYN
policy
state
variable
Sdiff
Sdiff
ST
S
J
{T, . . . ,
T + H}
{T, . . . ,
T + H} {T } {1, ..., T }
(α, β)
(1, 0)
(0, 1)
(1, 0)
(1, 1)
Table 1: Instantiation of Equation 2 for
each part of an UPSIDE policy, and for
VIC (Gregor et al., 2016) and DIAYN
(Eysenbach et al., 2019) policies.
where J = {T, . . . , T + H} and α = 1, β = 0 (resp. α = 0, β = 1) when optimizing for the skill
(resp. diffusing part). In words, the skill is incentivized to bring the diffusing part to a discriminable
region of the state space, while the diffusing part is optimized by a simple random walk policy (i.e.,
a stochastic policy with uniform distribution over actions) to promote local coverage around sT .
3
Published as a conference paper at ICLR 2022
Table 1 illustrates how UPSIDE’s policies compare to other methods. Unlike VIC and similar to
DIAYN, the diffusing parts of the policies tend to “push” the skills away so as to reach diverse
regions of the environment. The combination of the directedness of the skills and local coverage of
the diffusing parts thus ensures that the whole environment can be properly visited with NZ ≪|S|
policies.2 Furthermore, the diffusing part can be seen as defining a cluster of states that represents
the goal region of the directed skill. This is in contrast with DIAYN policies whose stochasticity may
be spread over the whole trajectory. This allows us to “ground” the latent variable representations
of the policies Z to specific regions of the environment (i.e., the clusters). As a result, maximizing
the MI I(Sdiff; Z) can be seen as learning a set of “cluster-conditioned” policies.
3.2
A CONSTRAINED OPTIMIZATION PROBLEM
In this section, we focus on how to optimize the number of policies NZ and the policy sampling
distribution ρ(z). The standard practice for Equation 1 is to preset a fixed number of policies NZ and
to fix the distribution ρ to be uniform (see e.g., Eysenbach et al., 2019; Baumli et al., 2021; Strouse
et al., 2021). However, using a uniform ρ over a fixed number of policies may be highly suboptimal,
in particular when NZ is not carefully tuned. In App. A.2 we give a simple example and a theoretical
argument on how the MI can be ensured to increase by removing skills with low discriminability
when ρ is uniform. Motivated by this observation, in UPSIDE we focus on maximizing the number
of policies that are sufficiently discriminable. We fix the sampling distribution ρ to be uniform over
N policies and define the following constrained optimization problem
max
N≥1 N
s.t.
g(N) ≥log η,
where
g(N) := max
π,ϕ min
z∈[N] Esdiff [log qϕ(z|sdiff)] ,
(Pη)
where qϕ(z|sdiff) denotes the probability of z being the policy traversing sdiff during its diffusing
part according to the discriminator and η ∈(0, 1) defines a minimum discriminability threshold.
By optimizing for (Pη), UPSIDE automatically adapts the number of policies to promote coverage,
while still guaranteeing that each policy reaches a distinct region of the environment. Alternatively,
we can interpret (Pη) under the lens of clustering: the aim is to find the largest number of clusters
(i.e., the region reached by the directed skill and covered by its associated diffusing part) with a
sufficient level of inter-cluster distance (i.e., discriminability) (see Fig. 1). The following lemma
(proof in App. A.1) formally links the constrained problem (Pη) back to the original MI objective.
Lemma 1. There exists a value η† ∈(0, 1) such that solving (Pη†) is equivalent to maximizing a
lower bound on the mutual information objective max NZ,ρ,π,ϕ I(Sdiff; Z).
Since (Pη†) is a lower bound to the MI, optimizing it ensures that the algorithm does not deviate
too much from the dual covering and directed behavior targeted by MI maximization. Interestingly,
Lem. 1 provides a rigorous justification for using a uniform sampling distribution restricted to the
η-discriminable policies, which is in striking contrast with most of MI-based literature, where a
uniform sampling distribution ρ is defined over the predefined number of policies.
In addition, our alternative objective (Pη) has the benefit of providing a simple greedy strategy to
optimize the number of policies N. Indeed, the following lemma (proof in App. A.1) ensures that
starting with N = 1 (where g(1) = 0) and increasing it until the constraint g(N) ≥log η is violated
is guaranteed to terminate with the optimal number of policies.
Lemma 2. The function g is non-increasing in N.
3.3
COMPOSING SKILLS IN A GROWING TREE STRUCTURE
Both the original MI objective and our constrained formulation (Pη) depend on the initial state s0
and on the length of each policy. Although these quantities are usually predefined and only appear
implicitly in the equations, they have a crucial impact on the obtained behavior. In fact, resetting
after each policy execution unavoidably restricts the coverage to a radius of at most T + H steps
around s0. This may suggest to set T and H to large values. However, increasing T makes training
the skills more challenging, while increasing H may not be sufficient to improve coverage.
2Equation 1 is maximized by setting NZ = |S| (i.e., maxY I(X, Y ) = I(X, X) = H(X)), where each z
represents a goal-conditioned policy reaching a different state, which implies having as many policies as states,
thus making the learning particularly challenging.
4
Published as a conference paper at ICLR 2022
Instead, we propose to “extend” the length of the policies through composition. We rely on the
key insight that the constraint in (Pη) guarantees that the directed skill of each η-discriminable
policy reliably reaches a specific (and distinct) region of the environment and it is thus re-usable
and amenable to composition. We thus propose to chain the skills so as to reach further and further
parts of the state space. Specifically, we build a growing tree, where the root node is a diffusing
part around s0, the edges represent the skills, and the nodes represent the diffusing parts. When a
policy z is selected, the directed skills of its predecessor policies in the tree are executed first (see
Fig. 9 in App. B for an illustration). Interestingly, this growing tree structure builds a curriculum
on the episode lengths which grow as the sequence (iT + H)i≥1, thus avoiding the need of prior
knowledge on an adequate horizon of the downstream tasks.3 Here this knowledge is replaced by
T and H which are more environment-agnostic and task-agnostic choices as they rather have an
impact on the size and shape of the learned tree (e.g., the smaller T and H the bigger the tree).
3.4
IMPLEMENTATION
We are now ready to introduce the UPSIDE al-
gorithm, which provides a specific implementation
of the components described before (see Fig. 1 for
an illustration, Alg. 1 for a short pseudo-code and
Alg. 2 in App. B for the detailed version). We first
make approximations so that the constraint in (Pη)
is easier to estimate.
We remove the logarithm
from the constraint to have an estimation range of
[0, 1] and thus lower variance.4 We also replace the
expectation over sdiff with an empirical estimate
b
q
B
ϕ(z) =
1
|Bz|
P
s∈Bz qϕ(z|s), where Bz denotes
a small replay buffer, which we call state buffer,
that contains states collected during a few rollouts
by the diffusing part of πz. In our experiments, we
take B = |Bz| = 10H. Integrating this in (Pη)
leads to
max
N≥1 N
s.t.
max
π,ϕ min
z∈[N] b
q
B
ϕ(z) ≥η,
(3)
where η is an hyper-parameter.5 From Lem. 2, this
optimization problem in N can be solved using the
incremental policy addition or removal in Alg. 1
(lines 5 & 9), independently from the number of
initial policies N.
Algorithm 1: UPSIDE
Parameters: Discriminability threshold
η ∈(0, 1), branching factor N start, N max.
Initialize: Tree T initialized as a root node
0, policies candidates Q = {0}.
while Q ̸= ∅do // tree expansion
1
Dequeue a policy z ∈Q and create
N = N start policies C(z).
2
POLICYLEARNING(T , C(z)).
3
if minz′∈C(z) b
q
B
ϕ (z′) > η then
// Node addition
4
while minz′∈C(z) b
q
B
ϕ (z′) > η
and N < N max do
5
Increment N = N + 1 and add
one policy to C(z).
6
POLICYLEARNING(T , C(z)).
7
else // Node removal
8
while minz′∈C(z) b
q
B
ϕ (z′) < η
and N > 1 do
9
Reduce N = N −1 and
remove least discriminable
policy from C(z).
10
POLICYLEARNING(T , C(z)).
11
Add η-discriminable policies C(z) to
Q, and to T as nodes rooted at z.
We then integrate the optimization of Equation 3 into an adaptive tree expansion strategy that in-
crementally composes skills (Sect. 3.3). The tree is initialized with a root node corresponding to a
policy only composed of the diffusing part around s0. Then UPSIDE iteratively proceeds through
the following phases: (Expansion) While policies/nodes can be expanded according to different
ordering rules (e.g., a FIFO strategy), we rank them in descending order by their discriminability
(i.e., b
q
B
ϕ(z)), thus favoring the expansion of policies that reach regions of the state space that are
not too saturated. Given a candidate leaf z to expand from the tree, we introduce new policies
by adding a set C(z) of N = N start nodes rooted at node z (line 2, see also steps (A) and (D) in
Fig. 1). (Policy learning) The new policies are optimized in three steps (see App. B for details on
the POLICYLEARNING subroutine): i) sample states from the diffusing parts of the new policies
sampled uniformly from C(z) (state buffers of consolidated policies in T are kept in memory), ii)
update the discriminator and compute the discriminability b
q
B
ϕ(z′) of new policies z′ ∈C(z), iii)
3See e.g., the discussion in Mutti et al. (2021) on the “importance of properly choosing the training horizon
in accordance with the downstream-task horizon the policy will eventually face.”
4While Gregor et al. (2016); Eysenbach et al. (2019) employ rewards in the log domain, we find that map-
ping rewards into [0, 1] works better in practice, as observed in Warde-Farley et al. (2019); Baumli et al. (2021).
5Ideally, we would set η
=
η† from Lem. 1, however η† is non-trivial to compute.
A strategy
may be to solve (Pη′) for different values of η′ and select the one that maximizes the MI lower bound
E [log qϕ(z|sdiff) −log ρ(z)]. In our experiments we rather use the same predefined parameter of η = 0.8
which avoids computational overhead and performs well across all environments.
5
Published as a conference paper at ICLR 2022
update the skills to optimize the reward (Sect. 3.1) computed using the discriminator (see step (B)
in Fig. 1). (Node adaptation) Once the policies are trained, UPSIDE proceeds with optimizing N
in a greedy fashion. If all the policies in C(z) have an (estimated) discriminability larger than η
(lines 3-5), a new policy is tentatively added to C(z), the policy counter N is incremented, the policy
learning step is restarted, and the algorithm keeps adding policies until the constraint is not met
anymore or a maximum number of policies is attained. Conversely, if every policy in C(z) does
not meet the discriminability constraint (lines 7-9), the one with lowest discriminability is removed
from C(z), the policy learning step is restarted, and the algorithm keeps removing policies until all
policies satisfy the constraint or no policy is left (see step (C) in Fig. 1). The resulting C(z) is added
to the set of consolidated policies (line 11) and UPSIDE iteratively proceeds by selecting another
node to expand until no node can be expanded (i.e., the node adaptation step terminates with N = 0
for all nodes) or a maximum number of environment iterations is met.
4
RELATED WORK
URL methods can be broadly categorized depending on how the experience of the unsupervised
phase is summarized to solve downstream tasks in a zero- or few-shot manner. This includes model-
free (e.g., Pong et al., 2020), model-based (e.g., Sekar et al., 2020) and representation learning (e.g.,
Yarats et al., 2021) methods that build a representative replay buffer to learn accurate estimates or
low-dimensional representations. An alternative line of work focuses on discovering a set of skills
in an unsupervised way. Our approach falls in this category, on which we now focus this section.
Skill discovery based on MI maximization was first proposed in VIC (Gregor et al., 2016), where
the trajectories’ final states are considered in the reverse form of Equation 1 and the policy parame-
ters π(z) and sampling distribution ρ are simultaneously learned (with a fixed number of skills NZ).
DIAYN (Eysenbach et al., 2019) fixes a uniform ρ and weights skills with an action-entropy co-
efficient (i.e., it additionally minimizes the MI between actions and skills given the state) to push
the skills away from each other. DADS (Sharma et al., 2020) learns skills that are not only diverse
but also predictable by learned dynamics models, using a generative model over observations and
optimizing a forward form of MI I(s′; z|s) between the next state s′ and current skill z (with con-
tinuous latent) conditioned on the current state s. EDL (Campos et al., 2020) shows that existing
skill discovery approaches can provide insufficient coverage and relies on a fixed distribution over
states that is either provided by an oracle or learned. SMM (Lee et al., 2019) uses the MI formalism
to learn a policy whose state marginal distribution matches a target state distribution (e.g., uniform).
Other MI-based skill discovery methods include Florensa et al. (2017); Hansen et al. (2019); Modhe
et al. (2020); Baumli et al. (2021); Xie et al. (2021); Liu & Abbeel (2021); Strouse et al. (2021), and
extensions in non-episodic settings (Xu et al., 2020; Lu et al., 2020).
While most skill discovery approaches consider a fixed number of policies, a curriculum with in-
creasing NZ is studied in Achiam et al. (2018); Aubret et al. (2020). We consider a similar dis-
criminability criterion in the constraint in (Pη) and show that it enables to maintain skills that can
be readily composed along a tree structure, which can either increase or decrease the support of
available skills depending on the region of the state space. Recently, Zhang et al. (2021) propose a
hierarchical RL method that discovers abstract skills while jointly learning a higher-level policy to
maximize extrinsic reward. Our approach builds on a similar promise of composing skills instead
of resetting to s0 after each execution, yet we articulate the composition differently, by exploiting
the direct-then-diffuse structure to ground skills to the state space instead of being abstract. Har-
tikainen et al. (2020) perform unsupervised skill discovery by fitting a distance function; while their
approach also includes a directed part and a diffusive part for exploration, it learns only a single
directed policy and does not aim to cover the entire state space. Approaches such as DISCERN
(Warde-Farley et al., 2019) and Skew-Fit (Pong et al., 2020) learn a goal-conditioned policy in
an unsupervised way with an MI objective. As explained by Campos et al. (2020), this can be in-
terpreted as a skill discovery approach with latent Z = S, i.e., where each goal state can define
a different skill. Conditioning on either goal states or abstract latent skills forms two extremes of
the spectrum of unsupervised RL. As argued in Sect. 3.1, we target an intermediate approach of
learning “cluster-conditioned” policies. Finally, an alternative approach to skill discovery builds on
“spectral” properties of the dynamics of the MDP. This includes eigenoptions (Machado et al., 2017;
2018) and covering options (Jinnai et al., 2019; 2020), and the algorithm of Bagaria et al. (2021)
that builds a discrete graph representation which learns and composes spectral skills.
6
Published as a conference paper at ICLR 2022
Bottleneck Maze
U-Maze
RANDOM
29.17
(±0.57)
23.33
(±0.57)
DIAYN-10
17.67
(±0.57)
14.67
(±0.42)
DIAYN-20
23.00
(±1.09)
16.67
(±1.10)
DIAYN-50
30.00
(±0.72)
25.33
(±1.03)
DIAYN-curr
18.00
(±0.82)
15.67
(±0.87)
DIAYN-hier
38.33
(±0.68)
49.67
(±0.57)
EDL-10
27.00
(±1.41)
32.00
(±1.19)
EDL-20
31.00
(±0.47)
46.00
(±0.82)
EDL-50
33.33
(±0.42)
52.33
(±1.23)
SMM-10
19.00
(±0.47)
14.00
(±0.54)
SMM-20
23.67
(±1.29)
14.00
(±0.27)
SMM-50
28.00
(±0.82)
25.00
(±1.52)
Flat UPSIDE-10
40.67
(±1.50)
43.33
(±2.57)
Flat UPSIDE-20
47.67
(±0.31)
55.67
(±1.03)
Flat UPSIDE-50
51.33
(±1.64)
57.33
(±0.31)
UPSIDE
85.67
(±1.93)
71.33
(±0.42)
Table 2: Coverage on Bottleneck Maze and U-Maze: UPSIDE cov-
ers significantly more regions of the discretized state space than the
other methods. The values represent the number of buckets that are
reached, where the 50 × 50 space is discretized into 10 buckets per
axis. To compare the global coverage of methods (and to be fair
w.r.t. the amount of injected noise that may vary across methods), we
roll-out for each model its associated deterministic policies.
UPSIDE
DIAYN-50
EDL-50
Figure 2: Policies learned on the Bot-
tleneck Maze (see Fig. 14 in App. C
for the other methods):
contrary to
the baselines, UPSIDE successfully
escapes the bottleneck region.
5
EXPERIMENTS
Our experiments investigate the following questions: i) Can UPSIDE incrementally cover an un-
known environment while preserving the directedness of its skills? ii) Following the unsupervised
phase, how can UPSIDE be leveraged to solve sparse-reward goal-reaching downstream tasks?
iii) What is the impact of the different components of UPSIDE on its performance?
We report results on navigation problems in continuous 2D mazes6 and on continuous control prob-
lems (Brockman et al., 2016; Todorov et al., 2012): Ant, Half-Cheetah and Walker2d. We evaluate
performance with the following tasks: 1) “coverage” which evaluates the extent to which the state
space has been covered during the unsupervised phase, and 2) “unknown goal-reaching” whose ob-
jective is to find and reliably reach an unknown goal location through fine-tuning of the policy. We
perform our experiments based on the SaLinA framework (Denoyer et al., 2021).
We compare UPSIDE to different baselines. First we consider DIAYN-NZ (Eysenbach et al., 2019),
where NZ denotes a fixed number of skills. We introduce two new baselines derived from DIAYN:
a) DIAYN-curr is a curriculum variant where the number of skills is automatically tuned following
the same procedure as in UPSIDE, similar to Achiam et al. (2018), to ensure sufficient discriminabil-
ity, and b) DIAYN-hier is a hierarchical extension of DIAYN where the skills are composed in a
tree as in UPSIDE but without the diffusing part. We also compare to SMM (Lee et al., 2019), which
is similar to DIAYN but includes an exploration bonus encouraging the policies to visit rarely en-
countered states. In addition, we consider EDL (Campos et al., 2020) with the assumption of the
available state distribution oracle (since replacing it by SMM does not lead to satisfying results in
presence of bottleneck states as shown in Campos et al., 2020). Finally, we consider the RANDOM
6The agent observes its current position and its actions (in [−1, +1]) control its shift in x and y coordinates.
We consider two topologies of mazes illustrated in Fig. 2 with size 50 × 50 such that exploration is non-trivial.
The Bottleneck maze is a harder version of the one in Campos et al. (2020, Fig. 1) whose size is only 10 × 10.
7
Published as a conference paper at ICLR 2022
0
0.5
1
1.5
·105
0
100
200
300
400
500
Env interactions
Coverage
(a) Ant
0
0.5
1
1.5
·105
0
5
10
15
20
25
Env interactions
Coverage
(b) Half-Cheetah
0
0.5
1
1.5
·105
5
10
15
20
Env interactions
Coverage
(c) Walker2d
Figure 3: Coverage on control environments: UPSIDE covers the state space signif-
icantly more than DIAYN and RANDOM. The curve represents the number of buckets
reached by the policies extracted from the unsupervised phase of UPSIDE and DIAYN
as a function of the number of environment interactions. DIAYN and UPSIDE have
the same amount of injected noise. Each axis is discretized into 50 buckets.
UPSIDE
DIAYN-5
DIAYN-10
DIAYN-20
RANDOM
(a) UPSIDE policies on Ant
(b) DIAYN policies on Ant
0
0.2
0.4
0.6
0.8
1
·107
0
0.2
0.4
0.6
0.8
1
Env interactions
Average success rate
UPSIDE
TD3
DIAYN
(c) Fine-tuning on Ant
Figure 4: (a) & (b) Unsupervised phase on Ant: visualization of the policies learned by UPSIDE and
DIAYN-20. We display only the final skill and the diffusing part of the UPSIDE policies. (c) Downstream
tasks on Ant: we plot the average success rate over 48 unknown goals (with sparse reward) that are sampled
uniformly in the [−8, 8]2 square (using stochastic roll-outs) during the fine-tuning phase. UPSIDE achieves
higher success rate than DIAYN-20 and TD3.
policy, which samples actions uniformly in the action space. We use TD3 as the policy optimizer
(Fujimoto et al., 2018) though we also tried SAC (Haarnoja et al., 2018) which showed equivalent
results than TD3 with harder tuning. Similar to e.g., Eysenbach et al. (2019); Bagaria & Konidaris
(2020), we restrict the observation space of the discriminator to the cartesian coordinates (x, y) for
Ant and x for Half-Cheetah and Walker2d. All algorithms were ran on Tmax = 1e7 unsupervised en-
vironment interactions in episodes of size Hmax = 200 (resp. 250) for mazes (resp. for control). For
baselines, models are selected according to the cumulated intrinsic reward (as done in e.g., Strouse
et al., 2021), while UPSIDE, DIAYN-hier and DIAYN-curr are selected according to the high-
est number of η-discriminable policies. On the downstream tasks, we consider ICM (Pathak et al.,
2017) as an additional baseline. See App. C for the full experimental details.
Coverage. We analyze the coverage achieved by the various methods following an unsupervised
phase of at most Tmax = 1e7 environment interactions. For UPSIDE, we report coverage for the
skill and diffusing part lengths T = H = 10 in the continuous mazes (see App. D.4 for an ablation
on the values of T, H) and T = H = 50 in control environments. Fig. 2 shows that UPSIDE man-
ages to cover the near-entirety of the state space of the bottleneck maze (including the top-left room)
by creating a tree of directed skills, while the other methods struggle to escape from the bottleneck
region. This translates quantitatively in the coverage measure of Table 2 where UPSIDE achieves
the best results. As shown in Fig. 3 and 4, UPSIDE clearly outperforms DIAYN and RANDOM in
state-coverage of control environments, for the same number of environment interactions. In the
Ant domain, traces from DIAYN (Fig. 4b) and discriminator curves in App. D.3 demonstrate that
even though DIAYN successfully fits 20 policies by learning to take a few steps then hover, it fails
to explore the environment. In Half-Cheetah and Walker2d, while DIAYN policies learn to fall on
the agent’s back, UPSIDE learns to move forward/backward on its back through skill composition.
Unknown goal-reaching tasks.
We investigate how the tree of policies learned by UPSIDE
in the unsupervised phase can be used to tackle goal-reaching downstream tasks. All unsuper-
vised methods follow the same protocol: given an unknown7 goal g, i) we sample rollouts over
7Notice that if the goal was known, the learned discriminator could be directly used to identify the most
promising skill to fine-tune.
8
Published as a conference paper at ICLR 2022
(a) UPSIDE: before and after fine-tuning
(b) DIAYN
(c) EDL
(d) ICM
Figure 5: Downstream task performance on Bottleneck Maze: UPSIDE achieves higher discounted cumulative
reward on various unknown goals (See Fig. 15 in App. C for SMM and TD3 performance). From each of the
16 discretized regions, we randomly sample 3 unknown goals. For every method and goal seed, we roll-out
each policy (learned in the unsupervised phase) during 10 episodes and select the one with largest cumulative
reward to fine-tune (with sparse reward r(s) = I[∥s −g∥2 ≤1]). Formally, for a given goal g the reported
value is γτI[τ ≤Hmax] with τ := inf{t ≥1 : ∥st −g∥2 ≤1}, γ = 0.99 and horizon Hmax = 200.
Figure 6: For an unknown goal
location, UPSIDE identifies a
promising policy in its tree and
fine-tunes it.
the different learned policies, ii) then we select the best policy based
on the maximum discounted cumulative reward collected, and iii) we
fine-tune this policy (i.e., its sequence of directed skills and its final
diffusing part) to maximize the sparse reward r(s) = I[∥s−g∥2 ≤1].
Fig. 5 reports the discounted cumulative reward on various goals af-
ter the fine-tuning phase. We see that UPSIDE accumulates more
reward than the other methods, in particular in regions far from s0,
where performing fine-tuning over the entire skill path is especially
challenging. In Fig. 6 we see that UPSIDE’s fine-tuning can slightly
deviate from the original tree structure to improve the goal-reaching
behavior of its candidate policy. We also perform fine-tuning on the
Ant domain under the same setting. In Fig. 4c, we show that UPSIDE
clearly outperforms DIAYN-20 and TD3 when we evaluate the aver-
age success rate of reaching 48 goals sampled uniformly in [−8, 8]2.
Note that DIAYN particularly fails as its policies learned during the
unsupervised phase all stay close to the origin s0.
Ablative study of the UPSIDE components. The main components of UPSIDE that differ from ex-
isting skill discovery approaches such as DIAYN are: the decoupled policy structure, the constrained
optimization problem and the skill chaining via the growing tree. We perform ablations to show that
all components are simultaneously required for good performance. First, we compare UPSIDE to
flat UPSIDE, i.e., UPSIDE with the tree depth of 1 (T = 150, H = 50). Table 2 reveals that the
tree structuring is key to improve exploration and escape bottlenecks; it makes the agent learn on
smaller and easier problems (i.e., short-horizon MDPs) and mitigates the optimization issues (e.g.,
non-stationary rewards). However, the diffusing part of flat UPSIDE largely improves the coverage
performance over the DIAYN baseline, suggesting that the diffusing part is an interesting structural
bias on the entropy regularization that pushes the policies away from each other. This is particularly
useful on the Ant environment as shown in Fig. 4. A challenging aspect is to make the skill composi-
tion work. As shown in Table 1, DIAYN-hier (a hierarchical version of DIAYN) does not perform
as well as UPSIDE by a clear margin. In fact, UPSIDE’s direct-then-diffuse decoupling enables
both policy re-usability for the chaining (via the directed skills) and local coverage (via the diffusing
part). Moreover, as shown by the results of DIAYN-hier on the bottleneck maze, the constrained
optimization problem (Pη) combined with the diffusing part is crucial to prevent consolidating too
many policies, thus allowing a sample efficient growth of the tree structure.
6
CONCLUSION AND LIMITATIONS
We introduced UPSIDE, a novel algorithm for unsupervised skill discovery designed to trade off
between coverage and directedness and develop a tree of skills that can be used to perform ef-
ficient exploration and solve sparse-reward goal-reaching downstream tasks. Limitations of our
approach that constitute natural venues for future investigation are: 1) The diffusing part of each
policy could be explicitly trained to maximize local coverage around the skill’s terminal state; 2)
UPSIDE assumes a good state representation is provided as input to the discriminator, it would be
interesting to pair UPSIDE with effective representation learning techniques to tackle problems with
high-dimensional input; 3) As UPSIDE relies on the ability to reset to establish a root node for its
growing tree, it could be relevant to extend the approach in non-episodic environments.
9
Published as a conference paper at ICLR 2022
Acknowledgements
We thank both Evrard Garcelon and Jonas Gehring for helpful discussion.
REFERENCES
Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery
algorithms. arXiv preprint arXiv:1807.10299, 2018.
Arthur Aubret, La¨
etitia Matignon, and Salima Hassas. Elsim: End-to-end learning of reusable skills
through intrinsic motivation. In European Conference on Machine Learning and Principles and
Practice of Knowledge Discovery in Databases (ECML-PKDD), 2020.
Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In International
Conference on Learning Representations, 2020.
Akhil Bagaria, Jason K Senthil, and George Konidaris. Skill discovery for exploration and planning
using deep skill graphs. In International Conference on Machine Learning, pp. 521–531. PMLR,
2021.
David Barber and Felix Agakov. The im algorithm: a variational approach to information maxi-
mization. Advances in neural information processing systems, 16(320):201, 2004.
Kate Baumli, David Warde-Farley, Steven Hansen, and Volodymyr Mnih. Relative variational in-
trinsic control. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.
6732–6740, 2021.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-
ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:
253–279, 2013.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. Openai gym, 2016.
ıctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi
Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In Interna-
tional Conference on Machine Learning, 2020.
Ludovic Denoyer, Alfredo de la Fuente, Song Duong, Jean-Baptiste Gaya, Pierre-Alexandre
Kamienny, and Daniel H Thompson.
Salina: Sequential learning of agents.
arXiv preprint
arXiv:2110.07910, 2021.
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need:
Learning skills without a reward function. In International Conference on Learning Representa-
tions, 2019.
Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical rein-
forcement learning. arXiv preprint arXiv:1704.03012, 2017.
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-
critic methods. In International Conference on Machine Learning, pp. 1587–1596. PMLR, 2018.
Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv
preprint arXiv:1611.07507, 2016.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. In International Confer-
ence on Machine Learning, pp. 1861–1870. PMLR, 2018.
Steven Hansen, Will Dabney, Andre Barreto, David Warde-Farley, Tom Van de Wiele, and
Volodymyr Mnih. Fast task inference with variational intrinsic successor features. In Interna-
tional Conference on Learning Representations, 2019.
Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, and Sergey Levine. Dynamical distance
learning for semi-supervised and unsupervised skill discovery. In International Conference on
Learning Representations, 2020.
10
Published as a conference paper at ICLR 2022
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:
variational information maximizing exploration. In Proceedings of the 30th International Con-
ference on Neural Information Processing Systems, pp. 1117–1125, 2016.
Yuu Jinnai, Jee Won Park, David Abel, and George Konidaris. Discovering options for exploration
by minimizing cover time. In International Conference on Machine Learning, pp. 3130–3139.
PMLR, 2019.
Yuu Jinnai, Jee Won Park, Marlos C Machado, and George Konidaris. Exploration in reinforcement
learning with deep covering options. In International Conference on Learning Representations,
2020.
Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdi-
nov. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019.
Hao Liu and Pieter Abbeel.
Aps: Active pretraining with successor features.
In International
Conference on Machine Learning, pp. 6736–6747. PMLR, 2021.
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-
space planning. arXiv preprint arXiv:2012.03548, 2020.
Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for op-
tion discovery in reinforcement learning. In International Conference on Machine Learning, pp.
2295–2304. PMLR, 2017.
Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray
Campbell. Eigenoption discovery through the deep successor representation. In International
Conference on Learning Representations, 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra,
and Ramakrishna Vedantam. Ir-vic: Unsupervised discovery of sub-goals for transfer in rl. In
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-
20, 2020.
Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsi-
cally motivated reinforcement learning. In Advances in neural information processing systems,
pp. 2125–2133, 2015.
Mirco Mutti, Lorenzo Pratissoli, and Marcello Restelli. Task-agnostic exploration via policy gra-
dient of a non-parametric state entropy estimate. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 35, pp. 9028–9036, 2021.
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration
by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, pp. 16–17, 2017.
Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-
fit: State-covering self-supervised reinforcement learning. In International Conference on Ma-
chine Learning, 2020.
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak.
Planning to explore via self-supervised world models. In International Conference on Machine
Learning, pp. 8583–8592. PMLR, 2020.
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware
unsupervised discovery of skills. In International Conference on Learning Representations, 2020.
DJ Strouse, Kate Baumli, David Warde-Farley, Vlad Mnih, and Steven Hansen. Learning more
skills through optimistic exploration. arXiv preprint arXiv:2107.14226, 2021.
11
Published as a conference paper at ICLR 2022
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.
In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033.
IEEE, 2012.
David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and
Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In In-
ternational Conference on Learning Representations, 2019.
Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, and Florian Shkurti. Skill transfer
via partially amortized hierarchical planning. In International Conference on Learning Represen-
tations, 2021.
Kelvin Xu, Siddharth Verma, Chelsea Finn, and Sergey Levine. Continual learning of control prim-
itives: Skill discovery via reset-games. Advances in Neural Information Processing Systems, 33,
2020.
Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with proto-
typical representations. In Proceedings of the 38th International Conference on Machine Learn-
ing, pp. 11920–11931. PMLR, 2021.
Jesse Zhang, Haonan Yu, and Wei Xu. Hierarchical reinforcement learning by discovering intrinsic
options. In International Conference on Learning Representations, 2021.
12
Published as a conference paper at ICLR 2022
Appendix
A Theoretical Details on Section 3
13
B
UPSIDE Algorithm
16
C Experimental Details
18
D Additional Experiments
22
A
THEORETICAL DETAILS ON SECTION 3
A.1
PROOFS OF LEMMAS 1 AND 2
Restatement of Lemma 1. There exists a value η† ∈(0, 1) such that solving (Pη†) is equivalent to
maximizing a lower bound on the mutual information objective max NZ,ρ,π,ϕ I(Sdiff; Z).
Proof. We assume that the number of available skills is upper bounded, i.e., 1 ≤NZ ≤Nmax. We
begin by lower bounding the negative conditional entropy by using the well known lower bound of
Barber & Agakov (2004) on the mutual information
−H(Z|Sdiff) =
X
z∈Z
ρ(z)Esdiff [log p(z|sdiff)]
X
z∈Z
ρ(z)Esdiff [log qϕ(z|sdiff)] .
We now use that any weighted average is lower bounded by its minimum component, which yields
−H(Z|Sdiff) ≥min
z∈Z Esdiff [log qϕ(z|sdiff)] .
Thus the following objective is a lower bound on the original objective of maximizing I(Sdiff; Z)
max
NZ=N,ρ,π,ϕ
n
H(Z) + min
z∈[N] Esdiff [log qϕ(z|sdiff)]
o
.
(4)
Interestingly, the second term in Equation 4 no longer depends on the skill distribution ρ, while the
first entropy term H(Z) is maximized by setting ρ to the uniform distribution over N skills (i.e.,
maxρ H(Z) = log(N)). This enables to simplify the optimization which now only depends on N.
Thus Equation 4 is equivalent to
max
NZ=N
n
log(N) + max
π,ϕ min
z∈[N] Esdiff [log qϕ(z|sdiff)]
o
.
(5)
We define the functions
f(N) := log(N),
g(N) := max
π,ϕ min
z∈[N] Esdiff [log qϕ(z|sdiff)] .
Let N † ∈arg maxN f(N) + g(N) and η† := exp g(N †) ∈(0, 1). We now show that any solution
of (Pη†) is a solution of Equation 5. Indeed, denote by N ⋆the value that optimizes (Pη†). First,
by validity of the constraint, it holds that g(N ⋆) ≥log η† = g(N †). Second, since N † meets the
constraint and N ⋆is the maximal number of skills that satisfies the constraint of (Pη†), by optimality
we have that N ⋆≥N † and therefore f(N ⋆) ≥f(N †) since f is non-decreasing. We thus have
g(N ⋆) ≥g(N †)
f(N ⋆) ≥f(N †)
=
⇒f(N ⋆) + g(N ⋆) ≥f(N †) + g(N †).
Putting everything together, an optimal solution for (Pη†) is an optimal solution for Equation 5,
which is equivalent to Equation 4, which is a lower bound of the MI objective, thus concluding the
proof.
13
Published as a conference paper at ICLR 2022
Restatement of Lemma 2. The function g is non-increasing in N.
Proof. We have that g(N) := maxπ,q minz∈[N] Es∼π(z)[log(q(z|s)], where throughout the proof
we write s instead of sdiff for ease of notation. Here the optimization variables are π ∈(Π)N (i.e., a
set of N policies) and q : S →∆(N), i.e., a classifier of states to N possible classes, where ∆(N)
denotes the N-simplex. For z ∈[N], let
hN(π, q, z) := Es∼π(z)[log(q(z|s)],
fN(π, q) := min
z∈[N] hN(π, q, z).
Let (π′, q′) ∈arg maxπ,q fN+1(π, q). We define e
π := (π′(1), . . . , π′(N)) ∈(Π)N and e
q : S →
∆(N) such that e
q(i|s) := q′(i|s) for any i ∈[N −1] and e
q(N|s) := q′(N|s) + q′(N + 1|s).
Intuitively, we are “discarding” policy N + 1 and “merging” class N + 1 with class N.
Then it holds that
g(N + 1) = fN+1(π′, q′) =
min
z∈[N+1] hN+1(π′, q′, z) ≤min
z∈[N] hN+1(π′, q′, z).
Now, by construction of e
π, e
q, we have for any i ∈[N −1] that hN+1(π′, q′, i) = hN(e
π, e
q, i). As for
class N, since e
π(N) = π′(N), by definition of e
q(N|·) and from monotonicity of the log function, it
holds that hN+1(π′, q′, N) = Es∼π′(N)[log(q′(N|s)] satisfies
hN+1(π′, q′, N) ≤Es∼˜
π(N)[log(e
q(N|s)] = hN(e
π, e
q, N).
Hence, we get that
min
z∈[N] hN+1(π′, q′, z) ≤min
z∈[N] hN(e
π, e
q, z) = fN(e
π, e
q) ≤g(N).
Putting everything together gives g(N + 1) ≤g(N), which yields the desired result.
A.2
SIMPLE ILLUSTRATION OF THE ISSUE WITH UNIFORM-ρ MI MAXIMIZATION
Figure 7:
The agent must assign (possibly
stochastically) N skills to M states: under the
prior of uniform skill distribution, can the MI with
be increased by varying the number of skills N?
This section complements Sect. 3.2: we show a sim-
ple scenario where 1) considering both a uniform ρ
prior and a fixed skill number NZ provably leads
to suboptimal MI maximization, and where 2) the
UPSIDE strategy of considering a uniform ρ re-
stricted to the η-discriminable skills can provably in-
crease the MI for small enough η.
Consider the simple scenario (illustrated on Fig. 7)
where the agent has N skills indexed by n and must
assign them to M states indexed by m.
We as-
sume that the execution of each skill deterministi-
cally brings it to the assigned state, yet the agent
may assign stochastically (i.e., more than one state
per skill). (A non-RL way to interpret this is that we
want to allocate N balls into M boxes.) Denote by pn,m ∈[0, 1] the probability that skill n is
assigned to state m. It must hold that ∀n ∈[N], P
m pn,m = 1. Denote by I the MI between the
skill variable and the assigned state variable, and by I the MI under the prior that the skill sampling
distribution ρ is uniform, i.e., ρ(n) = 1/N. It holds that
I(N, M) = −
X
n
1
N log 1
N +
X
n,m
1
N pn,m log
1
N pn,m
P
n
1
N pn,m
= log N + 1
N
X
n,m
pn,m log
pn,m
P
n pn,m
.
Let I
⋆(N, M) := max{pn,m} I(N, M) and {p⋆
n,m} ∈arg max{pn,m} I(N, M). We also define the
discriminability of skill n in state m as
qn,m :=
pn,m
P
n pn,m
,
as well as the minimum discriminability of the optimal assignment as
η := min
n max
m q⋆
n,m.
14
Published as a conference paper at ICLR 2022
Lemma 3. There exist values of N and M such that the uniform-ρ MI be improved by removing a
skill (i.e., by decreasing N).
Proof. The following example shows that with M = 2 states, it is beneficial for the uniform-ρ MI
maximization to go from N = 3 to N = 2 skills. Indeed, we can numerically compute the optimal
solutions and we obtain for N = 3 and M = 2 that
I
⋆(N = 3, M = 2) ≈0.918,
p⋆
n,m =
0
1
0
1
1
0
!
,
q⋆
n,m =
0
0.5
0
0.5
1
0
!
,
η = 0.5,
whereas for N = 2 and M = 2,
I
⋆(N = 2, M = 2) = 1,
p⋆
n,m =

0
1
1
0

,
q⋆
n,m =

0
1
1
0

,
η = 1.
As a result, I
⋆(N = 2, M = 2) > I
⋆(N = 3, M = 2), which concludes the proof. The intuition
why I
⋆is increased by decreasing N is that for N = 2 there is one skill per state whereas for
N = 3 the skills must necessarily overlap. Note that this contrasts with the original MI (that also
optimizes ρ) where decreasing N cannot improve the optimal MI.
The previous simple example hints to the fact that the value of the minimum discriminability of the
optimal assignment η may be a good indicator to determine whether to remove a skill. The following
more general lemma indeed shows that a sufficient condition for the uniform-ρ MI to be increased
by removing a skill is that η is small enough.
Lemma 4. Assume without loss of generality that the skill indexed by N has the minimum discrim-
inability η, i.e., N ∈arg minn maxm q⋆
n,m. Define
∆(N, η) := log N −N −1
N
log(N −1) + 1
N log η.
If ∆(N, η) ≤0 — which holds for small enough η — then removing skill N results in a larger
uniform-ρ optimal MI, i.e., I
⋆(N, M) < I
⋆(N −1, M).
Proof. It holds that
I
⋆(N, M) = log N + 1
N
X
n∈[N−1]
X
m∈[M]
p⋆
n,m log q⋆
n,m +
X
m∈[M]
p⋆
n,m log η
= log N −N −1
N
log(N −1)
+ N −1
N
log(N −1) +
1
N −1
X
n∈[N−1]
X
m∈[M]
p⋆
n,m log q⋆
n,m
+ 1
N log η
= ∆(N, η) + N −1
N
I
⋆(N −1, M).
As a result, if ∆(N, η) ≤0 then I
⋆(N, M) < I
⋆(N −1, M).
15
Published as a conference paper at ICLR 2022
B
UPSIDE ALGORITHM
B.1
VISUAL ILLUSTRATIONS OF UPSIDE’S DESIGN MENTIONED IN SECTION 3
Figure 8: Decoupled structure of an
UPSIDE policy: a directed skill fol-
lowed by a diffusing part.
Figure 9: In the above UPSIDE tree example, executing policy z = 7
means sequentially composing the skills of policies z ∈{2, 5, 7} and
then deploying the diffusing part of policy z = 7.
B.2
HIGH-LEVEL APPROACH OF UPSIDE
Figure 10: High-level approach of UPSIDE.
B.3
DETAILS OF ALGORITHM 1
We give in Alg. 2 a more detailed version of Alg. 1 and we list some additional explanations below.
• When optimizing the discriminator, rather than sampling (state, policy) pairs with equal probabil-
ity for all nodes from the tree T , we put more weight (e.g. 3×) on already consolidated policies,
which seeks to avoid the new policies from invading the territory of the older policies that were
previously correctly learned.
• A replay buffer BRL is instantiated at every new expansion (line 2), thus avoiding the need to start
collecting data from scratch with the new policies at every POLICYLEARNING call.
• J (line 22) corresponds to the number of policy updates ratio w.r.t. discriminator updates, i.e. for
how long the discriminator reward is kept fixed, in order to add stationarity to the reward signal.
• Instead of using a number of iterations K to stop the training loop of the POLICYLEARNING func-
tion (line 16), we use a maximum number of environment interactions Ksteps for node expansion.
Note that this is the same for DIAYN-hier and DIAYN-curr.
16
Published as a conference paper at ICLR 2022
Algorithm 2: Detailed UPSIDE
Parameters: Discriminability threshold η ∈(0, 1), branching factor N start, N max
Initialize: Tree T initialized as a root node index by 0, policy candidates Q = {0}, state buffers
BZ = {0 : [ ]}
while Q ̸= ∅do // tree expansion
1
Dequeue a policy z ∈Q and create N = N start nodes C(z) rooted at z and add new key z to BZ
2
Instantiate new replay buffer BRL
3
POLICYLEARNING(BRL, BZ, T , C(z))
4
if minz′∈C(z) d(z′) > η then // Node addition
5
while minz′∈C(z) d(z′) > η and N < N max do
6
Increment N = N + 1 and add one policy to C(z)
7
POLICYLEARNING(BRL, BZ, T , C(z))
8
else // Node removal
9
while minz′∈C(z) d(z′) < η and N > 1 do
10
Reduce N = N −1 and remove least discriminable policy from C(z)
11
POLICYLEARNING(BRL, BZ, T , C(z))
12
Enqueue in Q the η-discriminable nodes C(z)
13 POLICYLEARNING(Replay buffer BRL, State buffers BZ, Tree T , policies to update ZU)
14 Optimization parameters: patience K, policy-to-discriminator update ratio J, Kdiscr discriminator
update epochs, Kpol policy update epochs
15 Initialize: Discriminator qϕ with |T | classes
16 for K iterations do // Training loop
17
For all z′ ∈ZU, clear BZ[z′] then collect and add B states from the diffusing part of π(z′) to it
18
Train the discriminator qϕ for Kdiscr steps with dataset S
z′∈T BZ[z′].
19
Compute discriminability d(z′) = b
q
B
ϕ (z′) =
1
|Bz′ |
P
s∈Bz′ qϕ(z′|s) for all z′ ∈ZU
20
if minz′∈ZU d(z′) > η then // Early stopping
21
Break
22
for J iterations do
23
For all z′ ∈ZU, sample a trajectory from π(z′) and add to replay buffer BRL
24
For all z′ ∈ZU, update policy π′
z for Kpol steps on replay buffer BRL to optimize the
discriminator reward as in Sect. 3.1 keeping skills from parent policies fixed
25 Compute discriminability d(z′) for all z′ ∈ZU
• The state buffer size B needs to be sufficiently large compared to H so that the state buffers of
each policy represent well the distribution of the states generated by the policy’s diffusing part.
In practice we set B = 10H.
• In POLICYLEARNING, we add Kinitial random (uniform) transitions to the replay buffer for each
newly instantiated policies.
• Moreover, in POLICYLEARNING, instead of sampling uniformly the new policies we sample them
in a round robin fashion (i.e., one after the other), which can be simply seen as a variance-reduced
version of uniform sampling.
B.4
ILLUSTRATION OF THE EVOLUTION OF UPSIDE’S TREE ON A WALL-FREE MAZE
See Fig. 11.
B.5
ILLUSTRATION OF EVOLUTION OF UPSIDE’S TREE ON THE BOTTLENECK MAZE
See Fig. 12.
17
Published as a conference paper at ICLR 2022
0
0.5
1
1.5
·105
0
0.2
0.4
0.6
0.8
Env interactions
Average discriminator accuracy
A
B
C
D
E
F
G
H
I
Figure 11: Fine-grained evolution of the tree structure on a wall-free maze with N start = 4 and N max = 8.
The environment is a wall-free continuous maze with initial state s0 located at the center of the maze. Image A
represents the diffusing part around s0. In image B, N start = 4 policies are trained, yet one of them (in lime
yellow) is not sufficiently discriminable, thus it is pruned, resulting in image C. A small number of interactions
is enough to ensure that the three policies are η-discriminable (image C). In image D, a fourth policy (in green) is
able to become η-discriminable. New policies are added, trained and η-discriminated from 5 policies (image E)
to N max = 8 policies (image F). Then a policy (in yellow) is expanded with N start = 4 policies (image G).
They are all η-discriminable so additional policies are added (images H, I, . . . ). The process continues until
convergence or until time-out (as done here). On the left, we plot the number of active policies (which represents
the number of policies that are being trained at the current level of the tree) as well as the average discriminator
accuracy over the active policies.
Figure 12: Incremental expansion of the tree learned by UPSIDE towards unexplored regions of the state space
in the Bottleneck Maze.
C
EXPERIMENTAL DETAILS
C.1
BASELINES
DIAYN-NZ.
This corresponds to the original DIAYN algorithm (Eysenbach et al., 2019) where
NZ is the number of skills to be learned. In order to make the architecture more similar to UPSIDE,
we use distinct policies for each skill, i.e. they do not share weights as opposed to Eysenbach et al.
(2019). While this may come at the price of sample efficiency, it may also help put lesser constraint
on the model (e.g. gradient interference).
DIAYN-curr.
We augment DIAYN with a curriculum that enables to be less dependent on an
adequate tuning of the number of skills of DIAYN. We consider the curriculum of UPSIDE where we
start learning with N start policies during a period of time/number of interactions. If the configuration
satisfies the discriminablity threshold η, a skill is added, otherwise a skill is removed or learning
stopped (as in Alg. 2, lines 5-12). Note that the increasing version of this curriculum is similar to
the one proposed in VALOR (Achiam et al., 2018, Sect. 3.3). In our experiments, we use N start = 1.
18
Published as a conference paper at ICLR 2022
DIAYN-hier.
We extend DIAYN through the use of a hierarchy of directed skills, built following
the UPSIDE principles. The difference between DIAYN-hier and UPSIDE is that the discrimi-
nator reward is computed over the entire directed skill trajectory, while it is guided by the diffusing
part for UPSIDE. This introduced baseline can be interpreted as an ablation of UPSIDE without the
decoupled structure of policies.
SMM.
We consider SMM (Lee et al., 2019) as it is state-of-art in terms of coverage, at least on
long-horizon control problems, although Campos et al. (2020) reported its poor performance in
hard-to-explore bottleneck mazes. We tested the regular SMM version, i.e. learning a state density
model with a VAE, yet we failed to make it work on the maze domains that we consider. As we use
the cartesian (x, y) positions in maze domains, learning the identity function on two-dimensional
input data is too easy with a VAE, thus preventing the benefits of using a density model to drive
exploration. In our implementation, the exploration bonus is obtained by maintaining a multinomial
distribution over “buckets of states” obtained by discretization (as in our coverage computation),
resulting in a computation-efficient implementation that is more stable than the original VAE-based
method. Note that the state distribution is computed using states from past-but-recent policies as
suggested in the original paper.
EDL.
We consider EDL (Campos et al., 2020) with the strong assumption of an available state
distribution oracle (since replacing it by SMM does not lead to satisfying results in presence of bot-
tleneck states as shown in Campos et al., 2020, page 7: “We were unable to explore this type
of maze effectively with SMM”). In our implementation, the oracle samples states uniformly in the
mazes avoiding the need to handle a complex exploration, but this setting is not realistic when facing
unknown environments.
C.2
ARCHITECTURE AND HYPERPARAMETERS
The architecture of the different methods remains the same in all our experiments, except that the
number of hidden units changes across considered environments. For UPSIDE, flat UPSIDE (i.e.,
UPSIDE with a tree depth of 1), DIAYN, DIAYN-curr, DIAYN-hier and SMM the multiple
policies do not share weights, however EDL policies all share the same network because of the
constraint that the policy embedding z is learnt in a supervised fashion with the VQ-VAE rather
than the unsupervised RL objective. We consider decoupled actor and critic optimized with the TD3
algorithm (Fujimoto et al., 2018) though we also tried SAC (Haarnoja et al., 2018) which showed
equivalent results than TD3 with harder tuning.8 The actor and the critic have the same architecture
that processes observations with a two-hidden layers (of size 64 for maze environments and 256 for
control environments) neural networks. The discriminator is a two-hidden (of size 64) layer model
with output size the number of skills in the tree.
Common (for all methods and environments) optimization hyper-parameters:
• Discount factor: γ = 0.99
• σTD3 = {0.1, 0.15, 0.2}
• Q-functions soft updates temperature τ = 0.005
• Policy Adam optimizer with learning rate lrpol = {1e−3, 1e−4}
• policy inner epochs Kpol = {10, 100}
• policy batch size Bpol = {64}
• Discriminator delay: J = {1, 10}
• Replay buffer maximum size: 1e6
• Kinitial = 1e3
We consider the same range of hyper-parameters in the downstream tasks.
8For completeness, we report here the performance of DIAYN-SAC in the continuous mazes: DIAYN-SAC
with NZ = 10 on Bottleneck maze: 21.0 (± 0.50); on U-maze: 17.5 (± 0.75), to compare with DIAYN-TD3
with NZ = 10 on Bottleneck maze: 17.67 (± 0.57); on U-maze: 14.67 (± 0.42). We thus see that DIAYN-SAC
fails to cover the state space, performing similarly to DIAYN-TD3 (albeit over a larger range of hyperparameter
search, possibly explaining the slight improvement).
19
Published as a conference paper at ICLR 2022
UPSIDE, DIAYN and SMM variants (common for all environments) optimization hyper-
parameters:
• Discriminator batch size Bdiscr = 64
• Discriminator Adam optimizer with learning rate lrdiscr = {1e−3, 1e−4}
• discriminator inner epochs Kdiscr = {10, 100}
• Discriminator delay: J = {1, 10}
• State buffer size B = 10H where the diffusing part length H is environment-specific.
EDL optimization hyper-parameters:
We kept the same as Campos et al. (2020). The VQ-VAE’s
architecture consists of an encoder that takes states as an input and maps them to a code with 2 hidden
layers with 128 hidden units and a final linear layer, and the decoder takes the code and maps it back
to states also with 2 hidden layers with 128 hidden units. It is trained on the oracle state distribution,
then kept fixed during policy learning. Contrary to UPSIDE, DIAYN and SMM variants, the reward
is stationary.
• βcommitment = {0.25, 0.5}
• VQ-VAE’s code size 16
• VQ-VAE batch size Bvq-vae = {64, 256}
• total number of epochs: 5000 (trained until convergence)
• VQ-VAE Adam optimizer with learning rate lrvq-vae = {1e−3, 1e−4}
Maze specific hyper-parameters:
• Ksteps = 5e4 (and in time 10 minutes)
• T = H = 10
• Max episode length Hmax = 200
• Max number of interactions Tmax = 1e7 during unsupervised pre-training and downstream tasks.
Control specific optimization hyper-parameters:
• Ksteps = 1e5 (and in time 1 hour)
• T = H = 50
• Max episode length Hmax = 250
• Max number of interactions Tmax = 1e7 during unsupervised pre-training and downstream tasks.
Note that hyperparameters are kept fixed for the downstream tasks too.
C.3
EXPERIMENTAL PROTOCOL
We now detail the experimental protocol that we followed, which is common for both UPSIDE and
baselines, on all environments. It consists in the following three stages:
Unsupervised pre-training phase.
Given an environment, each algorithm is trained without any
extrinsic reward on Nunsup = 3 seeds which we call unsupervised seeds (to account for the random-
ness in the model weights’ initialization and environment stochasticity if present). Each training
lasts for a maximum number of Tmax environment steps (split in episodes of length Hmax). This
protocol actually favors the baselines since by its design, UPSIDE may decide to have fewer en-
vironment interactions than Tmax thanks to its termination criterion (triggered if it cannot fit any
more policies); for instance, all baselines where allowed Tmax = 1e7 on the maze environments, but
UPSIDE finished at most in 1e6 environment steps fitting in average 57 and 51 policies respectively
for the Bottleneck Maze and U-Maze.
Model selection.
For each unsupervised seed, we tune the hyper-parameters of each algorithm
according to a certain performance metric. For the baselines, we consider the cumulated intrin-
sic reward (as done in e.g., Strouse et al., 2021) averaged over stochastic roll-outs. For UPSIDE,
DIAYN-hier and DIAYN-curr, the model selection criterion is the number of consolidated poli-
cies, i.e., how many policies were η-discriminated during their training stage. For each method, we
thus have as many models as seeds, i.e. Nunsup.
20
Published as a conference paper at ICLR 2022
Downstream tasks.
For each algorithm, we evaluate the Nunsup selected models on a set of tasks.
All results on downstream tasks will show a performance averaged over the Nunsup seeds.
• Coverage. We evaluate to which extent the state space has been covered by discretizing the
state space into buckets (10 per axis on the continuous maze domains) and counting how many
buckets have been reached. To compare the global coverage of methods (and also to be fair
w.r.t. the amount of injected noise that may vary across methods), we roll-out for each model its
associated deterministic policies.
• Fine-tuning on goal-reaching task. We consider goal-oriented tasks in the discounted episodic
setting where the agent needs to reach some unknown goal position within a certain radius (i.e.,
the goal location is unknown until it is reached once) and with sparse reward signal (i.e., reward
of 1 in the goal location, 0 otherwise). The environment terminates when goal is reached or if
the number of timesteps is larger than Hmax. The combination of unknown goal location and
sparse reward makes the exploration problem very challenging, and calls upon the ability to first
cover (for goal finding) and then navigate (for reliable goal reaching) the environment efficiently.
To evaluate performance in an exhaustive manner, we discretize the state space into Bgoal = 14
buckets and we randomly sample Ngoal = 3 from each of these buckets according to what we call
goal seeds (thus there are Bgoal × Ngoal = 10 different goals in total). For every goal seed, we
initialize each algorithm with the set of policies learned during the unsupervised pre-training.
We then roll-out each policy during Nexplo episodes to compute the cumulative reward of the
policy, and select the best one to fine-tune. On UPSIDE, we complete the selected policy (of
length denoted by L) by replacing the diffusing skill with a skill whose length is the remaining
number of interactions left, i.e. Hmax −L. The ability of selecting a good policy is intrinsically
linked to the coverage performance of the model, but also to few-shot adaptation. Learning
curves and performance are averaged over unsupervised seeds, goal seeds, and over roll-outs of
the stochastic policy. Since we are in the discounted episodic setting, fine-tuning makes sense, to
reach as fast as possible the goal. This is particularly important as UPSIDE, because of its tree
policy structure, can reach the goal sub-optimally w.r.t the discount. On the maze environments,
we consider all unsupervised pre-training baselines as well as “vanilla” baselines trained from
scratch during the downstream tasks: TD3 (Fujimoto et al., 2018) and ICM (Pathak et al., 2017).
In the Ant environment, we also consider Ngoal = 3 and Bgoal = 14 in the [−8, 8]2 square.
21
Published as a conference paper at ICLR 2022
(a) UPSIDE discriminator
(b) DIAYN discriminator
(c) EDL VQ-VAE
Figure 13: Environment divided in colors according to the most likely latent variable Z, according to (from left
to right) the discriminator learned by UPSIDE, the discriminator learned by DIAYN and the VQ-VAE learned
by EDL. Contrary to DIAYN, UPSIDE’s optimization enables the discriminator training and the policy training
to catch up to each other, thus nicely clustering the discriminator predictions across the state space. EDL’s
VQ-VAE also manages to output good predictions (recall that we consider the EDL version with the strong
assumption of the available state distribution oracle, see Campos et al., 2020), yet the skill learning is unable to
cover the entire state space due to exploration issues and sparse rewards.
(a) SMM
(b) DIAYN-curr
(c) DIAYN-hier
(d) Flat UPSIDE
Figure 14: Complement to Fig. 2: Visualization of the policies learned on the Bottleneck Maze for the remain-
ing methods.
(a) SMM
(b) TD3
Figure 15: Complement of Fig. 5: Heatmaps of downstream task performance after fine-tuning for the remain-
ing methods.
D
ADDITIONAL EXPERIMENTS
D.1
ADDITIONAL RESULTS ON BOTTLENECK MAZE
Here we include 1) Fig. 13 for an analysis of the predictions of the discriminator (see caption for
details); 2) Fig. 14 for the policy visualizations for the remaining methods (i.e., those not reported
in Fig. 2; 3) Fig. 15 for the downstream task performance for the remaining methods (i.e., those not
reported in Fig. 5).
22
Published as a conference paper at ICLR 2022
(a) UPSIDE
(b) DIAYN
(c) EDL
(d) SMM
(e) DIAYN-curr
(f) DIAYN-hier
(g) Flat UPSIDE
Figure 16: Visualization of the policies learned on U-Maze. This is the equivalent of Fig. 2 for U-Maze.
(a) UPSIDE before fine-
tuning
(b) UPSIDE
(c) DIAYN
(d) EDL
(e) ICM
(f) SMM
(g) TD3
Figure 17: Heat maps of downstream task performance on U-Maze. This is the equivalent of Fig. 5 for U-Maze.
D.2
ADDITIONAL RESULTS ON U-MAZE
Fig. 16 visualizes the policies learned during the unsupervised phase (i.e., the equivalent of Fig. 2 for
the U-Maze), and Fig. 17 reports the heatmaps of downstream task performance (i.e., the equivalent
of Fig. 5 for the U-Maze). The conclusion is the same as on the Bottleneck Maze described in
Sect. 5: UPSIDE clearly outperforms all the baselines, both in coverage (Fig. 16) and in unknown
goal-reaching performance (Fig. 17) in the downstream task phase.
23
Published as a conference paper at ICLR 2022
0
0.2
0.4
0.6
0.8
1
·107
0.2
0.4
0.6
0.8
1
Env interactions
Average discriminator accuracy
DIAYN-10
DIAYN-20
DIAYN-50
(a) Bottleneck Maze
0
0.2
0.4
0.6
0.8
1
·107
0.4
0.6
0.8
1
Env interactions
Average discriminator accuracy
DIAYN-5
DIAYN-10
DIAYN-20
(b) Ant
0
0.2
0.4
0.6
0.8
1
·107
0.2
0.4
0.6
0.8
1
Env interactions
Average discriminator accuracy
DIAYN-5
DIAYN-10
DIAYN-20
(c) Half-Cheetah
0
0.2
0.4
0.6
0.8
1
·107
0.2
0.4
0.6
0.8
1
Env interactions
Average discriminator accuracy
DIAYN-5
DIAYN-10
DIAYN-20
(d) Walker2d
Figure 18: Average discriminability of the DIAYN-NZ policies. The smaller NZ is, the easier it is to obtain
a close-to-perfect discriminability. However, even for quite large NZ (50 for mazes and 20 in control envi-
ronments), DIAYN is able to achieve a good discriminator accuracy, most often because policies learn how to
“stop” in some state.
D.3
ANALYSIS OF THE DISCRIMINABILITY
In Fig. 18 (see caption) we investigate the average discriminability of DIAYN depending on the
number of policies NZ.
24
Published as a conference paper at ICLR 2022
(a) T = H = 10
(b) T = H = 20
(c) T = H = 30
(d) T = H = 40
(e) T = H = 50
(f) T = H = 10
(g) T = H = 20
(h) T = H = 30
(i) T = H = 40
(j) T = H = 50
Figure 19: Ablation on the length of UPSIDE
policies (T, H):
Visualization of the policies
learned on the Bottleneck Maze (top) and the U-
Maze (bottom) for different values of T, H. (Right
table) Coverage values (according to the same
procedure as in Table 2). Recall that T and H de-
note respectively the lengths of the directed skill
and of the diffusing part of an UPSIDE policy.
UPSIDE
Bottleneck Maze
U-Maze
T = H = 10
85.67
(±1.93)
71.33
(±0.42)
T = H = 20
87.33
(±0.42)
67.67
(±1.50)
T = H = 30
77.33
(±3.06)
68.33
(±0.83)
T = H = 40
59.67
(±1.81)
57.33
(±0.96)
T = H = 50
51.67
(±0.63)
58.67
(±1.26)
D.4
ABLATION ON THE LENGTHS T AND H OF THE UPSIDE POLICIES
Our ablation on the mazes in Fig. 19 investigates the sensitiveness of UPSIDE w.r.t. T and H, the
lengths of the directed skills and diffusing parts of the policies. For the sake of simplicity, we kept
T = H. It shows that the method is quite robust to reasonable choices of T and H, i.e., equal
to 10 (as done in all other experiments) but also 20 or 30. Naturally, the performance degrades if
T, H are chosen too large w.r.t. the environment size, in particular in the bottleneck maze which
requires “narrow” exploration, thus composing disproportionately long skills hinders the coverage.
For T = H = 50, we recover the performance of flat UPSIDE.
D.5
FINE-TUNING RESULTS ON HALF-CHEETAH AND WALKER2D
In Sect. 5, we reported the fine-tuning results on Ant. We now focus on Half-Cheetah and Walker2d,
and similarly observe that UPSIDE largely outperforms the baselines:
UPSIDE
TD3
DIAYN
Half-Cheetah
174.93
(±1.45)
108.67
(±25.61)
0.0
(±0.0)
Walker2d
46.29
(±3.09)
14.33
(±1.00)
2.13
(±0.74)
We note that the fine-tuning experiment on Half-Cheetah exactly corresponds to the standard Sparse-
Half-Cheetah task, by rewarding states where the x-coordinate is larger than 15. Meanwhile, Sparse-
Walker2d provides a reward as long as the x-coordinate is larger than 1 and the agent is standing
up. Our fine-tuning task on Walker2d is more challenging as we require the x-coordinate to be
larger than 4. Yet the agent can be rewarded even if it is not standing up, since our downstream
task is purely goal-reaching. We indeed interestingly noticed that UPSIDE may reach the desired
goal location yet not be standing up (e.g., by crawling), which may occur as there is no incentive in
UPSIDE to remain standing up, only to be discriminable.
25