|
Published as a conference paper at ICLR 2022 |
|
DIRECT THEN DIFFUSE: |
|
INCREMENTAL UNSUPERVISED SKILL DISCOVERY |
|
FOR STATE COVERING AND GOAL REACHING |
|
Pierre-Alexandre Kamienny∗1,2, Jean Tarbouriech∗1,3, |
|
Sylvain Lamprier2, Alessandro Lazaric1, Ludovic Denoyer†1 |
|
1 Meta AI |
|
2 Sorbonne University, LIP6/ISIR |
|
3 Inria Scool |
|
ABSTRACT |
|
Learning meaningful behaviors in the absence of reward is a difficult problem in |
|
reinforcement learning. A desirable and challenging unsupervised objective is to |
|
learn a set of diverse skills that provide a thorough coverage of the state space |
|
while being directed, i.e., reliably reaching distinct regions of the environment. |
|
In this paper, we build on the mutual information framework for skill discovery |
|
and introduce UPSIDE, which addresses the coverage-directedness trade-off in |
|
the following ways: 1) We design policies with a decoupled structure of a directed |
|
skill, trained to reach a specific region, followed by a diffusing part that induces |
|
a local coverage. 2) We optimize policies by maximizing their number under the |
|
constraint that each of them reaches distinct regions of the environment (i.e., they |
|
are sufficiently discriminable) and prove that this serves as a lower bound to the |
|
original mutual information objective. 3) Finally, we compose the learned directed |
|
skills into a growing tree that adaptively covers the environment. We illustrate in |
|
several navigation and control environments how the skills learned by UPSIDE |
|
solve sparse-reward downstream tasks better than existing baselines. |
|
1 |
|
INTRODUCTION |
|
Deep reinforcement learning (RL) algorithms have been shown to effectively solve a wide variety |
|
of complex problems (e.g., Mnih et al., 2015; Bellemare et al., 2013). However, they are often |
|
designed to solve one single task at a time and they need to restart the learning process from scratch |
|
for any new problem, even when it is defined on the very same environment (e.g., a robot navigating |
|
to different locations in the same apartment). Recently, Unsupervised RL (URL) has been proposed |
|
as an approach to address this limitation. In URL, the agent first interacts with the environment |
|
without any extrinsic reward signal. Afterward, the agent leverages the experience accumulated |
|
during the unsupervised learning phase to efficiently solve a variety of downstream tasks defined on |
|
the same environment. This approach is particularly effective in problems such as navigation (see |
|
e.g., Bagaria et al., 2021) and robotics (see e.g., Pong et al., 2020) where the agent is often required |
|
to readily solve a wide range of tasks while the dynamics of environment remains fixed. |
|
In this paper, we focus on the unsupervised objective of discovering a set of skills that can be |
|
used to efficiently solve sparse-reward downstream tasks. In particular, we build on the insight |
|
that mutual information (MI) between the skills’ latent variables and the states reached by them |
|
can formalize the dual objective of learning policies that both cover and navigate the environment |
|
efficiently. Indeed, maximizing MI has been shown to be a powerful approach for encouraging |
|
exploration in RL (Houthooft et al., 2016; Mohamed & Rezende, 2015) and for unsupervised skill |
|
discovery (e.g., Gregor et al., 2016; Eysenbach et al., 2019; Achiam et al., 2018; Sharma et al., 2020; |
|
Campos et al., 2020). Nonetheless, learning policies that maximize MI is a challenging optimization |
|
problem. Several approximations have been proposed to simplify it at the cost of possibly deviating |
|
from the original objective of coverage and directedness (see Sect. 4 for a review of related work). |
|
∗equal contribution |
|
†Now at Ubisoft La Forge |
|
{pakamienny,jtarbouriech,lazaric}@fb.com, sylvain.lamprier@isir.upmc.fr, ludovic.den@gmail.com |
|
1 |
|
Published as a conference paper at ICLR 2022 |
|
Figure 1: Overview of UPSIDE. The black dot corresponds to the initial state. (A) A set of random policies is |
|
initialized, each policy being composed of a directed part called skill (illustrated as a black arrow) and a dif- |
|
fusing part (red arrows) which induces a local coverage (colored circles). (B) The skills are then updated to |
|
maximize the discriminability of the states reached by their corresponding diffusing part (Sect. 3.1). (C) The |
|
least discriminable policies are iteratively removed while the remaining policies are re-optimized. This is ex- |
|
ecuted until the discriminability of each policy satisfies a given constraint (Sect. 3.2). In this example two |
|
policies are consolidated. (D) One of these policies is used as basis to add new policies, which are then opti- |
|
mized following the same procedure. For the “red” and “purple” policy, UPSIDE is not able to find sub-policies |
|
of sufficient quality and thus they are not expanded any further. (E) At the end of the process, UPSIDE has |
|
created a tree of policies covering the state space, with skills as edges and diffusing parts as nodes (Sect. 3.3). |
|
In this paper, we propose UPSIDE (UnsuPervised Skills that dIrect then DiffusE) to learn a set of |
|
policies that can be effectively used to cover the environment and solve goal-reaching downstream |
|
tasks. Our solution builds on the following components (Fig. 1): |
|
• Policy structure (Sect. 3.1, see Fig. 1 (A)). We consider policies composed of two parts: 1) a |
|
directed part, referred to as the skill, that is trained to reach a specific region of the environment, |
|
and 2) a diffusing part that induces a local coverage around the region attained by the first part. |
|
This structure favors coverage and directedness at the level of a single policy. |
|
• New constrained objective (Sect. 3.2, see Fig. 1 (B) & (C)). We then introduce a constrained opti- |
|
mization problem designed to maximize the number of policies under the constraint that the states |
|
reached by each of the diffusing parts are distinct enough (i.e., they satisfy a minimum level of |
|
discriminability). We prove that this problem can be cast as a lower bound to the original MI |
|
objective, thus preserving its coverage-directedness trade-off. UPSIDE solves it by adaptively |
|
adding or removing policies to a given initial set, without requiring any prior knowledge on a |
|
sensible number of policies. |
|
• Tree structure (Sect. 3.3, see Fig. 1 (D) & (E)). Leveraging the directed nature of the skills, |
|
UPSIDE effectively composes them to build longer and longer policies organized in a tree struc- |
|
ture. This overcomes the need of defining a suitable policy length in advance. Thus in UPSIDE |
|
we can consider short policies to make the optimization easier, while composing their skills along |
|
a growing tree structure to ensure an adaptive and thorough coverage of the environment. |
|
The combination of these components allows UPSIDE to effectively adapt the number and the length |
|
of policies to the specific structure of the environment, while learning policies that ensure coverage |
|
and directedness. We study the effectiveness of UPSIDE and the impact of its components in hard- |
|
to-explore continuous navigation and control environments, where UPSIDE improves over existing |
|
baselines both in terms of exploration and learning performance. |
|
2 |
|
SETTING |
|
We consider the URL setting where the agent interacts with a Markov decision process (MDP) M |
|
with state space S, action space A, dynamics p(s′|s, a), and no reward. The agent starts each |
|
episode from a designated initial state s0 ∈S.1 Upon termination of the chosen policy, the agent is |
|
then reset to s0. This setting is particularly challenging from an exploration point of view since the |
|
agent cannot rely on the initial distribution to cover the state space. |
|
We recall the MI-based unsupervised skill discovery approach (see e.g., Gregor et al., 2016). Denote |
|
by Z some (latent) variables on which the policies of length T are conditioned (we assume that Z |
|
is categorical for simplicity and because it is the most common case in practice). There are three |
|
1More generally, s0 could be drawn from any distribution supported over a compact region. |
|
2 |
|
Published as a conference paper at ICLR 2022 |
|
optimization variables: (i) the cardinality of Z denoted by NZ, i.e., the number of policies (we write |
|
Z = {1, . . . , NZ} = [NZ]), (ii) the parameters π(z) of the policy indexed by z ∈Z, and (iii) the |
|
policy sampling distribution ρ (i.e., ρ(z) is the probability of sampling policy z at the beginning of |
|
the episode). Denote policy z’s action distribution in state s by π(·|z, s) and the entropy function |
|
by H. Let the variable ST be the random (final) state induced by sampling a policy z from ρ and |
|
executing π(z) from s0 for an episode. Denote by pπ(z)(sT ) the distribution over (final) states |
|
induced by executing policy z, by p(z|sT ) the probability of z being the policy to induce (final) |
|
state sT , and let p(sT ) = P |
|
z∈Z ρ(z)pπ(z)(sT ). Maximizing the MI between Z and ST can be |
|
written as maxNZ, ρ, π I(ST ; Z), where |
|
I(ST ; Z) = H(ST ) −H(ST |Z) = − |
|
X |
|
sT |
|
p(sT ) log p(sT ) + |
|
X |
|
z∈Z |
|
ρ(z)EsT |z |
|
|
|
log pπ(z)(sT ) |
|
|
|
= H(Z) −H(Z|ST ) = − |
|
X |
|
z∈Z |
|
ρ(z) log ρ(z) + |
|
X |
|
z∈Z |
|
ρ(z)EsT |z [log p(z|sT )] , |
|
(1) |
|
where in the expectations sT |z ∼pπ(z)(sT ). In the first formulation, the entropy term over states |
|
captures the requirement that policies thoroughly cover the state space, while the second term mea- |
|
sures the entropy over the states reached by each policy and thus promotes policies that have a |
|
directed behavior. Learning the optimal NZ, ρ, and π to maximize Equation 1 is a challenging prob- |
|
lem and several approximations have been proposed (see e.g., Gregor et al., 2016; Eysenbach et al., |
|
2019; Achiam et al., 2018; Campos et al., 2020). Many approaches focus on the so-called reverse |
|
formulation of the MI (second line of Equation 1). In this case, the conditional distribution p(z|sT ) is |
|
usually replaced with a parametric model qϕ(z|sT ) called the discriminator that is trained via a neg- |
|
ative log likelihood loss simultaneously with all other variables. Then one can maximize the lower |
|
bound (Barber & Agakov, 2004): I(ST ; Z) ≥Ez∼ρ(z),τ∼π(z) [log qϕ(z|sT ) −log ρ(z)], where we |
|
denote by τ ∼π(z) trajectories sampled from the policy indexed by z. As a result, each policy π(z) |
|
can be trained with RL to maximize the intrinsic reward rz(sT ) := log qϕ(z|sT ) −log ρ(z). |
|
3 |
|
THE UPSIDE ALGORITHM |
|
In this section we detail the three main components of UPSIDE, which is summarized in Sect. 3.4. |
|
3.1 |
|
DECOUPLED POLICY STRUCTURE OF DIRECT-THEN-DIFFUSE |
|
While the trade-off between coverage and directedness is determined by the MI objective, the |
|
amount of stochasticity of each policy (e.g., injected via a regularization on the entropy over the |
|
actions) has also a major impact on the effectiveness of the overall algorithm (Eysenbach et al., |
|
2019). In fact, while randomness can promote broader coverage, a highly stochastic policy tends |
|
to induce a distribution pπ(z)(sT ) over final states with high entropy, thus increasing H(ST |Z) and |
|
losing in directedness. In UPSIDE, we define policies with a decoupled structure (see Fig. 1 (A)) |
|
composed of a) a directed part (of length T) that we refer to as skill, with low stochasticity and |
|
trained to reach a specific region of the environment and b) a diffusing part (of length H) with high |
|
stochasticity to promote local coverage of the states around the region reached by the skill. |
|
Coherently with this structure, the state variable in the con- |
|
ditional entropy in Equation 1 becomes any state reached |
|
during the diffusing part (denote by Sdiff the random vari- |
|
able) and not just the skill’s terminal state. |
|
Following |
|
Sect. 2 we define an intrinsic reward rz(s) = log qϕ(z|s)− |
|
log ρ(z) and the skill of policy z maximizes the cumulative |
|
reward over the states traversed by the diffusing part. For- |
|
mally, we can conveniently define the objective function: |
|
max |
|
π(z) Eτ∼π(z) |
|
h X |
|
t∈J |
|
α · rz(st) + β · H(π(·|z, st)) |
|
i |
|
, |
|
(2) |
|
UPSIDE |
|
directed |
|
skill |
|
UPSIDE |
|
diffusing |
|
part |
|
VIC |
|
policy |
|
DIAYN |
|
policy |
|
state |
|
variable |
|
Sdiff |
|
Sdiff |
|
ST |
|
S |
|
J |
|
{T, . . . , |
|
T + H} |
|
{T, . . . , |
|
T + H} {T } {1, ..., T } |
|
(α, β) |
|
(1, 0) |
|
(0, 1) |
|
(1, 0) |
|
(1, 1) |
|
Table 1: Instantiation of Equation 2 for |
|
each part of an UPSIDE policy, and for |
|
VIC (Gregor et al., 2016) and DIAYN |
|
(Eysenbach et al., 2019) policies. |
|
where J = {T, . . . , T + H} and α = 1, β = 0 (resp. α = 0, β = 1) when optimizing for the skill |
|
(resp. diffusing part). In words, the skill is incentivized to bring the diffusing part to a discriminable |
|
region of the state space, while the diffusing part is optimized by a simple random walk policy (i.e., |
|
a stochastic policy with uniform distribution over actions) to promote local coverage around sT . |
|
3 |
|
Published as a conference paper at ICLR 2022 |
|
Table 1 illustrates how UPSIDE’s policies compare to other methods. Unlike VIC and similar to |
|
DIAYN, the diffusing parts of the policies tend to “push” the skills away so as to reach diverse |
|
regions of the environment. The combination of the directedness of the skills and local coverage of |
|
the diffusing parts thus ensures that the whole environment can be properly visited with NZ ≪|S| |
|
policies.2 Furthermore, the diffusing part can be seen as defining a cluster of states that represents |
|
the goal region of the directed skill. This is in contrast with DIAYN policies whose stochasticity may |
|
be spread over the whole trajectory. This allows us to “ground” the latent variable representations |
|
of the policies Z to specific regions of the environment (i.e., the clusters). As a result, maximizing |
|
the MI I(Sdiff; Z) can be seen as learning a set of “cluster-conditioned” policies. |
|
3.2 |
|
A CONSTRAINED OPTIMIZATION PROBLEM |
|
In this section, we focus on how to optimize the number of policies NZ and the policy sampling |
|
distribution ρ(z). The standard practice for Equation 1 is to preset a fixed number of policies NZ and |
|
to fix the distribution ρ to be uniform (see e.g., Eysenbach et al., 2019; Baumli et al., 2021; Strouse |
|
et al., 2021). However, using a uniform ρ over a fixed number of policies may be highly suboptimal, |
|
in particular when NZ is not carefully tuned. In App. A.2 we give a simple example and a theoretical |
|
argument on how the MI can be ensured to increase by removing skills with low discriminability |
|
when ρ is uniform. Motivated by this observation, in UPSIDE we focus on maximizing the number |
|
of policies that are sufficiently discriminable. We fix the sampling distribution ρ to be uniform over |
|
N policies and define the following constrained optimization problem |
|
max |
|
N≥1 N |
|
s.t. |
|
g(N) ≥log η, |
|
where |
|
g(N) := max |
|
π,ϕ min |
|
z∈[N] Esdiff [log qϕ(z|sdiff)] , |
|
(Pη) |
|
where qϕ(z|sdiff) denotes the probability of z being the policy traversing sdiff during its diffusing |
|
part according to the discriminator and η ∈(0, 1) defines a minimum discriminability threshold. |
|
By optimizing for (Pη), UPSIDE automatically adapts the number of policies to promote coverage, |
|
while still guaranteeing that each policy reaches a distinct region of the environment. Alternatively, |
|
we can interpret (Pη) under the lens of clustering: the aim is to find the largest number of clusters |
|
(i.e., the region reached by the directed skill and covered by its associated diffusing part) with a |
|
sufficient level of inter-cluster distance (i.e., discriminability) (see Fig. 1). The following lemma |
|
(proof in App. A.1) formally links the constrained problem (Pη) back to the original MI objective. |
|
Lemma 1. There exists a value η† ∈(0, 1) such that solving (Pη†) is equivalent to maximizing a |
|
lower bound on the mutual information objective max NZ,ρ,π,ϕ I(Sdiff; Z). |
|
Since (Pη†) is a lower bound to the MI, optimizing it ensures that the algorithm does not deviate |
|
too much from the dual covering and directed behavior targeted by MI maximization. Interestingly, |
|
Lem. 1 provides a rigorous justification for using a uniform sampling distribution restricted to the |
|
η-discriminable policies, which is in striking contrast with most of MI-based literature, where a |
|
uniform sampling distribution ρ is defined over the predefined number of policies. |
|
In addition, our alternative objective (Pη) has the benefit of providing a simple greedy strategy to |
|
optimize the number of policies N. Indeed, the following lemma (proof in App. A.1) ensures that |
|
starting with N = 1 (where g(1) = 0) and increasing it until the constraint g(N) ≥log η is violated |
|
is guaranteed to terminate with the optimal number of policies. |
|
Lemma 2. The function g is non-increasing in N. |
|
3.3 |
|
COMPOSING SKILLS IN A GROWING TREE STRUCTURE |
|
Both the original MI objective and our constrained formulation (Pη) depend on the initial state s0 |
|
and on the length of each policy. Although these quantities are usually predefined and only appear |
|
implicitly in the equations, they have a crucial impact on the obtained behavior. In fact, resetting |
|
after each policy execution unavoidably restricts the coverage to a radius of at most T + H steps |
|
around s0. This may suggest to set T and H to large values. However, increasing T makes training |
|
the skills more challenging, while increasing H may not be sufficient to improve coverage. |
|
2Equation 1 is maximized by setting NZ = |S| (i.e., maxY I(X, Y ) = I(X, X) = H(X)), where each z |
|
represents a goal-conditioned policy reaching a different state, which implies having as many policies as states, |
|
thus making the learning particularly challenging. |
|
4 |
|
Published as a conference paper at ICLR 2022 |
|
Instead, we propose to “extend” the length of the policies through composition. We rely on the |
|
key insight that the constraint in (Pη) guarantees that the directed skill of each η-discriminable |
|
policy reliably reaches a specific (and distinct) region of the environment and it is thus re-usable |
|
and amenable to composition. We thus propose to chain the skills so as to reach further and further |
|
parts of the state space. Specifically, we build a growing tree, where the root node is a diffusing |
|
part around s0, the edges represent the skills, and the nodes represent the diffusing parts. When a |
|
policy z is selected, the directed skills of its predecessor policies in the tree are executed first (see |
|
Fig. 9 in App. B for an illustration). Interestingly, this growing tree structure builds a curriculum |
|
on the episode lengths which grow as the sequence (iT + H)i≥1, thus avoiding the need of prior |
|
knowledge on an adequate horizon of the downstream tasks.3 Here this knowledge is replaced by |
|
T and H which are more environment-agnostic and task-agnostic choices as they rather have an |
|
impact on the size and shape of the learned tree (e.g., the smaller T and H the bigger the tree). |
|
3.4 |
|
IMPLEMENTATION |
|
We are now ready to introduce the UPSIDE al- |
|
gorithm, which provides a specific implementation |
|
of the components described before (see Fig. 1 for |
|
an illustration, Alg. 1 for a short pseudo-code and |
|
Alg. 2 in App. B for the detailed version). We first |
|
make approximations so that the constraint in (Pη) |
|
is easier to estimate. |
|
We remove the logarithm |
|
from the constraint to have an estimation range of |
|
[0, 1] and thus lower variance.4 We also replace the |
|
expectation over sdiff with an empirical estimate |
|
b |
|
q |
|
B |
|
ϕ(z) = |
|
1 |
|
|Bz| |
|
P |
|
s∈Bz qϕ(z|s), where Bz denotes |
|
a small replay buffer, which we call state buffer, |
|
that contains states collected during a few rollouts |
|
by the diffusing part of πz. In our experiments, we |
|
take B = |Bz| = 10H. Integrating this in (Pη) |
|
leads to |
|
max |
|
N≥1 N |
|
s.t. |
|
max |
|
π,ϕ min |
|
z∈[N] b |
|
q |
|
B |
|
ϕ(z) ≥η, |
|
(3) |
|
where η is an hyper-parameter.5 From Lem. 2, this |
|
optimization problem in N can be solved using the |
|
incremental policy addition or removal in Alg. 1 |
|
(lines 5 & 9), independently from the number of |
|
initial policies N. |
|
Algorithm 1: UPSIDE |
|
Parameters: Discriminability threshold |
|
η ∈(0, 1), branching factor N start, N max. |
|
Initialize: Tree T initialized as a root node |
|
0, policies candidates Q = {0}. |
|
while Q ̸= ∅do // tree expansion |
|
1 |
|
Dequeue a policy z ∈Q and create |
|
N = N start policies C(z). |
|
2 |
|
POLICYLEARNING(T , C(z)). |
|
3 |
|
if minz′∈C(z) b |
|
q |
|
B |
|
ϕ (z′) > η then |
|
// Node addition |
|
4 |
|
while minz′∈C(z) b |
|
q |
|
B |
|
ϕ (z′) > η |
|
and N < N max do |
|
5 |
|
Increment N = N + 1 and add |
|
one policy to C(z). |
|
6 |
|
POLICYLEARNING(T , C(z)). |
|
7 |
|
else // Node removal |
|
8 |
|
while minz′∈C(z) b |
|
q |
|
B |
|
ϕ (z′) < η |
|
and N > 1 do |
|
9 |
|
Reduce N = N −1 and |
|
remove least discriminable |
|
policy from C(z). |
|
10 |
|
POLICYLEARNING(T , C(z)). |
|
11 |
|
Add η-discriminable policies C(z) to |
|
Q, and to T as nodes rooted at z. |
|
We then integrate the optimization of Equation 3 into an adaptive tree expansion strategy that in- |
|
crementally composes skills (Sect. 3.3). The tree is initialized with a root node corresponding to a |
|
policy only composed of the diffusing part around s0. Then UPSIDE iteratively proceeds through |
|
the following phases: (Expansion) While policies/nodes can be expanded according to different |
|
ordering rules (e.g., a FIFO strategy), we rank them in descending order by their discriminability |
|
(i.e., b |
|
q |
|
B |
|
ϕ(z)), thus favoring the expansion of policies that reach regions of the state space that are |
|
not too saturated. Given a candidate leaf z to expand from the tree, we introduce new policies |
|
by adding a set C(z) of N = N start nodes rooted at node z (line 2, see also steps (A) and (D) in |
|
Fig. 1). (Policy learning) The new policies are optimized in three steps (see App. B for details on |
|
the POLICYLEARNING subroutine): i) sample states from the diffusing parts of the new policies |
|
sampled uniformly from C(z) (state buffers of consolidated policies in T are kept in memory), ii) |
|
update the discriminator and compute the discriminability b |
|
q |
|
B |
|
ϕ(z′) of new policies z′ ∈C(z), iii) |
|
3See e.g., the discussion in Mutti et al. (2021) on the “importance of properly choosing the training horizon |
|
in accordance with the downstream-task horizon the policy will eventually face.” |
|
4While Gregor et al. (2016); Eysenbach et al. (2019) employ rewards in the log domain, we find that map- |
|
ping rewards into [0, 1] works better in practice, as observed in Warde-Farley et al. (2019); Baumli et al. (2021). |
|
5Ideally, we would set η |
|
= |
|
η† from Lem. 1, however η† is non-trivial to compute. |
|
A strategy |
|
may be to solve (Pη′) for different values of η′ and select the one that maximizes the MI lower bound |
|
E [log qϕ(z|sdiff) −log ρ(z)]. In our experiments we rather use the same predefined parameter of η = 0.8 |
|
which avoids computational overhead and performs well across all environments. |
|
5 |
|
Published as a conference paper at ICLR 2022 |
|
update the skills to optimize the reward (Sect. 3.1) computed using the discriminator (see step (B) |
|
in Fig. 1). (Node adaptation) Once the policies are trained, UPSIDE proceeds with optimizing N |
|
in a greedy fashion. If all the policies in C(z) have an (estimated) discriminability larger than η |
|
(lines 3-5), a new policy is tentatively added to C(z), the policy counter N is incremented, the policy |
|
learning step is restarted, and the algorithm keeps adding policies until the constraint is not met |
|
anymore or a maximum number of policies is attained. Conversely, if every policy in C(z) does |
|
not meet the discriminability constraint (lines 7-9), the one with lowest discriminability is removed |
|
from C(z), the policy learning step is restarted, and the algorithm keeps removing policies until all |
|
policies satisfy the constraint or no policy is left (see step (C) in Fig. 1). The resulting C(z) is added |
|
to the set of consolidated policies (line 11) and UPSIDE iteratively proceeds by selecting another |
|
node to expand until no node can be expanded (i.e., the node adaptation step terminates with N = 0 |
|
for all nodes) or a maximum number of environment iterations is met. |
|
4 |
|
RELATED WORK |
|
URL methods can be broadly categorized depending on how the experience of the unsupervised |
|
phase is summarized to solve downstream tasks in a zero- or few-shot manner. This includes model- |
|
free (e.g., Pong et al., 2020), model-based (e.g., Sekar et al., 2020) and representation learning (e.g., |
|
Yarats et al., 2021) methods that build a representative replay buffer to learn accurate estimates or |
|
low-dimensional representations. An alternative line of work focuses on discovering a set of skills |
|
in an unsupervised way. Our approach falls in this category, on which we now focus this section. |
|
Skill discovery based on MI maximization was first proposed in VIC (Gregor et al., 2016), where |
|
the trajectories’ final states are considered in the reverse form of Equation 1 and the policy parame- |
|
ters π(z) and sampling distribution ρ are simultaneously learned (with a fixed number of skills NZ). |
|
DIAYN (Eysenbach et al., 2019) fixes a uniform ρ and weights skills with an action-entropy co- |
|
efficient (i.e., it additionally minimizes the MI between actions and skills given the state) to push |
|
the skills away from each other. DADS (Sharma et al., 2020) learns skills that are not only diverse |
|
but also predictable by learned dynamics models, using a generative model over observations and |
|
optimizing a forward form of MI I(s′; z|s) between the next state s′ and current skill z (with con- |
|
tinuous latent) conditioned on the current state s. EDL (Campos et al., 2020) shows that existing |
|
skill discovery approaches can provide insufficient coverage and relies on a fixed distribution over |
|
states that is either provided by an oracle or learned. SMM (Lee et al., 2019) uses the MI formalism |
|
to learn a policy whose state marginal distribution matches a target state distribution (e.g., uniform). |
|
Other MI-based skill discovery methods include Florensa et al. (2017); Hansen et al. (2019); Modhe |
|
et al. (2020); Baumli et al. (2021); Xie et al. (2021); Liu & Abbeel (2021); Strouse et al. (2021), and |
|
extensions in non-episodic settings (Xu et al., 2020; Lu et al., 2020). |
|
While most skill discovery approaches consider a fixed number of policies, a curriculum with in- |
|
creasing NZ is studied in Achiam et al. (2018); Aubret et al. (2020). We consider a similar dis- |
|
criminability criterion in the constraint in (Pη) and show that it enables to maintain skills that can |
|
be readily composed along a tree structure, which can either increase or decrease the support of |
|
available skills depending on the region of the state space. Recently, Zhang et al. (2021) propose a |
|
hierarchical RL method that discovers abstract skills while jointly learning a higher-level policy to |
|
maximize extrinsic reward. Our approach builds on a similar promise of composing skills instead |
|
of resetting to s0 after each execution, yet we articulate the composition differently, by exploiting |
|
the direct-then-diffuse structure to ground skills to the state space instead of being abstract. Har- |
|
tikainen et al. (2020) perform unsupervised skill discovery by fitting a distance function; while their |
|
approach also includes a directed part and a diffusive part for exploration, it learns only a single |
|
directed policy and does not aim to cover the entire state space. Approaches such as DISCERN |
|
(Warde-Farley et al., 2019) and Skew-Fit (Pong et al., 2020) learn a goal-conditioned policy in |
|
an unsupervised way with an MI objective. As explained by Campos et al. (2020), this can be in- |
|
terpreted as a skill discovery approach with latent Z = S, i.e., where each goal state can define |
|
a different skill. Conditioning on either goal states or abstract latent skills forms two extremes of |
|
the spectrum of unsupervised RL. As argued in Sect. 3.1, we target an intermediate approach of |
|
learning “cluster-conditioned” policies. Finally, an alternative approach to skill discovery builds on |
|
“spectral” properties of the dynamics of the MDP. This includes eigenoptions (Machado et al., 2017; |
|
2018) and covering options (Jinnai et al., 2019; 2020), and the algorithm of Bagaria et al. (2021) |
|
that builds a discrete graph representation which learns and composes spectral skills. |
|
6 |
|
Published as a conference paper at ICLR 2022 |
|
Bottleneck Maze |
|
U-Maze |
|
RANDOM |
|
29.17 |
|
(±0.57) |
|
23.33 |
|
(±0.57) |
|
DIAYN-10 |
|
17.67 |
|
(±0.57) |
|
14.67 |
|
(±0.42) |
|
DIAYN-20 |
|
23.00 |
|
(±1.09) |
|
16.67 |
|
(±1.10) |
|
DIAYN-50 |
|
30.00 |
|
(±0.72) |
|
25.33 |
|
(±1.03) |
|
DIAYN-curr |
|
18.00 |
|
(±0.82) |
|
15.67 |
|
(±0.87) |
|
DIAYN-hier |
|
38.33 |
|
(±0.68) |
|
49.67 |
|
(±0.57) |
|
EDL-10 |
|
27.00 |
|
(±1.41) |
|
32.00 |
|
(±1.19) |
|
EDL-20 |
|
31.00 |
|
(±0.47) |
|
46.00 |
|
(±0.82) |
|
EDL-50 |
|
33.33 |
|
(±0.42) |
|
52.33 |
|
(±1.23) |
|
SMM-10 |
|
19.00 |
|
(±0.47) |
|
14.00 |
|
(±0.54) |
|
SMM-20 |
|
23.67 |
|
(±1.29) |
|
14.00 |
|
(±0.27) |
|
SMM-50 |
|
28.00 |
|
(±0.82) |
|
25.00 |
|
(±1.52) |
|
Flat UPSIDE-10 |
|
40.67 |
|
(±1.50) |
|
43.33 |
|
(±2.57) |
|
Flat UPSIDE-20 |
|
47.67 |
|
(±0.31) |
|
55.67 |
|
(±1.03) |
|
Flat UPSIDE-50 |
|
51.33 |
|
(±1.64) |
|
57.33 |
|
(±0.31) |
|
UPSIDE |
|
85.67 |
|
(±1.93) |
|
71.33 |
|
(±0.42) |
|
Table 2: Coverage on Bottleneck Maze and U-Maze: UPSIDE cov- |
|
ers significantly more regions of the discretized state space than the |
|
other methods. The values represent the number of buckets that are |
|
reached, where the 50 × 50 space is discretized into 10 buckets per |
|
axis. To compare the global coverage of methods (and to be fair |
|
w.r.t. the amount of injected noise that may vary across methods), we |
|
roll-out for each model its associated deterministic policies. |
|
UPSIDE |
|
DIAYN-50 |
|
EDL-50 |
|
Figure 2: Policies learned on the Bot- |
|
tleneck Maze (see Fig. 14 in App. C |
|
for the other methods): |
|
contrary to |
|
the baselines, UPSIDE successfully |
|
escapes the bottleneck region. |
|
5 |
|
EXPERIMENTS |
|
Our experiments investigate the following questions: i) Can UPSIDE incrementally cover an un- |
|
known environment while preserving the directedness of its skills? ii) Following the unsupervised |
|
phase, how can UPSIDE be leveraged to solve sparse-reward goal-reaching downstream tasks? |
|
iii) What is the impact of the different components of UPSIDE on its performance? |
|
We report results on navigation problems in continuous 2D mazes6 and on continuous control prob- |
|
lems (Brockman et al., 2016; Todorov et al., 2012): Ant, Half-Cheetah and Walker2d. We evaluate |
|
performance with the following tasks: 1) “coverage” which evaluates the extent to which the state |
|
space has been covered during the unsupervised phase, and 2) “unknown goal-reaching” whose ob- |
|
jective is to find and reliably reach an unknown goal location through fine-tuning of the policy. We |
|
perform our experiments based on the SaLinA framework (Denoyer et al., 2021). |
|
We compare UPSIDE to different baselines. First we consider DIAYN-NZ (Eysenbach et al., 2019), |
|
where NZ denotes a fixed number of skills. We introduce two new baselines derived from DIAYN: |
|
a) DIAYN-curr is a curriculum variant where the number of skills is automatically tuned following |
|
the same procedure as in UPSIDE, similar to Achiam et al. (2018), to ensure sufficient discriminabil- |
|
ity, and b) DIAYN-hier is a hierarchical extension of DIAYN where the skills are composed in a |
|
tree as in UPSIDE but without the diffusing part. We also compare to SMM (Lee et al., 2019), which |
|
is similar to DIAYN but includes an exploration bonus encouraging the policies to visit rarely en- |
|
countered states. In addition, we consider EDL (Campos et al., 2020) with the assumption of the |
|
available state distribution oracle (since replacing it by SMM does not lead to satisfying results in |
|
presence of bottleneck states as shown in Campos et al., 2020). Finally, we consider the RANDOM |
|
6The agent observes its current position and its actions (in [−1, +1]) control its shift in x and y coordinates. |
|
We consider two topologies of mazes illustrated in Fig. 2 with size 50 × 50 such that exploration is non-trivial. |
|
The Bottleneck maze is a harder version of the one in Campos et al. (2020, Fig. 1) whose size is only 10 × 10. |
|
7 |
|
Published as a conference paper at ICLR 2022 |
|
0 |
|
0.5 |
|
1 |
|
1.5 |
|
·105 |
|
0 |
|
100 |
|
200 |
|
300 |
|
400 |
|
500 |
|
Env interactions |
|
Coverage |
|
(a) Ant |
|
0 |
|
0.5 |
|
1 |
|
1.5 |
|
·105 |
|
0 |
|
5 |
|
10 |
|
15 |
|
20 |
|
25 |
|
Env interactions |
|
Coverage |
|
(b) Half-Cheetah |
|
0 |
|
0.5 |
|
1 |
|
1.5 |
|
·105 |
|
5 |
|
10 |
|
15 |
|
20 |
|
Env interactions |
|
Coverage |
|
(c) Walker2d |
|
Figure 3: Coverage on control environments: UPSIDE covers the state space signif- |
|
icantly more than DIAYN and RANDOM. The curve represents the number of buckets |
|
reached by the policies extracted from the unsupervised phase of UPSIDE and DIAYN |
|
as a function of the number of environment interactions. DIAYN and UPSIDE have |
|
the same amount of injected noise. Each axis is discretized into 50 buckets. |
|
UPSIDE |
|
DIAYN-5 |
|
DIAYN-10 |
|
DIAYN-20 |
|
RANDOM |
|
(a) UPSIDE policies on Ant |
|
(b) DIAYN policies on Ant |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
·107 |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
Env interactions |
|
Average success rate |
|
UPSIDE |
|
TD3 |
|
DIAYN |
|
(c) Fine-tuning on Ant |
|
Figure 4: (a) & (b) Unsupervised phase on Ant: visualization of the policies learned by UPSIDE and |
|
DIAYN-20. We display only the final skill and the diffusing part of the UPSIDE policies. (c) Downstream |
|
tasks on Ant: we plot the average success rate over 48 unknown goals (with sparse reward) that are sampled |
|
uniformly in the [−8, 8]2 square (using stochastic roll-outs) during the fine-tuning phase. UPSIDE achieves |
|
higher success rate than DIAYN-20 and TD3. |
|
policy, which samples actions uniformly in the action space. We use TD3 as the policy optimizer |
|
(Fujimoto et al., 2018) though we also tried SAC (Haarnoja et al., 2018) which showed equivalent |
|
results than TD3 with harder tuning. Similar to e.g., Eysenbach et al. (2019); Bagaria & Konidaris |
|
(2020), we restrict the observation space of the discriminator to the cartesian coordinates (x, y) for |
|
Ant and x for Half-Cheetah and Walker2d. All algorithms were ran on Tmax = 1e7 unsupervised en- |
|
vironment interactions in episodes of size Hmax = 200 (resp. 250) for mazes (resp. for control). For |
|
baselines, models are selected according to the cumulated intrinsic reward (as done in e.g., Strouse |
|
et al., 2021), while UPSIDE, DIAYN-hier and DIAYN-curr are selected according to the high- |
|
est number of η-discriminable policies. On the downstream tasks, we consider ICM (Pathak et al., |
|
2017) as an additional baseline. See App. C for the full experimental details. |
|
Coverage. We analyze the coverage achieved by the various methods following an unsupervised |
|
phase of at most Tmax = 1e7 environment interactions. For UPSIDE, we report coverage for the |
|
skill and diffusing part lengths T = H = 10 in the continuous mazes (see App. D.4 for an ablation |
|
on the values of T, H) and T = H = 50 in control environments. Fig. 2 shows that UPSIDE man- |
|
ages to cover the near-entirety of the state space of the bottleneck maze (including the top-left room) |
|
by creating a tree of directed skills, while the other methods struggle to escape from the bottleneck |
|
region. This translates quantitatively in the coverage measure of Table 2 where UPSIDE achieves |
|
the best results. As shown in Fig. 3 and 4, UPSIDE clearly outperforms DIAYN and RANDOM in |
|
state-coverage of control environments, for the same number of environment interactions. In the |
|
Ant domain, traces from DIAYN (Fig. 4b) and discriminator curves in App. D.3 demonstrate that |
|
even though DIAYN successfully fits 20 policies by learning to take a few steps then hover, it fails |
|
to explore the environment. In Half-Cheetah and Walker2d, while DIAYN policies learn to fall on |
|
the agent’s back, UPSIDE learns to move forward/backward on its back through skill composition. |
|
Unknown goal-reaching tasks. |
|
We investigate how the tree of policies learned by UPSIDE |
|
in the unsupervised phase can be used to tackle goal-reaching downstream tasks. All unsuper- |
|
vised methods follow the same protocol: given an unknown7 goal g, i) we sample rollouts over |
|
7Notice that if the goal was known, the learned discriminator could be directly used to identify the most |
|
promising skill to fine-tune. |
|
8 |
|
Published as a conference paper at ICLR 2022 |
|
(a) UPSIDE: before and after fine-tuning |
|
(b) DIAYN |
|
(c) EDL |
|
(d) ICM |
|
Figure 5: Downstream task performance on Bottleneck Maze: UPSIDE achieves higher discounted cumulative |
|
reward on various unknown goals (See Fig. 15 in App. C for SMM and TD3 performance). From each of the |
|
16 discretized regions, we randomly sample 3 unknown goals. For every method and goal seed, we roll-out |
|
each policy (learned in the unsupervised phase) during 10 episodes and select the one with largest cumulative |
|
reward to fine-tune (with sparse reward r(s) = I[∥s −g∥2 ≤1]). Formally, for a given goal g the reported |
|
value is γτI[τ ≤Hmax] with τ := inf{t ≥1 : ∥st −g∥2 ≤1}, γ = 0.99 and horizon Hmax = 200. |
|
Figure 6: For an unknown goal |
|
location, UPSIDE identifies a |
|
promising policy in its tree and |
|
fine-tunes it. |
|
the different learned policies, ii) then we select the best policy based |
|
on the maximum discounted cumulative reward collected, and iii) we |
|
fine-tune this policy (i.e., its sequence of directed skills and its final |
|
diffusing part) to maximize the sparse reward r(s) = I[∥s−g∥2 ≤1]. |
|
Fig. 5 reports the discounted cumulative reward on various goals af- |
|
ter the fine-tuning phase. We see that UPSIDE accumulates more |
|
reward than the other methods, in particular in regions far from s0, |
|
where performing fine-tuning over the entire skill path is especially |
|
challenging. In Fig. 6 we see that UPSIDE’s fine-tuning can slightly |
|
deviate from the original tree structure to improve the goal-reaching |
|
behavior of its candidate policy. We also perform fine-tuning on the |
|
Ant domain under the same setting. In Fig. 4c, we show that UPSIDE |
|
clearly outperforms DIAYN-20 and TD3 when we evaluate the aver- |
|
age success rate of reaching 48 goals sampled uniformly in [−8, 8]2. |
|
Note that DIAYN particularly fails as its policies learned during the |
|
unsupervised phase all stay close to the origin s0. |
|
Ablative study of the UPSIDE components. The main components of UPSIDE that differ from ex- |
|
isting skill discovery approaches such as DIAYN are: the decoupled policy structure, the constrained |
|
optimization problem and the skill chaining via the growing tree. We perform ablations to show that |
|
all components are simultaneously required for good performance. First, we compare UPSIDE to |
|
flat UPSIDE, i.e., UPSIDE with the tree depth of 1 (T = 150, H = 50). Table 2 reveals that the |
|
tree structuring is key to improve exploration and escape bottlenecks; it makes the agent learn on |
|
smaller and easier problems (i.e., short-horizon MDPs) and mitigates the optimization issues (e.g., |
|
non-stationary rewards). However, the diffusing part of flat UPSIDE largely improves the coverage |
|
performance over the DIAYN baseline, suggesting that the diffusing part is an interesting structural |
|
bias on the entropy regularization that pushes the policies away from each other. This is particularly |
|
useful on the Ant environment as shown in Fig. 4. A challenging aspect is to make the skill composi- |
|
tion work. As shown in Table 1, DIAYN-hier (a hierarchical version of DIAYN) does not perform |
|
as well as UPSIDE by a clear margin. In fact, UPSIDE’s direct-then-diffuse decoupling enables |
|
both policy re-usability for the chaining (via the directed skills) and local coverage (via the diffusing |
|
part). Moreover, as shown by the results of DIAYN-hier on the bottleneck maze, the constrained |
|
optimization problem (Pη) combined with the diffusing part is crucial to prevent consolidating too |
|
many policies, thus allowing a sample efficient growth of the tree structure. |
|
6 |
|
CONCLUSION AND LIMITATIONS |
|
We introduced UPSIDE, a novel algorithm for unsupervised skill discovery designed to trade off |
|
between coverage and directedness and develop a tree of skills that can be used to perform ef- |
|
ficient exploration and solve sparse-reward goal-reaching downstream tasks. Limitations of our |
|
approach that constitute natural venues for future investigation are: 1) The diffusing part of each |
|
policy could be explicitly trained to maximize local coverage around the skill’s terminal state; 2) |
|
UPSIDE assumes a good state representation is provided as input to the discriminator, it would be |
|
interesting to pair UPSIDE with effective representation learning techniques to tackle problems with |
|
high-dimensional input; 3) As UPSIDE relies on the ability to reset to establish a root node for its |
|
growing tree, it could be relevant to extend the approach in non-episodic environments. |
|
9 |
|
Published as a conference paper at ICLR 2022 |
|
Acknowledgements |
|
We thank both Evrard Garcelon and Jonas Gehring for helpful discussion. |
|
REFERENCES |
|
Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery |
|
algorithms. arXiv preprint arXiv:1807.10299, 2018. |
|
Arthur Aubret, La¨ |
|
etitia Matignon, and Salima Hassas. Elsim: End-to-end learning of reusable skills |
|
through intrinsic motivation. In European Conference on Machine Learning and Principles and |
|
Practice of Knowledge Discovery in Databases (ECML-PKDD), 2020. |
|
Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In International |
|
Conference on Learning Representations, 2020. |
|
Akhil Bagaria, Jason K Senthil, and George Konidaris. Skill discovery for exploration and planning |
|
using deep skill graphs. In International Conference on Machine Learning, pp. 521–531. PMLR, |
|
2021. |
|
David Barber and Felix Agakov. The im algorithm: a variational approach to information maxi- |
|
mization. Advances in neural information processing systems, 16(320):201, 2004. |
|
Kate Baumli, David Warde-Farley, Steven Hansen, and Volodymyr Mnih. Relative variational in- |
|
trinsic control. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. |
|
6732–6740, 2021. |
|
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ- |
|
ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: |
|
253–279, 2013. |
|
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and |
|
Wojciech Zaremba. Openai gym, 2016. |
|
V´ |
|
ıctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi |
|
Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In Interna- |
|
tional Conference on Machine Learning, 2020. |
|
Ludovic Denoyer, Alfredo de la Fuente, Song Duong, Jean-Baptiste Gaya, Pierre-Alexandre |
|
Kamienny, and Daniel H Thompson. |
|
Salina: Sequential learning of agents. |
|
arXiv preprint |
|
arXiv:2110.07910, 2021. |
|
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: |
|
Learning skills without a reward function. In International Conference on Learning Representa- |
|
tions, 2019. |
|
Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical rein- |
|
forcement learning. arXiv preprint arXiv:1704.03012, 2017. |
|
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor- |
|
critic methods. In International Conference on Machine Learning, pp. 1587–1596. PMLR, 2018. |
|
Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv |
|
preprint arXiv:1611.07507, 2016. |
|
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy |
|
maximum entropy deep reinforcement learning with a stochastic actor. In International Confer- |
|
ence on Machine Learning, pp. 1861–1870. PMLR, 2018. |
|
Steven Hansen, Will Dabney, Andre Barreto, David Warde-Farley, Tom Van de Wiele, and |
|
Volodymyr Mnih. Fast task inference with variational intrinsic successor features. In Interna- |
|
tional Conference on Learning Representations, 2019. |
|
Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, and Sergey Levine. Dynamical distance |
|
learning for semi-supervised and unsupervised skill discovery. In International Conference on |
|
Learning Representations, 2020. |
|
10 |
|
Published as a conference paper at ICLR 2022 |
|
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: |
|
variational information maximizing exploration. In Proceedings of the 30th International Con- |
|
ference on Neural Information Processing Systems, pp. 1117–1125, 2016. |
|
Yuu Jinnai, Jee Won Park, David Abel, and George Konidaris. Discovering options for exploration |
|
by minimizing cover time. In International Conference on Machine Learning, pp. 3130–3139. |
|
PMLR, 2019. |
|
Yuu Jinnai, Jee Won Park, Marlos C Machado, and George Konidaris. Exploration in reinforcement |
|
learning with deep covering options. In International Conference on Learning Representations, |
|
2020. |
|
Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdi- |
|
nov. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019. |
|
Hao Liu and Pieter Abbeel. |
|
Aps: Active pretraining with successor features. |
|
In International |
|
Conference on Machine Learning, pp. 6736–6747. PMLR, 2021. |
|
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill- |
|
space planning. arXiv preprint arXiv:2012.03548, 2020. |
|
Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for op- |
|
tion discovery in reinforcement learning. In International Conference on Machine Learning, pp. |
|
2295–2304. PMLR, 2017. |
|
Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray |
|
Campbell. Eigenoption discovery through the deep successor representation. In International |
|
Conference on Learning Representations, 2018. |
|
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- |
|
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level |
|
control through deep reinforcement learning. nature, 518(7540):529–533, 2015. |
|
Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, |
|
and Ramakrishna Vedantam. Ir-vic: Unsupervised discovery of sub-goals for transfer in rl. In |
|
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI- |
|
20, 2020. |
|
Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsi- |
|
cally motivated reinforcement learning. In Advances in neural information processing systems, |
|
pp. 2125–2133, 2015. |
|
Mirco Mutti, Lorenzo Pratissoli, and Marcello Restelli. Task-agnostic exploration via policy gra- |
|
dient of a non-parametric state entropy estimate. In Proceedings of the AAAI Conference on |
|
Artificial Intelligence, volume 35, pp. 9028–9036, 2021. |
|
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration |
|
by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and |
|
Pattern Recognition Workshops, pp. 16–17, 2017. |
|
Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew- |
|
fit: State-covering self-supervised reinforcement learning. In International Conference on Ma- |
|
chine Learning, 2020. |
|
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. |
|
Planning to explore via self-supervised world models. In International Conference on Machine |
|
Learning, pp. 8583–8592. PMLR, 2020. |
|
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware |
|
unsupervised discovery of skills. In International Conference on Learning Representations, 2020. |
|
DJ Strouse, Kate Baumli, David Warde-Farley, Vlad Mnih, and Steven Hansen. Learning more |
|
skills through optimistic exploration. arXiv preprint arXiv:2107.14226, 2021. |
|
11 |
|
Published as a conference paper at ICLR 2022 |
|
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. |
|
In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. |
|
IEEE, 2012. |
|
David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and |
|
Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In In- |
|
ternational Conference on Learning Representations, 2019. |
|
Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, and Florian Shkurti. Skill transfer |
|
via partially amortized hierarchical planning. In International Conference on Learning Represen- |
|
tations, 2021. |
|
Kelvin Xu, Siddharth Verma, Chelsea Finn, and Sergey Levine. Continual learning of control prim- |
|
itives: Skill discovery via reset-games. Advances in Neural Information Processing Systems, 33, |
|
2020. |
|
Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with proto- |
|
typical representations. In Proceedings of the 38th International Conference on Machine Learn- |
|
ing, pp. 11920–11931. PMLR, 2021. |
|
Jesse Zhang, Haonan Yu, and Wei Xu. Hierarchical reinforcement learning by discovering intrinsic |
|
options. In International Conference on Learning Representations, 2021. |
|
12 |
|
Published as a conference paper at ICLR 2022 |
|
Appendix |
|
A Theoretical Details on Section 3 |
|
13 |
|
B |
|
UPSIDE Algorithm |
|
16 |
|
C Experimental Details |
|
18 |
|
D Additional Experiments |
|
22 |
|
A |
|
THEORETICAL DETAILS ON SECTION 3 |
|
A.1 |
|
PROOFS OF LEMMAS 1 AND 2 |
|
Restatement of Lemma 1. There exists a value η† ∈(0, 1) such that solving (Pη†) is equivalent to |
|
maximizing a lower bound on the mutual information objective max NZ,ρ,π,ϕ I(Sdiff; Z). |
|
Proof. We assume that the number of available skills is upper bounded, i.e., 1 ≤NZ ≤Nmax. We |
|
begin by lower bounding the negative conditional entropy by using the well known lower bound of |
|
Barber & Agakov (2004) on the mutual information |
|
−H(Z|Sdiff) = |
|
X |
|
z∈Z |
|
ρ(z)Esdiff [log p(z|sdiff)] |
|
≥ |
|
X |
|
z∈Z |
|
ρ(z)Esdiff [log qϕ(z|sdiff)] . |
|
We now use that any weighted average is lower bounded by its minimum component, which yields |
|
−H(Z|Sdiff) ≥min |
|
z∈Z Esdiff [log qϕ(z|sdiff)] . |
|
Thus the following objective is a lower bound on the original objective of maximizing I(Sdiff; Z) |
|
max |
|
NZ=N,ρ,π,ϕ |
|
n |
|
H(Z) + min |
|
z∈[N] Esdiff [log qϕ(z|sdiff)] |
|
o |
|
. |
|
(4) |
|
Interestingly, the second term in Equation 4 no longer depends on the skill distribution ρ, while the |
|
first entropy term H(Z) is maximized by setting ρ to the uniform distribution over N skills (i.e., |
|
maxρ H(Z) = log(N)). This enables to simplify the optimization which now only depends on N. |
|
Thus Equation 4 is equivalent to |
|
max |
|
NZ=N |
|
n |
|
log(N) + max |
|
π,ϕ min |
|
z∈[N] Esdiff [log qϕ(z|sdiff)] |
|
o |
|
. |
|
(5) |
|
We define the functions |
|
f(N) := log(N), |
|
g(N) := max |
|
π,ϕ min |
|
z∈[N] Esdiff [log qϕ(z|sdiff)] . |
|
Let N † ∈arg maxN f(N) + g(N) and η† := exp g(N †) ∈(0, 1). We now show that any solution |
|
of (Pη†) is a solution of Equation 5. Indeed, denote by N ⋆the value that optimizes (Pη†). First, |
|
by validity of the constraint, it holds that g(N ⋆) ≥log η† = g(N †). Second, since N † meets the |
|
constraint and N ⋆is the maximal number of skills that satisfies the constraint of (Pη†), by optimality |
|
we have that N ⋆≥N † and therefore f(N ⋆) ≥f(N †) since f is non-decreasing. We thus have |
|
g(N ⋆) ≥g(N †) |
|
f(N ⋆) ≥f(N †) |
|
= |
|
⇒f(N ⋆) + g(N ⋆) ≥f(N †) + g(N †). |
|
Putting everything together, an optimal solution for (Pη†) is an optimal solution for Equation 5, |
|
which is equivalent to Equation 4, which is a lower bound of the MI objective, thus concluding the |
|
proof. |
|
13 |
|
Published as a conference paper at ICLR 2022 |
|
Restatement of Lemma 2. The function g is non-increasing in N. |
|
Proof. We have that g(N) := maxπ,q minz∈[N] Es∼π(z)[log(q(z|s)], where throughout the proof |
|
we write s instead of sdiff for ease of notation. Here the optimization variables are π ∈(Π)N (i.e., a |
|
set of N policies) and q : S →∆(N), i.e., a classifier of states to N possible classes, where ∆(N) |
|
denotes the N-simplex. For z ∈[N], let |
|
hN(π, q, z) := Es∼π(z)[log(q(z|s)], |
|
fN(π, q) := min |
|
z∈[N] hN(π, q, z). |
|
Let (π′, q′) ∈arg maxπ,q fN+1(π, q). We define e |
|
π := (π′(1), . . . , π′(N)) ∈(Π)N and e |
|
q : S → |
|
∆(N) such that e |
|
q(i|s) := q′(i|s) for any i ∈[N −1] and e |
|
q(N|s) := q′(N|s) + q′(N + 1|s). |
|
Intuitively, we are “discarding” policy N + 1 and “merging” class N + 1 with class N. |
|
Then it holds that |
|
g(N + 1) = fN+1(π′, q′) = |
|
min |
|
z∈[N+1] hN+1(π′, q′, z) ≤min |
|
z∈[N] hN+1(π′, q′, z). |
|
Now, by construction of e |
|
π, e |
|
q, we have for any i ∈[N −1] that hN+1(π′, q′, i) = hN(e |
|
π, e |
|
q, i). As for |
|
class N, since e |
|
π(N) = π′(N), by definition of e |
|
q(N|·) and from monotonicity of the log function, it |
|
holds that hN+1(π′, q′, N) = Es∼π′(N)[log(q′(N|s)] satisfies |
|
hN+1(π′, q′, N) ≤Es∼˜ |
|
π(N)[log(e |
|
q(N|s)] = hN(e |
|
π, e |
|
q, N). |
|
Hence, we get that |
|
min |
|
z∈[N] hN+1(π′, q′, z) ≤min |
|
z∈[N] hN(e |
|
π, e |
|
q, z) = fN(e |
|
π, e |
|
q) ≤g(N). |
|
Putting everything together gives g(N + 1) ≤g(N), which yields the desired result. |
|
A.2 |
|
SIMPLE ILLUSTRATION OF THE ISSUE WITH UNIFORM-ρ MI MAXIMIZATION |
|
Figure 7: |
|
The agent must assign (possibly |
|
stochastically) N skills to M states: under the |
|
prior of uniform skill distribution, can the MI with |
|
be increased by varying the number of skills N? |
|
This section complements Sect. 3.2: we show a sim- |
|
ple scenario where 1) considering both a uniform ρ |
|
prior and a fixed skill number NZ provably leads |
|
to suboptimal MI maximization, and where 2) the |
|
UPSIDE strategy of considering a uniform ρ re- |
|
stricted to the η-discriminable skills can provably in- |
|
crease the MI for small enough η. |
|
Consider the simple scenario (illustrated on Fig. 7) |
|
where the agent has N skills indexed by n and must |
|
assign them to M states indexed by m. |
|
We as- |
|
sume that the execution of each skill deterministi- |
|
cally brings it to the assigned state, yet the agent |
|
may assign stochastically (i.e., more than one state |
|
per skill). (A non-RL way to interpret this is that we |
|
want to allocate N balls into M boxes.) Denote by pn,m ∈[0, 1] the probability that skill n is |
|
assigned to state m. It must hold that ∀n ∈[N], P |
|
m pn,m = 1. Denote by I the MI between the |
|
skill variable and the assigned state variable, and by I the MI under the prior that the skill sampling |
|
distribution ρ is uniform, i.e., ρ(n) = 1/N. It holds that |
|
I(N, M) = − |
|
X |
|
n |
|
1 |
|
N log 1 |
|
N + |
|
X |
|
n,m |
|
1 |
|
N pn,m log |
|
1 |
|
N pn,m |
|
P |
|
n |
|
1 |
|
N pn,m |
|
= log N + 1 |
|
N |
|
X |
|
n,m |
|
pn,m log |
|
pn,m |
|
P |
|
n pn,m |
|
. |
|
Let I |
|
⋆(N, M) := max{pn,m} I(N, M) and {p⋆ |
|
n,m} ∈arg max{pn,m} I(N, M). We also define the |
|
discriminability of skill n in state m as |
|
qn,m := |
|
pn,m |
|
P |
|
n pn,m |
|
, |
|
as well as the minimum discriminability of the optimal assignment as |
|
η := min |
|
n max |
|
m q⋆ |
|
n,m. |
|
14 |
|
Published as a conference paper at ICLR 2022 |
|
Lemma 3. There exist values of N and M such that the uniform-ρ MI be improved by removing a |
|
skill (i.e., by decreasing N). |
|
Proof. The following example shows that with M = 2 states, it is beneficial for the uniform-ρ MI |
|
maximization to go from N = 3 to N = 2 skills. Indeed, we can numerically compute the optimal |
|
solutions and we obtain for N = 3 and M = 2 that |
|
I |
|
⋆(N = 3, M = 2) ≈0.918, |
|
p⋆ |
|
n,m = |
|
0 |
|
1 |
|
0 |
|
1 |
|
1 |
|
0 |
|
! |
|
, |
|
q⋆ |
|
n,m = |
|
0 |
|
0.5 |
|
0 |
|
0.5 |
|
1 |
|
0 |
|
! |
|
, |
|
η = 0.5, |
|
whereas for N = 2 and M = 2, |
|
I |
|
⋆(N = 2, M = 2) = 1, |
|
p⋆ |
|
n,m = |
|
|
|
0 |
|
1 |
|
1 |
|
0 |
|
|
|
, |
|
q⋆ |
|
n,m = |
|
|
|
0 |
|
1 |
|
1 |
|
0 |
|
|
|
, |
|
η = 1. |
|
As a result, I |
|
⋆(N = 2, M = 2) > I |
|
⋆(N = 3, M = 2), which concludes the proof. The intuition |
|
why I |
|
⋆is increased by decreasing N is that for N = 2 there is one skill per state whereas for |
|
N = 3 the skills must necessarily overlap. Note that this contrasts with the original MI (that also |
|
optimizes ρ) where decreasing N cannot improve the optimal MI. |
|
The previous simple example hints to the fact that the value of the minimum discriminability of the |
|
optimal assignment η may be a good indicator to determine whether to remove a skill. The following |
|
more general lemma indeed shows that a sufficient condition for the uniform-ρ MI to be increased |
|
by removing a skill is that η is small enough. |
|
Lemma 4. Assume without loss of generality that the skill indexed by N has the minimum discrim- |
|
inability η, i.e., N ∈arg minn maxm q⋆ |
|
n,m. Define |
|
∆(N, η) := log N −N −1 |
|
N |
|
log(N −1) + 1 |
|
N log η. |
|
If ∆(N, η) ≤0 — which holds for small enough η — then removing skill N results in a larger |
|
uniform-ρ optimal MI, i.e., I |
|
⋆(N, M) < I |
|
⋆(N −1, M). |
|
Proof. It holds that |
|
I |
|
⋆(N, M) = log N + 1 |
|
N |
|
|
|
|
|
X |
|
n∈[N−1] |
|
X |
|
m∈[M] |
|
p⋆ |
|
n,m log q⋆ |
|
n,m + |
|
X |
|
m∈[M] |
|
p⋆ |
|
n,m log η |
|
|
|
|
|
= log N −N −1 |
|
N |
|
log(N −1) |
|
+ N −1 |
|
N |
|
|
|
log(N −1) + |
|
1 |
|
N −1 |
|
X |
|
n∈[N−1] |
|
X |
|
m∈[M] |
|
p⋆ |
|
n,m log q⋆ |
|
n,m |
|
|
|
+ 1 |
|
N log η |
|
= ∆(N, η) + N −1 |
|
N |
|
I |
|
⋆(N −1, M). |
|
As a result, if ∆(N, η) ≤0 then I |
|
⋆(N, M) < I |
|
⋆(N −1, M). |
|
15 |
|
Published as a conference paper at ICLR 2022 |
|
B |
|
UPSIDE ALGORITHM |
|
B.1 |
|
VISUAL ILLUSTRATIONS OF UPSIDE’S DESIGN MENTIONED IN SECTION 3 |
|
Figure 8: Decoupled structure of an |
|
UPSIDE policy: a directed skill fol- |
|
lowed by a diffusing part. |
|
Figure 9: In the above UPSIDE tree example, executing policy z = 7 |
|
means sequentially composing the skills of policies z ∈{2, 5, 7} and |
|
then deploying the diffusing part of policy z = 7. |
|
B.2 |
|
HIGH-LEVEL APPROACH OF UPSIDE |
|
Figure 10: High-level approach of UPSIDE. |
|
B.3 |
|
DETAILS OF ALGORITHM 1 |
|
We give in Alg. 2 a more detailed version of Alg. 1 and we list some additional explanations below. |
|
• When optimizing the discriminator, rather than sampling (state, policy) pairs with equal probabil- |
|
ity for all nodes from the tree T , we put more weight (e.g. 3×) on already consolidated policies, |
|
which seeks to avoid the new policies from invading the territory of the older policies that were |
|
previously correctly learned. |
|
• A replay buffer BRL is instantiated at every new expansion (line 2), thus avoiding the need to start |
|
collecting data from scratch with the new policies at every POLICYLEARNING call. |
|
• J (line 22) corresponds to the number of policy updates ratio w.r.t. discriminator updates, i.e. for |
|
how long the discriminator reward is kept fixed, in order to add stationarity to the reward signal. |
|
• Instead of using a number of iterations K to stop the training loop of the POLICYLEARNING func- |
|
tion (line 16), we use a maximum number of environment interactions Ksteps for node expansion. |
|
Note that this is the same for DIAYN-hier and DIAYN-curr. |
|
16 |
|
Published as a conference paper at ICLR 2022 |
|
Algorithm 2: Detailed UPSIDE |
|
Parameters: Discriminability threshold η ∈(0, 1), branching factor N start, N max |
|
Initialize: Tree T initialized as a root node index by 0, policy candidates Q = {0}, state buffers |
|
BZ = {0 : [ ]} |
|
while Q ̸= ∅do // tree expansion |
|
1 |
|
Dequeue a policy z ∈Q and create N = N start nodes C(z) rooted at z and add new key z to BZ |
|
2 |
|
Instantiate new replay buffer BRL |
|
3 |
|
POLICYLEARNING(BRL, BZ, T , C(z)) |
|
4 |
|
if minz′∈C(z) d(z′) > η then // Node addition |
|
5 |
|
while minz′∈C(z) d(z′) > η and N < N max do |
|
6 |
|
Increment N = N + 1 and add one policy to C(z) |
|
7 |
|
POLICYLEARNING(BRL, BZ, T , C(z)) |
|
8 |
|
else // Node removal |
|
9 |
|
while minz′∈C(z) d(z′) < η and N > 1 do |
|
10 |
|
Reduce N = N −1 and remove least discriminable policy from C(z) |
|
11 |
|
POLICYLEARNING(BRL, BZ, T , C(z)) |
|
12 |
|
Enqueue in Q the η-discriminable nodes C(z) |
|
13 POLICYLEARNING(Replay buffer BRL, State buffers BZ, Tree T , policies to update ZU) |
|
14 Optimization parameters: patience K, policy-to-discriminator update ratio J, Kdiscr discriminator |
|
update epochs, Kpol policy update epochs |
|
15 Initialize: Discriminator qϕ with |T | classes |
|
16 for K iterations do // Training loop |
|
17 |
|
For all z′ ∈ZU, clear BZ[z′] then collect and add B states from the diffusing part of π(z′) to it |
|
18 |
|
Train the discriminator qϕ for Kdiscr steps with dataset S |
|
z′∈T BZ[z′]. |
|
19 |
|
Compute discriminability d(z′) = b |
|
q |
|
B |
|
ϕ (z′) = |
|
1 |
|
|Bz′ | |
|
P |
|
s∈Bz′ qϕ(z′|s) for all z′ ∈ZU |
|
20 |
|
if minz′∈ZU d(z′) > η then // Early stopping |
|
21 |
|
Break |
|
22 |
|
for J iterations do |
|
23 |
|
For all z′ ∈ZU, sample a trajectory from π(z′) and add to replay buffer BRL |
|
24 |
|
For all z′ ∈ZU, update policy π′ |
|
z for Kpol steps on replay buffer BRL to optimize the |
|
discriminator reward as in Sect. 3.1 keeping skills from parent policies fixed |
|
25 Compute discriminability d(z′) for all z′ ∈ZU |
|
• The state buffer size B needs to be sufficiently large compared to H so that the state buffers of |
|
each policy represent well the distribution of the states generated by the policy’s diffusing part. |
|
In practice we set B = 10H. |
|
• In POLICYLEARNING, we add Kinitial random (uniform) transitions to the replay buffer for each |
|
newly instantiated policies. |
|
• Moreover, in POLICYLEARNING, instead of sampling uniformly the new policies we sample them |
|
in a round robin fashion (i.e., one after the other), which can be simply seen as a variance-reduced |
|
version of uniform sampling. |
|
B.4 |
|
ILLUSTRATION OF THE EVOLUTION OF UPSIDE’S TREE ON A WALL-FREE MAZE |
|
See Fig. 11. |
|
B.5 |
|
ILLUSTRATION OF EVOLUTION OF UPSIDE’S TREE ON THE BOTTLENECK MAZE |
|
See Fig. 12. |
|
17 |
|
Published as a conference paper at ICLR 2022 |
|
0 |
|
0.5 |
|
1 |
|
1.5 |
|
·105 |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
Env interactions |
|
Average discriminator accuracy |
|
A |
|
B |
|
C |
|
D |
|
E |
|
F |
|
G |
|
H |
|
I |
|
Figure 11: Fine-grained evolution of the tree structure on a wall-free maze with N start = 4 and N max = 8. |
|
The environment is a wall-free continuous maze with initial state s0 located at the center of the maze. Image A |
|
represents the diffusing part around s0. In image B, N start = 4 policies are trained, yet one of them (in lime |
|
yellow) is not sufficiently discriminable, thus it is pruned, resulting in image C. A small number of interactions |
|
is enough to ensure that the three policies are η-discriminable (image C). In image D, a fourth policy (in green) is |
|
able to become η-discriminable. New policies are added, trained and η-discriminated from 5 policies (image E) |
|
to N max = 8 policies (image F). Then a policy (in yellow) is expanded with N start = 4 policies (image G). |
|
They are all η-discriminable so additional policies are added (images H, I, . . . ). The process continues until |
|
convergence or until time-out (as done here). On the left, we plot the number of active policies (which represents |
|
the number of policies that are being trained at the current level of the tree) as well as the average discriminator |
|
accuracy over the active policies. |
|
Figure 12: Incremental expansion of the tree learned by UPSIDE towards unexplored regions of the state space |
|
in the Bottleneck Maze. |
|
C |
|
EXPERIMENTAL DETAILS |
|
C.1 |
|
BASELINES |
|
DIAYN-NZ. |
|
This corresponds to the original DIAYN algorithm (Eysenbach et al., 2019) where |
|
NZ is the number of skills to be learned. In order to make the architecture more similar to UPSIDE, |
|
we use distinct policies for each skill, i.e. they do not share weights as opposed to Eysenbach et al. |
|
(2019). While this may come at the price of sample efficiency, it may also help put lesser constraint |
|
on the model (e.g. gradient interference). |
|
DIAYN-curr. |
|
We augment DIAYN with a curriculum that enables to be less dependent on an |
|
adequate tuning of the number of skills of DIAYN. We consider the curriculum of UPSIDE where we |
|
start learning with N start policies during a period of time/number of interactions. If the configuration |
|
satisfies the discriminablity threshold η, a skill is added, otherwise a skill is removed or learning |
|
stopped (as in Alg. 2, lines 5-12). Note that the increasing version of this curriculum is similar to |
|
the one proposed in VALOR (Achiam et al., 2018, Sect. 3.3). In our experiments, we use N start = 1. |
|
18 |
|
Published as a conference paper at ICLR 2022 |
|
DIAYN-hier. |
|
We extend DIAYN through the use of a hierarchy of directed skills, built following |
|
the UPSIDE principles. The difference between DIAYN-hier and UPSIDE is that the discrimi- |
|
nator reward is computed over the entire directed skill trajectory, while it is guided by the diffusing |
|
part for UPSIDE. This introduced baseline can be interpreted as an ablation of UPSIDE without the |
|
decoupled structure of policies. |
|
SMM. |
|
We consider SMM (Lee et al., 2019) as it is state-of-art in terms of coverage, at least on |
|
long-horizon control problems, although Campos et al. (2020) reported its poor performance in |
|
hard-to-explore bottleneck mazes. We tested the regular SMM version, i.e. learning a state density |
|
model with a VAE, yet we failed to make it work on the maze domains that we consider. As we use |
|
the cartesian (x, y) positions in maze domains, learning the identity function on two-dimensional |
|
input data is too easy with a VAE, thus preventing the benefits of using a density model to drive |
|
exploration. In our implementation, the exploration bonus is obtained by maintaining a multinomial |
|
distribution over “buckets of states” obtained by discretization (as in our coverage computation), |
|
resulting in a computation-efficient implementation that is more stable than the original VAE-based |
|
method. Note that the state distribution is computed using states from past-but-recent policies as |
|
suggested in the original paper. |
|
EDL. |
|
We consider EDL (Campos et al., 2020) with the strong assumption of an available state |
|
distribution oracle (since replacing it by SMM does not lead to satisfying results in presence of bot- |
|
tleneck states as shown in Campos et al., 2020, page 7: “We were unable to explore this type |
|
of maze effectively with SMM”). In our implementation, the oracle samples states uniformly in the |
|
mazes avoiding the need to handle a complex exploration, but this setting is not realistic when facing |
|
unknown environments. |
|
C.2 |
|
ARCHITECTURE AND HYPERPARAMETERS |
|
The architecture of the different methods remains the same in all our experiments, except that the |
|
number of hidden units changes across considered environments. For UPSIDE, flat UPSIDE (i.e., |
|
UPSIDE with a tree depth of 1), DIAYN, DIAYN-curr, DIAYN-hier and SMM the multiple |
|
policies do not share weights, however EDL policies all share the same network because of the |
|
constraint that the policy embedding z is learnt in a supervised fashion with the VQ-VAE rather |
|
than the unsupervised RL objective. We consider decoupled actor and critic optimized with the TD3 |
|
algorithm (Fujimoto et al., 2018) though we also tried SAC (Haarnoja et al., 2018) which showed |
|
equivalent results than TD3 with harder tuning.8 The actor and the critic have the same architecture |
|
that processes observations with a two-hidden layers (of size 64 for maze environments and 256 for |
|
control environments) neural networks. The discriminator is a two-hidden (of size 64) layer model |
|
with output size the number of skills in the tree. |
|
Common (for all methods and environments) optimization hyper-parameters: |
|
• Discount factor: γ = 0.99 |
|
• σTD3 = {0.1, 0.15, 0.2} |
|
• Q-functions soft updates temperature τ = 0.005 |
|
• Policy Adam optimizer with learning rate lrpol = {1e−3, 1e−4} |
|
• policy inner epochs Kpol = {10, 100} |
|
• policy batch size Bpol = {64} |
|
• Discriminator delay: J = {1, 10} |
|
• Replay buffer maximum size: 1e6 |
|
• Kinitial = 1e3 |
|
We consider the same range of hyper-parameters in the downstream tasks. |
|
8For completeness, we report here the performance of DIAYN-SAC in the continuous mazes: DIAYN-SAC |
|
with NZ = 10 on Bottleneck maze: 21.0 (± 0.50); on U-maze: 17.5 (± 0.75), to compare with DIAYN-TD3 |
|
with NZ = 10 on Bottleneck maze: 17.67 (± 0.57); on U-maze: 14.67 (± 0.42). We thus see that DIAYN-SAC |
|
fails to cover the state space, performing similarly to DIAYN-TD3 (albeit over a larger range of hyperparameter |
|
search, possibly explaining the slight improvement). |
|
19 |
|
Published as a conference paper at ICLR 2022 |
|
UPSIDE, DIAYN and SMM variants (common for all environments) optimization hyper- |
|
parameters: |
|
• Discriminator batch size Bdiscr = 64 |
|
• Discriminator Adam optimizer with learning rate lrdiscr = {1e−3, 1e−4} |
|
• discriminator inner epochs Kdiscr = {10, 100} |
|
• Discriminator delay: J = {1, 10} |
|
• State buffer size B = 10H where the diffusing part length H is environment-specific. |
|
EDL optimization hyper-parameters: |
|
We kept the same as Campos et al. (2020). The VQ-VAE’s |
|
architecture consists of an encoder that takes states as an input and maps them to a code with 2 hidden |
|
layers with 128 hidden units and a final linear layer, and the decoder takes the code and maps it back |
|
to states also with 2 hidden layers with 128 hidden units. It is trained on the oracle state distribution, |
|
then kept fixed during policy learning. Contrary to UPSIDE, DIAYN and SMM variants, the reward |
|
is stationary. |
|
• βcommitment = {0.25, 0.5} |
|
• VQ-VAE’s code size 16 |
|
• VQ-VAE batch size Bvq-vae = {64, 256} |
|
• total number of epochs: 5000 (trained until convergence) |
|
• VQ-VAE Adam optimizer with learning rate lrvq-vae = {1e−3, 1e−4} |
|
Maze specific hyper-parameters: |
|
• Ksteps = 5e4 (and in time 10 minutes) |
|
• T = H = 10 |
|
• Max episode length Hmax = 200 |
|
• Max number of interactions Tmax = 1e7 during unsupervised pre-training and downstream tasks. |
|
Control specific optimization hyper-parameters: |
|
• Ksteps = 1e5 (and in time 1 hour) |
|
• T = H = 50 |
|
• Max episode length Hmax = 250 |
|
• Max number of interactions Tmax = 1e7 during unsupervised pre-training and downstream tasks. |
|
Note that hyperparameters are kept fixed for the downstream tasks too. |
|
C.3 |
|
EXPERIMENTAL PROTOCOL |
|
We now detail the experimental protocol that we followed, which is common for both UPSIDE and |
|
baselines, on all environments. It consists in the following three stages: |
|
Unsupervised pre-training phase. |
|
Given an environment, each algorithm is trained without any |
|
extrinsic reward on Nunsup = 3 seeds which we call unsupervised seeds (to account for the random- |
|
ness in the model weights’ initialization and environment stochasticity if present). Each training |
|
lasts for a maximum number of Tmax environment steps (split in episodes of length Hmax). This |
|
protocol actually favors the baselines since by its design, UPSIDE may decide to have fewer en- |
|
vironment interactions than Tmax thanks to its termination criterion (triggered if it cannot fit any |
|
more policies); for instance, all baselines where allowed Tmax = 1e7 on the maze environments, but |
|
UPSIDE finished at most in 1e6 environment steps fitting in average 57 and 51 policies respectively |
|
for the Bottleneck Maze and U-Maze. |
|
Model selection. |
|
For each unsupervised seed, we tune the hyper-parameters of each algorithm |
|
according to a certain performance metric. For the baselines, we consider the cumulated intrin- |
|
sic reward (as done in e.g., Strouse et al., 2021) averaged over stochastic roll-outs. For UPSIDE, |
|
DIAYN-hier and DIAYN-curr, the model selection criterion is the number of consolidated poli- |
|
cies, i.e., how many policies were η-discriminated during their training stage. For each method, we |
|
thus have as many models as seeds, i.e. Nunsup. |
|
20 |
|
Published as a conference paper at ICLR 2022 |
|
Downstream tasks. |
|
For each algorithm, we evaluate the Nunsup selected models on a set of tasks. |
|
All results on downstream tasks will show a performance averaged over the Nunsup seeds. |
|
• Coverage. We evaluate to which extent the state space has been covered by discretizing the |
|
state space into buckets (10 per axis on the continuous maze domains) and counting how many |
|
buckets have been reached. To compare the global coverage of methods (and also to be fair |
|
w.r.t. the amount of injected noise that may vary across methods), we roll-out for each model its |
|
associated deterministic policies. |
|
• Fine-tuning on goal-reaching task. We consider goal-oriented tasks in the discounted episodic |
|
setting where the agent needs to reach some unknown goal position within a certain radius (i.e., |
|
the goal location is unknown until it is reached once) and with sparse reward signal (i.e., reward |
|
of 1 in the goal location, 0 otherwise). The environment terminates when goal is reached or if |
|
the number of timesteps is larger than Hmax. The combination of unknown goal location and |
|
sparse reward makes the exploration problem very challenging, and calls upon the ability to first |
|
cover (for goal finding) and then navigate (for reliable goal reaching) the environment efficiently. |
|
To evaluate performance in an exhaustive manner, we discretize the state space into Bgoal = 14 |
|
buckets and we randomly sample Ngoal = 3 from each of these buckets according to what we call |
|
goal seeds (thus there are Bgoal × Ngoal = 10 different goals in total). For every goal seed, we |
|
initialize each algorithm with the set of policies learned during the unsupervised pre-training. |
|
We then roll-out each policy during Nexplo episodes to compute the cumulative reward of the |
|
policy, and select the best one to fine-tune. On UPSIDE, we complete the selected policy (of |
|
length denoted by L) by replacing the diffusing skill with a skill whose length is the remaining |
|
number of interactions left, i.e. Hmax −L. The ability of selecting a good policy is intrinsically |
|
linked to the coverage performance of the model, but also to few-shot adaptation. Learning |
|
curves and performance are averaged over unsupervised seeds, goal seeds, and over roll-outs of |
|
the stochastic policy. Since we are in the discounted episodic setting, fine-tuning makes sense, to |
|
reach as fast as possible the goal. This is particularly important as UPSIDE, because of its tree |
|
policy structure, can reach the goal sub-optimally w.r.t the discount. On the maze environments, |
|
we consider all unsupervised pre-training baselines as well as “vanilla” baselines trained from |
|
scratch during the downstream tasks: TD3 (Fujimoto et al., 2018) and ICM (Pathak et al., 2017). |
|
In the Ant environment, we also consider Ngoal = 3 and Bgoal = 14 in the [−8, 8]2 square. |
|
21 |
|
Published as a conference paper at ICLR 2022 |
|
(a) UPSIDE discriminator |
|
(b) DIAYN discriminator |
|
(c) EDL VQ-VAE |
|
Figure 13: Environment divided in colors according to the most likely latent variable Z, according to (from left |
|
to right) the discriminator learned by UPSIDE, the discriminator learned by DIAYN and the VQ-VAE learned |
|
by EDL. Contrary to DIAYN, UPSIDE’s optimization enables the discriminator training and the policy training |
|
to catch up to each other, thus nicely clustering the discriminator predictions across the state space. EDL’s |
|
VQ-VAE also manages to output good predictions (recall that we consider the EDL version with the strong |
|
assumption of the available state distribution oracle, see Campos et al., 2020), yet the skill learning is unable to |
|
cover the entire state space due to exploration issues and sparse rewards. |
|
(a) SMM |
|
(b) DIAYN-curr |
|
(c) DIAYN-hier |
|
(d) Flat UPSIDE |
|
Figure 14: Complement to Fig. 2: Visualization of the policies learned on the Bottleneck Maze for the remain- |
|
ing methods. |
|
(a) SMM |
|
(b) TD3 |
|
Figure 15: Complement of Fig. 5: Heatmaps of downstream task performance after fine-tuning for the remain- |
|
ing methods. |
|
D |
|
ADDITIONAL EXPERIMENTS |
|
D.1 |
|
ADDITIONAL RESULTS ON BOTTLENECK MAZE |
|
Here we include 1) Fig. 13 for an analysis of the predictions of the discriminator (see caption for |
|
details); 2) Fig. 14 for the policy visualizations for the remaining methods (i.e., those not reported |
|
in Fig. 2; 3) Fig. 15 for the downstream task performance for the remaining methods (i.e., those not |
|
reported in Fig. 5). |
|
22 |
|
Published as a conference paper at ICLR 2022 |
|
(a) UPSIDE |
|
(b) DIAYN |
|
(c) EDL |
|
(d) SMM |
|
(e) DIAYN-curr |
|
(f) DIAYN-hier |
|
(g) Flat UPSIDE |
|
Figure 16: Visualization of the policies learned on U-Maze. This is the equivalent of Fig. 2 for U-Maze. |
|
(a) UPSIDE before fine- |
|
tuning |
|
(b) UPSIDE |
|
(c) DIAYN |
|
(d) EDL |
|
(e) ICM |
|
(f) SMM |
|
(g) TD3 |
|
Figure 17: Heat maps of downstream task performance on U-Maze. This is the equivalent of Fig. 5 for U-Maze. |
|
D.2 |
|
ADDITIONAL RESULTS ON U-MAZE |
|
Fig. 16 visualizes the policies learned during the unsupervised phase (i.e., the equivalent of Fig. 2 for |
|
the U-Maze), and Fig. 17 reports the heatmaps of downstream task performance (i.e., the equivalent |
|
of Fig. 5 for the U-Maze). The conclusion is the same as on the Bottleneck Maze described in |
|
Sect. 5: UPSIDE clearly outperforms all the baselines, both in coverage (Fig. 16) and in unknown |
|
goal-reaching performance (Fig. 17) in the downstream task phase. |
|
23 |
|
Published as a conference paper at ICLR 2022 |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
·107 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
Env interactions |
|
Average discriminator accuracy |
|
DIAYN-10 |
|
DIAYN-20 |
|
DIAYN-50 |
|
(a) Bottleneck Maze |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
·107 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
Env interactions |
|
Average discriminator accuracy |
|
DIAYN-5 |
|
DIAYN-10 |
|
DIAYN-20 |
|
(b) Ant |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
·107 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
Env interactions |
|
Average discriminator accuracy |
|
DIAYN-5 |
|
DIAYN-10 |
|
DIAYN-20 |
|
(c) Half-Cheetah |
|
0 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
·107 |
|
0.2 |
|
0.4 |
|
0.6 |
|
0.8 |
|
1 |
|
Env interactions |
|
Average discriminator accuracy |
|
DIAYN-5 |
|
DIAYN-10 |
|
DIAYN-20 |
|
(d) Walker2d |
|
Figure 18: Average discriminability of the DIAYN-NZ policies. The smaller NZ is, the easier it is to obtain |
|
a close-to-perfect discriminability. However, even for quite large NZ (50 for mazes and 20 in control envi- |
|
ronments), DIAYN is able to achieve a good discriminator accuracy, most often because policies learn how to |
|
“stop” in some state. |
|
D.3 |
|
ANALYSIS OF THE DISCRIMINABILITY |
|
In Fig. 18 (see caption) we investigate the average discriminability of DIAYN depending on the |
|
number of policies NZ. |
|
24 |
|
Published as a conference paper at ICLR 2022 |
|
(a) T = H = 10 |
|
(b) T = H = 20 |
|
(c) T = H = 30 |
|
(d) T = H = 40 |
|
(e) T = H = 50 |
|
(f) T = H = 10 |
|
(g) T = H = 20 |
|
(h) T = H = 30 |
|
(i) T = H = 40 |
|
(j) T = H = 50 |
|
Figure 19: Ablation on the length of UPSIDE |
|
policies (T, H): |
|
Visualization of the policies |
|
learned on the Bottleneck Maze (top) and the U- |
|
Maze (bottom) for different values of T, H. (Right |
|
table) Coverage values (according to the same |
|
procedure as in Table 2). Recall that T and H de- |
|
note respectively the lengths of the directed skill |
|
and of the diffusing part of an UPSIDE policy. |
|
UPSIDE |
|
Bottleneck Maze |
|
U-Maze |
|
T = H = 10 |
|
85.67 |
|
(±1.93) |
|
71.33 |
|
(±0.42) |
|
T = H = 20 |
|
87.33 |
|
(±0.42) |
|
67.67 |
|
(±1.50) |
|
T = H = 30 |
|
77.33 |
|
(±3.06) |
|
68.33 |
|
(±0.83) |
|
T = H = 40 |
|
59.67 |
|
(±1.81) |
|
57.33 |
|
(±0.96) |
|
T = H = 50 |
|
51.67 |
|
(±0.63) |
|
58.67 |
|
(±1.26) |
|
D.4 |
|
ABLATION ON THE LENGTHS T AND H OF THE UPSIDE POLICIES |
|
Our ablation on the mazes in Fig. 19 investigates the sensitiveness of UPSIDE w.r.t. T and H, the |
|
lengths of the directed skills and diffusing parts of the policies. For the sake of simplicity, we kept |
|
T = H. It shows that the method is quite robust to reasonable choices of T and H, i.e., equal |
|
to 10 (as done in all other experiments) but also 20 or 30. Naturally, the performance degrades if |
|
T, H are chosen too large w.r.t. the environment size, in particular in the bottleneck maze which |
|
requires “narrow” exploration, thus composing disproportionately long skills hinders the coverage. |
|
For T = H = 50, we recover the performance of flat UPSIDE. |
|
D.5 |
|
FINE-TUNING RESULTS ON HALF-CHEETAH AND WALKER2D |
|
In Sect. 5, we reported the fine-tuning results on Ant. We now focus on Half-Cheetah and Walker2d, |
|
and similarly observe that UPSIDE largely outperforms the baselines: |
|
UPSIDE |
|
TD3 |
|
DIAYN |
|
Half-Cheetah |
|
174.93 |
|
(±1.45) |
|
108.67 |
|
(±25.61) |
|
0.0 |
|
(±0.0) |
|
Walker2d |
|
46.29 |
|
(±3.09) |
|
14.33 |
|
(±1.00) |
|
2.13 |
|
(±0.74) |
|
We note that the fine-tuning experiment on Half-Cheetah exactly corresponds to the standard Sparse- |
|
Half-Cheetah task, by rewarding states where the x-coordinate is larger than 15. Meanwhile, Sparse- |
|
Walker2d provides a reward as long as the x-coordinate is larger than 1 and the agent is standing |
|
up. Our fine-tuning task on Walker2d is more challenging as we require the x-coordinate to be |
|
larger than 4. Yet the agent can be rewarded even if it is not standing up, since our downstream |
|
task is purely goal-reaching. We indeed interestingly noticed that UPSIDE may reach the desired |
|
goal location yet not be standing up (e.g., by crawling), which may occur as there is no incentive in |
|
UPSIDE to remain standing up, only to be discriminable. |
|
25 |
|
|