Published as a conference paper at ICLR 2022 DIRECT THEN DIFFUSE: INCREMENTAL UNSUPERVISED SKILL DISCOVERY FOR STATE COVERING AND GOAL REACHING Pierre-Alexandre Kamienny∗1,2, Jean Tarbouriech∗1,3, Sylvain Lamprier2, Alessandro Lazaric1, Ludovic Denoyer†1 1 Meta AI 2 Sorbonne University, LIP6/ISIR 3 Inria Scool ABSTRACT Learning meaningful behaviors in the absence of reward is a difficult problem in reinforcement learning. A desirable and challenging unsupervised objective is to learn a set of diverse skills that provide a thorough coverage of the state space while being directed, i.e., reliably reaching distinct regions of the environment. In this paper, we build on the mutual information framework for skill discovery and introduce UPSIDE, which addresses the coverage-directedness trade-off in the following ways: 1) We design policies with a decoupled structure of a directed skill, trained to reach a specific region, followed by a diffusing part that induces a local coverage. 2) We optimize policies by maximizing their number under the constraint that each of them reaches distinct regions of the environment (i.e., they are sufficiently discriminable) and prove that this serves as a lower bound to the original mutual information objective. 3) Finally, we compose the learned directed skills into a growing tree that adaptively covers the environment. We illustrate in several navigation and control environments how the skills learned by UPSIDE solve sparse-reward downstream tasks better than existing baselines. 1 INTRODUCTION Deep reinforcement learning (RL) algorithms have been shown to effectively solve a wide variety of complex problems (e.g., Mnih et al., 2015; Bellemare et al., 2013). However, they are often designed to solve one single task at a time and they need to restart the learning process from scratch for any new problem, even when it is defined on the very same environment (e.g., a robot navigating to different locations in the same apartment). Recently, Unsupervised RL (URL) has been proposed as an approach to address this limitation. In URL, the agent first interacts with the environment without any extrinsic reward signal. Afterward, the agent leverages the experience accumulated during the unsupervised learning phase to efficiently solve a variety of downstream tasks defined on the same environment. This approach is particularly effective in problems such as navigation (see e.g., Bagaria et al., 2021) and robotics (see e.g., Pong et al., 2020) where the agent is often required to readily solve a wide range of tasks while the dynamics of environment remains fixed. In this paper, we focus on the unsupervised objective of discovering a set of skills that can be used to efficiently solve sparse-reward downstream tasks. In particular, we build on the insight that mutual information (MI) between the skills’ latent variables and the states reached by them can formalize the dual objective of learning policies that both cover and navigate the environment efficiently. Indeed, maximizing MI has been shown to be a powerful approach for encouraging exploration in RL (Houthooft et al., 2016; Mohamed & Rezende, 2015) and for unsupervised skill discovery (e.g., Gregor et al., 2016; Eysenbach et al., 2019; Achiam et al., 2018; Sharma et al., 2020; Campos et al., 2020). Nonetheless, learning policies that maximize MI is a challenging optimization problem. Several approximations have been proposed to simplify it at the cost of possibly deviating from the original objective of coverage and directedness (see Sect. 4 for a review of related work). ∗equal contribution †Now at Ubisoft La Forge {pakamienny,jtarbouriech,lazaric}@fb.com, sylvain.lamprier@isir.upmc.fr, ludovic.den@gmail.com 1 Published as a conference paper at ICLR 2022 Figure 1: Overview of UPSIDE. The black dot corresponds to the initial state. (A) A set of random policies is initialized, each policy being composed of a directed part called skill (illustrated as a black arrow) and a dif- fusing part (red arrows) which induces a local coverage (colored circles). (B) The skills are then updated to maximize the discriminability of the states reached by their corresponding diffusing part (Sect. 3.1). (C) The least discriminable policies are iteratively removed while the remaining policies are re-optimized. This is ex- ecuted until the discriminability of each policy satisfies a given constraint (Sect. 3.2). In this example two policies are consolidated. (D) One of these policies is used as basis to add new policies, which are then opti- mized following the same procedure. For the “red” and “purple” policy, UPSIDE is not able to find sub-policies of sufficient quality and thus they are not expanded any further. (E) At the end of the process, UPSIDE has created a tree of policies covering the state space, with skills as edges and diffusing parts as nodes (Sect. 3.3). In this paper, we propose UPSIDE (UnsuPervised Skills that dIrect then DiffusE) to learn a set of policies that can be effectively used to cover the environment and solve goal-reaching downstream tasks. Our solution builds on the following components (Fig. 1): • Policy structure (Sect. 3.1, see Fig. 1 (A)). We consider policies composed of two parts: 1) a directed part, referred to as the skill, that is trained to reach a specific region of the environment, and 2) a diffusing part that induces a local coverage around the region attained by the first part. This structure favors coverage and directedness at the level of a single policy. • New constrained objective (Sect. 3.2, see Fig. 1 (B) & (C)). We then introduce a constrained opti- mization problem designed to maximize the number of policies under the constraint that the states reached by each of the diffusing parts are distinct enough (i.e., they satisfy a minimum level of discriminability). We prove that this problem can be cast as a lower bound to the original MI objective, thus preserving its coverage-directedness trade-off. UPSIDE solves it by adaptively adding or removing policies to a given initial set, without requiring any prior knowledge on a sensible number of policies. • Tree structure (Sect. 3.3, see Fig. 1 (D) & (E)). Leveraging the directed nature of the skills, UPSIDE effectively composes them to build longer and longer policies organized in a tree struc- ture. This overcomes the need of defining a suitable policy length in advance. Thus in UPSIDE we can consider short policies to make the optimization easier, while composing their skills along a growing tree structure to ensure an adaptive and thorough coverage of the environment. The combination of these components allows UPSIDE to effectively adapt the number and the length of policies to the specific structure of the environment, while learning policies that ensure coverage and directedness. We study the effectiveness of UPSIDE and the impact of its components in hard- to-explore continuous navigation and control environments, where UPSIDE improves over existing baselines both in terms of exploration and learning performance. 2 SETTING We consider the URL setting where the agent interacts with a Markov decision process (MDP) M with state space S, action space A, dynamics p(s′|s, a), and no reward. The agent starts each episode from a designated initial state s0 ∈S.1 Upon termination of the chosen policy, the agent is then reset to s0. This setting is particularly challenging from an exploration point of view since the agent cannot rely on the initial distribution to cover the state space. We recall the MI-based unsupervised skill discovery approach (see e.g., Gregor et al., 2016). Denote by Z some (latent) variables on which the policies of length T are conditioned (we assume that Z is categorical for simplicity and because it is the most common case in practice). There are three 1More generally, s0 could be drawn from any distribution supported over a compact region. 2 Published as a conference paper at ICLR 2022 optimization variables: (i) the cardinality of Z denoted by NZ, i.e., the number of policies (we write Z = {1, . . . , NZ} = [NZ]), (ii) the parameters π(z) of the policy indexed by z ∈Z, and (iii) the policy sampling distribution ρ (i.e., ρ(z) is the probability of sampling policy z at the beginning of the episode). Denote policy z’s action distribution in state s by π(·|z, s) and the entropy function by H. Let the variable ST be the random (final) state induced by sampling a policy z from ρ and executing π(z) from s0 for an episode. Denote by pπ(z)(sT ) the distribution over (final) states induced by executing policy z, by p(z|sT ) the probability of z being the policy to induce (final) state sT , and let p(sT ) = P z∈Z ρ(z)pπ(z)(sT ). Maximizing the MI between Z and ST can be written as maxNZ, ρ, π I(ST ; Z), where I(ST ; Z) = H(ST ) −H(ST |Z) = − X sT p(sT ) log p(sT ) + X z∈Z ρ(z)EsT |z  log pπ(z)(sT )  = H(Z) −H(Z|ST ) = − X z∈Z ρ(z) log ρ(z) + X z∈Z ρ(z)EsT |z [log p(z|sT )] , (1) where in the expectations sT |z ∼pπ(z)(sT ). In the first formulation, the entropy term over states captures the requirement that policies thoroughly cover the state space, while the second term mea- sures the entropy over the states reached by each policy and thus promotes policies that have a directed behavior. Learning the optimal NZ, ρ, and π to maximize Equation 1 is a challenging prob- lem and several approximations have been proposed (see e.g., Gregor et al., 2016; Eysenbach et al., 2019; Achiam et al., 2018; Campos et al., 2020). Many approaches focus on the so-called reverse formulation of the MI (second line of Equation 1). In this case, the conditional distribution p(z|sT ) is usually replaced with a parametric model qϕ(z|sT ) called the discriminator that is trained via a neg- ative log likelihood loss simultaneously with all other variables. Then one can maximize the lower bound (Barber & Agakov, 2004): I(ST ; Z) ≥Ez∼ρ(z),τ∼π(z) [log qϕ(z|sT ) −log ρ(z)], where we denote by τ ∼π(z) trajectories sampled from the policy indexed by z. As a result, each policy π(z) can be trained with RL to maximize the intrinsic reward rz(sT ) := log qϕ(z|sT ) −log ρ(z). 3 THE UPSIDE ALGORITHM In this section we detail the three main components of UPSIDE, which is summarized in Sect. 3.4. 3.1 DECOUPLED POLICY STRUCTURE OF DIRECT-THEN-DIFFUSE While the trade-off between coverage and directedness is determined by the MI objective, the amount of stochasticity of each policy (e.g., injected via a regularization on the entropy over the actions) has also a major impact on the effectiveness of the overall algorithm (Eysenbach et al., 2019). In fact, while randomness can promote broader coverage, a highly stochastic policy tends to induce a distribution pπ(z)(sT ) over final states with high entropy, thus increasing H(ST |Z) and losing in directedness. In UPSIDE, we define policies with a decoupled structure (see Fig. 1 (A)) composed of a) a directed part (of length T) that we refer to as skill, with low stochasticity and trained to reach a specific region of the environment and b) a diffusing part (of length H) with high stochasticity to promote local coverage of the states around the region reached by the skill. Coherently with this structure, the state variable in the con- ditional entropy in Equation 1 becomes any state reached during the diffusing part (denote by Sdiff the random vari- able) and not just the skill’s terminal state. Following Sect. 2 we define an intrinsic reward rz(s) = log qϕ(z|s)− log ρ(z) and the skill of policy z maximizes the cumulative reward over the states traversed by the diffusing part. For- mally, we can conveniently define the objective function: max π(z) Eτ∼π(z) h X t∈J α · rz(st) + β · H(π(·|z, st)) i , (2) UPSIDE directed skill UPSIDE diffusing part VIC policy DIAYN policy state variable Sdiff Sdiff ST S J {T, . . . , T + H} {T, . . . , T + H} {T } {1, ..., T } (α, β) (1, 0) (0, 1) (1, 0) (1, 1) Table 1: Instantiation of Equation 2 for each part of an UPSIDE policy, and for VIC (Gregor et al., 2016) and DIAYN (Eysenbach et al., 2019) policies. where J = {T, . . . , T + H} and α = 1, β = 0 (resp. α = 0, β = 1) when optimizing for the skill (resp. diffusing part). In words, the skill is incentivized to bring the diffusing part to a discriminable region of the state space, while the diffusing part is optimized by a simple random walk policy (i.e., a stochastic policy with uniform distribution over actions) to promote local coverage around sT . 3 Published as a conference paper at ICLR 2022 Table 1 illustrates how UPSIDE’s policies compare to other methods. Unlike VIC and similar to DIAYN, the diffusing parts of the policies tend to “push” the skills away so as to reach diverse regions of the environment. The combination of the directedness of the skills and local coverage of the diffusing parts thus ensures that the whole environment can be properly visited with NZ ≪|S| policies.2 Furthermore, the diffusing part can be seen as defining a cluster of states that represents the goal region of the directed skill. This is in contrast with DIAYN policies whose stochasticity may be spread over the whole trajectory. This allows us to “ground” the latent variable representations of the policies Z to specific regions of the environment (i.e., the clusters). As a result, maximizing the MI I(Sdiff; Z) can be seen as learning a set of “cluster-conditioned” policies. 3.2 A CONSTRAINED OPTIMIZATION PROBLEM In this section, we focus on how to optimize the number of policies NZ and the policy sampling distribution ρ(z). The standard practice for Equation 1 is to preset a fixed number of policies NZ and to fix the distribution ρ to be uniform (see e.g., Eysenbach et al., 2019; Baumli et al., 2021; Strouse et al., 2021). However, using a uniform ρ over a fixed number of policies may be highly suboptimal, in particular when NZ is not carefully tuned. In App. A.2 we give a simple example and a theoretical argument on how the MI can be ensured to increase by removing skills with low discriminability when ρ is uniform. Motivated by this observation, in UPSIDE we focus on maximizing the number of policies that are sufficiently discriminable. We fix the sampling distribution ρ to be uniform over N policies and define the following constrained optimization problem max N≥1 N s.t. g(N) ≥log η, where g(N) := max π,ϕ min z∈[N] Esdiff [log qϕ(z|sdiff)] , (Pη) where qϕ(z|sdiff) denotes the probability of z being the policy traversing sdiff during its diffusing part according to the discriminator and η ∈(0, 1) defines a minimum discriminability threshold. By optimizing for (Pη), UPSIDE automatically adapts the number of policies to promote coverage, while still guaranteeing that each policy reaches a distinct region of the environment. Alternatively, we can interpret (Pη) under the lens of clustering: the aim is to find the largest number of clusters (i.e., the region reached by the directed skill and covered by its associated diffusing part) with a sufficient level of inter-cluster distance (i.e., discriminability) (see Fig. 1). The following lemma (proof in App. A.1) formally links the constrained problem (Pη) back to the original MI objective. Lemma 1. There exists a value η† ∈(0, 1) such that solving (Pη†) is equivalent to maximizing a lower bound on the mutual information objective max NZ,ρ,π,ϕ I(Sdiff; Z). Since (Pη†) is a lower bound to the MI, optimizing it ensures that the algorithm does not deviate too much from the dual covering and directed behavior targeted by MI maximization. Interestingly, Lem. 1 provides a rigorous justification for using a uniform sampling distribution restricted to the η-discriminable policies, which is in striking contrast with most of MI-based literature, where a uniform sampling distribution ρ is defined over the predefined number of policies. In addition, our alternative objective (Pη) has the benefit of providing a simple greedy strategy to optimize the number of policies N. Indeed, the following lemma (proof in App. A.1) ensures that starting with N = 1 (where g(1) = 0) and increasing it until the constraint g(N) ≥log η is violated is guaranteed to terminate with the optimal number of policies. Lemma 2. The function g is non-increasing in N. 3.3 COMPOSING SKILLS IN A GROWING TREE STRUCTURE Both the original MI objective and our constrained formulation (Pη) depend on the initial state s0 and on the length of each policy. Although these quantities are usually predefined and only appear implicitly in the equations, they have a crucial impact on the obtained behavior. In fact, resetting after each policy execution unavoidably restricts the coverage to a radius of at most T + H steps around s0. This may suggest to set T and H to large values. However, increasing T makes training the skills more challenging, while increasing H may not be sufficient to improve coverage. 2Equation 1 is maximized by setting NZ = |S| (i.e., maxY I(X, Y ) = I(X, X) = H(X)), where each z represents a goal-conditioned policy reaching a different state, which implies having as many policies as states, thus making the learning particularly challenging. 4 Published as a conference paper at ICLR 2022 Instead, we propose to “extend” the length of the policies through composition. We rely on the key insight that the constraint in (Pη) guarantees that the directed skill of each η-discriminable policy reliably reaches a specific (and distinct) region of the environment and it is thus re-usable and amenable to composition. We thus propose to chain the skills so as to reach further and further parts of the state space. Specifically, we build a growing tree, where the root node is a diffusing part around s0, the edges represent the skills, and the nodes represent the diffusing parts. When a policy z is selected, the directed skills of its predecessor policies in the tree are executed first (see Fig. 9 in App. B for an illustration). Interestingly, this growing tree structure builds a curriculum on the episode lengths which grow as the sequence (iT + H)i≥1, thus avoiding the need of prior knowledge on an adequate horizon of the downstream tasks.3 Here this knowledge is replaced by T and H which are more environment-agnostic and task-agnostic choices as they rather have an impact on the size and shape of the learned tree (e.g., the smaller T and H the bigger the tree). 3.4 IMPLEMENTATION We are now ready to introduce the UPSIDE al- gorithm, which provides a specific implementation of the components described before (see Fig. 1 for an illustration, Alg. 1 for a short pseudo-code and Alg. 2 in App. B for the detailed version). We first make approximations so that the constraint in (Pη) is easier to estimate. We remove the logarithm from the constraint to have an estimation range of [0, 1] and thus lower variance.4 We also replace the expectation over sdiff with an empirical estimate b q B ϕ(z) = 1 |Bz| P s∈Bz qϕ(z|s), where Bz denotes a small replay buffer, which we call state buffer, that contains states collected during a few rollouts by the diffusing part of πz. In our experiments, we take B = |Bz| = 10H. Integrating this in (Pη) leads to max N≥1 N s.t. max π,ϕ min z∈[N] b q B ϕ(z) ≥η, (3) where η is an hyper-parameter.5 From Lem. 2, this optimization problem in N can be solved using the incremental policy addition or removal in Alg. 1 (lines 5 & 9), independently from the number of initial policies N. Algorithm 1: UPSIDE Parameters: Discriminability threshold η ∈(0, 1), branching factor N start, N max. Initialize: Tree T initialized as a root node 0, policies candidates Q = {0}. while Q ̸= ∅do // tree expansion 1 Dequeue a policy z ∈Q and create N = N start policies C(z). 2 POLICYLEARNING(T , C(z)). 3 if minz′∈C(z) b q B ϕ (z′) > η then // Node addition 4 while minz′∈C(z) b q B ϕ (z′) > η and N < N max do 5 Increment N = N + 1 and add one policy to C(z). 6 POLICYLEARNING(T , C(z)). 7 else // Node removal 8 while minz′∈C(z) b q B ϕ (z′) < η and N > 1 do 9 Reduce N = N −1 and remove least discriminable policy from C(z). 10 POLICYLEARNING(T , C(z)). 11 Add η-discriminable policies C(z) to Q, and to T as nodes rooted at z. We then integrate the optimization of Equation 3 into an adaptive tree expansion strategy that in- crementally composes skills (Sect. 3.3). The tree is initialized with a root node corresponding to a policy only composed of the diffusing part around s0. Then UPSIDE iteratively proceeds through the following phases: (Expansion) While policies/nodes can be expanded according to different ordering rules (e.g., a FIFO strategy), we rank them in descending order by their discriminability (i.e., b q B ϕ(z)), thus favoring the expansion of policies that reach regions of the state space that are not too saturated. Given a candidate leaf z to expand from the tree, we introduce new policies by adding a set C(z) of N = N start nodes rooted at node z (line 2, see also steps (A) and (D) in Fig. 1). (Policy learning) The new policies are optimized in three steps (see App. B for details on the POLICYLEARNING subroutine): i) sample states from the diffusing parts of the new policies sampled uniformly from C(z) (state buffers of consolidated policies in T are kept in memory), ii) update the discriminator and compute the discriminability b q B ϕ(z′) of new policies z′ ∈C(z), iii) 3See e.g., the discussion in Mutti et al. (2021) on the “importance of properly choosing the training horizon in accordance with the downstream-task horizon the policy will eventually face.” 4While Gregor et al. (2016); Eysenbach et al. (2019) employ rewards in the log domain, we find that map- ping rewards into [0, 1] works better in practice, as observed in Warde-Farley et al. (2019); Baumli et al. (2021). 5Ideally, we would set η = η† from Lem. 1, however η† is non-trivial to compute. A strategy may be to solve (Pη′) for different values of η′ and select the one that maximizes the MI lower bound E [log qϕ(z|sdiff) −log ρ(z)]. In our experiments we rather use the same predefined parameter of η = 0.8 which avoids computational overhead and performs well across all environments. 5 Published as a conference paper at ICLR 2022 update the skills to optimize the reward (Sect. 3.1) computed using the discriminator (see step (B) in Fig. 1). (Node adaptation) Once the policies are trained, UPSIDE proceeds with optimizing N in a greedy fashion. If all the policies in C(z) have an (estimated) discriminability larger than η (lines 3-5), a new policy is tentatively added to C(z), the policy counter N is incremented, the policy learning step is restarted, and the algorithm keeps adding policies until the constraint is not met anymore or a maximum number of policies is attained. Conversely, if every policy in C(z) does not meet the discriminability constraint (lines 7-9), the one with lowest discriminability is removed from C(z), the policy learning step is restarted, and the algorithm keeps removing policies until all policies satisfy the constraint or no policy is left (see step (C) in Fig. 1). The resulting C(z) is added to the set of consolidated policies (line 11) and UPSIDE iteratively proceeds by selecting another node to expand until no node can be expanded (i.e., the node adaptation step terminates with N = 0 for all nodes) or a maximum number of environment iterations is met. 4 RELATED WORK URL methods can be broadly categorized depending on how the experience of the unsupervised phase is summarized to solve downstream tasks in a zero- or few-shot manner. This includes model- free (e.g., Pong et al., 2020), model-based (e.g., Sekar et al., 2020) and representation learning (e.g., Yarats et al., 2021) methods that build a representative replay buffer to learn accurate estimates or low-dimensional representations. An alternative line of work focuses on discovering a set of skills in an unsupervised way. Our approach falls in this category, on which we now focus this section. Skill discovery based on MI maximization was first proposed in VIC (Gregor et al., 2016), where the trajectories’ final states are considered in the reverse form of Equation 1 and the policy parame- ters π(z) and sampling distribution ρ are simultaneously learned (with a fixed number of skills NZ). DIAYN (Eysenbach et al., 2019) fixes a uniform ρ and weights skills with an action-entropy co- efficient (i.e., it additionally minimizes the MI between actions and skills given the state) to push the skills away from each other. DADS (Sharma et al., 2020) learns skills that are not only diverse but also predictable by learned dynamics models, using a generative model over observations and optimizing a forward form of MI I(s′; z|s) between the next state s′ and current skill z (with con- tinuous latent) conditioned on the current state s. EDL (Campos et al., 2020) shows that existing skill discovery approaches can provide insufficient coverage and relies on a fixed distribution over states that is either provided by an oracle or learned. SMM (Lee et al., 2019) uses the MI formalism to learn a policy whose state marginal distribution matches a target state distribution (e.g., uniform). Other MI-based skill discovery methods include Florensa et al. (2017); Hansen et al. (2019); Modhe et al. (2020); Baumli et al. (2021); Xie et al. (2021); Liu & Abbeel (2021); Strouse et al. (2021), and extensions in non-episodic settings (Xu et al., 2020; Lu et al., 2020). While most skill discovery approaches consider a fixed number of policies, a curriculum with in- creasing NZ is studied in Achiam et al. (2018); Aubret et al. (2020). We consider a similar dis- criminability criterion in the constraint in (Pη) and show that it enables to maintain skills that can be readily composed along a tree structure, which can either increase or decrease the support of available skills depending on the region of the state space. Recently, Zhang et al. (2021) propose a hierarchical RL method that discovers abstract skills while jointly learning a higher-level policy to maximize extrinsic reward. Our approach builds on a similar promise of composing skills instead of resetting to s0 after each execution, yet we articulate the composition differently, by exploiting the direct-then-diffuse structure to ground skills to the state space instead of being abstract. Har- tikainen et al. (2020) perform unsupervised skill discovery by fitting a distance function; while their approach also includes a directed part and a diffusive part for exploration, it learns only a single directed policy and does not aim to cover the entire state space. Approaches such as DISCERN (Warde-Farley et al., 2019) and Skew-Fit (Pong et al., 2020) learn a goal-conditioned policy in an unsupervised way with an MI objective. As explained by Campos et al. (2020), this can be in- terpreted as a skill discovery approach with latent Z = S, i.e., where each goal state can define a different skill. Conditioning on either goal states or abstract latent skills forms two extremes of the spectrum of unsupervised RL. As argued in Sect. 3.1, we target an intermediate approach of learning “cluster-conditioned” policies. Finally, an alternative approach to skill discovery builds on “spectral” properties of the dynamics of the MDP. This includes eigenoptions (Machado et al., 2017; 2018) and covering options (Jinnai et al., 2019; 2020), and the algorithm of Bagaria et al. (2021) that builds a discrete graph representation which learns and composes spectral skills. 6 Published as a conference paper at ICLR 2022 Bottleneck Maze U-Maze RANDOM 29.17 (±0.57) 23.33 (±0.57) DIAYN-10 17.67 (±0.57) 14.67 (±0.42) DIAYN-20 23.00 (±1.09) 16.67 (±1.10) DIAYN-50 30.00 (±0.72) 25.33 (±1.03) DIAYN-curr 18.00 (±0.82) 15.67 (±0.87) DIAYN-hier 38.33 (±0.68) 49.67 (±0.57) EDL-10 27.00 (±1.41) 32.00 (±1.19) EDL-20 31.00 (±0.47) 46.00 (±0.82) EDL-50 33.33 (±0.42) 52.33 (±1.23) SMM-10 19.00 (±0.47) 14.00 (±0.54) SMM-20 23.67 (±1.29) 14.00 (±0.27) SMM-50 28.00 (±0.82) 25.00 (±1.52) Flat UPSIDE-10 40.67 (±1.50) 43.33 (±2.57) Flat UPSIDE-20 47.67 (±0.31) 55.67 (±1.03) Flat UPSIDE-50 51.33 (±1.64) 57.33 (±0.31) UPSIDE 85.67 (±1.93) 71.33 (±0.42) Table 2: Coverage on Bottleneck Maze and U-Maze: UPSIDE cov- ers significantly more regions of the discretized state space than the other methods. The values represent the number of buckets that are reached, where the 50 × 50 space is discretized into 10 buckets per axis. To compare the global coverage of methods (and to be fair w.r.t. the amount of injected noise that may vary across methods), we roll-out for each model its associated deterministic policies. UPSIDE DIAYN-50 EDL-50 Figure 2: Policies learned on the Bot- tleneck Maze (see Fig. 14 in App. C for the other methods): contrary to the baselines, UPSIDE successfully escapes the bottleneck region. 5 EXPERIMENTS Our experiments investigate the following questions: i) Can UPSIDE incrementally cover an un- known environment while preserving the directedness of its skills? ii) Following the unsupervised phase, how can UPSIDE be leveraged to solve sparse-reward goal-reaching downstream tasks? iii) What is the impact of the different components of UPSIDE on its performance? We report results on navigation problems in continuous 2D mazes6 and on continuous control prob- lems (Brockman et al., 2016; Todorov et al., 2012): Ant, Half-Cheetah and Walker2d. We evaluate performance with the following tasks: 1) “coverage” which evaluates the extent to which the state space has been covered during the unsupervised phase, and 2) “unknown goal-reaching” whose ob- jective is to find and reliably reach an unknown goal location through fine-tuning of the policy. We perform our experiments based on the SaLinA framework (Denoyer et al., 2021). We compare UPSIDE to different baselines. First we consider DIAYN-NZ (Eysenbach et al., 2019), where NZ denotes a fixed number of skills. We introduce two new baselines derived from DIAYN: a) DIAYN-curr is a curriculum variant where the number of skills is automatically tuned following the same procedure as in UPSIDE, similar to Achiam et al. (2018), to ensure sufficient discriminabil- ity, and b) DIAYN-hier is a hierarchical extension of DIAYN where the skills are composed in a tree as in UPSIDE but without the diffusing part. We also compare to SMM (Lee et al., 2019), which is similar to DIAYN but includes an exploration bonus encouraging the policies to visit rarely en- countered states. In addition, we consider EDL (Campos et al., 2020) with the assumption of the available state distribution oracle (since replacing it by SMM does not lead to satisfying results in presence of bottleneck states as shown in Campos et al., 2020). Finally, we consider the RANDOM 6The agent observes its current position and its actions (in [−1, +1]) control its shift in x and y coordinates. We consider two topologies of mazes illustrated in Fig. 2 with size 50 × 50 such that exploration is non-trivial. The Bottleneck maze is a harder version of the one in Campos et al. (2020, Fig. 1) whose size is only 10 × 10. 7 Published as a conference paper at ICLR 2022 0 0.5 1 1.5 ·105 0 100 200 300 400 500 Env interactions Coverage (a) Ant 0 0.5 1 1.5 ·105 0 5 10 15 20 25 Env interactions Coverage (b) Half-Cheetah 0 0.5 1 1.5 ·105 5 10 15 20 Env interactions Coverage (c) Walker2d Figure 3: Coverage on control environments: UPSIDE covers the state space signif- icantly more than DIAYN and RANDOM. The curve represents the number of buckets reached by the policies extracted from the unsupervised phase of UPSIDE and DIAYN as a function of the number of environment interactions. DIAYN and UPSIDE have the same amount of injected noise. Each axis is discretized into 50 buckets. UPSIDE DIAYN-5 DIAYN-10 DIAYN-20 RANDOM (a) UPSIDE policies on Ant (b) DIAYN policies on Ant 0 0.2 0.4 0.6 0.8 1 ·107 0 0.2 0.4 0.6 0.8 1 Env interactions Average success rate UPSIDE TD3 DIAYN (c) Fine-tuning on Ant Figure 4: (a) & (b) Unsupervised phase on Ant: visualization of the policies learned by UPSIDE and DIAYN-20. We display only the final skill and the diffusing part of the UPSIDE policies. (c) Downstream tasks on Ant: we plot the average success rate over 48 unknown goals (with sparse reward) that are sampled uniformly in the [−8, 8]2 square (using stochastic roll-outs) during the fine-tuning phase. UPSIDE achieves higher success rate than DIAYN-20 and TD3. policy, which samples actions uniformly in the action space. We use TD3 as the policy optimizer (Fujimoto et al., 2018) though we also tried SAC (Haarnoja et al., 2018) which showed equivalent results than TD3 with harder tuning. Similar to e.g., Eysenbach et al. (2019); Bagaria & Konidaris (2020), we restrict the observation space of the discriminator to the cartesian coordinates (x, y) for Ant and x for Half-Cheetah and Walker2d. All algorithms were ran on Tmax = 1e7 unsupervised en- vironment interactions in episodes of size Hmax = 200 (resp. 250) for mazes (resp. for control). For baselines, models are selected according to the cumulated intrinsic reward (as done in e.g., Strouse et al., 2021), while UPSIDE, DIAYN-hier and DIAYN-curr are selected according to the high- est number of η-discriminable policies. On the downstream tasks, we consider ICM (Pathak et al., 2017) as an additional baseline. See App. C for the full experimental details. Coverage. We analyze the coverage achieved by the various methods following an unsupervised phase of at most Tmax = 1e7 environment interactions. For UPSIDE, we report coverage for the skill and diffusing part lengths T = H = 10 in the continuous mazes (see App. D.4 for an ablation on the values of T, H) and T = H = 50 in control environments. Fig. 2 shows that UPSIDE man- ages to cover the near-entirety of the state space of the bottleneck maze (including the top-left room) by creating a tree of directed skills, while the other methods struggle to escape from the bottleneck region. This translates quantitatively in the coverage measure of Table 2 where UPSIDE achieves the best results. As shown in Fig. 3 and 4, UPSIDE clearly outperforms DIAYN and RANDOM in state-coverage of control environments, for the same number of environment interactions. In the Ant domain, traces from DIAYN (Fig. 4b) and discriminator curves in App. D.3 demonstrate that even though DIAYN successfully fits 20 policies by learning to take a few steps then hover, it fails to explore the environment. In Half-Cheetah and Walker2d, while DIAYN policies learn to fall on the agent’s back, UPSIDE learns to move forward/backward on its back through skill composition. Unknown goal-reaching tasks. We investigate how the tree of policies learned by UPSIDE in the unsupervised phase can be used to tackle goal-reaching downstream tasks. All unsuper- vised methods follow the same protocol: given an unknown7 goal g, i) we sample rollouts over 7Notice that if the goal was known, the learned discriminator could be directly used to identify the most promising skill to fine-tune. 8 Published as a conference paper at ICLR 2022 (a) UPSIDE: before and after fine-tuning (b) DIAYN (c) EDL (d) ICM Figure 5: Downstream task performance on Bottleneck Maze: UPSIDE achieves higher discounted cumulative reward on various unknown goals (See Fig. 15 in App. C for SMM and TD3 performance). From each of the 16 discretized regions, we randomly sample 3 unknown goals. For every method and goal seed, we roll-out each policy (learned in the unsupervised phase) during 10 episodes and select the one with largest cumulative reward to fine-tune (with sparse reward r(s) = I[∥s −g∥2 ≤1]). Formally, for a given goal g the reported value is γτI[τ ≤Hmax] with τ := inf{t ≥1 : ∥st −g∥2 ≤1}, γ = 0.99 and horizon Hmax = 200. Figure 6: For an unknown goal location, UPSIDE identifies a promising policy in its tree and fine-tunes it. the different learned policies, ii) then we select the best policy based on the maximum discounted cumulative reward collected, and iii) we fine-tune this policy (i.e., its sequence of directed skills and its final diffusing part) to maximize the sparse reward r(s) = I[∥s−g∥2 ≤1]. Fig. 5 reports the discounted cumulative reward on various goals af- ter the fine-tuning phase. We see that UPSIDE accumulates more reward than the other methods, in particular in regions far from s0, where performing fine-tuning over the entire skill path is especially challenging. In Fig. 6 we see that UPSIDE’s fine-tuning can slightly deviate from the original tree structure to improve the goal-reaching behavior of its candidate policy. We also perform fine-tuning on the Ant domain under the same setting. In Fig. 4c, we show that UPSIDE clearly outperforms DIAYN-20 and TD3 when we evaluate the aver- age success rate of reaching 48 goals sampled uniformly in [−8, 8]2. Note that DIAYN particularly fails as its policies learned during the unsupervised phase all stay close to the origin s0. Ablative study of the UPSIDE components. The main components of UPSIDE that differ from ex- isting skill discovery approaches such as DIAYN are: the decoupled policy structure, the constrained optimization problem and the skill chaining via the growing tree. We perform ablations to show that all components are simultaneously required for good performance. First, we compare UPSIDE to flat UPSIDE, i.e., UPSIDE with the tree depth of 1 (T = 150, H = 50). Table 2 reveals that the tree structuring is key to improve exploration and escape bottlenecks; it makes the agent learn on smaller and easier problems (i.e., short-horizon MDPs) and mitigates the optimization issues (e.g., non-stationary rewards). However, the diffusing part of flat UPSIDE largely improves the coverage performance over the DIAYN baseline, suggesting that the diffusing part is an interesting structural bias on the entropy regularization that pushes the policies away from each other. This is particularly useful on the Ant environment as shown in Fig. 4. A challenging aspect is to make the skill composi- tion work. As shown in Table 1, DIAYN-hier (a hierarchical version of DIAYN) does not perform as well as UPSIDE by a clear margin. In fact, UPSIDE’s direct-then-diffuse decoupling enables both policy re-usability for the chaining (via the directed skills) and local coverage (via the diffusing part). Moreover, as shown by the results of DIAYN-hier on the bottleneck maze, the constrained optimization problem (Pη) combined with the diffusing part is crucial to prevent consolidating too many policies, thus allowing a sample efficient growth of the tree structure. 6 CONCLUSION AND LIMITATIONS We introduced UPSIDE, a novel algorithm for unsupervised skill discovery designed to trade off between coverage and directedness and develop a tree of skills that can be used to perform ef- ficient exploration and solve sparse-reward goal-reaching downstream tasks. Limitations of our approach that constitute natural venues for future investigation are: 1) The diffusing part of each policy could be explicitly trained to maximize local coverage around the skill’s terminal state; 2) UPSIDE assumes a good state representation is provided as input to the discriminator, it would be interesting to pair UPSIDE with effective representation learning techniques to tackle problems with high-dimensional input; 3) As UPSIDE relies on the ability to reset to establish a root node for its growing tree, it could be relevant to extend the approach in non-episodic environments. 9 Published as a conference paper at ICLR 2022 Acknowledgements We thank both Evrard Garcelon and Jonas Gehring for helpful discussion. REFERENCES Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018. Arthur Aubret, La¨ etitia Matignon, and Salima Hassas. Elsim: End-to-end learning of reusable skills through intrinsic motivation. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2020. Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In International Conference on Learning Representations, 2020. Akhil Bagaria, Jason K Senthil, and George Konidaris. Skill discovery for exploration and planning using deep skill graphs. In International Conference on Machine Learning, pp. 521–531. PMLR, 2021. David Barber and Felix Agakov. The im algorithm: a variational approach to information maxi- mization. Advances in neural information processing systems, 16(320):201, 2004. Kate Baumli, David Warde-Farley, Steven Hansen, and Volodymyr Mnih. Relative variational in- trinsic control. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 6732–6740, 2021. Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ- ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016. V´ ıctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In Interna- tional Conference on Machine Learning, 2020. Ludovic Denoyer, Alfredo de la Fuente, Song Duong, Jean-Baptiste Gaya, Pierre-Alexandre Kamienny, and Daniel H Thompson. Salina: Sequential learning of agents. arXiv preprint arXiv:2110.07910, 2021. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representa- tions, 2019. Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical rein- forcement learning. arXiv preprint arXiv:1704.03012, 2017. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor- critic methods. In International Conference on Machine Learning, pp. 1587–1596. PMLR, 2018. Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Confer- ence on Machine Learning, pp. 1861–1870. PMLR, 2018. Steven Hansen, Will Dabney, Andre Barreto, David Warde-Farley, Tom Van de Wiele, and Volodymyr Mnih. Fast task inference with variational intrinsic successor features. In Interna- tional Conference on Learning Representations, 2019. Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, and Sergey Levine. Dynamical distance learning for semi-supervised and unsupervised skill discovery. In International Conference on Learning Representations, 2020. 10 Published as a conference paper at ICLR 2022 Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: variational information maximizing exploration. In Proceedings of the 30th International Con- ference on Neural Information Processing Systems, pp. 1117–1125, 2016. Yuu Jinnai, Jee Won Park, David Abel, and George Konidaris. Discovering options for exploration by minimizing cover time. In International Conference on Machine Learning, pp. 3130–3139. PMLR, 2019. Yuu Jinnai, Jee Won Park, Marlos C Machado, and George Konidaris. Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, 2020. Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdi- nov. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019. Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. In International Conference on Machine Learning, pp. 6736–6747. PMLR, 2021. Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill- space planning. arXiv preprint arXiv:2012.03548, 2020. Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for op- tion discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295–2304. PMLR, 2017. Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, and Ramakrishna Vedantam. Ir-vic: Unsupervised discovery of sub-goals for transfer in rl. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI- 20, 2020. Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsi- cally motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133, 2015. Mirco Mutti, Lorenzo Pratissoli, and Marcello Restelli. Task-agnostic exploration via policy gra- dient of a non-parametric state entropy estimate. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9028–9036, 2021. Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017. Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew- fit: State-covering self-supervised reinforcement learning. In International Conference on Ma- chine Learning, 2020. Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pp. 8583–8592. PMLR, 2020. Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations, 2020. DJ Strouse, Kate Baumli, David Warde-Farley, Vlad Mnih, and Steven Hansen. Learning more skills through optimistic exploration. arXiv preprint arXiv:2107.14226, 2021. 11 Published as a conference paper at ICLR 2022 Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In In- ternational Conference on Learning Representations, 2019. Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, and Florian Shkurti. Skill transfer via partially amortized hierarchical planning. In International Conference on Learning Represen- tations, 2021. Kelvin Xu, Siddharth Verma, Chelsea Finn, and Sergey Levine. Continual learning of control prim- itives: Skill discovery via reset-games. Advances in Neural Information Processing Systems, 33, 2020. Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with proto- typical representations. In Proceedings of the 38th International Conference on Machine Learn- ing, pp. 11920–11931. PMLR, 2021. Jesse Zhang, Haonan Yu, and Wei Xu. Hierarchical reinforcement learning by discovering intrinsic options. In International Conference on Learning Representations, 2021. 12 Published as a conference paper at ICLR 2022 Appendix A Theoretical Details on Section 3 13 B UPSIDE Algorithm 16 C Experimental Details 18 D Additional Experiments 22 A THEORETICAL DETAILS ON SECTION 3 A.1 PROOFS OF LEMMAS 1 AND 2 Restatement of Lemma 1. There exists a value η† ∈(0, 1) such that solving (Pη†) is equivalent to maximizing a lower bound on the mutual information objective max NZ,ρ,π,ϕ I(Sdiff; Z). Proof. We assume that the number of available skills is upper bounded, i.e., 1 ≤NZ ≤Nmax. We begin by lower bounding the negative conditional entropy by using the well known lower bound of Barber & Agakov (2004) on the mutual information −H(Z|Sdiff) = X z∈Z ρ(z)Esdiff [log p(z|sdiff)] ≥ X z∈Z ρ(z)Esdiff [log qϕ(z|sdiff)] . We now use that any weighted average is lower bounded by its minimum component, which yields −H(Z|Sdiff) ≥min z∈Z Esdiff [log qϕ(z|sdiff)] . Thus the following objective is a lower bound on the original objective of maximizing I(Sdiff; Z) max NZ=N,ρ,π,ϕ n H(Z) + min z∈[N] Esdiff [log qϕ(z|sdiff)] o . (4) Interestingly, the second term in Equation 4 no longer depends on the skill distribution ρ, while the first entropy term H(Z) is maximized by setting ρ to the uniform distribution over N skills (i.e., maxρ H(Z) = log(N)). This enables to simplify the optimization which now only depends on N. Thus Equation 4 is equivalent to max NZ=N n log(N) + max π,ϕ min z∈[N] Esdiff [log qϕ(z|sdiff)] o . (5) We define the functions f(N) := log(N), g(N) := max π,ϕ min z∈[N] Esdiff [log qϕ(z|sdiff)] . Let N † ∈arg maxN f(N) + g(N) and η† := exp g(N †) ∈(0, 1). We now show that any solution of (Pη†) is a solution of Equation 5. Indeed, denote by N ⋆the value that optimizes (Pη†). First, by validity of the constraint, it holds that g(N ⋆) ≥log η† = g(N †). Second, since N † meets the constraint and N ⋆is the maximal number of skills that satisfies the constraint of (Pη†), by optimality we have that N ⋆≥N † and therefore f(N ⋆) ≥f(N †) since f is non-decreasing. We thus have g(N ⋆) ≥g(N †) f(N ⋆) ≥f(N †) = ⇒f(N ⋆) + g(N ⋆) ≥f(N †) + g(N †). Putting everything together, an optimal solution for (Pη†) is an optimal solution for Equation 5, which is equivalent to Equation 4, which is a lower bound of the MI objective, thus concluding the proof. 13 Published as a conference paper at ICLR 2022 Restatement of Lemma 2. The function g is non-increasing in N. Proof. We have that g(N) := maxπ,q minz∈[N] Es∼π(z)[log(q(z|s)], where throughout the proof we write s instead of sdiff for ease of notation. Here the optimization variables are π ∈(Π)N (i.e., a set of N policies) and q : S →∆(N), i.e., a classifier of states to N possible classes, where ∆(N) denotes the N-simplex. For z ∈[N], let hN(π, q, z) := Es∼π(z)[log(q(z|s)], fN(π, q) := min z∈[N] hN(π, q, z). Let (π′, q′) ∈arg maxπ,q fN+1(π, q). We define e π := (π′(1), . . . , π′(N)) ∈(Π)N and e q : S → ∆(N) such that e q(i|s) := q′(i|s) for any i ∈[N −1] and e q(N|s) := q′(N|s) + q′(N + 1|s). Intuitively, we are “discarding” policy N + 1 and “merging” class N + 1 with class N. Then it holds that g(N + 1) = fN+1(π′, q′) = min z∈[N+1] hN+1(π′, q′, z) ≤min z∈[N] hN+1(π′, q′, z). Now, by construction of e π, e q, we have for any i ∈[N −1] that hN+1(π′, q′, i) = hN(e π, e q, i). As for class N, since e π(N) = π′(N), by definition of e q(N|·) and from monotonicity of the log function, it holds that hN+1(π′, q′, N) = Es∼π′(N)[log(q′(N|s)] satisfies hN+1(π′, q′, N) ≤Es∼˜ π(N)[log(e q(N|s)] = hN(e π, e q, N). Hence, we get that min z∈[N] hN+1(π′, q′, z) ≤min z∈[N] hN(e π, e q, z) = fN(e π, e q) ≤g(N). Putting everything together gives g(N + 1) ≤g(N), which yields the desired result. A.2 SIMPLE ILLUSTRATION OF THE ISSUE WITH UNIFORM-ρ MI MAXIMIZATION Figure 7: The agent must assign (possibly stochastically) N skills to M states: under the prior of uniform skill distribution, can the MI with be increased by varying the number of skills N? This section complements Sect. 3.2: we show a sim- ple scenario where 1) considering both a uniform ρ prior and a fixed skill number NZ provably leads to suboptimal MI maximization, and where 2) the UPSIDE strategy of considering a uniform ρ re- stricted to the η-discriminable skills can provably in- crease the MI for small enough η. Consider the simple scenario (illustrated on Fig. 7) where the agent has N skills indexed by n and must assign them to M states indexed by m. We as- sume that the execution of each skill deterministi- cally brings it to the assigned state, yet the agent may assign stochastically (i.e., more than one state per skill). (A non-RL way to interpret this is that we want to allocate N balls into M boxes.) Denote by pn,m ∈[0, 1] the probability that skill n is assigned to state m. It must hold that ∀n ∈[N], P m pn,m = 1. Denote by I the MI between the skill variable and the assigned state variable, and by I the MI under the prior that the skill sampling distribution ρ is uniform, i.e., ρ(n) = 1/N. It holds that I(N, M) = − X n 1 N log 1 N + X n,m 1 N pn,m log 1 N pn,m P n 1 N pn,m = log N + 1 N X n,m pn,m log pn,m P n pn,m . Let I ⋆(N, M) := max{pn,m} I(N, M) and {p⋆ n,m} ∈arg max{pn,m} I(N, M). We also define the discriminability of skill n in state m as qn,m := pn,m P n pn,m , as well as the minimum discriminability of the optimal assignment as η := min n max m q⋆ n,m. 14 Published as a conference paper at ICLR 2022 Lemma 3. There exist values of N and M such that the uniform-ρ MI be improved by removing a skill (i.e., by decreasing N). Proof. The following example shows that with M = 2 states, it is beneficial for the uniform-ρ MI maximization to go from N = 3 to N = 2 skills. Indeed, we can numerically compute the optimal solutions and we obtain for N = 3 and M = 2 that I ⋆(N = 3, M = 2) ≈0.918, p⋆ n,m = 0 1 0 1 1 0 ! , q⋆ n,m = 0 0.5 0 0.5 1 0 ! , η = 0.5, whereas for N = 2 and M = 2, I ⋆(N = 2, M = 2) = 1, p⋆ n,m =  0 1 1 0  , q⋆ n,m =  0 1 1 0  , η = 1. As a result, I ⋆(N = 2, M = 2) > I ⋆(N = 3, M = 2), which concludes the proof. The intuition why I ⋆is increased by decreasing N is that for N = 2 there is one skill per state whereas for N = 3 the skills must necessarily overlap. Note that this contrasts with the original MI (that also optimizes ρ) where decreasing N cannot improve the optimal MI. The previous simple example hints to the fact that the value of the minimum discriminability of the optimal assignment η may be a good indicator to determine whether to remove a skill. The following more general lemma indeed shows that a sufficient condition for the uniform-ρ MI to be increased by removing a skill is that η is small enough. Lemma 4. Assume without loss of generality that the skill indexed by N has the minimum discrim- inability η, i.e., N ∈arg minn maxm q⋆ n,m. Define ∆(N, η) := log N −N −1 N log(N −1) + 1 N log η. If ∆(N, η) ≤0 — which holds for small enough η — then removing skill N results in a larger uniform-ρ optimal MI, i.e., I ⋆(N, M) < I ⋆(N −1, M). Proof. It holds that I ⋆(N, M) = log N + 1 N   X n∈[N−1] X m∈[M] p⋆ n,m log q⋆ n,m + X m∈[M] p⋆ n,m log η   = log N −N −1 N log(N −1) + N −1 N  log(N −1) + 1 N −1 X n∈[N−1] X m∈[M] p⋆ n,m log q⋆ n,m  + 1 N log η = ∆(N, η) + N −1 N I ⋆(N −1, M). As a result, if ∆(N, η) ≤0 then I ⋆(N, M) < I ⋆(N −1, M). 15 Published as a conference paper at ICLR 2022 B UPSIDE ALGORITHM B.1 VISUAL ILLUSTRATIONS OF UPSIDE’S DESIGN MENTIONED IN SECTION 3 Figure 8: Decoupled structure of an UPSIDE policy: a directed skill fol- lowed by a diffusing part. Figure 9: In the above UPSIDE tree example, executing policy z = 7 means sequentially composing the skills of policies z ∈{2, 5, 7} and then deploying the diffusing part of policy z = 7. B.2 HIGH-LEVEL APPROACH OF UPSIDE Figure 10: High-level approach of UPSIDE. B.3 DETAILS OF ALGORITHM 1 We give in Alg. 2 a more detailed version of Alg. 1 and we list some additional explanations below. • When optimizing the discriminator, rather than sampling (state, policy) pairs with equal probabil- ity for all nodes from the tree T , we put more weight (e.g. 3×) on already consolidated policies, which seeks to avoid the new policies from invading the territory of the older policies that were previously correctly learned. • A replay buffer BRL is instantiated at every new expansion (line 2), thus avoiding the need to start collecting data from scratch with the new policies at every POLICYLEARNING call. • J (line 22) corresponds to the number of policy updates ratio w.r.t. discriminator updates, i.e. for how long the discriminator reward is kept fixed, in order to add stationarity to the reward signal. • Instead of using a number of iterations K to stop the training loop of the POLICYLEARNING func- tion (line 16), we use a maximum number of environment interactions Ksteps for node expansion. Note that this is the same for DIAYN-hier and DIAYN-curr. 16 Published as a conference paper at ICLR 2022 Algorithm 2: Detailed UPSIDE Parameters: Discriminability threshold η ∈(0, 1), branching factor N start, N max Initialize: Tree T initialized as a root node index by 0, policy candidates Q = {0}, state buffers BZ = {0 : [ ]} while Q ̸= ∅do // tree expansion 1 Dequeue a policy z ∈Q and create N = N start nodes C(z) rooted at z and add new key z to BZ 2 Instantiate new replay buffer BRL 3 POLICYLEARNING(BRL, BZ, T , C(z)) 4 if minz′∈C(z) d(z′) > η then // Node addition 5 while minz′∈C(z) d(z′) > η and N < N max do 6 Increment N = N + 1 and add one policy to C(z) 7 POLICYLEARNING(BRL, BZ, T , C(z)) 8 else // Node removal 9 while minz′∈C(z) d(z′) < η and N > 1 do 10 Reduce N = N −1 and remove least discriminable policy from C(z) 11 POLICYLEARNING(BRL, BZ, T , C(z)) 12 Enqueue in Q the η-discriminable nodes C(z) 13 POLICYLEARNING(Replay buffer BRL, State buffers BZ, Tree T , policies to update ZU) 14 Optimization parameters: patience K, policy-to-discriminator update ratio J, Kdiscr discriminator update epochs, Kpol policy update epochs 15 Initialize: Discriminator qϕ with |T | classes 16 for K iterations do // Training loop 17 For all z′ ∈ZU, clear BZ[z′] then collect and add B states from the diffusing part of π(z′) to it 18 Train the discriminator qϕ for Kdiscr steps with dataset S z′∈T BZ[z′]. 19 Compute discriminability d(z′) = b q B ϕ (z′) = 1 |Bz′ | P s∈Bz′ qϕ(z′|s) for all z′ ∈ZU 20 if minz′∈ZU d(z′) > η then // Early stopping 21 Break 22 for J iterations do 23 For all z′ ∈ZU, sample a trajectory from π(z′) and add to replay buffer BRL 24 For all z′ ∈ZU, update policy π′ z for Kpol steps on replay buffer BRL to optimize the discriminator reward as in Sect. 3.1 keeping skills from parent policies fixed 25 Compute discriminability d(z′) for all z′ ∈ZU • The state buffer size B needs to be sufficiently large compared to H so that the state buffers of each policy represent well the distribution of the states generated by the policy’s diffusing part. In practice we set B = 10H. • In POLICYLEARNING, we add Kinitial random (uniform) transitions to the replay buffer for each newly instantiated policies. • Moreover, in POLICYLEARNING, instead of sampling uniformly the new policies we sample them in a round robin fashion (i.e., one after the other), which can be simply seen as a variance-reduced version of uniform sampling. B.4 ILLUSTRATION OF THE EVOLUTION OF UPSIDE’S TREE ON A WALL-FREE MAZE See Fig. 11. B.5 ILLUSTRATION OF EVOLUTION OF UPSIDE’S TREE ON THE BOTTLENECK MAZE See Fig. 12. 17 Published as a conference paper at ICLR 2022 0 0.5 1 1.5 ·105 0 0.2 0.4 0.6 0.8 Env interactions Average discriminator accuracy A B C D E F G H I Figure 11: Fine-grained evolution of the tree structure on a wall-free maze with N start = 4 and N max = 8. The environment is a wall-free continuous maze with initial state s0 located at the center of the maze. Image A represents the diffusing part around s0. In image B, N start = 4 policies are trained, yet one of them (in lime yellow) is not sufficiently discriminable, thus it is pruned, resulting in image C. A small number of interactions is enough to ensure that the three policies are η-discriminable (image C). In image D, a fourth policy (in green) is able to become η-discriminable. New policies are added, trained and η-discriminated from 5 policies (image E) to N max = 8 policies (image F). Then a policy (in yellow) is expanded with N start = 4 policies (image G). They are all η-discriminable so additional policies are added (images H, I, . . . ). The process continues until convergence or until time-out (as done here). On the left, we plot the number of active policies (which represents the number of policies that are being trained at the current level of the tree) as well as the average discriminator accuracy over the active policies. Figure 12: Incremental expansion of the tree learned by UPSIDE towards unexplored regions of the state space in the Bottleneck Maze. C EXPERIMENTAL DETAILS C.1 BASELINES DIAYN-NZ. This corresponds to the original DIAYN algorithm (Eysenbach et al., 2019) where NZ is the number of skills to be learned. In order to make the architecture more similar to UPSIDE, we use distinct policies for each skill, i.e. they do not share weights as opposed to Eysenbach et al. (2019). While this may come at the price of sample efficiency, it may also help put lesser constraint on the model (e.g. gradient interference). DIAYN-curr. We augment DIAYN with a curriculum that enables to be less dependent on an adequate tuning of the number of skills of DIAYN. We consider the curriculum of UPSIDE where we start learning with N start policies during a period of time/number of interactions. If the configuration satisfies the discriminablity threshold η, a skill is added, otherwise a skill is removed or learning stopped (as in Alg. 2, lines 5-12). Note that the increasing version of this curriculum is similar to the one proposed in VALOR (Achiam et al., 2018, Sect. 3.3). In our experiments, we use N start = 1. 18 Published as a conference paper at ICLR 2022 DIAYN-hier. We extend DIAYN through the use of a hierarchy of directed skills, built following the UPSIDE principles. The difference between DIAYN-hier and UPSIDE is that the discrimi- nator reward is computed over the entire directed skill trajectory, while it is guided by the diffusing part for UPSIDE. This introduced baseline can be interpreted as an ablation of UPSIDE without the decoupled structure of policies. SMM. We consider SMM (Lee et al., 2019) as it is state-of-art in terms of coverage, at least on long-horizon control problems, although Campos et al. (2020) reported its poor performance in hard-to-explore bottleneck mazes. We tested the regular SMM version, i.e. learning a state density model with a VAE, yet we failed to make it work on the maze domains that we consider. As we use the cartesian (x, y) positions in maze domains, learning the identity function on two-dimensional input data is too easy with a VAE, thus preventing the benefits of using a density model to drive exploration. In our implementation, the exploration bonus is obtained by maintaining a multinomial distribution over “buckets of states” obtained by discretization (as in our coverage computation), resulting in a computation-efficient implementation that is more stable than the original VAE-based method. Note that the state distribution is computed using states from past-but-recent policies as suggested in the original paper. EDL. We consider EDL (Campos et al., 2020) with the strong assumption of an available state distribution oracle (since replacing it by SMM does not lead to satisfying results in presence of bot- tleneck states as shown in Campos et al., 2020, page 7: “We were unable to explore this type of maze effectively with SMM”). In our implementation, the oracle samples states uniformly in the mazes avoiding the need to handle a complex exploration, but this setting is not realistic when facing unknown environments. C.2 ARCHITECTURE AND HYPERPARAMETERS The architecture of the different methods remains the same in all our experiments, except that the number of hidden units changes across considered environments. For UPSIDE, flat UPSIDE (i.e., UPSIDE with a tree depth of 1), DIAYN, DIAYN-curr, DIAYN-hier and SMM the multiple policies do not share weights, however EDL policies all share the same network because of the constraint that the policy embedding z is learnt in a supervised fashion with the VQ-VAE rather than the unsupervised RL objective. We consider decoupled actor and critic optimized with the TD3 algorithm (Fujimoto et al., 2018) though we also tried SAC (Haarnoja et al., 2018) which showed equivalent results than TD3 with harder tuning.8 The actor and the critic have the same architecture that processes observations with a two-hidden layers (of size 64 for maze environments and 256 for control environments) neural networks. The discriminator is a two-hidden (of size 64) layer model with output size the number of skills in the tree. Common (for all methods and environments) optimization hyper-parameters: • Discount factor: γ = 0.99 • σTD3 = {0.1, 0.15, 0.2} • Q-functions soft updates temperature τ = 0.005 • Policy Adam optimizer with learning rate lrpol = {1e−3, 1e−4} • policy inner epochs Kpol = {10, 100} • policy batch size Bpol = {64} • Discriminator delay: J = {1, 10} • Replay buffer maximum size: 1e6 • Kinitial = 1e3 We consider the same range of hyper-parameters in the downstream tasks. 8For completeness, we report here the performance of DIAYN-SAC in the continuous mazes: DIAYN-SAC with NZ = 10 on Bottleneck maze: 21.0 (± 0.50); on U-maze: 17.5 (± 0.75), to compare with DIAYN-TD3 with NZ = 10 on Bottleneck maze: 17.67 (± 0.57); on U-maze: 14.67 (± 0.42). We thus see that DIAYN-SAC fails to cover the state space, performing similarly to DIAYN-TD3 (albeit over a larger range of hyperparameter search, possibly explaining the slight improvement). 19 Published as a conference paper at ICLR 2022 UPSIDE, DIAYN and SMM variants (common for all environments) optimization hyper- parameters: • Discriminator batch size Bdiscr = 64 • Discriminator Adam optimizer with learning rate lrdiscr = {1e−3, 1e−4} • discriminator inner epochs Kdiscr = {10, 100} • Discriminator delay: J = {1, 10} • State buffer size B = 10H where the diffusing part length H is environment-specific. EDL optimization hyper-parameters: We kept the same as Campos et al. (2020). The VQ-VAE’s architecture consists of an encoder that takes states as an input and maps them to a code with 2 hidden layers with 128 hidden units and a final linear layer, and the decoder takes the code and maps it back to states also with 2 hidden layers with 128 hidden units. It is trained on the oracle state distribution, then kept fixed during policy learning. Contrary to UPSIDE, DIAYN and SMM variants, the reward is stationary. • βcommitment = {0.25, 0.5} • VQ-VAE’s code size 16 • VQ-VAE batch size Bvq-vae = {64, 256} • total number of epochs: 5000 (trained until convergence) • VQ-VAE Adam optimizer with learning rate lrvq-vae = {1e−3, 1e−4} Maze specific hyper-parameters: • Ksteps = 5e4 (and in time 10 minutes) • T = H = 10 • Max episode length Hmax = 200 • Max number of interactions Tmax = 1e7 during unsupervised pre-training and downstream tasks. Control specific optimization hyper-parameters: • Ksteps = 1e5 (and in time 1 hour) • T = H = 50 • Max episode length Hmax = 250 • Max number of interactions Tmax = 1e7 during unsupervised pre-training and downstream tasks. Note that hyperparameters are kept fixed for the downstream tasks too. C.3 EXPERIMENTAL PROTOCOL We now detail the experimental protocol that we followed, which is common for both UPSIDE and baselines, on all environments. It consists in the following three stages: Unsupervised pre-training phase. Given an environment, each algorithm is trained without any extrinsic reward on Nunsup = 3 seeds which we call unsupervised seeds (to account for the random- ness in the model weights’ initialization and environment stochasticity if present). Each training lasts for a maximum number of Tmax environment steps (split in episodes of length Hmax). This protocol actually favors the baselines since by its design, UPSIDE may decide to have fewer en- vironment interactions than Tmax thanks to its termination criterion (triggered if it cannot fit any more policies); for instance, all baselines where allowed Tmax = 1e7 on the maze environments, but UPSIDE finished at most in 1e6 environment steps fitting in average 57 and 51 policies respectively for the Bottleneck Maze and U-Maze. Model selection. For each unsupervised seed, we tune the hyper-parameters of each algorithm according to a certain performance metric. For the baselines, we consider the cumulated intrin- sic reward (as done in e.g., Strouse et al., 2021) averaged over stochastic roll-outs. For UPSIDE, DIAYN-hier and DIAYN-curr, the model selection criterion is the number of consolidated poli- cies, i.e., how many policies were η-discriminated during their training stage. For each method, we thus have as many models as seeds, i.e. Nunsup. 20 Published as a conference paper at ICLR 2022 Downstream tasks. For each algorithm, we evaluate the Nunsup selected models on a set of tasks. All results on downstream tasks will show a performance averaged over the Nunsup seeds. • Coverage. We evaluate to which extent the state space has been covered by discretizing the state space into buckets (10 per axis on the continuous maze domains) and counting how many buckets have been reached. To compare the global coverage of methods (and also to be fair w.r.t. the amount of injected noise that may vary across methods), we roll-out for each model its associated deterministic policies. • Fine-tuning on goal-reaching task. We consider goal-oriented tasks in the discounted episodic setting where the agent needs to reach some unknown goal position within a certain radius (i.e., the goal location is unknown until it is reached once) and with sparse reward signal (i.e., reward of 1 in the goal location, 0 otherwise). The environment terminates when goal is reached or if the number of timesteps is larger than Hmax. The combination of unknown goal location and sparse reward makes the exploration problem very challenging, and calls upon the ability to first cover (for goal finding) and then navigate (for reliable goal reaching) the environment efficiently. To evaluate performance in an exhaustive manner, we discretize the state space into Bgoal = 14 buckets and we randomly sample Ngoal = 3 from each of these buckets according to what we call goal seeds (thus there are Bgoal × Ngoal = 10 different goals in total). For every goal seed, we initialize each algorithm with the set of policies learned during the unsupervised pre-training. We then roll-out each policy during Nexplo episodes to compute the cumulative reward of the policy, and select the best one to fine-tune. On UPSIDE, we complete the selected policy (of length denoted by L) by replacing the diffusing skill with a skill whose length is the remaining number of interactions left, i.e. Hmax −L. The ability of selecting a good policy is intrinsically linked to the coverage performance of the model, but also to few-shot adaptation. Learning curves and performance are averaged over unsupervised seeds, goal seeds, and over roll-outs of the stochastic policy. Since we are in the discounted episodic setting, fine-tuning makes sense, to reach as fast as possible the goal. This is particularly important as UPSIDE, because of its tree policy structure, can reach the goal sub-optimally w.r.t the discount. On the maze environments, we consider all unsupervised pre-training baselines as well as “vanilla” baselines trained from scratch during the downstream tasks: TD3 (Fujimoto et al., 2018) and ICM (Pathak et al., 2017). In the Ant environment, we also consider Ngoal = 3 and Bgoal = 14 in the [−8, 8]2 square. 21 Published as a conference paper at ICLR 2022 (a) UPSIDE discriminator (b) DIAYN discriminator (c) EDL VQ-VAE Figure 13: Environment divided in colors according to the most likely latent variable Z, according to (from left to right) the discriminator learned by UPSIDE, the discriminator learned by DIAYN and the VQ-VAE learned by EDL. Contrary to DIAYN, UPSIDE’s optimization enables the discriminator training and the policy training to catch up to each other, thus nicely clustering the discriminator predictions across the state space. EDL’s VQ-VAE also manages to output good predictions (recall that we consider the EDL version with the strong assumption of the available state distribution oracle, see Campos et al., 2020), yet the skill learning is unable to cover the entire state space due to exploration issues and sparse rewards. (a) SMM (b) DIAYN-curr (c) DIAYN-hier (d) Flat UPSIDE Figure 14: Complement to Fig. 2: Visualization of the policies learned on the Bottleneck Maze for the remain- ing methods. (a) SMM (b) TD3 Figure 15: Complement of Fig. 5: Heatmaps of downstream task performance after fine-tuning for the remain- ing methods. D ADDITIONAL EXPERIMENTS D.1 ADDITIONAL RESULTS ON BOTTLENECK MAZE Here we include 1) Fig. 13 for an analysis of the predictions of the discriminator (see caption for details); 2) Fig. 14 for the policy visualizations for the remaining methods (i.e., those not reported in Fig. 2; 3) Fig. 15 for the downstream task performance for the remaining methods (i.e., those not reported in Fig. 5). 22 Published as a conference paper at ICLR 2022 (a) UPSIDE (b) DIAYN (c) EDL (d) SMM (e) DIAYN-curr (f) DIAYN-hier (g) Flat UPSIDE Figure 16: Visualization of the policies learned on U-Maze. This is the equivalent of Fig. 2 for U-Maze. (a) UPSIDE before fine- tuning (b) UPSIDE (c) DIAYN (d) EDL (e) ICM (f) SMM (g) TD3 Figure 17: Heat maps of downstream task performance on U-Maze. This is the equivalent of Fig. 5 for U-Maze. D.2 ADDITIONAL RESULTS ON U-MAZE Fig. 16 visualizes the policies learned during the unsupervised phase (i.e., the equivalent of Fig. 2 for the U-Maze), and Fig. 17 reports the heatmaps of downstream task performance (i.e., the equivalent of Fig. 5 for the U-Maze). The conclusion is the same as on the Bottleneck Maze described in Sect. 5: UPSIDE clearly outperforms all the baselines, both in coverage (Fig. 16) and in unknown goal-reaching performance (Fig. 17) in the downstream task phase. 23 Published as a conference paper at ICLR 2022 0 0.2 0.4 0.6 0.8 1 ·107 0.2 0.4 0.6 0.8 1 Env interactions Average discriminator accuracy DIAYN-10 DIAYN-20 DIAYN-50 (a) Bottleneck Maze 0 0.2 0.4 0.6 0.8 1 ·107 0.4 0.6 0.8 1 Env interactions Average discriminator accuracy DIAYN-5 DIAYN-10 DIAYN-20 (b) Ant 0 0.2 0.4 0.6 0.8 1 ·107 0.2 0.4 0.6 0.8 1 Env interactions Average discriminator accuracy DIAYN-5 DIAYN-10 DIAYN-20 (c) Half-Cheetah 0 0.2 0.4 0.6 0.8 1 ·107 0.2 0.4 0.6 0.8 1 Env interactions Average discriminator accuracy DIAYN-5 DIAYN-10 DIAYN-20 (d) Walker2d Figure 18: Average discriminability of the DIAYN-NZ policies. The smaller NZ is, the easier it is to obtain a close-to-perfect discriminability. However, even for quite large NZ (50 for mazes and 20 in control envi- ronments), DIAYN is able to achieve a good discriminator accuracy, most often because policies learn how to “stop” in some state. D.3 ANALYSIS OF THE DISCRIMINABILITY In Fig. 18 (see caption) we investigate the average discriminability of DIAYN depending on the number of policies NZ. 24 Published as a conference paper at ICLR 2022 (a) T = H = 10 (b) T = H = 20 (c) T = H = 30 (d) T = H = 40 (e) T = H = 50 (f) T = H = 10 (g) T = H = 20 (h) T = H = 30 (i) T = H = 40 (j) T = H = 50 Figure 19: Ablation on the length of UPSIDE policies (T, H): Visualization of the policies learned on the Bottleneck Maze (top) and the U- Maze (bottom) for different values of T, H. (Right table) Coverage values (according to the same procedure as in Table 2). Recall that T and H de- note respectively the lengths of the directed skill and of the diffusing part of an UPSIDE policy. UPSIDE Bottleneck Maze U-Maze T = H = 10 85.67 (±1.93) 71.33 (±0.42) T = H = 20 87.33 (±0.42) 67.67 (±1.50) T = H = 30 77.33 (±3.06) 68.33 (±0.83) T = H = 40 59.67 (±1.81) 57.33 (±0.96) T = H = 50 51.67 (±0.63) 58.67 (±1.26) D.4 ABLATION ON THE LENGTHS T AND H OF THE UPSIDE POLICIES Our ablation on the mazes in Fig. 19 investigates the sensitiveness of UPSIDE w.r.t. T and H, the lengths of the directed skills and diffusing parts of the policies. For the sake of simplicity, we kept T = H. It shows that the method is quite robust to reasonable choices of T and H, i.e., equal to 10 (as done in all other experiments) but also 20 or 30. Naturally, the performance degrades if T, H are chosen too large w.r.t. the environment size, in particular in the bottleneck maze which requires “narrow” exploration, thus composing disproportionately long skills hinders the coverage. For T = H = 50, we recover the performance of flat UPSIDE. D.5 FINE-TUNING RESULTS ON HALF-CHEETAH AND WALKER2D In Sect. 5, we reported the fine-tuning results on Ant. We now focus on Half-Cheetah and Walker2d, and similarly observe that UPSIDE largely outperforms the baselines: UPSIDE TD3 DIAYN Half-Cheetah 174.93 (±1.45) 108.67 (±25.61) 0.0 (±0.0) Walker2d 46.29 (±3.09) 14.33 (±1.00) 2.13 (±0.74) We note that the fine-tuning experiment on Half-Cheetah exactly corresponds to the standard Sparse- Half-Cheetah task, by rewarding states where the x-coordinate is larger than 15. Meanwhile, Sparse- Walker2d provides a reward as long as the x-coordinate is larger than 1 and the agent is standing up. Our fine-tuning task on Walker2d is more challenging as we require the x-coordinate to be larger than 4. Yet the agent can be rewarded even if it is not standing up, since our downstream task is purely goal-reaching. We indeed interestingly noticed that UPSIDE may reach the desired goal location yet not be standing up (e.g., by crawling), which may occur as there is no incentive in UPSIDE to remain standing up, only to be discriminable. 25