|
# IMITATION LEARNING |
|
## BY REINFORCEMENT LEARNING |
|
|
|
**Kamil Ciosek** |
|
Spotify |
|
kamilc@spotify.com |
|
|
|
ABSTRACT |
|
|
|
Imitation learning algorithms learn a policy from demonstrations of expert behavior. |
|
We show that, for deterministic experts, imitation learning can be done by reduction |
|
to reinforcement learning with a stationary reward. Our theoretical analysis both |
|
certifies the recovery of expert reward and bounds the total variation distance |
|
between the expert and the imitation learner, showing a link to adversarial imitation |
|
learning. We conduct experiments which confirm that our reduction works well in |
|
practice for continuous control tasks. |
|
|
|
1 INTRODUCTION |
|
|
|
Typically, reinforcement learning (RL) assumes access to a pre-specified reward and then learns |
|
a policy maximizing the expected average of this reward along a trajectory. However, specifying |
|
rewards is difficult for many practical tasks (Atkeson & Schaal, 1997; Zhang et al., 2018; Ibarz et al., |
|
2018). In such cases, it is convenient to instead perform Imitation Learning (IL), learning a policy |
|
from expert demonstrations. |
|
|
|
There are two major categories of Imitation Learning algorithms: behavioral cloning and inverse |
|
reinforcement learning. Behavioral cloning learns the policy by supervised learning on expert data, |
|
but is not robust to training errors, failing in settings where expert data is limited (Ross & Bagnell, |
|
2010). Inverse reinforcement learning (IRL) achieves improved performance on limited data by |
|
constructing reward signals and calling an RL oracle to maximize these rewards (Ng et al., 2000). |
|
|
|
The most versatile IRL method is adversarial IL (Ho & Ermon, 2016; Li et al., 2017; Ghasemipour |
|
et al., 2020), which minimizes a divergence between the distribution of data produced by the agent |
|
and provided by the expert. Adversarial IL learns a representation and a policy simultaneously by |
|
using a non-stationary reward obtained from a discriminator network. However, training adversarial |
|
IL combines two components which are hard to stabilize: a discriminator network, akin to the one |
|
used in GANs, as well as a policy, typically learned with an RL algorithm with actor and critic |
|
networks. This complexity makes the training process very brittle. |
|
|
|
There is a clear need for imitation learning algorithms that are simpler and easier to deploy. To |
|
address this need, Wang et al. (2019) proposed to reduce imitation learning to a single instance of |
|
reinforcement learning problem, where reward is defined to be one for state-action pairs from the |
|
expert trajectory and zero for other state-action pairs. A closely related, but not identical, algorithm |
|
has been proposed by Reddy et al. (2020) (we describe the differences in Section 5). However, while |
|
empirical performance of these approaches has been good, they enjoy no performance guarantees at |
|
all, even in the asymptotic setting where expert data is infinite. |
|
|
|
**Contributions** We fill in this missing justification for this algorithm, providing the needed theoretical analysis. Specifically, in Sections 3 and 4, we show a total variation bound between the expert |
|
policy and the imitation policy, providing a high-probability performance guarantee for a finite dataset |
|
of expert data and linking the reduction to adversarial imitation learning algorithms. For stochastic |
|
experts, we describe how the reduction fails, completing the analysis. Moreover, in Section 6, we |
|
empirically evaluate the performance of the reduction as the amount of available expert data varies. |
|
|
|
|
|
----- |
|
|
|
2 PRELIMINARIES |
|
|
|
**Markov Decision Process** An average-reward Markov Decision Process (Puterman, 2014; Feinberg & Shwartz, 2012) is a tuple (S, A, T, R, s1), where S is a state space, A is the action space, |
|
_T : S_ _A_ _P_ (S) is the transition model, R : S _A_ [0, 1] is a bounded reward function and s1 |
|
_×_ _→_ _×_ _→_ |
|
is the initial state. Here, we write P (X) to denote probability distributions over a set X. A stationary |
|
policy π : S → _P_ (A) maps environment states to distributions over actions. A policy π induces a |
|
Markov chain Pπ over the states. In the theoretical part of the paper, we treat MDPs with finite state |
|
and action spaces. Given the starting state s1 of the MDP, and a policy π, the limiting distribution |
|
over states is defined as |
|
|
|
|
|
_ρ[π]S_ [=][ 1][(][s][1][)][⊤] [lim] |
|
_N_ _→∞_ |
|
|
|
|
|
_Pπ[i]_ _[,]_ (1) |
|
_i=0_ |
|
|
|
X |
|
|
|
|
|
where 1(s1) denotes the indicator vector. We adopt the convention that the subscript S indicates a |
|
distribution over states and no subscript indicates a distribution over state-action pairs. We denote |
|
_ρ[π](s, a) = ρ[π]S[(][s][)][π][(][a][|][s][)][. While the limit in equation 1 is guaranteed to exist for all finite MDPs]_ |
|
(Puterman, 2014), without requiring ergodicity, in this paper we consider policies that induce ergodic |
|
chains. Correspondingly, our notation for ρ[π]S [and][ ρ][π][ does not include the dependence on the initial] |
|
state. The expected per-step reward of π is defined as |
|
|
|
_V_ _[π]_ = lim (2) |
|
_N_ _→∞_ [E][π][ [][J][N] _[/N][ |][ s][1][] =][ E][ρ][π]_ [[][R][(][s, a][)]][,] |
|
|
|
where JN = _t=1_ _[R][(][s][t][, a][t][)][ is the total return. In this paper we consider undiscounted MDPs]_ |
|
because they are enough to model the essential properties of the imitation learning problem while |
|
being conceptually simpler than discounted MDPs. However, we believe that similar results could be |
|
|
|
[P][N] |
|
obtained for discounted MDPs. |
|
|
|
**Expert Dataset** Assume that the expert follows an ergodic policy πE. Denote the corresponding |
|
distribution over expert state-action pairs as ρ[E](s, a) = ρ[π][E] (s, a), which does not depend on |
|
the initial state by ergodicity. Consider a finite trajectory s1, a1, s2, a2, . . ., sN _, aN_ . Denote by |
|
_ρˆ(s, a) =_ _N1_ _Nt=1_ [1][ {][(][s, a][) = (][s][t][, a][t][)][}][ the histogram (empirical distribution) of data seen along] |
|
|
|
this trajectory and denote byP ˆρS(s) = _N1_ _Nt=1_ [1][ {][s][ =][ s][t][}][ the corresponding histogram over the] |
|
|
|
states. Denote by D the support of ˆρ, i.e. the set of state-action pairs visited by the expert. |
|
P |
|
|
|
**Total Variation** Consider probability distributions defined on a discrete set X. The total variation |
|
distance between distributions is defined as |
|
|
|
_∥ρ1 −_ _ρ2∥TV = supM_ _X_ _|ρ1(M_ ) − _ρ2(M_ )|. (3) |
|
_⊆_ |
|
|
|
In our application, the set X is either the set of MDP states S or the set of state state-action pairs |
|
_S × A. Below, we restate in our notation standard results about the total variation distance. See_ |
|
Appendix D for proofs. |
|
|
|
**Lemma 1. If Eρ1** [v] − Eρ2 [v] ≤ _ϵ for any vector v ∈_ [0, 1][|][X][|], then ∥ρ1 − _ρ2∥TV ≤_ _ϵ._ |
|
|
|
**Lemma 2. If ∥ρ1 −** _ρ2∥TV ≤_ _ϵ, then Eρ1_ [v] − Eρ2 [v] ≤ 2ϵ for any vector v ∈ [0, 1][|][X][|]. |
|
|
|
**Imitation Learning** Learning to imitate means obtaining a policy that mimics expert behavior. |
|
This can be formalized in two ways. First, we can seek an imitation policy πI which obtains a |
|
expected per-step reward that is ϵ-close to the expert on any reward signal bounded in the interval |
|
|
|
[0, 1]. Denote the limiting distribution of state-action tuples generated by the imitation learner with |
|
_ρ[I]_ (s, a). Formally, we want to satisfy |
|
|
|
_∀R : S × A →_ [0, 1], EρE [R] − EρI [R] ≤ _ϵ,_ (4) |
|
|
|
where ϵ is a small constant. Second, we can seek to ensure that the distributions of state-action pairs |
|
generated by the expert and the imitation learner are similar (Ho & Ermon, 2016; Finn et al., 2016). |
|
In particular, if we measure the divergence between distributions with total variation, we want to |
|
ensure |
|
|
|
_∥ρ[E]_ _−_ _ρ[I]_ _∥TV ≤_ _ϵ._ (5) |
|
|
|
|
|
----- |
|
|
|
**Algorithm 1 Imitation Learning by Reinforcement Learning (ILR)** |
|
|
|
**Require: expert dataset D, ENVIRONMENT (without access to extrinsic reward)** |
|
|
|
_Rint(s, a) ←_ 1 {(s, a) ∈ _D}_ |
|
_πI ←_ RL-SOLVER(ENVIRONMENT, Rint) |
|
|
|
Lemmas 1 and 2 show that the equations 4 and 5 are closely related. In other words, attaining expected |
|
per-step reward close to the expert’s is the same (up to a multiplicative constant) as closeness in total |
|
variation between the state-action distribution of the expert and the imitation learner. We provide |
|
further background on adversarial imitation learning and other divergences in Section 5. |
|
|
|
**Markov Chains** When coupled with the dynamics of the MDP, a policy induces a Markov chain, |
|
In this paper, we focus our attention on chains which are irreducible and aperiodic. Crucially, such |
|
chains approach the stationary distribution at an exponential rate. In the following Lemma, we |
|
formalize this insight. The proof is given in Appendix D. |
|
**Lemma 3. Denote with PE the Markov chain induced by an irreducible and aperiodic expert policy.** |
|
_The mixing time of the chain τmix, defined as the smallest t so that maxs[′]_ 1(s[′])[⊤]PE[t] _S_ 4 |
|
_∥_ _[−]_ _[ρ][E][∥][TV][ ≤]_ [1] |
|
|
|
_is bounded. Moreover, for any probability distribution p1, we have_ _t=0_ _[∥][p]1[⊤][P]E[ t]_ _[−]_ _[ρ]S[E][∥][TV][ ≤]_ [2][τ][mix][.] |
|
|
|
3 IMITATION LEARNING BY REINFORCEMENT LEARNING[P][∞] |
|
|
|
**Algorithm** To perform imitation learning, we first obtain an expert dataset D and construct an |
|
intrinsic reward |
|
|
|
_Rint(s, a) = 1 {(s, a) ∈_ _D} ._ (6) |
|
|
|
We can then use any RL algorithm to solve for this reward. We note that for finite MDPs equation 6 |
|
exactly matches the algorithm proposed by Wang et al. (2019) as a heuristic. The reward defined in |
|
equation 6 is intrinsic, meaning we construct it artificially in order to elicit a certain behavior from |
|
the agent. This is in contrast to an extrinsic reward, which is what the agent gathers in the real world |
|
and what we want to recover. We will learn in Proposition 1 that the two are in fact closely related |
|
and solving for the intrinsic reward guarantees recovery of the extrinsic reward. |
|
|
|
**Guarantee on Imitation Policy** The main focus of the paper lies on showing theoretical properties |
|
of the imitation policy obtained when solving for the reward signal defined in equation 6. Our formal |
|
guarantees hold under three assumptions. |
|
**Assumption 1. The expert policy is deterministic.** |
|
**Assumption 2. The expert policy induces an irreducible and aperiodic Markov chain.** |
|
**Assumption 3. The imitation learner policy induces an irreducible and aperiodic Markov chain.** |
|
|
|
Assumption 1 is critical for our proof to go through and cannot be relaxed. It is also essential to |
|
rationalize the approach of Wang et al. (2019). In fact, no reduction involving a single call to an RL |
|
solver with stationary reward can exist for stochastic experts since for any MDP there is always an |
|
optimal policy which is deterministic. Assumptions 2 and 3 could in principle be relaxed, to allow |
|
for periodic chains, but it would complicate the reasoning, which we wanted to avoid. We note that |
|
we only interact with the expert policy via a finite dataset of N samples. |
|
|
|
Our main contribution is Proposition 1, in which we quantify the performance of our imitation learner. |
|
We state it below and prove it in Section 4. |
|
**Proposition 1. Consider an imitation learner trained on a dataset of size N** _, which attains the_ |
|
_limiting distribution of state-action pairs ρ[I]_ _. Under Assumptions 1, 2 and 3, given that we have_ |
|
2 |
|
_N_ max 800 _S_ _, 450 log_ _δ_ _τmix[3]_ _[η][−][2][ expert demonstrations, with probability at least][ 1][ −]_ _[δ][,]_ |
|
_≥_ _|_ _|_ |
|
|
|
_the imitation learner attains total variation distance from the expert of at most_ |
|
|
|
|
|
|
|
_∥ρ[E]_ _−_ _ρ[I]_ _∥TV ≤_ _η._ (7) |
|
|
|
_The constant τmix is the mixing time of the expert policy and has been defined in Section 2. Moreover,_ |
|
_with the same probability, for any bounded extrinsic reward function R, the imitation learner achieves_ |
|
|
|
|
|
----- |
|
|
|
_expected per-step extrinsic reward of at least_ |
|
EρI [R] ≥ EρE [R] − _η._ (8) |
|
|
|
Proposition 1 shows that the policy learned by imitation learner satisfies two important desiderata: we |
|
can guarantee both the closeness of the generated state-action distribution to the expert distribution |
|
as well as recovery of expert reward. Moreover, the total variation bound in equation 7 links the |
|
obtained policy to adversarial imitation learning algorithms. We describe this link in more detail in |
|
Section 5. |
|
|
|
4 PROOF |
|
|
|
**Structure of Results** We begin by showing a result about mixing in Markov chains in Section 4.1. |
|
We then give our proof, which has two main parts. In the first part, in Section 4.2, we show that it |
|
is possible to achieve a high expected per-step intrinsic reward defined in equation 6. In the second |
|
part, in Section 4.3, we show that, achieving a large expected per-step intrinsic reward guarantees a |
|
large expected per-step extrinsic reward for any extrinsic reward bounded in [0, 1]. In Section 4.4, we |
|
combine these results and prove Proposition 1. |
|
|
|
4.1 MIXING IN MARKOV CHAINS |
|
|
|
We now show a lemma that quantifies how fast the histogram approaches the stationary distribution. The proof of the lemma relies on standard results about the Total Variation distance between |
|
probability distributions as well as generic results about mixing in MDPs, developed by Paulin (2015). |
|
**Lemma 4. The total variation distance between the expert distribution and the expert histogram ˆρ** |
|
_based on N samples can be bounded as_ |
|
|
|
8 _S_ _τmix_ |
|
|
|
_ρ[E]_ _ρˆ_ TV _ϵ +_ _|_ _|_ (9) |
|
_∥_ _−_ _∥_ _≤_ r _N_ |
|
|
|
_with probability at least 1_ 2 exp 4.5τmix _(where the probability space is defined by the process_ |
|
_−_ _−_ _[ϵ][2][N]_ |
|
|
|
_generating the expert histogram)._ n o |
|
|
|
_Proof. We first prove the statement for histograms over the states. Recall the notation ˆρS =_ |
|
|
|
_N1_ _Nt=1_ [1][ {][s][t][ =][ s][}][. Recall that we denote the stationary distribution (over states) of the Markov] |
|
chain induced by the expert with ρ[E]S [.] |
|
P |
|
|
|
First, instantiating Proposition 3.16 of Paulin (2015), we obtain |
|
|
|
4ρ[E]S [(][s][)] |
|
|
|
EρˆS [[][∥][ρ][E]S _[−]_ _ρ[ˆ]S∥TV] ≤_ [P]s [min] _Nγps_ _[, ρ]S[E][(][s][)]_ _,_ |
|
|
|
q |
|
|
|
where γps is pseudo spectral gap of the chain. Re-write the right-hand side as |
|
|
|
_s_ [min] 4Nγρ[E]S [(]ps[s][)] _[, ρ]S[E][(][s][)]_ _≤_ [P]s 4Nγρ[E]S [(]ps[s][)] _≤_ _Nγ4|Sps|_ _[,]_ |
|
q q q |
|
|
|
where the last inequality follows from the Cauchy-Schwartz inequality. Using equation 3.9 of PaulinP |
|
(2015), we have γps ≥ 2τ1mix [. Taken together, this gives] |
|
|
|
EρˆS [[][∥][ρ][E]S _ρS_ TV] 8|SN|τmix _._ (10) |
|
|
|
_[−]_ [ˆ] _∥_ _≤_ |
|
q |
|
|
|
Instantiating Proposition 2.18 of Paulin (2015), we obtain |
|
|
|
_P_ (|∥ρ[E]S _[−]_ _ρ[ˆ]S∥TV −_ EρˆS [[][∥][ρ][E]S _[−]_ _ρ[ˆ]S∥TV]| ≥_ _ϵ) ≤_ 2 exp _−_ 4.[ϵ]5[2]τ[N]mix _._ (11) |
|
|
|
|
|
Combining equations 10 and 11, we obtain |
|
|
|
|
|
2 exp |
|
_≤_ _−_ 4.[ϵ]5[2]τ[N]mix |
|
|
|
|
|
|
|
8|S|τmix |
|
|
|
|
|
_ρ[E]S_ _ρS_ TV _ϵ +_ |
|
_|∥_ _[−]_ [ˆ] _∥_ _≥_ |
|
|
|
|
|
Since the expert policy is deterministic, the total variation distance between the distributions of |
|
state-action tuples and states is the same, i.e. _ρ[E]S_ _ρS_ TV = _ρ[E]_ _ρˆ_ TV. |
|
_∥_ _[−]_ [ˆ] _∥_ _∥_ _−_ _∥_ |
|
|
|
|
|
----- |
|
|
|
4.2 HIGH EXPECTED INTRINSIC PER-STEP REWARD IS ACHIEVABLE |
|
|
|
We now want to show that it is possible for the imitation learner to attain a large intrinsic reward. |
|
Informally, the proof asks what intrinsic reward the expert would have achieved. We then conclude |
|
that a learner specifically optimizing the intrinsic reward will obtain at least as much. |
|
**Lemma 5. Generating an expert histogram with N points, with probability at least 1 −** |
|
2 exp 4.5τmix _, we obtain a dataset such that a policy πI maximizing the intrinsic reward_ |
|
_−_ _[ϵ][2][N]_ |
|
|
|
_Rint(s, an_ ) = 1((os, a) _D) satisfies EρI_ [Rint] 1 _ϵ_ 8|SN|τmix _, where we used the shorthand_ |
|
_∈_ _≥_ _−_ _−_ |
|
|
|
_notation ρ[I]_ = ρ[π][I] _to denote the state-action distribution ofq πI_ _._ |
|
|
|
|
|
_Proof. We invoke Lemma 4 obtaining_ _ρ[E]_ _ρˆ_ TV _ϵ +_ 8|SN|τmix with probability as in the |
|
_∥_ _−_ _∥_ _≤_ |
|
|
|
statement of the lemma. First, we prove q |
|
_s,a_ _[ρ][E][(][s, a][)][1][ {]ρ[ˆ](s, a) = 0} ≤_ _ϵ +_ 8|SN|τmix _,_ (12) |
|
q |
|
|
|
by setting ρ1 = ρ[E], ρ2 = ˆPρ and M = {(s, a) : ˆρ(s, a) = 0} in equation 3. |
|
|
|
Combining |
|
|
|
1 = _s,a_ _[ρ][E][(][s, a][) =][ P]s,a_ _[ρ][E][(][s, a][)][1][ {]ρ[ˆ](s, a) = 0} +_ _s,a_ _[ρ][E][(][s, a][)][1][ {]ρ[ˆ](s, a) > 0}_ (13) |
|
|
|
and equation 12, we obtain |
|
|
|
[P] [P] |
|
|
|
_s,a_ _[ρ][E][(][s, a][)][1][ {]ρ[ˆ](s, a) > 0} ≥_ 1 − _ϵ −_ 8|SN|τmix _,_ (14) |
|
q |
|
|
|
which means that the expert policy achieves expected per-step intrinsic reward of at leastP 1 _ϵ_ |
|
_−_ _−_ |
|
|
|
8|SN|τmix . This lower bounds the expected per-step reward obtained by the optimal policy. |
|
|
|
q |
|
|
|
4.3 MAXIMIZING INTRINSIC REWARD LEADS TO HIGH PER-STEP EXTRINSIC REWARD |
|
|
|
|
|
We now aim to prove that the intrinsic and extrinsic rewards are connected, i.e. maximizing the |
|
intrinsic reward leads to a large expected per-step extrinsic reward. The proofs in this section are |
|
based on the insight that, by construction of intrinsic reward in equation 6, attaining intrinsic reward |
|
of one in a given state implies agreement with the expert. In the following Lemma, we quantify the |
|
outcome of achieving such agreement for ℓ consecutive steps. |
|
**Lemma 6. Assume an agent traverses a sequence of ℓ** _state-action pairs, in a way consistent with_ |
|
_the expert policy. Denote the expert’s expected extrinsic per-step reward with EρE_ [R]. Denote |
|
_by ˜ρt[p][1]_ [(][s, a][) = (][p]1[⊤][P]E[ t] [)(][s][)][π][E][(][a][|][s][)][ the expected state-action occupancy at time][ t][, starting in state] |
|
_distribution p1. The per-step extrinsic reward_ _V[˜]p[ℓ]1_ _[of the agent in expectation over realizations of the]_ |
|
_sequence satisfies_ |
|
|
|
_V˜p[ℓ]1_ [=][ P]s,a _ℓt=0−1_ 1ℓ _ρ[˜][p]t_ [1] [(][s, a][)] _R(s, a)_ EρE [R] _ℓ_ _[.]_ (15) |
|
|
|
_≥_ _−_ [4][τ][mix] |
|
P |
|
|
|
_Proof. Invoking Lemma 2, we have that EρE_ [R] − Eρ˜[p]t [1] [[][R][]][ ≤] [2][∥]ρ[˜][p]t [1] _−_ _ρ[E]∥TV for any timestep_ |
|
_t. In other words, the expected per-step reward obtained in step t of the sequence is at least_ |
|
EρE [R] 2 _ρ˜[p]t_ [1] |
|
_−_ _∥_ _[−]_ _[ρ][E][∥][TV][. The per-step reward in the sequence is at least]_ |
|
|
|
_V˜p[ℓ]1_ _ℓ_ _ℓi=0−1[(][E][ρ][E]_ [[][R][]][ −] [2][∥]ρ[˜][p]i [1] _ℓ_ _ℓi=0−1_ _ρ[p]i_ [1] |
|
|
|
_[≥]_ [1] _[−]_ _[ρ][E][∥][TV][) =][ E][ρ][E]_ [[][R][]][ −] [2] _[∥][˜]_ _[−]_ _[ρ][E][∥][TV]_ |
|
|
|
P EρE [R] _ℓ_ P∞i=0 _ρ[p]i_ [1] |
|
|
|
_≥_ _−_ [2] _[∥][˜]_ _[−]_ _[ρ][E][∥][TV]_ |
|
|
|
P |
|
|
|
Invoking Lemma 3, we have thatdeterministic, distances between distributions of states and state-action pairs are the same. We thust=0 _[∥][p]1[⊤][P][ t]E_ _[−]_ _[ρ]S[E][∥][TV][ ≤]_ [2][τ][mix][. Since the expert policy is] |
|
get _V[˜]p[ℓ]1_ _ρ[E]_ [[][R][]][ −] [4][τ]ℓ[mix] [.] [P][∞] |
|
|
|
_[≥]_ [E] |
|
|
|
|
|
----- |
|
|
|
In Lemma 6, we have shown that, on average, agreeing with the expert for a number of steps |
|
guarantees a certain level of extrinsic reward. We will now use this result to guarantee extrinsic |
|
reward obtained over a long trajectory. |
|
|
|
**Lemma 7. For any extrinsic reward signal R bounded in [0, 1], an imitation learner which attains** |
|
_expected per-step intrinsic reward of EρI_ [Rint] = 1 − _κ also attains extrinsic per-step reward of at_ |
|
_least_ |
|
EρI [R] ≥ (1 − _κ)(EρE_ [R]) − 4τmixκ, |
|
|
|
_with probability one, where EρE_ [R] is the expected per-step extrinsic reward of the expert. |
|
|
|
We provide the proof in Appendix D. |
|
|
|
4.4 BRINGING THE PIECES TOGETHER |
|
|
|
We now show how Lemma 5 and Lemma 7 can be combined to obtain Proposition 1 (stated in Section |
|
3). |
|
|
|
2 |
|
_Proof of Proposition 1. We use the assumption that N_ max 800 _S_ _, 450 log_ _δ_ _τmix[3]_ _[η][−][2][.]_ |
|
_≥_ _|_ _|_ |
|
|
|
Since τmix is either zero or τmix 1, this implies |
|
_≥_ |
|
|
|
2 |
|
_N_ max 32 _S_ _τmix (1 + 4τmix)[2]_ _η[−][2], log_ 18τmix (1 + 4τmix)[2] _η[−][2]_ _._ (16) |
|
_≥_ _|_ _|_ _δ_ |
|
|
|
|
|
We will instantiate Lemma 5 with |
|
|
|
|
|
_η_ |
|
|
|
_._ (17) |
|
2 + 8τmix |
|
|
|
|
|
_ϵ =_ |
|
|
|
|
|
2 2 |
|
Equations 16 and 17 imply that N log _δ_ 18τmix (1 + 4τmix)[2] _η[−][2]_ = log _δ_ 4.5τmixϵ[−][2]. We can |
|
|
|
2 _Nϵ[2]_ _≥_ |
|
rewrite this as log _δ_ _≤_ 4.5τmix [, which implies] _[ −]_ 4.[ϵ]5[2]τ[N]mix _[≤]_ [log][ δ]2 [. This implies that] |
|
|