|
# MULTI-AGENT PERFORMATIVE PREDICTION: FROM GLOBAL STABILITY AND OPTIMALITY TO CHAOS |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
The recent framework of performative prediction (Perdomo et al., 2020) is aimed |
|
at capturing settings where predictions influence the target/outcome they want to |
|
predict. In this paper, we introduce a natural multi-agent version of this framework, where multiple decision makers try to predict the same outcome. We showcase that such competition can result in interesting phenomena by proving the |
|
possibility of phase transitions from stability to instability and eventually chaos. |
|
Specifically, we present settings of multi-agent performative prediction where under sufficient conditions their dynamics lead to global stability and optimality. In |
|
the opposite direction, when the agents are not sufficiently cautious in their learning/updates rates, we show that instability and in fact formal chaos is possible. We |
|
complement our theoretical predictions with simulations showcasing the predictive power of our results. |
|
|
|
1 INTRODUCTION |
|
|
|
Performative prediction (Perdomo et al., 2020) is a recently introduced framework that focuses on a |
|
natural but largely unexplored element of supervised learning. In many practical cases the predictive |
|
model can affect the very outcome that it is trying to predict. For example, predictions about which |
|
part of a road network will have high congestion trigger responses from the drivers which affect the |
|
resulting traffic realization leading to a shift of the target distribution. In such settings, (Perdomo |
|
et al., 2020) explored conditions for the existence and approximate optimality of stable equilibria of |
|
such processes. |
|
|
|
One possible way to interpret the performative prediction setting is a single agent “game”, where the |
|
predictive agent is playing a game against himself. An agent chooses the parameters of his model |
|
as his actions but the predictive accuracy/cost of the model depends on his own past actions. Fixed |
|
points of this process do not allow for profitable deviations. Once cast in this light, it becomes |
|
self-evident that the restriction to a single predictive agent/model is arguably only the first step in |
|
capturing more general phenomena where there is a closed loop between predictive models and their |
|
environment. This motivates our central question: What is the interplay between stability, optimality |
|
_in cases where multiple predictive models operate in parallel to each other? Can the competition be-_ |
|
_tween multiple models lead to novel phenomena such as phase transitions from stability to instability_ |
|
_and chaos?_ |
|
|
|
A natural setting to consider for example is market competition, where multiple hedge funds are |
|
trying to simultaneously predict future prices, volatility of financial instruments. Of course as they |
|
act upon their predictions they also move the prices of these commodities in a highly correlated |
|
way. As more agents enter the market and the competition becomes increasingly fierce, is it possible |
|
that at some point the market flips from quickly discovering accurate stable predictions reflecting |
|
the underlying market fundamentals to self-induced unpredictability and chaos? When it comes to |
|
performative prediction can too many cooks spoil the soup? |
|
|
|
**Our model** Standard supervised learning consists of three components: a set of predictive models, a loss function, and a data distribution. The learner (agent) observes samples of the distribution, |
|
and then decides a predictive model. When predictions are performative, the agent’s decision of a |
|
predictive model influences the data distribution. Thus, instead of a fixed data distribution, a predictive prediction also has a distribution map: a mapping from the agent’s predictive models to |
|
|
|
|
|
----- |
|
|
|
data distributions. We further propose multi-agent performative prediction which model the influence of multiple agents’ decisions on the data distribution, and ask whether these influence leads to |
|
convergent or chaotic systems. |
|
|
|
To understand the framework of multi-agent performative prediction, we study a natural regression |
|
problem with multi-agent location-scale distribution maps (definition 2.2) where the agents’ influence is linear in the agents’ models. Specifically, the data consist of features and outcomes. For |
|
each agent i ∈ [n], his influence on the distribution of outcome is his model weighted by a scalar |
|
_λi > 0. When n = 1, our multi-agent location-scale distribution map is a special case of location-_ |
|
scale family in Miller et al. (2021). In the market example, each hedge fund company tries to predict |
|
the price (outcome) based on macroeconomic data or other information (features). These influence |
|
parameters λ1, . . ., λn can be seen as each hedge fund’s capital that can influence the distribution of |
|
the future price. |
|
|
|
Finally, we consider the agents do not know the dependency between their model and the data |
|
distribution, and they can only improve their predictive models myopically and iteratively. In this |
|
paper, the agents myopically use a reinforcement learning algorithm, exponentiated gradient for |
|
linear regression (Kivinen & Warmuth, 1997), to improve their predictive models in rounds. We |
|
study the long-term behavior, and ask if the system converge to the performative stable and optimal |
|
point, or behave chaotically. |
|
|
|
**Our results** We first study basic properties of our multi-agent preformative prediction with multiagent location-scale distribution map. We show 1) the existence of performative stable point in our |
|
setting (proposition 3.1) and 2) the performative stability and performative optimality are equivalent |
|
(proposition 3.2). This equivalence allows us to focus on the dynamical behavior of the system. |
|
|
|
In section 4, we introduce learning dynamics to the multi-agent performative prediction, and study |
|
their long-term behavior. We provide a threshold result which depends on the learning rates of |
|
exponentiated gradient descent and the collective influence. Theorem 4.1 shows the dynamics of |
|
exponential gradient descent converge to the performative stable and optimal point when the learning rate is small enough. Our convergence result in theorem 4.1 holds when the feature space is |
|
multi-dimensional, and every learning agent can use different learning rates starting at arbitrarily |
|
interior states. Our exact convergence result also works in the single-agent performative prediction setting. Contrarily, previous convergent results in Perdomo et al. (2020); Miller et al. (2021); |
|
Mendler-D¨unner et al. (2020) only show that their dynamics converge to a small neighborhood of |
|
the performative stable point. |
|
|
|
On the other hand, section 4.2 shows the dynamics can have chaotic behavior if the collective influence Ln is large enough. Specifically, theorem 4.6 shows that even when the feature space is |
|
in R[2] these systems provably exhibit Li-Yorke chaos, when the collective influence _i_ _[λ][i][ is large]_ |
|
|
|
enough. This implies that there exists an uncountable “scrambled” set so that given any two initial |
|
conditions in the set, the liminf of the distance between these two dynamics is zero, but the lim |
|
[P] |
|
|
|
sup of their distance is positive. (definition 2.4) The chaotic result in theorem 4.6 also holds for |
|
the original single-agent performative prediction so long as the agent’s influence is large enough, |
|
and, thus, complements previous performative prediction works on convergence behavior, which |
|
primarily consider that the agent’s influence on the data distribution is sufficiently small. Moreover, |
|
no matter how small the agents’ learning rates, theorem 4.6 shows that chaos is inevitable in some |
|
performative predictions settings when the number of agents exceeds a carrying capacity. After that |
|
the system becomes totally unpredictable with small perturbations exploding exponentially fast. [1] |
|
|
|
Finally, section 5 provides numerical examples of our dynamics and show convergent and chaotic |
|
behavior. Additionally, through simulation, we demonstrate that our convergent and chaotic results |
|
also hold when the agents can only access noisy estimation of the gradient and conduct stochastic |
|
exponentiated gradient descent. |
|
|
|
**Related work** Data distribution shift is not a new topic in ML, but earlier works focused primarily |
|
on exogenous changes to the data generating distribution. Performativity is machine learning was |
|
introduced by Perdomo et al. (2020). The original work and several follow-ups study the discrepancy between performative stability and performative optimality and prove approximated conver |
|
1We highlight our revision in blue color. |
|
|
|
|
|
----- |
|
|
|
gence of learning dynamics, e.g., stochastic gradient descent, or iterative empirical risk minimization. (Mendler-D¨unner et al., 2020; Drusvyatskiy & Xiao, 2020; Izzo et al., 2021) Performativity of |
|
prediction is also related to several applications: strategic classification (Hardt et al., 2016), retraining Bartlett (1992); Kuh et al. (1990). |
|
|
|
Inspired by the instability of training algorithms in ML applications such as Generative Adversarial |
|
Networks (GANs), there has been a lot of recent interest in understanding conditions (particularly |
|
in multi-agent systems) where learning behavior may be non-equilibrating/unstable (Cheung & Tao, |
|
2020; Balduzzi et al., 2020; Flokas et al., 2020; Andrade et al., 2021; Letcher, 2021; Giannou et al., |
|
2021). The (in)stability and performance of exponentiated gradient descent in particular (also referred to as Multiplicative Weights Updates) and other closely related dynamics has attracted a lot |
|
of attention (Cohen et al., 2017; Bailey & Piliouras, 2018; Cheung, 2018; Panageas et al., 2019; |
|
Cheung & Piliouras, 2020; Vadori et al., 2021). The technique of Li-Yorke chaos has recently found |
|
applications across several different domains such as routing games, Cournot games and blockchain |
|
protocols (Palaiopanos et al., 2017; Chotibut et al., 2020; Bielawski et al., 2021; Cheung et al., 2021; |
|
Leonardos et al., 2021). To our knowledge, this is the first time where such formal chaotic results |
|
are established in settings related to performative prediction and supervised learning more generally. |
|
|
|
Another line of related works is learning in games. Specifically, our dynamics can be seen as special |
|
cases multiplicative weight update (Hedge algorithms) on congestion games. Previous works on |
|
Hedge algorithms only show exact convergence when the learning rate is decreasing Kleinberg et al. |
|
(2009); Cohen et al. (2017), and, to our best knowledge, our results are the first that shows exact |
|
convergence of Hedge algorithms with small constant learning rates. Our results also complement |
|
the exact convergence result of the linear variant of multiplicative weight update by Palaiopanos |
|
et al. (2017). |
|
|
|
2 PRELIMINARY |
|
|
|
2.1 MULTI-AGENT PERFORMATIVE PREDICTION |
|
|
|
A multi-agent performative prediction comprises n agents deploying their predictive models |
|
_fθ1_ _, . . ., fθn with parameters θ[1], θ[2], . . ., θ[n]_ Θ that collectively influence the future data dis_∈_ |
|
tribution. We formalize such dependency via a distribution map D(·) which outputs a distribution |
|
on the data, D(θ[⃗]), given a models profile **_θ[⃗] := (θ[1], . . ., θ[n]) ∈_** Θ[n]. A loss function ℓ(z, θ[′]) measures the loss of a model fθ′ on a data point z, and the expected loss on a distribution is |
|
_∈Z_ _D_ |
|
Ez [ℓ(z, θ[′])]. For performative prediction, we further define the decoupled performative loss on a |
|
_∼D_ |
|
distribution mapping D as |
|
_ℓ(θ[⃗], θ[′]) := Ez_ (θ⃗)[ℓ][(][z][,][ θ][′][)] |
|
_∼D_ |
|
|
|
where θ[′] _∈_ Θ denotes a predictive model, while **_θ[⃗] ∈_** Θ[n] denotes a deployed model profile. |
|
|
|
Given the set of model Θ, the loss function ℓ, and the distribution mapping D, each agent in a |
|
multi-agent performative prediction (Θ, ℓ, D) pursues minimal loss on the distribution that they |
|
collectively induce. We consider two solution concepts performative optimality and performative |
|
_stability which generalizes the original ones in Perdomo et al. (2020)._ |
|
|
|
**Definition 2.1. Given (Θ, ℓ, D), a models profile** **_θ[⃗][∗]∈_** Θ[n] is performatively optimal if the total loss |
|
is minimized, |
|
**_θ⃗[∗]_** _∈_ arg minθ⃗∈Θ[n] Xi _ℓ(θ[⃗], θ[i])._ |
|
|
|
Another desirable property of a model profile is that, given all agents deploy their models, their |
|
models are also simultaneously optimal for distribution that their model induces. Formally, **_θ[⃗][∗]_** is |
|
_performatively stable if for all i ∈_ [n] |
|
|
|
(θ[∗])[i] _∈_ arg minθ[i]∈Θ _ℓ(θ[⃗][∗], θ[i])._ |
|
|
|
The performative optimality does not implies the performative stable point. For performative optimal point, the variable for minimization **_θ[⃗] affects both the first and the second argument, but only_** |
|
affect the second one for performative stable point. Now we introduce our model in this paper. |
|
|
|
|
|
----- |
|
|
|
**Location-scale distribution map** In this paper, we study the family of multi-agent location-scale |
|
map for regression problem where a data point consists of d-dimensional feature and scalar outcome, |
|
**_z = (x, y). These are natural classes of distribution maps in which performative effects enter_** |
|
through an additive or multiplicative factor that is linear in **_θ[⃗]._** |
|
|
|
**Definition 2.2. Given d ∈** N and d ≥ 2, and Θ[n] _⊆_ R[d], a distribution map D : Θ[n] _→_ R[d] _× R_ |
|
is a multi-agent location-scale distribution map on n parties if there exists a static distribution _X_ |
|
_D_ |
|
on R[d][+1], θ[0] _∈_ R[d], and n linear functions Λ1, . . ., Λn from R[d] to R[d] so that the distribution of |
|
from(x, y) D ∼DX . Given feature(θ[⃗]) has the following form: The feature is x ∈ R[d] and noise x0 ∈n R, and the the outcome is x ∈ R[d] and noise x0 ∈ R is jointly sampled |
|
|
|
_y =_ **_θ[0]_** Λi(θ[i]), x + x0. |
|
|
|
- _−_ _i=1_ + |
|
|
|
X |
|
|
|
In this paper, we consider the scaling maps Λi(θ) = λiθ for all θ ∈ Θ with scalar λi > 0 for |
|
all i ∈ [d]. We call λ := (λ1, . . ., λn) the influence parameters, and Ln := _i=1_ _[λ][i][ collective]_ |
|
influence. Furthermore, we let A := E **xx[⊤][]** _∈_ R[d][×][d] be the covariance matrix of the feature, and |
|
**_b := Aθ[0]_** + E[x0x] R[d]. We will specify the multi agent location-scale distribution map with[P][n] |
|
_∈_ |
|
parameters n, d, λ, A, and b. |
|
|
|
When n = 1, our multi-agent location-scale distribution map is a special case of location-scale |
|
family in Miller et al. (2021) where the model θ may both the outcome y as well as the feature x. |
|
|
|
**Predictive Models and Loss Function** We consider linear predictive model with constraint where |
|
_fθ′_ (x) = ⟨θ[′], x⟩, and the collection of parameter is the d-simplex, Θ = {θ : _k=1_ _[θ][k][ = 1][, θ][k][ ≥]_ |
|
0}. We use mean squared error to measure a predictive model θ[′] _∈_ Θ on a distribution map with a |
|
deployed model profile **_θ[⃗]_** Θ[n], ℓ(θ[⃗], θ[′]) = E(x,y) (θ⃗)[[(][y] _[−]_ _[f][θ][′]_ [(][x][))][2][] =][ E](x,y)[P][d](θ[⃗])[[(][y] _[−]_ **_[θ][′][ ·]_** **[x][)][2][]][.]** |
|
_∈_ _∼D_ _∼D_ |
|
|
|
Given a deployed model profile **_θ[⃗] and a predictive model θ[′], the gradient of the decoupled loss is_** |
|
**_g(θ[⃗], θ[′]) :=_** **_θ′_** _ℓ(θ[⃗], θ[′]), and if_ is a location-scale distribution map, g(θ[⃗], θ[′]) = E(x,y) (θ⃗)[[2(][θ][′][ ·] |
|
_∇_ _D_ _∼D_ |
|
**x −** y)x]. Furthermore, with A and b defined in definition 2.2, the gradient can be written as |
|
|
|
|
|
**_θ[′]_** + |
|
|
|
|
|
**_g(θ[⃗], θ[′]) =_** **_θ′_** _ℓ(θ[⃗], θ[′]) = 2A_ |
|
_∇_ |
|
|
|
|
|
_λiθ[i]_ |
|
|
|
|
|
_−_ 2b. (1) |
|
|
|
|
|
Additionally, given a deployed model profile **_θ[⃗] and a predictive model profile_** **_θ[⃗][′], we define the_** |
|
gradient of agent i’s decoupled loss as g[i](θ[⃗], θ[⃗][′]) := g(θ[⃗], (θ[′])[i]), and g[i](θ[⃗]) := g[i](θ[⃗], θ[⃗]) when the |
|
deployed model profile is identical to the predictive model profile. We denote the gradient of agent |
|
_i’s average loss as ¯g[i](θ[⃗]) :=_ _l_ _[θ]l[i][g]l[i][(]θ[⃗]) for all_ **_θ[⃗] ∈_** Θ[n]. Finally, we define **_ξ[⃗](θ[⃗]) = (ξ[1], . . ., ξ[n])_** |
|
with ξ[i] _∈_ R[d] so that for all i ∈ [n] and k ∈ [d] ξk[i] [(]θ[⃗]) := θk[i] [(][P][ θ]l[i][g]l[i] _[−]_ _[g]k[i]_ [)][. For brevity, we omit][ ⃗]θ |
|
and define ⃗g := ⃗g(θ[⃗]) and **_ξ[⃗] :=[P]ξ[⃗](θ[⃗]) when there is no ambiguity._** |
|
|
|
2.2 LEARNING DYNAMICS |
|
|
|
We consider the agents myopically use a reinforcement learning algorithm exponentiated gradient |
|
_for linear regression to improve their predictive models in rounds. We first define the original single_ |
|
agent’s exponentiated gradient for linear regression on a sequence of data points here and will state |
|
our dynamics on multi agent performative prediction in section 4. |
|
**Definition 2.3 (Kivinen & Warmuth (1997)). Given a learning rate η > 0, an initial parameter** |
|
**_θregression0 ∈_** Θ and a sequential of data points iteratively updates the model as follows: At round (xt, yt)t≥1, the exponentiated gradient descent for linear t + 1, with previous parameter θt Θ |
|
the exponentiated gradient algorithm updates it to θt+1 so that _∈_ |
|
|
|
_θt,k exp(_ 2η(θt **_xt_** _yt)xt,k)_ |
|
_θt+1,k =_ _−_ _·_ _−_ |
|
|
|
_ℓNote that each exponent((xt, yt), θt) at θt which yields the name of exponentiated gradient descent. 2(θtP · xl_ _[θ]t −[t,l][ exp(]yt)x[−]t,k[2] is the[η][(][θ][t][ ·] k[ x]-th coordinate of the gradient of squared error[t][ −]_ _[y][t][)][x][t,l][)][ for all][ k][ ∈]_ [[][d][]][.] |
|
|
|
|
|
----- |
|
|
|
2.3 DYNAMICAL SYSTEM |
|
|
|
**Definition 2.4. Let (X** _, f_ ) be a dynamical system where X is a metric space and f is a mapping on |
|
_X_ . We say (x, x[′]) ∈X × X is a Li-Yorke pair if |
|
|
|
lim inf _dist(f_ _[t](x), f_ _[t](x[′])) = 0 and lim sup_ _dist(f_ _[t](x), f_ _[t](x[′]) > 0._ |
|
_t_ _t_ |
|
|
|
(X _, f_ ) is Li-Yorke chaotic if there is an uncountable set S ⊂X (scrambled set) such that any pair |
|
of points x ̸= x[′] _∈_ _S is Li-Yorke pair._ |
|
|
|
3 MULTI-AGENT PERFORMATIVE LEARNING: STABILITY AND OPTIMALITY |
|
|
|
Having introduced multi-agent performative prediction, we show some basic property of our |
|
location-scale distribution map with mean squared error as a warm-up. Using the first order condition and a potential function argument, we show 1) the existence of performative stable point in our |
|
model (proposition 3.1) and 2) the performative stability and performative optimality are equivalent |
|
(proposition 3.2). The proofs of both propositions are in appendix A. |
|
|
|
We first show the existence and uniqueness of performative stable point through a potential function |
|
argument. The first part is due to the first order condition of convex optimization. The second part |
|
also use the first order condition, and _∂θ∂k[i]_ [Φ(]θ[⃗]) = λigk[i] [(]θ[⃗]) for all i ∈ [n] and k ∈ [d]. |
|
|
|
**Proposition 3.1 (existence). Given the family of linear predictive models with constrains Θ, mean** |
|
_squared error ℓ_ _and multi-agent location-scale distribution map with parameters n, d, λ, A, b in_ |
|
_definition 2.2,_ **_θ[⃗][∗]_** _on (Θ, ℓ, D) is performative stable if and only if the gradient (defined in eq. (1))_ |
|
_satisfies gk[i]_ [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗]) for all i ∈_ [n] and k ∈ [d] with θk[i] _[>][ 0][.]_ |
|
|
|
_Furthermore, if A is positive definite, there exists a unique performative stable point_ **_θ[⃗][∗], and_** **_θ[⃗][∗]_** _is_ |
|
_also a global minimizer of the following convex function for all_ **_θ[⃗] ∈_** Θ[n] |
|
|
|
Φ(θ[⃗]) := ( _λiθ[i])[⊤]A(_ _λiθ[i]) +_ _λi(θ[i])[⊤]Aθ[i]_ 2b[⊤] [X] _λiθ[i]._ (2) |
|
|
|
_−_ |
|
|
|
_i_ _i_ _i_ _i_ |
|
|
|
X X X |
|
|
|
While model profile is performative stable does not necessary mean the loss is minimized, the following proposition shows that optimality and stability are equivalent in our setting. |
|
|
|
**Proposition 3.2. Given (Θ, ℓ, D) defined in proposition 3.1,** **_θ[⃗][∗]_** _is performative stable if and only if_ |
|
**_θ⃗[∗]_** _is performatively optimal defined in definition 2.1._ |
|
|
|
4 MULTI-AGENT PERFORMATIVE LEARNING: CONVERGENCE AND CHAOS |
|
|
|
In section 3, we study the stability and optimality of location-scale distribution map with mean |
|
squared error. Now we ask when each agent myopically improves his predictive model through |
|
reinforcement learning but their predictions are performative, what is the long term behavior of the |
|
system? Can the system converges to performative stable and optimal point, or behave chaotically? |
|
|
|
We provide a threshold result depending on the learning rate and the collective influence Ln. Section 4.1 shows the dynamics converge to the performative stable and optimal point when when the |
|
learning rate is small enough. On the other hand, section 4.2 shows the dynamics can have chaotic |
|
behavior if the collective influence Ln is large enough. |
|
|
|
Now, we define our dynamics (θ[⃗]t)t 0 formally. At each round, each agent accesses the data dis_≥_ |
|
tribution which influenced by their previous model, and updates his model through exponentiated |
|
gradient descent. Specifically, given an initial model profile **_θ[⃗]0_** Θ[n] and a learning rate profile |
|
**_η := (η1, . . ., ηn), each agent i applies the exponentiated gradient descent with initial parameter ∈_** **_θ0[i]_** |
|
and learning rate ηi > 0: At round t + 1 > 1, each agent i use D(θ[⃗]t) and estimates the gradient of |
|
the (expected) loss, g[i](θ[⃗]t) defined in eq. (1), and updates his model according to definition 2.3, |
|
|
|
|
|
_θt,k[i]_ [exp(][−][η][i][g]k[i] [(]θ[⃗]t)) |
|
_θt[i]+1,k_ [=] _d_ for all t 0, i [n], and k [d]. (3) |
|
|
|
_l=1_ _[θ]t,l[i]_ [exp(][−][η][i][g]l[i][(]θ[⃗]t)) _≥_ _∈_ _∈_ |
|
P |
|
|
|
|
|
----- |
|
|
|
We will use superscript to denote agent, i ∈ [n], and subscript for time, t ≥ 0, and index of feature, |
|
_k, l ∈_ [d]. Recall that the gradient of agent i’s average loss is ¯g[i](θ[⃗]) := _l_ _[θ]l[i][g]l[i][(]θ[⃗]), and eq. (3)_ |
|
|
|
can be rewritten as _θk[i]_ [exp(][−][η][i][g]k[i] [(]θ[⃗]t)) _θk[i]_ [exp(][η][i][(¯]g[i](θ[⃗]t)−gk[i] [(]θ[⃗]t))) **_θ[∗]_** and i [d], |
|
|
|
_l_ _[θ]l[i]_ [exp(][−][η][i][g]l[i][(]θ[⃗]t)) [=] _l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i](θ[⃗]t)−gl[i][(]θ[⃗]t))) [. Finally, given][P] _[ ⃗]_ _∈_ |
|
|
|
_equilibrium (or fixed point)we define the support ofP_ **_θ[⃗][∗] of eq. (3) ifas Si ⊆_** [d] =P {k : (θ[∗])[i]k _[>][ 0][}][, and][ ¯]Si := [d] \ Si. Then_ **_θ[⃗][∗]_** is an |
|
|
|
_gk[i]_ [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]), for all i ∈ [n], k ∈ _Si._ (4) |
|
We say a fixed point of eq. (3) is isolated if there exists an open set of it so that no other fixed point |
|
is in the set. Additionally, the performative stable condition in proposition 3.1 is equivalent to |
|
|
|
_gk[i]_ [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]) and gl[i][(]θ[⃗][∗]) ≥ _g¯[i](θ[⃗][∗]), for all i ∈_ [n], k ∈ _Si, l ∈_ _S[¯]i._ (5) |
|
Therefore, the set of fixed points of eq. (3) contains the performative stable point. |
|
|
|
4.1 CONVERGING WITH SMALL LEARNING RATE |
|
|
|
In this section, we show when the learning rate of each agent is small enough the the dynamics in |
|
eq. (3) converge to performative stable point. Specifically, if the parameter of multi-agent performative learning (n, d, A, λ) is fixed, the dynamics in eq. (3) converge to performative stable point |
|
when η is “small” enough. |
|
|
|
By eq. (5), **_θ[⃗][∗]_** is a performative stable if the gradient in the support is no less than the average |
|
gradient. We call a performative stable point **_θ[⃗][∗]_** _proper if the gradient of non-support coordinate_ |
|
is greater than the average gradient: for all i ∈ [n] l ∈ _S[¯]i, gl[i][(]θ[⃗][∗]) > ¯g[i](θ[⃗][∗]). The below theorem_ |
|
shows if the performative stable point is proper and equilibria satisfying eq. (4) are isolated, eq. (3) |
|
converges when maxi ηi is small enough and the ratio of ηi/ηj is bounded for all i and j. |
|
**Theorem 4.1 (Convergence). Consider a constant Rη > 0, and a multi-agent performative learning** |
|
_setting with parameter n, d, A, b, λ such that A is positive definite, the performative stable point_ **_θ[⃗][∗]_** |
|
_is proper and its equilibria (defined in eq. (4)) are isolated. There exists η_ _> 0 so that dynamic in_ |
|
_∗_ |
|
_eq. (3) with learning rate profile η converges to the performative stable point, limt_ **_θt =_** **_θ[⃗][∗], if_** |
|
_→∞_ _[⃗]_ |
|
_the initial state is an interior point and η satisfies maxi ηi_ _η_ _and maxi ηi/ mini ηi_ _Rη._ |
|
_≤_ _∗_ _≤_ |
|
|
|
Informally, when the learning rate η are small, **_θ[⃗]t in eq. (3) can be approximated by a solution of an_** |
|
ordinary differential equation, **_ϑ[⃗](t∥η∥1), where the initial condition is_** **_ϑ[⃗](0) =_** **_θ[⃗]0 and_** |
|
_d_ _ηi_ |
|
|
|
_k[(][t][) =]_ _ϑ[i]k[(][t][)(¯]g[i](ϑ[⃗](t))_ _gk[i]_ [(]ϑ[⃗](t))) (6) |
|
_dt_ _[ϑ][i]_ **_η_** 1 _−_ |
|
|
|
_∥_ _∥_ |
|
|
|
for all t ≥ 0, i ∈ [n], and k ∈ [d]. We formalize this intuition in the proof of lemma 4.4. Note that |
|
the set of fixed points of eq. (6) is identical to eq. (3) and satisfies eq. (4). |
|
|
|
The proof of theorem 4.1 has two parts. First we show the continuous approximation eq. (6) converges to the performative stable point. Then we study the relationship between eq. (3) and eq. (6), |
|
and prove the eq. (3) also converges to a performative stable point. |
|
|
|
**From Potential Function to Convergence of Equation (6)** In this section, theorem 4.2 shows |
|
the dynamics of eq. (6) converge to performative stable point which will be useful to show the |
|
convergence of eq. (3). |
|
|
|
**Theorem 4.2. If all points satisfying eq. (4) are isolated and A is positive definite, and** **_ϑ[⃗](0) is in_** |
|
_the interior of Θ[n], the limit limt→∞_ **_ϑ[⃗](t) =_** **_θ[⃗][∗]_** _is the performative stable point._ |
|
|
|
Theorem 4.2 can be seen as the continuous version of theorem 4.1. Compared to theorem 4.1, theorem 4.2 does not require that the performative stable point is proper, but theorem 4.2 also implicitly |
|
requires the ratio of learning rate between any two agents is bounded. |
|
|
|
To prove theorem 4.2. we prove lemma 4.3 which shows that Φ in eq. (2) is a potential function |
|
for eq. (6) so that time derivative of Φ is decreasing for all non-fixed points. Then, in the proof of |
|
theorem 4.2, we further prove that the dynamics in eq. (6) converge to performative stable points |
|
when the initial condition **_ϑ[⃗](0) is an interior point. The proof is similar to a proof in Kleinberg et al._** |
|
(2009), and is presented in appendix B. |
|
|
|
|
|
----- |
|
|
|
**Lemma 4.3. Given a solution of eq. (6), the time derivative of Φ in eq. (2) is 0 at fixed points of** |
|
_eq. (6), and negative at all other points. Furthermore,_ |
|
|
|
2 |
|
|
|
_d_ 1 mini λiηi |
|
|
|
**_ϑ(t))_** _−_ _λiηi_ _ϑ[i]k[(]_ _ϑ[i]l[g]l[i]_ _k[)][|]_ _−_ **_ξ(ϑ[⃗])_** 1[.] |
|
_dt_ [Φ(] _[⃗]_ _≤_ 2 _i_ _[η][i]_ _i_ _[λ][i][η][i]_ _i,k_ _|_ _l_ _[−]_ _[g][i]_ _≤_ 2 _i_ _[η][i]_ _i_ _[λ][i][η][i]_ _∥[⃗]_ _∥[2]_ |
|
|
|
X |
|
|
|
**From Approximation to Convergence of Equation (3)[P]** P [X] Given the convergence of eq. (6), we [P] P |
|
show the dynamics of eq. (3) also converges to a performative stable point. The argument has two |
|
stages. We first show given any neighborhood of performative stable points D, eq. (3) will hits D |
|
and stay in D in finite amount of steps in lemma 4.4. Then lemma 4.5 shows that eq. (3) converges |
|
to a performative stable point which completes the proof of theorem 4.1. |
|
|
|
**Lemma 4.4. Given any open set D ∈** Θ[n] _that contains the performative stable point, there exists_ |
|
_η∗_ _small enough so that for all_ **_θ[⃗]0 ∈_** Θ[n], there exists τ so that **_θ[⃗]t ∈_** _D for all t ≥_ _τ_ _._ |
|
|
|
The proof is based on standard approximation result of numerical solution to ordinary differential |
|
equations, and is presented in appendix C. |
|
|
|
**Lemma 4.5. Let** **_θ[⃗][∗]_** _is the performative stable point and D is a small enough neighborhood of_ **_θ[⃗][∗]._** |
|
_Besides the condition in theorem 4.1, if the initial condition of eq. (3),_ **_θ[⃗]0, is in D and_** **_θ[⃗]t_** _D for_ |
|
_∈_ |
|
_all t_ 0, the limit of eq. (3) is the performative stable point limt **_θt =_** **_θ[⃗][∗]._** |
|
_≥_ _→∞_ _[⃗]_ |
|
|
|
The proof uses the fact that the right-hand side of eq. (3) decreases as the dynamic converges to the |
|
fixed point. Therefore, even though the learning rate profile is fixed, the error between eq. (3) and |
|
eq. (6) vanishes as they converge to the fixed point. We present the proof in appendix D. |
|
|
|
4.2 CHAOS WITH CONSTANT LEARNING RATE |
|
|
|
Now we want to ask the converse question in section 4.1. Do the dynamics in eq. (3) still converge, if |
|
the learning rate is fixed, as the number of agent increases? Alternatively, do the dynamics converge |
|
when the agents have overwhelming influence on the data distribution? |
|
|
|
Theorem 4.6 shows that the dynamics is Li-York chaotic when Ln = _i_ _[λ][i][ is large even with][ d][ = 2][.]_ |
|
|
|
Note that large Ln may be due to a fixed number of agent which have overwhelming influence, or |
|
the number of agents is large. The later case implies for any small η, no matter how cautious the |
|
|
|
[P] |
|
|
|
agents are, as number of agent increases the chaos inevitably occurs. |
|
|
|
**Theorem 4.6. Given a multi-party induced location scale family distribution with A, b, d = 2,** |
|
_and a common learning rate η > 0, if all n agents use the exponentiated gradient with a common_ |
|
_learning rate η in eq. (3) and A is diagonally dominant, there exists a carrying capacity L[∗]_ _such_ |
|
_that if Ln =_ _i=1_ _[λ][i][ ≥]_ _[L][∗]_ _[the dynamics][ (]θ[⃗]t)t∈N is Li-Yorke chaotic and thus has periodic orbits_ |
|
_of all possible periods._ |
|
|
|
[P][n] |
|
|
|
To prove theorem 4.6, we show there exists an uncountable scrambled set defined in definition 2.4. |
|
First we consider all agents start from the same initial model, and show the dynamics eq. (3) can be |
|
captured by the following function, |
|
|
|
_x_ |
|
_fu,v(x) =_ (7) |
|
|
|
_x + (1 −_ _x) exp(u(x −_ _v))_ _[∀][x][ ∈]_ [[0][,][ 1]] |
|
|
|
with proper choice of u ∈ R and v ∈ R. Note that v is an equilibrium, and u controls steepness. |
|
First, when all agents start from the same initial model, θ0[i] [=][ θ][0] [for all][ i][ ∈] [[][n][]][, and use the same] |
|
learning rate η, their models are identical for all t 0. For all t there exists θt so that θt[i] [=][ θ][t] [for] |
|
_≥_ |
|
all i. Additionally, since d = 2, we can use a single parameter to encode a linear predictive model |
|
with constraints. Given θ Θ = ∆2, we define p(θ) = θ1 [0, 1] and omit the input θ when it |
|
is clear. Thus, we can rewrite the dynamics as ∈ _pt = p(θt) which are one-dimensional dynamics on ∈_ |
|
|
|
[0, 1], and pt+1 = _pt+(1−pt)e[η]p[(][gi]t1_ [(]θ[⃗]t )−g2[i] [(]θ[⃗]t )) [. We set] |
|
|
|
|
|
(1 + L)(A2,2 _A2,1) + (b1_ _b2)_ |
|
_α(L) =2η (1 + L) (A1,1_ _A1,2_ _A2,1 + A2,2) and β(L) =_ _−_ _−_ |
|
_−_ _−_ (1 + L)(A1,1 _A1,2_ _A2,1 + A2,2)_ _[,]_ |
|
|
|
_−_ _−_ |
|
|
|
|
|
----- |
|
|
|
and write αn = α(Ln) and βn = β(Ln). By direct computation the dynamic of pt is |
|
|
|
_pt+1 = fαn,βn_ (pt) (8) |
|
|
|
where αn encodes the update pace and βn is an equilibrium of the dynamic. We further set β := |
|
_A2,2−A2,1_ _∞_ |
|
|
|
_A1,1−A1,2−A2,1+A2,2_ [and][ δ][(][L][) :=][ β][(][L][)][−][β][∞] [which converges to zero as][ L][ →∞][. If][ A][ is diagonally] |
|
dominantA2,2 _A α2,1( are positive, we can permute the coordinate so thatL) is positive and increasing for all L ≥_ 0. Additionally, because β (0, 1/2] A. With Li et al.1,1 − _A1,2 and_ |
|
(1982), the lemma below implies the existence of period three points and the proof is in appendix E. − _∞_ _∈_ |
|
|
|
**Lemma 4.7. If β** (0, 1/2), there exists a constant L[∗] _> 0 so that if L_ _L[∗], there exists a_ |
|
_∞_ _∈_ _≥_ |
|
_trajectory x0, x1, x2, and x3 with fα(L),β(L)(xi) = xi+1 such that x3 < x0 < x1._ |
|
|
|
_Proof of theorem 4.6. First by lemma 4.7, for all L ≥_ _L[∗], there exists a trajectory x0, x1, x2, x3 so_ |
|
that x3 < x0 < x1. Additionally, existence of such trajectory implies the exists of period three |
|
points by Li et al. (1982). Finally, by the seminal work of Li & Yorke (1975), it implies that the map |
|
is Li-York chaotic. |
|
|
|
5 SIMULATION |
|
|
|
(a) (14, 0.001) without noise (b) (1.4, 0.05) without noise (c) (14, 0.05) without noise |
|
|
|
(d) (14, 0.001) with noise (e) (1.4, 0.05) with noise (f) (14, 0.05) with noise |
|
|
|
Figure 1: Here we plot the temporal behavior of pt starting at p0 = 0.2 for 100 rounds under various |
|
_L and η, and noisy estimation of gradient of mean squared error with m samples. The top row_ |
|
(figs. 1a to 1c) present different combination of (L, η), and bottom row (figs. 1d to 1f) consider the |
|
gradient is estimated from m = 10 and m = 100 samples. |
|
|
|
Now we simulate the one-dimensional dynamics in eq. (8) of one agent with different learning rate |
|
_η and collective influence L._ |
|
|
|
We first define a location-scale distribution map. Let the feature x1, x2 and noise x0 are mutually |
|
independent Gaussian distribution with zero mean, and the variance are E[x[2]1[] = 3][,][ E][[][x][2]2[] = 7][, and] |
|
E[x[2]0[] = 1][. Finally,][ θ][0][ =][ 0][. Therefore,] |
|
|
|
|
|
_, and b = Aθ[0]_ + E[x0x] = |
|
|
|
|
|
**_A = E[xx[⊤]] =_** |
|
|
|
|
|
----- |
|
|
|
Given learning rate η > 0, and collective influence L, we have α(L) = 20η(1 + L), and the |
|
performative stable point β(L) = 0.7. |
|
|
|
The top row of fig. 1 shows the temporal behavior of pt under different (L, η). The bottom row in |
|
fig. 1 demonstrates our (convergent and chaotic) results are robust even when the value of gradient is |
|
noisy. Specifically, we consider at each round t, instead of g1[i] [(]θ[⃗]t)−g2[i] [(]θ[⃗]t) = 2(1+L)(1, −1)Aθt _−_ |
|
2(1, −1)b, the agent replace A and b with empirical moments on m samples. This process can |
|
be seen as a stochastic exponentiated gradient descent which uses a stochastic estimation of the |
|
gradient. |
|
|
|
We can see chaotic behavior happens in fig. 1c with L = 14 and η = 0.05, and such behavior |
|
persists in fig. 1f when the evaluations of gradient are noisy. On the other hand, when the learning |
|
rate is small enough (figs. 1a and 1b) the dynamics converge to the performative stable point 0.7. |
|
Furthermore, in figs. 1d and 1e the dynamics converge even with noisy estimation of gradient. |
|
|
|
Finally, fig. 2 shows the temporal behavior of the loss. Under the same setting as fig. 1c, fig. 2b |
|
shows not only the temporal behavior pt is chaotic, but also the mean squared loss. On the other |
|
hand, under the same setting as fig. 1a, fig. 2a shows convergent behavior of loss. |
|
|
|
(a) (14, 0.001) without noise (b) (14, 0.05) without noise |
|
|
|
Figure 2: The temporal behavior of the total cost for 100 rounds under various η. |
|
|
|
6 CONCLUSION |
|
|
|
We introduce a framework of multi-agent performative prediction and investigate whether classical |
|
reinforcement learning algorithms can converge or behave chaotically depending on the collective |
|
influence of the agents model and learning rate. However, we view our example as only scratching |
|
the surface of the work of multi-agent performative predictions. |
|
|
|
Our framework leads to several new theoretical problems. In particular, it would be interesting |
|
to understand whether this threshold is generic and if our results still hold on other reinforcement |
|
learning algorithms or other general multi-agent distribution maps. Our framework also provides a |
|
new viewpoint to several applications. One natural application is strategic classification. Hardt et al. |
|
(2016) In this context, our framework can be seen as strategic learners in strategic classification |
|
problem where both data and classifiers are strategic. However, the features also respond to the |
|
deployed models in conventional strategic classification, which is not captured in our multi-agent |
|
location-scale distribution map. It would be interesting to investigate our dynamics in the context of |
|
strategic prediction. Another application is pricing strategy/prediction, where multiple companies |
|
predict the demand function and set their prices. |
|
|
|
Both our digital and real-world environments are increasingly under the influence of or ever more |
|
powerful and numerous ML systems. As these algorithms trigger actions that change the state of |
|
the system that produces their joint input data (e.g., AI-triggered ads shown to users affecting their |
|
behavior, or automated trading systems affecting stock prices, etc), they are effectively forming |
|
closed loop systems and can no longer be understood fully in isolation. As we show in this paper, |
|
even fairly simple and innocuous such settings can exhibit phase transitions going from optimal |
|
behavior to chaos. We believe such phenomena are worthy of a careful investigation and we hope |
|
that this paper sets out some useful building blocks by bringing together optimization/performance |
|
analysis with Lyapunov theory and chaos theory. |
|
|
|
|
|
----- |
|
|
|
ETHICS STATEMENT |
|
|
|
Our work showcases the possibility of competing prediction algorithms leading to instability and |
|
chaos. Extra steps should be taken, whenever possible, so that related systems operate within the |
|
range of parameters leading to stability. We hope that our work helps both by identifying a potential |
|
threat for ML systems as well as providing some guidance on how to counter or minimize it. |
|
|
|
REPRODUCIBILITY |
|
|
|
All of our theoretical results are stated fully, including previous definitions and results on which they |
|
are based. They are accompanied by proofs, either in the main paper or in the appendix. The setting |
|
of our simulation results are stated fully in the main paper. |
|
|
|
REFERENCES |
|
|
|
Gabriel P Andrade, Rafael Frongillo, and Georgios Piliouras. Learning in matrix games can be |
|
arbitrarily complex. In Conference on Learning Theory (COLT), 2021. |
|
|
|
James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM |
|
_Confernce on Economics and Computation, pp. 321–338, 2018._ |
|
|
|
David Balduzzi, Wojciech M Czarnecki, Thomas W Anthony, Ian M Gemp, Edward Hughes, Joel Z |
|
Leibo, Georgios Piliouras, and Thore Graepel. Smooth markets: A basic mechanism for organizing gradient-based learners. 2020. |
|
|
|
Peter L Bartlett. Learning with a slowly changing distribution. In Proceedings of the fifth annual |
|
_workshop on Computational learning theory, pp. 243–252, 1992._ |
|
|
|
Jakub Bielawski, Thiparat Chotibut, Fryderyk Falniowski, Grzegorz Kosiorowski, Michał Misiurewicz, and Georgios Piliouras. Follow-the-regularized-leader routes to chaos in routing games. |
|
_ICML, 2021._ |
|
|
|
Yun Kuen Cheung. Multiplicative weights updates with constant step-size in graphical constant-sum |
|
games. In NeurIPS 2018, pp. 3532–3542, 2018. |
|
|
|
Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of |
|
learning in games. In NeurIPS 2020, 2020. |
|
|
|
Yun Kuen Cheung and Yixin Tao. Chaos of learning beyond zero-sum and coordination via game |
|
decompositions. In International Conference on Learning Representations, 2020. |
|
|
|
Yun Kuen Cheung, Stefanos Leonardos, and Georgios Piliouras. Learning in markets: Greed leads |
|
to chaos but following the price is right. In Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth |
|
_International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal,_ |
|
_Canada, 19-27 August 2021, pp. 111–117. ijcai.org, 2021. doi: 10.24963/ijcai.2021/16._ |
|
|
|
Thiparat Chotibut, Fryderyk Falniowski, Michał Misiurewicz, and Georgios Piliouras. The route to |
|
chaos in routing games: When is price of anarchy too optimistic? NeurIPS, 2020. |
|
|
|
Johanne Cohen, Am´elie H´eliou, and Panayotis Mertikopoulos. Learning with bandit feedback in |
|
potential games. In Proceedings of the 31st International Conference on Neural Information |
|
_Processing Systems, pp. 6372–6381, 2017._ |
|
|
|
Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions, |
|
2020. |
|
|
|
Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Thanasis Lianeas, Panayotis Mertikopoulos, and Georgios Piliouras. No-regret learning and mixed nash equilibria: They do not |
|
mix. In Conference on Neural Information Processing Systems (NeurIPS), 2020. |
|
|
|
|
|
----- |
|
|
|
Angeliki Giannou, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Panayotis Mertikopoulos. |
|
Survival of the strictest: Stable and unstable equilibria under regularized learning with partial |
|
information. COLT, 2021. |
|
|
|
Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, |
|
pp. 111–122, 2016. |
|
|
|
Zachary Izzo, Lexing Ying, and James Zou. How to learn when data reacts to your model: performative gradient descent. arXiv preprint arXiv:2102.07698, 2021. |
|
|
|
Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear |
|
predictors. information and computation, 132(1):1–63, 1997. |
|
|
|
Robert Kleinberg, Georgios Piliouras, and Eva Tardos. Multiplicative updates outperform generic[´] |
|
no-regret learning in congestion games. In Proceedings of the forty-first annual ACM symposium |
|
_on Theory of computing, pp. 533–542, 2009._ |
|
|
|
Anthony Kuh, Thomas Petsche, and Ronald L Rivest. Learning time-varying concepts. In NIPS, pp. |
|
183–189, 1990. |
|
|
|
Stefanos Leonardos, Barnab´e Monnot, Dani¨el Reijsbergen, Stratis Skoulakis, and Georgios Piliouras. Dynamical analysis of the eip-1559 ethereum fee market, 2021. |
|
|
|
Alistair Letcher. On the impossibility of global convergence in multi-loss optimization. In Interna_tional Conference on Learning Representations, 2021._ |
|
|
|
Tien-Yien Li and James A. Yorke. Period three implies chaos. _The American Mathematical_ |
|
_[Monthly, 82(10):985–992, 1975. ISSN 00029890, 19300972. URL http://www.jstor.](http://www.jstor.org/stable/2318254)_ |
|
[org/stable/2318254.](http://www.jstor.org/stable/2318254) |
|
|
|
Tien-Yien Li, Michał Misiurewicz, Giulio Pianigiani, and James A. Yorke. Odd chaos. |
|
_Physics Letters A, 87(6):271–273, 1982._ ISSN 0375-9601. doi: https://doi.org/10. |
|
1016/0375-9601(82)90692-2. [URL https://www.sciencedirect.com/science/](https://www.sciencedirect.com/science/article/pii/0375960182906922) |
|
[article/pii/0375960182906922.](https://www.sciencedirect.com/science/article/pii/0375960182906922) |
|
|
|
Celestine Mendler-D¨unner, Juan C Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. arXiv preprint arXiv:2006.06887, 2020. |
|
|
|
John Miller, Juan C Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. arXiv preprint arXiv:2102.08570, 2021. |
|
|
|
Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. CoRR, |
|
[abs/1703.01138, 2017. URL http://arxiv.org/abs/1703.01138.](http://arxiv.org/abs/1703.01138) |
|
|
|
Ioannis Panageas, Georgios Piliouras, and Xiao Wang. Multiplicative weights updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always. In International Conference on Machine Learning, pp. 4961–4969. PMLR, 2019. |
|
|
|
Juan Perdomo, Tijana Zrnic, Celestine Mendler-D¨unner, and Moritz Hardt. Performative prediction. |
|
In International Conference on Machine Learning, pp. 7599–7609. PMLR, 2020. |
|
|
|
Nelson Vadori, Rahul Savani, Thomas Spooner, and Sumitra Ganesh. Consensus multiplicative weights update: Learning to learn using projector-based game signatures. arXiv preprint |
|
_arXiv:2106.02615, 2021._ |
|
|
|
|
|
----- |
|
|
|
A PROOFS AND DETAILS IN SECTION 3 |
|
|
|
_Proof of proposition 3.1. First under the mean squared loss function, each agent i’s decoupled per-_ |
|
formative loss function ℓ(θ[⃗][′], θ[i]) is convex in θ[i]. Thus, for linear predictive model, we can apply the |
|
|
|
KKT conditionsstable if and only if for all on definition 2.1 so that a collection of predictive models i ∈ [n], k ∈ [d] with θk[i] _[>][ 0][,]_ _∂θ∂k[i]_ _[ℓ][(]θ[⃗][′], θ[i]) ≤_ _∂θθ[⃗]∂ ∈l[i]_ _[ℓ][(]Θθ[⃗][′][n], θis performatively[i]) for all l ∈_ [d] |
|
|
|
where **_θ[⃗][′]_** = **_θ[⃗]. Therefore, with eq. (1),_** **_θ[⃗][∗]_** is performative stable if |
|
|
|
_gk[i]_ [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗])_ (9) |
|
|
|
holds for all i ∈ [n] and k ∈ [d] with θk[i] _[>][ 0][.]_ |
|
|
|
With eq. (9), we now show that there exists a unique performative stable point **_θ[⃗][∗]_** by proving that 1) |
|
Φ in eq. (2) is strictly convex, and 2) **_θ[⃗][∗]_** is a minimizer of Φ. First, to show Φ is strictly convex, it is |
|
sufficient to show the Hessian of Φ positive definite. Because Φ is a quadratic function on Θ[n], the |
|
Hessian of Φ is a constant matrix in R∂[2][nd][×][nd] By the partial derivative of eq. (2), for all i, j ∈ [n] and |
|
_k, l ∈_ [d], we have (∇[2]Φ)ik,jl = _∂θk[i]_ _[∂θ]l[j]_ [Φ = 2][λ][i][λ][j][A][k,l][ if][ i][ ̸][=][ j][ and][ (][∇][2][Φ)][ik,il][ = 2][λ][i][(][λ][i][ +1)][A][k,l][.] |
|
|
|
Let L ∈ R[n][×][n] with Lij = 2λiλj if i ̸= j and 2λi(1 + λi) which is positive definite because λi > 0 |
|
for all i ∈ [n]. Then, the Hessian of Φ is the Kronecker product of L and A, |
|
|
|
2λ1(1 + λ1)A _. . ._ 2λ1λnA |
|
. . |
|
|
|
Φ = .. ... .. = L **_A._** |
|
_∇[2]_ _⊗_ |
|
|
|
2λnλ1A _. . ._ 2λn(1 + λn)A |
|
|
|
|
|
|
|
|
|
Because L and A are both positive definite, the Hessian is also positive definite. Therefore, Φ is |
|
strictly convex, and there exist a unique minimum of Φ in the compact set Θ[n]. Now we show **_θ[⃗][∗]_** is |
|
performative stable if and only if **_θ[⃗][∗]_** is a minimizer of Φ. By the first order condition and the partial |
|
derivative of eq. (2), the minimum of eq. (2) at **_θ[⃗][∗]_** if and only if gk[i] [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗]) for all i ∈_ [n] |
|
_k, l ∈_ [d] with (θ[∗])[i]k _[>][ 0][. Therefore,][ ⃗]θ[∗]_ is the minimum of Φ if and only if **_θ[⃗][∗]_** is a performative |
|
stable point. |
|
|
|
|
|
_Proof of proposition 3.2. Given a profile of models_ **_θ[⃗], the total cost is_** |
|
|
|
|
|
E(x,y)∼D(θ⃗)[[(][y][ −] **_[θ][i][ ·][ x][)][2][]]_** |
|
|
|
|
|
_ℓ(θ[⃗], (θ)[i]) =_ |
|
|
|
|
|
2[] |
|
|
|
|
|
|
|
|
|
= _i_ E *x, θ[0] _−_ |
|
X |
|
|
|
|
|
|
|
|
|
which is a convex function on **_θ[⃗] ∈_** Θ[n]. Additionally, |
|
|
|
|
|
_λjθ[j]_ **_θ[i]_** |
|
_−_ |
|
|
|
|
|
+ x0 |
|
|
|
|
|
_ℓ(θ[⃗], θ[ı]) = (1 + λi)gk[i]_ [(]θ[⃗], θ[⃗]) + |
|
|
|
|
|
_λigk[i]_ [(]θ[⃗], θ[⃗]) = (1 + nλi)gk[i] [(]θ[⃗], θ[⃗]) |
|
_ı∈[Xn]:ı≠_ _i_ |
|
|
|
|
|
_∂θk[i]_ |
|
|
|
|
|
which is the gradient of decoupled performative loss scaled by (1 + nλi). Thus, we can apply the |
|
KKT conditions and the minimum happens if and only if eq. (9) holds. |
|
|
|
B PROOF AND DETAILS FOR THEOREM 4.2 |
|
|
|
_Proof of lemma 4.3. By chain rule,_ |
|
|
|
|
|
_∂_ |
|
|
|
(ϑ[⃗](t)) _[d]_ **_ϑ⃗(t),_** (10) |
|
|
|
_i∈[nX],k∈[d]_ _∂θk[i]_ _·_ _dt_ |
|
|
|
|
|
_d_ |
|
|
|
**_ϑ(t)) =_** |
|
_dt_ [Φ(] _[⃗]_ |
|
|
|
|
|
so the time derivative of the potential function is zero if the dynamics is at a fixed point of eq. (6). |
|
|
|
|
|
----- |
|
|
|
Now we compute a closed form of the time derivative and show the derivative is zero only if the |
|
dynamics is at a fixed point of eq. (6). Here we omit the input t to simplify the notation. |
|
|
|
|
|
_∂_ _ηi_ |
|
|
|
(ϑ[⃗](t)) _ϑ[i]k[(][t][)(]_ |
|
_∂θk[i]_ _·_ **_η_** 1 |
|
|
|
_∥_ _∥_ |
|
|
|
|
|
_d_ |
|
|
|
**_ϑ(t)) =_** |
|
_dt_ [Φ(] _[⃗]_ |
|
|
|
|
|
_ϑ[i]l[(][t][)][g]l[i][(]ϑ[⃗](t)) −_ _gk[i]_ [(]ϑ[⃗](t))) (by eqs. (6) and (10)) |
|
|
|
|
|
_i,k_ |
|
|
|
|
|
!! |
|
|
|
(by the partial derivative of eq. (2)) |
|
|
|
|
|
_gk[i]_ _ϑ[i]k_ |
|
|
|
_·_ |
|
|
|
|
|
|
|
_ϑ[i]l[g]l[i]_ _k_ |
|
|
|
_[−]_ _[g][i]_ |
|
|
|
|
|
_λiηi_ |
|
|
|
_λiηi_ |
|
|
|
_λiηi_ |
|
|
|
|
|
**_η_** 1 |
|
_∥_ _∥_ |
|
|
|
1 |
|
|
|
2 **_η_** 1 |
|
_∥_ _∥_ |
|
|
|
|
|
1 |
|
= _λiηi_ _ϑ[i]k[ϑ][i]l[(2][g]k[i]_ _[g]l[i]_ _k[)][2][)]_ (because _l_ _[ϑ]l[i]_ [= 1][)] |
|
|
|
2∥η∥1 Xi _k,lX∈[d]_ _[−]_ [2(][g][i] [P] |
|
|
|
= _λiηi_ _ϑ[i]k[ϑ][i]l[(][g]k[i]_ _l_ [)][2] |
|
|
|
2∥[−]η[1]∥1 Xi _k,lX∈[d]_ _[−]_ _[g][i]_ |
|
|
|
Now we bound the time derivative. |
|
|
|
|
|
_d_ |
|
|
|
**_ϑ(t))_** |
|
_dt_ [Φ(] _[⃗]_ |
|
|
|
|
|
2 **_η_** 1 |
|
_−_ _∥_ _∥_ |
|
|
|
|
|
_λiηi_ |
|
|
|
|
|
([P]k _[ϑ]k[i]_ [=][ P]l _[ϑ]l[i]_ [= 1][)] |
|
|
|
(by Cauchy inequality) |
|
|
|
(by triangle inequality) |
|
|
|
|
|
_λiηiϑ[i]k[ϑ]l[i]_ |
|
_i∈[nX],k,l∈[d]_ |
|
|
|
|
|
_i∈[nX],k,l∈[d]_ _λiηiϑ[i]k[ϑ]l[i][(][g]k[i]_ _[−]_ _[g]l[i][)][2]_ |
|
|
|
|
|
_i∈[nX],k,l∈[d]_ _λiηiϑ[i]k[ϑ]l[i][|][g]k[i]_ _[−]_ _[g]l[i][|]_ |
|
|
|
|
|
_λiηi_ _ϑ[i]k_ _ϑ[i]l[(][g]k[i]_ _l_ [)] (by triangle inequality) |
|
|
|
_≥_ i∈[nX],k∈[d] _lX∈[d]_ _[−]_ _[g][i]_ |
|
|
|
|
|
|
|
Therefore, _dt[d]_ [Φ(]ϑ[⃗](t)) = 0 only if ϑ[i]k _gk[i]_ _[−]_ [P]l∈[d] _[ϑ]l[i][g]l[i]_ = 0 for all i ∈ [n] and k ∈ [d] which is a |
|
|
|
fixed point of eq. (6). |
|
|
|
(minWe can further simplify the bound byi λiηi) _i,k_ _[|][ξ]k[i]_ [(]ϑ[⃗])| = (mini λiηi) ∥ξ[⃗](ϑ[⃗])ξ⃗∥.1, _dt[d]_ Because[Φ(]ϑ[⃗](t)) ≤[P]2i,k−imin[λ][η][i][i][η]i[i] λ[|][ϑ]ii[λ]ηk[i] _[i]i[(][η][P][i][ ∥]lξ[⃗][ϑ](ϑl[i][⃗][g])l[i]∥1[2][−][.]_ _[g]k[i]_ [)][|] _≥_ |
|
|
|
[P] P |
|
|
|
_Proof of theorem 4.2.[P]_ When the fixed points are isolated satisfying eq. (4), by lemma 4.3, the dynamics in eq. (6) converges to a fixed point of eq. (6), **_θ[⃗][∗]_** _∈_ Θ[n]. Because **_θ[⃗][∗]_** is a fixed point, for all |
|
_i ∈_ [n], k ∈ [d], |
|
|
|
|
|
_ϑ[i]k_ |
|
|
|
|
|
(θ[∗])[i]k _g¯[i](θ[⃗][∗])_ _gk[i]_ [(]θ[⃗][∗]) = (θ[∗])[i]k |
|
_−_ |
|
|
|
|
|
|
|
(θ[∗])[i]l[g]l[i][(]θ[⃗][∗]) − _gk[i]_ [(]θ[⃗][∗]) = 0 |
|
|
|
|
|
By proposition 3.1, the fixed point condition implies each coordinate of the gradient of loss in the |
|
support are identical, gk[i] [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]) for all i and k ∈ _Si. Thus, if the fixed point_ **_θ[⃗][∗]_** is not |
|
performative stable, there exists ı and κ with (θ[∗])[ı]κ [= 0][ so that][ g]κ[ı] [(]θ[⃗][∗]) < ¯g[ı](θ[⃗][∗]). Furthermore, |
|
exp(−ηıgκ[ı] [(]θ[⃗][∗])) > _l[(][θ][∗][)]l[ı]_ [exp(][−][η][ı][g]l[ı][(]θ[⃗][∗])). We can pick a small enough ϵ > 0 and define |
|
|
|
_Uϵ :=_ **_θ : exp(_** _ηıgκ[ı]_ [)][ >] _θl[ı]_ [exp(][−][η][ı][g]l[ı][) +][ ϵ][}] (11) |
|
|
|
[P] _{[⃗]_ _−_ |
|
|
|
_l_ |
|
|
|
X |
|
|
|
which contains **_θ[⃗][∗]_** and is an open set because ⃗g and the exponential function are continuous. Since |
|
ifϑ⃗(ϑ[⃗]t)( converges tot) ∈ _Uϵ and_ **_θϑ[⃗][⃗][∗](tas) is an interior point, we get t →∞, there exists a time tdtdϵ so that for all[ϑ]k[i]_** [(][t][)][ >][ 0][ by eqs. (6) and (11). Therefore,] t ≥ _tϵ,_ **_ϑ[⃗](t) ∈_** _Uϵ. However,_ |
|
|
|
_ϑ[i]k[(][t][)][ is positive and increasing for][ t][ ≥]_ _[t][ϵ][. We reached a contradiction because][ ϑ][i]k[(][t][)][ →]_ [(][θ][∗][)][i]k [= 0][.] |
|
Therefore, **_θ[⃗][∗]_** is a performative stable point. |
|
|
|
|
|
----- |
|
|
|
C PROOF ANS DETAILS FOR LEMMA 4.4 |
|
|
|
To prove lemma 4.4, show the dynamic eq. (3) can be approximated by eq. (6) and the error vanishes |
|
as η decreases. Thus, we can show the dynamic can hit an arbitrary neighborhood of the performa_∗_ |
|
tive stable point. We further use Φ to show the dynamic will stay in the neighborhood. Below we |
|
state two ancillary lemmas to control the error of our approximation. |
|
|
|
We define an error vector ⃗e(θ[⃗]) ∈ R[n][×][d] between eqs. (3) and (6) so that |
|
|
|
_θk[i]_ [exp(][η][i][(¯]g[i] _gk[i]_ [))] |
|
_ηi[2][e]k[i]_ [(]θ[⃗]) := _−_ _k_ _k_ |
|
|
|
_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_ |
|
|
|
P |
|
|
|
|
|
We omit the input **_θ[⃗] of ⃗e and ⃗et := ⃗e(θ[⃗]t) when there is no ambiguity. We show if the the error_** |
|
vector is small, Φ is decreasing in dynamics eq. (3). |
|
|
|
**Claim C.1of eq. (2) on eq. (Approximated potential) (3) satisfies** **. There exist C4, so that for all t and** **_θ[⃗]0 ∈_** Θ[n], the difference |
|
|
|
|
|
_i_ _[η]i[2]_ |
|
Φ(θ[⃗]t+1) Φ(θ[⃗]t) **_ξt_** 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) **_e⃗t_** 1 + d[2]C4 **_θt+1_** **_θt_** |
|
_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _|_ _| · ∥_ _∥_ _−_ _[⃗]_ |
|
|
|
**Claim C.2. If maxi ηi is small enough and satisfies eq. (12), there exists a constant C[⃗] so that** |
|
|
|
[P] |
|
|
|
_|e[i]k[(]θ[⃗])| ≤_ _C for all i ∈_ [n], k ∈ [d], and **_θ[⃗] ∈_** Θ[n]. |
|
|
|
The proofs of these two claims are based on first order approximation. |
|
|
|
|
|
1 _[.]_ |
|
|
|
|
|
_∂[2]_ |
|
_Proof of claim C.1. By the partial derivative of eq. (2), we have_ _∂θk[i]_ _[∂θ]l[j]_ [Φ(]θ[⃗]) = 2λiλjAl,k if i ̸= j |
|
|
|
_∂[2]_ |
|
and _∂θk[i]_ _[∂θ]l[i]_ [Φ(]θ[⃗]) = 2λi(1 + λi)Al,k. Thus, the second order partial derivatives of Φ can bounded by |
|
|
|
the twice of C4 := maxı λı(λı + 1) maxk,l Ak,l. By Taylor’s expansion, we have |
|
|
|
|
|
Φ(θ[⃗]t+1) Φ(θ[⃗]t) |
|
_−_ |
|
|
|
Φ(θ[⃗]t) **_θ⃗t+1_** **_θt_** + _[d][2][2][C][4]_ |
|
_≤∇_ _·_ _−_ _[⃗]_ 2 |
|
|
|
|
|
|
|
**_θ[⃗](t + 1) −_** **_θ[⃗](t)_** |
|
|
|
|
|
2 |
|
_λigk[i]_ [(]θ[⃗]t)(ηiξk[i] [(]θ[⃗]t) + ηi[2][e]k[i] [(]θ[⃗]t)) + d[2]C4 **_θ(t + 1)_** **_θ(t)_** |
|
_−_ _[⃗]_ 1 |
|
_i∈[nX],k∈[d]_ |
|
|
|
(by the partial derivative of eq. (2)) |
|
|
|
_[⃗]_ |
|
|
|
2 |
|
_λiηigt,k[i]_ _[ξ]t,k[i]_ [+][ λ][i][η]i[2][g]t,k[i] _[e][i]t,k_ [+][ d][2][C][4] **_θt+1_** **_θt_** |
|
_−_ _[⃗]_ 1 |
|
_i∈[nX],k∈[d]_ |
|
|
|
_[⃗]_ |
|
|
|
|
|
1 |
|
_−_ _λiηi_ _ξt,k[i]_ _[|]_ |
|
_≤_ 2 _i_ _[λ][i][η][i]_ _i,k_ _|_ |
|
|
|
[X] |
|
|
|
Therefore, we have[P] |
|
|
|
|
|
_λiηi[2][g]t,k[i]_ _[e][i]t,k_ [+][ d][2][C][4] **_θt+1_** **_θt_** |
|
_−_ _[⃗]_ |
|
_i∈[nX],k∈[d]_ |
|
|
|
_[⃗]_ |
|
|
|
|
|
(by lemma 4.3) |
|
|
|
|
|
2 |
|
|
|
_i_ _[η]i[2]_ |
|
Φ(θ[⃗]t+1) Φ(θ[⃗]t) **_ξt_** 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) **_e⃗t_** 1 + d[2]C4 **_θt+1_** **_θt_** |
|
_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _|_ _| · ∥_ _∥_ _−_ _[⃗]_ 1 |
|
|
|
which completes the proof.[P] _[⃗]_ |
|
|
|
_Proof of claim C.2. By if η_ is small enough and satisfies eq. (12), we have ηi(¯g[i] _gk[i]_ [)][ ≤] [1][, and] |
|
_θk[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][ ≤] _[θ]k[i]_ _[e][η][i][(¯]∗g[i]−gk[i]_ [)] _≤_ _θk[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)+][ e]2 _[η]i[2][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][2][ for all][ i, k]− [ and][ ⃗]θ ∈ Θ[n]. |
|
|
|
Additionally, with ξk[i] [=][ θ]k[i] [(¯]g[i] _−_ _gk[i]_ [)][, we can rewrite it as] |
|
|
|
_θk[i]_ _k[e][η][i][(¯]g[i]−gk[i]_ [)] _ηiξk[i]_ _k_ [+][ e] _i_ _[θ]k[i]_ [(¯]g[i] _gk[i]_ [)][2][.] |
|
|
|
_[≤]_ _[θ][i]_ _−_ _[≤]_ _[θ][i]_ 2 _[η][2]_ _−_ |
|
|
|
|
|
----- |
|
|
|
Therefore, the error can be upper bounded as |
|
|
|
_ηi[2][e]k[i]_ _k_ [+][ η][i][ξ]k[i] [+][ e]2 _[η]i[2][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][2] (θk[i] [+][ η][i][ξ]k[i] [) =][ e] _i_ _[θ]k[i]_ [(¯]g[i] _gk[i]_ [)][2][,] |
|
|
|
_[≤]_ _[θ][i]_ _l_ _[θ]l[i]_ [+][ η][i][ξ]l[i] _−_ 2 _[η][2]_ _−_ |
|
|
|
because _l_ _[θ]l[i]_ [= 1][ and][ P]l _[ξ]l[i]P[= 0][. For lower bound, we have,]_ |
|
|
|
_ηi[2][e]k[i]_ [P] _θk[i]_ [+][ η][i][ξ]k[i] _k_ [+][ η][i][ξ]k[i] [)][ ≥−][(][θ]k[i] [+][ η][i][ξ]k[i] [)] _eηi[2]_ _θl[i][(¯]g[i]_ _gl[i][)][2]_ _._ |
|
|
|
_[≥]_ _l_ _[θ]l[i]_ [+][ η][i][ξ]l[i] [+][ e]2 _[η]i[2][θ]l[i][(¯]g[i]_ _−_ _gl[i][)][2][ −]_ [(][θ][i] 2 _l_ _−_ ! |
|
|
|
X |
|
|
|
P |
|
|
|
If we set C := e maxθ⃗ _l_ _[θ]l[i][(¯]g[i]_ _−_ _gl[i][)][2]_, because |ηiξk[i] _[| ≤]_ _[η][i][2 max][k][ |][g]k[i]_ _[| ≤]_ [1][ by eq. (12) and] |
|
|
|
_θk[i]_ _[≤]_ [1][, we have] _e[i]k[(]θ[⃗])_ _≤[P]C._ |
|
|
|
supProof of lemma 4.4.θ⃗∈D[′][ Φ(]θ[⃗]) < Because12 [inf][ ⃗]θ /∈D _D[Φ(]θ[⃗] is open, we can set)._ We will pick η∗ _Dsmall enough so that[′]_ so that **_θ[⃗][∗]_** _∈_ _Dθ[⃗][′],t D+1 ∈[′]_ _⊂DD for all, and_ |
|
**_θ⃗t_** _D[′]_ _D. The proof has two parts: we first show the dynamics eq. (3) hits the set D. Then we_ |
|
prove eq. (3) stays in ∈ _⊂_ _D afterward._ |
|
|
|
By theorem 4.2, there exists τ so that **_ϑ[⃗](τ_** ) _D[′]. Now by Gronwall’s inequality we can set η_ small |
|
_∈_ _∗_ |
|
enough so that **_θ[⃗]τ/∥η∥1_** _D. Formally, given the dynamics in eq. (3), we define right-continuous_ |
|
_∈_ |
|
step functions **_θ[⃗](t) =_** **_θ[⃗]⌊t⌋_** and ⃗e(t) := ⃗e(θ[⃗]⌊t⌋) for all t ≥ 0. Then the dynamics in eq. (3) can be |
|
written as |
|
|
|
|
|
_t_ |
|
|
|
0 |
|
|
|
Z |
|
|
|
|
|
_t−1_ |
|
|
|
(θs[i]+1,k _s,k[) =]_ |
|
_s=0_ _[−]_ _[θ][i]_ |
|
|
|
X |
|
|
|
|
|
_t−1_ |
|
|
|
_s=0_ |
|
|
|
X |
|
|
|
|
|
_ηiξk[i]_ [(]θ[⃗]s) + ηi[2][e]k[i] [(]θ[⃗]s) = ηi |
|
|
|
|
|
|
|
_ξk[i]_ [(]θ[⃗](s)) + ηie[i]k[(][s][)][ ds] |
|
|
|
|
|
_θk[i]_ [(][t][)][ −] _[θ]k[i]_ [(0) =] |
|
|
|
|
|
On the other hand, the solution of eq. (6) can be written as |
|
|
|
_ηi_ _∥ηt∥1_ _t_ |
|
_ϑ[i]k[(][η][i][t][)][ −]_ _[ϑ]k[i]_ [(0) =] _ξk[i]_ [(]ϑ[⃗](s)) ds = ηi _ξk[i]_ [(]ϑ[⃗]( **_η_** 1s)) ds |
|
|
|
**_η_** 1 0 0 _∥_ _∥_ |
|
_∥_ _∥_ Z Z |
|
|
|
Because ξk[i] [are continuous and][ Θ][n][ is compact, there exists][ L >][ 0][ so that][ ξ]k[i] [is][ L][-Lipschitz with] |
|
respect to one norm for all i and k. Thus, the difference between above equations is |
|
|
|
_t_ _t_ |
|
|
|
_θk[i]_ [(][t][)][ −] _[ϑ]k[i]_ [(][∥][η][∥][1][t][)] _≤ηi_ 0 _ξk[i]_ [(]θ[⃗](s)) − _ξk[i]_ [(]ϑ[⃗](∥η∥1s)) _ds + ηi[2]_ 0 _|e[i]k[(][s][)][|][ ds]_ |
|
|
|
Z Z |
|
|
|
_t_ |
|
_ηiL_ **_θ(s)_** **_ϑ(ηis)_** _i_ _[Ct]_ (L-Lipschitz and claim C.2) |
|
_≤_ 0 _−_ _[⃗]_ 1 _[ds][ +][ η][2]_ |
|
Z |
|
|
|
Therefore, by Gronwall’s inequality and[⃗] _ηi ≤_ _η∗, we have_ |
|
_t_ |
|
**_θ(t)_** **_ϑ(_** **_η_** 1t) **_θ(s)_** **_ϑ(_** **_η_** 1s) |
|
_−_ _[⃗]_ _∥_ _∥_ 1 0 _−_ _[⃗]_ _∥_ _∥_ 1 _[ds][ + (][η][∗][)][2][ndCt][ ≤]_ [(][η][∗][)][2][ndCte][η][∗][ndLt] |
|
|
|
Because[⃗] **_ϑ[⃗](τ_** ) _D[′], we can pick[≤]_ _[η][∗][ndL]_ Z _η_ _[⃗]_ small enough so that **_θ[⃗](τ/_** **_η_** 1) = **_θ[⃗]_** _τ/_ **_η_** 1 _D and_ |
|
_∈_ _∗_ _∥_ _∥_ _⌊_ _∥_ _∥_ _⌋_ _∈_ |
|
Φ(θ[⃗]⌊τ/∥η∥1⌋) < inf ⃗θ /∈D [Φ(]θ[⃗]). |
|
|
|
Now we show the second part that **_θ[⃗]t_** _D for all t_ _τ/_ **_η_** 1 . First, with claim C.1, we will prove |
|
_∈_ _≥⌊_ _∥_ _∥_ _⌋_ |
|
that the potential function is decreasing for all **_θ[⃗]t /_** _D[′]_ when η small enough. We estimate three |
|
_∈_ _∗_ |
|
terms in claim C.1 separately. First, because **_θ[⃗][∗]_** _∈_ _D[′]_ and Θ[n]\D[′] is compact, minθ⃗ /∈D[′][ ∥]ξ[⃗](θ[⃗])∥1 > 0 |
|
|
|
exists. Then [min]2 _[i]i[ λ][λ]i[2][i][η][η]i[2][i][ ∥]ξ[⃗]t∥1[2]_ [= Ω(max][i] _[η][i][)][ By claim C.2,][ max][i]_ _[λ][i][η]i[2]_ [max]i,k,θ[⃗] _k[(]θ[⃗])| · ∥e⃗t∥1 =_ |
|
|
|
2 _[|][g][i]_ |
|
_O(maxi ηi[2][)][. Finally,][P]_ _[ d][2][C][4]_ **_θt+1_** **_θt_** _i_ [)][. Therefore, there exists][ η][∗] [small enough] |
|
_−_ _[⃗]_ 1 [=][ O][(max][i][ η][2] |
|
|
|
so that the potential function Φ([⃗] **_θ[⃗]t+1) −_** Φ(θ[⃗]t) < 0 for all **_θ[⃗]t /∈_** _D[′]_ and η. Therefore Φ(θ[⃗]t) < |
|
minθ⃗ /∈D [Φ(]θ[⃗]) for all t ≥⌊τ/∥η∥1⌋ which completes the proof. |
|
|
|
|
|
----- |
|
|
|
D PROOFS AND DETAILS FOR LEMMA 4.5 |
|
|
|
Lemma 4.3, shows the value of Φ is decreasing in eq. (6), and the decrease rate is lower bounded by |
|
the one norm of **_ξ[⃗]. Thus, if we can show the error between eq. (3) and eq. (6) is bounded by_** **_ξ_** 1, |
|
_∥[⃗]∥_ |
|
we have the value of Φ is also decreasing on eq. (3). The main challenge is that because **_ξ[⃗](θ[⃗][∗]) = 0,_** |
|
we need to control the error as **_ξ[⃗] converges to zero but η_** is fixed. |
|
_∗_ |
|
|
|
We first show three ancillary claims, claim D.1 to D.3. Claim D.1 show the vanishing components |
|
of **_θ[⃗]t decrease rapidly once the dynamic is in D. Claim D.2 and D.3 show the supporting component_** |
|
also decrease once the vanishing components are small enough. |
|
|
|
Given (n, d, A, λ), we define the following constants C1 := [1]4 [max][i,l,][θ][∈][Θ][n][ |][g]l[i][|][,][ C][2][ :=] mini,k2(θ[∗])[i]k [,] |
|
|
|
_C3 := ed max(C1, C2), and C4 := maxı λı(λı +1) maxk,l Ak,l. We require the maximum learning_ |
|
rate is bounded by η which satisfies the following conditions. |
|
_∗_ |
|
|
|
_η_ max _k[(]θ[⃗])_ (12) |
|
_∗_ _i,k,θ[⃗]_ Θ[n][ |][g][i] _| ≤_ 2[1] |
|
_∈_ |
|
|
|
_η_ _ndC3 max_ **_ξ(θ[⃗])_** 1 1 (13) |
|
_∗_ **_θ⃗_** _∥[⃗]_ _∥_ _≤_ |
|
|
|
Additionally, given the bound of learning rate ratio, [max]mini[i] η[ η]i[i] |
|
|
|
_[≤]_ _[R][η][, we requires]_ |
|
|
|
_Rη[2][η][3]_ _i_ min 4 _,_ 1 (14) |
|
_∗_ _[<][ min]16_ _[i][ λ]i_ _[λ][2][i]_ _C3 maxi(λi) maxθ⃗_ **_g_** 1 _d[2]C4_ |
|
|
|
_[∥][⃗]∥_ |
|
|
|
_i_ |
|
Note that [max]min η[ η]i[2][3] [is less then the right hand side of eq. (14). On the other hand, by lemma 4.4, we][P] |
|
|
|
can pick D small enough so that the following conditions holds. |
|
|
|
1 |
|
min _k_ min _θk[i]_ (15) |
|
2 _i∈[n],k∈Si[(][θ][∗][)][i]_ _[≤]_ _i∈[n],k∈Si,θ[⃗]∈D_ |
|
|
|
|
|
We first show for all i and k ∈ _S[¯]i, θt,k[i]_ [is decreasing and converges to zero exponentially fast] |
|
|
|
as t increases. Because **_θ[⃗][∗]_** is a proper performative stable, _l[(][θ][∗][)]l[i][e][−][η][i][g]l[i][(]θ[⃗][∗]) = e−ηig¯[i](θ[⃗][∗]) >_ |
|
|
|
_e[−][η][i][g]k[i]_ [(]θ[⃗][∗]) for all i and k ∈ _S¯i. We can take η∗, ϵ1, and D small enough so that for all_ **_θ[⃗] ∈_** _D and_ |
|
all learning rate profile η with max ηi _η_, [P] |
|
_≤_ _∗_ |
|
|
|
_e[−][η][i][g]k[i] < (1_ _ϵ1)_ _θl[i][e][−][η][i][g]l[i], i_ [n], k _S¯i_ (16) |
|
_−_ _∈_ _∈_ |
|
|
|
_l_ |
|
|
|
X |
|
|
|
**Claim D.1 (Vanishing components). Given η** _, ϵ1, and D in eq. (16), if_ **_θ[⃗]t_** _D for all t_ 0, for all |
|
_i ∈_ [n] and k ∈ _Si, θk[i]_ [(][t][)][ is decreasing in][ t][ and for all]∗ _[ t][ ≥]_ [0] _∈_ _≥_ |
|
|
|
0 ≤ _θt,k[i]_ _[≤]_ _[θ]0[i]_ _,k[e][−][ϵ][1][t][.]_ |
|
|
|
_Proof of claim D.1. By eq. (16), for all t ≥_ 0, θt[i]+1,k [=][ θ]t,k[i] _l_ exp([θ]t,l[i] [exp(]−ηig[−]t,k[i][η][i][)][g]t,l[i] [)][ ≤] [(1][ −] _[ϵ][1][)][θ]t,k[i]_ [.] |
|
|
|
Therefore, θt,k[i] [is decreasing, and][ θ]t,k[i] _[≤]_ _[θ]0[i]_ _,k[e][−][ϵ][1][t][.]_ P |
|
|
|
**Claim D.2. There exists a constant C3 such that for all** **_θ[⃗] ∈_** Θ[n] _with δ1 > 0 so that ∥ξ[⃗](θ[⃗])∥1 ≥_ |
|
2[√]δ1, and |ξk[i] [(]θ[⃗])| ≤ _δ1 for all i ∈_ [n] and k ∈ _S[¯]i, then_ |
|
|
|
|
|
_|e[i]k[(]θ[⃗])| ≤_ _C3∥ξ[⃗](θ[⃗])∥1[2][,]_ (17) |
|
|
|
|
|
_for all i ∈_ [n] and k ∈ [d]. |
|
|
|
|
|
Claim D.2 shows if the one norm of **_ξ[⃗] is much bigger than the vanishing components, the error term_** |
|
**_e⃗ can be bounded by the ∥ξ[⃗]∥1[2][. Moreover, claim D.1 ensures that the vanishing components decrease]_** |
|
rapidly, so the condition of claim D.2 readily holds. |
|
|
|
|
|
----- |
|
|
|
_Proof of claim D.2. Given_ **_θ[⃗] and δ1 > 0 that satisfy the condition, we first show two inequalities to_** |
|
bound θk[i] [(¯]g[i] _gk[i]_ [)][2][ for supporting and vanishing component respectively. For a vanishing compo-] |
|
_−_ |
|
nent k _Si,_ |
|
_∈_ [¯] |
|
|
|
_θk[i]_ [(¯]g[i] _gk[i]_ [)][2][ ≤] max _l_ _[| · |][θ]k[i]_ [(¯]g[i] _gk[i]_ [)][| ≤] [4][C][1] **_ξ_** 1[.] (18) |
|
_−_ _i,l,θ∈Θ[n][ |][g][i]_ _−_ _[·][ δ][1]_ _[≤]_ _[C][1][∥][⃗]∥[2]_ |
|
|
|
Then for a supporting component k _Si, with eq. (15) we have_ |
|
_∈_ |
|
|
|
1 2 |
|
_θk[i]_ [(¯]g[i] _gk[i]_ [)][2][ ≤] (θk[i] [(][g][i][ −] _[g]k[i]_ [))][2][ ≤] _ξk[i]_ _[|][2][ ≤]_ _[C][2][∥]ξ[⃗]_ 1[.] (19) |
|
_−_ mini,l,⃗θ _D_ _[θ]l[i]_ mini,k(θ[∗])[i]k _|_ _∥[2]_ |
|
|
|
_∈_ |
|
|
|
Now we use above two inequalities to approximate eq. (3). For nominator, because 1 + x ≤ |
|
exp(x) ≤ 1+ _x_ + 2[e] _[x][2][ for all][ x][ ≤]_ [1][,][ θ]k[i] [exp(][η][i][(¯]g[i] _−_ _gk[i]_ [))][ ≥] _[θ]k[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [) =][ θ]k[i] [+] _[η][i][ξ]k[i]_ [. On the] |
|
|
|
other hand, because ηi(¯g[i] _−gk[i]_ [)][ ≤] [1][ by eq. (12),][ θ]k[i] [exp(][η][i][(¯]g[i] _−gk[i]_ [))][ ≤] _[θ]k[i]_ [+][η][i][ξ]k[i] [+][ e]2 _[θ]k[i]_ _[η]i[2][(¯]g[i]_ _−gk[i]_ [)][2][.] |
|
|
|
By eqs. (18) and (19), we have |
|
|
|
0 _θk[i]_ _[e][η][i][(¯]g[i]−gk[i]_ [)] _θk[i]_ _k_ _i_ **_ξ_** 1 (20) |
|
_≤_ _−_ _[−]_ _[η][i][ξ][i]_ _[≤]_ 2 [e] [max(][C][1][, C][2][)][η][2][∥][⃗]∥[2] |
|
|
|
For denominator, we sum over eq. (20). Because _l_ _[θ]l[i]_ [= 1][ and][ P]l _[θ]l[k][(¯]g[i]_ _−_ _gl[i][) = 0][, we have]_ |
|
|
|
0 _θl[i][e][η][i][(¯]g[i]−gl[i][)]_ 1 _i_ **_ξ_** 1 (21) |
|
_≤_ _l_ _−_ _≤[P][ed]2 [max(][C][1][, C][2][)][η][2][∥][⃗]∥[2]_ |
|
X |
|
|
|
Given i ∈ [n] and k ∈ [d], we apply the above equation to eq. (3). For upper bounds, we have |
|
|
|
_k_ [exp(][η][i][(¯]g[i] _gk[i]_ [))] |
|
_ηi[2][e]k[i]_ [(]θ[⃗]t) = _[θ][i]_ _−_ _t,k_ _t,k_ |
|
|
|
_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_ |
|
|
|
_k[(]θ[⃗]t)_ _i_ |
|
_≤θPt,k[i]_ _[e][−][η][i][g][i]_ _−_ _θt,k_ _[−]_ _[η][i][ξ]t,k[i]_ (by eq. (21)) |
|
|
|
_≤_ _[e][ max(]2[C][1][, C][2][)]_ _ηi[2][∥]ξ[⃗]∥1[2]_ (by eq. (20)) |
|
|
|
For lower bounds, |
|
|
|
_k_ [exp(][η][i][(¯]g[i] _gk[i]_ [))] |
|
_ηi[2][e]k[i]_ [(]θ[⃗]t) = _[θ][i]_ _−_ _t,k_ _t,k_ |
|
|
|
_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_ |
|
|
|
P _θt,k[i]_ [+][ η][i][ξ]t,k[i] |
|
|
|
_θt,k[i]_ _t,k_ (by eqs. (20) and (21)) |
|
|
|
_≥_ 1 + _[ed][ max(]2[C][1][,C][2][)]_ _ηi[2]_ **_ξ_** 1 _−_ _[−]_ _[η][i][ξ][i]_ |
|
|
|
_[∥][⃗]∥[2]_ |
|
|
|
_ed max(C1, C2)_ |
|
(θt,k[i] [+][ η][i][ξ]t,k[i] [)] _ηi[2]_ **_ξ_** 1 (1/(1 + x) 1 _x)_ |
|
_≥−_ 2 _[∥][⃗]∥[2]_ _≥_ _−_ |
|
|
|
|
|
_ed max(C1, C2)ηi[2]_ **_ξ_** 1 (θt,k[i] _t,k_ |
|
_≥−_ _[∥][⃗]∥[2]_ _[≤]_ [1][ and][ η][i][ξ][i] _[≤]_ [1][ by eq. (12))] |
|
|
|
Therefore, with C3 := ed max(C1, C2) we finish the proof. |
|
|
|
**Claim D.3. There exists ϵ2 > 0 so that for all** **_θ[⃗]t ∈_** Θ[n] _and δ1 > 0 so that ∥ξ[⃗](θ[⃗]t)∥1 ≥_ 2[√]δ1, and |
|
_|ξk[i]_ [(]θ[⃗]t)| ≤ _δ1 for all i ∈_ [n] and k ∈ _S[¯]i, then Φ(θ[⃗]t+1) −_ Φ(θ[⃗]t) ≤ _ϵ2∥ξ[⃗](θ[⃗]t)∥1[2][.]_ |
|
|
|
_Proof for claim D.3. By claim C.1,_ |
|
|
|
2 |
|
|
|
_i_ _[η]i[2]_ |
|
Φ(θ[⃗]t+1) Φ(θ[⃗]t) **_ξt_** 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) **_e⃗t_** 1 + d[2]C4 **_θt+1_** **_θt_** |
|
_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _|_ _| · ∥_ _∥_ _−_ _[⃗]_ 1 _[.]_ |
|
|
|
We can bound the later two terms by claim D.2. For the second term, **_e⃗t_** 1 _ndC3[⃗]ξt_ 1[. For the] |
|
third term, [P] _∥_ _∥_ _≤_ _∥[⃗]_ _∥[2]_ |
|
**_θt+1_** **_θt_** _ηi_ _ξt,k[i]_ _[|][ +][ C][3][η]i[2]_ **_ξ_** 1 |
|
_−_ _[⃗]_ 1 _[≤]_ Xi,k _|_ _[∥][⃗]∥[2]_ |
|
|
|
_[⃗]_ max _ηi_ **_ξ_** 1 + ndC3 max _ηi[2]_ **_ξ_** 1[.] |
|
|
|
_≤_ _i_ _∥[⃗]∥_ _i_ _[∥][⃗]∥[2]_ |
|
|
|
2 max _ηi_ **_ξ_** 1 ( by eq. (13)) |
|
_≤_ _i_ _∥[⃗]∥_ |
|
|
|
|
|
----- |
|
|
|
With above inequalities, by eq. (14), we havemini λ[2]i _[η]i[2]_ [min]4 _[i]i[ λ][λ]i[2][i][η][η]i[2][i][ >][ max][i][ λ][i][η]i[2]_ [max]i,k,θ[⃗] _[|][g]k[i]_ [(]θ[⃗])|ndC3 and |
|
|
|
4 [P]i _[λ][i][η][i][ >][ 4][d][2][C][4][ max][i][ η]i[2]i[. Therefore there exists a constant][η]i[2]_ [P] |
|
|
|
_ϵ2 := [min][i][ λ][2]_ max _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) _ndC3_ 4d[2]C4 max _ηi[2]_ _[>][ 0]_ |
|
|
|
2 _i_ _[λ][i][η][i]_ _−_ _i_ _i,k,θ[⃗]_ _|_ _|_ _−_ _i_ |
|
|
|
so that Φ(θ[⃗]t+1) Φ(θ[⃗]t) < _ϵ2_ **_ξt_** 1[.] |
|
_−_ [P] _−_ _∥[⃗]_ _∥[2]_ |
|
|
|
_Proof of lemma 4.5. To prove limt_ **_θt =_** **_θ[⃗][∗], there are two equivalent ways to measure the_** |
|
_→∞_ _[⃗]_ |
|
**_θprogress, besides⃗[∗]_** if and only if lim ∥θ[⃗]tt −→∞θ[⃗]ξ[∗][⃗]∥(θ[⃗]1. First becauset) = _[⃗]0. On the other hand, becauseθ[⃗][∗]_ is an isolated fixed point of eq. (6), Φ is strictly convex and andθ[⃗]t converges toθ[⃗][∗] is |
|
the minimum point, **_θ[⃗]t converges to_** **_θ[⃗][∗]_** if and only if limt→∞ Φ(θ[⃗]t) = minθ⃗ [Φ(]θ[⃗]). With these two |
|
equivalent conditions, given any ϵ > 0, there exists δ > 0 so that ∥θ[⃗] − **_θ[⃗][∗]∥1 < ϵ when_** |
|
|
|
|
|
**_θ⃗ : Φ(θ[⃗])_** max Φ(θ[⃗][′]) |
|
_≤_ **_θ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ_** |
|
|
|
|
|
**_θ⃗ ∈_** _Vδ :=_ |
|
|
|
|
|
Therefore, it is sufficient for us to show for alltechnical claim D.1 to D.2, our proof has two parts. First we show that the dynamic hits δ > 0, there is tδ so that **_θ[⃗]t ∈_** _Vδ for all t ≥ Vδtδ. Then,. With_ |
|
the dynamic stays in Vδ. |
|
|
|
For the first part, given δ > 0 and 0 < C < 1, by claim D.1 there exists T1 such that each vanishing |
|
component |ξt,k[i] _[| ≤]_ _[C]_ [2][δ][2][/][4][ for all][ t][ ≥] _[T][1][. Then by claim D.3, there exists][ T][2][ ≥]_ _[T][1][ such that]_ |
|
**_ξT2_** 1 _Cδ. Otherwise, the value of Φ decreases by a nonzero constant ϵ2_ **_ξt_** 1 |
|
each round which contradicts that the minimum of∥[⃗] _∥_ _≤_ Φ bounded. _∥[⃗]_ _∥[2]_ _[≥]_ _[Cϵ][2][δ][2][ >][ 0][ at]_ |
|
|
|
For the second part, if **_ξt_** 1 _Cδ_ _δ for all t_ _T2, then we finish the proof._ Other_∥[⃗]_ _∥_ _≤_ _≤_ _≥_ |
|
wise, there exists τ ≥ _T2 so that ∥ξ[⃗]τ_ _∥1 ≤_ _Cδ < ∥ξ[⃗]τ_ +1∥1. Now we prove that Φ(θ[⃗]τ +1) ≤ |
|
maxθ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ [Φ(]θ[⃗][′]). Because the difference between **_θ[⃗]τ_** +1 and **_θ[⃗]τ is ∥θ[⃗]τ_** +1 − **_θ[⃗]τ_** _∥1_ _≤_ |
|
|
|
**_θmax⃗ for some constanti ηi∥ξ[⃗]τ_** _∥1 +max Li ηξ in one norm, we havei[2][∥]e[⃗]τ_ _∥1 ≤_ 2 maxi ηi∥ξ[⃗]τ _∥1 when η∗_ is small enough, and **_ξ[⃗] is a Lξ-Lipschitz_** |
|
|
|
_Cδ <_ **_ξτ_** +1 1 **_ξτ_** 1 + **_ξτ_** +1 **_ξτ_** 1 **_ξτ_** 1 + 2Lξ max _ηi_ **_ξτ_** 1 _δ,_ (22) |
|
_∥[⃗]_ _∥_ _≤∥[⃗]_ _∥_ _∥[⃗]_ _−_ _[⃗]_ _∥_ _≤∥[⃗]_ _∥_ _i_ _∥[⃗]_ _∥_ _≤_ |
|
|
|
when C is small enough so that C(1 + 2Lξ maxi ηi) ≤ 1. Therefore, ∥ξ[⃗]τ +1∥1 ≤ _δ and Φ(θ[⃗]τ_ +1) ≤ |
|
maxθ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ [Φ(]θ[⃗][′]). Finally, because |ξt,k[i] _[| ≤]_ _[C]_ [2][δ][2][/][4][ for all][ t][ ≥] _[τ][ + 1][, by claim D.3, the]_ |
|
|
|
potential function is decreasing for all t _τ + 1, unless_ **_ξ_** _t_ _Cδ. Both make the potential_ |
|
_≥_ _∥[⃗]∥_ _≤_ |
|
function less than maxθ⃗[′]: **_ξ(θ[⃗][′])_** 1 _δ_ [Φ(]θ[⃗][′]) which completes the proof. |
|
_∥[⃗]_ _∥_ _≤_ |
|
|
|
E PROOF AND DETAIL FOR THEOREM 4.6 |
|
|
|
_Proof of lemma 4.7. Define PL(x) := α(L)(x_ _β(L)) which is increasing for all L because α(L) >_ |
|
_x_ _−_ |
|
0. With PL, fα(L),β(L)(x) = _x+(1−x) exp(PL(x))_ [. Take][ x][1][(][L][) =][ x][1][ := 1] _[−]_ [1][/α][(][L][)][,][ x][2][(][L][) =][ x][2][ :=] |
|
|
|
_fα(L),β(L)(x1(L)), and x3(L) = x3 := fα(L),β(L)(x2(L)). We will define x0(L) later. We will_ |
|
omit L and use x1, x2, and x3. |
|
|
|
To define x0(L), we set y(L) := β(L)/2, and want to show |
|
|
|
_fα(L),β(L)(y) > x1,_ (23) |
|
|
|
which is equivalent to (α(L) 1)(1 _y(L)) exp(PL(y(L))) < y(L). When L is large enough, we_ |
|
_−_ _−_ |
|
have3 _[α] β[(][L](L[)][β])[∞] ∈[. Therefore, we have](_ [2][β]3[∞] _[,][ 4][β]3[∞]_ [)][. Additionally because][ (][α][(][L][)][ −] [1)(1][ α][ −][(][L][y][)][(][ >][L][))][ 0][e][, we have][P][L][(][y][(][L][))][ ≤][ P][L][(][α][y][(][(][L][L][)][)) =][e][−][α][(][ −][L][)][ β]2[1] 3[∞][α][(][L]<[)][β]β[(]3[L]∞[)][ <]< |
|
|
|
_−_ [1] |
|
|
|
_y(L) when L is large enough, and prove eq. (23). On the other hand, because β(L) <_ [4]3 _[β][∞]_ [and] |
|
|
|
|
|
----- |
|
|
|
_PL(β(L)) = 0, we have fα(L),β(L)(β(L)) = β(L) <_ [1]2 [(][β][∞] [+1)][. Moreover,][ 1]2 [(][β][∞] [+1)][ <][ 1][−] _α(1L)_ [,] |
|
|
|
holds when L is large enough. These two imply |
|
|
|
_fα(L),β(L)(β(L)) = β(L) < x1(L)._ (24) |
|
|
|
Combining eqs. (23) and (24), by intermediate value theorem, there exists x0 such that |
|
|
|
_fα(L),β(L)(x0) = x1 with y < x0 < β(L) < x1._ (25) |
|
|
|
Now we showx1 _x3 < x0. With 1α−(Lx)1(L) = 1/α(L) and 0 < x1 < 1, we have x2 = fα(L),β(L)(x1) =_ |
|
|
|
_x1+α(L)[−][1]_ exp(PL(x1)) exp(PL(x1)) [. Because][ β][∞] _[<][ 1][/][2][, when][ L][ is large enough,][ x][1][(][L][) =]_ |
|
|
|
1 − _α(1L)_ _[>][ 1]4_ [(2][β][(][L][) + 3)][≤] [ and, thus,][ P][L][(][x][1][)][ > α][(][L][)] 34 _[−]_ _[β][(]2[L][)]_ . Therefore, we have |
|
|
|
|
|
|
|
3 |
|
_x2_ _α(L) exp_ _α(L)_ (26) |
|
_≤_ _−_ 4 2 |
|
_[−]_ _[β][(][L][)]_ |
|
|
|
Finally, because PL(x2) _PL(0) =_ _β(L), we get_ |
|
_≥_ _−_ |
|
|
|
_x2_ |
|
_x3 =fα(L),β(L)(x2)_ |
|
_≤_ _x2 + (1 −_ _x2) exp(−α(L)β(L))_ |
|
|
|
_x2 exp(α(L)β(L))_ |
|
= |
|
|
|
1 + x2(exp(α(L)β(L)) 1) |
|
_−_ |
|
_≤x2 exp(α(L)β(L))_ (since exp(α(L)β(L)) > 1 when L is large) |
|
|
|
3 |
|
_α(L) exp_ _α(L)_ _β(L)_ (eq. (26)) |
|
_≤_ _−_ 4 2 |
|
_[−]_ _[β][(][L][)]_ |
|
|
|
3α(L) |
|
=α(L) exp _β(L)_ |
|
|
|
2 _−_ 2[1] |
|
|
|
|
|
|
|
Finally, because β _< 1/2, x3 converges to 0 when L is large enough. Therefore,_ |
|
_∞_ |
|
|
|
_x3 < y < x0._ (27) |
|
|
|
By eqs. (25) and (27), we have x3 < x0 < x1 which completes the proof. |
|
|
|
|
|
----- |
|
|
|
|