pradachan's picture
Upload folder using huggingface_hub
f71c233 verified
raw
history blame
72.8 kB
# MULTI-AGENT PERFORMATIVE PREDICTION: FROM GLOBAL STABILITY AND OPTIMALITY TO CHAOS
**Anonymous authors**
Paper under double-blind review
ABSTRACT
The recent framework of performative prediction (Perdomo et al., 2020) is aimed
at capturing settings where predictions influence the target/outcome they want to
predict. In this paper, we introduce a natural multi-agent version of this framework, where multiple decision makers try to predict the same outcome. We showcase that such competition can result in interesting phenomena by proving the
possibility of phase transitions from stability to instability and eventually chaos.
Specifically, we present settings of multi-agent performative prediction where under sufficient conditions their dynamics lead to global stability and optimality. In
the opposite direction, when the agents are not sufficiently cautious in their learning/updates rates, we show that instability and in fact formal chaos is possible. We
complement our theoretical predictions with simulations showcasing the predictive power of our results.
1 INTRODUCTION
Performative prediction (Perdomo et al., 2020) is a recently introduced framework that focuses on a
natural but largely unexplored element of supervised learning. In many practical cases the predictive
model can affect the very outcome that it is trying to predict. For example, predictions about which
part of a road network will have high congestion trigger responses from the drivers which affect the
resulting traffic realization leading to a shift of the target distribution. In such settings, (Perdomo
et al., 2020) explored conditions for the existence and approximate optimality of stable equilibria of
such processes.
One possible way to interpret the performative prediction setting is a single agent “game”, where the
predictive agent is playing a game against himself. An agent chooses the parameters of his model
as his actions but the predictive accuracy/cost of the model depends on his own past actions. Fixed
points of this process do not allow for profitable deviations. Once cast in this light, it becomes
self-evident that the restriction to a single predictive agent/model is arguably only the first step in
capturing more general phenomena where there is a closed loop between predictive models and their
environment. This motivates our central question: What is the interplay between stability, optimality
_in cases where multiple predictive models operate in parallel to each other? Can the competition be-_
_tween multiple models lead to novel phenomena such as phase transitions from stability to instability_
_and chaos?_
A natural setting to consider for example is market competition, where multiple hedge funds are
trying to simultaneously predict future prices, volatility of financial instruments. Of course as they
act upon their predictions they also move the prices of these commodities in a highly correlated
way. As more agents enter the market and the competition becomes increasingly fierce, is it possible
that at some point the market flips from quickly discovering accurate stable predictions reflecting
the underlying market fundamentals to self-induced unpredictability and chaos? When it comes to
performative prediction can too many cooks spoil the soup?
**Our model** Standard supervised learning consists of three components: a set of predictive models, a loss function, and a data distribution. The learner (agent) observes samples of the distribution,
and then decides a predictive model. When predictions are performative, the agent’s decision of a
predictive model influences the data distribution. Thus, instead of a fixed data distribution, a predictive prediction also has a distribution map: a mapping from the agent’s predictive models to
-----
data distributions. We further propose multi-agent performative prediction which model the influence of multiple agents’ decisions on the data distribution, and ask whether these influence leads to
convergent or chaotic systems.
To understand the framework of multi-agent performative prediction, we study a natural regression
problem with multi-agent location-scale distribution maps (definition 2.2) where the agents’ influence is linear in the agents’ models. Specifically, the data consist of features and outcomes. For
each agent i ∈ [n], his influence on the distribution of outcome is his model weighted by a scalar
_λi > 0. When n = 1, our multi-agent location-scale distribution map is a special case of location-_
scale family in Miller et al. (2021). In the market example, each hedge fund company tries to predict
the price (outcome) based on macroeconomic data or other information (features). These influence
parameters λ1, . . ., λn can be seen as each hedge fund’s capital that can influence the distribution of
the future price.
Finally, we consider the agents do not know the dependency between their model and the data
distribution, and they can only improve their predictive models myopically and iteratively. In this
paper, the agents myopically use a reinforcement learning algorithm, exponentiated gradient for
linear regression (Kivinen & Warmuth, 1997), to improve their predictive models in rounds. We
study the long-term behavior, and ask if the system converge to the performative stable and optimal
point, or behave chaotically.
**Our results** We first study basic properties of our multi-agent preformative prediction with multiagent location-scale distribution map. We show 1) the existence of performative stable point in our
setting (proposition 3.1) and 2) the performative stability and performative optimality are equivalent
(proposition 3.2). This equivalence allows us to focus on the dynamical behavior of the system.
In section 4, we introduce learning dynamics to the multi-agent performative prediction, and study
their long-term behavior. We provide a threshold result which depends on the learning rates of
exponentiated gradient descent and the collective influence. Theorem 4.1 shows the dynamics of
exponential gradient descent converge to the performative stable and optimal point when the learning rate is small enough. Our convergence result in theorem 4.1 holds when the feature space is
multi-dimensional, and every learning agent can use different learning rates starting at arbitrarily
interior states. Our exact convergence result also works in the single-agent performative prediction setting. Contrarily, previous convergent results in Perdomo et al. (2020); Miller et al. (2021);
Mendler-D¨unner et al. (2020) only show that their dynamics converge to a small neighborhood of
the performative stable point.
On the other hand, section 4.2 shows the dynamics can have chaotic behavior if the collective influence Ln is large enough. Specifically, theorem 4.6 shows that even when the feature space is
in R[2] these systems provably exhibit Li-Yorke chaos, when the collective influence _i_ _[λ][i][ is large]_
enough. This implies that there exists an uncountable “scrambled” set so that given any two initial
conditions in the set, the liminf of the distance between these two dynamics is zero, but the lim
[P]
sup of their distance is positive. (definition 2.4) The chaotic result in theorem 4.6 also holds for
the original single-agent performative prediction so long as the agent’s influence is large enough,
and, thus, complements previous performative prediction works on convergence behavior, which
primarily consider that the agent’s influence on the data distribution is sufficiently small. Moreover,
no matter how small the agents’ learning rates, theorem 4.6 shows that chaos is inevitable in some
performative predictions settings when the number of agents exceeds a carrying capacity. After that
the system becomes totally unpredictable with small perturbations exploding exponentially fast. [1]
Finally, section 5 provides numerical examples of our dynamics and show convergent and chaotic
behavior. Additionally, through simulation, we demonstrate that our convergent and chaotic results
also hold when the agents can only access noisy estimation of the gradient and conduct stochastic
exponentiated gradient descent.
**Related work** Data distribution shift is not a new topic in ML, but earlier works focused primarily
on exogenous changes to the data generating distribution. Performativity is machine learning was
introduced by Perdomo et al. (2020). The original work and several follow-ups study the discrepancy between performative stability and performative optimality and prove approximated conver
1We highlight our revision in blue color.
-----
gence of learning dynamics, e.g., stochastic gradient descent, or iterative empirical risk minimization. (Mendler-D¨unner et al., 2020; Drusvyatskiy & Xiao, 2020; Izzo et al., 2021) Performativity of
prediction is also related to several applications: strategic classification (Hardt et al., 2016), retraining Bartlett (1992); Kuh et al. (1990).
Inspired by the instability of training algorithms in ML applications such as Generative Adversarial
Networks (GANs), there has been a lot of recent interest in understanding conditions (particularly
in multi-agent systems) where learning behavior may be non-equilibrating/unstable (Cheung & Tao,
2020; Balduzzi et al., 2020; Flokas et al., 2020; Andrade et al., 2021; Letcher, 2021; Giannou et al.,
2021). The (in)stability and performance of exponentiated gradient descent in particular (also referred to as Multiplicative Weights Updates) and other closely related dynamics has attracted a lot
of attention (Cohen et al., 2017; Bailey & Piliouras, 2018; Cheung, 2018; Panageas et al., 2019;
Cheung & Piliouras, 2020; Vadori et al., 2021). The technique of Li-Yorke chaos has recently found
applications across several different domains such as routing games, Cournot games and blockchain
protocols (Palaiopanos et al., 2017; Chotibut et al., 2020; Bielawski et al., 2021; Cheung et al., 2021;
Leonardos et al., 2021). To our knowledge, this is the first time where such formal chaotic results
are established in settings related to performative prediction and supervised learning more generally.
Another line of related works is learning in games. Specifically, our dynamics can be seen as special
cases multiplicative weight update (Hedge algorithms) on congestion games. Previous works on
Hedge algorithms only show exact convergence when the learning rate is decreasing Kleinberg et al.
(2009); Cohen et al. (2017), and, to our best knowledge, our results are the first that shows exact
convergence of Hedge algorithms with small constant learning rates. Our results also complement
the exact convergence result of the linear variant of multiplicative weight update by Palaiopanos
et al. (2017).
2 PRELIMINARY
2.1 MULTI-AGENT PERFORMATIVE PREDICTION
A multi-agent performative prediction comprises n agents deploying their predictive models
_fθ1_ _, . . ., fθn with parameters θ[1], θ[2], . . ., θ[n]_ Θ that collectively influence the future data dis_∈_
tribution. We formalize such dependency via a distribution map D(·) which outputs a distribution
on the data, D(θ[⃗]), given a models profile **_θ[⃗] := (θ[1], . . ., θ[n]) ∈_** Θ[n]. A loss function ℓ(z, θ[′]) measures the loss of a model fθ′ on a data point z, and the expected loss on a distribution is
_∈Z_ _D_
Ez [ℓ(z, θ[′])]. For performative prediction, we further define the decoupled performative loss on a
_∼D_
distribution mapping D as
_ℓ(θ[⃗], θ[′]) := Ez_ (θ⃗)[ℓ][(][z][,][ θ][′][)]
_∼D_
where θ[′] _∈_ Θ denotes a predictive model, while **_θ[⃗] ∈_** Θ[n] denotes a deployed model profile.
Given the set of model Θ, the loss function ℓ, and the distribution mapping D, each agent in a
multi-agent performative prediction (Θ, ℓ, D) pursues minimal loss on the distribution that they
collectively induce. We consider two solution concepts performative optimality and performative
_stability which generalizes the original ones in Perdomo et al. (2020)._
**Definition 2.1. Given (Θ, ℓ, D), a models profile** **_θ[⃗][∗]∈_** Θ[n] is performatively optimal if the total loss
is minimized,
**_θ⃗[∗]_** _∈_ arg minθ⃗∈Θ[n] Xi _ℓ(θ[⃗], θ[i])._
Another desirable property of a model profile is that, given all agents deploy their models, their
models are also simultaneously optimal for distribution that their model induces. Formally, **_θ[⃗][∗]_** is
_performatively stable if for all i ∈_ [n]
(θ[∗])[i] _∈_ arg minθ[i]∈Θ _ℓ(θ[⃗][∗], θ[i])._
The performative optimality does not implies the performative stable point. For performative optimal point, the variable for minimization **_θ[⃗] affects both the first and the second argument, but only_**
affect the second one for performative stable point. Now we introduce our model in this paper.
-----
**Location-scale distribution map** In this paper, we study the family of multi-agent location-scale
map for regression problem where a data point consists of d-dimensional feature and scalar outcome,
**_z = (x, y). These are natural classes of distribution maps in which performative effects enter_**
through an additive or multiplicative factor that is linear in **_θ[⃗]._**
**Definition 2.2. Given d ∈** N and d ≥ 2, and Θ[n] _⊆_ R[d], a distribution map D : Θ[n] _→_ R[d] _× R_
is a multi-agent location-scale distribution map on n parties if there exists a static distribution _X_
_D_
on R[d][+1], θ[0] _∈_ R[d], and n linear functions Λ1, . . ., Λn from R[d] to R[d] so that the distribution of
from(x, y) D ∼DX . Given feature(θ[⃗]) has the following form: The feature is x ∈ R[d] and noise x0 ∈n R, and the the outcome is x ∈ R[d] and noise x0 ∈ R is jointly sampled
_y =_ **_θ[0]_** Λi(θ[i]), x + x0.
- _−_ _i=1_ +
X
In this paper, we consider the scaling maps Λi(θ) = λiθ for all θ ∈ Θ with scalar λi > 0 for
all i ∈ [d]. We call λ := (λ1, . . ., λn) the influence parameters, and Ln := _i=1_ _[λ][i][ collective]_
influence. Furthermore, we let A := E **xx[⊤][]** _∈_ R[d][×][d] be the covariance matrix of the feature, and
**_b := Aθ[0]_** + E[x0x] R[d]. We will specify the multi agent location-scale distribution map with[P][n]
_∈_ 
parameters n, d, λ, A, and b.
When n = 1, our multi-agent location-scale distribution map is a special case of location-scale
family in Miller et al. (2021) where the model θ may both the outcome y as well as the feature x.
**Predictive Models and Loss Function** We consider linear predictive model with constraint where
_fθ′_ (x) = ⟨θ[′], x⟩, and the collection of parameter is the d-simplex, Θ = {θ : _k=1_ _[θ][k][ = 1][, θ][k][ ≥]_
0}. We use mean squared error to measure a predictive model θ[′] _∈_ Θ on a distribution map with a
deployed model profile **_θ[⃗]_** Θ[n], ℓ(θ[⃗], θ[′]) = E(x,y) (θ⃗)[[(][y] _[−]_ _[f][θ][′]_ [(][x][))][2][] =][ E](x,y)[P][d](θ[⃗])[[(][y] _[−]_ **_[θ][′][ ·]_** **[x][)][2][]][.]**
_∈_ _∼D_ _∼D_
Given a deployed model profile **_θ[⃗] and a predictive model θ[′], the gradient of the decoupled loss is_**
**_g(θ[⃗], θ[′]) :=_** **_θ′_** _ℓ(θ[⃗], θ[′]), and if_ is a location-scale distribution map, g(θ[⃗], θ[′]) = E(x,y) (θ⃗)[[2(][θ][′][ ·]
_∇_ _D_ _∼D_
**x −** y)x]. Furthermore, with A and b defined in definition 2.2, the gradient can be written as
**_θ[′]_** +
**_g(θ[⃗], θ[′]) =_** **_θ′_** _ℓ(θ[⃗], θ[′]) = 2A_
_∇_
_λiθ[i]_
_−_ 2b. (1)
Additionally, given a deployed model profile **_θ[⃗] and a predictive model profile_** **_θ[⃗][′], we define the_**
gradient of agent i’s decoupled loss as g[i](θ[⃗], θ[⃗][′]) := g(θ[⃗], (θ[′])[i]), and g[i](θ[⃗]) := g[i](θ[⃗], θ[⃗]) when the
deployed model profile is identical to the predictive model profile. We denote the gradient of agent
_i’s average loss as ¯g[i](θ[⃗]) :=_ _l_ _[θ]l[i][g]l[i][(]θ[⃗]) for all_ **_θ[⃗] ∈_** Θ[n]. Finally, we define **_ξ[⃗](θ[⃗]) = (ξ[1], . . ., ξ[n])_**
with ξ[i] _∈_ R[d] so that for all i ∈ [n] and k ∈ [d] ξk[i] [(]θ[⃗]) := θk[i] [(][P][ θ]l[i][g]l[i] _[−]_ _[g]k[i]_ [)][. For brevity, we omit][ ⃗]θ
and define ⃗g := ⃗g(θ[⃗]) and **_ξ[⃗] :=[P]ξ[⃗](θ[⃗]) when there is no ambiguity._**
2.2 LEARNING DYNAMICS
We consider the agents myopically use a reinforcement learning algorithm exponentiated gradient
_for linear regression to improve their predictive models in rounds. We first define the original single_
agent’s exponentiated gradient for linear regression on a sequence of data points here and will state
our dynamics on multi agent performative prediction in section 4.
**Definition 2.3 (Kivinen & Warmuth (1997)). Given a learning rate η > 0, an initial parameter**
**_θregression0 ∈_** Θ and a sequential of data points iteratively updates the model as follows: At round (xt, yt)t≥1, the exponentiated gradient descent for linear t + 1, with previous parameter θt Θ
the exponentiated gradient algorithm updates it to θt+1 so that _∈_
_θt,k exp(_ 2η(θt **_xt_** _yt)xt,k)_
_θt+1,k =_ _−_ _·_ _−_
_ℓNote that each exponent((xt, yt), θt) at θt which yields the name of exponentiated gradient descent. 2(θtP · xl_ _[θ]t −[t,l][ exp(]yt)x[−]t,k[2] is the[η][(][θ][t][ ·] k[ x]-th coordinate of the gradient of squared error[t][ −]_ _[y][t][)][x][t,l][)][ for all][ k][ ∈]_ [[][d][]][.]
-----
2.3 DYNAMICAL SYSTEM
**Definition 2.4. Let (X** _, f_ ) be a dynamical system where X is a metric space and f is a mapping on
_X_ . We say (x, x[′]) ∈X × X is a Li-Yorke pair if
lim inf _dist(f_ _[t](x), f_ _[t](x[′])) = 0 and lim sup_ _dist(f_ _[t](x), f_ _[t](x[′]) > 0._
_t_ _t_
(X _, f_ ) is Li-Yorke chaotic if there is an uncountable set S ⊂X (scrambled set) such that any pair
of points x ̸= x[′] _∈_ _S is Li-Yorke pair._
3 MULTI-AGENT PERFORMATIVE LEARNING: STABILITY AND OPTIMALITY
Having introduced multi-agent performative prediction, we show some basic property of our
location-scale distribution map with mean squared error as a warm-up. Using the first order condition and a potential function argument, we show 1) the existence of performative stable point in our
model (proposition 3.1) and 2) the performative stability and performative optimality are equivalent
(proposition 3.2). The proofs of both propositions are in appendix A.
We first show the existence and uniqueness of performative stable point through a potential function
argument. The first part is due to the first order condition of convex optimization. The second part
also use the first order condition, and _∂θ∂k[i]_ [Φ(]θ[⃗]) = λigk[i] [(]θ[⃗]) for all i ∈ [n] and k ∈ [d].
**Proposition 3.1 (existence). Given the family of linear predictive models with constrains Θ, mean**
_squared error ℓ_ _and multi-agent location-scale distribution map with parameters n, d, λ, A, b in_
_definition 2.2,_ **_θ[⃗][∗]_** _on (Θ, ℓ, D) is performative stable if and only if the gradient (defined in eq. (1))_
_satisfies gk[i]_ [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗]) for all i ∈_ [n] and k ∈ [d] with θk[i] _[>][ 0][.]_
_Furthermore, if A is positive definite, there exists a unique performative stable point_ **_θ[⃗][∗], and_** **_θ[⃗][∗]_** _is_
_also a global minimizer of the following convex function for all_ **_θ[⃗] ∈_** Θ[n]
Φ(θ[⃗]) := ( _λiθ[i])[⊤]A(_ _λiθ[i]) +_ _λi(θ[i])[⊤]Aθ[i]_ 2b[⊤] [X] _λiθ[i]._ (2)
_−_
_i_ _i_ _i_ _i_
X X X
While model profile is performative stable does not necessary mean the loss is minimized, the following proposition shows that optimality and stability are equivalent in our setting.
**Proposition 3.2. Given (Θ, ℓ, D) defined in proposition 3.1,** **_θ[⃗][∗]_** _is performative stable if and only if_
**_θ⃗[∗]_** _is performatively optimal defined in definition 2.1._
4 MULTI-AGENT PERFORMATIVE LEARNING: CONVERGENCE AND CHAOS
In section 3, we study the stability and optimality of location-scale distribution map with mean
squared error. Now we ask when each agent myopically improves his predictive model through
reinforcement learning but their predictions are performative, what is the long term behavior of the
system? Can the system converges to performative stable and optimal point, or behave chaotically?
We provide a threshold result depending on the learning rate and the collective influence Ln. Section 4.1 shows the dynamics converge to the performative stable and optimal point when when the
learning rate is small enough. On the other hand, section 4.2 shows the dynamics can have chaotic
behavior if the collective influence Ln is large enough.
Now, we define our dynamics (θ[⃗]t)t 0 formally. At each round, each agent accesses the data dis_≥_
tribution which influenced by their previous model, and updates his model through exponentiated
gradient descent. Specifically, given an initial model profile **_θ[⃗]0_** Θ[n] and a learning rate profile
**_η := (η1, . . ., ηn), each agent i applies the exponentiated gradient descent with initial parameter ∈_** **_θ0[i]_**
and learning rate ηi > 0: At round t + 1 > 1, each agent i use D(θ[⃗]t) and estimates the gradient of
the (expected) loss, g[i](θ[⃗]t) defined in eq. (1), and updates his model according to definition 2.3,
_θt,k[i]_ [exp(][−][η][i][g]k[i] [(]θ[⃗]t))
_θt[i]+1,k_ [=] _d_ for all t 0, i [n], and k [d]. (3)
_l=1_ _[θ]t,l[i]_ [exp(][−][η][i][g]l[i][(]θ[⃗]t)) _≥_ _∈_ _∈_
P
-----
We will use superscript to denote agent, i ∈ [n], and subscript for time, t ≥ 0, and index of feature,
_k, l ∈_ [d]. Recall that the gradient of agent i’s average loss is ¯g[i](θ[⃗]) := _l_ _[θ]l[i][g]l[i][(]θ[⃗]), and eq. (3)_
can be rewritten as _θk[i]_ [exp(][−][η][i][g]k[i] [(]θ[⃗]t)) _θk[i]_ [exp(][η][i][(¯]g[i](θ[⃗]t)−gk[i] [(]θ[⃗]t))) **_θ[∗]_** and i [d],
_l_ _[θ]l[i]_ [exp(][−][η][i][g]l[i][(]θ[⃗]t)) [=] _l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i](θ[⃗]t)−gl[i][(]θ[⃗]t))) [. Finally, given][P] _[ ⃗]_ _∈_
_equilibrium (or fixed point)we define the support ofP_ **_θ[⃗][∗] of eq. (3) ifas Si ⊆_** [d] =P {k : (θ[∗])[i]k _[>][ 0][}][, and][ ¯]Si := [d] \ Si. Then_ **_θ[⃗][∗]_** is an
_gk[i]_ [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]), for all i ∈ [n], k ∈ _Si._ (4)
We say a fixed point of eq. (3) is isolated if there exists an open set of it so that no other fixed point
is in the set. Additionally, the performative stable condition in proposition 3.1 is equivalent to
_gk[i]_ [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]) and gl[i][(]θ[⃗][∗]) ≥ _g¯[i](θ[⃗][∗]), for all i ∈_ [n], k ∈ _Si, l ∈_ _S[¯]i._ (5)
Therefore, the set of fixed points of eq. (3) contains the performative stable point.
4.1 CONVERGING WITH SMALL LEARNING RATE
In this section, we show when the learning rate of each agent is small enough the the dynamics in
eq. (3) converge to performative stable point. Specifically, if the parameter of multi-agent performative learning (n, d, A, λ) is fixed, the dynamics in eq. (3) converge to performative stable point
when η is “small” enough.
By eq. (5), **_θ[⃗][∗]_** is a performative stable if the gradient in the support is no less than the average
gradient. We call a performative stable point **_θ[⃗][∗]_** _proper if the gradient of non-support coordinate_
is greater than the average gradient: for all i ∈ [n] l ∈ _S[¯]i, gl[i][(]θ[⃗][∗]) > ¯g[i](θ[⃗][∗]). The below theorem_
shows if the performative stable point is proper and equilibria satisfying eq. (4) are isolated, eq. (3)
converges when maxi ηi is small enough and the ratio of ηi/ηj is bounded for all i and j.
**Theorem 4.1 (Convergence). Consider a constant Rη > 0, and a multi-agent performative learning**
_setting with parameter n, d, A, b, λ such that A is positive definite, the performative stable point_ **_θ[⃗][∗]_**
_is proper and its equilibria (defined in eq. (4)) are isolated. There exists η_ _> 0 so that dynamic in_
_∗_
_eq. (3) with learning rate profile η converges to the performative stable point, limt_ **_θt =_** **_θ[⃗][∗], if_**
_→∞_ _[⃗]_
_the initial state is an interior point and η satisfies maxi ηi_ _η_ _and maxi ηi/ mini ηi_ _Rη._
_≤_ _∗_ _≤_
Informally, when the learning rate η are small, **_θ[⃗]t in eq. (3) can be approximated by a solution of an_**
ordinary differential equation, **_ϑ[⃗](t∥η∥1), where the initial condition is_** **_ϑ[⃗](0) =_** **_θ[⃗]0 and_**
_d_ _ηi_
_k[(][t][) =]_ _ϑ[i]k[(][t][)(¯]g[i](ϑ[⃗](t))_ _gk[i]_ [(]ϑ[⃗](t))) (6)
_dt_ _[ϑ][i]_ **_η_** 1 _−_
_∥_ _∥_
for all t ≥ 0, i ∈ [n], and k ∈ [d]. We formalize this intuition in the proof of lemma 4.4. Note that
the set of fixed points of eq. (6) is identical to eq. (3) and satisfies eq. (4).
The proof of theorem 4.1 has two parts. First we show the continuous approximation eq. (6) converges to the performative stable point. Then we study the relationship between eq. (3) and eq. (6),
and prove the eq. (3) also converges to a performative stable point.
**From Potential Function to Convergence of Equation (6)** In this section, theorem 4.2 shows
the dynamics of eq. (6) converge to performative stable point which will be useful to show the
convergence of eq. (3).
**Theorem 4.2. If all points satisfying eq. (4) are isolated and A is positive definite, and** **_ϑ[⃗](0) is in_**
_the interior of Θ[n], the limit limt→∞_ **_ϑ[⃗](t) =_** **_θ[⃗][∗]_** _is the performative stable point._
Theorem 4.2 can be seen as the continuous version of theorem 4.1. Compared to theorem 4.1, theorem 4.2 does not require that the performative stable point is proper, but theorem 4.2 also implicitly
requires the ratio of learning rate between any two agents is bounded.
To prove theorem 4.2. we prove lemma 4.3 which shows that Φ in eq. (2) is a potential function
for eq. (6) so that time derivative of Φ is decreasing for all non-fixed points. Then, in the proof of
theorem 4.2, we further prove that the dynamics in eq. (6) converge to performative stable points
when the initial condition **_ϑ[⃗](0) is an interior point. The proof is similar to a proof in Kleinberg et al._**
(2009), and is presented in appendix B.
-----
**Lemma 4.3. Given a solution of eq. (6), the time derivative of Φ in eq. (2) is 0 at fixed points of**
_eq. (6), and negative at all other points. Furthermore,_
2
_d_ 1 mini λiηi
**_ϑ(t))_** _−_ _λiηi_ _ϑ[i]k[(]_ _ϑ[i]l[g]l[i]_ _k[)][|]_ _−_ **_ξ(ϑ[⃗])_** 1[.]
_dt_ [Φ(] _[⃗]_ _≤_ 2 _i_ _[η][i]_ _i_ _[λ][i][η][i]_  _i,k_ _|_ _l_ _[−]_ _[g][i]_  _≤_ 2 _i_ _[η][i]_ _i_ _[λ][i][η][i]_ _∥[⃗]_ _∥[2]_
X
**From Approximation to Convergence of Equation (3)[P]** P [X] Given the convergence of eq. (6), we [P] P
show the dynamics of eq. (3) also converges to a performative stable point. The argument has two
stages. We first show given any neighborhood of performative stable points D, eq. (3) will hits D
and stay in D in finite amount of steps in lemma 4.4. Then lemma 4.5 shows that eq. (3) converges
to a performative stable point which completes the proof of theorem 4.1.
**Lemma 4.4. Given any open set D ∈** Θ[n] _that contains the performative stable point, there exists_
_η∗_ _small enough so that for all_ **_θ[⃗]0 ∈_** Θ[n], there exists τ so that **_θ[⃗]t ∈_** _D for all t ≥_ _τ_ _._
The proof is based on standard approximation result of numerical solution to ordinary differential
equations, and is presented in appendix C.
**Lemma 4.5. Let** **_θ[⃗][∗]_** _is the performative stable point and D is a small enough neighborhood of_ **_θ[⃗][∗]._**
_Besides the condition in theorem 4.1, if the initial condition of eq. (3),_ **_θ[⃗]0, is in D and_** **_θ[⃗]t_** _D for_
_∈_
_all t_ 0, the limit of eq. (3) is the performative stable point limt **_θt =_** **_θ[⃗][∗]._**
_≥_ _→∞_ _[⃗]_
The proof uses the fact that the right-hand side of eq. (3) decreases as the dynamic converges to the
fixed point. Therefore, even though the learning rate profile is fixed, the error between eq. (3) and
eq. (6) vanishes as they converge to the fixed point. We present the proof in appendix D.
4.2 CHAOS WITH CONSTANT LEARNING RATE
Now we want to ask the converse question in section 4.1. Do the dynamics in eq. (3) still converge, if
the learning rate is fixed, as the number of agent increases? Alternatively, do the dynamics converge
when the agents have overwhelming influence on the data distribution?
Theorem 4.6 shows that the dynamics is Li-York chaotic when Ln = _i_ _[λ][i][ is large even with][ d][ = 2][.]_
Note that large Ln may be due to a fixed number of agent which have overwhelming influence, or
the number of agents is large. The later case implies for any small η, no matter how cautious the
[P]
agents are, as number of agent increases the chaos inevitably occurs.
**Theorem 4.6. Given a multi-party induced location scale family distribution with A, b, d = 2,**
_and a common learning rate η > 0, if all n agents use the exponentiated gradient with a common_
_learning rate η in eq. (3) and A is diagonally dominant, there exists a carrying capacity L[∗]_ _such_
_that if Ln =_ _i=1_ _[λ][i][ ≥]_ _[L][∗]_ _[the dynamics][ (]θ[⃗]t)t∈N is Li-Yorke chaotic and thus has periodic orbits_
_of all possible periods._
[P][n]
To prove theorem 4.6, we show there exists an uncountable scrambled set defined in definition 2.4.
First we consider all agents start from the same initial model, and show the dynamics eq. (3) can be
captured by the following function,
_x_
_fu,v(x) =_ (7)
_x + (1 −_ _x) exp(u(x −_ _v))_ _[∀][x][ ∈]_ [[0][,][ 1]]
with proper choice of u ∈ R and v ∈ R. Note that v is an equilibrium, and u controls steepness.
First, when all agents start from the same initial model, θ0[i] [=][ θ][0] [for all][ i][ ∈] [[][n][]][, and use the same]
learning rate η, their models are identical for all t 0. For all t there exists θt so that θt[i] [=][ θ][t] [for]
_≥_
all i. Additionally, since d = 2, we can use a single parameter to encode a linear predictive model
with constraints. Given θ Θ = ∆2, we define p(θ) = θ1 [0, 1] and omit the input θ when it
is clear. Thus, we can rewrite the dynamics as ∈ _pt = p(θt) which are one-dimensional dynamics on ∈_
[0, 1], and pt+1 = _pt+(1−pt)e[η]p[(][gi]t1_ [(]θ[⃗]t )−g2[i] [(]θ[⃗]t )) [. We set]
(1 + L)(A2,2 _A2,1) + (b1_ _b2)_
_α(L) =2η (1 + L) (A1,1_ _A1,2_ _A2,1 + A2,2) and β(L) =_ _−_ _−_
_−_ _−_ (1 + L)(A1,1 _A1,2_ _A2,1 + A2,2)_ _[,]_
_−_ _−_
-----
and write αn = α(Ln) and βn = β(Ln). By direct computation the dynamic of pt is
_pt+1 = fαn,βn_ (pt) (8)
where αn encodes the update pace and βn is an equilibrium of the dynamic. We further set β :=
_A2,2−A2,1_ _∞_
_A1,1−A1,2−A2,1+A2,2_ [and][ δ][(][L][) :=][ β][(][L][)][−][β][∞] [which converges to zero as][ L][ →∞][. If][ A][ is diagonally]
dominantA2,2 _A α2,1( are positive, we can permute the coordinate so thatL) is positive and increasing for all L ≥_ 0. Additionally, because β (0, 1/2] A. With Li et al.1,1 − _A1,2 and_
(1982), the lemma below implies the existence of period three points and the proof is in appendix E. − _∞_ _∈_
**Lemma 4.7. If β** (0, 1/2), there exists a constant L[∗] _> 0 so that if L_ _L[∗], there exists a_
_∞_ _∈_ _≥_
_trajectory x0, x1, x2, and x3 with fα(L),β(L)(xi) = xi+1 such that x3 < x0 < x1._
_Proof of theorem 4.6. First by lemma 4.7, for all L ≥_ _L[∗], there exists a trajectory x0, x1, x2, x3 so_
that x3 < x0 < x1. Additionally, existence of such trajectory implies the exists of period three
points by Li et al. (1982). Finally, by the seminal work of Li & Yorke (1975), it implies that the map
is Li-York chaotic.
5 SIMULATION
(a) (14, 0.001) without noise (b) (1.4, 0.05) without noise (c) (14, 0.05) without noise
(d) (14, 0.001) with noise (e) (1.4, 0.05) with noise (f) (14, 0.05) with noise
Figure 1: Here we plot the temporal behavior of pt starting at p0 = 0.2 for 100 rounds under various
_L and η, and noisy estimation of gradient of mean squared error with m samples. The top row_
(figs. 1a to 1c) present different combination of (L, η), and bottom row (figs. 1d to 1f) consider the
gradient is estimated from m = 10 and m = 100 samples.
Now we simulate the one-dimensional dynamics in eq. (8) of one agent with different learning rate
_η and collective influence L._
We first define a location-scale distribution map. Let the feature x1, x2 and noise x0 are mutually
independent Gaussian distribution with zero mean, and the variance are E[x[2]1[] = 3][,][ E][[][x][2]2[] = 7][, and]
E[x[2]0[] = 1][. Finally,][ θ][0][ =][ 0][. Therefore,]
_, and b = Aθ[0]_ + E[x0x] =
**_A = E[xx[⊤]] =_**
-----
Given learning rate η > 0, and collective influence L, we have α(L) = 20η(1 + L), and the
performative stable point β(L) = 0.7.
The top row of fig. 1 shows the temporal behavior of pt under different (L, η). The bottom row in
fig. 1 demonstrates our (convergent and chaotic) results are robust even when the value of gradient is
noisy. Specifically, we consider at each round t, instead of g1[i] [(]θ[⃗]t)−g2[i] [(]θ[⃗]t) = 2(1+L)(1, −1)Aθt _−_
2(1, −1)b, the agent replace A and b with empirical moments on m samples. This process can
be seen as a stochastic exponentiated gradient descent which uses a stochastic estimation of the
gradient.
We can see chaotic behavior happens in fig. 1c with L = 14 and η = 0.05, and such behavior
persists in fig. 1f when the evaluations of gradient are noisy. On the other hand, when the learning
rate is small enough (figs. 1a and 1b) the dynamics converge to the performative stable point 0.7.
Furthermore, in figs. 1d and 1e the dynamics converge even with noisy estimation of gradient.
Finally, fig. 2 shows the temporal behavior of the loss. Under the same setting as fig. 1c, fig. 2b
shows not only the temporal behavior pt is chaotic, but also the mean squared loss. On the other
hand, under the same setting as fig. 1a, fig. 2a shows convergent behavior of loss.
(a) (14, 0.001) without noise (b) (14, 0.05) without noise
Figure 2: The temporal behavior of the total cost for 100 rounds under various η.
6 CONCLUSION
We introduce a framework of multi-agent performative prediction and investigate whether classical
reinforcement learning algorithms can converge or behave chaotically depending on the collective
influence of the agents model and learning rate. However, we view our example as only scratching
the surface of the work of multi-agent performative predictions.
Our framework leads to several new theoretical problems. In particular, it would be interesting
to understand whether this threshold is generic and if our results still hold on other reinforcement
learning algorithms or other general multi-agent distribution maps. Our framework also provides a
new viewpoint to several applications. One natural application is strategic classification. Hardt et al.
(2016) In this context, our framework can be seen as strategic learners in strategic classification
problem where both data and classifiers are strategic. However, the features also respond to the
deployed models in conventional strategic classification, which is not captured in our multi-agent
location-scale distribution map. It would be interesting to investigate our dynamics in the context of
strategic prediction. Another application is pricing strategy/prediction, where multiple companies
predict the demand function and set their prices.
Both our digital and real-world environments are increasingly under the influence of or ever more
powerful and numerous ML systems. As these algorithms trigger actions that change the state of
the system that produces their joint input data (e.g., AI-triggered ads shown to users affecting their
behavior, or automated trading systems affecting stock prices, etc), they are effectively forming
closed loop systems and can no longer be understood fully in isolation. As we show in this paper,
even fairly simple and innocuous such settings can exhibit phase transitions going from optimal
behavior to chaos. We believe such phenomena are worthy of a careful investigation and we hope
that this paper sets out some useful building blocks by bringing together optimization/performance
analysis with Lyapunov theory and chaos theory.
-----
ETHICS STATEMENT
Our work showcases the possibility of competing prediction algorithms leading to instability and
chaos. Extra steps should be taken, whenever possible, so that related systems operate within the
range of parameters leading to stability. We hope that our work helps both by identifying a potential
threat for ML systems as well as providing some guidance on how to counter or minimize it.
REPRODUCIBILITY
All of our theoretical results are stated fully, including previous definitions and results on which they
are based. They are accompanied by proofs, either in the main paper or in the appendix. The setting
of our simulation results are stated fully in the main paper.
REFERENCES
Gabriel P Andrade, Rafael Frongillo, and Georgios Piliouras. Learning in matrix games can be
arbitrarily complex. In Conference on Learning Theory (COLT), 2021.
James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM
_Confernce on Economics and Computation, pp. 321–338, 2018._
David Balduzzi, Wojciech M Czarnecki, Thomas W Anthony, Ian M Gemp, Edward Hughes, Joel Z
Leibo, Georgios Piliouras, and Thore Graepel. Smooth markets: A basic mechanism for organizing gradient-based learners. 2020.
Peter L Bartlett. Learning with a slowly changing distribution. In Proceedings of the fifth annual
_workshop on Computational learning theory, pp. 243–252, 1992._
Jakub Bielawski, Thiparat Chotibut, Fryderyk Falniowski, Grzegorz Kosiorowski, Michał Misiurewicz, and Georgios Piliouras. Follow-the-regularized-leader routes to chaos in routing games.
_ICML, 2021._
Yun Kuen Cheung. Multiplicative weights updates with constant step-size in graphical constant-sum
games. In NeurIPS 2018, pp. 3532–3542, 2018.
Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of
learning in games. In NeurIPS 2020, 2020.
Yun Kuen Cheung and Yixin Tao. Chaos of learning beyond zero-sum and coordination via game
decompositions. In International Conference on Learning Representations, 2020.
Yun Kuen Cheung, Stefanos Leonardos, and Georgios Piliouras. Learning in markets: Greed leads
to chaos but following the price is right. In Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth
_International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal,_
_Canada, 19-27 August 2021, pp. 111–117. ijcai.org, 2021. doi: 10.24963/ijcai.2021/16._
Thiparat Chotibut, Fryderyk Falniowski, Michał Misiurewicz, and Georgios Piliouras. The route to
chaos in routing games: When is price of anarchy too optimistic? NeurIPS, 2020.
Johanne Cohen, Am´elie H´eliou, and Panayotis Mertikopoulos. Learning with bandit feedback in
potential games. In Proceedings of the 31st International Conference on Neural Information
_Processing Systems, pp. 6372–6381, 2017._
Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions,
2020.
Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Thanasis Lianeas, Panayotis Mertikopoulos, and Georgios Piliouras. No-regret learning and mixed nash equilibria: They do not
mix. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
-----
Angeliki Giannou, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Panayotis Mertikopoulos.
Survival of the strictest: Stable and unstable equilibria under regularized learning with partial
information. COLT, 2021.
Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science,
pp. 111–122, 2016.
Zachary Izzo, Lexing Ying, and James Zou. How to learn when data reacts to your model: performative gradient descent. arXiv preprint arXiv:2102.07698, 2021.
Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear
predictors. information and computation, 132(1):1–63, 1997.
Robert Kleinberg, Georgios Piliouras, and Eva Tardos. Multiplicative updates outperform generic[´]
no-regret learning in congestion games. In Proceedings of the forty-first annual ACM symposium
_on Theory of computing, pp. 533–542, 2009._
Anthony Kuh, Thomas Petsche, and Ronald L Rivest. Learning time-varying concepts. In NIPS, pp.
183–189, 1990.
Stefanos Leonardos, Barnab´e Monnot, Dani¨el Reijsbergen, Stratis Skoulakis, and Georgios Piliouras. Dynamical analysis of the eip-1559 ethereum fee market, 2021.
Alistair Letcher. On the impossibility of global convergence in multi-loss optimization. In Interna_tional Conference on Learning Representations, 2021._
Tien-Yien Li and James A. Yorke. Period three implies chaos. _The American Mathematical_
_[Monthly, 82(10):985–992, 1975. ISSN 00029890, 19300972. URL http://www.jstor.](http://www.jstor.org/stable/2318254)_
[org/stable/2318254.](http://www.jstor.org/stable/2318254)
Tien-Yien Li, Michał Misiurewicz, Giulio Pianigiani, and James A. Yorke. Odd chaos.
_Physics Letters A, 87(6):271–273, 1982._ ISSN 0375-9601. doi: https://doi.org/10.
1016/0375-9601(82)90692-2. [URL https://www.sciencedirect.com/science/](https://www.sciencedirect.com/science/article/pii/0375960182906922)
[article/pii/0375960182906922.](https://www.sciencedirect.com/science/article/pii/0375960182906922)
Celestine Mendler-D¨unner, Juan C Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. arXiv preprint arXiv:2006.06887, 2020.
John Miller, Juan C Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. arXiv preprint arXiv:2102.08570, 2021.
Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. CoRR,
[abs/1703.01138, 2017. URL http://arxiv.org/abs/1703.01138.](http://arxiv.org/abs/1703.01138)
Ioannis Panageas, Georgios Piliouras, and Xiao Wang. Multiplicative weights updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always. In International Conference on Machine Learning, pp. 4961–4969. PMLR, 2019.
Juan Perdomo, Tijana Zrnic, Celestine Mendler-D¨unner, and Moritz Hardt. Performative prediction.
In International Conference on Machine Learning, pp. 7599–7609. PMLR, 2020.
Nelson Vadori, Rahul Savani, Thomas Spooner, and Sumitra Ganesh. Consensus multiplicative weights update: Learning to learn using projector-based game signatures. arXiv preprint
_arXiv:2106.02615, 2021._
-----
A PROOFS AND DETAILS IN SECTION 3
_Proof of proposition 3.1. First under the mean squared loss function, each agent i’s decoupled per-_
formative loss function ℓ(θ[⃗][′], θ[i]) is convex in θ[i]. Thus, for linear predictive model, we can apply the
KKT conditionsstable if and only if for all on definition 2.1 so that a collection of predictive models i ∈ [n], k ∈ [d] with θk[i] _[>][ 0][,]_ _∂θ∂k[i]_ _[ℓ][(]θ[⃗][′], θ[i]) ≤_ _∂θθ[⃗]∂ ∈l[i]_ _[ℓ][(]Θθ[⃗][′][n], θis performatively[i]) for all l ∈_ [d]
where **_θ[⃗][′]_** = **_θ[⃗]. Therefore, with eq. (1),_** **_θ[⃗][∗]_** is performative stable if
_gk[i]_ [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗])_ (9)
holds for all i ∈ [n] and k ∈ [d] with θk[i] _[>][ 0][.]_
With eq. (9), we now show that there exists a unique performative stable point **_θ[⃗][∗]_** by proving that 1)
Φ in eq. (2) is strictly convex, and 2) **_θ[⃗][∗]_** is a minimizer of Φ. First, to show Φ is strictly convex, it is
sufficient to show the Hessian of Φ positive definite. Because Φ is a quadratic function on Θ[n], the
Hessian of Φ is a constant matrix in R∂[2][nd][×][nd] By the partial derivative of eq. (2), for all i, j ∈ [n] and
_k, l ∈_ [d], we have (∇[2]Φ)ik,jl = _∂θk[i]_ _[∂θ]l[j]_ [Φ = 2][λ][i][λ][j][A][k,l][ if][ i][ ̸][=][ j][ and][ (][∇][2][Φ)][ik,il][ = 2][λ][i][(][λ][i][ +1)][A][k,l][.]
Let L ∈ R[n][×][n] with Lij = 2λiλj if i ̸= j and 2λi(1 + λi) which is positive definite because λi > 0
for all i ∈ [n]. Then, the Hessian of Φ is the Kronecker product of L and A,
2λ1(1 + λ1)A _. . ._ 2λ1λnA
. .
Φ =  .. ... ..  = L **_A._**
_∇[2]_ _⊗_
2λnλ1A _. . ._ 2λn(1 + λn)A
 
 
Because L and A are both positive definite, the Hessian is also positive definite. Therefore, Φ is
strictly convex, and there exist a unique minimum of Φ in the compact set Θ[n]. Now we show **_θ[⃗][∗]_** is
performative stable if and only if **_θ[⃗][∗]_** is a minimizer of Φ. By the first order condition and the partial
derivative of eq. (2), the minimum of eq. (2) at **_θ[⃗][∗]_** if and only if gk[i] [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗]) for all i ∈_ [n]
_k, l ∈_ [d] with (θ[∗])[i]k _[>][ 0][. Therefore,][ ⃗]θ[∗]_ is the minimum of Φ if and only if **_θ[⃗][∗]_** is a performative
stable point.
_Proof of proposition 3.2. Given a profile of models_ **_θ[⃗], the total cost is_**
E(x,y)∼D(θ⃗)[[(][y][ −] **_[θ][i][ ·][ x][)][2][]]_**
_ℓ(θ[⃗], (θ)[i]) =_
2[]
= _i_ E *x, θ[0] _−_
X

which is a convex function on **_θ[⃗] ∈_** Θ[n]. Additionally,
_λjθ[j]_ **_θ[i]_**
_−_
+ x0
_ℓ(θ[⃗], θ[ı]) = (1 + λi)gk[i]_ [(]θ[⃗], θ[⃗]) +
_λigk[i]_ [(]θ[⃗], θ[⃗]) = (1 + nλi)gk[i] [(]θ[⃗], θ[⃗])
_ı∈[Xn]:ı≠_ _i_
_∂θk[i]_
which is the gradient of decoupled performative loss scaled by (1 + nλi). Thus, we can apply the
KKT conditions and the minimum happens if and only if eq. (9) holds.
B PROOF AND DETAILS FOR THEOREM 4.2
_Proof of lemma 4.3. By chain rule,_
_∂_
(ϑ[⃗](t)) _[d]_ **_ϑ⃗(t),_** (10)
_i∈[nX],k∈[d]_ _∂θk[i]_ _·_ _dt_
_d_
**_ϑ(t)) =_**
_dt_ [Φ(] _[⃗]_
so the time derivative of the potential function is zero if the dynamics is at a fixed point of eq. (6).
-----
Now we compute a closed form of the time derivative and show the derivative is zero only if the
dynamics is at a fixed point of eq. (6). Here we omit the input t to simplify the notation.
_∂_ _ηi_
(ϑ[⃗](t)) _ϑ[i]k[(][t][)(]_
_∂θk[i]_ _·_ **_η_** 1
_∥_ _∥_
_d_
**_ϑ(t)) =_**
_dt_ [Φ(] _[⃗]_
_ϑ[i]l[(][t][)][g]l[i][(]ϑ[⃗](t)) −_ _gk[i]_ [(]ϑ[⃗](t))) (by eqs. (6) and (10))
_i,k_
!!
(by the partial derivative of eq. (2))
_gk[i]_ _ϑ[i]k_
_·_

_ϑ[i]l[g]l[i]_ _k_
_[−]_ _[g][i]_
_λiηi_
_λiηi_
_λiηi_
**_η_** 1
_∥_ _∥_
1
2 **_η_** 1
_∥_ _∥_
1
= _λiηi_ _ϑ[i]k[ϑ][i]l[(2][g]k[i]_ _[g]l[i]_ _k[)][2][)]_ (because _l_ _[ϑ]l[i]_ [= 1][)]
2∥η∥1 Xi _k,lX∈[d]_ _[−]_ [2(][g][i] [P]
= _λiηi_ _ϑ[i]k[ϑ][i]l[(][g]k[i]_ _l_ [)][2]
2∥[−]η[1]∥1 Xi _k,lX∈[d]_ _[−]_ _[g][i]_
Now we bound the time derivative.
_d_
**_ϑ(t))_**
_dt_ [Φ(] _[⃗]_
2 **_η_** 1
_−_ _∥_ _∥_
_λiηi_
([P]k _[ϑ]k[i]_ [=][ P]l _[ϑ]l[i]_ [= 1][)]
(by Cauchy inequality)
(by triangle inequality)
_λiηiϑ[i]k[ϑ]l[i]_
_i∈[nX],k,l∈[d]_
_i∈[nX],k,l∈[d]_ _λiηiϑ[i]k[ϑ]l[i][(][g]k[i]_ _[−]_ _[g]l[i][)][2]_
_i∈[nX],k,l∈[d]_ _λiηiϑ[i]k[ϑ]l[i][|][g]k[i]_ _[−]_ _[g]l[i][|]_
_λiηi_ _ϑ[i]k_ _ϑ[i]l[(][g]k[i]_ _l_ [)] (by triangle inequality)
_≥_ i∈[nX],k∈[d] _lX∈[d]_ _[−]_ _[g][i]_ 
 
Therefore, _dt[d]_ [Φ(]ϑ[⃗](t)) = 0 only if ϑ[i]k _gk[i]_ _[−]_ [P]l∈[d] _[ϑ]l[i][g]l[i]_ = 0 for all i ∈ [n] and k ∈ [d] which is a
fixed point of eq. (6).  
(minWe can further simplify the bound byi λiηi) _i,k_ _[|][ξ]k[i]_ [(]ϑ[⃗])| = (mini λiηi) ∥ξ[⃗](ϑ[⃗])ξ⃗∥.1, _dt[d]_ Because[Φ(]ϑ[⃗](t)) ≤[P]2i,k−imin[λ][η][i][i][η]i[i] λ[|][ϑ]ii[λ]ηk[i] _[i]i[(][η][P][i][ ∥]lξ[⃗][ϑ](ϑl[i][⃗][g])l[i]∥1[2][−][.]_ _[g]k[i]_ [)][|] _≥_
[P] P
_Proof of theorem 4.2.[P]_ When the fixed points are isolated satisfying eq. (4), by lemma 4.3, the dynamics in eq. (6) converges to a fixed point of eq. (6), **_θ[⃗][∗]_** _∈_ Θ[n]. Because **_θ[⃗][∗]_** is a fixed point, for all
_i ∈_ [n], k ∈ [d],
_ϑ[i]k_
(θ[∗])[i]k _g¯[i](θ[⃗][∗])_ _gk[i]_ [(]θ[⃗][∗]) = (θ[∗])[i]k
_−_
 
(θ[∗])[i]l[g]l[i][(]θ[⃗][∗]) − _gk[i]_ [(]θ[⃗][∗]) = 0
By proposition 3.1, the fixed point condition implies each coordinate of the gradient of loss in the
support are identical, gk[i] [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]) for all i and k ∈ _Si. Thus, if the fixed point_ **_θ[⃗][∗]_** is not
performative stable, there exists ı and κ with (θ[∗])[ı]κ [= 0][ so that][ g]κ[ı] [(]θ[⃗][∗]) < ¯g[ı](θ[⃗][∗]). Furthermore,
exp(−ηıgκ[ı] [(]θ[⃗][∗])) > _l[(][θ][∗][)]l[ı]_ [exp(][−][η][ı][g]l[ı][(]θ[⃗][∗])). We can pick a small enough ϵ > 0 and define
_Uϵ :=_ **_θ : exp(_** _ηıgκ[ı]_ [)][ >] _θl[ı]_ [exp(][−][η][ı][g]l[ı][) +][ ϵ][}] (11)
[P] _{[⃗]_ _−_
_l_
X
which contains **_θ[⃗][∗]_** and is an open set because ⃗g and the exponential function are continuous. Since
ifϑ⃗(ϑ[⃗]t)( converges tot) ∈ _Uϵ and_ **_θϑ[⃗][⃗][∗](tas) is an interior point, we get t →∞, there exists a time tdtdϵ so that for all[ϑ]k[i]_** [(][t][)][ >][ 0][ by eqs. (6) and (11). Therefore,] t ≥ _tϵ,_ **_ϑ[⃗](t) ∈_** _Uϵ. However,_
_ϑ[i]k[(][t][)][ is positive and increasing for][ t][ ≥]_ _[t][ϵ][. We reached a contradiction because][ ϑ][i]k[(][t][)][ →]_ [(][θ][∗][)][i]k [= 0][.]
Therefore, **_θ[⃗][∗]_** is a performative stable point.
-----
C PROOF ANS DETAILS FOR LEMMA 4.4
To prove lemma 4.4, show the dynamic eq. (3) can be approximated by eq. (6) and the error vanishes
as η decreases. Thus, we can show the dynamic can hit an arbitrary neighborhood of the performa_∗_
tive stable point. We further use Φ to show the dynamic will stay in the neighborhood. Below we
state two ancillary lemmas to control the error of our approximation.
We define an error vector ⃗e(θ[⃗]) ∈ R[n][×][d] between eqs. (3) and (6) so that
_θk[i]_ [exp(][η][i][(¯]g[i] _gk[i]_ [))]
_ηi[2][e]k[i]_ [(]θ[⃗]) := _−_ _k_ _k_
_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_
P
We omit the input **_θ[⃗] of ⃗e and ⃗et := ⃗e(θ[⃗]t) when there is no ambiguity. We show if the the error_**
vector is small, Φ is decreasing in dynamics eq. (3).
**Claim C.1of eq. (2) on eq. (Approximated potential) (3) satisfies** **. There exist C4, so that for all t and** **_θ[⃗]0 ∈_** Θ[n], the difference
_i_ _[η]i[2]_
Φ(θ[⃗]t+1) Φ(θ[⃗]t) **_ξt_** 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) **_e⃗t_** 1 + d[2]C4 **_θt+1_** **_θt_**
_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _|_ _| · ∥_ _∥_ _−_ _[⃗]_
**Claim C.2. If maxi ηi is small enough and satisfies eq. (12), there exists a constant C[⃗] so that**
[P]
_|e[i]k[(]θ[⃗])| ≤_ _C for all i ∈_ [n], k ∈ [d], and **_θ[⃗] ∈_** Θ[n].
The proofs of these two claims are based on first order approximation.
1 _[.]_
_∂[2]_
_Proof of claim C.1. By the partial derivative of eq. (2), we have_ _∂θk[i]_ _[∂θ]l[j]_ [Φ(]θ[⃗]) = 2λiλjAl,k if i ̸= j
_∂[2]_
and _∂θk[i]_ _[∂θ]l[i]_ [Φ(]θ[⃗]) = 2λi(1 + λi)Al,k. Thus, the second order partial derivatives of Φ can bounded by
the twice of C4 := maxı λı(λı + 1) maxk,l Ak,l. By Taylor’s expansion, we have
Φ(θ[⃗]t+1) Φ(θ[⃗]t)
_−_
Φ(θ[⃗]t) **_θ⃗t+1_** **_θt_** + _[d][2][2][C][4]_
_≤∇_ _·_ _−_ _[⃗]_ 2
 
**_θ[⃗](t + 1) −_** **_θ[⃗](t)_**
2
_λigk[i]_ [(]θ[⃗]t)(ηiξk[i] [(]θ[⃗]t) + ηi[2][e]k[i] [(]θ[⃗]t)) + d[2]C4 **_θ(t + 1)_** **_θ(t)_**
_−_ _[⃗]_ 1
_i∈[nX],k∈[d]_
(by the partial derivative of eq. (2))
_[⃗]_
2
_λiηigt,k[i]_ _[ξ]t,k[i]_ [+][ λ][i][η]i[2][g]t,k[i] _[e][i]t,k_ [+][ d][2][C][4] **_θt+1_** **_θt_**
_−_ _[⃗]_ 1
_i∈[nX],k∈[d]_
_[⃗]_
1
_−_ _λiηi_ _ξt,k[i]_ _[|]_
_≤_ 2 _i_ _[λ][i][η][i]_  _i,k_ _|_
[X]
Therefore, we have[P]
_λiηi[2][g]t,k[i]_ _[e][i]t,k_ [+][ d][2][C][4] **_θt+1_** **_θt_**
_−_ _[⃗]_
_i∈[nX],k∈[d]_
_[⃗]_
(by lemma 4.3)
2
_i_ _[η]i[2]_
Φ(θ[⃗]t+1) Φ(θ[⃗]t) **_ξt_** 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) **_e⃗t_** 1 + d[2]C4 **_θt+1_** **_θt_**
_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _|_ _| · ∥_ _∥_ _−_ _[⃗]_ 1
which completes the proof.[P] _[⃗]_
_Proof of claim C.2. By if η_ is small enough and satisfies eq. (12), we have ηi(¯g[i] _gk[i]_ [)][ ≤] [1][, and]
_θk[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][ ≤] _[θ]k[i]_ _[e][η][i][(¯]∗g[i]−gk[i]_ [)] _≤_ _θk[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)+][ e]2 _[η]i[2][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][2][ for all][ i, k]− [ and][ ⃗]θ ∈ Θ[n].
Additionally, with ξk[i] [=][ θ]k[i] [(¯]g[i] _−_ _gk[i]_ [)][, we can rewrite it as]
_θk[i]_ _k[e][η][i][(¯]g[i]−gk[i]_ [)] _ηiξk[i]_ _k_ [+][ e] _i_ _[θ]k[i]_ [(¯]g[i] _gk[i]_ [)][2][.]
_[≤]_ _[θ][i]_ _−_ _[≤]_ _[θ][i]_ 2 _[η][2]_ _−_
-----
Therefore, the error can be upper bounded as
_ηi[2][e]k[i]_ _k_ [+][ η][i][ξ]k[i] [+][ e]2 _[η]i[2][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][2] (θk[i] [+][ η][i][ξ]k[i] [) =][ e] _i_ _[θ]k[i]_ [(¯]g[i] _gk[i]_ [)][2][,]
_[≤]_ _[θ][i]_ _l_ _[θ]l[i]_ [+][ η][i][ξ]l[i] _−_ 2 _[η][2]_ _−_
because _l_ _[θ]l[i]_ [= 1][ and][ P]l _[ξ]l[i]P[= 0][. For lower bound, we have,]_
_ηi[2][e]k[i]_ [P] _θk[i]_ [+][ η][i][ξ]k[i] _k_ [+][ η][i][ξ]k[i] [)][ ≥−][(][θ]k[i] [+][ η][i][ξ]k[i] [)] _eηi[2]_ _θl[i][(¯]g[i]_ _gl[i][)][2]_ _._
_[≥]_ _l_ _[θ]l[i]_ [+][ η][i][ξ]l[i] [+][ e]2 _[η]i[2][θ]l[i][(¯]g[i]_ _−_ _gl[i][)][2][ −]_ [(][θ][i] 2 _l_ _−_ !
X
P
If we set C := e maxθ⃗ _l_ _[θ]l[i][(¯]g[i]_ _−_ _gl[i][)][2]_, because |ηiξk[i] _[| ≤]_ _[η][i][2 max][k][ |][g]k[i]_ _[| ≤]_ [1][ by eq. (12) and]
_θk[i]_ _[≤]_ [1][, we have] _e[i]k[(]θ[⃗])_ _≤[P]C._
supProof of lemma 4.4.θ⃗∈D[′][ Φ(]θ[⃗]) < Because12 [inf][ ⃗]θ /∈D _D[Φ(]θ[⃗] is open, we can set)._ We will pick η∗ _Dsmall enough so that[′]_ so that **_θ[⃗][∗]_** _∈_ _Dθ[⃗][′],t D+1 ∈[′]_ _⊂DD for all, and_
**_θ⃗t_** _D[′]_ _D. The proof has two parts: we first show the dynamics eq. (3) hits the set D. Then we_
prove eq. (3) stays in ∈ _⊂_ _D afterward._
By theorem 4.2, there exists τ so that **_ϑ[⃗](τ_** ) _D[′]. Now by Gronwall’s inequality we can set η_ small
_∈_ _∗_
enough so that **_θ[⃗]τ/∥η∥1_** _D. Formally, given the dynamics in eq. (3), we define right-continuous_
_∈_
step functions **_θ[⃗](t) =_** **_θ[⃗]⌊t⌋_** and ⃗e(t) := ⃗e(θ[⃗]⌊t⌋) for all t ≥ 0. Then the dynamics in eq. (3) can be
written as
_t_
0
Z
_t−1_
(θs[i]+1,k _s,k[) =]_
_s=0_ _[−]_ _[θ][i]_
X
_t−1_
_s=0_
X
_ηiξk[i]_ [(]θ[⃗]s) + ηi[2][e]k[i] [(]θ[⃗]s) = ηi

_ξk[i]_ [(]θ[⃗](s)) + ηie[i]k[(][s][)][ ds]
_θk[i]_ [(][t][)][ −] _[θ]k[i]_ [(0) =]
On the other hand, the solution of eq. (6) can be written as
_ηi_ _∥ηt∥1_ _t_
_ϑ[i]k[(][η][i][t][)][ −]_ _[ϑ]k[i]_ [(0) =] _ξk[i]_ [(]ϑ[⃗](s)) ds = ηi _ξk[i]_ [(]ϑ[⃗]( **_η_** 1s)) ds
**_η_** 1 0 0 _∥_ _∥_
_∥_ _∥_ Z Z
Because ξk[i] [are continuous and][ Θ][n][ is compact, there exists][ L >][ 0][ so that][ ξ]k[i] [is][ L][-Lipschitz with]
respect to one norm for all i and k. Thus, the difference between above equations is
_t_ _t_
_θk[i]_ [(][t][)][ −] _[ϑ]k[i]_ [(][∥][η][∥][1][t][)] _≤ηi_ 0 _ξk[i]_ [(]θ[⃗](s)) − _ξk[i]_ [(]ϑ[⃗](∥η∥1s)) _ds + ηi[2]_ 0 _|e[i]k[(][s][)][|][ ds]_
Z Z
_t_
_ηiL_ **_θ(s)_** **_ϑ(ηis)_** _i_ _[Ct]_ (L-Lipschitz and claim C.2)
_≤_ 0 _−_ _[⃗]_ 1 _[ds][ +][ η][2]_
Z
Therefore, by Gronwall’s inequality and[⃗] _ηi ≤_ _η∗, we have_
_t_
**_θ(t)_** **_ϑ(_** **_η_** 1t) **_θ(s)_** **_ϑ(_** **_η_** 1s)
_−_ _[⃗]_ _∥_ _∥_ 1 0 _−_ _[⃗]_ _∥_ _∥_ 1 _[ds][ + (][η][∗][)][2][ndCt][ ≤]_ [(][η][∗][)][2][ndCte][η][∗][ndLt]
Because[⃗] **_ϑ[⃗](τ_** ) _D[′], we can pick[≤]_ _[η][∗][ndL]_ Z _η_ _[⃗]_ small enough so that **_θ[⃗](τ/_** **_η_** 1) = **_θ[⃗]_** _τ/_ **_η_** 1 _D and_
_∈_ _∗_ _∥_ _∥_ _⌊_ _∥_ _∥_ _⌋_ _∈_
Φ(θ[⃗]⌊τ/∥η∥1⌋) < inf ⃗θ /∈D [Φ(]θ[⃗]).
Now we show the second part that **_θ[⃗]t_** _D for all t_ _τ/_ **_η_** 1 . First, with claim C.1, we will prove
_∈_ _≥⌊_ _∥_ _∥_ _⌋_
that the potential function is decreasing for all **_θ[⃗]t /_** _D[′]_ when η small enough. We estimate three
_∈_ _∗_
terms in claim C.1 separately. First, because **_θ[⃗][∗]_** _∈_ _D[′]_ and Θ[n]\D[′] is compact, minθ⃗ /∈D[′][ ∥]ξ[⃗](θ[⃗])∥1 > 0
exists. Then [min]2 _[i]i[ λ][λ]i[2][i][η][η]i[2][i][ ∥]ξ[⃗]t∥1[2]_ [= Ω(max][i] _[η][i][)][ By claim C.2,][ max][i]_ _[λ][i][η]i[2]_ [max]i,k,θ[⃗] _k[(]θ[⃗])| · ∥e⃗t∥1 =_
2 _[|][g][i]_
_O(maxi ηi[2][)][. Finally,][P]_ _[ d][2][C][4]_ **_θt+1_** **_θt_** _i_ [)][. Therefore, there exists][ η][∗] [small enough]
_−_ _[⃗]_ 1 [=][ O][(max][i][ η][2]
so that the potential function Φ([⃗] **_θ[⃗]t+1) −_** Φ(θ[⃗]t) < 0 for all **_θ[⃗]t /∈_** _D[′]_ and η. Therefore Φ(θ[⃗]t) <
minθ⃗ /∈D [Φ(]θ[⃗]) for all t ≥⌊τ/∥η∥1⌋ which completes the proof.
-----
D PROOFS AND DETAILS FOR LEMMA 4.5
Lemma 4.3, shows the value of Φ is decreasing in eq. (6), and the decrease rate is lower bounded by
the one norm of **_ξ[⃗]. Thus, if we can show the error between eq. (3) and eq. (6) is bounded by_** **_ξ_** 1,
_∥[⃗]∥_
we have the value of Φ is also decreasing on eq. (3). The main challenge is that because **_ξ[⃗](θ[⃗][∗]) = 0,_**
we need to control the error as **_ξ[⃗] converges to zero but η_** is fixed.
_∗_
We first show three ancillary claims, claim D.1 to D.3. Claim D.1 show the vanishing components
of **_θ[⃗]t decrease rapidly once the dynamic is in D. Claim D.2 and D.3 show the supporting component_**
also decrease once the vanishing components are small enough.
Given (n, d, A, λ), we define the following constants C1 := [1]4 [max][i,l,][θ][∈][Θ][n][ |][g]l[i][|][,][ C][2][ :=] mini,k2(θ[∗])[i]k [,]
_C3 := ed max(C1, C2), and C4 := maxı λı(λı +1) maxk,l Ak,l. We require the maximum learning_
rate is bounded by η which satisfies the following conditions.
_∗_
_η_ max _k[(]θ[⃗])_ (12)
_∗_ _i,k,θ[⃗]_ Θ[n][ |][g][i] _| ≤_ 2[1]
_∈_
_η_ _ndC3 max_ **_ξ(θ[⃗])_** 1 1 (13)
_∗_ **_θ⃗_** _∥[⃗]_ _∥_ _≤_
Additionally, given the bound of learning rate ratio, [max]mini[i] η[ η]i[i]
_[≤]_ _[R][η][, we requires]_
_Rη[2][η][3]_ _i_ min 4 _,_ 1 (14)
_∗_ _[<][ min]16_ _[i][ λ]i_ _[λ][2][i]_ _C3 maxi(λi) maxθ⃗_ **_g_** 1 _d[2]C4_
 _[∥][⃗]∥_ 
_i_
Note that [max]min η[ η]i[2][3] [is less then the right hand side of eq. (14). On the other hand, by lemma 4.4, we][P]
can pick D small enough so that the following conditions holds.
1
min _k_ min _θk[i]_ (15)
2 _i∈[n],k∈Si[(][θ][∗][)][i]_ _[≤]_ _i∈[n],k∈Si,θ[⃗]∈D_
We first show for all i and k ∈ _S[¯]i, θt,k[i]_ [is decreasing and converges to zero exponentially fast]
as t increases. Because **_θ[⃗][∗]_** is a proper performative stable, _l[(][θ][∗][)]l[i][e][−][η][i][g]l[i][(]θ[⃗][∗]) = e−ηig¯[i](θ[⃗][∗]) >_
_e[−][η][i][g]k[i]_ [(]θ[⃗][∗]) for all i and k ∈ _S¯i. We can take η∗, ϵ1, and D small enough so that for all_ **_θ[⃗] ∈_** _D and_
all learning rate profile η with max ηi _η_, [P]
_≤_ _∗_
_e[−][η][i][g]k[i] < (1_ _ϵ1)_ _θl[i][e][−][η][i][g]l[i], i_ [n], k _S¯i_ (16)
_−_ _∈_ _∈_
_l_
X
**Claim D.1 (Vanishing components). Given η** _, ϵ1, and D in eq. (16), if_ **_θ[⃗]t_** _D for all t_ 0, for all
_i ∈_ [n] and k ∈ _Si, θk[i]_ [(][t][)][ is decreasing in][ t][ and for all]∗ _[ t][ ≥]_ [0] _∈_ _≥_
0 ≤ _θt,k[i]_ _[≤]_ _[θ]0[i]_ _,k[e][−][ϵ][1][t][.]_
_Proof of claim D.1. By eq. (16), for all t ≥_ 0, θt[i]+1,k [=][ θ]t,k[i] _l_ exp([θ]t,l[i] [exp(]−ηig[−]t,k[i][η][i][)][g]t,l[i] [)][ ≤] [(1][ −] _[ϵ][1][)][θ]t,k[i]_ [.]
Therefore, θt,k[i] [is decreasing, and][ θ]t,k[i] _[≤]_ _[θ]0[i]_ _,k[e][−][ϵ][1][t][.]_ P
**Claim D.2. There exists a constant C3 such that for all** **_θ[⃗] ∈_** Θ[n] _with δ1 > 0 so that ∥ξ[⃗](θ[⃗])∥1 ≥_
2[√]δ1, and |ξk[i] [(]θ[⃗])| ≤ _δ1 for all i ∈_ [n] and k ∈ _S[¯]i, then_
_|e[i]k[(]θ[⃗])| ≤_ _C3∥ξ[⃗](θ[⃗])∥1[2][,]_ (17)
_for all i ∈_ [n] and k ∈ [d].
Claim D.2 shows if the one norm of **_ξ[⃗] is much bigger than the vanishing components, the error term_**
**_e⃗ can be bounded by the ∥ξ[⃗]∥1[2][. Moreover, claim D.1 ensures that the vanishing components decrease]_**
rapidly, so the condition of claim D.2 readily holds.
-----
_Proof of claim D.2. Given_ **_θ[⃗] and δ1 > 0 that satisfy the condition, we first show two inequalities to_**
bound θk[i] [(¯]g[i] _gk[i]_ [)][2][ for supporting and vanishing component respectively. For a vanishing compo-]
_−_
nent k _Si,_
_∈_ [¯]
_θk[i]_ [(¯]g[i] _gk[i]_ [)][2][ ≤] max _l_ _[| · |][θ]k[i]_ [(¯]g[i] _gk[i]_ [)][| ≤] [4][C][1] **_ξ_** 1[.] (18)
_−_ _i,l,θ∈Θ[n][ |][g][i]_ _−_ _[·][ δ][1]_ _[≤]_ _[C][1][∥][⃗]∥[2]_
Then for a supporting component k _Si, with eq. (15) we have_
_∈_
1 2
_θk[i]_ [(¯]g[i] _gk[i]_ [)][2][ ≤] (θk[i] [(][g][i][ −] _[g]k[i]_ [))][2][ ≤] _ξk[i]_ _[|][2][ ≤]_ _[C][2][∥]ξ[⃗]_ 1[.] (19)
_−_ mini,l,⃗θ _D_ _[θ]l[i]_ mini,k(θ[∗])[i]k _|_ _∥[2]_
_∈_
Now we use above two inequalities to approximate eq. (3). For nominator, because 1 + x ≤
exp(x) ≤ 1+ _x_ + 2[e] _[x][2][ for all][ x][ ≤]_ [1][,][ θ]k[i] [exp(][η][i][(¯]g[i] _−_ _gk[i]_ [))][ ≥] _[θ]k[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [) =][ θ]k[i] [+] _[η][i][ξ]k[i]_ [. On the]
other hand, because ηi(¯g[i] _−gk[i]_ [)][ ≤] [1][ by eq. (12),][ θ]k[i] [exp(][η][i][(¯]g[i] _−gk[i]_ [))][ ≤] _[θ]k[i]_ [+][η][i][ξ]k[i] [+][ e]2 _[θ]k[i]_ _[η]i[2][(¯]g[i]_ _−gk[i]_ [)][2][.]
By eqs. (18) and (19), we have
0 _θk[i]_ _[e][η][i][(¯]g[i]−gk[i]_ [)] _θk[i]_ _k_ _i_ **_ξ_** 1 (20)
_≤_ _−_ _[−]_ _[η][i][ξ][i]_ _[≤]_ 2 [e] [max(][C][1][, C][2][)][η][2][∥][⃗]∥[2]
For denominator, we sum over eq. (20). Because _l_ _[θ]l[i]_ [= 1][ and][ P]l _[θ]l[k][(¯]g[i]_ _−_ _gl[i][) = 0][, we have]_
0 _θl[i][e][η][i][(¯]g[i]−gl[i][)]_ 1 _i_ **_ξ_** 1 (21)
_≤_ _l_ _−_ _≤[P][ed]2 [max(][C][1][, C][2][)][η][2][∥][⃗]∥[2]_
X
Given i ∈ [n] and k ∈ [d], we apply the above equation to eq. (3). For upper bounds, we have
_k_ [exp(][η][i][(¯]g[i] _gk[i]_ [))]
_ηi[2][e]k[i]_ [(]θ[⃗]t) = _[θ][i]_ _−_ _t,k_ _t,k_
_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_
_k[(]θ[⃗]t)_ _i_
_≤θPt,k[i]_ _[e][−][η][i][g][i]_ _−_ _θt,k_ _[−]_ _[η][i][ξ]t,k[i]_ (by eq. (21))
_≤_ _[e][ max(]2[C][1][, C][2][)]_ _ηi[2][∥]ξ[⃗]∥1[2]_ (by eq. (20))
For lower bounds,
_k_ [exp(][η][i][(¯]g[i] _gk[i]_ [))]
_ηi[2][e]k[i]_ [(]θ[⃗]t) = _[θ][i]_ _−_ _t,k_ _t,k_
_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_
P _θt,k[i]_ [+][ η][i][ξ]t,k[i]
_θt,k[i]_ _t,k_ (by eqs. (20) and (21))
_≥_ 1 + _[ed][ max(]2[C][1][,C][2][)]_ _ηi[2]_ **_ξ_** 1 _−_ _[−]_ _[η][i][ξ][i]_
_[∥][⃗]∥[2]_
_ed max(C1, C2)_
(θt,k[i] [+][ η][i][ξ]t,k[i] [)] _ηi[2]_ **_ξ_** 1 (1/(1 + x) 1 _x)_
_≥−_ 2 _[∥][⃗]∥[2]_ _≥_ _−_
 
_ed max(C1, C2)ηi[2]_ **_ξ_** 1 (θt,k[i] _t,k_
_≥−_ _[∥][⃗]∥[2]_ _[≤]_ [1][ and][ η][i][ξ][i] _[≤]_ [1][ by eq. (12))]
Therefore, with C3 := ed max(C1, C2) we finish the proof.
**Claim D.3. There exists ϵ2 > 0 so that for all** **_θ[⃗]t ∈_** Θ[n] _and δ1 > 0 so that ∥ξ[⃗](θ[⃗]t)∥1 ≥_ 2[√]δ1, and
_|ξk[i]_ [(]θ[⃗]t)| ≤ _δ1 for all i ∈_ [n] and k ∈ _S[¯]i, then Φ(θ[⃗]t+1) −_ Φ(θ[⃗]t) ≤ _ϵ2∥ξ[⃗](θ[⃗]t)∥1[2][.]_
_Proof for claim D.3. By claim C.1,_
2
_i_ _[η]i[2]_
Φ(θ[⃗]t+1) Φ(θ[⃗]t) **_ξt_** 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) **_e⃗t_** 1 + d[2]C4 **_θt+1_** **_θt_**
_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _|_ _| · ∥_ _∥_ _−_ _[⃗]_ 1 _[.]_
We can bound the later two terms by claim D.2. For the second term, **_e⃗t_** 1 _ndC3[⃗]ξt_ 1[. For the]
third term, [P] _∥_ _∥_ _≤_ _∥[⃗]_ _∥[2]_
**_θt+1_** **_θt_** _ηi_ _ξt,k[i]_ _[|][ +][ C][3][η]i[2]_ **_ξ_** 1
_−_ _[⃗]_ 1 _[≤]_ Xi,k  _|_ _[∥][⃗]∥[2]_
_[⃗]_ max _ηi_ **_ξ_** 1 + ndC3 max _ηi[2]_ **_ξ_** 1[.]
_≤_ _i_ _∥[⃗]∥_ _i_ _[∥][⃗]∥[2]_
2 max _ηi_ **_ξ_** 1 ( by eq. (13))
_≤_ _i_ _∥[⃗]∥_
-----
With above inequalities, by eq. (14), we havemini λ[2]i _[η]i[2]_ [min]4 _[i]i[ λ][λ]i[2][i][η][η]i[2][i][ >][ max][i][ λ][i][η]i[2]_ [max]i,k,θ[⃗] _[|][g]k[i]_ [(]θ[⃗])|ndC3 and
4 [P]i _[λ][i][η][i][ >][ 4][d][2][C][4][ max][i][ η]i[2]i[. Therefore there exists a constant][η]i[2]_ [P]
_ϵ2 := [min][i][ λ][2]_ max _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) _ndC3_ 4d[2]C4 max _ηi[2]_ _[>][ 0]_
2 _i_ _[λ][i][η][i]_ _−_ _i_ _i,k,θ[⃗]_ _|_ _|_ _−_ _i_
so that Φ(θ[⃗]t+1) Φ(θ[⃗]t) < _ϵ2_ **_ξt_** 1[.]
_−_ [P] _−_ _∥[⃗]_ _∥[2]_
_Proof of lemma 4.5. To prove limt_ **_θt =_** **_θ[⃗][∗], there are two equivalent ways to measure the_**
_→∞_ _[⃗]_
**_θprogress, besides⃗[∗]_** if and only if lim ∥θ[⃗]tt −→∞θ[⃗]ξ[∗][⃗]∥(θ[⃗]1. First becauset) = _[⃗]0. On the other hand, becauseθ[⃗][∗]_ is an isolated fixed point of eq. (6), Φ is strictly convex and andθ[⃗]t converges toθ[⃗][∗] is
the minimum point, **_θ[⃗]t converges to_** **_θ[⃗][∗]_** if and only if limt→∞ Φ(θ[⃗]t) = minθ⃗ [Φ(]θ[⃗]). With these two
equivalent conditions, given any ϵ > 0, there exists δ > 0 so that ∥θ[⃗] − **_θ[⃗][∗]∥1 < ϵ when_**
**_θ⃗ : Φ(θ[⃗])_** max Φ(θ[⃗][′])
_≤_ **_θ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ_**
**_θ⃗ ∈_** _Vδ :=_
Therefore, it is sufficient for us to show for alltechnical claim D.1 to D.2, our proof has two parts. First we show that the dynamic hits δ > 0, there is tδ so that **_θ[⃗]t ∈_** _Vδ for all t ≥ Vδtδ. Then,. With_
the dynamic stays in Vδ.
For the first part, given δ > 0 and 0 < C < 1, by claim D.1 there exists T1 such that each vanishing
component |ξt,k[i] _[| ≤]_ _[C]_ [2][δ][2][/][4][ for all][ t][ ≥] _[T][1][. Then by claim D.3, there exists][ T][2][ ≥]_ _[T][1][ such that]_
**_ξT2_** 1 _Cδ. Otherwise, the value of Φ decreases by a nonzero constant ϵ2_ **_ξt_** 1
each round which contradicts that the minimum of∥[⃗] _∥_ _≤_ Φ bounded. _∥[⃗]_ _∥[2]_ _[≥]_ _[Cϵ][2][δ][2][ >][ 0][ at]_
For the second part, if **_ξt_** 1 _Cδ_ _δ for all t_ _T2, then we finish the proof._ Other_∥[⃗]_ _∥_ _≤_ _≤_ _≥_
wise, there exists τ ≥ _T2 so that ∥ξ[⃗]τ_ _∥1 ≤_ _Cδ < ∥ξ[⃗]τ_ +1∥1. Now we prove that Φ(θ[⃗]τ +1) ≤
maxθ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ [Φ(]θ[⃗][′]). Because the difference between **_θ[⃗]τ_** +1 and **_θ[⃗]τ is ∥θ[⃗]τ_** +1 − **_θ[⃗]τ_** _∥1_ _≤_
**_θmax⃗ for some constanti ηi∥ξ[⃗]τ_** _∥1 +max Li ηξ in one norm, we havei[2][∥]e[⃗]τ_ _∥1 ≤_ 2 maxi ηi∥ξ[⃗]τ _∥1 when η∗_ is small enough, and **_ξ[⃗] is a Lξ-Lipschitz_**
_Cδ <_ **_ξτ_** +1 1 **_ξτ_** 1 + **_ξτ_** +1 **_ξτ_** 1 **_ξτ_** 1 + 2Lξ max _ηi_ **_ξτ_** 1 _δ,_ (22)
_∥[⃗]_ _∥_ _≤∥[⃗]_ _∥_ _∥[⃗]_ _−_ _[⃗]_ _∥_ _≤∥[⃗]_ _∥_ _i_ _∥[⃗]_ _∥_ _≤_
when C is small enough so that C(1 + 2Lξ maxi ηi) ≤ 1. Therefore, ∥ξ[⃗]τ +1∥1 ≤ _δ and Φ(θ[⃗]τ_ +1) ≤
maxθ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ [Φ(]θ[⃗][′]). Finally, because |ξt,k[i] _[| ≤]_ _[C]_ [2][δ][2][/][4][ for all][ t][ ≥] _[τ][ + 1][, by claim D.3, the]_
potential function is decreasing for all t _τ + 1, unless_ **_ξ_** _t_ _Cδ. Both make the potential_
_≥_ _∥[⃗]∥_ _≤_
function less than maxθ⃗[′]: **_ξ(θ[⃗][′])_** 1 _δ_ [Φ(]θ[⃗][′]) which completes the proof.
_∥[⃗]_ _∥_ _≤_
E PROOF AND DETAIL FOR THEOREM 4.6
_Proof of lemma 4.7. Define PL(x) := α(L)(x_ _β(L)) which is increasing for all L because α(L) >_
_x_ _−_
0. With PL, fα(L),β(L)(x) = _x+(1−x) exp(PL(x))_ [. Take][ x][1][(][L][) =][ x][1][ := 1] _[−]_ [1][/α][(][L][)][,][ x][2][(][L][) =][ x][2][ :=]
_fα(L),β(L)(x1(L)), and x3(L) = x3 := fα(L),β(L)(x2(L)). We will define x0(L) later. We will_
omit L and use x1, x2, and x3.
To define x0(L), we set y(L) := β(L)/2, and want to show
_fα(L),β(L)(y) > x1,_ (23)
which is equivalent to (α(L) 1)(1 _y(L)) exp(PL(y(L))) < y(L). When L is large enough, we_
_−_ _−_
have3 _[α] β[(][L](L[)][β])[∞] ∈[. Therefore, we have](_ [2][β]3[∞] _[,][ 4][β]3[∞]_ [)][. Additionally because][ (][α][(][L][)][ −] [1)(1][ α][ −][(][L][y][)][(][ >][L][))][ 0][e][, we have][P][L][(][y][(][L][))][ ≤][ P][L][(][α][y][(][(][L][L][)][)) =][e][−][α][(][ −][L][)][ β]2[1] 3[∞][α][(][L]<[)][β]β[(]3[L]∞[)][ <]<
_−_ [1]
_y(L) when L is large enough, and prove eq. (23). On the other hand, because β(L) <_ [4]3 _[β][∞]_ [and]
-----
_PL(β(L)) = 0, we have fα(L),β(L)(β(L)) = β(L) <_ [1]2 [(][β][∞] [+1)][. Moreover,][ 1]2 [(][β][∞] [+1)][ <][ 1][−] _α(1L)_ [,]
holds when L is large enough. These two imply
_fα(L),β(L)(β(L)) = β(L) < x1(L)._ (24)
Combining eqs. (23) and (24), by intermediate value theorem, there exists x0 such that
_fα(L),β(L)(x0) = x1 with y < x0 < β(L) < x1._ (25)
Now we showx1 _x3 < x0. With 1α−(Lx)1(L) = 1/α(L) and 0 < x1 < 1, we have x2 = fα(L),β(L)(x1) =_
_x1+α(L)[−][1]_ exp(PL(x1)) exp(PL(x1)) [. Because][ β][∞] _[<][ 1][/][2][, when][ L][ is large enough,][ x][1][(][L][) =]_
1 − _α(1L)_ _[>][ 1]4_ [(2][β][(][L][) + 3)][≤] [ and, thus,][ P][L][(][x][1][)][ > α][(][L][)] 34 _[−]_ _[β][(]2[L][)]_ . Therefore, we have
 
3
_x2_ _α(L) exp_ _α(L)_ (26)
_≤_ _−_ 4 2
  _[−]_ _[β][(][L][)]_ 
Finally, because PL(x2) _PL(0) =_ _β(L), we get_
_≥_ _−_
_x2_
_x3 =fα(L),β(L)(x2)_
_≤_ _x2 + (1 −_ _x2) exp(−α(L)β(L))_
_x2 exp(α(L)β(L))_
=
1 + x2(exp(α(L)β(L)) 1)
_−_
_≤x2 exp(α(L)β(L))_ (since exp(α(L)β(L)) > 1 when L is large)
3
_α(L) exp_ _α(L)_ _β(L)_ (eq. (26))
_≤_ _−_ 4 2
   _[−]_ _[β][(][L][)]_ 
3α(L)
=α(L) exp _β(L)_
2 _−_ 2[1]
  
Finally, because β _< 1/2, x3 converges to 0 when L is large enough. Therefore,_
_∞_
_x3 < y < x0._ (27)
By eqs. (25) and (27), we have x3 < x0 < x1 which completes the proof.
-----