pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /-Nf6TikpjQ.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame

72.8 kB

	# MULTI-AGENT PERFORMATIVE PREDICTION: FROM GLOBAL STABILITY AND OPTIMALITY TO CHAOS

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	The recent framework of performative prediction (Perdomo et al., 2020) is aimed
	at capturing settings where predictions influence the target/outcome they want to
	predict. In this paper, we introduce a natural multi-agent version of this framework, where multiple decision makers try to predict the same outcome. We showcase that such competition can result in interesting phenomena by proving the
	possibility of phase transitions from stability to instability and eventually chaos.
	Specifically, we present settings of multi-agent performative prediction where under sufficient conditions their dynamics lead to global stability and optimality. In
	the opposite direction, when the agents are not sufficiently cautious in their learning/updates rates, we show that instability and in fact formal chaos is possible. We
	complement our theoretical predictions with simulations showcasing the predictive power of our results.

	1 INTRODUCTION

	Performative prediction (Perdomo et al., 2020) is a recently introduced framework that focuses on a
	natural but largely unexplored element of supervised learning. In many practical cases the predictive
	model can affect the very outcome that it is trying to predict. For example, predictions about which
	part of a road network will have high congestion trigger responses from the drivers which affect the
	resulting traffic realization leading to a shift of the target distribution. In such settings, (Perdomo
	et al., 2020) explored conditions for the existence and approximate optimality of stable equilibria of
	such processes.

	One possible way to interpret the performative prediction setting is a single agent “game”, where the
	predictive agent is playing a game against himself. An agent chooses the parameters of his model
	as his actions but the predictive accuracy/cost of the model depends on his own past actions. Fixed
	points of this process do not allow for profitable deviations. Once cast in this light, it becomes
	self-evident that the restriction to a single predictive agent/model is arguably only the first step in
	capturing more general phenomena where there is a closed loop between predictive models and their
	environment. This motivates our central question: What is the interplay between stability, optimality
	_in cases where multiple predictive models operate in parallel to each other? Can the competition be-_
	_tween multiple models lead to novel phenomena such as phase transitions from stability to instability_
	_and chaos?_

	A natural setting to consider for example is market competition, where multiple hedge funds are
	trying to simultaneously predict future prices, volatility of financial instruments. Of course as they
	act upon their predictions they also move the prices of these commodities in a highly correlated
	way. As more agents enter the market and the competition becomes increasingly fierce, is it possible
	that at some point the market flips from quickly discovering accurate stable predictions reflecting
	the underlying market fundamentals to self-induced unpredictability and chaos? When it comes to
	performative prediction can too many cooks spoil the soup?

	Our model Standard supervised learning consists of three components: a set of predictive models, a loss function, and a data distribution. The learner (agent) observes samples of the distribution,
	and then decides a predictive model. When predictions are performative, the agent’s decision of a
	predictive model influences the data distribution. Thus, instead of a fixed data distribution, a predictive prediction also has a distribution map: a mapping from the agent’s predictive models to


	-----

	data distributions. We further propose multi-agent performative prediction which model the influence of multiple agents’ decisions on the data distribution, and ask whether these influence leads to
	convergent or chaotic systems.

	To understand the framework of multi-agent performative prediction, we study a natural regression
	problem with multi-agent location-scale distribution maps (definition 2.2) where the agents’ influence is linear in the agents’ models. Specifically, the data consist of features and outcomes. For
	each agent i ∈ [n], his influence on the distribution of outcome is his model weighted by a scalar
	_λi > 0. When n = 1, our multi-agent location-scale distribution map is a special case of location-_
	scale family in Miller et al. (2021). In the market example, each hedge fund company tries to predict
	the price (outcome) based on macroeconomic data or other information (features). These influence
	parameters λ1, . . ., λn can be seen as each hedge fund’s capital that can influence the distribution of
	the future price.

	Finally, we consider the agents do not know the dependency between their model and the data
	distribution, and they can only improve their predictive models myopically and iteratively. In this
	paper, the agents myopically use a reinforcement learning algorithm, exponentiated gradient for
	linear regression (Kivinen & Warmuth, 1997), to improve their predictive models in rounds. We
	study the long-term behavior, and ask if the system converge to the performative stable and optimal
	point, or behave chaotically.

	Our results We first study basic properties of our multi-agent preformative prediction with multiagent location-scale distribution map. We show 1) the existence of performative stable point in our
	setting (proposition 3.1) and 2) the performative stability and performative optimality are equivalent
	(proposition 3.2). This equivalence allows us to focus on the dynamical behavior of the system.

	In section 4, we introduce learning dynamics to the multi-agent performative prediction, and study
	their long-term behavior. We provide a threshold result which depends on the learning rates of
	exponentiated gradient descent and the collective influence. Theorem 4.1 shows the dynamics of
	exponential gradient descent converge to the performative stable and optimal point when the learning rate is small enough. Our convergence result in theorem 4.1 holds when the feature space is
	multi-dimensional, and every learning agent can use different learning rates starting at arbitrarily
	interior states. Our exact convergence result also works in the single-agent performative prediction setting. Contrarily, previous convergent results in Perdomo et al. (2020); Miller et al. (2021);
	Mendler-D¨unner et al. (2020) only show that their dynamics converge to a small neighborhood of
	the performative stable point.

	On the other hand, section 4.2 shows the dynamics can have chaotic behavior if the collective influence Ln is large enough. Specifically, theorem 4.6 shows that even when the feature space is
	in R[2] these systems provably exhibit Li-Yorke chaos, when the collective influence _i_ _[λ][i][ is large]_

	enough. This implies that there exists an uncountable “scrambled” set so that given any two initial
	conditions in the set, the liminf of the distance between these two dynamics is zero, but the lim
	[P]

	sup of their distance is positive. (definition 2.4) The chaotic result in theorem 4.6 also holds for
	the original single-agent performative prediction so long as the agent’s influence is large enough,
	and, thus, complements previous performative prediction works on convergence behavior, which
	primarily consider that the agent’s influence on the data distribution is sufficiently small. Moreover,
	no matter how small the agents’ learning rates, theorem 4.6 shows that chaos is inevitable in some
	performative predictions settings when the number of agents exceeds a carrying capacity. After that
	the system becomes totally unpredictable with small perturbations exploding exponentially fast. [1]

	Finally, section 5 provides numerical examples of our dynamics and show convergent and chaotic
	behavior. Additionally, through simulation, we demonstrate that our convergent and chaotic results
	also hold when the agents can only access noisy estimation of the gradient and conduct stochastic
	exponentiated gradient descent.

	Related work Data distribution shift is not a new topic in ML, but earlier works focused primarily
	on exogenous changes to the data generating distribution. Performativity is machine learning was
	introduced by Perdomo et al. (2020). The original work and several follow-ups study the discrepancy between performative stability and performative optimality and prove approximated conver
	1We highlight our revision in blue color.


	-----

	gence of learning dynamics, e.g., stochastic gradient descent, or iterative empirical risk minimization. (Mendler-D¨unner et al., 2020; Drusvyatskiy & Xiao, 2020; Izzo et al., 2021) Performativity of
	prediction is also related to several applications: strategic classification (Hardt et al., 2016), retraining Bartlett (1992); Kuh et al. (1990).

	Inspired by the instability of training algorithms in ML applications such as Generative Adversarial
	Networks (GANs), there has been a lot of recent interest in understanding conditions (particularly
	in multi-agent systems) where learning behavior may be non-equilibrating/unstable (Cheung & Tao,
	2020; Balduzzi et al., 2020; Flokas et al., 2020; Andrade et al., 2021; Letcher, 2021; Giannou et al.,
	2021). The (in)stability and performance of exponentiated gradient descent in particular (also referred to as Multiplicative Weights Updates) and other closely related dynamics has attracted a lot
	of attention (Cohen et al., 2017; Bailey & Piliouras, 2018; Cheung, 2018; Panageas et al., 2019;
	Cheung & Piliouras, 2020; Vadori et al., 2021). The technique of Li-Yorke chaos has recently found
	applications across several different domains such as routing games, Cournot games and blockchain
	protocols (Palaiopanos et al., 2017; Chotibut et al., 2020; Bielawski et al., 2021; Cheung et al., 2021;
	Leonardos et al., 2021). To our knowledge, this is the first time where such formal chaotic results
	are established in settings related to performative prediction and supervised learning more generally.

	Another line of related works is learning in games. Specifically, our dynamics can be seen as special
	cases multiplicative weight update (Hedge algorithms) on congestion games. Previous works on
	Hedge algorithms only show exact convergence when the learning rate is decreasing Kleinberg et al.
	(2009); Cohen et al. (2017), and, to our best knowledge, our results are the first that shows exact
	convergence of Hedge algorithms with small constant learning rates. Our results also complement
	the exact convergence result of the linear variant of multiplicative weight update by Palaiopanos
	et al. (2017).

	2 PRELIMINARY

	2.1 MULTI-AGENT PERFORMATIVE PREDICTION

	A multi-agent performative prediction comprises n agents deploying their predictive models
	_fθ1_ _, . . ., fθn with parameters θ[1], θ[2], . . ., θ[n]_ Θ that collectively influence the future data dis_∈_
	tribution. We formalize such dependency via a distribution map D(·) which outputs a distribution
	on the data, D(θ[⃗]), given a models profile _θ[⃗] := (θ[1], . . ., θ[n]) ∈_ Θ[n]. A loss function ℓ(z, θ[′]) measures the loss of a model fθ′ on a data point z, and the expected loss on a distribution is
	_∈Z_ _D_
	Ez [ℓ(z, θ[′])]. For performative prediction, we further define the decoupled performative loss on a
	_∼D_
	distribution mapping D as
	_ℓ(θ[⃗], θ[′]) := Ez_ (θ⃗)[ℓ][(][z][,][ θ][′][)]
	_∼D_

	where θ[′] _∈_ Θ denotes a predictive model, while _θ[⃗] ∈_ Θ[n] denotes a deployed model profile.

	Given the set of model Θ, the loss function ℓ, and the distribution mapping D, each agent in a
	multi-agent performative prediction (Θ, ℓ, D) pursues minimal loss on the distribution that they
	collectively induce. We consider two solution concepts performative optimality and performative
	_stability which generalizes the original ones in Perdomo et al. (2020)._

	Definition 2.1. Given (Θ, ℓ, D), a models profile _θ[⃗][∗]∈_ Θ[n] is performatively optimal if the total loss
	is minimized,
	_θ⃗[∗]_ _∈_ arg minθ⃗∈Θ[n] Xi _ℓ(θ[⃗], θ[i])._

	Another desirable property of a model profile is that, given all agents deploy their models, their
	models are also simultaneously optimal for distribution that their model induces. Formally, _θ[⃗][∗]_ is
	_performatively stable if for all i ∈_ [n]

	(θ[∗])[i] _∈_ arg minθ[i]∈Θ _ℓ(θ[⃗][∗], θ[i])._

	The performative optimality does not implies the performative stable point. For performative optimal point, the variable for minimization _θ[⃗] affects both the first and the second argument, but only_
	affect the second one for performative stable point. Now we introduce our model in this paper.


	-----

	Location-scale distribution map In this paper, we study the family of multi-agent location-scale
	map for regression problem where a data point consists of d-dimensional feature and scalar outcome,
	_z = (x, y). These are natural classes of distribution maps in which performative effects enter_
	through an additive or multiplicative factor that is linear in _θ[⃗]._

	Definition 2.2. Given d ∈ N and d ≥ 2, and Θ[n] _⊆_ R[d], a distribution map D : Θ[n] _→_ R[d] _× R_
	is a multi-agent location-scale distribution map on n parties if there exists a static distribution _X_
	_D_
	on R[d][+1], θ[0] _∈_ R[d], and n linear functions Λ1, . . ., Λn from R[d] to R[d] so that the distribution of
	from(x, y) D ∼DX . Given feature(θ[⃗]) has the following form: The feature is x ∈ R[d] and noise x0 ∈n R, and the the outcome is x ∈ R[d] and noise x0 ∈ R is jointly sampled

	_y =_ _θ[0]_ Λi(θ[i]), x + x0.

	- _−_ _i=1_ +

	X

	In this paper, we consider the scaling maps Λi(θ) = λiθ for all θ ∈ Θ with scalar λi > 0 for
	all i ∈ [d]. We call λ := (λ1, . . ., λn) the influence parameters, and Ln := _i=1_ _[λ][i][ collective]_
	influence. Furthermore, we let A := E xx[⊤][] _∈_ R[d][×][d] be the covariance matrix of the feature, and
	_b := Aθ[0]_ + E[x0x] R[d]. We will specify the multi agent location-scale distribution map with[P][n]
	_∈_
	parameters n, d, λ, A, and b.

	When n = 1, our multi-agent location-scale distribution map is a special case of location-scale
	family in Miller et al. (2021) where the model θ may both the outcome y as well as the feature x.

	Predictive Models and Loss Function We consider linear predictive model with constraint where
	_fθ′_ (x) = ⟨θ[′], x⟩, and the collection of parameter is the d-simplex, Θ = {θ : _k=1_ _[θ][k][ = 1][, θ][k][ ≥]_
	0}. We use mean squared error to measure a predictive model θ[′] _∈_ Θ on a distribution map with a
	deployed model profile _θ[⃗]_ Θ[n], ℓ(θ[⃗], θ[′]) = E(x,y) (θ⃗)[[(][y] _[−]_ _[f][θ][′]_ [(][x][))][2][] =][ E](x,y)[P][d](θ[⃗])[[(][y] _[−]_ _[θ][′][ ·]_ [x][)][2][]][.]
	_∈_ _∼D_ _∼D_

	Given a deployed model profile _θ[⃗] and a predictive model θ[′], the gradient of the decoupled loss is_
	_g(θ[⃗], θ[′]) :=_ _θ′_ _ℓ(θ[⃗], θ[′]), and if_ is a location-scale distribution map, g(θ[⃗], θ[′]) = E(x,y) (θ⃗)[[2(][θ][′][ ·]
	_∇_ _D_ _∼D_
	x − y)x]. Furthermore, with A and b defined in definition 2.2, the gradient can be written as


	_θ[′]_ +


	_g(θ[⃗], θ[′]) =_ _θ′_ _ℓ(θ[⃗], θ[′]) = 2A_
	_∇_


	_λiθ[i]_


	_−_ 2b. (1)


	Additionally, given a deployed model profile _θ[⃗] and a predictive model profile_ _θ[⃗][′], we define the_
	gradient of agent i’s decoupled loss as g[i](θ[⃗], θ[⃗][′]) := g(θ[⃗], (θ[′])[i]), and g[i](θ[⃗]) := g[i](θ[⃗], θ[⃗]) when the
	deployed model profile is identical to the predictive model profile. We denote the gradient of agent
	_i’s average loss as ¯g[i](θ[⃗]) :=_ _l_ _[θ]l[i][g]l[i][(]θ[⃗]) for all_ _θ[⃗] ∈_ Θ[n]. Finally, we define _ξ[⃗](θ[⃗]) = (ξ[1], . . ., ξ[n])_
	with ξ[i] _∈_ R[d] so that for all i ∈ [n] and k ∈ [d] ξk[i] [(]θ[⃗]) := θk[i] [(][P][ θ]l[i][g]l[i] _[−]_ _[g]k[i]_ [)][. For brevity, we omit][ ⃗]θ
	and define ⃗g := ⃗g(θ[⃗]) and _ξ[⃗] :=[P]ξ[⃗](θ[⃗]) when there is no ambiguity._

	2.2 LEARNING DYNAMICS

	We consider the agents myopically use a reinforcement learning algorithm exponentiated gradient
	_for linear regression to improve their predictive models in rounds. We first define the original single_
	agent’s exponentiated gradient for linear regression on a sequence of data points here and will state
	our dynamics on multi agent performative prediction in section 4.
	Definition 2.3 (Kivinen & Warmuth (1997)). Given a learning rate η > 0, an initial parameter
	_θregression0 ∈_ Θ and a sequential of data points iteratively updates the model as follows: At round (xt, yt)t≥1, the exponentiated gradient descent for linear t + 1, with previous parameter θt Θ
	the exponentiated gradient algorithm updates it to θt+1 so that _∈_

	_θt,k exp(_ 2η(θt _xt_ _yt)xt,k)_
	_θt+1,k =_ _−_ _·_ _−_

	_ℓNote that each exponent((xt, yt), θt) at θt which yields the name of exponentiated gradient descent. 2(θtP · xl_ _[θ]t −[t,l][ exp(]yt)x[−]t,k[2] is the[η][(][θ][t][ ·] k[ x]-th coordinate of the gradient of squared error[t][ −]_ _[y][t][)][x][t,l][)][ for all][ k][ ∈]_ [[][d][]][.]


	-----

	2.3 DYNAMICAL SYSTEM

	Definition 2.4. Let (X _, f_ ) be a dynamical system where X is a metric space and f is a mapping on
	_X_ . We say (x, x[′]) ∈X × X is a Li-Yorke pair if

	lim inf _dist(f_ _[t](x), f_ _[t](x[′])) = 0 and lim sup_ _dist(f_ _[t](x), f_ _[t](x[′]) > 0._
	_t_ _t_

	(X _, f_ ) is Li-Yorke chaotic if there is an uncountable set S ⊂X (scrambled set) such that any pair
	of points x ̸= x[′] _∈_ _S is Li-Yorke pair._

	3 MULTI-AGENT PERFORMATIVE LEARNING: STABILITY AND OPTIMALITY

	Having introduced multi-agent performative prediction, we show some basic property of our
	location-scale distribution map with mean squared error as a warm-up. Using the first order condition and a potential function argument, we show 1) the existence of performative stable point in our
	model (proposition 3.1) and 2) the performative stability and performative optimality are equivalent
	(proposition 3.2). The proofs of both propositions are in appendix A.

	We first show the existence and uniqueness of performative stable point through a potential function
	argument. The first part is due to the first order condition of convex optimization. The second part
	also use the first order condition, and _∂θ∂k[i]_ [Φ(]θ[⃗]) = λigk[i] [(]θ[⃗]) for all i ∈ [n] and k ∈ [d].

	Proposition 3.1 (existence). Given the family of linear predictive models with constrains Θ, mean
	_squared error ℓ_ _and multi-agent location-scale distribution map with parameters n, d, λ, A, b in_
	_definition 2.2,_ _θ[⃗][∗]_ _on (Θ, ℓ, D) is performative stable if and only if the gradient (defined in eq. (1))_
	_satisfies gk[i]_ [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗]) for all i ∈_ [n] and k ∈ [d] with θk[i] _[>][ 0][.]_

	_Furthermore, if A is positive definite, there exists a unique performative stable point_ _θ[⃗][∗], and_ _θ[⃗][∗]_ _is_
	_also a global minimizer of the following convex function for all_ _θ[⃗] ∈_ Θ[n]

	Φ(θ[⃗]) := ( _λiθ[i])[⊤]A(_ _λiθ[i]) +_ _λi(θ[i])[⊤]Aθ[i]_ 2b[⊤] [X] _λiθ[i]._ (2)

	_−_

	_i_ _i_ _i_ _i_

	X X X

	While model profile is performative stable does not necessary mean the loss is minimized, the following proposition shows that optimality and stability are equivalent in our setting.

	Proposition 3.2. Given (Θ, ℓ, D) defined in proposition 3.1, _θ[⃗][∗]_ _is performative stable if and only if_
	_θ⃗[∗]_ _is performatively optimal defined in definition 2.1._

	4 MULTI-AGENT PERFORMATIVE LEARNING: CONVERGENCE AND CHAOS

	In section 3, we study the stability and optimality of location-scale distribution map with mean
	squared error. Now we ask when each agent myopically improves his predictive model through
	reinforcement learning but their predictions are performative, what is the long term behavior of the
	system? Can the system converges to performative stable and optimal point, or behave chaotically?

	We provide a threshold result depending on the learning rate and the collective influence Ln. Section 4.1 shows the dynamics converge to the performative stable and optimal point when when the
	learning rate is small enough. On the other hand, section 4.2 shows the dynamics can have chaotic
	behavior if the collective influence Ln is large enough.

	Now, we define our dynamics (θ[⃗]t)t 0 formally. At each round, each agent accesses the data dis_≥_
	tribution which influenced by their previous model, and updates his model through exponentiated
	gradient descent. Specifically, given an initial model profile _θ[⃗]0_ Θ[n] and a learning rate profile
	_η := (η1, . . ., ηn), each agent i applies the exponentiated gradient descent with initial parameter ∈_ _θ0[i]_
	and learning rate ηi > 0: At round t + 1 > 1, each agent i use D(θ[⃗]t) and estimates the gradient of
	the (expected) loss, g[i](θ[⃗]t) defined in eq. (1), and updates his model according to definition 2.3,


	_θt,k[i]_ [exp(][−][η][i][g]k[i] [(]θ[⃗]t))
	_θt[i]+1,k_ [=] _d_ for all t 0, i [n], and k [d]. (3)

	_l=1_ _[θ]t,l[i]_ [exp(][−][η][i][g]l[i][(]θ[⃗]t)) _≥_ _∈_ _∈_
	P


	-----

	We will use superscript to denote agent, i ∈ [n], and subscript for time, t ≥ 0, and index of feature,
	_k, l ∈_ [d]. Recall that the gradient of agent i’s average loss is ¯g[i](θ[⃗]) := _l_ _[θ]l[i][g]l[i][(]θ[⃗]), and eq. (3)_

	can be rewritten as _θk[i]_ [exp(][−][η][i][g]k[i] [(]θ[⃗]t)) _θk[i]_ [exp(][η][i][(¯]g[i](θ[⃗]t)−gk[i] [(]θ[⃗]t))) _θ[∗]_ and i [d],

	_l_ _[θ]l[i]_ [exp(][−][η][i][g]l[i][(]θ[⃗]t)) [=] _l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i](θ[⃗]t)−gl[i][(]θ[⃗]t))) [. Finally, given][P] _[ ⃗]_ _∈_

	_equilibrium (or fixed point)we define the support ofP_ _θ[⃗][∗] of eq. (3) ifas Si ⊆_ [d] =P {k : (θ[∗])[i]k _[>][ 0][}][, and][ ¯]Si := [d] \ Si. Then_ _θ[⃗][∗]_ is an

	_gk[i]_ [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]), for all i ∈ [n], k ∈ _Si._ (4)
	We say a fixed point of eq. (3) is isolated if there exists an open set of it so that no other fixed point
	is in the set. Additionally, the performative stable condition in proposition 3.1 is equivalent to

	_gk[i]_ [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]) and gl[i][(]θ[⃗][∗]) ≥ _g¯[i](θ[⃗][∗]), for all i ∈_ [n], k ∈ _Si, l ∈_ _S[¯]i._ (5)
	Therefore, the set of fixed points of eq. (3) contains the performative stable point.

	4.1 CONVERGING WITH SMALL LEARNING RATE

	In this section, we show when the learning rate of each agent is small enough the the dynamics in
	eq. (3) converge to performative stable point. Specifically, if the parameter of multi-agent performative learning (n, d, A, λ) is fixed, the dynamics in eq. (3) converge to performative stable point
	when η is “small” enough.

	By eq. (5), _θ[⃗][∗]_ is a performative stable if the gradient in the support is no less than the average
	gradient. We call a performative stable point _θ[⃗][∗]_ _proper if the gradient of non-support coordinate_
	is greater than the average gradient: for all i ∈ [n] l ∈ _S[¯]i, gl[i][(]θ[⃗][∗]) > ¯g[i](θ[⃗][∗]). The below theorem_
	shows if the performative stable point is proper and equilibria satisfying eq. (4) are isolated, eq. (3)
	converges when maxi ηi is small enough and the ratio of ηi/ηj is bounded for all i and j.
	Theorem 4.1 (Convergence). Consider a constant Rη > 0, and a multi-agent performative learning
	_setting with parameter n, d, A, b, λ such that A is positive definite, the performative stable point_ _θ[⃗][∗]_
	_is proper and its equilibria (defined in eq. (4)) are isolated. There exists η_ _> 0 so that dynamic in_
	_∗_
	_eq. (3) with learning rate profile η converges to the performative stable point, limt_ _θt =_ _θ[⃗][∗], if_
	_→∞_ _[⃗]_
	_the initial state is an interior point and η satisfies maxi ηi_ _η_ _and maxi ηi/ mini ηi_ _Rη._
	_≤_ _∗_ _≤_

	Informally, when the learning rate η are small, _θ[⃗]t in eq. (3) can be approximated by a solution of an_
	ordinary differential equation, _ϑ[⃗](t∥η∥1), where the initial condition is_ _ϑ[⃗](0) =_ _θ[⃗]0 and_
	_d_ _ηi_

	_k[(][t][) =]_ _ϑ[i]k[(][t][)(¯]g[i](ϑ[⃗](t))_ _gk[i]_ [(]ϑ[⃗](t))) (6)
	_dt_ _[ϑ][i]_ _η_ 1 _−_

	_∥_ _∥_

	for all t ≥ 0, i ∈ [n], and k ∈ [d]. We formalize this intuition in the proof of lemma 4.4. Note that
	the set of fixed points of eq. (6) is identical to eq. (3) and satisfies eq. (4).

	The proof of theorem 4.1 has two parts. First we show the continuous approximation eq. (6) converges to the performative stable point. Then we study the relationship between eq. (3) and eq. (6),
	and prove the eq. (3) also converges to a performative stable point.

	From Potential Function to Convergence of Equation (6) In this section, theorem 4.2 shows
	the dynamics of eq. (6) converge to performative stable point which will be useful to show the
	convergence of eq. (3).

	Theorem 4.2. If all points satisfying eq. (4) are isolated and A is positive definite, and _ϑ[⃗](0) is in_
	_the interior of Θ[n], the limit limt→∞_ _ϑ[⃗](t) =_ _θ[⃗][∗]_ _is the performative stable point._

	Theorem 4.2 can be seen as the continuous version of theorem 4.1. Compared to theorem 4.1, theorem 4.2 does not require that the performative stable point is proper, but theorem 4.2 also implicitly
	requires the ratio of learning rate between any two agents is bounded.

	To prove theorem 4.2. we prove lemma 4.3 which shows that Φ in eq. (2) is a potential function
	for eq. (6) so that time derivative of Φ is decreasing for all non-fixed points. Then, in the proof of
	theorem 4.2, we further prove that the dynamics in eq. (6) converge to performative stable points
	when the initial condition _ϑ[⃗](0) is an interior point. The proof is similar to a proof in Kleinberg et al._
	(2009), and is presented in appendix B.


	-----

	Lemma 4.3. Given a solution of eq. (6), the time derivative of Φ in eq. (2) is 0 at fixed points of
	_eq. (6), and negative at all other points. Furthermore,_

	2

	_d_ 1 mini λiηi

	_ϑ(t))_ _−_ _λiηi_ _ϑ[i]k[(]_ _ϑ[i]l[g]l[i]_ _k[)][\|]_ _−_ _ξ(ϑ[⃗])_ 1[.]
	_dt_ [Φ(] _[⃗]_ _≤_ 2 _i_ _[η][i]_ _i_ _[λ][i][η][i]_  _i,k_ _\|_ _l_ _[−]_ _[g][i]_  _≤_ 2 _i_ _[η][i]_ _i_ _[λ][i][η][i]_ _∥[⃗]_ _∥[2]_

	X

	From Approximation to Convergence of Equation (3)[P] P [X] Given the convergence of eq. (6), we [P] P
	show the dynamics of eq. (3) also converges to a performative stable point. The argument has two
	stages. We first show given any neighborhood of performative stable points D, eq. (3) will hits D
	and stay in D in finite amount of steps in lemma 4.4. Then lemma 4.5 shows that eq. (3) converges
	to a performative stable point which completes the proof of theorem 4.1.

	Lemma 4.4. Given any open set D ∈ Θ[n] _that contains the performative stable point, there exists_
	_η∗_ _small enough so that for all_ _θ[⃗]0 ∈_ Θ[n], there exists τ so that _θ[⃗]t ∈_ _D for all t ≥_ _τ_ _._

	The proof is based on standard approximation result of numerical solution to ordinary differential
	equations, and is presented in appendix C.

	Lemma 4.5. Let _θ[⃗][∗]_ _is the performative stable point and D is a small enough neighborhood of_ _θ[⃗][∗]._
	_Besides the condition in theorem 4.1, if the initial condition of eq. (3),_ _θ[⃗]0, is in D and_ _θ[⃗]t_ _D for_
	_∈_
	_all t_ 0, the limit of eq. (3) is the performative stable point limt _θt =_ _θ[⃗][∗]._
	_≥_ _→∞_ _[⃗]_

	The proof uses the fact that the right-hand side of eq. (3) decreases as the dynamic converges to the
	fixed point. Therefore, even though the learning rate profile is fixed, the error between eq. (3) and
	eq. (6) vanishes as they converge to the fixed point. We present the proof in appendix D.

	4.2 CHAOS WITH CONSTANT LEARNING RATE

	Now we want to ask the converse question in section 4.1. Do the dynamics in eq. (3) still converge, if
	the learning rate is fixed, as the number of agent increases? Alternatively, do the dynamics converge
	when the agents have overwhelming influence on the data distribution?

	Theorem 4.6 shows that the dynamics is Li-York chaotic when Ln = _i_ _[λ][i][ is large even with][ d][ = 2][.]_

	Note that large Ln may be due to a fixed number of agent which have overwhelming influence, or
	the number of agents is large. The later case implies for any small η, no matter how cautious the

	[P]

	agents are, as number of agent increases the chaos inevitably occurs.

	Theorem 4.6. Given a multi-party induced location scale family distribution with A, b, d = 2,
	_and a common learning rate η > 0, if all n agents use the exponentiated gradient with a common_
	_learning rate η in eq. (3) and A is diagonally dominant, there exists a carrying capacity L[∗]_ _such_
	_that if Ln =_ _i=1_ _[λ][i][ ≥]_ _[L][∗]_ _[the dynamics][ (]θ[⃗]t)t∈N is Li-Yorke chaotic and thus has periodic orbits_
	_of all possible periods._

	[P][n]

	To prove theorem 4.6, we show there exists an uncountable scrambled set defined in definition 2.4.
	First we consider all agents start from the same initial model, and show the dynamics eq. (3) can be
	captured by the following function,

	_x_
	_fu,v(x) =_ (7)

	_x + (1 −_ _x) exp(u(x −_ _v))_ _[∀][x][ ∈]_ [[0][,][ 1]]

	with proper choice of u ∈ R and v ∈ R. Note that v is an equilibrium, and u controls steepness.
	First, when all agents start from the same initial model, θ0[i] [=][ θ][0] [for all][ i][ ∈] [[][n][]][, and use the same]
	learning rate η, their models are identical for all t 0. For all t there exists θt so that θt[i] [=][ θ][t] [for]
	_≥_
	all i. Additionally, since d = 2, we can use a single parameter to encode a linear predictive model
	with constraints. Given θ Θ = ∆2, we define p(θ) = θ1 [0, 1] and omit the input θ when it
	is clear. Thus, we can rewrite the dynamics as ∈ _pt = p(θt) which are one-dimensional dynamics on ∈_

	[0, 1], and pt+1 = _pt+(1−pt)e[η]p[(][gi]t1_ [(]θ[⃗]t )−g2[i] [(]θ[⃗]t )) [. We set]


	(1 + L)(A2,2 _A2,1) + (b1_ _b2)_
	_α(L) =2η (1 + L) (A1,1_ _A1,2_ _A2,1 + A2,2) and β(L) =_ _−_ _−_
	_−_ _−_ (1 + L)(A1,1 _A1,2_ _A2,1 + A2,2)_ _[,]_

	_−_ _−_


	-----

	and write αn = α(Ln) and βn = β(Ln). By direct computation the dynamic of pt is

	_pt+1 = fαn,βn_ (pt) (8)

	where αn encodes the update pace and βn is an equilibrium of the dynamic. We further set β :=
	_A2,2−A2,1_ _∞_

	_A1,1−A1,2−A2,1+A2,2_ [and][ δ][(][L][) :=][ β][(][L][)][−][β][∞] [which converges to zero as][ L][ →∞][. If][ A][ is diagonally]
	dominantA2,2 _A α2,1( are positive, we can permute the coordinate so thatL) is positive and increasing for all L ≥_ 0. Additionally, because β (0, 1/2] A. With Li et al.1,1 − _A1,2 and_
	(1982), the lemma below implies the existence of period three points and the proof is in appendix E. − _∞_ _∈_

	Lemma 4.7. If β (0, 1/2), there exists a constant L[∗] _> 0 so that if L_ _L[∗], there exists a_
	_∞_ _∈_ _≥_
	_trajectory x0, x1, x2, and x3 with fα(L),β(L)(xi) = xi+1 such that x3 < x0 < x1._

	_Proof of theorem 4.6. First by lemma 4.7, for all L ≥_ _L[∗], there exists a trajectory x0, x1, x2, x3 so_
	that x3 < x0 < x1. Additionally, existence of such trajectory implies the exists of period three
	points by Li et al. (1982). Finally, by the seminal work of Li & Yorke (1975), it implies that the map
	is Li-York chaotic.

	5 SIMULATION

	(a) (14, 0.001) without noise (b) (1.4, 0.05) without noise (c) (14, 0.05) without noise

	(d) (14, 0.001) with noise (e) (1.4, 0.05) with noise (f) (14, 0.05) with noise

	Figure 1: Here we plot the temporal behavior of pt starting at p0 = 0.2 for 100 rounds under various
	_L and η, and noisy estimation of gradient of mean squared error with m samples. The top row_
	(figs. 1a to 1c) present different combination of (L, η), and bottom row (figs. 1d to 1f) consider the
	gradient is estimated from m = 10 and m = 100 samples.

	Now we simulate the one-dimensional dynamics in eq. (8) of one agent with different learning rate
	_η and collective influence L._

	We first define a location-scale distribution map. Let the feature x1, x2 and noise x0 are mutually
	independent Gaussian distribution with zero mean, and the variance are E[x[2]1[] = 3][,][ E][[][x][2]2[] = 7][, and]
	E[x[2]0[] = 1][. Finally,][ θ][0][ =][ 0][. Therefore,]


	_, and b = Aθ[0]_ + E[x0x] =


	_A = E[xx[⊤]] =_


	-----

	Given learning rate η > 0, and collective influence L, we have α(L) = 20η(1 + L), and the
	performative stable point β(L) = 0.7.

	The top row of fig. 1 shows the temporal behavior of pt under different (L, η). The bottom row in
	fig. 1 demonstrates our (convergent and chaotic) results are robust even when the value of gradient is
	noisy. Specifically, we consider at each round t, instead of g1[i] [(]θ[⃗]t)−g2[i] [(]θ[⃗]t) = 2(1+L)(1, −1)Aθt _−_
	2(1, −1)b, the agent replace A and b with empirical moments on m samples. This process can
	be seen as a stochastic exponentiated gradient descent which uses a stochastic estimation of the
	gradient.

	We can see chaotic behavior happens in fig. 1c with L = 14 and η = 0.05, and such behavior
	persists in fig. 1f when the evaluations of gradient are noisy. On the other hand, when the learning
	rate is small enough (figs. 1a and 1b) the dynamics converge to the performative stable point 0.7.
	Furthermore, in figs. 1d and 1e the dynamics converge even with noisy estimation of gradient.

	Finally, fig. 2 shows the temporal behavior of the loss. Under the same setting as fig. 1c, fig. 2b
	shows not only the temporal behavior pt is chaotic, but also the mean squared loss. On the other
	hand, under the same setting as fig. 1a, fig. 2a shows convergent behavior of loss.

	(a) (14, 0.001) without noise (b) (14, 0.05) without noise

	Figure 2: The temporal behavior of the total cost for 100 rounds under various η.

	6 CONCLUSION

	We introduce a framework of multi-agent performative prediction and investigate whether classical
	reinforcement learning algorithms can converge or behave chaotically depending on the collective
	influence of the agents model and learning rate. However, we view our example as only scratching
	the surface of the work of multi-agent performative predictions.

	Our framework leads to several new theoretical problems. In particular, it would be interesting
	to understand whether this threshold is generic and if our results still hold on other reinforcement
	learning algorithms or other general multi-agent distribution maps. Our framework also provides a
	new viewpoint to several applications. One natural application is strategic classification. Hardt et al.
	(2016) In this context, our framework can be seen as strategic learners in strategic classification
	problem where both data and classifiers are strategic. However, the features also respond to the
	deployed models in conventional strategic classification, which is not captured in our multi-agent
	location-scale distribution map. It would be interesting to investigate our dynamics in the context of
	strategic prediction. Another application is pricing strategy/prediction, where multiple companies
	predict the demand function and set their prices.

	Both our digital and real-world environments are increasingly under the influence of or ever more
	powerful and numerous ML systems. As these algorithms trigger actions that change the state of
	the system that produces their joint input data (e.g., AI-triggered ads shown to users affecting their
	behavior, or automated trading systems affecting stock prices, etc), they are effectively forming
	closed loop systems and can no longer be understood fully in isolation. As we show in this paper,
	even fairly simple and innocuous such settings can exhibit phase transitions going from optimal
	behavior to chaos. We believe such phenomena are worthy of a careful investigation and we hope
	that this paper sets out some useful building blocks by bringing together optimization/performance
	analysis with Lyapunov theory and chaos theory.


	-----

	ETHICS STATEMENT

	Our work showcases the possibility of competing prediction algorithms leading to instability and
	chaos. Extra steps should be taken, whenever possible, so that related systems operate within the
	range of parameters leading to stability. We hope that our work helps both by identifying a potential
	threat for ML systems as well as providing some guidance on how to counter or minimize it.

	REPRODUCIBILITY

	All of our theoretical results are stated fully, including previous definitions and results on which they
	are based. They are accompanied by proofs, either in the main paper or in the appendix. The setting
	of our simulation results are stated fully in the main paper.

	REFERENCES

	Gabriel P Andrade, Rafael Frongillo, and Georgios Piliouras. Learning in matrix games can be
	arbitrarily complex. In Conference on Learning Theory (COLT), 2021.

	James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM
	_Confernce on Economics and Computation, pp. 321–338, 2018._

	David Balduzzi, Wojciech M Czarnecki, Thomas W Anthony, Ian M Gemp, Edward Hughes, Joel Z
	Leibo, Georgios Piliouras, and Thore Graepel. Smooth markets: A basic mechanism for organizing gradient-based learners. 2020.

	Peter L Bartlett. Learning with a slowly changing distribution. In Proceedings of the fifth annual
	_workshop on Computational learning theory, pp. 243–252, 1992._

	Jakub Bielawski, Thiparat Chotibut, Fryderyk Falniowski, Grzegorz Kosiorowski, Michał Misiurewicz, and Georgios Piliouras. Follow-the-regularized-leader routes to chaos in routing games.
	_ICML, 2021._

	Yun Kuen Cheung. Multiplicative weights updates with constant step-size in graphical constant-sum
	games. In NeurIPS 2018, pp. 3532–3542, 2018.

	Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of
	learning in games. In NeurIPS 2020, 2020.

	Yun Kuen Cheung and Yixin Tao. Chaos of learning beyond zero-sum and coordination via game
	decompositions. In International Conference on Learning Representations, 2020.

	Yun Kuen Cheung, Stefanos Leonardos, and Georgios Piliouras. Learning in markets: Greed leads
	to chaos but following the price is right. In Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth
	_International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal,_
	_Canada, 19-27 August 2021, pp. 111–117. ijcai.org, 2021. doi: 10.24963/ijcai.2021/16._

	Thiparat Chotibut, Fryderyk Falniowski, Michał Misiurewicz, and Georgios Piliouras. The route to
	chaos in routing games: When is price of anarchy too optimistic? NeurIPS, 2020.

	Johanne Cohen, Am´elie H´eliou, and Panayotis Mertikopoulos. Learning with bandit feedback in
	potential games. In Proceedings of the 31st International Conference on Neural Information
	_Processing Systems, pp. 6372–6381, 2017._

	Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions,
	2020.

	Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Thanasis Lianeas, Panayotis Mertikopoulos, and Georgios Piliouras. No-regret learning and mixed nash equilibria: They do not
	mix. In Conference on Neural Information Processing Systems (NeurIPS), 2020.


	-----

	Angeliki Giannou, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Panayotis Mertikopoulos.
	Survival of the strictest: Stable and unstable equilibria under regularized learning with partial
	information. COLT, 2021.

	Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science,
	pp. 111–122, 2016.

	Zachary Izzo, Lexing Ying, and James Zou. How to learn when data reacts to your model: performative gradient descent. arXiv preprint arXiv:2102.07698, 2021.

	Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear
	predictors. information and computation, 132(1):1–63, 1997.

	Robert Kleinberg, Georgios Piliouras, and Eva Tardos. Multiplicative updates outperform generic[´]
	no-regret learning in congestion games. In Proceedings of the forty-first annual ACM symposium
	_on Theory of computing, pp. 533–542, 2009._

	Anthony Kuh, Thomas Petsche, and Ronald L Rivest. Learning time-varying concepts. In NIPS, pp.
	183–189, 1990.

	Stefanos Leonardos, Barnab´e Monnot, Dani¨el Reijsbergen, Stratis Skoulakis, and Georgios Piliouras. Dynamical analysis of the eip-1559 ethereum fee market, 2021.

	Alistair Letcher. On the impossibility of global convergence in multi-loss optimization. In Interna_tional Conference on Learning Representations, 2021._

	Tien-Yien Li and James A. Yorke. Period three implies chaos. _The American Mathematical_
	_[Monthly, 82(10):985–992, 1975. ISSN 00029890, 19300972. URL http://www.jstor.](http://www.jstor.org/stable/2318254)_
	[org/stable/2318254.](http://www.jstor.org/stable/2318254)

	Tien-Yien Li, Michał Misiurewicz, Giulio Pianigiani, and James A. Yorke. Odd chaos.
	_Physics Letters A, 87(6):271–273, 1982._ ISSN 0375-9601. doi: https://doi.org/10.
	1016/0375-9601(82)90692-2. [URL https://www.sciencedirect.com/science/](https://www.sciencedirect.com/science/article/pii/0375960182906922)
	[article/pii/0375960182906922.](https://www.sciencedirect.com/science/article/pii/0375960182906922)

	Celestine Mendler-D¨unner, Juan C Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. arXiv preprint arXiv:2006.06887, 2020.

	John Miller, Juan C Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. arXiv preprint arXiv:2102.08570, 2021.

	Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. CoRR,
	[abs/1703.01138, 2017. URL http://arxiv.org/abs/1703.01138.](http://arxiv.org/abs/1703.01138)

	Ioannis Panageas, Georgios Piliouras, and Xiao Wang. Multiplicative weights updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always. In International Conference on Machine Learning, pp. 4961–4969. PMLR, 2019.

	Juan Perdomo, Tijana Zrnic, Celestine Mendler-D¨unner, and Moritz Hardt. Performative prediction.
	In International Conference on Machine Learning, pp. 7599–7609. PMLR, 2020.

	Nelson Vadori, Rahul Savani, Thomas Spooner, and Sumitra Ganesh. Consensus multiplicative weights update: Learning to learn using projector-based game signatures. arXiv preprint
	_arXiv:2106.02615, 2021._


	-----

	A PROOFS AND DETAILS IN SECTION 3

	_Proof of proposition 3.1. First under the mean squared loss function, each agent i’s decoupled per-_
	formative loss function ℓ(θ[⃗][′], θ[i]) is convex in θ[i]. Thus, for linear predictive model, we can apply the

	KKT conditionsstable if and only if for all on definition 2.1 so that a collection of predictive models i ∈ [n], k ∈ [d] with θk[i] _[>][ 0][,]_ _∂θ∂k[i]_ _[ℓ][(]θ[⃗][′], θ[i]) ≤_ _∂θθ[⃗]∂ ∈l[i]_ _[ℓ][(]Θθ[⃗][′][n], θis performatively[i]) for all l ∈_ [d]

	where _θ[⃗][′]_ = _θ[⃗]. Therefore, with eq. (1),_ _θ[⃗][∗]_ is performative stable if

	_gk[i]_ [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗])_ (9)

	holds for all i ∈ [n] and k ∈ [d] with θk[i] _[>][ 0][.]_

	With eq. (9), we now show that there exists a unique performative stable point _θ[⃗][∗]_ by proving that 1)
	Φ in eq. (2) is strictly convex, and 2) _θ[⃗][∗]_ is a minimizer of Φ. First, to show Φ is strictly convex, it is
	sufficient to show the Hessian of Φ positive definite. Because Φ is a quadratic function on Θ[n], the
	Hessian of Φ is a constant matrix in R∂[2][nd][×][nd] By the partial derivative of eq. (2), for all i, j ∈ [n] and
	_k, l ∈_ [d], we have (∇[2]Φ)ik,jl = _∂θk[i]_ _[∂θ]l[j]_ [Φ = 2][λ][i][λ][j][A][k,l][ if][ i][ ̸][=][ j][ and][ (][∇][2][Φ)][ik,il][ = 2][λ][i][(][λ][i][ +1)][A][k,l][.]

	Let L ∈ R[n][×][n] with Lij = 2λiλj if i ̸= j and 2λi(1 + λi) which is positive definite because λi > 0
	for all i ∈ [n]. Then, the Hessian of Φ is the Kronecker product of L and A,

	2λ1(1 + λ1)A _. . ._ 2λ1λnA
	. .

	Φ =  .. ... ..  = L _A._
	_∇[2]_ _⊗_

	2λnλ1A _. . ._ 2λn(1 + λn)A

	 
	 

	Because L and A are both positive definite, the Hessian is also positive definite. Therefore, Φ is
	strictly convex, and there exist a unique minimum of Φ in the compact set Θ[n]. Now we show _θ[⃗][∗]_ is
	performative stable if and only if _θ[⃗][∗]_ is a minimizer of Φ. By the first order condition and the partial
	derivative of eq. (2), the minimum of eq. (2) at _θ[⃗][∗]_ if and only if gk[i] [(]θ[⃗][∗]) ≤ _gl[i][(]θ[⃗][∗]) for all i ∈_ [n]
	_k, l ∈_ [d] with (θ[∗])[i]k _[>][ 0][. Therefore,][ ⃗]θ[∗]_ is the minimum of Φ if and only if _θ[⃗][∗]_ is a performative
	stable point.


	_Proof of proposition 3.2. Given a profile of models_ _θ[⃗], the total cost is_


	E(x,y)∼D(θ⃗)[[(][y][ −] _[θ][i][ ·][ x][)][2][]]_


	_ℓ(θ[⃗], (θ)[i]) =_


	2[]

	


	= _i_ E *x, θ[0] _−_
	X

	
	

	which is a convex function on _θ[⃗] ∈_ Θ[n]. Additionally,


	_λjθ[j]_ _θ[i]_
	_−_


	+ x0


	_ℓ(θ[⃗], θ[ı]) = (1 + λi)gk[i]_ [(]θ[⃗], θ[⃗]) +


	_λigk[i]_ [(]θ[⃗], θ[⃗]) = (1 + nλi)gk[i] [(]θ[⃗], θ[⃗])
	_ı∈[Xn]:ı≠_ _i_


	_∂θk[i]_


	which is the gradient of decoupled performative loss scaled by (1 + nλi). Thus, we can apply the
	KKT conditions and the minimum happens if and only if eq. (9) holds.

	B PROOF AND DETAILS FOR THEOREM 4.2

	_Proof of lemma 4.3. By chain rule,_


	_∂_

	(ϑ[⃗](t)) _[d]_ _ϑ⃗(t),_ (10)

	_i∈[nX],k∈[d]_ _∂θk[i]_ _·_ _dt_


	_d_

	_ϑ(t)) =_
	_dt_ [Φ(] _[⃗]_


	so the time derivative of the potential function is zero if the dynamics is at a fixed point of eq. (6).


	-----

	Now we compute a closed form of the time derivative and show the derivative is zero only if the
	dynamics is at a fixed point of eq. (6). Here we omit the input t to simplify the notation.


	_∂_ _ηi_

	(ϑ[⃗](t)) _ϑ[i]k[(][t][)(]_
	_∂θk[i]_ _·_ _η_ 1

	_∥_ _∥_


	_d_

	_ϑ(t)) =_
	_dt_ [Φ(] _[⃗]_


	_ϑ[i]l[(][t][)][g]l[i][(]ϑ[⃗](t)) −_ _gk[i]_ [(]ϑ[⃗](t))) (by eqs. (6) and (10))


	_i,k_


	!!

	(by the partial derivative of eq. (2))


	_gk[i]_ _ϑ[i]k_

	_·_



	_ϑ[i]l[g]l[i]_ _k_

	_[−]_ _[g][i]_


	_λiηi_

	_λiηi_

	_λiηi_


	_η_ 1
	_∥_ _∥_

	1

	2 _η_ 1
	_∥_ _∥_


	1
	= _λiηi_ _ϑ[i]k[ϑ][i]l[(2][g]k[i]_ _[g]l[i]_ _k[)][2][)]_ (because _l_ _[ϑ]l[i]_ [= 1][)]

	2∥η∥1 Xi _k,lX∈[d]_ _[−]_ [2(][g][i] [P]

	= _λiηi_ _ϑ[i]k[ϑ][i]l[(][g]k[i]_ _l_ [)][2]

	2∥[−]η[1]∥1 Xi _k,lX∈[d]_ _[−]_ _[g][i]_

	Now we bound the time derivative.


	_d_

	_ϑ(t))_
	_dt_ [Φ(] _[⃗]_


	2 _η_ 1
	_−_ _∥_ _∥_


	_λiηi_


	([P]k _[ϑ]k[i]_ [=][ P]l _[ϑ]l[i]_ [= 1][)]

	(by Cauchy inequality)

	(by triangle inequality)


	_λiηiϑ[i]k[ϑ]l[i]_
	_i∈[nX],k,l∈[d]_


	_i∈[nX],k,l∈[d]_ _λiηiϑ[i]k[ϑ]l[i][(][g]k[i]_ _[−]_ _[g]l[i][)][2]_


	_i∈[nX],k,l∈[d]_ _λiηiϑ[i]k[ϑ]l[i][\|][g]k[i]_ _[−]_ _[g]l[i][\|]_


	_λiηi_ _ϑ[i]k_ _ϑ[i]l[(][g]k[i]_ _l_ [)] (by triangle inequality)

	_≥_ i∈[nX],k∈[d] _lX∈[d]_ _[−]_ _[g][i]_ 

	 

	Therefore, _dt[d]_ [Φ(]ϑ[⃗](t)) = 0 only if ϑ[i]k _gk[i]_ _[−]_ [P]l∈[d] _[ϑ]l[i][g]l[i]_ = 0 for all i ∈ [n] and k ∈ [d] which is a

	fixed point of eq. (6).

	(minWe can further simplify the bound byi λiηi) _i,k_ _[\|][ξ]k[i]_ [(]ϑ[⃗])\| = (mini λiηi) ∥ξ[⃗](ϑ[⃗])ξ⃗∥.1, _dt[d]_ Because[Φ(]ϑ[⃗](t)) ≤[P]2i,k−imin[λ][η][i][i][η]i[i] λ[\|][ϑ]ii[λ]ηk[i] _[i]i[(][η][P][i][ ∥]lξ[⃗][ϑ](ϑl[i][⃗][g])l[i]∥1[2][−][.]_ _[g]k[i]_ [)][\|] _≥_

	[P] P

	_Proof of theorem 4.2.[P]_ When the fixed points are isolated satisfying eq. (4), by lemma 4.3, the dynamics in eq. (6) converges to a fixed point of eq. (6), _θ[⃗][∗]_ _∈_ Θ[n]. Because _θ[⃗][∗]_ is a fixed point, for all
	_i ∈_ [n], k ∈ [d],


	_ϑ[i]k_


	(θ[∗])[i]k _g¯[i](θ[⃗][∗])_ _gk[i]_ [(]θ[⃗][∗]) = (θ[∗])[i]k
	_−_



	(θ[∗])[i]l[g]l[i][(]θ[⃗][∗]) − _gk[i]_ [(]θ[⃗][∗]) = 0


	By proposition 3.1, the fixed point condition implies each coordinate of the gradient of loss in the
	support are identical, gk[i] [(]θ[⃗][∗]) = ¯g[i](θ[⃗][∗]) for all i and k ∈ _Si. Thus, if the fixed point_ _θ[⃗][∗]_ is not
	performative stable, there exists ı and κ with (θ[∗])[ı]κ [= 0][ so that][ g]κ[ı] [(]θ[⃗][∗]) < ¯g[ı](θ[⃗][∗]). Furthermore,
	exp(−ηıgκ[ı] [(]θ[⃗][∗])) > _l[(][θ][∗][)]l[ı]_ [exp(][−][η][ı][g]l[ı][(]θ[⃗][∗])). We can pick a small enough ϵ > 0 and define

	_Uϵ :=_ _θ : exp(_ _ηıgκ[ı]_ [)][ >] _θl[ı]_ [exp(][−][η][ı][g]l[ı][) +][ ϵ][}] (11)

	[P] _{[⃗]_ _−_

	_l_

	X

	which contains _θ[⃗][∗]_ and is an open set because ⃗g and the exponential function are continuous. Since
	ifϑ⃗(ϑ[⃗]t)( converges tot) ∈ _Uϵ and_ _θϑ[⃗][⃗][∗](tas) is an interior point, we get t →∞, there exists a time tdtdϵ so that for all[ϑ]k[i]_ [(][t][)][ >][ 0][ by eqs. (6) and (11). Therefore,] t ≥ _tϵ,_ _ϑ[⃗](t) ∈_ _Uϵ. However,_

	_ϑ[i]k[(][t][)][ is positive and increasing for][ t][ ≥]_ _[t][ϵ][. We reached a contradiction because][ ϑ][i]k[(][t][)][ →]_ [(][θ][∗][)][i]k [= 0][.]
	Therefore, _θ[⃗][∗]_ is a performative stable point.


	-----

	C PROOF ANS DETAILS FOR LEMMA 4.4

	To prove lemma 4.4, show the dynamic eq. (3) can be approximated by eq. (6) and the error vanishes
	as η decreases. Thus, we can show the dynamic can hit an arbitrary neighborhood of the performa_∗_
	tive stable point. We further use Φ to show the dynamic will stay in the neighborhood. Below we
	state two ancillary lemmas to control the error of our approximation.

	We define an error vector ⃗e(θ[⃗]) ∈ R[n][×][d] between eqs. (3) and (6) so that

	_θk[i]_ [exp(][η][i][(¯]g[i] _gk[i]_ [))]
	_ηi[2][e]k[i]_ [(]θ[⃗]) := _−_ _k_ _k_

	_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_

	P


	We omit the input _θ[⃗] of ⃗e and ⃗et := ⃗e(θ[⃗]t) when there is no ambiguity. We show if the the error_
	vector is small, Φ is decreasing in dynamics eq. (3).

	Claim C.1of eq. (2) on eq. (Approximated potential) (3) satisfies . There exist C4, so that for all t and _θ[⃗]0 ∈_ Θ[n], the difference


	_i_ _[η]i[2]_
	Φ(θ[⃗]t+1) Φ(θ[⃗]t) _ξt_ 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) _e⃗t_ 1 + d[2]C4 _θt+1_ _θt_
	_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _\|_ _\| · ∥_ _∥_ _−_ _[⃗]_

	Claim C.2. If maxi ηi is small enough and satisfies eq. (12), there exists a constant C[⃗] so that

	[P]

	_\|e[i]k[(]θ[⃗])\| ≤_ _C for all i ∈_ [n], k ∈ [d], and _θ[⃗] ∈_ Θ[n].

	The proofs of these two claims are based on first order approximation.


	1 _[.]_


	_∂[2]_
	_Proof of claim C.1. By the partial derivative of eq. (2), we have_ _∂θk[i]_ _[∂θ]l[j]_ [Φ(]θ[⃗]) = 2λiλjAl,k if i ̸= j

	_∂[2]_
	and _∂θk[i]_ _[∂θ]l[i]_ [Φ(]θ[⃗]) = 2λi(1 + λi)Al,k. Thus, the second order partial derivatives of Φ can bounded by

	the twice of C4 := maxı λı(λı + 1) maxk,l Ak,l. By Taylor’s expansion, we have


	Φ(θ[⃗]t+1) Φ(θ[⃗]t)
	_−_

	Φ(θ[⃗]t) _θ⃗t+1_ _θt_ + _[d][2][2][C][4]_
	_≤∇_ _·_ _−_ _[⃗]_ 2



	_θ[⃗](t + 1) −_ _θ[⃗](t)_


	2
	_λigk[i]_ [(]θ[⃗]t)(ηiξk[i] [(]θ[⃗]t) + ηi[2][e]k[i] [(]θ[⃗]t)) + d[2]C4 _θ(t + 1)_ _θ(t)_
	_−_ _[⃗]_ 1
	_i∈[nX],k∈[d]_

	(by the partial derivative of eq. (2))

	_[⃗]_

	2
	_λiηigt,k[i]_ _[ξ]t,k[i]_ [+][ λ][i][η]i[2][g]t,k[i] _[e][i]t,k_ [+][ d][2][C][4] _θt+1_ _θt_
	_−_ _[⃗]_ 1
	_i∈[nX],k∈[d]_

	_[⃗]_


	1
	_−_ _λiηi_ _ξt,k[i]_ _[\|]_
	_≤_ 2 _i_ _[λ][i][η][i]_  _i,k_ _\|_

	[X]

	Therefore, we have[P]


	_λiηi[2][g]t,k[i]_ _[e][i]t,k_ [+][ d][2][C][4] _θt+1_ _θt_
	_−_ _[⃗]_
	_i∈[nX],k∈[d]_

	_[⃗]_


	(by lemma 4.3)


	2

	_i_ _[η]i[2]_
	Φ(θ[⃗]t+1) Φ(θ[⃗]t) _ξt_ 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) _e⃗t_ 1 + d[2]C4 _θt+1_ _θt_
	_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _\|_ _\| · ∥_ _∥_ _−_ _[⃗]_ 1

	which completes the proof.[P] _[⃗]_

	_Proof of claim C.2. By if η_ is small enough and satisfies eq. (12), we have ηi(¯g[i] _gk[i]_ [)][ ≤] [1][, and]
	_θk[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][ ≤] _[θ]k[i]_ _[e][η][i][(¯]∗g[i]−gk[i]_ [)] _≤_ _θk[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)+][ e]2 _[η]i[2][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][2][ for all][ i, k]− [ and][ ⃗]θ ∈ Θ[n].

	Additionally, with ξk[i] [=][ θ]k[i] [(¯]g[i] _−_ _gk[i]_ [)][, we can rewrite it as]

	_θk[i]_ _k[e][η][i][(¯]g[i]−gk[i]_ [)] _ηiξk[i]_ _k_ [+][ e] _i_ _[θ]k[i]_ [(¯]g[i] _gk[i]_ [)][2][.]

	_[≤]_ _[θ][i]_ _−_ _[≤]_ _[θ][i]_ 2 _[η][2]_ _−_


	-----

	Therefore, the error can be upper bounded as

	_ηi[2][e]k[i]_ _k_ [+][ η][i][ξ]k[i] [+][ e]2 _[η]i[2][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [)][2] (θk[i] [+][ η][i][ξ]k[i] [) =][ e] _i_ _[θ]k[i]_ [(¯]g[i] _gk[i]_ [)][2][,]

	_[≤]_ _[θ][i]_ _l_ _[θ]l[i]_ [+][ η][i][ξ]l[i] _−_ 2 _[η][2]_ _−_

	because _l_ _[θ]l[i]_ [= 1][ and][ P]l _[ξ]l[i]P[= 0][. For lower bound, we have,]_

	_ηi[2][e]k[i]_ [P] _θk[i]_ [+][ η][i][ξ]k[i] _k_ [+][ η][i][ξ]k[i] [)][ ≥−][(][θ]k[i] [+][ η][i][ξ]k[i] [)] _eηi[2]_ _θl[i][(¯]g[i]_ _gl[i][)][2]_ _._

	_[≥]_ _l_ _[θ]l[i]_ [+][ η][i][ξ]l[i] [+][ e]2 _[η]i[2][θ]l[i][(¯]g[i]_ _−_ _gl[i][)][2][ −]_ [(][θ][i] 2 _l_ _−_ !

	X

	P

	If we set C := e maxθ⃗ _l_ _[θ]l[i][(¯]g[i]_ _−_ _gl[i][)][2]_, because \|ηiξk[i] _[\| ≤]_ _[η][i][2 max][k][ \|][g]k[i]_ _[\| ≤]_ [1][ by eq. (12) and]

	_θk[i]_ _[≤]_ [1][, we have] _e[i]k[(]θ[⃗])_ _≤[P]C._

	supProof of lemma 4.4.θ⃗∈D[′][ Φ(]θ[⃗]) < Because12 [inf][ ⃗]θ /∈D _D[Φ(]θ[⃗] is open, we can set)._ We will pick η∗ _Dsmall enough so that[′]_ so that _θ[⃗][∗]_ _∈_ _Dθ[⃗][′],t D+1 ∈[′]_ _⊂DD for all, and_
	_θ⃗t_ _D[′]_ _D. The proof has two parts: we first show the dynamics eq. (3) hits the set D. Then we_
	prove eq. (3) stays in ∈ _⊂_ _D afterward._

	By theorem 4.2, there exists τ so that _ϑ[⃗](τ_ ) _D[′]. Now by Gronwall’s inequality we can set η_ small
	_∈_ _∗_
	enough so that _θ[⃗]τ/∥η∥1_ _D. Formally, given the dynamics in eq. (3), we define right-continuous_
	_∈_
	step functions _θ[⃗](t) =_ _θ[⃗]⌊t⌋_ and ⃗e(t) := ⃗e(θ[⃗]⌊t⌋) for all t ≥ 0. Then the dynamics in eq. (3) can be
	written as


	_t_

	0

	Z


	_t−1_

	(θs[i]+1,k _s,k[) =]_
	_s=0_ _[−]_ _[θ][i]_

	X


	_t−1_

	_s=0_

	X


	_ηiξk[i]_ [(]θ[⃗]s) + ηi[2][e]k[i] [(]θ[⃗]s) = ηi



	_ξk[i]_ [(]θ[⃗](s)) + ηie[i]k[(][s][)][ ds]


	_θk[i]_ [(][t][)][ −] _[θ]k[i]_ [(0) =]


	On the other hand, the solution of eq. (6) can be written as

	_ηi_ _∥ηt∥1_ _t_
	_ϑ[i]k[(][η][i][t][)][ −]_ _[ϑ]k[i]_ [(0) =] _ξk[i]_ [(]ϑ[⃗](s)) ds = ηi _ξk[i]_ [(]ϑ[⃗]( _η_ 1s)) ds

	_η_ 1 0 0 _∥_ _∥_
	_∥_ _∥_ Z Z

	Because ξk[i] [are continuous and][ Θ][n][ is compact, there exists][ L >][ 0][ so that][ ξ]k[i] [is][ L][-Lipschitz with]
	respect to one norm for all i and k. Thus, the difference between above equations is

	_t_ _t_

	_θk[i]_ [(][t][)][ −] _[ϑ]k[i]_ [(][∥][η][∥][1][t][)] _≤ηi_ 0 _ξk[i]_ [(]θ[⃗](s)) − _ξk[i]_ [(]ϑ[⃗](∥η∥1s)) _ds + ηi[2]_ 0 _\|e[i]k[(][s][)][\|][ ds]_

	Z Z

	_t_
	_ηiL_ _θ(s)_ _ϑ(ηis)_ _i_ _[Ct]_ (L-Lipschitz and claim C.2)
	_≤_ 0 _−_ _[⃗]_ 1 _[ds][ +][ η][2]_
	Z

	Therefore, by Gronwall’s inequality and[⃗] _ηi ≤_ _η∗, we have_
	_t_
	_θ(t)_ _ϑ(_ _η_ 1t) _θ(s)_ _ϑ(_ _η_ 1s)
	_−_ _[⃗]_ _∥_ _∥_ 1 0 _−_ _[⃗]_ _∥_ _∥_ 1 _[ds][ + (][η][∗][)][2][ndCt][ ≤]_ [(][η][∗][)][2][ndCte][η][∗][ndLt]

	Because[⃗] _ϑ[⃗](τ_ ) _D[′], we can pick[≤]_ _[η][∗][ndL]_ Z _η_ _[⃗]_ small enough so that _θ[⃗](τ/_ _η_ 1) = _θ[⃗]_ _τ/_ _η_ 1 _D and_
	_∈_ _∗_ _∥_ _∥_ _⌊_ _∥_ _∥_ _⌋_ _∈_
	Φ(θ[⃗]⌊τ/∥η∥1⌋) < inf ⃗θ /∈D [Φ(]θ[⃗]).

	Now we show the second part that _θ[⃗]t_ _D for all t_ _τ/_ _η_ 1 . First, with claim C.1, we will prove
	_∈_ _≥⌊_ _∥_ _∥_ _⌋_
	that the potential function is decreasing for all _θ[⃗]t /_ _D[′]_ when η small enough. We estimate three
	_∈_ _∗_
	terms in claim C.1 separately. First, because _θ[⃗][∗]_ _∈_ _D[′]_ and Θ[n]\D[′] is compact, minθ⃗ /∈D[′][ ∥]ξ[⃗](θ[⃗])∥1 > 0

	exists. Then [min]2 _[i]i[ λ][λ]i[2][i][η][η]i[2][i][ ∥]ξ[⃗]t∥1[2]_ [= Ω(max][i] _[η][i][)][ By claim C.2,][ max][i]_ _[λ][i][η]i[2]_ [max]i,k,θ[⃗] _k[(]θ[⃗])\| · ∥e⃗t∥1 =_

	2 _[\|][g][i]_
	_O(maxi ηi[2][)][. Finally,][P]_ _[ d][2][C][4]_ _θt+1_ _θt_ _i_ [)][. Therefore, there exists][ η][∗] [small enough]
	_−_ _[⃗]_ 1 [=][ O][(max][i][ η][2]

	so that the potential function Φ([⃗] _θ[⃗]t+1) −_ Φ(θ[⃗]t) < 0 for all _θ[⃗]t /∈_ _D[′]_ and η. Therefore Φ(θ[⃗]t) <
	minθ⃗ /∈D [Φ(]θ[⃗]) for all t ≥⌊τ/∥η∥1⌋ which completes the proof.


	-----

	D PROOFS AND DETAILS FOR LEMMA 4.5

	Lemma 4.3, shows the value of Φ is decreasing in eq. (6), and the decrease rate is lower bounded by
	the one norm of _ξ[⃗]. Thus, if we can show the error between eq. (3) and eq. (6) is bounded by_ _ξ_ 1,
	_∥[⃗]∥_
	we have the value of Φ is also decreasing on eq. (3). The main challenge is that because _ξ[⃗](θ[⃗][∗]) = 0,_
	we need to control the error as _ξ[⃗] converges to zero but η_ is fixed.
	_∗_

	We first show three ancillary claims, claim D.1 to D.3. Claim D.1 show the vanishing components
	of _θ[⃗]t decrease rapidly once the dynamic is in D. Claim D.2 and D.3 show the supporting component_
	also decrease once the vanishing components are small enough.

	Given (n, d, A, λ), we define the following constants C1 := [1]4 [max][i,l,][θ][∈][Θ][n][ \|][g]l[i][\|][,][ C][2][ :=] mini,k2(θ[∗])[i]k [,]

	_C3 := ed max(C1, C2), and C4 := maxı λı(λı +1) maxk,l Ak,l. We require the maximum learning_
	rate is bounded by η which satisfies the following conditions.
	_∗_

	_η_ max _k[(]θ[⃗])_ (12)
	_∗_ _i,k,θ[⃗]_ Θ[n][ \|][g][i] _\| ≤_ 2[1]
	_∈_

	_η_ _ndC3 max_ _ξ(θ[⃗])_ 1 1 (13)
	_∗_ _θ⃗_ _∥[⃗]_ _∥_ _≤_

	Additionally, given the bound of learning rate ratio, [max]mini[i] η[ η]i[i]

	_[≤]_ _[R][η][, we requires]_

	_Rη[2][η][3]_ _i_ min 4 _,_ 1 (14)
	_∗_ _[<][ min]16_ _[i][ λ]i_ _[λ][2][i]_ _C3 maxi(λi) maxθ⃗_ _g_ 1 _d[2]C4_

	_[∥][⃗]∥_

	_i_
	Note that [max]min η[ η]i[2][3] [is less then the right hand side of eq. (14). On the other hand, by lemma 4.4, we][P]

	can pick D small enough so that the following conditions holds.

	1
	min _k_ min _θk[i]_ (15)
	2 _i∈[n],k∈Si[(][θ][∗][)][i]_ _[≤]_ _i∈[n],k∈Si,θ[⃗]∈D_


	We first show for all i and k ∈ _S[¯]i, θt,k[i]_ [is decreasing and converges to zero exponentially fast]

	as t increases. Because _θ[⃗][∗]_ is a proper performative stable, _l[(][θ][∗][)]l[i][e][−][η][i][g]l[i][(]θ[⃗][∗]) = e−ηig¯[i](θ[⃗][∗]) >_

	_e[−][η][i][g]k[i]_ [(]θ[⃗][∗]) for all i and k ∈ _S¯i. We can take η∗, ϵ1, and D small enough so that for all_ _θ[⃗] ∈_ _D and_
	all learning rate profile η with max ηi _η_, [P]
	_≤_ _∗_

	_e[−][η][i][g]k[i] < (1_ _ϵ1)_ _θl[i][e][−][η][i][g]l[i], i_ [n], k _S¯i_ (16)
	_−_ _∈_ _∈_

	_l_

	X

	Claim D.1 (Vanishing components). Given η _, ϵ1, and D in eq. (16), if_ _θ[⃗]t_ _D for all t_ 0, for all
	_i ∈_ [n] and k ∈ _Si, θk[i]_ [(][t][)][ is decreasing in][ t][ and for all]∗ _[ t][ ≥]_ [0] _∈_ _≥_

	0 ≤ _θt,k[i]_ _[≤]_ _[θ]0[i]_ _,k[e][−][ϵ][1][t][.]_

	_Proof of claim D.1. By eq. (16), for all t ≥_ 0, θt[i]+1,k [=][ θ]t,k[i] _l_ exp([θ]t,l[i] [exp(]−ηig[−]t,k[i][η][i][)][g]t,l[i] [)][ ≤] [(1][ −] _[ϵ][1][)][θ]t,k[i]_ [.]

	Therefore, θt,k[i] [is decreasing, and][ θ]t,k[i] _[≤]_ _[θ]0[i]_ _,k[e][−][ϵ][1][t][.]_ P

	Claim D.2. There exists a constant C3 such that for all _θ[⃗] ∈_ Θ[n] _with δ1 > 0 so that ∥ξ[⃗](θ[⃗])∥1 ≥_
	2[√]δ1, and \|ξk[i] [(]θ[⃗])\| ≤ _δ1 for all i ∈_ [n] and k ∈ _S[¯]i, then_


	_\|e[i]k[(]θ[⃗])\| ≤_ _C3∥ξ[⃗](θ[⃗])∥1[2][,]_ (17)


	_for all i ∈_ [n] and k ∈ [d].


	Claim D.2 shows if the one norm of _ξ[⃗] is much bigger than the vanishing components, the error term_
	_e⃗ can be bounded by the ∥ξ[⃗]∥1[2][. Moreover, claim D.1 ensures that the vanishing components decrease]_
	rapidly, so the condition of claim D.2 readily holds.


	-----

	_Proof of claim D.2. Given_ _θ[⃗] and δ1 > 0 that satisfy the condition, we first show two inequalities to_
	bound θk[i] [(¯]g[i] _gk[i]_ [)][2][ for supporting and vanishing component respectively. For a vanishing compo-]
	_−_
	nent k _Si,_
	_∈_ [¯]

	_θk[i]_ [(¯]g[i] _gk[i]_ [)][2][ ≤] max _l_ _[\| · \|][θ]k[i]_ [(¯]g[i] _gk[i]_ [)][\| ≤] [4][C][1] _ξ_ 1[.] (18)
	_−_ _i,l,θ∈Θ[n][ \|][g][i]_ _−_ _[·][ δ][1]_ _[≤]_ _[C][1][∥][⃗]∥[2]_

	Then for a supporting component k _Si, with eq. (15) we have_
	_∈_

	1 2
	_θk[i]_ [(¯]g[i] _gk[i]_ [)][2][ ≤] (θk[i] [(][g][i][ −] _[g]k[i]_ [))][2][ ≤] _ξk[i]_ _[\|][2][ ≤]_ _[C][2][∥]ξ[⃗]_ 1[.] (19)
	_−_ mini,l,⃗θ _D_ _[θ]l[i]_ mini,k(θ[∗])[i]k _\|_ _∥[2]_

	_∈_

	Now we use above two inequalities to approximate eq. (3). For nominator, because 1 + x ≤
	exp(x) ≤ 1+ _x_ + 2[e] _[x][2][ for all][ x][ ≤]_ [1][,][ θ]k[i] [exp(][η][i][(¯]g[i] _−_ _gk[i]_ [))][ ≥] _[θ]k[i]_ [+] _[η][i][θ]k[i]_ [(¯]g[i] _−_ _gk[i]_ [) =][ θ]k[i] [+] _[η][i][ξ]k[i]_ [. On the]

	other hand, because ηi(¯g[i] _−gk[i]_ [)][ ≤] [1][ by eq. (12),][ θ]k[i] [exp(][η][i][(¯]g[i] _−gk[i]_ [))][ ≤] _[θ]k[i]_ [+][η][i][ξ]k[i] [+][ e]2 _[θ]k[i]_ _[η]i[2][(¯]g[i]_ _−gk[i]_ [)][2][.]

	By eqs. (18) and (19), we have

	0 _θk[i]_ _[e][η][i][(¯]g[i]−gk[i]_ [)] _θk[i]_ _k_ _i_ _ξ_ 1 (20)
	_≤_ _−_ _[−]_ _[η][i][ξ][i]_ _[≤]_ 2 [e] [max(][C][1][, C][2][)][η][2][∥][⃗]∥[2]

	For denominator, we sum over eq. (20). Because _l_ _[θ]l[i]_ [= 1][ and][ P]l _[θ]l[k][(¯]g[i]_ _−_ _gl[i][) = 0][, we have]_

	0 _θl[i][e][η][i][(¯]g[i]−gl[i][)]_ 1 _i_ _ξ_ 1 (21)
	_≤_ _l_ _−_ _≤[P][ed]2 [max(][C][1][, C][2][)][η][2][∥][⃗]∥[2]_
	X

	Given i ∈ [n] and k ∈ [d], we apply the above equation to eq. (3). For upper bounds, we have

	_k_ [exp(][η][i][(¯]g[i] _gk[i]_ [))]
	_ηi[2][e]k[i]_ [(]θ[⃗]t) = _[θ][i]_ _−_ _t,k_ _t,k_

	_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_

	_k[(]θ[⃗]t)_ _i_
	_≤θPt,k[i]_ _[e][−][η][i][g][i]_ _−_ _θt,k_ _[−]_ _[η][i][ξ]t,k[i]_ (by eq. (21))

	_≤_ _[e][ max(]2[C][1][, C][2][)]_ _ηi[2][∥]ξ[⃗]∥1[2]_ (by eq. (20))

	For lower bounds,

	_k_ [exp(][η][i][(¯]g[i] _gk[i]_ [))]
	_ηi[2][e]k[i]_ [(]θ[⃗]t) = _[θ][i]_ _−_ _t,k_ _t,k_

	_l_ _[θ]l[i]_ [exp(][η][i][(¯]g[i] _−_ _gl[i][))][ −]_ _[θ][i]_ _[−]_ _[η][i][ξ][i]_

	P _θt,k[i]_ [+][ η][i][ξ]t,k[i]

	_θt,k[i]_ _t,k_ (by eqs. (20) and (21))

	_≥_ 1 + _[ed][ max(]2[C][1][,C][2][)]_ _ηi[2]_ _ξ_ 1 _−_ _[−]_ _[η][i][ξ][i]_

	_[∥][⃗]∥[2]_

	_ed max(C1, C2)_
	(θt,k[i] [+][ η][i][ξ]t,k[i] [)] _ηi[2]_ _ξ_ 1 (1/(1 + x) 1 _x)_
	_≥−_ 2 _[∥][⃗]∥[2]_ _≥_ _−_


	_ed max(C1, C2)ηi[2]_ _ξ_ 1 (θt,k[i] _t,k_
	_≥−_ _[∥][⃗]∥[2]_ _[≤]_ [1][ and][ η][i][ξ][i] _[≤]_ [1][ by eq. (12))]

	Therefore, with C3 := ed max(C1, C2) we finish the proof.

	Claim D.3. There exists ϵ2 > 0 so that for all _θ[⃗]t ∈_ Θ[n] _and δ1 > 0 so that ∥ξ[⃗](θ[⃗]t)∥1 ≥_ 2[√]δ1, and
	_\|ξk[i]_ [(]θ[⃗]t)\| ≤ _δ1 for all i ∈_ [n] and k ∈ _S[¯]i, then Φ(θ[⃗]t+1) −_ Φ(θ[⃗]t) ≤ _ϵ2∥ξ[⃗](θ[⃗]t)∥1[2][.]_

	_Proof for claim D.3. By claim C.1,_

	2

	_i_ _[η]i[2]_
	Φ(θ[⃗]t+1) Φ(θ[⃗]t) _ξt_ 1 [+ max] _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) _e⃗t_ 1 + d[2]C4 _θt+1_ _θt_
	_−_ _≤_ _[−]2[min]i[i][λ][ λ][i][2][η][i]_ _∥[⃗]_ _∥[2]_ _i_ _i,k,θ[⃗]_ _\|_ _\| · ∥_ _∥_ _−_ _[⃗]_ 1 _[.]_

	We can bound the later two terms by claim D.2. For the second term, _e⃗t_ 1 _ndC3[⃗]ξt_ 1[. For the]
	third term, [P] _∥_ _∥_ _≤_ _∥[⃗]_ _∥[2]_
	_θt+1_ _θt_ _ηi_ _ξt,k[i]_ _[\|][ +][ C][3][η]i[2]_ _ξ_ 1
	_−_ _[⃗]_ 1 _[≤]_ Xi,k _\|_ _[∥][⃗]∥[2]_

	_[⃗]_ max _ηi_ _ξ_ 1 + ndC3 max _ηi[2]_ _ξ_ 1[.]

	_≤_ _i_ _∥[⃗]∥_ _i_ _[∥][⃗]∥[2]_

	2 max _ηi_ _ξ_ 1 ( by eq. (13))
	_≤_ _i_ _∥[⃗]∥_


	-----

	With above inequalities, by eq. (14), we havemini λ[2]i _[η]i[2]_ [min]4 _[i]i[ λ][λ]i[2][i][η][η]i[2][i][ >][ max][i][ λ][i][η]i[2]_ [max]i,k,θ[⃗] _[\|][g]k[i]_ [(]θ[⃗])\|ndC3 and

	4 [P]i _[λ][i][η][i][ >][ 4][d][2][C][4][ max][i][ η]i[2]i[. Therefore there exists a constant][η]i[2]_ [P]

	_ϵ2 := [min][i][ λ][2]_ max _λiηi[2]_ [max] _gk[i]_ [(]θ[⃗]) _ndC3_ 4d[2]C4 max _ηi[2]_ _[>][ 0]_

	2 _i_ _[λ][i][η][i]_ _−_ _i_ _i,k,θ[⃗]_ _\|_ _\|_ _−_ _i_

	so that Φ(θ[⃗]t+1) Φ(θ[⃗]t) < _ϵ2_ _ξt_ 1[.]
	_−_ [P] _−_ _∥[⃗]_ _∥[2]_

	_Proof of lemma 4.5. To prove limt_ _θt =_ _θ[⃗][∗], there are two equivalent ways to measure the_
	_→∞_ _[⃗]_
	_θprogress, besides⃗[∗]_ if and only if lim ∥θ[⃗]tt −→∞θ[⃗]ξ[∗][⃗]∥(θ[⃗]1. First becauset) = _[⃗]0. On the other hand, becauseθ[⃗][∗]_ is an isolated fixed point of eq. (6), Φ is strictly convex and andθ[⃗]t converges toθ[⃗][∗] is
	the minimum point, _θ[⃗]t converges to_ _θ[⃗][∗]_ if and only if limt→∞ Φ(θ[⃗]t) = minθ⃗ [Φ(]θ[⃗]). With these two
	equivalent conditions, given any ϵ > 0, there exists δ > 0 so that ∥θ[⃗] − _θ[⃗][∗]∥1 < ϵ when_


	_θ⃗ : Φ(θ[⃗])_ max Φ(θ[⃗][′])
	_≤_ _θ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ_


	_θ⃗ ∈_ _Vδ :=_


	Therefore, it is sufficient for us to show for alltechnical claim D.1 to D.2, our proof has two parts. First we show that the dynamic hits δ > 0, there is tδ so that _θ[⃗]t ∈_ _Vδ for all t ≥ Vδtδ. Then,. With_
	the dynamic stays in Vδ.

	For the first part, given δ > 0 and 0 < C < 1, by claim D.1 there exists T1 such that each vanishing
	component \|ξt,k[i] _[\| ≤]_ _[C]_ [2][δ][2][/][4][ for all][ t][ ≥] _[T][1][. Then by claim D.3, there exists][ T][2][ ≥]_ _[T][1][ such that]_
	_ξT2_ 1 _Cδ. Otherwise, the value of Φ decreases by a nonzero constant ϵ2_ _ξt_ 1
	each round which contradicts that the minimum of∥[⃗] _∥_ _≤_ Φ bounded. _∥[⃗]_ _∥[2]_ _[≥]_ _[Cϵ][2][δ][2][ >][ 0][ at]_

	For the second part, if _ξt_ 1 _Cδ_ _δ for all t_ _T2, then we finish the proof._ Other_∥[⃗]_ _∥_ _≤_ _≤_ _≥_
	wise, there exists τ ≥ _T2 so that ∥ξ[⃗]τ_ _∥1 ≤_ _Cδ < ∥ξ[⃗]τ_ +1∥1. Now we prove that Φ(θ[⃗]τ +1) ≤
	maxθ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ [Φ(]θ[⃗][′]). Because the difference between _θ[⃗]τ_ +1 and _θ[⃗]τ is ∥θ[⃗]τ_ +1 − _θ[⃗]τ_ _∥1_ _≤_

	_θmax⃗ for some constanti ηi∥ξ[⃗]τ_ _∥1 +max Li ηξ in one norm, we havei[2][∥]e[⃗]τ_ _∥1 ≤_ 2 maxi ηi∥ξ[⃗]τ _∥1 when η∗_ is small enough, and _ξ[⃗] is a Lξ-Lipschitz_

	_Cδ <_ _ξτ_ +1 1 _ξτ_ 1 + _ξτ_ +1 _ξτ_ 1 _ξτ_ 1 + 2Lξ max _ηi_ _ξτ_ 1 _δ,_ (22)
	_∥[⃗]_ _∥_ _≤∥[⃗]_ _∥_ _∥[⃗]_ _−_ _[⃗]_ _∥_ _≤∥[⃗]_ _∥_ _i_ _∥[⃗]_ _∥_ _≤_

	when C is small enough so that C(1 + 2Lξ maxi ηi) ≤ 1. Therefore, ∥ξ[⃗]τ +1∥1 ≤ _δ and Φ(θ[⃗]τ_ +1) ≤
	maxθ⃗[′]:∥ξ[⃗](θ[⃗][′])∥1≤δ [Φ(]θ[⃗][′]). Finally, because \|ξt,k[i] _[\| ≤]_ _[C]_ [2][δ][2][/][4][ for all][ t][ ≥] _[τ][ + 1][, by claim D.3, the]_

	potential function is decreasing for all t _τ + 1, unless_ _ξ_ _t_ _Cδ. Both make the potential_
	_≥_ _∥[⃗]∥_ _≤_
	function less than maxθ⃗[′]: _ξ(θ[⃗][′])_ 1 _δ_ [Φ(]θ[⃗][′]) which completes the proof.
	_∥[⃗]_ _∥_ _≤_

	E PROOF AND DETAIL FOR THEOREM 4.6

	_Proof of lemma 4.7. Define PL(x) := α(L)(x_ _β(L)) which is increasing for all L because α(L) >_
	_x_ _−_
	0. With PL, fα(L),β(L)(x) = _x+(1−x) exp(PL(x))_ [. Take][ x][1][(][L][) =][ x][1][ := 1] _[−]_ [1][/α][(][L][)][,][ x][2][(][L][) =][ x][2][ :=]

	_fα(L),β(L)(x1(L)), and x3(L) = x3 := fα(L),β(L)(x2(L)). We will define x0(L) later. We will_
	omit L and use x1, x2, and x3.

	To define x0(L), we set y(L) := β(L)/2, and want to show

	_fα(L),β(L)(y) > x1,_ (23)

	which is equivalent to (α(L) 1)(1 _y(L)) exp(PL(y(L))) < y(L). When L is large enough, we_
	_−_ _−_
	have3 _[α] β[(][L](L[)][β])[∞] ∈[. Therefore, we have](_ [2][β]3[∞] _[,][ 4][β]3[∞]_ [)][. Additionally because][ (][α][(][L][)][ −] [1)(1][ α][ −][(][L][y][)][(][ >][L][))][ 0][e][, we have][P][L][(][y][(][L][))][ ≤][ P][L][(][α][y][(][(][L][L][)][)) =][e][−][α][(][ −][L][)][ β]2[1] 3[∞][α][(][L]<[)][β]β[(]3[L]∞[)][ <]<

	_−_ [1]

	_y(L) when L is large enough, and prove eq. (23). On the other hand, because β(L) <_ [4]3 _[β][∞]_ [and]


	-----

	_PL(β(L)) = 0, we have fα(L),β(L)(β(L)) = β(L) <_ [1]2 [(][β][∞] [+1)][. Moreover,][ 1]2 [(][β][∞] [+1)][ <][ 1][−] _α(1L)_ [,]

	holds when L is large enough. These two imply

	_fα(L),β(L)(β(L)) = β(L) < x1(L)._ (24)

	Combining eqs. (23) and (24), by intermediate value theorem, there exists x0 such that

	_fα(L),β(L)(x0) = x1 with y < x0 < β(L) < x1._ (25)

	Now we showx1 _x3 < x0. With 1α−(Lx)1(L) = 1/α(L) and 0 < x1 < 1, we have x2 = fα(L),β(L)(x1) =_

	_x1+α(L)[−][1]_ exp(PL(x1)) exp(PL(x1)) [. Because][ β][∞] _[<][ 1][/][2][, when][ L][ is large enough,][ x][1][(][L][) =]_

	1 − _α(1L)_ _[>][ 1]4_ [(2][β][(][L][) + 3)][≤] [ and, thus,][ P][L][(][x][1][)][ > α][(][L][)] 34 _[−]_ _[β][(]2[L][)]_ . Therefore, we have



	3
	_x2_ _α(L) exp_ _α(L)_ (26)
	_≤_ _−_ 4 2
	_[−]_ _[β][(][L][)]_

	Finally, because PL(x2) _PL(0) =_ _β(L), we get_
	_≥_ _−_

	_x2_
	_x3 =fα(L),β(L)(x2)_
	_≤_ _x2 + (1 −_ _x2) exp(−α(L)β(L))_

	_x2 exp(α(L)β(L))_
	=

	1 + x2(exp(α(L)β(L)) 1)
	_−_
	_≤x2 exp(α(L)β(L))_ (since exp(α(L)β(L)) > 1 when L is large)

	3
	_α(L) exp_ _α(L)_ _β(L)_ (eq. (26))
	_≤_ _−_ 4 2
	_[−]_ _[β][(][L][)]_

	3α(L)
	=α(L) exp _β(L)_

	2 _−_ 2[1]



	Finally, because β _< 1/2, x3 converges to 0 when L is large enough. Therefore,_
	_∞_

	_x3 < y < x0._ (27)

	By eqs. (25) and (27), we have x3 < x0 < x1 which completes the proof.


	-----