pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /-3yxxvDis3L.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

75.4 kB

	## HOW TO IMPROVE SAMPLE COMPLEXITY OF SGD
	### OVER HIGHLY DEPENDENT DATA?

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Conventional machine learning applications typically assume that data samples
	are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data
	samples, which are known to heavily bias the stochastic optimization process and
	slow down the convergence of learning. In this paper, we conduct a fundamental
	study on how different structures of stochastic update schemes affect the sample complexity of stochastic gradient descent (SGD) over highly dependent data.
	Specifically, with a φ-mixing model of data dependence, we show that SGD with
	proper periodic data-subsampling achieves an improved sample complexity over
	the standard SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence
	of SGD over highly dependent data. Moreover, we show that mini-batch SGD
	can further substantially improve the sample complexity over SGD with periodic
	data-subsampling over highly dependent data. We also conduct some numerical
	experiments to validate our theoretical results.

	1 INTRODUCTION

	Stochastic optimization algorithms have attracted great attention in the past decade due to its successful applications to a broad research areas, including deep learning (Goodfellow et al., 2016), reinforcement learning (Sutton & Barto, 2018), online learning (Bottou, 2010; Hazan, 2017), control
	(Marti, 2017), etc. In the conventional analysis of stochastic optimization algorithms, it is usually
	assumed that all data samples are independently and identically distributed (i.i.d.) and queried. For
	example, data samples in the traditional empirical risk minimization framework are assumed to be
	queried independently from the underlying data distribution, while data samples in reinforcement
	learning are assumed to be queried from the stationary distribution of the underlying Markov chain.

	Although the i.i.d. data assumption leads to a comprehensive understanding of the statistical limit
	and computation complexity of SGD, it violates the nature of many practical data-generating
	stochastic processes, which generate highly correlated samples that depend on the history. In fact,
	dependent data can be found almost everywhere, e.g., daily stock price, weather/climate data, state
	transitions in Markov chains, etc. To understand the impact of data dependence on the convergence
	and complexity of stochastic algorithms, there is a growing number of recent works that introduce
	various definitions to quantify data dependence. Specifically, to analyze the finite-time convergence
	of various stochastic reinforcement learning algorithms, recent studies assume that the dependent
	samples queried from the Markov decision process satisfy a geometric mixing property (Dalal et al.,
	2018; Zou et al., 2019; Xu & Gu, 2020; Qu & Wierman, 2020), which requires the underlying
	Markov chain to be uniformly ergodic or has a finite mixing time (Even-Dar et al., 2003). On the
	other hand, to analyze the convergence of stochastic optimization algorithms over dependent data,
	Karimi et al. (2019) assumed the existence of a solution to the Poisson equation associated with the
	underlying Markov chain, which is a weaker condition than the uniform ergodic condition (Glynn
	& Meyn, 1996). Moreover, Agarwal & Duchi (2012) introduced a φ-mixing property of the datagenerating process that quantifies how fast the distribution of future data samples (conditioned on a
	fixed filtration) converges to the underlying stationary data distribution. In particular, the φ-mixing
	property is more general than the previous two notions of data dependence (Douc et al., 2018).


	-----

	While the aforementioned works leveraged the above notions of data dependence to characterize the
	sample complexity of various standard stochastic algorithms over dependent data, there still lacks
	theoretical understanding of how algorithm structure affects the sample complexity of stochastic
	algorithms under different levels of data dependence. In particular, a key algorithm structure is the
	stochastic update scheme, which critically affects the bias and variance of the stochastic optimization process. In fact, under i.i.d. data and convex geometry, it is well known that SGD achieves
	the sample complexity lower bound under various stochastic update schemes (Bottou, 2010), e.g.,
	single-sample update and mini-batch update. However, these stochastic update schemes may lead
	to substantially different convergence behaviors over highly dependent data, as they are no longer
	unbiased. Therefore, it is of vital importance to understand the interplay among data dependence,
	structure of stochastic update and convergence rate of stochastic algorithms, and we want to ask the
	following fundamental question.

	- Q: How does the structure of stochastic updates affect the convergence rate and sample

	complexity of stochastic algorithms over dependent data?

	In this paper, we provide comprehensive answers to the above fundamental question. Specifically,
	we conduct a comprehensive study of the convergence rate and sample complexity of the SGD
	algorithm over a wide spectrum of data dependence levels under various types of stochastic updates,
	including periodic subsampling and mini-batch sampling. Our results show that SGD with both
	stochastic updates achieves a substantially improved sample complexity over the standard SGD
	under highly dependent data. We summarize our contributions as follows.

	1.1 OUR CONTRIBUTIONS

	We consider the following standard stochastic optimization problem.

	min _F_ (w; ξ) _,_ (P)
	_w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ]
	_∈W_


	where the objective function f is convex and Lipschitz continuous, and the expectation is taken over
	the stationary distribution µ of the underlying data-generating process P. To perform stochastic
	optimization, we query a stream of dependent data samples from the underlying data-generating
	process. Specifically, we adopt the φ-mixing model to quantify the data dependence via a decaying
	mixing coefficient function φξ(k) (see Definition 2.2) (Agarwal & Duchi, 2012). We study the convergence of the stochastic gradient descent (SGD) algorithm over φ-mixing dependent data samples
	under various stochastic update schemes, including data subsampling and mini-batch sampling.

	We first study the convergence of SGD over φ-mixing dependent data samples under the data subsampling update scheme. In particular, the data subsampling update scheme utilizes only one data
	sample per r consecutive data samples by periodically skipping r − 1 samples. With this data subsampling scheme, the subsampled data samples are less dependent for a larger subsampling period
	_r. Consequently, we show that SGD with a proper data subsampling period achieves an improved_
	sample complexity over that of the standard SGD in the full spectrum of the convergence rate of the
	_φ-mixing coefficient. In particular, the improvement is substantial when the data is highly dependent_
	with an algebraic decaying φ-mixing coefficient.

	Moreover, we study the sample complexity of SGD over φ-mixing dependent data samples under
	the mini-batch update scheme. Compare to the data subsampling update, mini-batch update can
	substantially reduce the mini-batch data dependence without skipping data samples. Consequently,
	mini-batch update leverages the sample average over a mini batch of data samples to reduce both the
	bias (caused by the data dependence) and the optimization variance. Specifically, we show that SGD
	with mini-batch update achieves an orderwise lower sample complexity than both the standard SGD
	and the SGD with data subsampling in the full spectrum of the convergence rate of the φ-mixing
	coefficient. We summarize and compare the sample complexities of these stochastic algorithms
	under different φ-mixing data dependence models in Table 1.

	1.2 RELATED WORK

	Stochastic Algorithms over Dependent Data Steinwart & Christmann (2009) and Modha &
	Masry (1996) established the convergence analysis of online stochastic algorithms for streaming


	-----

	Table 1: Comparison of sample complexities of SGD, SGD with data subsampling and Mini-batch
	SGD under different levels of data dependence for achieving f (w) − _f_ (w[∗]) ≤ _ϵ. Note that θ is a_
	parameter of the convergence rate of the φ-mixing coefficient.

	Data dependence level _φξ(k)_ SGD SGD w/ subsampling Mini-batch SGD

	Geometric φ-mixing exp( _k[θ]),_ 2 1
	_−_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2])
	(Weakly dependent) _θ > 0_ _O_ _O_ _O_

	Fast algebraic φ-mixing _k[−][θ],_
	(ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (ϵ[−][2])

	(Medium dependent) _θ_ 1 _O_ _O_ _O_
	_≥_

	Slow algebraic φ-mixing _k[−][θ],_ (ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (eϵ[−][1][−] _θ[1] )_

	(Highly dependent) 0 < θ < 1 _O_ _O_ _O_


	data with geometric ergodicity. Duchi et al. (2011) proved that the stochastic subgradient method
	has strong convergence guarantee if the mixing time is uniformly bounded. Agarwal & Duchi (2012)
	studied the convex/strongly convex stochastic optimization problem and proved high-probability
	convergence bounds for general stochastic algorithms under general stationary mixing processes.
	Godichon-Baggioni et al. (2021) provided the non-asymptotic analysis of stochastic algorithms with
	strongly convex objective function over streaming mini-batch data. In a more general setting, the
	stochastic approximation (SA) problem was studied in (Karimi et al., 2019) by assuming the existence of solution to a Poisson equation. Recently, Debavelaere et al. (2021) developed the asymptotic
	convergence analysis of SA problem for sub-geometric Markov dynamic noises.

	Finite-time convergence of reinforcement learning Recently, a series of work studied the finitetime convergence of many stochastic reinforcement learning algorithms over Markovian dependent
	samples, including TD learning (Dalal et al., 2018; Xu et al., 2019; Kaledin et al., 2020), Q-learning
	(Qu & Wierman, 2020; Li et al., 2021; Melo et al., 2008; Chen et al., 2019; Xu & Gu, 2020), fitted Qiteration (Mnih et al., 2013; 2015; Agarwal et al., 2021), actor-critic algorithms (Wang et al., 2019;
	Yang et al., 2019; Kumar et al., 2019; Qiu et al., 2019; Wu et al., 2020; Xu et al., 2020), etc. In these
	studies, the dependent Markovian samples are assumed to satisfy the geometric φ-mixing property,
	which is satisfied when the underlying Markov chain is uniformly ergodic or time-homogeneous
	with finite-states.

	Regret of Stochastic Convex Optimization There have been many known regret bounds for online convex optimization problem. Hazan (2017) has built the standard O(√T ) regret bound for on
	line SGD algorithm with assuming the bounded gradient. Xiao (2009) introduces the regret bound
	of online dual averaging method. To our best knowledge, there is no high-probability guaranteed
	regret bound for mini-batch SGD algorithm with considering the data dependence.

	2 PROBLEM FORMULATION AND ASSUMPTIONS

	In this section, we introduce the problem formulation and some basic assumptions. Consider a
	model with parameters w. For any data sample ξ, denote F (w; ξ) ∈ R as the sample loss of this data
	sample under the model w. In this paper, we consider the following standard stochastic optimization
	problem that has broad applications in machine learning.

	min _F_ (w; ξ) _._ (P)
	_w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ]
	_∈W_

	Here, the expectation is taken over the randomness of the data sample _ξ, which is drawn from an_
	underlying distribution µ. In particular, we make the following standard assumptions regarding the
	problem (P) (Agarwal & Duchi, 2012).
	Assumption 2.1. The stochastic optimization problem (P) satisfies

	_1. For every ξ, function F_ (·; ξ) is G-Lipschitz continuous over W, i.e., for all w, v ∈W,

	_\|F_ (w; ξ) − _F_ (v; ξ)\| ≤ _G∥w −_ _v∥._

	_2. Function f_ ( ) is convex and bounded below, i.e., f (w[∗]) := inf _w_ _f_ (w) > _._

	_·_ _∈W_ _−∞_


	-----

	_3. W is a convex and compact set with bounded diameter R, i.e., supw,v∈W ∥w −_ _v∥≤_ _R._

	To solve this stochastic optimization problem, one often needs to query a set of data samples from
	the distribution µ to perform optimization. Unlike traditional stochastic optimization that usually
	assumes that the data samples are i.i.d. we consider a more general and practical dependent datagenerating process as we elaborate below.

	Dependent data-generating process: We consider a stochastic process P that generates a stream
	of data samples _ξ1, ξ2, ...,_, which are not necessarily independent. In particular, the stochastic
	_{_ _}_
	process P has an underlying stationary distribution µ. To quantify the dependence of the data
	generation process, we introduce the following standard φ-mixing model (Agarwal & Duchi, 2012),
	where we denote {Ft}t as the canonical filtration generated by {ξt}t.
	Definition 2.2 (φ-mixing process). Consider a stochastic process {ξt}t with a stationary distribution
	_µas the total variation distance. Then, the process. Let P(ξt+k ∈·\|Ft) be the distribution of the (t_ + {ξkt)}-th sample conditioned ont is called φ-mixing if the following mixing Ft, and denote dTV
	coefficient φξ( ) converges to 0 as k tends to infinity.

	_·_

	_φξ(k) :=_ sup 2dTV P(ξt+k _A), µ_ _._
	_t∈N,A∈Ft_ _∈·\|_


	Intuitively, the φ-mixing coefficient describes how fast the distribution of sample ξt+k converges to
	the stationary distribution µ when conditioned on the filtration _t, as the time gap k_ . The φ_F_ _→∞_
	mixing process can be found in many applications, which involve mixing coefficients that converge
	to zero at different convergence rates. Below we mention some popular examples.

	- Geometric φ-mixing process. Such a type of process has a geometrically diminishing
	mixing coefficient, i.e., φξ(k) ≤ _φ0 exp(−ck[θ]) for some φ0, c, θ > 0. Examples include_
	finite-state ergodic Markov chains and some aperiodic Harris-recurrent Markov processes
	(Modha & Masry, 1996; Agarwal & Duchi, 2012; Meyn & Tweedie, 2012);

	- Algebraic φ-mixing process. Such a type of process has a polynomially diminishing mixing coefficient, i.e., φξ(k) _φ0k[−][θ]_ for some φ0, θ > 0. Examples include a large class of
	_≤_
	Metropolis-Hastings samplers (Jarner & Roberts, 2002) and some queuing systems (Agarwal & Duchi, 2012).

	3 CONVERGENCE OF SGD WITH SUBSAMPLING OVER DEPENDENT DATA

	In this section, we study the convergence rate and sample complexity of SGD with data subsampling
	update over φ-mixing dependent data. In Section 3.1, we recap the convergence results of the standard SGD over dependent data established in (Agarwal & Duchi, 2012). In Section 3.2, we establish
	convergence results of SGD with the data subsampling update.

	Throughout, we define the sample complexity as the total number of samples required for the algorithm to output a model w that achieves an ϵ convergence error, i.e., f (w) − _f_ (w[∗]) ≤ _ϵ. Also, the_
	standard regret of a stochastic algorithm is defined as


	_F_ (w(t); ξt) _F_ (w[∗]; ξt),
	_−_
	_t=1_

	X


	(Regret): Rn :=


	where the models _w1, w2, ..., wn_ are generated using the data samples _ξ1, ξ2, ..., ξn_, respec_{_ _}_ _{_ _}_
	tively, and w[∗] is the minimizer of f (w). For this sequence of models _w1, w2, ..., wn_, we make the
	_{_ _}_
	following mild assumption, which is satisfied by many SGD-type algorithms.
	Assumption 3.1. There is a non-increasing sequence {κ(t)}t such that ∥w(t + 1) − _w(t)∥≤_ _κ(t)._

	3.1 STOCHASTIC GRADIENT DESCENT

	Stochastic gradient descent (SGD) is a popular and classical algorithm for stochastic optimization.
	In every iteration t, SGD queries a sample ξt from the data-generating process and performs the
	following update.

	(SGD): _w(t + 1) = w(t)_ _ηt_ _F_ (w(t); ξt), (1)
	_−_ _∇_


	-----

	where ηt is the learning rate. In Theorem 2 of (Agarwal & Duchi, 2012), the authors established a
	high probability convergence error bound for a generic class of stochastic algorithms. Specifically,
	under the Assumptions 2.1 and 3.1, they showed that for any τ N with probability at least 1 _δ,_

	_n_ _∈_ _−_

	the averaged predictor _wn :=_ _n[1]_ _t=1_ _[w][(][t][)][ satisfies]_

	_n_

	P 2τ

	_f_ (wn) _f_ (w[∗]) b _κ(t) + [2(][τ][ −][1)][GR]_ + 2GR
	_−_ _≤_ [R]n[n] [+ (][τ][ −]n[1)][G] _n_ _n_ [log][ τ]δ [+][ φ][ξ][(][τ] [)][GR.][ (2)]

	_t=1_ r

	X

	Here, b Rn is the regret of the algorithm of interest, and τ ∈ N is an auxiliary parameter that is introduced to decouple the dependence of the data samples. From the above bound, one can see that
	the optimal choice of τ depends on the convergence rate of the mixing coefficient φξ(τ ). Specifically, consider the SGD algorithm in (1). It can be shown that it achieves the regret Rn = O([√]n)
	and satisfies κ(t) = O(1/√t) with a proper diminishing learning rate. Consequently, the above

	high-probability convergence bound for SGD reduces to

	1 _τ_ 1 _τ_
	_f_ (wn) _f_ (w[∗]) _−_ _._ (3)
	_−_ _≤O_ _√n + infτ_ _∈N_ _√n +_ r _n_ [log][ τ]δ [+][ φ][ξ][(][τ] [)]
	n o

	Such a bound further implies the following sample complexity results of SGD under different con
	b

	vergence rates of the mixing coefficient φξ.
	Corollary 3.2. The sample complexity of SGD in (1) for achieving an ϵ convergence error over
	_φ-mixing dependent data is given as follows._

	1

	- If the data is geometric φ-mixing with parameter θ > 0, then we choose τ = (log [1]ϵ [)] _θ_ _._

	2 _O_

	_The resulting sample complexity is in the order of n = O_ _ϵ[−][2](log_ [1]ϵ [)] _θ_ _._

	[]

	- If the data is algebraic φ-mixing with parameter θ > 0, then we choose _τ =_ (ϵ[−] _θ[1] ). The_

	[]

	_O_

	_resulting sample complexity is in the order of n = O(ϵ[−][2][−]_ _θ[2] )._


	It can be seen that if the data-generating process has a fast geometrically diminishing mixing coefficient, i.e., the data samples are close to being independent from each other, then the resulting
	sample complexity is almost the same as that of SGD with i.i.d. samples. On the other hand, if the
	data-generating process mixes slowly with an algebraically diminishing mixing coefficient, i.e., the
	data samples are highly dependent, then the data dependence increases the sample complexity by
	a non-negligible factor of ϵ[−] _θ[2] . In particular, such a factor is substantially large if the mixing rate_

	parameter θ is close to zero.

	3.2 SGD WITH SUBSAMPLING

	When apply SGD to solve stochastic optimization problems over dependent data, the key challenge is that the data dependence introduces non-negligible bias that slows down the convergence
	of the algorithm. Hence, a straightforward solution is to reduce data dependence before performing
	stochastic optimization. In the existing literature, a simple and useful approach is data subsampling (Nagaraj et al., 2020; Kotsalis et al., 2020). Next, we show that such an approach leads to an
	improved convergence bound and sample complexity of SGD over highly dependent data.

	Specifically, consider a stream of φ-mixing data samples _ξ1, ξ2, ξ3, . . ._ . Instead of utilizing the
	_{_ _}_
	entire stream of data, we subsample a subset of this data stream with period r ∈ N and obtain the
	following subsampled data stream

	_ξ1, ξr+1, ξ2r+1, . . ._ _._
	_{_ _}_

	In particular, let {Ft}t be the canonical filtration generated by {ξtr+1}t. Since the consecutive
	subsampled samples are r time steps away from each other, it is easy to verify that the subsampled
	data stream {ξtr+1}t is also a φ-mixing process with mixing coefficient given by φ[r]ξ[(][t][) =][ φ][ξ][(][rt][)][,]
	where φ[r]ξ [denotes the mixing coefficient of the subsampled data stream][ {][ξ][tr][+1][}][t][. Therefore, by]
	periodically subsampling the data stream, the resulting subsampled process has a faster-converging
	mixing coefficient. Then, we apply SGD over such subsampled data, i.e.,

	(SGD with subsampling): _w(t + 1) = w(t)_ _ηt_ _F_ (w(t); ξtr+1). (4)
	_−_ _∇_


	-----

	In particular, the convergence error bound in eq. (2) still holds by replacing φξ(τ ) with φξ(rτ ), and
	we obtain the following bound for SGD with subsampling.

	1 (τ 1) _τ_
	_f_ (wn) _f_ (w[∗]) _−_ + _._ (5)
	_−_ _≤O_ _√n + infτ_ _∈N_ _√n_ r _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]
	n o

	Such a bound further implies the following sample complexity results of SGD with subsampling

	b

	under different convergence rates of the mixing coefficient φξ.
	Corollary 3.3. The sample complexity of SGD with subsampling in (4) for achieving an ϵ conver-
	_gence error over φ-mixing dependent data is given as follows._

	1

	- If the data is geometric φ-mixing with parameter θ > 0, then we choose r = (log [1]ϵ [)] _θ_
	_O_ 1

	_and τ = O(1). The resulting sample complexity is in the order of rn = O_ _ϵ[−][2](log_ [1]ϵ [)] _θ_ _._

	[]

	- If the data is algebraic φ-mixing with parameter θ > 0, then we choose r = _ϵ[−]_ _θ[1]_ _and_
	_O_ []

	_τ =_ (1). The resulting sample complexity is in the order of rn = _ϵ[−][2][−]_ _θ[1]_ _._
	_O_ _O_

	[]

	Compare the above sample complexity results with those of the standard SGD in Corollary 3.2,

	[] 1

	we conclude that data-subsampling can improve the sample complexity by a factor of (log [1]ϵ [)] _θ and_

	_ϵ[−]_ _θ[1] for geometric φ-mixing and algebraic φ-mixing data, respectively. Intuitively, this is because_

	with data subsampling, we can choose a sufficiently large subsampling period r to decouple the data
	dependence in the term φξ(rτ ), as opposed to choosing a large τ in Corollary 3.2. In this way, the
	order of the dominant term _n_ [log][ τ]δ [is reduced. Therefore, when the data is highly dependent, it is]

	beneficial to subsample the dependent data before performing SGD. We also note another advantage
	of using data-subsampling, i.e., it only requires computing the stochastic gradients of the subsampled

	[p][ τ]

	data, and therefore can substantially reduce the computation load.

	4 CONVERGENCE OF MINI-BATCH SGD OVER DEPENDENT DATA

	Although the data-subsampling update scheme studied in the previous section helps improve the
	sample complexity of SGD, it does not leverage the full information of all the queried data. In
	particular, when the data is highly dependent, we need to choose a large period r to reduce data
	dependence, and this will throw away a huge amount of valuable samples. In this section, we study
	SGD with another popular update scheme that leverages the full information of all the sampled
	data, i.e., the mini-batch update scheme. We show that this simple and popular scheme can effectively reduce data dependence without skipping data samples, and can achieve an improved sample
	complexity over SGD with subsampling.

	Specifically, we consider a data stream {ξt}t with φ-mixing dependent samples. We rearrange the
	data samples into a stream of mini-batches {xt}t, where each mini-batch xt contains B samples,
	i.e., xt = _ξ(t_ 1)B+1, ξ(t 1)B+2, . . ., ξtB . Then, we perform mini-batch SGD update as follows.
	_{_ _−_ _−_ _}_

	(Mini-batch SGD): _w(t + 1) = w(t)_ _F_ (w(t); ξ). (6)
	_−_ _[η]B[t]_ _∇_

	_ξX∈xt_

	Performing SGD updates with mini-batch data has several advantages. First, it substantially reduce
	the optimization variance and allows to use a large learning rate to facilitate the convergence of the
	algorithm. As a comparison, SGD with subsampling still suffers from a large optimization variance.
	Second, unlike SGD with subsampling, mini-batch SGD utilizes the information of all the data
	samples to improve the performance of the model. Moreover, as we show in the following lemma,
	mini-batch update substantially reduces the stochastic bias caused by the data dependence. In the
	sequel, we denote F (w; x) := _B1_ _ξ_ _x_ _[F]_ [(][w][;][ ξ][)][ as the average loss on a mini-batch of samples.]

	_∈_
	With a bit abuse of notation, we also define _t_ _t as the canonical filtration generated by the mini-_

	P _{F_ _}_

	batch samples _xt_ _t._
	_{_ _}_
	Lemma 4.1. Let Assumption 2.1 hold and consider the mini-batch data stream {xt}t. Then, for any
	_w, v ∈W measureable with regard to Ft and any τ ∈_ N, it holds that


	_F_ (w; xt+τ ) _F_ (v; xt+τ ) _t_ _f_ (w) _f_ (v)
	_−_ _\|F_ _−_ _−_ _≤_ _[GR]B_


	_φξ(τB + i)._ (7)
	_i=1_

	X


	-----

	With dependent data, the above lemma shows that we can approximate the population risk f (w)
	by the conditional expectation E[F (w; xt+τ )\|Ft], which involves the sample xt+τ that is τ steps
	ahead of the filtration _t. Intuitively, by the φ-mixing principle, as τ gets larger, the distribution of_
	_F_
	_xGRt+τ conditional onB_ _Ft gets closer to the stationary distribution µ. In general, the estimation bias_

	_B_ _i=1_ _[φ][ξ][(][τB][ +][ i][)][ depends on both the batch size and the accumulated mixing coefficient over]_
	the corresponding batch of samples. To provide a concrete understanding, below we calculate the
	estimation bias for several different mixing models.P

	- Geometric φ-mixing: In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤]_ [P]i[∞]=1 _[φ][ξ][(][i][) =][ O][(1)][. Hence, the]_
	estimation bias is in the order of O( _[GR]B_ [)][.]

	[P][B]

	- Fast algebraic φ-mixing (θ ≥ 1): In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤]_ [P]i[∞]=1 _[φ][ξ][(][i][) =][ e]O(1)._
	Hence, the estimation bias is in the order of _O(_ _[GR]B_ [)][, where][ e]O hides all logarithm factors.

	[P][B]

	- Slow algebraic φ-mixing (0 < θ < 1): In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤O][((][τB][)][1][−][θ][)][.]_

	[e]

	Hence, the estimation bias is in the order of ( _[GRτ]B[ 1][θ]_ _[−][θ]_ ).
	_O_

	[P][B]

	It can be seen that if the mixing coefficient converges fast, i.e., either geometrically or fast algebraically, then the data dependence has a negligible impact on the estimation error. Consequently,
	choosing a large batch size can substantially reduce the estimation bias. On the other hand, when
	the mixing coefficient converges slow algebraically, it substantially increases the estimation bias,
	but it is still beneficial to use a large batch size. This result shows that mini-batch update can effectively reduce the statistical bias of stochastic approximation for a wide spectrum of dependent data
	generating processes.

	We obtain prove the following convergence error bound for mini-batch SGD over dependent data.
	Theorem 4.2. Let Assumption 2.1 and 3.1 hold. Apply mini-batch SGD to solve the stochastic
	_optimization problem (P) over φ-mixing dependent data and assume that it achieves regret Rn._
	_Then, for any τ_ N and any minimizer w[∗] _with probability at least 1_ _δ, the averaged predictor_

	_n_ _∈_ _−_

	_wn :=_ _n[1]_ _t=1_ _[w][(][t][)][ satisfies]_

	P _n−τ_ +1

	bf (wn) _f_ (w[∗]) _κ(t) +_ _[GR][(][τ][ −]_ [1)]

	_−_ _≤_ [R]n[n] [+][ G][(][τ][ −]n [1)] _n_

	_t=1_

	X

	_B_ _B_ [1]

	b 1 _τ_ 4

	+ _φ(τB + i) +_ _B[−]_ 4[1] + _φ(i)_ _. (8)_
	_O_ _nB_ _nB_ [log][ τ]δ [log][ n]δ
	Xi=1 r h Xi=1 i []

	To further understand the order of the above bound, a standard regret analysis shows that mini-batch

	_n_

	SGD achieves a regret in the order of [R]n[n] [=][ e]( _j=1nB[φ][(][j][)]_ ) and κ(t) ( _Bn_ [)][ (see Theorem][ C.3]

	_O_ P _≡O_

	for the proof). Consequently, the above convergence error bound reduces to the following bound,q q
	where we hide all logarithm factors for simplicity of presentation.

	_n_
	_j=1_ _[φ][(][j][)]_
	_f_ (wn) _f_ (w[∗]) + _[GR][(][τ][ −]_ [1)] (9)
	_−_ _≤_ _O[e]_ _nB_ _n_
	[sP]

	_B_ _B_ [1]

	b + [1] _φ(τB + i) +_ _τ_ _B[−]_ 4[1] + _φ(i)_ 4 _._ (10)

	_nB_ _nB_

	_i=1_ r _i=1_

	X h X i []

	Such a bound further implies the following sample complexity results of mini-batch SGD under
	different convergence rates of the mixing coefficient φξ.
	Corollary 4.3. The sample complexity of mini-batch SGD in (6) for achieving an ϵ convergence
	_error over φ-mixing dependent data is given as follows._

	- If the data is geometric φ-mixing with parameter θ > 0, then we choose τ = 1, B =
	_O(ϵ[−][1]), n = O(ϵ[−][1]). The overall sample complexity is nB = O(ϵ[−][2])._

	- If the data is fast algebraic φ-mixing with parameter θ ≥ 1, then we choose τ = 1, B =
	_O(ϵ[−][1]), n = O(ϵ[−][1]). The overall sample complexity is nB =_ _O[e](ϵ[−][2])._


	-----

	- If the data is slow algebraic φ-mixing with parameter 0 < θ < 1, then we choose τ =
	1, B = O(ϵ[−] _θ[1] ), n = O(ϵ[−][1]). The overall sample complexity is nB = O(ϵ[−][1][−]_ _θ[1] )._


	It can be seen that mini-batch SGD can achieve an order-wise lower sample complexity than the SGD
	with subsampling in the full spectrum of φ-mixing convergence rate. Specifically, mini-batch SGD

	1

	improves the sample complexity over that of SGD with subsampling by a factor of O((log [1]ϵ [)] _θ ),_

	_O(ϵ[−]_ _θ[1] ) and O(ϵ[−][1]) for geometric φ-mixing, fast algebraic φ-mixing and slow algebraic φ-mixing_

	data samples, respectively. This shows that mini-batch update can effectively reduce the bias caused
	by data dependence and leverage the full information of all the data samples to improve the learninge
	performance.

	To intuitively explain, this is because with mini-batch updates, we can choose a sufficiently large
	batch size B to reduce the bias caused by the data dependence and choose a small auxiliary parameter τ = 1. As a comparison, to control the bias caused by data dependence, the standard SGD needs
	to choose a very large τ and the SGD with subsampling needs to choose a large subsampling period
	_r that skips a huge amount of valuable data samples, especially when the mixing coefficient con-_
	verges slowly. Therefore, our result proves that it is beneficial to use mini-batch stochastic gradient
	updates when the data samples are highly dependent.

	We note that our proof of the tight high-probability bound in Theorem 4.2 for mini-batch SGD
	involves substantial new developments compared with the proof of (Agarwal & Duchi, 2012). Next,
	we elaborate on our technical novelty.

	- In (Agarwal & Duchi, 2012), they defined the following random variable

	_Xt[i]_ [:=][ f] _w((t_ 1)τ + i) _f_ (w[∗]) + F _w((t_ 1)τ + i); ξt+τ 1 _F_ _w[∗]; ξt+τ_ 1 _._
	_−_ _−_ _−_ _−_ _−_ _−_

	As this random variable involves only one sample _ξt+τ_ _−1, they bound the bias term_
	_Xt[i]_ _t_ _[\|F]t[i]_ 1[]][ as a universal constant. As a comparison, the random variable][ X]t[i] [would]
	involve a mini-batch of samples[−] [E][[][X] _[i]_ _−_ _xt+τ_ 1 in our analysis. With the mini-batch structure, the
	_−_
	bias Xt[i] _t_ _[\|F]t[i]_ 1[]][ can be written as an average of][ B][ zero-mean dependent random vari-]
	ables, which is close to zero with high probability due to the concentration phenomenon.[−][E][[][X] _[i]_ _−_
	Consequently, we are able to apply a Bernstein-type inequality developed in (Delyon et al.,
	2009) for dependent stochastic process to obtain an improved bias bound from O(1) to
	_O(1/√B). This is critical for obtaining the improved bound._

	- Second, with the improved high-probability bias bound mentioned above, the remaining
	proof of (e Agarwal & Duchi, 2012) no longer holds. Specifically, we can no longer apply
	the Azuma’s inequality to bound the accumulated bias _t[(][X]t[i]_ _t_ _[\|F]t[i]_ 1[])][, as each bias]

	term is no longer bounded with probability one. To address this issue, we developed a gen-[−] [E][[][X] _[i]_ _−_
	eralized Azuma’s inequality for martingale differences in Lemma B.3 based on Proposition

	[P]

	34 of (Tao et al., 2015) for independent zero-mean random variables.



	- Third, we develop a high-probability regret bound for mini-batch SGD over dependent data
	so that it can be integrated with the high-probability convergence bound in Theorem 4.2.
	To our best knowledge, the regret of SGD over dependent data has not been studied before.

	5 NUMERICAL EXAMPLE

	We examine our theory via a basic convex quadratic optimization problem, which is written as

	min (w _ξ)[⊤]A(w_ _ξ)_ _,_
	_w_ R[d][ f] [(][w][) :=][ E][ξ][∼][µ] _−_ _−_
	_∈_


	where A ⪰ 0 is a fixed positive semi-definite matrix and µ is the uniform distribution on [0, 1][d].
	Then, following the construction in (Jarner & Roberts, 2002), we generate an algebraic φ-mixing
	Markov chain that has the stationary distribution µ. In particular, its mixing coefficient φξ(k) converges at a sublinear convergence rate k[−] _r[1], where r > 0 is a parameter that controls the speed of_

	convergence. Please refer to Appendix D for more details of the experiment setup.

	We first estimate the following stochastic bias at the fixed origin point w = 0d.

	(Bias): E _F_ (w; xτ )\|x0 = 0d _−_ _f_ (w) _,_



	-----

	where the expectation is taken over the randomness of the mini-batch of samples queried at time
	_τ ∈_ N. Such a bias is affected by several factors, including the time gap τ, the batch size B and the
	convergence rate parameter r of the mixing coefficient.

	In Figure 1, we investigate the impact of these factors on the stochastic bias, and we use 10k Monte
	Carlo samples to estimate the stochastic bias. The left two figures fix the batch size, and it can be
	seen that the bias decreases as τ increases, which matches the definition of the φ-mixing property.
	Also, a faster-mixing Markov chain (i.e., smaller r) leads to a smaller bias. In particular, with batch
	size B = 1 and a slow-mixing chain r = 2, it takes an unacceptably large τ to achieve a relatively
	small bias. This provides an empirical justification to Corollary 3.2 and explains why the standard
	SGD suffers from a high sample complexity over highly dependent data. Moreover, as the batch
	size gets larger, one can achieve a numerically smaller bias, which matches our Lemma 4.1. The
	right two figures fix the convergence rate parameter of the mixing coefficient, and it can be seen
	that increasing the batch size significantly reduces the bias. Consequently, instead of choosing a
	large τ to reduce the bias, one can simply choose a large batch size B = 100 and set τ = 1. This
	observation matches and justifies our theoretical results in Corollary 4.3.

	Figure 1: Impact of τ, batch size B and convergence rate of mixing coefficient on the bias.


	We further compare the convergence of SGD, SGD with subsampling and mini-batch SGD. Here, we set r = 2 to generate
	highly dependent data samples. We set learning rate η = 0.01
	for both SGD and SGD with subsampling, and set learning rate
	_η = 0.01_ _B_ _B_
	_×_ _j=1_ _[φ][ξ][(][j][)][ = 0][.][01][×][100][1][/][4][ for mini-batch SGD with]_

	batch size Bq = 100P, as suggested by Theorem C.3 in the appendix.
	The results are plotted in Figure 2, where each curve corresponds
	to the mean of 100 independent trails. It can be seen that SGD with
	subsampling achieves a lower loss than the standard SGD asymptotically, due to the use of less dependent data. Moreover, mini-batch
	SGD achieves the smallest asymptotic loss. All these observations
	are consistent with our theoretical results.


	Figure 2: Comparison of sample complexity of different
	SGD algorithms.


	6 CONCLUSION

	In this study, we investigate the convergence property of SGD under various popular stochastic
	update schemes over highly dependent data. Unlike the conventional i.i.d. data setting in which the
	stochastic update schemes do not affect the sample complexity of SGD, the convergence of SGD
	in the data-dependent setting critically depends on the structure of the stochastic update scheme. In
	particular, we show that both data subsampling and mini-batch sampling can substantially improve
	the sample complexity of SGD over highly dependent data. Our study takes one step forward toward
	understanding the theoretical limits of stochastic optimization over dependent data, and it opens
	many directions for future study. For example, it is interesting to further explore the impact of
	algorithm structure on the sample complexity of stochastic reinforcement learning algorithms. Also,
	it is important to develop advanced algorithm update schemes that can facilitate the convergence of
	learning over highly dependent data.

	REFERENCES

	Alekh Agarwal and John C Duchi. The generalization ability of online algorithms for dependent
	data. IEEE Transactions on Information Theory, 59(1):573–587, 2012.

	Alekh Agarwal, Nan Jiang, Kakade Sham M, and Wen Sun. Reinforcement learning: Theory and
	algorithms. https://rltheorybook.github.io/, 2021.


	-----

	L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier
	and Gilbert Saporta (eds.), Proc. COMPSTAT, pp. 177–186, 2010.

	Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finitesample analysis of nonlinear stochastic approximation with applications in reinforcement learning. arXiv:1905.11425, 2019.

	Gal Dalal, Bal´azs Sz¨or´enyi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with
	function approximation. In Proc. AAAI conference on artificial intelligence, 2018.

	Vianney Debavelaere, Stanley Durrleman, and St´ephanie Allassonni`ere. On the convergence of
	stochastic approximations under a subgeometric ergodic Markov dynamic. Electronic Journal of
	Statistics, 15(1):1583 – 1609, 2021.

	Bernard Delyon et al. Exponential inequalities for sums of weakly dependent variables. Electronic

	Journal of Probability, 14:752–779, 2009.

	Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov chains. Springer, 2018.

	John Duchi, Alekh Agarwal, Mikael Johansson, and Michael Jordan. Ergodic subgradient descent.
	In Allerton Conference, 2011.

	Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. Journal of

	machine learning Research, 5(1), 2003.

	Peter W. Glynn and Sean P. Meyn. A Lyapunov bound for solutions of the Poisson equation. The

	Annals of Probability, 24(2):916 – 931, 1996.

	Antoine Godichon-Baggioni, Nicklas Werge, and Olivier Wintenberger. Non-asymptotic analysis of
	stochastic approximation algorithms for streaming data. arXiv:2109.07117, 2021.

	Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

	Elad Hazan. Introduction to Online Convex Optimization. Elad Hazan, Erscheinungsort nicht ermittelbar, 2017. ISBN 1521133301.

	Søren F Jarner and Gareth O Roberts. Polynomial convergence rates of markov chains. The Annals

	of Applied Probability, 12(1):224–247, 2002.

	Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time
	analysis of linear two-timescale stochastic approximation with markovian noise. In Conference
	on Learning Theory, pp. 2144–2203. PMLR, 2020.

	Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of
	biased stochastic approximation scheme. In Proc. Conference on Learning Theory, pp. 1944–
	1974, 2019.

	Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic
	variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning.
	arXiv:2011.02987, 11 2020.

	Harshat Kumar, Alec Koppel, and Alejandro Ribeiro. On the sample complexity of actor-critic
	method for reinforcement learning with function approximation. ArXiv:1910.08412, 2019.

	Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Tightening the dependence on horizon in the sample complexity of q-learning. In ICML, volume 139 of Proceedings
	of Machine Learning Research, pp. 6296–6306. PMLR, 2021.

	Kurt Marti. Stochastic optimization of regulators. Computers and Structures, 180:40–51, February
	2017.

	Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with
	function approximation. In Proc. International Conference on Machine Learning, pp. 664–671,
	2008.


	-----

	Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science &
	Business Media, 2012.

	Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602,
	2013.

	Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
	control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

	Dharmendra S Modha and Elias Masry. Minimum complexity regression estimation with weakly
	dependent observations. IEEE Transactions on Information Theory, 42(6):2133–2145, 1996.

	Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In Proc. Advances in Neural
	Information Processing Systems, volume 33, 2020.

	Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On the finite-time convergence of actorcritic algorithm. In NeurIPS Optimization Foundations for Reinforcement Learning Workshop,
	2019.

	Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation
	and q-learning. In Proc. Conference on Learning Theory, pp. 3185–3205, 2020.

	Ingo Steinwart and Andreas Christmann. Fast learning from non-iid observations. Advances in

	neural information processing systems, 22:1768–1776, 2009.

	Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Bradford, 2018.

	Terence Tao, Van Vu, et al. Random matrices: universality of local spectral statistics of nonhermitian matrices. Annals of probability, 43(2):782–874, 2015.

	Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global
	optimality and rates of convergence. In International Conference on Learning Representations,
	2019.

	Yue Frank Wu, Weitong ZHANG, Pan Xu, and Quanquan Gu. A finite-time analysis of two
	time-scale actor-critic methods. In Proc. Advances in Neural Information Processing Systems
	(NeurIPS), volume 33, pp. 17617–17628, 2020.

	Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. Proc.

	Advances in Neural Information Processing Systems, 22:2116–2124, 2009.

	Pan Xu and Quanquan Gu. A finite-time analysis of q-learning with neural network function approximation. In Proc. International Conference on Machine Learning, pp. 10555–10565, 2020.

	Tengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy TD learning: Nonasymptotic analysis over Markovian samples. In Proc. Advances in Neural Information

	Processing Systems (NeurIPS), 2019.

	Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural)
	actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems (NeurIPS),
	volume 33, 2020.

	Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. Provably global convergence of
	actor-critic: A case for linear quadratic regulator with ergodic cost. In Proc. Advances in Neural
	Information Processing Systems (NeurIPS), 2019.

	Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear
	function approximation. In Proc. Advances in Neural Information Processing Systems, pp. 8665–
	8675, 2019.


	-----

	# Appendix

	### Table of Contents

	A Proof of Corollary 3.3 12

	B Proof of Theorem 4.2 13

	B.1 Key Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

	B.2 Proof of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

	C Regret Analysis of Mini-Batch SGD 20

	D Experiment Setup 24

	Notation: To simplify the notation, throughout the appendix, we denote ξt[(][i][)] := ξ(t−1)B+i, which
	corresponds to the i-th data sample of the t-th mini-batch data xt. With this notation, we have
	_xt =_ _ξt[(1)], ξt[(2)], ..., ξt[(][B][)]_ .
	_{_ _}_

	A PROOF OF COROLLARY 3.3

	In this section, we analyze the convergence error bound of the SGD with data-subsampling in (4).

	Given a φξ-mixing data stream _ξ1, ξ2, ξ3, . . ._ _, we consider the following subsampled data stream_
	_{_ _}_
	_ξ1, ξr+1, ξ2r+1, . . ._ _._
	_{_ _}_
	Let F be the canonical filtration generated by {xt}. Then the subsampled data stream {ξtr+1}t is
	_φ[r]ξ[-mixing with the mixing coefficient given by]_

	_φ[r]ξ[(][t][) =][ φ][ξ][(][rt][)][.]_
	With this mixing coefficient, we can apply Theorem 2 of (Agarwal & Duchi, 2012) and obtain the
	following convergence error bound for any τ ∈ N.n

	Rn _τ_
	_f_ (wn) _f_ (w[∗]) _κ(t) +_ _[τ]_ _._
	_−_ _≤O_ _n_ [+ (][τ][ −]n [1)] _n_ [+] _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]

	_t=1_ r

	X

	Consider the standard SGD with a diminishing learning rate, we have b _κ(t) = O(_ _√[1]t_ [)][ and][ R][n][ =]

	_O([√]n). Then, the convergence error bound becomes_

	1 (τ 1) _τ_
	_f_ (wn) _f_ (w[∗]) _−_ + _._
	_−_ _≤O_ _√n + infτ_ _∈N_ _√n_ r _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]
	n o

	The above result further implies the following sample complexity results for different convergence

	b

	rates of the mixing coefficient.

	- Geometric φ-mixing: In this case, φξ(k) (exp( _k[θ])) for some θ > 0. Set the last_
	_≤O_ _−_ 1
	termτ√−n1 _φ=ξ O(rτ() =ϵ). We obtain that O(ϵ). We obtain that nτ_ _[−][2]_ _rτ= O =(ϵ O[−][2]((log). By choosing[1]ϵ_ [)] _θ ). Further set the second term τ = O(1), the sample_

	complexity is in the order of

	[1]

	_θ_ 2[] _θ_

	_ϵ-complexity = r_ _n =_ log [1] _τ 2ϵ−_ = _ϵ[−][2][]_ log ϵ[−][1][][ 1] _._
	_·_ _O_ _ϵ_ _O_

	- Algebraic φ-mixing: In this case, φξ(k) ≤O (k[−][θ]) for some _θ > 0. Set the last term[]_
	_φξ(rτ_ ) = O(ϵ). We obtain that τr = O(ϵ[−] _θ[1] ). Set the second term_ _[τ]√[−]n[1] = O(ϵ). We_

	obtain that nτ _[−][2]_ = O(ϵ[−][2]). By setting τ = O(1), the sample complexity is in the order of

	_ϵ-complexity = r · n = O(ϵ[−]_ _θ[1] τ_ [2]ϵ[−][2]) = O(ϵ[−][2][−] _θ[1] )._


	-----

	B PROOF OF THEOREM 4.2

	B.1 KEY LEMMAS

	In this subsection, we present several useful preliminary results for proving Theorem 4.2. Define
	N := {1, 2, 3, . . . }. Throughout this subsection, we assume that Assumption 2.1 holds. The following lemma is a generalization of the Lemma 1 in (Agarwal & Duchi, 2012).

	Lemma B.1. Let w, v be measurable with respect to Ft. Then for any τ ∈ N,


	_F_ (w; xt+τ ) _F_ (v; xt+τ ) _t_
	_−_ _\|F_ _≤_ _[GR]B_


	_φξ(τB + i) + f_ (w) _f_ (v).
	_−_
	_i=1_

	X


	_Proof. For any τ ∈_ N, we consider the following decomposition.

	E[F (w; xt+τ ) − _F_ (v; xt+τ )\|Ft]
	=E[F (w; xt+τ ) − _f_ (w) + f (v) − _F_ (v; xt+τ )\|Ft] + f (w) − _f_ (v)

	1 _B_ 1
	= h _B_ Xi=1 Z _F_ (w; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _[∈·\|F][t][)][ −]_ Z _F_ (w; ξ)dµi _−_ h _B_

	(A)

	+ f (w) _f_ (v).
	\| {z
	_−_


	_F_ (v; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _F_ (v; x)dµ

	_[∈·\|F][t][)][ −]_
	Z


	_i=1_


	We can further bound the term (A) using the mixing property of the dependent data stream.


	1 _B_
	(A) =

	_B_

	_i=1_

	h X

	_B_

	= [1]

	_B_

	_i=1_

	X

	_B_

	_≤_ _B[1]_

	_i=1_

	X


	_B_ 1
	X=1 Z _F_ (w; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _[∈·\|F][t][)][ −]_ Z _F_ (w; ξ)dµi _−_ h _B_

	(F (w; ξ) _F_ (v; ξ))d P(ξt[(]+[i][)]τ

	=1 Z _−_ _[∈]_ [d][ξ][\|F][t][)][ −] _[µ][(d][ξ][)]_



	_GRd_ P(ξt[(]+[i][)]τ

	=1 Z _[∈]_ [d][ξ][\|F][t][)][ −] _[µ][(d][ξ][)]_

	_B_

	_φξ(τB + i),_
	_i=1_

	X


	_F_ (v; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _F_ (v; x)dµ

	_[∈·\|F][t][)][ −]_
	Z


	_i=1_


	_≤_ _[GR]B_


	where in the first inequality we use the facts that F (·; ξ) is G-Lipschitz and the domain is bounded
	by R, and the second inequality is implied by the φ-mixing property. Substituting the above upper
	bound of (A) into the previous equation yields that


	E[F (w; xt+τ ) _F_ (v; xt+τ ) _t]_
	_−_ _\|F_ _≤_ _[GR]B_

	This completes the proof.


	_φξ(τB + i) + f_ (w) _f_ (v).
	_−_
	_i=1_

	X


	Proposition B.2. Let {w(t)}t∈N be the model parameter sequence generated by (6). Also suppose
	_that Assumption 3.1 holds. Then for any τ ∈_ N, we have

	_n_

	[f (w(t)) − _f_ (w[∗])]
	_t=1_

	X


	_n−τ_ +1

	_κ(t) + GR(τ −_ 1).
	_t=1_

	X



	[f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])] + Rn + G(τ 1)
	_−_ _−_ _−_ _−_ _−_
	_t=1_

	X


	-----

	_Proof. For any τ ∈_ N, we consider the following decomposition,

	_n_

	[f (w(t)) − _f_ (w[∗])]
	_t=1_

	Xn

	= [f (w(t)) − _F_ (w(t); xt+τ _−1) + F_ (w[∗]; xt+τ _−1) −_ _f_ (w[∗]) + F (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)]_

	_t=1_

	Xn

	= [f (w(t)) − _F_ (w(t); xt+τ _−1) + F_ (w[∗]; xt+τ _−1) −_ _f_ (w[∗])] (11)

	_t=1_

	X


	+ _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _._

	_t=1_

	X

	(B)
	\| {z }

	We will keep the first term and bound the term (B).


	_F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_
	_t=1_

	X


	(B) =


	_n_ _n−τ_ +1

	= [F (w(t); xt) − _F_ (w[∗]; xt)] + [F (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1)]_

	_t=1_ _t=1_

	X X

	(B1) (B2)
	\| _n_ {z } _τ_ _−\|1_ _τ_ _−1_ {z _n+τ_ _−1_ }

	+ _F_ (w(t); xt+τ _−1) −_ _F_ (w(t); xt) + _F_ (w[∗]; xt) − _F_ (w[∗]; xt) _._

	_t=nX−τ_ +2 Xt=1 Xt=1 _t=Xn+1_

	(B3)
	\| {z }

	Recall that the term (B1) is the regret Rn. We can bound the term (B2) by noting that

	_F_ (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1) ≤_ _G∥w(t + τ −_ 1) − _w(t)∥_

	_τ_ _−2_

	_≤_ _G_ _∥w(t + i + 1) −_ _w(t + i)∥_

	_i=0_

	X

	_τ_ _−2_

	_≤_ _G_ _κ(t + i)_

	_i=0_

	X


	_≤_ _G(τ −_ 1)κ(t).

	For the term (B3), we can bound it using the G-Lipschitzness of F ( ; ξ) and the R-bounded domain.

	_·_


	_τ_ _−1_

	_F_ (w(t); xt) +
	_t=1_

	X


	_τ_ _−1_

	_F_ (w[∗]; xt)
	_−_
	_t=1_

	X


	_n+τ_ _−1_

	_F_ (w[∗]; xt+τ _−1)_
	_t=n+1_

	X


	_F_ (w(t); xt+τ _−1) −_
	_t=nX−τ_ +2


	_n_ _n+τ_ _−1_ _[τ]_ _[−][1]_
	= _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _−_ _F_ (w(t); xt) −
	h _t=nX−τ_ +2 _t=Xn+1_ i h Xt=1

	_n_ _[τ]_ _[−][1]_
	_≤_ _G_ _∥w(t) −_ _w[∗]∥_ + G _∥w(t) −_ _w[∗]∥_
	h _t=nX−τ_ +2 i h Xt=1 i

	_≤_ _GR(τ −_ 1).


	_τ_ _−1_

	_F_ (w[∗]; xt)
	_t=1_

	X


	-----

	Combining the above bounds of (B1), (B2), and (B3), we obtain the upper bound of (B) as follows.

	_n_

	_F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_
	_t=1_

	X

	_n_ _n−τ_ +1

	= [F (w(t); xt) − _F_ (w[∗]; xt)] + [F (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1)]_

	_t=1_ _t=1_

	X X

	(B1) (B2)
	\| _n_ {z } _τ_ _−\|1_ _τ_ {z _n+τ_ _−1_ }

	+ _F_ (w(t); xt+τ _−1) −_ _F_ (w(t); xt) + _F_ (w[∗]; xt) − _F_ (w[∗]; xt) _._

	_t=nX−τ_ +2 Xt=1 Xt=1 _t=Xn+1_

	(B3)
	\| _n−τ_ +1 {z }

	_≤_ _Rn + G(τ −_ 1) _κ(t) + GR(τ −_ 1).

	_t=1_

	X


	Then the proof is completed by substituting the upper bound of (B) into (11).

	The following generalized Azuma’s inequality generalizes the Proposition 34 of (Tao et al., 2015).
	The inequality can be used to bound sum of martingale difference random variables.

	Lemma B.3 (Generalized Azuma’s Inequality). Let {Xt} be a martingale difference sequence with
	_respect to its canonical filtration F. Define Y =_ _i=1_ _[X][i][ and assume][ E][\|][Y][ \|][ <][ ∞][. Then for any]_
	_{αt}t > 0,_

	[P][T]


	_T_

	_αt[2]_ _≤_ 2 exp _−_ _[λ]2[2]_
	_t=1_

	X

	


	_Y_ EY _λ_

	\| _−_ _\| ≥_


	P(\|Xt\| ≥ _αt)._
	_t=1_

	X


	_Proof. Let T := min{t : \|Xt\| > αt}. Then the sets Bt := {ω : T (ω) = t} are disjoint. Construct_


	_C_
	_Y (ω)_ if ω _Tt=1_ _[B][t]_ _,_
	_∈_

	E[Y _Bt]_ if ω _B St for all t_ 1, 2, . . ., T _._
	_\|_ _∈_ _∈{_ _}_


	_Y_ _[′](ω) :=_


	By the above construction, the associated Doob martingale oft∧T _Y_ _[′]_ with respect to F is {Zt :=
	_i=1_ _[X][i][}][. It satisfies the conditions of Azuma’s inequality, i.e.,]_
	P

	- _Zt_ forms a martingale with respect to (because the stopped martingale is still a mar_{_ _}_ _F_
	tingale).

	- \|Zt − _Zt−1\| ≤_ _αt._

	Then we can apply Azuma’s inequality to Y _[′]._


	_T_

	_αt[2]_ _≤_ 2 exp _−_ _[λ]2[2]_
	_t=1_

	X

	


	\|Y _[′]_ _−_ EY _[′]\| ≥_ _λ_


	-----

	Now we can bound P _\|Y −_ EY \| ≥ _λ_ _Tt=1_ _[α]t[2]_
	qP


	as follows.


	_Y_ EY _λ_

	\| _−_ _\| ≥_

	_Y_ EY _λ_

	\| _−_ _\| ≥_


	_αt[2]_

	

	_t=1_

	X

	

	_T_

	_αt[2][, Y][ =][ Y][ ′]_
	_t=1_

	X


	=P

	_≤P_


	+ P

	

	


	_Y_ EY _λ_

	\| _−_ _\| ≥_


	_αt[2][, Y][ ̸][=][ Y][ ′]_
	_t=1_

	X


	_≤P_ \|Y _[′]_ _−_ EY _[′]\| ≥_ _λv_ _αt[2]_ + P (Y ̸= Y _[′])_

	ut=1
	uX

	 _T_ t 

	2 exp + P( _Xt_ _αt)._
	_≤_ _−_ _[λ]2[2]_ _\|_ _\| ≥_
	_t=1_

	X


	Then the proof is completed. Here we notice the fact that EY _[′]_ = EY by our construction.

	The following lemma is taken from (22), Theorem 4 of (Delyon et al., 2009).

	Lemma B.4 (Bernstein’s Inequality for Dependent Process). Let {Zt} be a centered adaptive pro-
	_cess with respect to F. Define the following quantities._


	_k−1_

	_Zi_ E[Zk _i]_ _,_
	_∥_ _∥∞_ _· ∥_ _\|F_ _∥∞_
	_i=1_

	X

	_∥E[Zk[2][\|][Z][k][−][1][, . . ., Z][1][]][∥][∞][,]_


	_q =_

	_v =_


	_k=1_


	_m = sup1≤i≤n_ _∥Zi∥∞._


	_Then, it holds that_


	_n_

	_t[2]_

	P _Zi_ _t_ exp _._
	Xi=1 _≥_ _≤_ − 2(v + 2q) + 2tm/3

	Application of Lemma B.4 to our proof. Here we make some comments about how to apply this
	inequality in our main proof. We define the following random variable in our proof. Throughout, we
	use the batch-level filtration F and the intra-batch level filtration _F. The formal definition is given_
	in Section B.2.

	_Xt[i]_ [=][ f] _w((t_ 1)τ + 1) _f_ (w[∗]) + F (w[∗]; xtτ +i 1) _F_ [b]w((t 1)τ + 1); xtτ +i 1 _._
	_−_ _−_ _−_ _−_ _−_ _−_

	We also define the filtration _t[:=][ F][tτ]_ [+][i][−][1] [for simplicity. Then, we have]
	_F_ _[i]_

	E[Xt[i][\|F]t[i]−1[] =][ f] _w((t −_ 1)τ + 1) _−_ _f_ (w[∗]) + E _F_ (w[∗]; xtτ +i−1) − _F_ _w((t −_ 1)τ + 1); xtτ +i−1 _\|Ft[i]−1_

	Then, the bias can be rewritten as


	_Xt[i]_ _t_ _[\|F]t[i]_ 1[]]

	_[−]_ [E][[][X] _[i]_ _−_

	=F (w[∗]; xtτ +i−1) − _F_ (w((t − 1)τ + 1); xtτ +i−1) − E _F_ (w[∗]; xtτ +i−1) − _F_ _w((t −_ 1)τ + 1); xtτ +i−1 _\|Ft[i]−1_

	= [1] _Yt[i][(][ξ][)][,]_

	_B_

	_ξ∈xXtτ_ +i−1

	where Yt[i] [is defined as]

	_Yt[i][(][ξ][) =][ F]_ [(][w][∗][;][ ξ][)][ −] _[F]_ _w((t_ 1)τ + 1); ξ E _F_ (w[∗]; ξ) _F_ _w((t_ 1)τ + 1); ξ _t_ 1 _._
	_−_ _−_ _−_ _−_ _\|F_ _[i]−_



	-----

	More specifically, we have

	_Xt[i]_ _t_ _[\|F]t[i]_ 1[] = 1]

	_[−]_ [E][[][X] _[i]_ _−_ _B_


	_Yt[i][(][ξ][) = 1]_

	_B_

	_ξ∈xXtτ_ +i−1


	_Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][.]
	_−_
	_j=1_

	X


	Recall that _F is the canonical filtration generated from the data stream (12)._ Moreover,
	_Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][}][j][=1][,][2][,...,B][ is centered and adaptive with respect to this filtration. Then we can evalu-]
	_{_ _−_
	ate the quantitiesb _q, v, and m in Lemma B.4 as follows._

	- Bounding m is simple. By Assumption 2.1 we have _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][∥≤] [2][GR][.]
	_∥_ _−_

	- The above bound of m leads to a simple bound for v, i.e., v ≤ 2nG[2]R[2].

	- The quantity q can be bounded as follows.


	_n_ _k−1_

	_q :=_ _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][∥][∞][∥][E][[][Y][ i]t [(][ξ]tτ[(][k]+[)] _i_ 1[)][\|][ b]tτ +i 1[]][∥][∞]

	_∥_ _−_ _−_ _F_ [(][j][)] _−_

	_k=1_ _j=1_

	X X

	_n_ _k−1_

	2GR E[Yt[i][(][ξ]tτ[(][k]+[)] _i_ 1[)][\|][ b]tτ +i 1[]][∥][∞]
	_≤_ _∥_ _−_ _F_ [(][j][)] _−_

	_k=1_ _j=1_

	X X

	_n_ _k−1_

	= 2GR E[Yt[i][(][ξ]tτ[(][k]+[)] _i_ 1[)][\|][ b]tτ +i 1[]][ −] [E][ξ][∼][µ][Y][ i]t [(][ξ]tτ[(][k]+[)] _i_ 1[)][∥][∞]

	_∥_ _−_ _F_ [(][j][)] _−_ _−_

	_k=1_ _j=1_

	X X

	_n_ _k−1_
	4G[2]R[2] _φξ(k_ _i)_
	_≤_ _−_

	_k=1_ _i=1_

	X X


	_≤_ 4G[2]R[2]n


	_φξ(i)._
	_i=1_

	X


	Then, by applying Lemma B.4, we obtain the following high-probability bound.

	_B[2]t[2]_
	P _\|Xt[i]_ _[−]_ [E][[][X]t[i][\|F]t[i]−1[]][\| ≥] _[t]_ _≤_ 2 exp _−_ 2(v + 2q) + 2Btm/3




	_B[2]t[2]_

	2(2G[2]R[2]B + 8G[2]R[2]B _i=1_ _[φ][ξ][(][i][)) + 4][GRBt/][3]_

	_Bt[2]_

	[P][B] _._

	2(2G[2]R[2] + 8G[2]R[2][ P][B]i=1 _[φ][ξ][(][i][)) + 4][GRt/][3]_ !


	_≤_ 2 exp

	= 2 exp


	Simplifying yields that


	_Bt[2]_

	_C +_ [4]3 _[GRt][ + 16][G][2][R][2][ P]i[B]=1_ _[φ][ξ][(][i][)]_


	P _Xt[i]_ _t_ _[\|F]t[i]_ 1[]][\| ≥] _[t]_ 2 exp
	_\|_ _[−]_ [E][[][X] _[i]_ _−_ _≤_


	where C := 4G[2]R[2].

	B.2 PROOF OF THE MAIN RESULT


	Recall that we are considering a data stream divided into small mini-batches. For convenience, we
	re-label the data stream _ξ1, ξ2, ξ3, . . ._ as follows to explicitly indicate its mini-batch index.
	_{_ _}_

	_{ξ1[(1)][, ξ]1[(2)][, . . ., ξ]1[(][B][)], ξ2[(1)][, ξ]2[(2)][, . . ., ξ]2[(][B][)], . . . }._ (12)

	The canonical filtration generated by the re-labeled data stream is denoted by _F. Also, when the_
	batch size is clear in the context, we denote the data in the specified mini-batch as x. For example,

	[b]


	-----

	we use xt to represent the t-th mini-batch {ξt[(1)], ξt[(2)], . . ., ξt[(][B][)]}. Then we can re-writhe the above
	data stream as

	_x1, x2, x3. . . ._ _._
	_{_ _}_

	We denote the canonical filtration generated by the above sequence as F. Note that we have the
	following relation:

	_t =_ _t_ _._
	_F_ _F_ [(][B][)]

	In summary, when we analyze the mini-batch SGD dynamics, we use the filtration[b] _F, and when we_
	need to consider intra-batch samples, we use the filtration _F._

	Theorem B.5. Let {w(t)}t∈N be the model parameter sequence generated by (6). Suppose Assump-
	_tions 2.1 and 3.1 hold. Then, for any τ ∈_ N, with probability at least[b] 1 − _δ, we have_

	_n_

	[f (w(t)) − _f_ (w[∗])]
	_t=1_

	X


	_GR [n]_
	_≤_ _B_


	_φξ(τB + i)_
	_i=1_

	X


	2

	3




	_G[2]R[2]_


	v
	u _B_
	u
	u
	t [2][τn]


	_GR_

	_B_ [log 4]δ[n] [+]


	v 9
	u
	u
	t [4]


	(log [4][n] 4G[2]R[2] + 16G[2]R[2]

	_δ_ [)][2][ +]



	_B_

	_φξ(i)_ log [4][n]

	_δ_

	_i=1_

	X


	log [4][τ]

	_·_ _δ_ [log 4]δ[n]


	log [4][τ]

	_·_ _δ_ [log 4]δ [n] _[.]_


	_n−τ_ +1

	_κ(t) + GR(τ −_ 1).
	_t=1_

	X


	+ Rn + G(τ − 1)


	_In particular, if τ = 1, then_

	_n_

	[f (w(t)) − _f_ (w[∗])]
	_t=1_

	X


	Rn + GR [n]
	_≤_ _B_


	_φξ(B + i)_
	_i=1_

	X


	2

	v
	u _B_ 3
	u _[·]_
	u
	t [2][n]


	_G[2]R[2]_


	_GR_

	_B_ [log 4]δ[n] [+]


	v 9
	u
	u
	t [4]


	(log [4][n] 4G[2]R[2] + 16G[2]R[2]

	_δ_ [)][2][ +]



	_B_

	_φξ(i)_ log [4][n]

	_δ_

	_i=1_

	X


	_Proof. From Proposition B.2, we obtain the following bound._

	_n_

	[f (w(t)) − _f_ (w[∗])]
	_t=1_

	X

	_n_ _n−τ_ +1

	[f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])] + Rn + G(τ 1) _κ(t) + GR(τ_ 1).

	_≤_ _−_ _−_ _−_ _−_ _−_ _−_

	_t=1_ _t=1_

	X X

	To complete the proof, it suffices to bound the first term; we define this term as


	_Zn :=_ [f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])].

	_−_ _−_ _−_ _−_
	_t=1_

	X


	We apply the same decomposition as the (13) of (Agarwal & Duchi, 2012). Define the index set
	(i) as 1, . . ., _τ_ _τ_ _τ_
	_I_ _{_ _⌊_ _[n]_ _[⌋]_ [+ 1][}][ for][ i][ ≤] _[n][ −]_ _[τ]_ _[⌊]_ _[n]_ _[⌋]_ [and][ {][1][, . . .,][ ⌊] _[n]_ _[⌋}][ otherwise. Then we have]_



	[Xt[i] _t_ _[\|F]t[i]_ 1[]] +]
	_t∈IX(i)_ _[−]_ [E][[][X] _[i]_ _−_


	E[Xt[i][\|F]t[i] 1[]][,]
	_−_
	_t∈IX(i)_


	_Zn =_


	_i=1_


	_i=1_


	-----

	where

	_Xt[i]_ [=][ f] _w((t_ 1)τ + 1) _f_ (w[∗]) + F (w[∗]; xtτ +i 1) _F_ _w((t_ 1)τ + 1); xtτ +i 1 _._
	_−_ _−_ _−_ _−_ _−_ _−_

	Note that by Lemma 4.1, we have that E[Xt[i][\|F]t[i] 1[]][ ≤] _[GR]B_ _Bi=1_ _[φ][ξ][(][τB][ +][ i][)][. Then, we have]_
	_−_

	_B_ _τ_

	P

	P _Zn > [nGR]_ _φξ(τB + i) + γ_ P( [Xt[i] _t_ _[\|F]t[i]_ 1[]]][ > γ][)]

	_B_ _i=1_ _≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_

	X X _∈IX_

	_τ_
	P [Xt[i] _t_ _[\|F]t[i]_ 1[]]][ > γ]
	_≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_ _τ_
	[ n X∈I o

	_τ_

	P [Xt[i] _t_ _[\|F]t[i]_ 1[]]][ > γ] _._

	_≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_ _τ_

	X X∈I

	Define Y := _t∈I(i)[[][X]t[i]_ _[−]_ [E][[][X]t[i][\|F]t[i]−1[]]][ and][ α][ :=] _√λB_ [. Notice that][ X]t[i] _[−]_ [E][[][X]t[i][\|F]t[i]−1[]][ is a centered]

	random variable, that is, E[Xt[i] _t_ _[\|F]t[i]_ 1[]] = 0][. Then by the generalized Azuma’s inequality]
	(Lemma B.3), we conclude that[P] _[−]_ [E][[][X] _[i]_ _−_


	_γ[2]_
	2 exp
	_≤_ − 2τ [2][ n]τ _[α][2]_


	_Y_
	_≥_ _[γ]τ_


	_t=1_ P(\|Xt[i] _[−]_ [E][[][X]t[i][\|F]t[i]−1[]][\| ≥] _[α][)][.]_

	X


	The second term can be bounded by using the generalized Bernstein’s inequality. The detailed
	calculation can be found in the discussion after Lemma B.4. We obtain that

	_λ[2]_

	P _\|Xt[i]_ _[−]_ [E][[][X]t[i][\|F]t[i]−1[]][\| ≥] _[α]_ _≤_ 2 exp _−_ _C +_ [4]3 _[GR][ λ]√B_ [+ 16][G][2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_ ! _,_


	where C = 4G[2]R[2]. In summary, the concentration bound for Zn is

	P _Zn > GR [n]_ _φξ(τB + i) + γ_
	X


	_γ[2]_
	2τ exp
	_≤_ − 2τ [2][ n]τ _[α][2]_


	_t=1_ P(\|Xt[i] _[−]_ [E][[][X]t[i][\|F]t[i]−1[]][\| ≥] _[α][)]_

	X


	+ τ


	_γ[2]_ _λ[2]_

	_≤_ 2τ exp _−_ 2τn _[λ]B[2]_ ! + 2n exp _−_ _C +_ [4]3 _[GR][ λ]√B_ [+ 16][G][2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_

	Then, let 2[δ] [= 2][n][ exp] _−_ _C+_ 3[4] _[GR]_ _√λB_ [+16]λ[G][2] [2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_, and we obtain that

	_B_

	_λ[2]_ = _C + [4]_ + 16G[2]R[2] _φξ(i)_ log [4][n]

	3 _[GR λ]√B_ _i=1_ _·_ _δ [.]_

	X

	It is a quadratic function of λ. Solving it yields that


	2
	+ _C + 16G[2]R[2]_



	_GR_ _G[2]R[2]_

	_λ = [2]_ log [4][n]

	3 _B_ [log 4]δ[n] [+] v 9 _B_ _δ_

	u
	u
	t [4]

	Also, let 2[δ] [= 2][τ][ exp] _−_ 2τnγ[2][λ][2], we have that




	_λ = [2]_


	_GR_

	_B_ [log 4]δ[n] [+]


	_B_

	_φξ(i)_ log [4][n] (13)

	_δ [.]_

	_i=1_

	X


	_γ[2]_ = 2τn _[λ][2]_

	_B_ _[·][ log 4]δ [τ]_ _[.]_

	Substituting (13) into the above equation, we obtain that


	2

	3




	2
	+ _C + 16G[2]R[2]_



	_G[2]R[2]_


	v
	u _B_
	u
	u
	t [2][τn]


	_GR_

	_B_ [log 4]δ[n] [+] v 9

	u
	u
	t [4]


	log [4][n]


	_B_

	_φξ(i)_ log [4][n]

	_δ_

	_i=1_

	X


	log [4][τ]

	_·_ _δ_ [log 4]δ [n] _[.]_


	_γ =_


	-----

	Then, we conclude that with probability at least 1 − _δ,_

	_n_

	[f (w(t)) − _f_ (w[∗])]
	_t=1_

	X


	_GR [n]_
	_≤_ _B_


	_φξ(τB + i)_
	_i=1_

	X


	2

	3




	2
	+ 4G[2]R[2] + 16G[2]R[2]



	_G[2]R[2]_


	u _B_
	u
	u
	t [2][τn]


	_GR_

	_B_ [log 4]δ[n] [+]


	v 9
	u
	u
	t [4]


	log [4][n]


	_B_

	_φξ(i)_ log [4][n]

	_δ_

	_i=1_

	X


	log [4][τ]

	_·_ _δ_ [log 4]δ[n]


	_n−τ_ +1

	_κ(t) + GR(τ −_ 1). (14)
	_t=1_

	X


	+ Rn + G(τ − 1)


	The desired result follows by noting that _t=1_ _[f]_ [(][w][(][t][))][ ≥] _[nf]_ [(][ b]wn).

	C REGRET ANALYSIS OF MINI[P]-B[n]ATCH SGD

	In this section, we derive the regret bound of mini-batch SGD algorithm. Throughout, for each sample loss F (w; ξ), recall that its gradient ∥∇F (w; ξ)∥ is uniformly bounded by G (see Assumption
	2.1). In particular, we assume the k-th coordinate of _F_ (w; ξ) is uniformly bounded by Gk, and
	_∇_
	we have G = _k_ _[G]k[2][.]_

	1. Gradient Variance Bound under Dependent DatapP

	In the i.i.d. setting, the variance of stochastic gradient decreases as the batch size increases. Specifically, we have


	_B_

	E∥∇F (w; ξi) −∇f (w)∥[2] _≤_ [2]B [G][2] _[.]_
	_i=1_

	X


	E
	_∥_ _B[1]_


	_B_

	_F_ (w; ξi) _f_ (w) = [1]
	_∇_ _−∇_ _∥[2]_ _B[2]_
	_i=1_

	X


	Therefore,the data samples are dependent. In the following lemma, we develop a similar result when the data E∥ _B[1]_ _Bi=1_ _[∇][F]_ [(][w][;][ ξ][i][)][ −∇][f] [(][w][)][∥][2][ =][ O][(][ 1]B [)][. However, this bound no longer holds if]

	is collected from a dependent stochastic process. Recall thatP _F_ (w(t); xt) denotes the averaged
	_∇_
	gradient over the mini-batch xt, i.e.,


	_F_ (w(t); xt) = [1]
	_∇_ _B_


	_F_ (w(t); ξt[(][i][)][)][.]
	_i=1_

	X


	Lemma C.1. Let {w(t)}t∈N be the model parameter sequence generated by the mini-batch SGD in
	_(6). Let Assumptions 2.1 and 3.1 hold. Then, with probability at least 1 −_ _δ,_

	268 _B_ _δ_ _Bi=1_ _[φ][ξ][(][i][)]_ 2
	_F_ (w(t); xt) _f_ (w(t)) _φξ(j)_ [log][ 2][d] + 2G[2] _._
	_∥∇_ _−∇_ _∥[2]_ _≤_ 3 _[G][2][ + 256][G][2]_ _j=1_ _·_ _B_ P _B_ !
	h X i

	_Proof. Let xt = {ξt[(][i][)][}]i[B]=1_ [be the][ t][-th mini-batch samples. We consider the filtration within][ x][t][ and]
	denote it as _t_
	_{F_ [(][i][)][}][. Then, by the definition of canonical filtration,]

	_Xi :=_ _F_ (w(t); ξt[(][i][)][)]

	[b] _∇_

	is measurable with respect to _Ft[(][i][)][. Define]_

	_Yi,k := (Xi_ E[Xi _t_ 1])k

	[b] _−_ _\|F_ _−_

	where (·)k denotes the k-th entry of the specified vector. And it is easy to see that {Yi,k}i is a
	centered process for any k ∈{1, 2, . . ., d}. With these construction, we start from the following


	-----

	decomposition.

	_F_ (w(t); xt) _f_ (w(t))
	_∥∇_ _−∇_ _∥[2]_

	= _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] + E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t))
	_∥∇_ _−_ _∇_ _\|F_ _−_ _∇_ _\|F_ _−_ _−∇_ _∥[2]_

	2 _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] +2 E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t))
	_≤_ _∥∇_ _−_ _∇_ _\|F_ _−_ _∥[2]_ _∥_ _∇_ _\|F_ _−_ _−∇_ _∥[2]_
	(A) (B)

	Then we will bound the term (A) and (B), respectively.\| {z } \| {z }



	- Bounding (A): Note that

	_F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] = [1]
	_∥∇_ _−_ _∇_ _\|F_ _−_ _∥[2]_ _B[2][ ∥]_



	[Xi E[Xi _t_ 1]]
	_i=1_ _−_ _\|F_ _−_ _∥[2]_

	X


	_d_ _B_

	= [1] (Xi E[Xi _t_ 1])k 2

	_B[2]_ _k=1_ _i=1_ _−_ _\|F_ _−_

	X h X i

	_d_ _B_

	2

	= [1] _Yi,k_ _._

	_B[2]_

	_k=1_ _i=1_

	X h X i

	Then, we show that the process {Yi,k}i satisfies the conditions of Lemma B.4.

	Since E[Yi,k _t_ 1] = 0, we conclude that _Yi,k_ _i is a centered process._
	_◦_ _\|F_ _−_ _{_ _}_
	_◦_ Denote the k-th entry of Xi as Xi,k. We know that \|Xi,k\| ≤ _Gk. Hence, we conclude_
	that 0 ≤\|Yi,k\| ≤ 2Gk. Then, we can set bi = 2Gk for all i.

	_◦_ Lastly, we can bound the quantity q defined in Lemma B.4 as follows.

	_B_ _j−1_

	_q_ 2Gk E[Yj,k _t_ []][∥] [+ 4] _k[B]_
	_≤_ _∥_ _\|F_ [(][i][)] 3 _[G][2]_

	_j=1_ _i=1_

	X X

	_B_ _j−1_

	4G[2]k _φξ(j_ _i) + [4]_ _k[B]_
	_≤_ _−_ 3 _[G][2]_

	_j=1_ _i=1_

	X X


	_B_

	_φξ(j) + [4]_ _k[B.]_

	3 _[G][2]_

	_j=1_

	X


	_≤_ 4G[2]k[B]


	Now, we can apply Lemma B.4 and obtain that


	_λ[2]_

	134

	3 _[G]k[2][B][ + 128][G][2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_


	_Yi,k > λ_


	_≤_ exp


	With a union bound, we obtain that


	_λ[2]_

	P _Yi,k_ _> λ_ 2 exp 134 _._

	_i_ ! _≤_ _−_ 3 _[G]k[2][B][ + 128][G][2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_ !

	[X]

	Further applying the union bound over k = 1, 2, . . ., d, we obtain that

	_d_

	P _Yi,k_ _> λ[2]k_ 2 exp 134 _λ[2]k_

	_≤_ _−_

	_k=1_ _i_ _k_ 3 _[G]k[2][B][ + 128][G]k[2][B][ P]j[B]=1_ _[φ][ξ][(][j][)]_

	[ n [X] [2] o[!] X

	Let _d[δ]_ [= 2 exp] _−_ 1343 _[G]k[2]_ _[B][+128][G]λ[2]k[2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_, we obtain that



	_B_

	134
	_λ[2]k_ [=] _k[B][ + 128][G][2]k[B]_ _φξ(j)_ log [2][d]

	3 _[G][2]_ _·_ _δ [.]_

	_j=1_

	h X i


	-----

	Then we conclude that,

	_d_

	134

	P _Yi,k_ _k[B][ + 128][G][2]k[B]_

	 _≤_ 3 _[G][2]_

	_k=1_ _i_

	 \ n [X] [2] h

	It implies that with the probability at least 1 − _δ,_


	_B_

	_φξ(j)_ log [2][d]

	_·_ _δ_
	_j=1_

	X i


	1 _δ._

	 _≥_ _−_


	_B_

	134
	_Yi,k_ _≤_ 3 _[B]_ _G[2]k_ + 128B _φξ(j)_ _G[2]k_ _· log [2]δ [d]_ _[.]_

	_k_ _i_ _k_ _j=1_ _k_

	X [X] [2] h X X X i

	By definition, G = _k_ _[G]k[2][. Finally, we have the following bound for term (A): with]_

	probability at least 1 −pPδ,


	134
	_F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1]
	_∥∇_ _−_ _∇_ _\|F_ _−_ _∥[2]_ _≤_ 3 _[G][2][ + 128][G][2]_
	h


	_B_

	_φξ(j)_ [log][ 2]δ[d]

	_·_ _B_
	_j=1_

	X i



	- Bounding (B): Note that

	E[ _F_ (w(t); ξt[(][i][)][)][\|F][t][−][1][]][ −∇][f] [(][w][(][t][))][∥] [=] _F_ (w(t); ξt[(][i][)][)d][P][(][ξ]t[(][i][)] _t_ 1) _F_ (w(t); ξ)dµ(ξ)
	_∥_ _∇_ _∇_ _∈·\|F_ _−_ _−_ _∇_
	Z Z

	_F_ (w(t); ξt[(][i][)][)][∥\|][d][P][(][ξ]t[(][i][)] _t_ 1) dµ
	_≤_ _∥∇_ _∈·\|F_ _−_ _−_ _\|_
	Z

	_G_ _φξ(i)._
	_≤_ _·_

	Then we bound the norm by triangle inequality,


	E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t))
	_∥_ _∇_ _\|F_ _−_ _−∇_ _∥≤_ _B[1]_

	_≤_ _B[G]_


	_∥E[∇F_ (w(t); ξt[(][i][)][)][\|F][t][−][1][]][ −∇][f] [(][w][(][t][))][∥]
	_i=1_

	X

	_B_

	_φξ(i)._
	_i=1_

	X


	Finally, we obtain the bound for the term (B) as

	_B_
	_i=1_ _[φ][ξ][(][i][)]_
	E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t)) _G[2]_
	_∥_ _∇_ _\|F_ _−_ _−∇_ _∥[2]_ _≤_ _B_
	P


	2
	!


	Combing the bounds of (A) and (B) yields that with probability at least 1 − _δ,_

	268 _B_ _δ_ _Bi=1_ _[φ][ξ][(][i][)]_
	_F_ (w(t); xt) _f_ (w(t)) _φξ(j)_ [log][ 2][d] + 2G[2]
	_∥∇_ _−∇_ _∥[2]_ _≤_ 3 _[G][2][ + 256][G][2]_ _·_ _B_ _B_

	_j=1_ P

	h X i


	2
	!


	2. High-Probability Regret Bound

	To derive the regret bound for the mini-batch SGD algorithm, we make the following additional
	mild assumption.

	Assumption C.2. The stochastic optimization problem (P) satisfies

	_1. Each sample loss F_ (·; ξ) : W → R is convex.

	_2. The objective function f : W →_ R is L-smooth.


	-----

	Theorem C.3 (High-probability regret bound). Let {w(t)}t∈N be the model parameter sequence
	_generated by the mini-batch SGD in (6). Suppose Assumptions C.2, 3.1 and 2.1 hold. Then, with_
	_probability at least 1 −_ _δ,_

	268 _B_ log 2dTδ _i=1_ _[φ][ξ][(][i][)]_ 2

	RT + ηT _φξ(j)_ + 2G[2][P][B]
	_≤_ _[∥][w][(1)]2[ −]η_ _[w][∗][∥][2]_ 3 _[G][2][ + 256][G][2]_ _j=1_ _B_ _B_

	h X i


	_f_ (w(t)) − _f_ (w[∗])


	+ 2ηL


	_t=1_

	_Moreover, let η = O(_


	_Moreover, let η =_ ( _B_
	_O_ _T ·[P]j[B]=1_ _[φ][ξ][(][j][)]_ [)][, the optimized upper bound is in the order of]
	q

	RT = _T ·_ _j=1_ _[φ][ξ][(][j][)]_ + 2ηL _T_ _f_ (w(t)) _f_ (w[∗]) _._
	_O_ s _B_ _−_

	_t=1_

	[P][B] X

	[e]

	_Proof. For convenience, we define gt =_ _B[1]_ _Bi=1_ _t_ [)][. By the algorithm update (][6][), we]

	obtain that _[∇][F]_ [(][w][(][t][);][ ξ][(][i][)]
	P

	2 _gt, w(t)_ _w[∗]_ + η _gt_
	_⟨_ _−_ _⟩≤_ _[∥][w][(][t][)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][t][ + 1)][ −]_ _[w][∗][∥][2]_ _∥_ _∥[2]_

	+ 2η _gt_ _f_ (w(t)) + 2η _f_ (w(t)) _._

	_≤_ _[∥][w][(][t][)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][t][ + 1)][ −]_ _[w][∗][∥][2]_ _∥_ _−∇_ _∥[2]_ _∥∇_ _∥[2]_

	Summing the above inequality over t yields that

	_T_

	2 _gt, w(t)_ _w[∗]_

	_⟨_ _−_ _⟩_
	_t=1_

	X


	_≤_ _[∥][w][(1)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][T][ + 1)][ −]_ _[w][∗][∥][2]_


	_gt_ _f_ (w(t)) + 4ηL
	_t=1_ _∥_ _−∇_ _∥[2]_

	X


	(f (w(t)) − _f_ (w[∗])).
	_t=1_

	X


	+ 2η


	By convexity of the function, we further obtain that

	_T_

	2 (F (w(t); xt) _F_ (w[∗]; xt))

	_−_ _≤_ _[∥][w][(1)][ −]η_ _[w][∗][∥][2]_
	_t=1_

	X


	_gt_ _f_ (w(t)) + 4ηL
	_t=1_ _∥_ _−∇_ _∥[2]_

	X


	(f (w(t)) − _f_ (w[∗])).
	_t=1_

	X


	+ 2η


	Then, we apply Lemma C.1 to bound the second term _t=1_
	union bound on over t. We conclude that, with probability at least[∥][g][t][ −∇] 1 − _δ[f],[(][w][(][t][))][∥][2][ and then apply a]_

	_T_ [P][T]


	(F (w(t); xt) _F_ (w[∗]; xt))
	_−_
	_t=1_

	X

	268
	+ ηT

	_≤_ _[∥][w][(1)]2[ −]η_ _[w][∗][∥][2]_ _·_ 3 _[G][2][ + 256][G][2]_

	h


	_B_
	_i=1_ _[φ][ξ][(][i][)]_
	+ 2G[2]

	_B_

	P


	2
	! i


	_B_ _φξ(j)_ log 2dTδ

	_B_

	_j=1_

	X


	(f (w(t)) − _f_ (w[∗])).
	_t=1_

	X


	+ 2ηL


	The proof is completed. Lastly, we set the learning rate η. To minimize the obtained upper bound, it
	suffices to minimize the first two terms, as the last term can be combined with the left hand side of
	(14) when we apply this regret bound. The optimized learning rate is achieved when


	_B_
	_i=1_ _[φ][ξ][(][i][)]_
	+ 2G[2]

	_B_

	P


	2
	! i


	_w(1)_ _w[∗]_ 268
	_∥_ _−_ _∥[2]_ = ηT

	2η _·_ 3 _[G][2][ + 256][G][2]_
	h


	_B_ _φξ(j)_ log 2dTδ

	_B_

	_j=1_

	X


	-----

	Then, η is chosen as


	_T ·_ 2683 _[G][2][ + 256][G][2][ P]∥j[B]w=1(1)[φ][ξ] −[(][j][)]w[∗] log∥[2]B 2/2dTδ_
	h

	_B_

	_._

	s _T ·_ _j=1_ _[φ][ξ][(][j][)]_ !

	[P][B]


	_η =_

	v
	u
	u
	t

	= O


	_B_ 2
	_i=1_ _[φ][ξ][(][i][)]_
	+ 2G[2] _B_
	P i


	D EXPERIMENT SETUP

	Recall that we consider the following convex quadratic optimization problem:

	min
	_w_ R[d][ E][ξ][∼][µ][(][w][ −] _[ξ][)][T][ A][(][w][ −]_ _[ξ][)][,]_
	_∈_

	where A is a fixed positive semi-definite matrix and µ is the uniform distribution on [0, 1][d]. The data
	stream admitting such a stationary distribution µ can be generated by a certain Metropolis-Hastings
	sampler provided in (Jarner & Roberts, 2002). Specifically, it is described as follows.

	Step 1: Let the “proposal” distribution q(x) have the density of Beta(r + 1, 1); that is,

	(r + 1)xr _x_ [0, 1]
	_q(x) =_ _∈_
	0 _x /_ [0, 1] _[.]_
	_∈_

	Define the acceptance probability α(x, y) = min{ _[q]q[(]([x]y)[)]_ _[,][ 1][}][.]_

	Step 2: If the current state is ξt, then we sample ζ ∼ _q. Define the next state ξt+1:_

	_ξt+1 =_ _ξt_ w.p. 1 − _α(ξt, ζ),_
	_ζ_ w.p. α(ξt, ζ).



	Step 3: Go back to Step 2 to generate the next state.

	We repeatedly generate d independent sequences starting from the same initial state s0 = 0 to obtain
	a d-dimension Markov chain. It has been shown that the above generated Markov chain converges to
	_µ in distribution with an algebraic convergence rate φξ(k)_ (k[−][1][/r]) in Proposition 5.2, (Jarner
	_≤O_
	& Roberts, 2002).

	We consider the following bias term at the fixed point w = 0d.

	(Bias): E _F_ (w; xτ )\|x0 = 0d _−_ _f_ (w) _._

	It can be used to approximate the left-hand side of Lemma 4.1 . Since E _F_ (w; xτ )\|s0 = 0d cannot
	be explicitly obtained, we use Monte Carlo method to estimate this conditional expectation. That
	is, we generate n = 10, 000 independent trajectories starting from x0 = 0d. At the step _τ_, we
	estimate the expected value as _n[1]_ _ni=1_ _[F]_ [(][w][;][ x]τ[(][i][)][)][, where][ x]τ[(][i][)] with the superscript (i) indicates that

	it is sampled from the i-th trajectory. Then we investigate the relation between the step τ and the
	mixing parameter r and the relation between the stepP _τ and the batch size B. All the results are_
	presented in Section 5.


	-----