## HOW TO IMPROVE SAMPLE COMPLEXITY OF SGD ### OVER HIGHLY DEPENDENT DATA? **Anonymous authors** Paper under double-blind review ABSTRACT Conventional machine learning applications typically assume that data samples are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data samples, which are known to heavily bias the stochastic optimization process and slow down the convergence of learning. In this paper, we conduct a fundamental study on how different structures of stochastic update schemes affect the sample complexity of stochastic gradient descent (SGD) over highly dependent data. Specifically, with a φ-mixing model of data dependence, we show that SGD with proper periodic data-subsampling achieves an improved sample complexity over the standard SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence of SGD over highly dependent data. Moreover, we show that mini-batch SGD can further substantially improve the sample complexity over SGD with periodic data-subsampling over highly dependent data. We also conduct some numerical experiments to validate our theoretical results. 1 INTRODUCTION Stochastic optimization algorithms have attracted great attention in the past decade due to its successful applications to a broad research areas, including deep learning (Goodfellow et al., 2016), reinforcement learning (Sutton & Barto, 2018), online learning (Bottou, 2010; Hazan, 2017), control (Marti, 2017), etc. In the conventional analysis of stochastic optimization algorithms, it is usually assumed that all data samples are independently and identically distributed (i.i.d.) and queried. For example, data samples in the traditional empirical risk minimization framework are assumed to be queried independently from the underlying data distribution, while data samples in reinforcement learning are assumed to be queried from the stationary distribution of the underlying Markov chain. Although the i.i.d. data assumption leads to a comprehensive understanding of the statistical limit and computation complexity of SGD, it violates the nature of many practical data-generating stochastic processes, which generate highly correlated samples that depend on the history. In fact, dependent data can be found almost everywhere, e.g., daily stock price, weather/climate data, state transitions in Markov chains, etc. To understand the impact of data dependence on the convergence and complexity of stochastic algorithms, there is a growing number of recent works that introduce various definitions to quantify data dependence. Specifically, to analyze the finite-time convergence of various stochastic reinforcement learning algorithms, recent studies assume that the dependent samples queried from the Markov decision process satisfy a geometric mixing property (Dalal et al., 2018; Zou et al., 2019; Xu & Gu, 2020; Qu & Wierman, 2020), which requires the underlying Markov chain to be uniformly ergodic or has a finite mixing time (Even-Dar et al., 2003). On the other hand, to analyze the convergence of stochastic optimization algorithms over dependent data, Karimi et al. (2019) assumed the existence of a solution to the Poisson equation associated with the underlying Markov chain, which is a weaker condition than the uniform ergodic condition (Glynn & Meyn, 1996). Moreover, Agarwal & Duchi (2012) introduced a φ-mixing property of the datagenerating process that quantifies how fast the distribution of future data samples (conditioned on a fixed filtration) converges to the underlying stationary data distribution. In particular, the φ-mixing property is more general than the previous two notions of data dependence (Douc et al., 2018). ----- While the aforementioned works leveraged the above notions of data dependence to characterize the sample complexity of various standard stochastic algorithms over dependent data, there still lacks theoretical understanding of how algorithm structure affects the sample complexity of stochastic algorithms under different levels of data dependence. In particular, a key algorithm structure is the stochastic update scheme, which critically affects the bias and variance of the stochastic optimization process. In fact, under i.i.d. data and convex geometry, it is well known that SGD achieves the sample complexity lower bound under various stochastic update schemes (Bottou, 2010), e.g., single-sample update and mini-batch update. However, these stochastic update schemes may lead to substantially different convergence behaviors over highly dependent data, as they are no longer unbiased. Therefore, it is of vital importance to understand the interplay among data dependence, structure of stochastic update and convergence rate of stochastic algorithms, and we want to ask the following fundamental question. - Q: How does the structure of stochastic updates affect the convergence rate and sample complexity of stochastic algorithms over dependent data? In this paper, we provide comprehensive answers to the above fundamental question. Specifically, we conduct a comprehensive study of the convergence rate and sample complexity of the SGD algorithm over a wide spectrum of data dependence levels under various types of stochastic updates, including periodic subsampling and mini-batch sampling. Our results show that SGD with both stochastic updates achieves a substantially improved sample complexity over the standard SGD under highly dependent data. We summarize our contributions as follows. 1.1 OUR CONTRIBUTIONS We consider the following standard stochastic optimization problem. min _F_ (w; ξ) _,_ (P) _w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ] _∈W_   where the objective function f is convex and Lipschitz continuous, and the expectation is taken over the stationary distribution µ of the underlying data-generating process P. To perform stochastic optimization, we query a stream of dependent data samples from the underlying data-generating process. Specifically, we adopt the φ-mixing model to quantify the data dependence via a decaying mixing coefficient function φξ(k) (see Definition 2.2) (Agarwal & Duchi, 2012). We study the convergence of the stochastic gradient descent (SGD) algorithm over φ-mixing dependent data samples under various stochastic update schemes, including data subsampling and mini-batch sampling. We first study the convergence of SGD over φ-mixing dependent data samples under the data subsampling update scheme. In particular, the data subsampling update scheme utilizes only one data sample per r consecutive data samples by periodically skipping r − 1 samples. With this data subsampling scheme, the subsampled data samples are less dependent for a larger subsampling period _r. Consequently, we show that SGD with a proper data subsampling period achieves an improved_ sample complexity over that of the standard SGD in the full spectrum of the convergence rate of the _φ-mixing coefficient. In particular, the improvement is substantial when the data is highly dependent_ with an algebraic decaying φ-mixing coefficient. Moreover, we study the sample complexity of SGD over φ-mixing dependent data samples under the mini-batch update scheme. Compare to the data subsampling update, mini-batch update can substantially reduce the mini-batch data dependence without skipping data samples. Consequently, mini-batch update leverages the sample average over a mini batch of data samples to reduce both the bias (caused by the data dependence) and the optimization variance. Specifically, we show that SGD with mini-batch update achieves an orderwise lower sample complexity than both the standard SGD and the SGD with data subsampling in the full spectrum of the convergence rate of the φ-mixing coefficient. We summarize and compare the sample complexities of these stochastic algorithms under different φ-mixing data dependence models in Table 1. 1.2 RELATED WORK **Stochastic Algorithms over Dependent Data** Steinwart & Christmann (2009) and Modha & Masry (1996) established the convergence analysis of online stochastic algorithms for streaming ----- Table 1: Comparison of sample complexities of SGD, SGD with data subsampling and Mini-batch SGD under different levels of data dependence for achieving f (w) − _f_ (w[∗]) ≤ _ϵ. Note that θ is a_ parameter of the convergence rate of the φ-mixing coefficient. Data dependence level _φξ(k)_ SGD SGD w/ subsampling Mini-batch SGD Geometric φ-mixing exp( _k[θ]),_ 2 1 _−_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2]) (Weakly dependent) _θ > 0_ _O_ _O_ _O_ Fast algebraic φ-mixing _k[−][θ],_ (ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (ϵ[−][2]) (Medium dependent) _θ_ 1 _O_ _O_ _O_ _≥_ Slow algebraic φ-mixing _k[−][θ],_ (ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (eϵ[−][1][−] _θ[1] )_ (Highly dependent) 0 < θ < 1 _O_ _O_ _O_ data with geometric ergodicity. Duchi et al. (2011) proved that the stochastic subgradient method has strong convergence guarantee if the mixing time is uniformly bounded. Agarwal & Duchi (2012) studied the convex/strongly convex stochastic optimization problem and proved high-probability convergence bounds for general stochastic algorithms under general stationary mixing processes. Godichon-Baggioni et al. (2021) provided the non-asymptotic analysis of stochastic algorithms with strongly convex objective function over streaming mini-batch data. In a more general setting, the stochastic approximation (SA) problem was studied in (Karimi et al., 2019) by assuming the existence of solution to a Poisson equation. Recently, Debavelaere et al. (2021) developed the asymptotic convergence analysis of SA problem for sub-geometric Markov dynamic noises. **Finite-time convergence of reinforcement learning** Recently, a series of work studied the finitetime convergence of many stochastic reinforcement learning algorithms over Markovian dependent samples, including TD learning (Dalal et al., 2018; Xu et al., 2019; Kaledin et al., 2020), Q-learning (Qu & Wierman, 2020; Li et al., 2021; Melo et al., 2008; Chen et al., 2019; Xu & Gu, 2020), fitted Qiteration (Mnih et al., 2013; 2015; Agarwal et al., 2021), actor-critic algorithms (Wang et al., 2019; Yang et al., 2019; Kumar et al., 2019; Qiu et al., 2019; Wu et al., 2020; Xu et al., 2020), etc. In these studies, the dependent Markovian samples are assumed to satisfy the geometric φ-mixing property, which is satisfied when the underlying Markov chain is uniformly ergodic or time-homogeneous with finite-states. **Regret of Stochastic Convex Optimization** There have been many known regret bounds for online convex optimization problem. Hazan (2017) has built the standard O(√T ) regret bound for on line SGD algorithm with assuming the bounded gradient. Xiao (2009) introduces the regret bound of online dual averaging method. To our best knowledge, there is no high-probability guaranteed regret bound for mini-batch SGD algorithm with considering the data dependence. 2 PROBLEM FORMULATION AND ASSUMPTIONS In this section, we introduce the problem formulation and some basic assumptions. Consider a model with parameters w. For any data sample ξ, denote F (w; ξ) ∈ R as the sample loss of this data sample under the model w. In this paper, we consider the following standard stochastic optimization problem that has broad applications in machine learning. min _F_ (w; ξ) _._ (P) _w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ] _∈W_ Here, the expectation is taken over the randomness of the data sample  _ξ, which is drawn from an_ underlying distribution µ. In particular, we make the following standard assumptions regarding the problem (P) (Agarwal & Duchi, 2012). **Assumption 2.1. The stochastic optimization problem (P) satisfies** _1. For every ξ, function F_ (·; ξ) is G-Lipschitz continuous over W, i.e., for all w, v ∈W, _|F_ (w; ξ) − _F_ (v; ξ)| ≤ _G∥w −_ _v∥._ _2. Function f_ ( ) is convex and bounded below, i.e., f (w[∗]) := inf _w_ _f_ (w) > _._ _·_ _∈W_ _−∞_ ----- _3. W is a convex and compact set with bounded diameter R, i.e., supw,v∈W ∥w −_ _v∥≤_ _R._ To solve this stochastic optimization problem, one often needs to query a set of data samples from the distribution µ to perform optimization. Unlike traditional stochastic optimization that usually assumes that the data samples are i.i.d. we consider a more general and practical dependent datagenerating process as we elaborate below. **Dependent data-generating process: We consider a stochastic process P that generates a stream** of data samples _ξ1, ξ2, ...,_, which are not necessarily independent. In particular, the stochastic _{_ _}_ process P has an underlying stationary distribution µ. To quantify the dependence of the data generation process, we introduce the following standard φ-mixing model (Agarwal & Duchi, 2012), where we denote {Ft}t as the canonical filtration generated by {ξt}t. **Definition 2.2 (φ-mixing process). Consider a stochastic process {ξt}t with a stationary distribution** _µas the total variation distance. Then, the process. Let P(ξt+k ∈·|Ft) be the distribution of the (t_ + {ξkt)}-th sample conditioned ont is called φ-mixing if the following mixing Ft, and denote dTV coefficient φξ( ) converges to 0 as k tends to infinity. _·_ _φξ(k) :=_ sup 2dTV P(ξt+k _A), µ_ _._ _t∈N,A∈Ft_ _∈·|_  Intuitively, the φ-mixing coefficient describes how fast the distribution of sample ξt+k converges to the stationary distribution µ when conditioned on the filtration _t, as the time gap k_ . The φ_F_ _→∞_ mixing process can be found in many applications, which involve mixing coefficients that converge to zero at different convergence rates. Below we mention some popular examples. - Geometric φ-mixing process. Such a type of process has a geometrically diminishing mixing coefficient, i.e., φξ(k) ≤ _φ0 exp(−ck[θ]) for some φ0, c, θ > 0. Examples include_ finite-state ergodic Markov chains and some aperiodic Harris-recurrent Markov processes (Modha & Masry, 1996; Agarwal & Duchi, 2012; Meyn & Tweedie, 2012); - Algebraic φ-mixing process. Such a type of process has a polynomially diminishing mixing coefficient, i.e., φξ(k) _φ0k[−][θ]_ for some φ0, θ > 0. Examples include a large class of _≤_ Metropolis-Hastings samplers (Jarner & Roberts, 2002) and some queuing systems (Agarwal & Duchi, 2012). 3 CONVERGENCE OF SGD WITH SUBSAMPLING OVER DEPENDENT DATA In this section, we study the convergence rate and sample complexity of SGD with data subsampling update over φ-mixing dependent data. In Section 3.1, we recap the convergence results of the standard SGD over dependent data established in (Agarwal & Duchi, 2012). In Section 3.2, we establish convergence results of SGD with the data subsampling update. Throughout, we define the sample complexity as the total number of samples required for the algorithm to output a model w that achieves an ϵ convergence error, i.e., f (w) − _f_ (w[∗]) ≤ _ϵ. Also, the_ standard regret of a stochastic algorithm is defined as _F_ (w(t); ξt) _F_ (w[∗]; ξt), _−_ _t=1_ X (Regret): Rn := where the models _w1, w2, ..., wn_ are generated using the data samples _ξ1, ξ2, ..., ξn_, respec_{_ _}_ _{_ _}_ tively, and w[∗] is the minimizer of f (w). For this sequence of models _w1, w2, ..., wn_, we make the _{_ _}_ following mild assumption, which is satisfied by many SGD-type algorithms. **Assumption 3.1. There is a non-increasing sequence {κ(t)}t such that ∥w(t + 1) −** _w(t)∥≤_ _κ(t)._ 3.1 STOCHASTIC GRADIENT DESCENT Stochastic gradient descent (SGD) is a popular and classical algorithm for stochastic optimization. In every iteration t, SGD queries a sample ξt from the data-generating process and performs the following update. (SGD): _w(t + 1) = w(t)_ _ηt_ _F_ (w(t); ξt), (1) _−_ _∇_ ----- where ηt is the learning rate. In Theorem 2 of (Agarwal & Duchi, 2012), the authors established a high probability convergence error bound for a generic class of stochastic algorithms. Specifically, under the Assumptions 2.1 and 3.1, they showed that for any τ N with probability at least 1 _δ,_ _n_ _∈_ _−_ the averaged predictor _wn :=_ _n[1]_ _t=1_ _[w][(][t][)][ satisfies]_ _n_ P 2τ _f_ (wn) _f_ (w[∗]) b _κ(t) + [2(][τ][ −][1)][GR]_ + 2GR _−_ _≤_ [R]n[n] [+ (][τ][ −]n[1)][G] _n_ _n_ [log][ τ]δ [+][ φ][ξ][(][τ] [)][GR.][ (2)] _t=1_ r X Here, b Rn is the regret of the algorithm of interest, and τ ∈ N is an auxiliary parameter that is introduced to decouple the dependence of the data samples. From the above bound, one can see that the optimal choice of τ depends on the convergence rate of the mixing coefficient φξ(τ ). Specifically, consider the SGD algorithm in (1). It can be shown that it achieves the regret Rn = O([√]n) and satisfies κ(t) = O(1/√t) with a proper diminishing learning rate. Consequently, the above high-probability convergence bound for SGD reduces to 1 _τ_ 1 _τ_ _f_ (wn) _f_ (w[∗]) _−_ _._ (3) _−_ _≤O_ _√n + infτ_ _∈N_ _√n +_ r _n_ [log][ τ]δ [+][ φ][ξ][(][τ] [)]  n o Such a bound further implies the following sample complexity results of SGD under different con b vergence rates of the mixing coefficient φξ. **Corollary 3.2. The sample complexity of SGD in (1) for achieving an ϵ convergence error over** _φ-mixing dependent data is given as follows._ 1 - If the data is geometric φ-mixing with parameter θ > 0, then we choose τ = (log [1]ϵ [)] _θ_ _._ 2 _O_ _The resulting sample complexity is in the order of n = O_ _ϵ[−][2](log_ [1]ϵ [)] _θ_ _._ [] - If the data is algebraic φ-mixing with parameter θ > 0, then we choose _τ =_ (ϵ[−] _θ[1] ). The_ [] _O_ _resulting sample complexity is in the order of n = O(ϵ[−][2][−]_ _θ[2] )._ It can be seen that if the data-generating process has a fast geometrically diminishing mixing coefficient, i.e., the data samples are close to being independent from each other, then the resulting sample complexity is almost the same as that of SGD with i.i.d. samples. On the other hand, if the data-generating process mixes slowly with an algebraically diminishing mixing coefficient, i.e., the data samples are highly dependent, then the data dependence increases the sample complexity by a non-negligible factor of ϵ[−] _θ[2] . In particular, such a factor is substantially large if the mixing rate_ parameter θ is close to zero. 3.2 SGD WITH SUBSAMPLING When apply SGD to solve stochastic optimization problems over dependent data, the key challenge is that the data dependence introduces non-negligible bias that slows down the convergence of the algorithm. Hence, a straightforward solution is to reduce data dependence before performing stochastic optimization. In the existing literature, a simple and useful approach is data subsampling (Nagaraj et al., 2020; Kotsalis et al., 2020). Next, we show that such an approach leads to an improved convergence bound and sample complexity of SGD over highly dependent data. Specifically, consider a stream of φ-mixing data samples _ξ1, ξ2, ξ3, . . ._ . Instead of utilizing the _{_ _}_ entire stream of data, we subsample a subset of this data stream with period r ∈ N and obtain the following subsampled data stream _ξ1, ξr+1, ξ2r+1, . . ._ _._ _{_ _}_ In particular, let {Ft}t be the canonical filtration generated by {ξtr+1}t. Since the consecutive subsampled samples are r time steps away from each other, it is easy to verify that the subsampled data stream {ξtr+1}t is also a φ-mixing process with mixing coefficient given by φ[r]ξ[(][t][) =][ φ][ξ][(][rt][)][,] where φ[r]ξ [denotes the mixing coefficient of the subsampled data stream][ {][ξ][tr][+1][}][t][. Therefore, by] periodically subsampling the data stream, the resulting subsampled process has a faster-converging mixing coefficient. Then, we apply SGD over such subsampled data, i.e., (SGD with subsampling): _w(t + 1) = w(t)_ _ηt_ _F_ (w(t); ξtr+1). (4) _−_ _∇_ ----- In particular, the convergence error bound in eq. (2) still holds by replacing φξ(τ ) with φξ(rτ ), and we obtain the following bound for SGD with subsampling. 1 (τ 1) _τ_ _f_ (wn) _f_ (w[∗]) _−_ + _._ (5) _−_ _≤O_ _√n + infτ_ _∈N_ _√n_ r _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]  n o Such a bound further implies the following sample complexity results of SGD with subsampling b under different convergence rates of the mixing coefficient φξ. **Corollary 3.3. The sample complexity of SGD with subsampling in (4) for achieving an ϵ conver-** _gence error over φ-mixing dependent data is given as follows._ 1 - If the data is geometric φ-mixing with parameter θ > 0, then we choose r = (log [1]ϵ [)] _θ_ _O_ 1 _and τ = O(1). The resulting sample complexity is in the order of rn = O_ _ϵ[−][2](log_ [1]ϵ [)] _θ_ _._ [] - If the data is algebraic φ-mixing with parameter θ > 0, then we choose r = _ϵ[−]_ _θ[1]_ _and_ _O_ [] _τ =_ (1). The resulting sample complexity is in the order of rn = _ϵ[−][2][−]_ _θ[1]_ _._ _O_ _O_ [] Compare the above sample complexity results with those of the standard SGD in Corollary 3.2, [] 1 we conclude that data-subsampling can improve the sample complexity by a factor of (log [1]ϵ [)] _θ and_ _ϵ[−]_ _θ[1] for geometric φ-mixing and algebraic φ-mixing data, respectively. Intuitively, this is because_ with data subsampling, we can choose a sufficiently large subsampling period r to decouple the data dependence in the term φξ(rτ ), as opposed to choosing a large τ in Corollary 3.2. In this way, the order of the dominant term _n_ [log][ τ]δ [is reduced. Therefore, when the data is highly dependent, it is] beneficial to subsample the dependent data before performing SGD. We also note another advantage of using data-subsampling, i.e., it only requires computing the stochastic gradients of the subsampled [p][ τ] data, and therefore can substantially reduce the computation load. 4 CONVERGENCE OF MINI-BATCH SGD OVER DEPENDENT DATA Although the data-subsampling update scheme studied in the previous section helps improve the sample complexity of SGD, it does not leverage the full information of all the queried data. In particular, when the data is highly dependent, we need to choose a large period r to reduce data dependence, and this will throw away a huge amount of valuable samples. In this section, we study SGD with another popular update scheme that leverages the full information of all the sampled data, i.e., the mini-batch update scheme. We show that this simple and popular scheme can effectively reduce data dependence without skipping data samples, and can achieve an improved sample complexity over SGD with subsampling. Specifically, we consider a data stream {ξt}t with φ-mixing dependent samples. We rearrange the data samples into a stream of mini-batches {xt}t, where each mini-batch xt contains B samples, i.e., xt = _ξ(t_ 1)B+1, ξ(t 1)B+2, . . ., ξtB . Then, we perform mini-batch SGD update as follows. _{_ _−_ _−_ _}_ (Mini-batch SGD): _w(t + 1) = w(t)_ _F_ (w(t); ξ). (6) _−_ _[η]B[t]_ _∇_ _ξX∈xt_ Performing SGD updates with mini-batch data has several advantages. First, it substantially reduce the optimization variance and allows to use a large learning rate to facilitate the convergence of the algorithm. As a comparison, SGD with subsampling still suffers from a large optimization variance. Second, unlike SGD with subsampling, mini-batch SGD utilizes the information of all the data samples to improve the performance of the model. Moreover, as we show in the following lemma, mini-batch update substantially reduces the stochastic bias caused by the data dependence. In the sequel, we denote F (w; x) := _B1_ _ξ_ _x_ _[F]_ [(][w][;][ ξ][)][ as the average loss on a mini-batch of samples.] _∈_ With a bit abuse of notation, we also define _t_ _t as the canonical filtration generated by the mini-_ P _{F_ _}_ batch samples _xt_ _t._ _{_ _}_ **Lemma 4.1. Let Assumption 2.1 hold and consider the mini-batch data stream {xt}t. Then, for any** _w, v ∈W measureable with regard to Ft and any τ ∈_ N, it holds that _F_ (w; xt+τ ) _F_ (v; xt+τ ) _t_ _f_ (w) _f_ (v) _−_ _|F_  _−_ _−_  _≤_ _[GR]B_ _φξ(τB + i)._ (7) _i=1_ X ----- With dependent data, the above lemma shows that we can approximate the population risk f (w) by the conditional expectation E[F (w; xt+τ )|Ft], which involves the sample xt+τ that is τ steps ahead of the filtration _t. Intuitively, by the φ-mixing principle, as τ gets larger, the distribution of_ _F_ _xGRt+τ conditional onB_ _Ft gets closer to the stationary distribution µ. In general, the estimation bias_ _B_ _i=1_ _[φ][ξ][(][τB][ +][ i][)][ depends on both the batch size and the accumulated mixing coefficient over]_ the corresponding batch of samples. To provide a concrete understanding, below we calculate the estimation bias for several different mixing models.P - Geometric φ-mixing: In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤]_ [P]i[∞]=1 _[φ][ξ][(][i][) =][ O][(1)][. Hence, the]_ estimation bias is in the order of O( _[GR]B_ [)][.] [P][B] - Fast algebraic φ-mixing (θ ≥ 1): In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤]_ [P]i[∞]=1 _[φ][ξ][(][i][) =][ e]O(1)._ Hence, the estimation bias is in the order of _O(_ _[GR]B_ [)][, where][ e]O hides all logarithm factors. [P][B] - Slow algebraic φ-mixing (0 < θ < 1): In this case, _i=1_ _[φ][ξ][(][τB][ +][ i][)][ ≤O][((][τB][)][1][−][θ][)][.]_ [e] Hence, the estimation bias is in the order of ( _[GRτ]B[ 1][θ]_ _[−][θ]_ ). _O_ [P][B] It can be seen that if the mixing coefficient converges fast, i.e., either geometrically or fast algebraically, then the data dependence has a negligible impact on the estimation error. Consequently, choosing a large batch size can substantially reduce the estimation bias. On the other hand, when the mixing coefficient converges slow algebraically, it substantially increases the estimation bias, but it is still beneficial to use a large batch size. This result shows that mini-batch update can effectively reduce the statistical bias of stochastic approximation for a wide spectrum of dependent data generating processes. We obtain prove the following convergence error bound for mini-batch SGD over dependent data. **Theorem 4.2. Let Assumption 2.1 and 3.1 hold. Apply mini-batch SGD to solve the stochastic** _optimization problem (P) over φ-mixing dependent data and assume that it achieves regret Rn._ _Then, for any τ_ N and any minimizer w[∗] _with probability at least 1_ _δ, the averaged predictor_ _n_ _∈_ _−_ _wn :=_ _n[1]_ _t=1_ _[w][(][t][)][ satisfies]_ P _n−τ_ +1 bf (wn) _f_ (w[∗]) _κ(t) +_ _[GR][(][τ][ −]_ [1)] _−_ _≤_ [R]n[n] [+][ G][(][τ][ −]n [1)] _n_ _t=1_ X _B_ _B_ [1] b 1 _τ_ 4 + _φ(τB + i) +_ _B[−]_ 4[1] + _φ(i)_ _. (8)_ _O_ _nB_ _nB_ [log][ τ]δ [log][ n]δ  Xi=1 r  h Xi=1 i [] To further understand the order of the above bound, a standard regret analysis shows that mini-batch _n_ SGD achieves a regret in the order of [R]n[n] [=][ e]( _j=1nB[φ][(][j][)]_ ) and κ(t) ( _Bn_ [)][ (see Theorem][ C.3] _O_ P _≡O_ for the proof). Consequently, the above convergence error bound reduces to the following bound,q q where we hide all logarithm factors for simplicity of presentation. _n_ _j=1_ _[φ][(][j][)]_ _f_ (wn) _f_ (w[∗]) + _[GR][(][τ][ −]_ [1)] (9) _−_ _≤_ _O[e]_ _nB_ _n_ [sP] _B_ _B_ [1] b + [1] _φ(τB + i) +_ _τ_ _B[−]_ 4[1] + _φ(i)_ 4 _._ (10) _nB_ _nB_ _i=1_ r _i=1_ X  h X i [] Such a bound further implies the following sample complexity results of mini-batch SGD under different convergence rates of the mixing coefficient φξ. **Corollary 4.3. The sample complexity of mini-batch SGD in (6) for achieving an ϵ convergence** _error over φ-mixing dependent data is given as follows._ - If the data is geometric φ-mixing with parameter θ > 0, then we choose τ = 1, B = _O(ϵ[−][1]), n = O(ϵ[−][1]). The overall sample complexity is nB = O(ϵ[−][2])._ - If the data is fast algebraic φ-mixing with parameter θ ≥ 1, then we choose τ = 1, B = _O(ϵ[−][1]), n = O(ϵ[−][1]). The overall sample complexity is nB =_ _O[e](ϵ[−][2])._ ----- - If the data is slow algebraic φ-mixing with parameter 0 < θ < 1, then we choose τ = 1, B = O(ϵ[−] _θ[1] ), n = O(ϵ[−][1]). The overall sample complexity is nB = O(ϵ[−][1][−]_ _θ[1] )._ It can be seen that mini-batch SGD can achieve an order-wise lower sample complexity than the SGD with subsampling in the full spectrum of φ-mixing convergence rate. Specifically, mini-batch SGD 1 improves the sample complexity over that of SGD with subsampling by a factor of O((log [1]ϵ [)] _θ ),_ _O(ϵ[−]_ _θ[1] ) and O(ϵ[−][1]) for geometric φ-mixing, fast algebraic φ-mixing and slow algebraic φ-mixing_ data samples, respectively. This shows that mini-batch update can effectively reduce the bias caused by data dependence and leverage the full information of all the data samples to improve the learninge performance. To intuitively explain, this is because with mini-batch updates, we can choose a sufficiently large batch size B to reduce the bias caused by the data dependence and choose a small auxiliary parameter τ = 1. As a comparison, to control the bias caused by data dependence, the standard SGD needs to choose a very large τ and the SGD with subsampling needs to choose a large subsampling period _r that skips a huge amount of valuable data samples, especially when the mixing coefficient con-_ verges slowly. Therefore, our result proves that it is beneficial to use mini-batch stochastic gradient updates when the data samples are highly dependent. We note that our proof of the tight high-probability bound in Theorem 4.2 for mini-batch SGD involves substantial new developments compared with the proof of (Agarwal & Duchi, 2012). Next, we elaborate on our technical novelty. - In (Agarwal & Duchi, 2012), they defined the following random variable _Xt[i]_ [:=][ f] _w((t_ 1)τ + i) _f_ (w[∗]) + F _w((t_ 1)τ + i); ξt+τ 1 _F_ _w[∗]; ξt+τ_ 1 _._ _−_ _−_ _−_ _−_ _−_ _−_ As this random variable involves only one sample  _ξt+τ_ _−1, they bound the bias term_  _Xt[i]_ _t_ _[|F]t[i]_ 1[]][ as a universal constant. As a comparison, the random variable][ X]t[i] [would] involve a mini-batch of samples[−] [E][[][X] _[i]_ _−_ _xt+τ_ 1 in our analysis. With the mini-batch structure, the _−_ bias Xt[i] _t_ _[|F]t[i]_ 1[]][ can be written as an average of][ B][ zero-mean dependent random vari-] ables, which is close to zero with high probability due to the concentration phenomenon.[−][E][[][X] _[i]_ _−_ Consequently, we are able to apply a Bernstein-type inequality developed in (Delyon et al., 2009) for dependent stochastic process to obtain an improved bias bound from O(1) to _O(1/√B). This is critical for obtaining the improved bound._ - Second, with the improved high-probability bias bound mentioned above, the remaining proof of (e Agarwal & Duchi, 2012) no longer holds. Specifically, we can no longer apply the Azuma’s inequality to bound the accumulated bias _t[(][X]t[i]_ _t_ _[|F]t[i]_ 1[])][, as each bias] term is no longer bounded with probability one. To address this issue, we developed a gen-[−] [E][[][X] _[i]_ _−_ eralized Azuma’s inequality for martingale differences in Lemma B.3 based on Proposition [P] 34 of (Tao et al., 2015) for independent zero-mean random variables. - Third, we develop a high-probability regret bound for mini-batch SGD over dependent data so that it can be integrated with the high-probability convergence bound in Theorem 4.2. To our best knowledge, the regret of SGD over dependent data has not been studied before. 5 NUMERICAL EXAMPLE We examine our theory via a basic convex quadratic optimization problem, which is written as min (w _ξ)[⊤]A(w_ _ξ)_ _,_ _w_ R[d][ f] [(][w][) :=][ E][ξ][∼][µ] _−_ _−_ _∈_   where A ⪰ 0 is a fixed positive semi-definite matrix and µ is the uniform distribution on [0, 1][d]. Then, following the construction in (Jarner & Roberts, 2002), we generate an algebraic φ-mixing Markov chain that has the stationary distribution µ. In particular, its mixing coefficient φξ(k) converges at a sublinear convergence rate k[−] _r[1], where r > 0 is a parameter that controls the speed of_ convergence. Please refer to Appendix D for more details of the experiment setup. We first estimate the following stochastic bias at the fixed origin point w = 0d. (Bias): E _F_ (w; xτ )|x0 = 0d _−_ _f_ (w) _,_   ----- where the expectation is taken over the randomness of the mini-batch of samples queried at time _τ ∈_ N. Such a bias is affected by several factors, including the time gap τ, the batch size B and the convergence rate parameter r of the mixing coefficient. In Figure 1, we investigate the impact of these factors on the stochastic bias, and we use 10k Monte Carlo samples to estimate the stochastic bias. The left two figures fix the batch size, and it can be seen that the bias decreases as τ increases, which matches the definition of the φ-mixing property. Also, a faster-mixing Markov chain (i.e., smaller r) leads to a smaller bias. In particular, with batch size B = 1 and a slow-mixing chain r = 2, it takes an unacceptably large τ to achieve a relatively small bias. This provides an empirical justification to Corollary 3.2 and explains why the standard SGD suffers from a high sample complexity over highly dependent data. Moreover, as the batch size gets larger, one can achieve a numerically smaller bias, which matches our Lemma 4.1. The right two figures fix the convergence rate parameter of the mixing coefficient, and it can be seen that increasing the batch size significantly reduces the bias. Consequently, instead of choosing a large τ to reduce the bias, one can simply choose a large batch size B = 100 and set τ = 1. This observation matches and justifies our theoretical results in Corollary 4.3. Figure 1: Impact of τ, batch size B and convergence rate of mixing coefficient on the bias. We further compare the convergence of SGD, SGD with subsampling and mini-batch SGD. Here, we set r = 2 to generate highly dependent data samples. We set learning rate η = 0.01 for both SGD and SGD with subsampling, and set learning rate _η = 0.01_ _B_ _B_ _×_ _j=1_ _[φ][ξ][(][j][)][ = 0][.][01][×][100][1][/][4][ for mini-batch SGD with]_ batch size Bq = 100P, as suggested by Theorem C.3 in the appendix. The results are plotted in Figure 2, where each curve corresponds to the mean of 100 independent trails. It can be seen that SGD with subsampling achieves a lower loss than the standard SGD asymptotically, due to the use of less dependent data. Moreover, mini-batch SGD achieves the smallest asymptotic loss. All these observations are consistent with our theoretical results. Figure 2: Comparison of sample complexity of different SGD algorithms. 6 CONCLUSION In this study, we investigate the convergence property of SGD under various popular stochastic update schemes over highly dependent data. Unlike the conventional i.i.d. data setting in which the stochastic update schemes do not affect the sample complexity of SGD, the convergence of SGD in the data-dependent setting critically depends on the structure of the stochastic update scheme. In particular, we show that both data subsampling and mini-batch sampling can substantially improve the sample complexity of SGD over highly dependent data. Our study takes one step forward toward understanding the theoretical limits of stochastic optimization over dependent data, and it opens many directions for future study. For example, it is interesting to further explore the impact of algorithm structure on the sample complexity of stochastic reinforcement learning algorithms. Also, it is important to develop advanced algorithm update schemes that can facilitate the convergence of learning over highly dependent data. REFERENCES Alekh Agarwal and John C Duchi. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1):573–587, 2012. Alekh Agarwal, Nan Jiang, Kakade Sham M, and Wen Sun. Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/, 2021. ----- L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta (eds.), Proc. COMPSTAT, pp. 177–186, 2010. Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finitesample analysis of nonlinear stochastic approximation with applications in reinforcement learning. arXiv:1905.11425, 2019. Gal Dalal, Bal´azs Sz¨or´enyi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. In Proc. AAAI conference on artificial intelligence, 2018. Vianney Debavelaere, Stanley Durrleman, and St´ephanie Allassonni`ere. On the convergence of stochastic approximations under a subgeometric ergodic Markov dynamic. Electronic Journal of Statistics, 15(1):1583 – 1609, 2021. Bernard Delyon et al. Exponential inequalities for sums of weakly dependent variables. Electronic Journal of Probability, 14:752–779, 2009. Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov chains. Springer, 2018. John Duchi, Alekh Agarwal, Mikael Johansson, and Michael Jordan. Ergodic subgradient descent. In Allerton Conference, 2011. Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003. Peter W. Glynn and Sean P. Meyn. A Lyapunov bound for solutions of the Poisson equation. The Annals of Probability, 24(2):916 – 931, 1996. Antoine Godichon-Baggioni, Nicklas Werge, and Olivier Wintenberger. Non-asymptotic analysis of stochastic approximation algorithms for streaming data. arXiv:2109.07117, 2021. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Elad Hazan. Introduction to Online Convex Optimization. Elad Hazan, Erscheinungsort nicht ermittelbar, 2017. ISBN 1521133301. Søren F Jarner and Gareth O Roberts. Polynomial convergence rates of markov chains. The Annals of Applied Probability, 12(1):224–247, 2002. Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. In Conference on Learning Theory, pp. 2144–2203. PMLR, 2020. Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Proc. Conference on Learning Theory, pp. 1944– 1974, 2019. Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning. arXiv:2011.02987, 11 2020. Harshat Kumar, Alec Koppel, and Alejandro Ribeiro. On the sample complexity of actor-critic method for reinforcement learning with function approximation. ArXiv:1910.08412, 2019. Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Tightening the dependence on horizon in the sample complexity of q-learning. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 6296–6306. PMLR, 2021. Kurt Marti. Stochastic optimization of regulators. Computers and Structures, 180:40–51, February 2017. Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. In Proc. International Conference on Machine Learning, pp. 664–671, 2008. ----- Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015. Dharmendra S Modha and Elias Masry. Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory, 42(6):2133–2145, 1996. Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In Proc. Advances in Neural Information Processing Systems, volume 33, 2020. Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On the finite-time convergence of actorcritic algorithm. In NeurIPS Optimization Foundations for Reinforcement Learning Workshop, 2019. Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. In Proc. Conference on Learning Theory, pp. 3185–3205, 2020. Ingo Steinwart and Andreas Christmann. Fast learning from non-iid observations. Advances in neural information processing systems, 22:1768–1776, 2009. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Bradford, 2018. Terence Tao, Van Vu, et al. Random matrices: universality of local spectral statistics of nonhermitian matrices. Annals of probability, 43(2):782–874, 2015. Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2019. Yue Frank Wu, Weitong ZHANG, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp. 17617–17628, 2020. Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. Proc. Advances in Neural Information Processing Systems, 22:2116–2124, 2009. Pan Xu and Quanquan Gu. A finite-time analysis of q-learning with neural network function approximation. In Proc. International Conference on Machine Learning, pp. 10555–10565, 2020. Tengyu Xu, Shaofeng Zou, and Yingbin Liang. Two time-scale off-policy TD learning: Nonasymptotic analysis over Markovian samples. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2019. Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural) actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020. Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2019. Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear function approximation. In Proc. Advances in Neural Information Processing Systems, pp. 8665– 8675, 2019. ----- # Appendix ### Table of Contents **A Proof of Corollary 3.3** **12** **B** **Proof of Theorem 4.2** **13** B.1 Key Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B.2 Proof of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 **C Regret Analysis of Mini-Batch SGD** **20** **D Experiment Setup** **24** **Notation: To simplify the notation, throughout the appendix, we denote ξt[(][i][)]** := ξ(t−1)B+i, which corresponds to the i-th data sample of the t-th mini-batch data xt. With this notation, we have _xt =_ _ξt[(1)], ξt[(2)], ..., ξt[(][B][)]_ . _{_ _}_ A PROOF OF COROLLARY 3.3 In this section, we analyze the convergence error bound of the SGD with data-subsampling in (4). Given a φξ-mixing data stream _ξ1, ξ2, ξ3, . . ._ _, we consider the following subsampled data stream_ _{_ _}_ _ξ1, ξr+1, ξ2r+1, . . ._ _._ _{_ _}_ Let F be the canonical filtration generated by {xt}. Then the subsampled data stream {ξtr+1}t is _φ[r]ξ[-mixing with the mixing coefficient given by]_ _φ[r]ξ[(][t][) =][ φ][ξ][(][rt][)][.]_ With this mixing coefficient, we can apply Theorem 2 of (Agarwal & Duchi, 2012) and obtain the following convergence error bound for any τ ∈ N.n Rn _τ_ _f_ (wn) _f_ (w[∗]) _κ(t) +_ _[τ]_ _._ _−_ _≤O_ _n_ [+ (][τ][ −]n [1)] _n_ [+] _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)] _t=1_ r  X  Consider the standard SGD with a diminishing learning rate, we have b _κ(t) = O(_ _√[1]t_ [)][ and][ R][n][ =] _O([√]n). Then, the convergence error bound becomes_ 1 (τ 1) _τ_ _f_ (wn) _f_ (w[∗]) _−_ + _._ _−_ _≤O_ _√n + infτ_ _∈N_ _√n_ r _n_ [log][ τ]δ [+][ φ][ξ][(][rτ] [)]  n o The above result further implies the following sample complexity results for different convergence b rates of the mixing coefficient. - Geometric φ-mixing: In this case, φξ(k) (exp( _k[θ])) for some θ > 0. Set the last_ _≤O_ _−_ 1 termτ√−n1 _φ=ξ O(rτ() =ϵ). We obtain that O(ϵ). We obtain that nτ_ _[−][2]_ _rτ= O =(ϵ O[−][2]((log). By choosing[1]ϵ_ [)] _θ ). Further set the second term τ = O(1), the sample_ complexity is in the order of [1] _θ_ 2[] _θ_ _ϵ-complexity = r_ _n =_ log [1] _τ 2ϵ−_ = _ϵ[−][2][]_ log ϵ[−][1][][ 1] _._ _·_ _O_ _ϵ_ _O_ - Algebraic φ-mixing: In this case, φξ(k) ≤O (k[−][θ]) for some _θ > 0. Set the last term[]_ _φξ(rτ_ ) = O(ϵ). We obtain that τr = O(ϵ[−] _θ[1] ). Set the second term_ _[τ]√[−]n[1] = O(ϵ). We_ obtain that nτ _[−][2]_ = O(ϵ[−][2]). By setting τ = O(1), the sample complexity is in the order of _ϵ-complexity = r · n = O(ϵ[−]_ _θ[1] τ_ [2]ϵ[−][2]) = O(ϵ[−][2][−] _θ[1] )._ ----- B PROOF OF THEOREM 4.2 B.1 KEY LEMMAS In this subsection, we present several useful preliminary results for proving Theorem 4.2. Define N := {1, 2, 3, . . . }. Throughout this subsection, we assume that Assumption 2.1 holds. The following lemma is a generalization of the Lemma 1 in (Agarwal & Duchi, 2012). **Lemma B.1. Let w, v be measurable with respect to Ft. Then for any τ ∈** N, _F_ (w; xt+τ ) _F_ (v; xt+τ ) _t_ _−_ _|F_  _≤_ _[GR]B_ _φξ(τB + i) + f_ (w) _f_ (v). _−_ _i=1_ X _Proof. For any τ ∈_ **N, we consider the following decomposition.** E[F (w; xt+τ ) − _F_ (v; xt+τ )|Ft] =E[F (w; xt+τ ) − _f_ (w) + f (v) − _F_ (v; xt+τ )|Ft] + f (w) − _f_ (v) 1 _B_ 1 = h _B_ Xi=1 Z _F_ (w; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _[∈·|F][t][)][ −]_ Z _F_ (w; ξ)dµi _−_ h _B_ (A) + f (w) _f_ (v). | {z _−_ _F_ (v; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _F_ (v; x)dµ _[∈·|F][t][)][ −]_ Z _i=1_ We can further bound the term (A) using the mixing property of the dependent data stream. 1 _B_ (A) = _B_ _i=1_ h X _B_ = [1] _B_ _i=1_ X _B_ _≤_ _B[1]_ _i=1_ X _B_ 1 X=1 Z _F_ (w; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _[∈·|F][t][)][ −]_ Z _F_ (w; ξ)dµi _−_ h _B_ (F (w; ξ) _F_ (v; ξ))d P(ξt[(]+[i][)]τ =1 Z _−_ _[∈]_ [d][ξ][|F][t][)][ −] _[µ][(d][ξ][)]_  _GRd_ P(ξt[(]+[i][)]τ =1 Z _[∈]_ [d][ξ][|F][t][)][ −] _[µ][(d][ξ][)]_ _B_ _φξ(τB + i),_ _i=1_ X _F_ (v; ξt[(]+[i][)]τ [)d][P][(][ξ]t[(]+[i][)]τ _F_ (v; x)dµ _[∈·|F][t][)][ −]_ Z _i=1_ _≤_ _[GR]B_ where in the first inequality we use the facts that F (·; ξ) is G-Lipschitz and the domain is bounded by R, and the second inequality is implied by the φ-mixing property. Substituting the above upper bound of (A) into the previous equation yields that E[F (w; xt+τ ) _F_ (v; xt+τ ) _t]_ _−_ _|F_ _≤_ _[GR]B_ This completes the proof. _φξ(τB + i) + f_ (w) _f_ (v). _−_ _i=1_ X **Proposition B.2. Let {w(t)}t∈N be the model parameter sequence generated by (6). Also suppose** _that Assumption 3.1 holds. Then for any τ ∈_ N, we have _n_ [f (w(t)) − _f_ (w[∗])] _t=1_ X _n−τ_ +1 _κ(t) + GR(τ −_ 1). _t=1_ X [f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])] + Rn + G(τ 1) _−_ _−_ _−_ _−_ _−_ _t=1_ X ----- _Proof. For any τ ∈_ N, we consider the following decomposition, _n_ [f (w(t)) − _f_ (w[∗])] _t=1_ Xn = [f (w(t)) − _F_ (w(t); xt+τ _−1) + F_ (w[∗]; xt+τ _−1) −_ _f_ (w[∗]) + F (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)]_ _t=1_ Xn = [f (w(t)) − _F_ (w(t); xt+τ _−1) + F_ (w[∗]; xt+τ _−1) −_ _f_ (w[∗])] (11) _t=1_ X + _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _._ _t=1_ X   (B) | {z } We will keep the first term and bound the term (B). _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _t=1_ X (B) = _n_ _n−τ_ +1 = [F (w(t); xt) − _F_ (w[∗]; xt)] + [F (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1)]_ _t=1_ _t=1_ X X (B1) (B2) | _n_ {z } _τ_ _−|1_ _τ_ _−1_ {z _n+τ_ _−1_ } + _F_ (w(t); xt+τ _−1) −_ _F_ (w(t); xt) + _F_ (w[∗]; xt) − _F_ (w[∗]; xt) _._ _t=nX−τ_ +2 Xt=1 Xt=1 _t=Xn+1_ (B3) | {z } Recall that the term (B1) is the regret Rn. We can bound the term (B2) by noting that _F_ (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1) ≤_ _G∥w(t + τ −_ 1) − _w(t)∥_ _τ_ _−2_ _≤_ _G_ _∥w(t + i + 1) −_ _w(t + i)∥_ _i=0_ X _τ_ _−2_ _≤_ _G_ _κ(t + i)_ _i=0_ X _≤_ _G(τ −_ 1)κ(t). For the term (B3), we can bound it using the G-Lipschitzness of F ( ; ξ) and the R-bounded domain. _·_ _τ_ _−1_ _F_ (w(t); xt) + _t=1_ X _τ_ _−1_ _F_ (w[∗]; xt) _−_ _t=1_ X _n+τ_ _−1_ _F_ (w[∗]; xt+τ _−1)_ _t=n+1_ X _F_ (w(t); xt+τ _−1) −_ _t=nX−τ_ +2 _n_ _n+τ_ _−1_ _[τ]_ _[−][1]_ = _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _−_ _F_ (w(t); xt) − h _t=nX−τ_ +2 _t=Xn+1_ i h Xt=1 _n_ _[τ]_ _[−][1]_ _≤_ _G_ _∥w(t) −_ _w[∗]∥_ + G _∥w(t) −_ _w[∗]∥_ h _t=nX−τ_ +2 i h Xt=1 i _≤_ _GR(τ −_ 1). _τ_ _−1_ _F_ (w[∗]; xt) _t=1_ X ----- Combining the above bounds of (B1), (B2), and (B3), we obtain the upper bound of (B) as follows. _n_ _F_ (w(t); xt+τ _−1) −_ _F_ (w[∗]; xt+τ _−1)_ _t=1_ X _n_ _n−τ_ +1 = [F (w(t); xt) − _F_ (w[∗]; xt)] + [F (w(t); xt+τ _−1) −_ _F_ (w(t + τ − 1); xt+τ _−1)]_ _t=1_ _t=1_ X X (B1) (B2) | _n_ {z } _τ_ _−|1_ _τ_ {z _n+τ_ _−1_ } + _F_ (w(t); xt+τ _−1) −_ _F_ (w(t); xt) + _F_ (w[∗]; xt) − _F_ (w[∗]; xt) _._ _t=nX−τ_ +2 Xt=1 Xt=1 _t=Xn+1_ (B3) | _n−τ_ +1 {z } _≤_ _Rn + G(τ −_ 1) _κ(t) + GR(τ −_ 1). _t=1_ X Then the proof is completed by substituting the upper bound of (B) into (11). The following generalized Azuma’s inequality generalizes the Proposition 34 of (Tao et al., 2015). The inequality can be used to bound sum of martingale difference random variables. **Lemma B.3 (Generalized Azuma’s Inequality). Let {Xt} be a martingale difference sequence with** _respect to its canonical filtration F. Define Y =_ _i=1_ _[X][i][ and assume][ E][|][Y][ |][ <][ ∞][. Then for any]_ _{αt}t > 0,_ [P][T] _T_ _αt[2]_ _≤_ 2 exp _−_ _[λ]2[2]_ _t=1_  X  _Y_ EY _λ_ | _−_ _| ≥_ P(|Xt| ≥ _αt)._ _t=1_ X _Proof. Let T := min{t : |Xt| > αt}. Then the sets Bt := {ω : T (ω) = t} are disjoint. Construct_ _C_ _Y (ω)_ if ω _Tt=1_ _[B][t]_ _,_ _∈_ E[Y _Bt]_ if ω _B St for all t_ 1, 2, . . ., T _._ _|_ _∈_ _∈{_ _}_ _Y_ _[′](ω) :=_ By the above construction, the associated Doob martingale oft∧T _Y_ _[′]_ with respect to F is {Zt := _i=1_ _[X][i][}][. It satisfies the conditions of Azuma’s inequality, i.e.,]_ P - _Zt_ forms a martingale with respect to (because the stopped martingale is still a mar_{_ _}_ _F_ tingale). - |Zt − _Zt−1| ≤_ _αt._ Then we can apply Azuma’s inequality to Y _[′]._ _T_ _αt[2]_ _≤_ 2 exp _−_ _[λ]2[2]_ _t=1_  X  |Y _[′]_ _−_ EY _[′]| ≥_ _λ_ ----- Now we can bound P _|Y −_ EY | ≥ _λ_ _Tt=1_ _[α]t[2]_  qP as follows. _Y_ EY _λ_ | _−_ _| ≥_ _Y_ EY _λ_ | _−_ _| ≥_ _αt[2]_  _t=1_ X  _T_ _αt[2][, Y][ =][ Y][ ′]_ _t=1_ X =P _≤P_ + P   _Y_ EY _λ_ | _−_ _| ≥_ _αt[2][, Y][ ̸][=][ Y][ ′]_ _t=1_ X _≤P_ |Y _[′]_ _−_ EY _[′]| ≥_ _λv_ _αt[2]_ + P (Y ̸= Y _[′])_ ut=1 uX  _T_ t  2 exp + P( _Xt_ _αt)._ _≤_ _−_ _[λ]2[2]_ _|_ _| ≥_   _t=1_ X Then the proof is completed. Here we notice the fact that EY _[′]_ = EY by our construction. The following lemma is taken from (22), Theorem 4 of (Delyon et al., 2009). **Lemma B.4 (Bernstein’s Inequality for Dependent Process). Let {Zt} be a centered adaptive pro-** _cess with respect to F. Define the following quantities._ _k−1_ _Zi_ E[Zk _i]_ _,_ _∥_ _∥∞_ _· ∥_ _|F_ _∥∞_ _i=1_ X _∥E[Zk[2][|][Z][k][−][1][, . . ., Z][1][]][∥][∞][,]_ _q =_ _v =_ _k=1_ _m = sup1≤i≤n_ _∥Zi∥∞._ _Then, it holds that_ _n_ _t[2]_ P _Zi_ _t_ exp _._  Xi=1 _≥_  _≤_ − 2(v + 2q) + 2tm/3  **Application of Lemma B.4 to our proof. Here we make some comments about how to apply this** inequality in our main proof. We define the following random variable in our proof. Throughout, we use the batch-level filtration F and the intra-batch level filtration _F. The formal definition is given_ in Section B.2. _Xt[i]_ [=][ f] _w((t_ 1)τ + 1) _f_ (w[∗]) + F (w[∗]; xtτ +i 1) _F_ [b]w((t 1)τ + 1); xtτ +i 1 _._ _−_ _−_ _−_ _−_ _−_ _−_ We also define the filtration _t[:=][ F][tτ]_ [+][i][−][1] [for simplicity. Then, we have]  _F_ _[i]_ E[Xt[i][|F]t[i]−1[] =][ f] _w((t −_ 1)τ + 1) _−_ _f_ (w[∗]) + E _F_ (w[∗]; xtτ +i−1) − _F_ _w((t −_ 1)τ + 1); xtτ +i−1 _|Ft[i]−1_ Then, the bias can be rewritten as    _Xt[i]_ _t_ _[|F]t[i]_ 1[]] _[−]_ [E][[][X] _[i]_ _−_ =F (w[∗]; xtτ +i−1) − _F_ (w((t − 1)τ + 1); xtτ +i−1) − E _F_ (w[∗]; xtτ +i−1) − _F_ _w((t −_ 1)τ + 1); xtτ +i−1 _|Ft[i]−1_ = [1] _Yt[i][(][ξ][)][,]_   _B_ _ξ∈xXtτ_ +i−1 where Yt[i] [is defined as] _Yt[i][(][ξ][) =][ F]_ [(][w][∗][;][ ξ][)][ −] _[F]_ _w((t_ 1)τ + 1); ξ E _F_ (w[∗]; ξ) _F_ _w((t_ 1)τ + 1); ξ _t_ 1 _._ _−_ _−_ _−_ _−_ _|F_ _[i]−_     ----- More specifically, we have _Xt[i]_ _t_ _[|F]t[i]_ 1[] = 1] _[−]_ [E][[][X] _[i]_ _−_ _B_ _Yt[i][(][ξ][) = 1]_ _B_ _ξ∈xXtτ_ +i−1 _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][.] _−_ _j=1_ X Recall that _F is the canonical filtration generated from the data stream (12)._ Moreover, _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][}][j][=1][,][2][,...,B][ is centered and adaptive with respect to this filtration. Then we can evalu-] _{_ _−_ ate the quantitiesb _q, v, and m in Lemma B.4 as follows._ - Bounding m is simple. By Assumption 2.1 we have _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][∥≤] [2][GR][.] _∥_ _−_ - The above bound of m leads to a simple bound for v, i.e., v ≤ 2nG[2]R[2]. - The quantity q can be bounded as follows. _n_ _k−1_ _q :=_ _Yt[i][(][ξ]tτ[(][j][)]+i_ 1[)][∥][∞][∥][E][[][Y][ i]t [(][ξ]tτ[(][k]+[)] _i_ 1[)][|][ b]tτ +i 1[]][∥][∞] _∥_ _−_ _−_ _F_ [(][j][)] _−_ _k=1_ _j=1_ X X _n_ _k−1_ 2GR E[Yt[i][(][ξ]tτ[(][k]+[)] _i_ 1[)][|][ b]tτ +i 1[]][∥][∞] _≤_ _∥_ _−_ _F_ [(][j][)] _−_ _k=1_ _j=1_ X X _n_ _k−1_ = 2GR E[Yt[i][(][ξ]tτ[(][k]+[)] _i_ 1[)][|][ b]tτ +i 1[]][ −] [E][ξ][∼][µ][Y][ i]t [(][ξ]tτ[(][k]+[)] _i_ 1[)][∥][∞] _∥_ _−_ _F_ [(][j][)] _−_ _−_ _k=1_ _j=1_ X X _n_ _k−1_ 4G[2]R[2] _φξ(k_ _i)_ _≤_ _−_ _k=1_ _i=1_ X X _≤_ 4G[2]R[2]n _φξ(i)._ _i=1_ X Then, by applying Lemma B.4, we obtain the following high-probability bound. _B[2]t[2]_ P _|Xt[i]_ _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[t]_ _≤_ 2 exp _−_ 2(v + 2q) + 2Btm/3    _B[2]t[2]_ 2(2G[2]R[2]B + 8G[2]R[2]B _i=1_ _[φ][ξ][(][i][)) + 4][GRBt/][3]_ _Bt[2]_ [P][B] _._ 2(2G[2]R[2] + 8G[2]R[2][ P][B]i=1 _[φ][ξ][(][i][)) + 4][GRt/][3]_ ! _≤_ 2 exp = 2 exp Simplifying yields that _Bt[2]_ _C +_ [4]3 _[GRt][ + 16][G][2][R][2][ P]i[B]=1_ _[φ][ξ][(][i][)]_ P _Xt[i]_ _t_ _[|F]t[i]_ 1[]][| ≥] _[t]_ 2 exp _|_ _[−]_ [E][[][X] _[i]_ _−_ _≤_  where C := 4G[2]R[2]. B.2 PROOF OF THE MAIN RESULT Recall that we are considering a data stream divided into small mini-batches. For convenience, we re-label the data stream _ξ1, ξ2, ξ3, . . ._ as follows to explicitly indicate its mini-batch index. _{_ _}_ _{ξ1[(1)][, ξ]1[(2)][, . . ., ξ]1[(][B][)], ξ2[(1)][, ξ]2[(2)][, . . ., ξ]2[(][B][)], . . . }._ (12) The canonical filtration generated by the re-labeled data stream is denoted by _F. Also, when the_ batch size is clear in the context, we denote the data in the specified mini-batch as x. For example, [b] ----- we use xt to represent the t-th mini-batch {ξt[(1)], ξt[(2)], . . ., ξt[(][B][)]}. Then we can re-writhe the above data stream as _x1, x2, x3. . . ._ _._ _{_ _}_ We denote the canonical filtration generated by the above sequence as F. Note that we have the following relation: _t =_ _t_ _._ _F_ _F_ [(][B][)] In summary, when we analyze the mini-batch SGD dynamics, we use the filtration[b] _F, and when we_ need to consider intra-batch samples, we use the filtration _F._ **Theorem B.5. Let {w(t)}t∈N be the model parameter sequence generated by (6). Suppose Assump-** _tions 2.1 and 3.1 hold. Then, for any τ ∈_ N, with probability at least[b] 1 − _δ, we have_ _n_ [f (w(t)) − _f_ (w[∗])] _t=1_ X _GR [n]_ _≤_ _B_ _φξ(τB + i)_ _i=1_ X 2 3  _G[2]R[2]_ v u _B_ u u t [2][τn] _GR_ _B_ [log 4]δ[n] [+] v 9 u u t [4] (log [4][n] 4G[2]R[2] + 16G[2]R[2] _δ_ [)][2][ +] _B_ _φξ(i)_ log [4][n] _δ_ _i=1_ X  log [4][τ] _·_ _δ_ [log 4]δ[n]  log [4][τ] _·_ _δ_ [log 4]δ [n] _[.]_ _n−τ_ +1 _κ(t) + GR(τ −_ 1). _t=1_ X + Rn + G(τ − 1) _In particular, if τ = 1, then_ _n_ [f (w(t)) − _f_ (w[∗])] _t=1_ X Rn + GR [n] _≤_ _B_ _φξ(B + i)_ _i=1_ X 2 v u _B_ 3 u _[·]_  u t [2][n] _G[2]R[2]_ _GR_ _B_ [log 4]δ[n] [+] v 9 u u t [4] (log [4][n] 4G[2]R[2] + 16G[2]R[2] _δ_ [)][2][ +] _B_ _φξ(i)_ log [4][n] _δ_ _i=1_ X  _Proof. From Proposition B.2, we obtain the following bound._ _n_ [f (w(t)) − _f_ (w[∗])] _t=1_ X _n_ _n−τ_ +1 [f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])] + Rn + G(τ 1) _κ(t) + GR(τ_ 1). _≤_ _−_ _−_ _−_ _−_ _−_ _−_ _t=1_ _t=1_ X X To complete the proof, it suffices to bound the first term; we define this term as _Zn :=_ [f (w(t)) _F_ (w(t); xt+τ 1) + F (w[∗]; xt+τ 1) _f_ (w[∗])]. _−_ _−_ _−_ _−_ _t=1_ X We apply the same decomposition as the (13) of (Agarwal & Duchi, 2012). Define the index set (i) as 1, . . ., _τ_ _τ_ _τ_ _I_ _{_ _⌊_ _[n]_ _[⌋]_ [+ 1][}][ for][ i][ ≤] _[n][ −]_ _[τ]_ _[⌊]_ _[n]_ _[⌋]_ [and][ {][1][, . . .,][ ⌊] _[n]_ _[⌋}][ otherwise. Then we have]_ [Xt[i] _t_ _[|F]t[i]_ 1[]] +] _t∈IX(i)_ _[−]_ [E][[][X] _[i]_ _−_ E[Xt[i][|F]t[i] 1[]][,] _−_ _t∈IX(i)_ _Zn =_ _i=1_ _i=1_ ----- where _Xt[i]_ [=][ f] _w((t_ 1)τ + 1) _f_ (w[∗]) + F (w[∗]; xtτ +i 1) _F_ _w((t_ 1)τ + 1); xtτ +i 1 _._ _−_ _−_ _−_ _−_ _−_ _−_ Note that by Lemma 4.1, we have that E[Xt[i][|F]t[i] 1[]][ ≤] _[GR]B_ _Bi=1_ _[φ][ξ][(][τB][ +][ i][)][. Then, we have]_ _−_ _B_ _τ_ P P _Zn > [nGR]_ _φξ(τB + i) + γ_ P( [Xt[i] _t_ _[|F]t[i]_ 1[]]][ > γ][)] _B_ _i=1_ _≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_  X  X _∈IX_ _τ_ P [Xt[i] _t_ _[|F]t[i]_ 1[]]][ > γ] _≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_ _τ_  [ n X∈I o _τ_ P [Xt[i] _t_ _[|F]t[i]_ 1[]]][ > γ] _._ _≤_ _i=1_ _t_ (i) _[−]_ [E][[][X] _[i]_ _−_ _τ_ X  X∈I  Define Y := _t∈I(i)[[][X]t[i]_ _[−]_ [E][[][X]t[i][|F]t[i]−1[]]][ and][ α][ :=] _√λB_ [. Notice that][ X]t[i] _[−]_ [E][[][X]t[i][|F]t[i]−1[]][ is a centered] random variable, that is, E[Xt[i] _t_ _[|F]t[i]_ 1[]] = 0][. Then by the generalized Azuma’s inequality] (Lemma B.3), we conclude that[P] _[−]_ [E][[][X] _[i]_ _−_ _γ[2]_ 2 exp _≤_ − 2τ [2][ n]τ _[α][2]_ _Y_ _≥_ _[γ]τ_ _t=1_ P(|Xt[i] _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[α][)][.]_ X The second term can be bounded by using the generalized Bernstein’s inequality. The detailed calculation can be found in the discussion after Lemma B.4. We obtain that _λ[2]_ P _|Xt[i]_ _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[α]_ _≤_ 2 exp _−_ _C +_ [4]3 _[GR][ λ]√B_ [+ 16][G][2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_ ! _,_  where C = 4G[2]R[2]. In summary, the concentration bound for Zn is P _Zn > GR [n]_ _φξ(τB + i) + γ_  X  _γ[2]_ 2τ exp _≤_ − 2τ [2][ n]τ _[α][2]_ _t=1_ P(|Xt[i] _[−]_ [E][[][X]t[i][|F]t[i]−1[]][| ≥] _[α][)]_ X + τ _γ[2]_ _λ[2]_ _≤_ 2τ exp _−_ 2τn _[λ]B[2]_ ! + 2n exp _−_ _C +_ [4]3 _[GR][ λ]√B_ [+ 16][G][2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_ Then, let 2[δ] [= 2][n][ exp] _−_ _C+_ 3[4] _[GR]_ _√λB_ [+16]λ[G][2] [2][R][2][ P]i[B]=1 _[φ][ξ][(][i][)]_, and we obtain that  _B_ _λ[2]_ = _C + [4]_ + 16G[2]R[2] _φξ(i)_ log [4][n] 3 _[GR λ]√B_ _i=1_ _·_ _δ [.]_  X  It is a quadratic function of λ. Solving it yields that 2 + _C + 16G[2]R[2]_   _GR_ _G[2]R[2]_ _λ = [2]_ log [4][n] 3 _B_ [log 4]δ[n] [+] v 9 _B_ _δ_ u u  t [4] Also, let 2[δ] [= 2][τ][ exp] _−_ 2τnγ[2][λ][2], we have that   _λ = [2]_ _GR_ _B_ [log 4]δ[n] [+] _B_ _φξ(i)_ log [4][n] (13) _δ [.]_ _i=1_ X  _γ[2]_ = 2τn _[λ][2]_ _B_ _[·][ log 4]δ [τ]_ _[.]_ Substituting (13) into the above equation, we obtain that 2 3  2 + _C + 16G[2]R[2]_   _G[2]R[2]_ v u _B_ u u t [2][τn] _GR_ _B_ [log 4]δ[n] [+] v 9 u u t [4] log [4][n] _B_ _φξ(i)_ log [4][n] _δ_ _i=1_ X  log [4][τ] _·_ _δ_ [log 4]δ [n] _[.]_ _γ =_ ----- Then, we conclude that with probability at least 1 − _δ,_ _n_ [f (w(t)) − _f_ (w[∗])] _t=1_ X _GR [n]_ _≤_ _B_ _φξ(τB + i)_ _i=1_ X 2 3  2 + 4G[2]R[2] + 16G[2]R[2]   _G[2]R[2]_ u _B_ u u t [2][τn] _GR_ _B_ [log 4]δ[n] [+] v 9 u u t [4] log [4][n] _B_ _φξ(i)_ log [4][n] _δ_ _i=1_ X  log [4][τ] _·_ _δ_ [log 4]δ[n] _n−τ_ +1 _κ(t) + GR(τ −_ 1). (14) _t=1_ X + Rn + G(τ − 1) The desired result follows by noting that _t=1_ _[f]_ [(][w][(][t][))][ ≥] _[nf]_ [(][ b]wn). C REGRET ANALYSIS OF MINI[P]-B[n]ATCH SGD In this section, we derive the regret bound of mini-batch SGD algorithm. Throughout, for each sample loss F (w; ξ), recall that its gradient ∥∇F (w; ξ)∥ is uniformly bounded by G (see Assumption 2.1). In particular, we assume the k-th coordinate of _F_ (w; ξ) is uniformly bounded by Gk, and _∇_ we have G = _k_ _[G]k[2][.]_ **1. Gradient Variance Bound under Dependent DatapP** In the i.i.d. setting, the variance of stochastic gradient decreases as the batch size increases. Specifically, we have _B_ E∥∇F (w; ξi) −∇f (w)∥[2] _≤_ [2]B [G][2] _[.]_ _i=1_ X E _∥_ _B[1]_ _B_ _F_ (w; ξi) _f_ (w) = [1] _∇_ _−∇_ _∥[2]_ _B[2]_ _i=1_ X Therefore,the data samples are dependent. In the following lemma, we develop a similar result when the data E∥ _B[1]_ _Bi=1_ _[∇][F]_ [(][w][;][ ξ][i][)][ −∇][f] [(][w][)][∥][2][ =][ O][(][ 1]B [)][. However, this bound no longer holds if] is collected from a dependent stochastic process. Recall thatP _F_ (w(t); xt) denotes the averaged _∇_ gradient over the mini-batch xt, i.e., _F_ (w(t); xt) = [1] _∇_ _B_ _F_ (w(t); ξt[(][i][)][)][.] _i=1_ X **Lemma C.1. Let {w(t)}t∈N be the model parameter sequence generated by the mini-batch SGD in** _(6). Let Assumptions 2.1 and 3.1 hold. Then, with probability at least 1 −_ _δ,_ 268 _B_ _δ_ _Bi=1_ _[φ][ξ][(][i][)]_ 2 _F_ (w(t); xt) _f_ (w(t)) _φξ(j)_ [log][ 2][d] + 2G[2] _._ _∥∇_ _−∇_ _∥[2]_ _≤_ 3 _[G][2][ + 256][G][2]_ _j=1_ _·_ _B_ P _B_ ! h X i _Proof. Let xt = {ξt[(][i][)][}]i[B]=1_ [be the][ t][-th mini-batch samples. We consider the filtration within][ x][t][ and] denote it as _t_ _{F_ [(][i][)][}][. Then, by the definition of canonical filtration,] _Xi :=_ _F_ (w(t); ξt[(][i][)][)] [b] _∇_ is measurable with respect to _Ft[(][i][)][. Define]_ _Yi,k := (Xi_ E[Xi _t_ 1])k [b] _−_ _|F_ _−_ where (·)k denotes the k-th entry of the specified vector. And it is easy to see that {Yi,k}i is a centered process for any k ∈{1, 2, . . ., d}. With these construction, we start from the following ----- decomposition. _F_ (w(t); xt) _f_ (w(t)) _∥∇_ _−∇_ _∥[2]_ = _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] + E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t)) _∥∇_ _−_ _∇_ _|F_ _−_ _∇_ _|F_ _−_ _−∇_ _∥[2]_ 2 _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] +2 E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t)) _≤_ _∥∇_ _−_ _∇_ _|F_ _−_ _∥[2]_ _∥_ _∇_ _|F_ _−_ _−∇_ _∥[2]_ (A) (B) Then we will bound the term (A) and (B), respectively.| {z } | {z } - Bounding (A): Note that _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] = [1] _∥∇_ _−_ _∇_ _|F_ _−_ _∥[2]_ _B[2][ ∥]_ [Xi E[Xi _t_ 1]] _i=1_ _−_ _|F_ _−_ _∥[2]_ X _d_ _B_ = [1] (Xi E[Xi _t_ 1])k 2 _B[2]_ _k=1_ _i=1_ _−_ _|F_ _−_ X h X i _d_ _B_ 2 = [1] _Yi,k_ _._ _B[2]_ _k=1_ _i=1_ X h X i Then, we show that the process {Yi,k}i satisfies the conditions of Lemma B.4. Since E[Yi,k _t_ 1] = 0, we conclude that _Yi,k_ _i is a centered process._ _◦_ _|F_ _−_ _{_ _}_ _◦_ Denote the k-th entry of Xi as Xi,k. We know that |Xi,k| ≤ _Gk. Hence, we conclude_ that 0 ≤|Yi,k| ≤ 2Gk. Then, we can set bi = 2Gk for all i. _◦_ Lastly, we can bound the quantity q defined in Lemma B.4 as follows. _B_ _j−1_ _q_ 2Gk E[Yj,k _t_ []][∥] [+ 4] _k[B]_ _≤_ _∥_ _|F_ [(][i][)] 3 _[G][2]_ _j=1_ _i=1_ X X _B_ _j−1_ 4G[2]k _φξ(j_ _i) + [4]_ _k[B]_ _≤_ _−_ 3 _[G][2]_ _j=1_ _i=1_ X X _B_ _φξ(j) + [4]_ _k[B.]_ 3 _[G][2]_ _j=1_ X _≤_ 4G[2]k[B] Now, we can apply Lemma B.4 and obtain that _λ[2]_ 134 3 _[G]k[2][B][ + 128][G][2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_ _Yi,k > λ_ _≤_ exp With a union bound, we obtain that _λ[2]_ P _Yi,k_ _> λ_ 2 exp 134 _._ _i_ ! _≤_ _−_ 3 _[G]k[2][B][ + 128][G][2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_ ! [X] Further applying the union bound over k = 1, 2, . . ., d, we obtain that _d_ P _Yi,k_ _> λ[2]k_ 2 exp 134 _λ[2]k_ _≤_ _−_ _k=1_ _i_ _k_ 3 _[G]k[2][B][ + 128][G]k[2][B][ P]j[B]=1_ _[φ][ξ][(][j][)]_ [ n [X] [2] o[!] X Let _d[δ]_ [= 2 exp] _−_ 1343 _[G]k[2]_ _[B][+128][G]λ[2]k[2]k[B][ P]j[B]=1_ _[φ][ξ][(][j][)]_, we obtain that   _B_ 134 _λ[2]k_ [=] _k[B][ + 128][G][2]k[B]_ _φξ(j)_ log [2][d] 3 _[G][2]_ _·_ _δ [.]_ _j=1_ h X i ----- Then we conclude that, _d_ 134 P _Yi,k_ _k[B][ + 128][G][2]k[B]_  _≤_ 3 _[G][2]_ _k=1_ _i_  \ n [X] [2] h It implies that with the probability at least 1 − _δ,_ _B_ _φξ(j)_ log [2][d] _·_ _δ_ _j=1_ X i 1 _δ._  _≥_ _−_ _B_ 134 _Yi,k_ _≤_ 3 _[B]_ _G[2]k_ + 128B _φξ(j)_ _G[2]k_ _· log [2]δ [d]_ _[.]_ _k_ _i_ _k_ _j=1_ _k_ X [X] [2] h  X   X  X i By definition, G = _k_ _[G]k[2][. Finally, we have the following bound for term (A): with]_ probability at least 1 −pPδ, 134 _F_ (w(t); xt) E[ _F_ (w(t); xt) _t_ 1] _∥∇_ _−_ _∇_ _|F_ _−_ _∥[2]_ _≤_ 3 _[G][2][ + 128][G][2]_ h _B_ _φξ(j)_ [log][ 2]δ[d] _·_ _B_ _j=1_ X i - Bounding (B): Note that E[ _F_ (w(t); ξt[(][i][)][)][|F][t][−][1][]][ −∇][f] [(][w][(][t][))][∥] [=] _F_ (w(t); ξt[(][i][)][)d][P][(][ξ]t[(][i][)] _t_ 1) _F_ (w(t); ξ)dµ(ξ) _∥_ _∇_ _∇_ _∈·|F_ _−_ _−_ _∇_ Z Z _F_ (w(t); ξt[(][i][)][)][∥|][d][P][(][ξ]t[(][i][)] _t_ 1) dµ _≤_ _∥∇_ _∈·|F_ _−_ _−_ _|_ Z _G_ _φξ(i)._ _≤_ _·_ Then we bound the norm by triangle inequality, E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t)) _∥_ _∇_ _|F_ _−_ _−∇_ _∥≤_ _B[1]_ _≤_ _B[G]_ _∥E[∇F_ (w(t); ξt[(][i][)][)][|F][t][−][1][]][ −∇][f] [(][w][(][t][))][∥] _i=1_ X _B_ _φξ(i)._ _i=1_ X Finally, we obtain the bound for the term (B) as _B_ _i=1_ _[φ][ξ][(][i][)]_ E[ _F_ (w(t); xt) _t_ 1] _f_ (w(t)) _G[2]_ _∥_ _∇_ _|F_ _−_ _−∇_ _∥[2]_ _≤_ _B_ P 2 ! Combing the bounds of (A) and (B) yields that with probability at least 1 − _δ,_ 268 _B_ _δ_ _Bi=1_ _[φ][ξ][(][i][)]_ _F_ (w(t); xt) _f_ (w(t)) _φξ(j)_ [log][ 2][d] + 2G[2] _∥∇_ _−∇_ _∥[2]_ _≤_ 3 _[G][2][ + 256][G][2]_ _·_ _B_ _B_ _j=1_ P h X i 2 ! **2. High-Probability Regret Bound** To derive the regret bound for the mini-batch SGD algorithm, we make the following additional mild assumption. **Assumption C.2. The stochastic optimization problem (P) satisfies** _1. Each sample loss F_ (·; ξ) : W → R is convex. _2. The objective function f : W →_ R is L-smooth. ----- **Theorem C.3 (High-probability regret bound). Let {w(t)}t∈N be the model parameter sequence** _generated by the mini-batch SGD in (6). Suppose Assumptions C.2, 3.1 and 2.1 hold. Then, with_ _probability at least 1 −_ _δ,_ 268 _B_ log 2dTδ _i=1_ _[φ][ξ][(][i][)]_ 2 RT + ηT _φξ(j)_ + 2G[2][P][B] _≤_ _[∥][w][(1)]2[ −]η_ _[w][∗][∥][2]_ 3 _[G][2][ + 256][G][2]_ _j=1_ _B_ _B_ h X   i _f_ (w(t)) − _f_ (w[∗]) + 2ηL _t=1_ _Moreover, let η = O(_ _Moreover, let η =_ ( _B_ _O_ _T ·[P]j[B]=1_ _[φ][ξ][(][j][)]_ [)][, the optimized upper bound is in the order of] q RT = _T ·_ _j=1_ _[φ][ξ][(][j][)]_ + 2ηL _T_ _f_ (w(t)) _f_ (w[∗]) _._ _O_ s _B_ _−_ _t=1_  [P][B]  X  [e] _Proof. For convenience, we define gt =_ _B[1]_ _Bi=1_ _t_ [)][. By the algorithm update (][6][), we] obtain that _[∇][F]_ [(][w][(][t][);][ ξ][(][i][)] P 2 _gt, w(t)_ _w[∗]_ + η _gt_ _⟨_ _−_ _⟩≤_ _[∥][w][(][t][)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][t][ + 1)][ −]_ _[w][∗][∥][2]_ _∥_ _∥[2]_ + 2η _gt_ _f_ (w(t)) + 2η _f_ (w(t)) _._ _≤_ _[∥][w][(][t][)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][t][ + 1)][ −]_ _[w][∗][∥][2]_ _∥_ _−∇_ _∥[2]_ _∥∇_ _∥[2]_ Summing the above inequality over t yields that _T_ 2 _gt, w(t)_ _w[∗]_ _⟨_ _−_ _⟩_ _t=1_ X _≤_ _[∥][w][(1)][ −]_ _[w][∗][∥][2][ −∥]η[w][(][T][ + 1)][ −]_ _[w][∗][∥][2]_ _gt_ _f_ (w(t)) + 4ηL _t=1_ _∥_ _−∇_ _∥[2]_ X (f (w(t)) − _f_ (w[∗])). _t=1_ X + 2η By convexity of the function, we further obtain that _T_ 2 (F (w(t); xt) _F_ (w[∗]; xt)) _−_ _≤_ _[∥][w][(1)][ −]η_ _[w][∗][∥][2]_ _t=1_ X _gt_ _f_ (w(t)) + 4ηL _t=1_ _∥_ _−∇_ _∥[2]_ X (f (w(t)) − _f_ (w[∗])). _t=1_ X + 2η Then, we apply Lemma C.1 to bound the second term _t=1_ union bound on over t. We conclude that, with probability at least[∥][g][t][ −∇] 1 − _δ[f],[(][w][(][t][))][∥][2][ and then apply a]_ _T_ [P][T] (F (w(t); xt) _F_ (w[∗]; xt)) _−_ _t=1_ X 268 + ηT _≤_ _[∥][w][(1)]2[ −]η_ _[w][∗][∥][2]_ _·_ 3 _[G][2][ + 256][G][2]_ h _B_ _i=1_ _[φ][ξ][(][i][)]_ + 2G[2] _B_ P 2 ! i _B_ _φξ(j)_ log 2dTδ _B_ _j=1_ X  (f (w(t)) − _f_ (w[∗])). _t=1_ X + 2ηL The proof is completed. Lastly, we set the learning rate η. To minimize the obtained upper bound, it suffices to minimize the first two terms, as the last term can be combined with the left hand side of (14) when we apply this regret bound. The optimized learning rate is achieved when _B_ _i=1_ _[φ][ξ][(][i][)]_ + 2G[2] _B_ P 2 ! i _w(1)_ _w[∗]_ 268 _∥_ _−_ _∥[2]_ = ηT 2η _·_ 3 _[G][2][ + 256][G][2]_ h _B_ _φξ(j)_ log 2dTδ _B_ _j=1_ X  ----- Then, η is chosen as _T ·_ 2683 _[G][2][ + 256][G][2][ P]∥j[B]w=1(1)[φ][ξ] −[(][j][)]w[∗] log∥[2]B 2/2dTδ_ h  _B_ _._ s _T ·_ _j=1_ _[φ][ξ][(][j][)]_ ! [P][B] _η =_ v u u t = O _B_ 2 _i=1_ _[φ][ξ][(][i][)]_ + 2G[2] _B_  P  i D EXPERIMENT SETUP Recall that we consider the following convex quadratic optimization problem: min _w_ R[d][ E][ξ][∼][µ][(][w][ −] _[ξ][)][T][ A][(][w][ −]_ _[ξ][)][,]_ _∈_ where A is a fixed positive semi-definite matrix and µ is the uniform distribution on [0, 1][d]. The data stream admitting such a stationary distribution µ can be generated by a certain Metropolis-Hastings sampler provided in (Jarner & Roberts, 2002). Specifically, it is described as follows. **Step 1: Let the “proposal” distribution q(x) have the density of Beta(r + 1, 1); that is,** (r + 1)xr _x_ [0, 1] _q(x) =_ _∈_ 0 _x /_ [0, 1] _[.]_  _∈_ Define the acceptance probability α(x, y) = min{ _[q]q[(]([x]y)[)]_ _[,][ 1][}][.]_ **Step 2: If the current state is ξt, then we sample ζ ∼** _q. Define the next state ξt+1:_ _ξt+1 =_ _ξt_ w.p. 1 − _α(ξt, ζ),_ _ζ_ w.p. α(ξt, ζ).  **Step 3: Go back to Step 2 to generate the next state.** We repeatedly generate d independent sequences starting from the same initial state s0 = 0 to obtain a d-dimension Markov chain. It has been shown that the above generated Markov chain converges to _µ in distribution with an algebraic convergence rate φξ(k)_ (k[−][1][/r]) in Proposition 5.2, (Jarner _≤O_ & Roberts, 2002). We consider the following bias term at the fixed point w = 0d. (Bias): E _F_ (w; xτ )|x0 = 0d _−_ _f_ (w) _._ It can be used to approximate the left-hand side of Lemma 4.1 . Since E _F_ (w; xτ )|s0 = 0d cannot be explicitly obtained, we use Monte Carlo method to estimate this conditional expectation. That is, we generate n = 10, 000 independent trajectories starting from x0 = 0d. At the step _τ_, we estimate the expected value as _n[1]_ _ni=1_ _[F]_ [(][w][;][ x]τ[(][i][)][)][, where][ x]τ[(][i][)] with the superscript (i) indicates that it is sampled from the i-th trajectory. Then we investigate the relation between the step τ and the mixing parameter r and the relation between the stepP _τ and the batch size B. All the results are_ presented in Section 5. -----