|
## HOW TO IMPROVE SAMPLE COMPLEXITY OF SGD |
|
### OVER HIGHLY DEPENDENT DATA? |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
Conventional machine learning applications typically assume that data samples |
|
are independently and identically distributed (i.i.d.). However, practical scenarios often involve a data-generating process that produces highly dependent data |
|
samples, which are known to heavily bias the stochastic optimization process and |
|
slow down the convergence of learning. In this paper, we conduct a fundamental |
|
study on how different structures of stochastic update schemes affect the sample complexity of stochastic gradient descent (SGD) over highly dependent data. |
|
Specifically, with a φ-mixing model of data dependence, we show that SGD with |
|
proper periodic data-subsampling achieves an improved sample complexity over |
|
the standard SGD in the full spectrum of the data dependence level. Interestingly, even subsampling a subset of data samples can accelerate the convergence |
|
of SGD over highly dependent data. Moreover, we show that mini-batch SGD |
|
can further substantially improve the sample complexity over SGD with periodic |
|
data-subsampling over highly dependent data. We also conduct some numerical |
|
experiments to validate our theoretical results. |
|
|
|
1 INTRODUCTION |
|
|
|
Stochastic optimization algorithms have attracted great attention in the past decade due to its successful applications to a broad research areas, including deep learning (Goodfellow et al., 2016), reinforcement learning (Sutton & Barto, 2018), online learning (Bottou, 2010; Hazan, 2017), control |
|
(Marti, 2017), etc. In the conventional analysis of stochastic optimization algorithms, it is usually |
|
assumed that all data samples are independently and identically distributed (i.i.d.) and queried. For |
|
example, data samples in the traditional empirical risk minimization framework are assumed to be |
|
queried independently from the underlying data distribution, while data samples in reinforcement |
|
learning are assumed to be queried from the stationary distribution of the underlying Markov chain. |
|
|
|
Although the i.i.d. data assumption leads to a comprehensive understanding of the statistical limit |
|
and computation complexity of SGD, it violates the nature of many practical data-generating |
|
stochastic processes, which generate highly correlated samples that depend on the history. In fact, |
|
dependent data can be found almost everywhere, e.g., daily stock price, weather/climate data, state |
|
transitions in Markov chains, etc. To understand the impact of data dependence on the convergence |
|
and complexity of stochastic algorithms, there is a growing number of recent works that introduce |
|
various definitions to quantify data dependence. Specifically, to analyze the finite-time convergence |
|
of various stochastic reinforcement learning algorithms, recent studies assume that the dependent |
|
samples queried from the Markov decision process satisfy a geometric mixing property (Dalal et al., |
|
2018; Zou et al., 2019; Xu & Gu, 2020; Qu & Wierman, 2020), which requires the underlying |
|
Markov chain to be uniformly ergodic or has a finite mixing time (Even-Dar et al., 2003). On the |
|
other hand, to analyze the convergence of stochastic optimization algorithms over dependent data, |
|
Karimi et al. (2019) assumed the existence of a solution to the Poisson equation associated with the |
|
underlying Markov chain, which is a weaker condition than the uniform ergodic condition (Glynn |
|
& Meyn, 1996). Moreover, Agarwal & Duchi (2012) introduced a φ-mixing property of the datagenerating process that quantifies how fast the distribution of future data samples (conditioned on a |
|
fixed filtration) converges to the underlying stationary data distribution. In particular, the φ-mixing |
|
property is more general than the previous two notions of data dependence (Douc et al., 2018). |
|
|
|
|
|
----- |
|
|
|
While the aforementioned works leveraged the above notions of data dependence to characterize the |
|
sample complexity of various standard stochastic algorithms over dependent data, there still lacks |
|
theoretical understanding of how algorithm structure affects the sample complexity of stochastic |
|
algorithms under different levels of data dependence. In particular, a key algorithm structure is the |
|
stochastic update scheme, which critically affects the bias and variance of the stochastic optimization process. In fact, under i.i.d. data and convex geometry, it is well known that SGD achieves |
|
the sample complexity lower bound under various stochastic update schemes (Bottou, 2010), e.g., |
|
single-sample update and mini-batch update. However, these stochastic update schemes may lead |
|
to substantially different convergence behaviors over highly dependent data, as they are no longer |
|
unbiased. Therefore, it is of vital importance to understand the interplay among data dependence, |
|
structure of stochastic update and convergence rate of stochastic algorithms, and we want to ask the |
|
following fundamental question. |
|
|
|
- Q: How does the structure of stochastic updates affect the convergence rate and sample |
|
|
|
complexity of stochastic algorithms over dependent data? |
|
|
|
In this paper, we provide comprehensive answers to the above fundamental question. Specifically, |
|
we conduct a comprehensive study of the convergence rate and sample complexity of the SGD |
|
algorithm over a wide spectrum of data dependence levels under various types of stochastic updates, |
|
including periodic subsampling and mini-batch sampling. Our results show that SGD with both |
|
stochastic updates achieves a substantially improved sample complexity over the standard SGD |
|
under highly dependent data. We summarize our contributions as follows. |
|
|
|
1.1 OUR CONTRIBUTIONS |
|
|
|
We consider the following standard stochastic optimization problem. |
|
|
|
min _F_ (w; ξ) _,_ (P) |
|
_w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ] |
|
_∈W_ |
|
|
|
|
|
where the objective function f is convex and Lipschitz continuous, and the expectation is taken over |
|
the stationary distribution µ of the underlying data-generating process P. To perform stochastic |
|
optimization, we query a stream of dependent data samples from the underlying data-generating |
|
process. Specifically, we adopt the φ-mixing model to quantify the data dependence via a decaying |
|
mixing coefficient function φξ(k) (see Definition 2.2) (Agarwal & Duchi, 2012). We study the convergence of the stochastic gradient descent (SGD) algorithm over φ-mixing dependent data samples |
|
under various stochastic update schemes, including data subsampling and mini-batch sampling. |
|
|
|
We first study the convergence of SGD over φ-mixing dependent data samples under the data subsampling update scheme. In particular, the data subsampling update scheme utilizes only one data |
|
sample per r consecutive data samples by periodically skipping r − 1 samples. With this data subsampling scheme, the subsampled data samples are less dependent for a larger subsampling period |
|
_r. Consequently, we show that SGD with a proper data subsampling period achieves an improved_ |
|
sample complexity over that of the standard SGD in the full spectrum of the convergence rate of the |
|
_φ-mixing coefficient. In particular, the improvement is substantial when the data is highly dependent_ |
|
with an algebraic decaying φ-mixing coefficient. |
|
|
|
Moreover, we study the sample complexity of SGD over φ-mixing dependent data samples under |
|
the mini-batch update scheme. Compare to the data subsampling update, mini-batch update can |
|
substantially reduce the mini-batch data dependence without skipping data samples. Consequently, |
|
mini-batch update leverages the sample average over a mini batch of data samples to reduce both the |
|
bias (caused by the data dependence) and the optimization variance. Specifically, we show that SGD |
|
with mini-batch update achieves an orderwise lower sample complexity than both the standard SGD |
|
and the SGD with data subsampling in the full spectrum of the convergence rate of the φ-mixing |
|
coefficient. We summarize and compare the sample complexities of these stochastic algorithms |
|
under different φ-mixing data dependence models in Table 1. |
|
|
|
1.2 RELATED WORK |
|
|
|
**Stochastic Algorithms over Dependent Data** Steinwart & Christmann (2009) and Modha & |
|
Masry (1996) established the convergence analysis of online stochastic algorithms for streaming |
|
|
|
|
|
----- |
|
|
|
Table 1: Comparison of sample complexities of SGD, SGD with data subsampling and Mini-batch |
|
SGD under different levels of data dependence for achieving f (w) − _f_ (w[∗]) ≤ _ϵ. Note that θ is a_ |
|
parameter of the convergence rate of the φ-mixing coefficient. |
|
|
|
Data dependence level _φξ(k)_ SGD SGD w/ subsampling Mini-batch SGD |
|
|
|
Geometric φ-mixing exp( _k[θ]),_ 2 1 |
|
_−_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2](log ϵ[−][1]) _θ )_ (ϵ[−][2]) |
|
(Weakly dependent) _θ > 0_ _O_ _O_ _O_ |
|
|
|
Fast algebraic φ-mixing _k[−][θ],_ |
|
(ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (ϵ[−][2]) |
|
|
|
(Medium dependent) _θ_ 1 _O_ _O_ _O_ |
|
_≥_ |
|
|
|
Slow algebraic φ-mixing _k[−][θ],_ (ϵ[−][2][−] _θ[2] )_ (ϵ[−][2][−] _θ[1] )_ (eϵ[−][1][−] _θ[1] )_ |
|
|
|
(Highly dependent) 0 < θ < 1 _O_ _O_ _O_ |
|
|
|
|
|
data with geometric ergodicity. Duchi et al. (2011) proved that the stochastic subgradient method |
|
has strong convergence guarantee if the mixing time is uniformly bounded. Agarwal & Duchi (2012) |
|
studied the convex/strongly convex stochastic optimization problem and proved high-probability |
|
convergence bounds for general stochastic algorithms under general stationary mixing processes. |
|
Godichon-Baggioni et al. (2021) provided the non-asymptotic analysis of stochastic algorithms with |
|
strongly convex objective function over streaming mini-batch data. In a more general setting, the |
|
stochastic approximation (SA) problem was studied in (Karimi et al., 2019) by assuming the existence of solution to a Poisson equation. Recently, Debavelaere et al. (2021) developed the asymptotic |
|
convergence analysis of SA problem for sub-geometric Markov dynamic noises. |
|
|
|
**Finite-time convergence of reinforcement learning** Recently, a series of work studied the finitetime convergence of many stochastic reinforcement learning algorithms over Markovian dependent |
|
samples, including TD learning (Dalal et al., 2018; Xu et al., 2019; Kaledin et al., 2020), Q-learning |
|
(Qu & Wierman, 2020; Li et al., 2021; Melo et al., 2008; Chen et al., 2019; Xu & Gu, 2020), fitted Qiteration (Mnih et al., 2013; 2015; Agarwal et al., 2021), actor-critic algorithms (Wang et al., 2019; |
|
Yang et al., 2019; Kumar et al., 2019; Qiu et al., 2019; Wu et al., 2020; Xu et al., 2020), etc. In these |
|
studies, the dependent Markovian samples are assumed to satisfy the geometric φ-mixing property, |
|
which is satisfied when the underlying Markov chain is uniformly ergodic or time-homogeneous |
|
with finite-states. |
|
|
|
**Regret of Stochastic Convex Optimization** There have been many known regret bounds for online convex optimization problem. Hazan (2017) has built the standard O(√T ) regret bound for on |
|
line SGD algorithm with assuming the bounded gradient. Xiao (2009) introduces the regret bound |
|
of online dual averaging method. To our best knowledge, there is no high-probability guaranteed |
|
regret bound for mini-batch SGD algorithm with considering the data dependence. |
|
|
|
2 PROBLEM FORMULATION AND ASSUMPTIONS |
|
|
|
In this section, we introduce the problem formulation and some basic assumptions. Consider a |
|
model with parameters w. For any data sample ξ, denote F (w; ξ) ∈ R as the sample loss of this data |
|
sample under the model w. In this paper, we consider the following standard stochastic optimization |
|
problem that has broad applications in machine learning. |
|
|
|
min _F_ (w; ξ) _._ (P) |
|
_w_ _[f]_ [(][w][) :=][ E][ξ][∼][µ] |
|
_∈W_ |
|
|
|
Here, the expectation is taken over the randomness of the data sample _ξ, which is drawn from an_ |
|
underlying distribution µ. In particular, we make the following standard assumptions regarding the |
|
problem (P) (Agarwal & Duchi, 2012). |
|
**Assumption 2.1. The stochastic optimization problem (P) satisfies** |
|
|
|
_1. For every ξ, function F_ (·; ξ) is G-Lipschitz continuous over W, i.e., for all w, v ∈W, |
|
|
|
_|F_ (w; ξ) − _F_ (v; ξ)| ≤ _G∥w −_ _v∥._ |
|
|
|
_2. Function f_ ( ) is convex and bounded below, i.e., f (w[∗]) := inf _w_ _f_ (w) > _._ |
|
|
|
_·_ _∈W_ _−∞_ |
|
|
|
|
|
----- |
|
|
|
_3. W is a convex and compact set with bounded diameter R, i.e., supw,v∈W ∥w −_ _v∥≤_ _R._ |
|
|
|
To solve this stochastic optimization problem, one often needs to query a set of data samples from |
|
the distribution µ to perform optimization. Unlike traditional stochastic optimization that usually |
|
assumes that the data samples are i.i.d. we consider a more general and practical dependent datagenerating process as we elaborate below. |
|
|
|
**Dependent data-generating process: We consider a stochastic process P that generates a stream** |
|
of data samples _ξ1, ξ2, ...,_, which are not necessarily independent. In particular, the stochastic |
|
_{_ _}_ |
|
process P has an underlying stationary distribution µ. To quantify the dependence of the data |
|
generation process, we introduce the following standard φ-mixing model (Agarwal & Duchi, 2012), |
|
where we denote {Ft}t as the canonical filtration generated by {ξt}t. |
|
**Definition 2.2 (φ-mixing process). Consider a stochastic process {ξt}t with a stationary distribution** |
|
_µas the total variation distance. Then, the process. Let P(ξt+k ∈·|Ft) be the distribution of the (t_ + {ξkt)}-th sample conditioned ont is called φ-mixing if the following mixing Ft, and denote dTV |
|
coefficient φξ( ) converges to 0 as k tends to infinity. |
|
|
|
_·_ |
|
|
|
_φξ(k) :=_ sup 2dTV P(ξt+k _A), µ_ _._ |
|
_t∈N,A∈Ft_ _∈·|_ |
|
|