|
# ON THE GLOBAL CONVERGENCE OF GRADIENT DE## SCENT FOR MULTI-LAYER RESNETS IN THE MEAN- FIELD REGIME |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
Finding the optimal configuration of parameters in ResNet is a nonconvex minimization problem, but first order methods nevertheless find the global optimum |
|
in the overparameterized regime. We study this phenomenon with mean-field |
|
analysis, by translating the training process of ResNet to a gradient-flow partial |
|
differential equation (PDE) and examining the convergence properties of this limiting process. The activation function is assumed to be 2-homogeneous or partially |
|
1-homogeneous; the regularized ReLU satisfies the latter condition. We show that |
|
if the ResNet is sufficiently large, with depth and width depending algebraically |
|
on the accuracy and confidence levels, first order optimization methods can find |
|
global minimizers that fit the training data. |
|
|
|
1 INTRODUCTION |
|
|
|
Training of multi-layer neural networks (NN) requires us to find weights in the network such that its |
|
outputs perfectly match the prescribed outputs for a given set of training data. The usual approach |
|
is to formulate this problem as a nonconvex minimization problem and solve it with a first-order |
|
optimization method based on gradient descent (GD). Extensive computational experience shows |
|
that in the overparametrized regime (where the total number of parameters in the NN far exceeds the |
|
minimum number required to fit the training data), GD methods run for sufficiently many iterations |
|
consistently find a global minimum achieving the zero-loss property, that is, a perfect fit to the training |
|
data. |
|
|
|
What is the mechanism that allows GD to perform so well on this large-scale nonconvex problem? |
|
|
|
Part of the explanation is that in the overparametrized case, the parameter space contains many |
|
global minima, and some evidence suggests that they are distributed throughout the space, making |
|
it easier for the optimization process to find one such solution. Many approaches have been taken |
|
to characterize this phenomenon more rigorously, including landscape analysis, the neural tangent |
|
kernel approach, and mean-field analysis. All such viewpoints aim to give an idea of the structure |
|
and size of the NN required to ensure global convergence. |
|
|
|
Our approach in this paper is based on mean-field analysis and gradient-flow analysis, the latter being |
|
the continuous and mean-field limit of GD. We will examine residual neural networks (ResNets), |
|
and study how deep and wide a ResNet needs to be to match the data with high accuracy and high |
|
confidence. To relax the assumptions on the activation function as far as possible, we follow the |
|
setup in (Chizat & Bach, 2018), which requires this function to be either 2-homogeneous or partially |
|
1-homogeneous. We show that both depth and width of the NN depend algebraically on ϵ and η, |
|
which are the accuracy and confidence levels, respectively. |
|
|
|
Mean-field analysis translates the training process of the ResNet to a gradient-flow partial differential |
|
equation (PDE). The training process evolves weights on connections between neurons. When dealing |
|
with wide neural networks, instead of tracing the evolution of each weight individually, one can |
|
record the evolution of the full distribution of the weight configuration. This perspective translates |
|
the coupled ordinary differential equation system (ODE) that characterizes evolution of individual |
|
weights into a PDE (the gradient-flow equation) characterizing the evolution of the distribution. The |
|
parameters in the PDE naturally depend on the properties of the activation functions. Gradient-flow |
|
|
|
|
|
----- |
|
|
|
analysis is used to show that the PDE drives the solution to a point where the loss function becomes |
|
zero. We obtain our results on zero-loss training of ResNet with GD by translating the zero-loss |
|
property of the gradient-flow PDE back to the discrete-step setting. |
|
|
|
This strategy of the proof was taken in an earlier paper (Ding et al., 2021) where multi-layer ResNets |
|
were also analyzed. The main difference in this current paper is that the assumptions on the activation |
|
function and the initial training state for obtaining the global convergence are both much relaxed. This |
|
paper adopts the setup from (Chizat & Bach, 2018) of minimal Lipschitz continuity requirements on |
|
the activation function. Furthermore, the paper (Ding et al., 2021) required a dense support condition |
|
to be satisfied on the final parameter configuration has a support condition. This condition is hard |
|
to justify in any realistic setting, and is discared from current paper. Further details on these issues |
|
appear in Section 3. |
|
|
|
We discuss the setup of the problem and formally derive the continuous and mean-field limits in |
|
Section 2. In Section 3, we discuss related work, identify our contribution and present the main |
|
theorem in its general terms. After precise definitions and assumptions are specified in Section 4, we |
|
present the two main ingredients in the proof strategy. The mean-field limit is obtained by connecting |
|
the training process of the ResNet to a gradient-flow PDE in Section 5, and the zero-loss property of |
|
the limiting PDE is verified in Section 6. The main theorem is a direct corollary of Theorem 5.1 and |
|
Theorem 6.1 (or Theorem 6.2). |
|
|
|
2 RESNET AND GRADIENT DESCENT |
|
|
|
The ResNet can be specified as follows: |
|
|
|
|
|
_f_ (zl(x), θl,m), _l = 0, 1, . . ., L_ 1, (1) |
|
_−_ |
|
_m=1_ |
|
|
|
X |
|
|
|
|
|
_zl+1(x) = zl(x) +_ |
|
|
|
_ML_ |
|
|
|
|
|
where M and L are the width and depth, respectively; z0(x) = x ∈ R[d] is the input data; and |
|
_zL(x) is the output from the last layer. The configuration of the NN is encoded in parameters_ |
|
ΘL,M = _θl,m_ _l=0,m=1[, where each parameter][ θ][l,m][ is a vector in][ R][k][ and][ f][ :][ R][d][ ×][ R][k][ →]_ [R][d][ is the] |
|
_{_ _}[L][−][1][,M]_ |
|
activation function. The formulation (1) covers“conventional” ResNets, which have the specific form |
|
|
|
|
|
1 |
|
|
|
_ML_ _Ul,mσ(Wl,m[⊤]_ _[z][l][(][x][) +][ b][l,m][)][,]_ _l = 0, 1, . . ., L −_ 1, |
|
|
|
_m=1_ |
|
|
|
X |
|
|
|
|
|
_zl+1(x) = zl(x) +_ |
|
|
|
|
|
where Wl,m, Ul,m ∈ R[d], bl.m ∈ R, and σ is the ReLU activation function. In this example, we have |
|
_θl,m = (Wl,m, Ul,m, bl,m) ∈_ R[k], with k = 2d + 1. |
|
|
|
Denote by ZΘL,M (l; x) the output of the ResNet defined by (1). (This quantity is the same as zL(x) |
|
defined above, but we use this alternative notation to emphasize the dependece on parameters ΘL,M .) |
|
The goal of training ResNet is to seek parameters ΘL,M that minimize the following mismatch or |
|
_loss function:_ |
|
|
|
1 2 |
|
|
|
_E(ΘL,M_ ) = Ex _µ_ _g(ZΘL,M (L; x))_ _y(x)_ _,_ (2) |
|
_∼_ 2 _−_ |
|
|
|
|
|
|
|
|