|
# NEURAL NETWORKS AS KERNEL LEARNERS: THE SILENT ALIGNMENT EFFECT |
|
|
|
**Alexander Atanasov[∗]** **, Blake Bordelon[∗]** **& Cengiz Pehlevan** |
|
Harvard University |
|
Cambridge, MA 02138, USA |
|
_{atanasov,blake bordelon,cpehlevan}@g.harvard.edu_ |
|
|
|
ABSTRACT |
|
|
|
Neural networks in the lazy training regime converge to kernel machines. Can |
|
neural networks in the rich feature learning regime learn a kernel machine with |
|
a data-dependent kernel? We demonstrate that this can indeed happen due to a |
|
phenomenon we term silent alignment, which requires that the tangent kernel of |
|
a network evolves in eigenstructure while small and before the loss appreciably |
|
decreases, and grows only in overall scale afterwards. We empirically show that |
|
such an effect takes place in homogenous neural networks with small initialization |
|
and whitened data. We provide an analytical treatment of this effect in the fully |
|
connected linear network case. In general, we find that the kernel develops a |
|
low-rank contribution in the early phase of training, and then evolves in overall |
|
scale, yielding a function equivalent to a kernel regression solution with the final |
|
network’s tangent kernel. The early spectral learning of the kernel depends on |
|
the depth. We also demonstrate that non-whitened data can weaken the silent |
|
alignment effect. |
|
|
|
1 INTRODUCTION |
|
|
|
Despite the numerous empirical successes of deep learning, much of the underlying theory remains |
|
poorly understood. One promising direction forward to an interpretable account of deep learning |
|
is in the study of the relationship between deep neural networks and kernel machines. Several |
|
studies in recent years have shown that gradient flow on infinitely wide neural networks with a |
|
certain parameterization gives rise to linearized dynamics in parameter space (Lee et al., 2019; Liu |
|
et al., 2020) and consequently a kernel regression solution with a kernel known as the neural tangent |
|
kernel (NTK) in function space (Jacot et al., 2018; Arora et al., 2019). Kernel machines enjoy firmer |
|
theoretical footing than deep neural networks, which allows one to accurately study their training |
|
and generalization (Rasmussen & Williams, 2006; Sch¨olkopf & Smola, 2002). Moreover, they share |
|
many of the phenomena that overparameterized neural networks exhibit, such as interpolating the |
|
training data (Zhang et al., 2017; Liang & Rakhlin, 2018; Belkin et al., 2018). However, the exact |
|
equivalence between neural networks and kernel machines breaks for finite width networks. Further, |
|
the regime with approximately static kernel, also referred to as the lazy training regime (Chizat et al., |
|
2019), cannot account for the ability of deep networks to adapt their internal representations to the |
|
structure of the data, a phenomenon widely believed to be crucial to their success. |
|
|
|
In this present study, we pursue an alternative perspective on the NTK, and ask whether a neural |
|
network with an NTK that changes significantly during training can ever be a kernel machine for a |
|
_data-dependent kernel: i.e. does there exist a kernel function K for which the final neural network_ |
|
function f is f (x) _µ=1_ _[α][µ][K][(][x][,][ x][µ][)][ with coefficients][ α][µ][ that depend only on the training]_ |
|
_≈_ [P][P] |
|
data? We answer in the affirmative: that a large class of neural networks at small initialization |
|
trained on approximately whitened data are accurately approximated as kernel regression solutions |
|
with their final, data-dependent NTKs up to an error dependent on initialization scale. Hence, our |
|
results provide a further concrete link between kernel machines and deep learning which, unlike the |
|
infinite width limit, allows for the kernel to be shaped by the data. |
|
|
|
_∗These authors contributed equally._ |
|
|
|
|
|
----- |
|
|
|
The phenomenon we study consists of two training phases. In the first phase, the kernel starts off |
|
small in overall scale and quickly aligns its eigenvectors toward task-relevant directions. In the |
|
second phase, the kernel increases in overall scale, causing the network to learn a kernel regression |
|
solution with the final NTK. We call this phenomenon the silent alignment effect because the feature |
|
learning happens before the loss appreciably decreases. Our contributions are the following |
|
|
|
1. In Section 2, we demonstrate the silent alignment effect by considering a simplified model where |
|
the kernel evolves while small and then subsequently increases only in scale. We theoretically |
|
show that if these conditions are met, the final neural network is a kernel machine that uses the |
|
final, data-dependent NTK. A proof is provided in Appendix B. |
|
|
|
2. In Section 3, we provide an analysis of the NTK evolution of two layer linear MLPs with scalar |
|
target function with small initialization. If the input training data is whitened, the kernel aligns its |
|
eigenvectors towards the direction of the optimal linear function early on during training while |
|
the loss does not decrease appreciably. After this, the kernel changes in scale only, showing this |
|
setup satisfies the requirements for silent alignment discussed in Section 2. |
|
|
|
3. In Section 4, we extend our analysis to deep MLPs by showing that the time required for alignment scales with initialization the same way as the time for the loss to decrease appreciably. Still, |
|
these time scales can be sufficiently separated to lead to the silent alignment effect for which we |
|
provide empirical evidence. We further present an explicit formula for the final kernel in linear |
|
networks of any depth and width when trained from small initialization, showing that the final |
|
NTK aligns to task-relevant directions. |
|
|
|
4. In Section 5, we show empirically that the silent alignment phenomenon carries over to nonlinear |
|
networks trained with ReLU and Tanh activations on isotropic data, as well as linear and nonlinear networks with multiple output classes. For anisotropic data, we show that the NTK must |
|
necessarily change its eigenvectors when the loss is significantly decreasing, destroying the silent |
|
alignment phenomenon. In these cases, the final neural network output deviates from a kernel |
|
machine that uses the final NTK. |
|
|
|
1.1 RELATED WORKS |
|
|
|
Jacot et al. (2018) demonstrated that infinitely wide neural networks with an appropriate parameterization trained on mean square error loss evolve their predictions as a linear dynamical system |
|
with the NTK at initalization. A limitation of this kernel regime is that the neural network internal representations and the kernel function do not evolve during training. Conditions under which |
|
such lazy training can happen is studied further in (Chizat et al., 2019; Liu et al., 2020). Domingos |
|
(2020) recently showed that every model, including neural networks, trained with gradient descent |
|
leads to a kernel model with a path kernel and coefficients α[µ] that depend on the test point x. This |
|
dependence on x makes the construction not a kernel method in the traditional sense that we pursue |
|
here (see Remark 1 in (Domingos, 2020)). |
|
|
|
Phenomenological studies and models of kernel evolution have been recently invoked to gain insight |
|
into the difference between lazy and feature learning regimes of neural networks. These include |
|
analysis of NTK dynamics which revealed that the NTK in the feature learning regime aligns its |
|
eigenvectors to the labels throughout training, causing non-linear prediction dynamics (Fort et al., |
|
2020; Baratin et al., 2021; Shan & Bordelon, 2021; Woodworth et al., 2020; Chen et al., 2020; Geiger |
|
et al., 2021; Bai et al., 2020). Experiments have shown that lazy learning can be faster but less robust |
|
than feature learning (Flesch et al., 2021) and that the generalization advantage that feature learning |
|
provides to the final predictor is heavily task and architecture dependent (Lee et al., 2020). Fort et al. |
|
(2020) found that networks can undergo a rapid change of kernel early on in training after which |
|
the network’s output function is well-approximated by a kernel method with a data-dependent NTK. |
|
Our findings are consistent with these results. |
|
|
|
St¨oger & Soltanolkotabi (2021) recently obtained a similar multiple-phase training dynamics involving an early alignment phase followed by spectral learning and refinement phases in the setting of |
|
low-rank matrix recovery. Their results share qualitative similarities with our analysis of deep linear |
|
networks. The second phase after alignment, where the kernel’s eigenspectrum grows, was studied |
|
in linear networks in (Jacot et al., 2021), where it is referred to as the saddle-to-saddle regime. |
|
|
|
|
|
----- |
|
|
|
Unlike prior works (Dyer & Gur-Ari, 2020; Aitken & Gur-Ari, 2020; Andreassen & Dyer, 2020), |
|
our results do not rely on perturbative expansions in network width. Also unlike the work of Saxe |
|
et al. (2014), our solutions for the evolution of the kernel do not depend on choosing a specific set of |
|
initial conditions, but rather follow only from assumptions of small initialization and whitened data. |
|
|
|
2 THE SILENT ALIGNMENT EFFECT AND APPROXIMATE KERNEL SOLUTION |
|
|
|
Neural networks in the overparameterized regime can find many interpolators: the precise function |
|
that the network converges to is controlled by the time evolution of the NTK. As a concrete example, |
|
we will consider learning a scalar target function with mean square error loss through gradient flow. |
|
Let x ∈ R[D] represent an arbitrary input to the network f (x) and let {x[µ], y[µ]}µ[P]=1 [be a supervised] |
|
learning training set. Under gradient flow the parameters θ of the neural network will evolve, so the |
|
output function is time-dependent and we write this as f (x, t). The evolution for the predictions of |
|
the network on a test point can be written in terms of the NTK K(x, x[′], t) = _[∂f]∂[(][x]θ[,t][)]_ _[∂f]_ [(]∂[x]θ[′][,t][)] as |
|
|
|
_·_ |
|
|
|
|
|
_K(x, x[µ], t)(y[µ]_ _−_ _f_ (x[µ], t)), (1) |
|
|
|
|
|
_dt_ _[f]_ [(][x][, t][) =][ η] |
|
|
|
|
|
where η is the learning rate. If one had access to the dynamics of K(x, x[µ], t) throughout all t, one |
|
could solve for the final learned function f _[∗]_ with integrating factors under conditions discussed in |
|
Appendix A |
|
|
|
|
|
**_Kt[′] dt[′]_** (y[ν] _f0(x[ν])) ._ (2) |
|
|
|
_µν_ _−_ |
|
|
|
|
|
|
|
|
|
_∞_ |
|
|
|
0 |
|
|
|
Z |
|
|
|
|
|
_t_ |
|
_dt kt(x)[µ]_ exp _η_ |
|
_−_ 0 |
|
Z |
|
|
|
|
|
_f_ (x) = f0(x) + |
|
|
|
_[∗]_ |
|
|
|
|
|
_µν_ |
|
|
|
|
|
Here, kt(x)[µ] = K(x, x[µ], t), [Kt]µ,ν = K(x[µ], x[ν], t), and y[µ] _−_ _f0(x[µ]) is the initial error on_ |
|
point x[µ]. We see that the final function has contributions throughout the full training interval t ∈ |
|
(0, ∞). The seminal work by Jacot et al. (2018) considers an infinite-width limit of neural networks, |
|
where the kernel function Kt(x, x[′]) stays constant throughout training time. In this setting where |
|
the kernel is constant and f0(x[µ]) 0, then we obtain a true kernel regression solution f (x) = |
|
_≈_ |
|
_µ,ν_ **_[k][(][x][)][µ][K]µν[−][1][y][ν][ for a kernel][ K][(][x][,][ x][′][)][ which does not depend on the training data.]_** |
|
PMuch less is known about what happens in the rich, feature learning regime of neural networks, |
|
|
|
where the kernel evolves significantly during time in a data-dependent manner. In this paper, we |
|
consider a setting where the initial kernel is small in scale, aligns its eigenfunctions early on during |
|
gradient descent, and then increases only in scale monotonically. As a concrete phenomenological |
|
model, consider depth L networks with homogenous activation functions with weights initialized |
|
with variance σ[2]. At initialization K0(x, x[′]) _O(σ[2][L][−][2]), f0(x)_ _O(σ[L]) (see Appendix B). We_ |
|
_∼_ _∼_ |
|
further assume that after time τ, the kernel only evolves in scale in a constant direction |
|
|
|
_σ2L−2 ˜K(x, x[′], t)_ _t_ _τ_ |
|
_K(x, x[′], t) =_ _≤_ (3) |
|
_g(t)K_ (x, x[′]) _t > τ [,]_ |
|
_∞_ |
|
|
|
where _K[˜]_ (x, x[′], t) evolves from an initial kernel at time t = 0 to K (x, x[′]) by t = τ and g(t) |
|
_∞_ |
|
increases monotonically from σ[2][L][−][2] to 1. In this model, one also obtains a kernel regression solution |
|
in the limit where σ 0 with the final, rather than the initial kernel: f (x) = k (x) **_K[−][1]_** |
|
_→_ _∞_ _·_ _∞_ **_[y][ +]_** |
|
_O(σ[L]). We provide a proof of this in the Appendix B._ |
|
|
|
The assumption that the kernel evolves early on in gradient descent before increasing only in scale |
|
may seem overly strict as a model of kernel evolution. However, we analytically show in Sections 3 |
|
and 4 that this can happen in deep linear networks initialized with small weights, and consequently |
|
that the final learned function is a kernel regression with the final NTK. Moreover, we show that for |
|
a linear network with small weight initialization, the final NTK depends on the training data in a |
|
universal and predictable way. |
|
|
|
We show empirically that our results carry over to nonlinear networks with ReLU and tanh activations under the condition that the data is whitened. For example, see Figure 1, where we show the |
|
silent alignment effect on ReLU networks with whitened MNIST and CIFAR-10 images. We define |
|
alignment as the overlap between the kernel and the target function _∥Ky[⊤]∥FKy |y|[2][, where][ y][ ∈]_ [R][P][ is] |
|
|
|
|
|
----- |
|
|
|
1.00.8 L|Kt (t)| 4 ReLU MLP on Whitened MNISTK0 ReLU MLP on Whitened CIFARK0 |
|
|
|
0.6 Alignment 2 K 5 K |
|
|
|
0.4 0 0 |
|
|
|
0.2 2 5 |
|
|
|
Kernel and Loss |
|
|
|
0.0 Test Prediction NTK 4 Test Prediction NTK |
|
|
|
0 100 200 300 400 500 2 0 2 4 7.5 5.0 2.5 0.0 2.5 5.0 7.5 |
|
|
|
t Test Prediction NN Test Prediction NN |
|
|
|
|
|
(a) Whitened Data MLP Dynamics |
|
|
|
|
|
(b) Prediction MNIST |
|
|
|
|
|
(c) Prediction CIFAR-10 |
|
|
|
|
|
1.00.8 L|Kt (t)| 2 Wide Res-Net on Whitened CIFAR |
|
|
|
0.6 Alignment 1 |
|
|
|
0.4 0 |
|
|
|
0.2 1 K0 |
|
|
|
Loss and Alignment0.0 Test Prediction NTK 2 K |
|
|
|
0 250 500 750 1000 1250 1500 2 1 0 1 2 |
|
|
|
t Test Prediction NN |
|
|
|
|
|
(d) Wide Res-Net Dynamics |
|
|
|
|
|
(e) Prediction Res-Net |
|
|
|
|
|
Figure 1: A demonstration of the Silent Alignment effect. (a) We trained a 2-layer ReLU MLP |
|
on P = 1000 MNIST images of handwritten 0’s and 1’s which were whitened. Early in training, |
|
around t ≈ 50, the NTK aligns to the target function and stay fixed (green). The kernel’s overall |
|
scale (orange) and the loss (blue) begin to move at around t = 300. The analytic solution for the |
|
maximal final alignment value in linear networks is overlayed (dashed green), see Appendix E.2. |
|
(b) We compare the predictions of the NTK and the trained network on MNIST test points. Due |
|
to silent alignment, the final learned function is well described as a kernel regression solution with |
|
the final NTK K . However, regression with the initial NTK is not a good model of the network’s |
|
_∞_ |
|
predictions. (c) The same experiment on P = 1000 whitened CIFAR-10 images from the first two |
|
classes. Here we use MSE loss on a width 100 network with initialization scale σ = 0.1. (d) |
|
Wide-ResNet with width multiplier k = 4 and blocksize of b = 1 trained with P = 100 training |
|
points from the first two classes of CIFAR-10. The dashed orange line marks when the kernel starts |
|
growing significantly, by which point the alignment has already finished. (e) Predictions of the final |
|
NTK are strongly correlated with the final NN function. |
|
|
|
a vector of the target values, quantifying the projection of the labels onto the kernel, as discussed |
|
in (Cortes et al., 2012). This quantity increases early in training but quickly stabilizes around its |
|
asymptotic value before the loss decreases. Though Equation 2 was derived under assumption of |
|
gradient flow with constant learning rate, the underlying conclusions can hold in more realistic |
|
settings as well. In Figure 1 (d) and (e) we show learning dynamics and network predictions for |
|
Wide-ResNet (Zagoruyko & Komodakis, 2017) on whitened CIFAR-10 trained with the Adam optimizer (Kingma & Ba, 2014) with learning rate 10[−][5], which exhibits silent alignment and strong |
|
correlation with the final NTK predictor. In the unwhitened setting, this effect is partially degraded, |
|
as we discuss in Section 5 and Appendix J. Our results suggest that the final NTK may be useful for |
|
analyzing generalization and transfer as we discuss for the linear case in Appendix F. |
|
|
|
3 KERNEL EVOLUTION IN 2 LAYER LINEAR NETWORKS |
|
|
|
We will first study shallow linear networks trained with small initialization before providing analysis |
|
for deeper networks in Section 4. We will focus our discussion in this section on the scalar output |
|
case but we will provide similar analysis in the multiple output channel case in a subsequent section. |
|
We demonstrate that our analytic solutions match empirical simulations in Appendix C.5. |
|
|
|
We assume theΣ = _P1_ _Pµ=1_ **_[x] P[µ][x] data points[µ][⊤][. Further, we assume that the target values are generated by a linear teacher] x[µ]_** _∈_ R[D], µ = 1, . . ., P of zero mean with correlation matrix |
|
|
|
function y[µ] = sβT **_x[µ]_** for a unit vector βT . The scalar s merely quantifies the size of the supervised learning signal: the variance ofP _·_ _|y|[2]_ = s[2]βT[⊤][Σ][β][T][ . We define the two-layer linear neu-] |
|
|
|
|
|
----- |
|
|
|
t = 0 t t1 t |
|
|
|
2 2 2 |
|
|
|
1 1 1 |
|
|
|
(a) Initialization |
|
|
|
|
|
(b) Phase 1 |
|
|
|
|
|
(c) Phase 2 |
|
|
|
|
|
Figure 2: The evolution of the kernel’s eigenfunctions happens during the early alignment phase for |
|
_tContour plot of kernel’s norm for linear functions1 ≈_ [1]s [, but significant evolution in the network predictions happens for] f (x) = β · x. The black line represents the space[ t > t][2][ =][ 1]2 [log(][sσ][−][2][)][. (a)] |
|
|
|
of weights which interpolate the training set, ie X _[⊤]β = y. At initialization, the kernel is isotropic,_ |
|
resulting in spherically symmetric level sets of RKHS norm. The network function is represented |
|
as a blue dot. (b) During Phase I, the kernel’s eigenfunctions have evolved, enhancing power in the |
|
direction of the min-norm interpolator, but the network function has not moved far from the origin. |
|
(c) In Phase II, the network function W _[⊤]a moves from the origin to the final solution._ |
|
|
|
ral network with N hidden units as f (x) = a[⊤]W x. Concretely, we initialize the weights with |
|
standard parameterization ai (0, σ[2]/N ), Wij (0, σ[2]/D). Understanding the role of σ |
|
in the dynamics will be crucial to our study. We analyze gradient flow dynamics on MSE cost ∼N _∼N_ |
|
_L =_ 21P _µ_ [(][f] [(][x][µ][)][ −] _[y][µ][)][2][.]_ |
|
|
|
Under gradient flow with learning rateP _η = 1, the weight matrices in each layer evolve as_ |
|
_d_ _sβT_ **_W_** **_a_** _,_ _d_ _sβT_ **_W_** **_a_** _⊤_ **Σ.** (4) |
|
|
|
_dt_ **_[a][ =][ −]_** _[∂L]∂a_ [=][ W][ Σ] _−_ _[⊤]_ _dt_ **_[W][ =][ −]_** _∂[∂L]W_ [=][ a] _−_ _[⊤]_ |
|
|
|
The NTK takes the following form throughout training. |
|
|
|
_K(x, x[′]; t) = x[⊤]W_ _[⊤]W x[′]_ + |a|[2]x[⊤]x[′]. (5) |
|
|
|
Note that while the second term, a simple isotropic linear kernel, does not reflect the nature of the |
|
learning task, the first term x[⊤]W _[⊤]W x[′]_ can evolve to yield an anisotropic kernel that has learned |
|
a representation from the data. |
|
|
|
|
|
3.1 PHASES OF TRAINING IN TWO LAYER LINEAR NETWORK |
|
|
|
We next show that there are essentially two phases of training when training a two-layer linear |
|
network from small initialization on whitened-input data. |
|
|
|
- Phase I: An alignment phase which occurs for t ∼ [1]s [. In this phase the weights align to their low] |
|
|
|
rank structure and the kernel picks up a rank-one term of the form x[⊤]ββ[⊤]x[′]. In this setting, since |
|
the network is initialized near W, a = 0, which is a saddle point of the loss function, the gradient |
|
of the loss is small. Consequently, the magnitudes of the weights and kernel evolve slowly. |
|
|
|
- Phase II: A data fitting phase which begins around t ∼ [1]s [log(][sσ][−][2][)][. In this phase, the system] |
|
|
|
escapes the initial saddle point W, a = 0 and loss decreases to zero. In this setting both the |
|
kernel’s overall scale and the scale of the function f (x, t) increase substantially. |
|
|
|
If Phase I and Phase II are well separated in time, which can be guaranteed by making σ small, |
|
then the final function solves a kernel interpolation problem for the NTK which is only sensitive |
|
to the geometry of gradients in the final basin of attraction. In fact, in the linear case, the kernel |
|
interpolation at every point along the gradient descent trajectory would give the final solution as we |
|
show in Appendix G. A visual summary of these phases is provided in Figure 2. |
|
|
|
3.1.1 PHASE I: EARLY ALIGNMENT FOR SMALL INITIALIZATION |
|
|
|
In this section we show how the kernel aligns to the correct eigenspace early in training. We focus |
|
on the whitened setting, where the data matrix X has all of its nonzero singular values equal. We let |
|
|
|
|
|
----- |
|
|
|
**_β represent the normalized component of βT in the span of the training data {x[µ]}. We will discuss_** |
|
general Σ in section 3.2. We approximate the dynamics early in training by recognizing that the |
|
network output is small due to the small initialization. Early on, the dynamics are given by: |
|
_d_ _d_ |
|
|
|
(6) |
|
|
|
_dt_ **_[a][ =][ s][W β][ +][ O][(][σ][3][)][,]_** _dt_ **_[W][ =][ s][aβ][⊤]_** [+][ O][(][σ][3][)][.] |
|
|
|
Truncating terms order σ[3] and higher, we can solve for the kernel’s dynamics early on in training |
|
|
|
_K(x, x[′]; t) = q0 cosh(2ηst) x[⊤]_ []ββ[⊤] + I **_x[′]_** + O(σ[2]), _t ≪_ _s[−][1]_ log(s/σ[2]). (7) |
|
where q0 is an initialization dependent quantity, see Appendix C.1. The bound on the error is ob- |
|
tained in Appendix C.2. We see that the kernel picks up a rank one-correction ββ[⊤] which points |
|
in the direction of the task vector β, indicating that the kernel evolves in a direction sensitive to |
|
the target function y = sβT **_x. This term grows exponentially during the early stages of train-_** |
|
ing, and overwhelms the original kernel · _K0 with timescale 1/s. Though the neural network has_ |
|
not yet achieved low loss in this phase, the alignment of the kernel and learned representation has |
|
consequences for the transfer ability of the network on correlated tasks as we show in Appendix F. |
|
|
|
3.1.2 PHASE II: SPECTRAL LEARNING |
|
|
|
We now assume that the weights have approached their low rank structure, as predicted from the |
|
previous analysis of Phase I dynamics, and study the subsequent NTK evolution. We will show that, |
|
under the assumption of whitening, the kernel only evolves in overall scale. |
|
|
|
First, following (Fukumizu, 1998; Arora et al., 2018; Du et al., 2018), we note the following conservation law _dt[d]_ **_a(t)a(t)[⊤]_** **_W (t)W (t)[⊤][]_** = 0 which holds for all time. If we assume small initial |
|
|
|
_−_ |
|
weight variance _σ[2], aa[⊤]_ _−_ **_W W_** _[⊤]_ = O(σ[2]) ≈ 0 at initialization, and stays that way during the |
|
training due to the conservation law. This condition is surprisingly informative, since it indicates |
|
that W is rank-one up to O(σ) corrections. From the analysis of the alignment phase, we also have |
|
that W _[⊤]W ∝_ **_ββ[⊤]. These two observations uniquely determine the rank one structure of W to be_** |
|
**_aβ[⊤]_** + O(σ). Thus, from equation 5 it follows that in Phase II, the kernel evolution takes the form |
|
|
|
_K(x, x[′]; t) = u(t)[2]x[⊤]_ []ββ[⊤] + I **_x[′]_** + O(σ), (8) |
|
|
|
where u(t)[2] = **_a_** . This demonstrates that the kernel only changes in overall scale during Phase II. |
|
_|_ _|[2]_ |
|
|
|
Once the weights are aligned with this scheme, we can get an expression for the evolution of u(t)[2] |
|
analytically, u(t)[2] = se[2][st](e[2][st] _−_ 1 + s/u[2]0[)][−][1][, using the results of (Fukumizu, 1998; Saxe et al.,] |
|
2014) as we discuss in C.4. This is a sigmoidal curve which starts at u[2]0 [and approaches][ s][. The] |
|
transition time where active learning begins occurs when e[st] _s/u[2]0_ = _t_ _s[−][1]_ log(s/σ[2]). |
|
_≈_ _⇒_ _≈_ |
|
This analysis demonstrates that the kernel only evolves in scale during this second phase in training |
|
from the small initial value u[2]0 |
|
|
|
_[∼]_ _[O][(][σ][2][)][ to its asymptote.]_ |
|
|
|
Hence, kernel evolution in this scenario is equivalent to the assumptions discussed in Section 2, |
|
with g(t) = u(t)[2], showing that the final solution is well approximated by kernel regression with |
|
the final NTK. We stress that the timescale for the first phase t1 1/s, where eigenvectors evolve, |
|
is independent of the scale of the initialization σ[2], whereas the second phase occurs around ∼ _t2_ |
|
effect. We illustrate these learning curves and for varyingt1 log(s/σ[2]). This separation of timescales t1 ≪ _t2 for small σ in Figure C.2. σ guarantees the silent alignment ≈_ |
|
|
|
3.2 UNWHITENED DATA |
|
|
|
|
|
When data is unwhitened, the right singular vector of W aligns with Σβ early in training, as |
|
we show in Appendix C.3. This happens since, early on, the dynamics for the first layer are |
|
_d_ |
|
|
|
_dt_ **_[W][ ∼]_** **_[a][(][t][)][β][⊤][Σ][. Thus the early time kernel will have a rank-one spike in the][ Σ][β][ direction.]_** |
|
However, this configuration is not stable as the network outputs grow. In fact, at late time W |
|
must realign to converge to W ∝ **_aβ[⊤]_** since the network function converges to the optimum and |
|
_f = a[⊤]W x = sβ · x, which is the minimum ℓ2 norm solution (Appendix G.1). Thus, the final_ |
|
kernel will always look like K (x, x[′]) = sx[⊤] []ββ[⊤] + I **_x[′]. However, since the realignment of_** |
|
_∞_ |
|
**_W ’s singular vectors happens during the Phase II spectral learning, the kernel is not constant up to_** |
|
|
|
overall scale, violating the conditions for silent alignment. We note that the learned function still is |
|
a kernel regression solution of the final NTK, which is a peculiarity of the linear network case, but |
|
this is not achieved through the silent alignment phenomenon as we explain in Appendix C.3. |
|
|
|
|
|
----- |
|
|
|
4 EXTENSION TO DEEP LINEAR NETWORKS |
|
|
|
We next consider scalar target functions approximated by deep linear neural networks and show |
|
that many of the insights from the two layer network carry over. The neural network function |
|
_f : R[D]_ _→_ R takes the form f (x) = w[L][⊤]W _[L][−][1]...W_ [1]x. The gradient flow dynamics under mean |
|
squared error (MSE) loss become |
|
|
|
|
|
_⊤_ |
|
**_W_** _[ℓ][′]_ |
|
_ℓ[′]>ℓ_ ! |
|
|
|
Y |
|
|
|
|
|
_⊤_ |
|
**_W_** _[ℓ][′]_ |
|
_ℓ[′]<ℓ_ ! |
|
|
|
Y |
|
|
|
|
|
_d_ |
|
|
|
_dt_ **_[W][ ℓ]_** [=][ −][η ∂L]∂W _[ℓ]_ [=][ η] |
|
|
|
|
|
(sβ − **_w˜)[⊤]_** **Σ** |
|
|
|
|
|
(9) |
|
|
|
|
|
where ˜w = W [1][⊤]W [2][⊤]...w[L] _∈_ R[D] is shorthand for the effective one-layer linear network weights. |
|
Inspired by observations made in prior works (Fukumizu, 1998; Arora et al., 2018; Du et al., 2018), |
|
we again note that the following set of conservation laws hold during the dynamics of gradient |
|
descent _dtd_ **_W_** _[ℓ]W_ _[ℓ][⊤]_ **_W_** _[ℓ][+1][⊤]W_ _[ℓ][+1][]_ = 0. This condition indicates a balance in the size of |
|
|
|
_−_ |
|
weight updates in adjacent layers and simplifies the analysis of linear networks. This balancing |
|
|
|
condition between weights of adjacent layers is not specific to MSE loss, but will also hold for any |
|
loss function, see Appendix D. We will use this condition to characterize the NTK’s evolution. |
|
|
|
4.1 NTK UNDER SMALL INITIALIZATION |
|
|
|
We now consider the effects of small initialization. When the initial weight variance σ[2] is sufficiently |
|
small, W _[ℓ]W_ _[ℓ][⊤]−W_ _[ℓ][+1][⊤]W_ _[ℓ][+1]_ = O(σ[2]) ≈ 0 at initialization.[1] This conservation law implies that |
|
these matrices remain approximately equal throughout training. Performing an SVD on each matrix |
|
and inductively using the above formula from the last layer to the first, we find that all matrices |
|
will be approximately rank-one w[L] = u(t)rL(t), W _[ℓ]_ = u(t)rℓ+1(t)rℓ(t)[⊤], where rℓ(t) are unit |
|
vectors. Using only this balancing condition and expanding to leading order in σ, we find that the |
|
NTK’s dynamics look like |
|
|
|
_K(x, x[′], t) = u(t)[2(][L][−][1)]x[⊤]_ [](L 1)r1(t)r1(t)[⊤] + I **_x[′]_** + O(σ). (10) |
|
_−_ |
|
|
|
We derive this formula in the Appendix E. We observe that the NTK consists of a rank- 1 correction |
|
to the isotropic linear kernel x **_x[′]_** with the rank-one spike pointing along the r1(t) direction. This |
|
_·_ |
|
is true dynamically throughout training under the assumption of small σ. At convergence r(t) → **_β,_** |
|
which is the unique fixed point reachable through gradient descent. We discuss evolution of u(t) |
|
below. The alignment of the NTK with the direction β increases with depth L. |
|
|
|
4.1.1 WHITENED DATA VS ANISOTROPIC DATA |
|
|
|
We now argue that in the case where the input data is whitened, the trained network function is again |
|
a kernel machine that uses the final NTK. The unit vector r1(t) quickly aligns to β since the first |
|
layer weight matrix evolves in the rank-one direction _dtd_ **_[W][ 1][ =][ v][(][t][)][β][⊤]_** [throughout training for a] |
|
|
|
time dependent vector function v(t). As a consequence, early in training the top eigenvector of the |
|
NTK aligns to β. Due to gradient descent dynamics, W [1][⊤]W [1] grows only in the ββ[⊤] direction. |
|
Since the r1 quickly aligns to β due to W [1] growing only along the β direction, then the global |
|
scalar function c(t) = u(t)[L] satisfies the dynamics ˙c(t) = c(t)[2][−][2][/L] [s − _c(t)] in the whitened data_ |
|
case, which is consistent with the dynamics obtained when starting from the orthogonal initialization |
|
scheme of Saxe et al. (2014). We show in the Appendix E.1 that spectral learning occurs over a |
|
timescale on the order of t1/2 _s(LL_ 2) _[σ][−][L][+2][, where][ t][1][/][2][ is the time required to reach half the]_ |
|
_≈_ _−_ |
|
|
|
value of the initial loss. We discuss this scaling in detail in Figure 3, showing that although the |
|
timescale of alignment shares the same scaling with σ for L > 2, empirically alignment in deep |
|
networks occurs faster than spectral learning. Hence, the silent alignment conditions of Section 2 |
|
are satisfied. In the case where the data is unwhitened, the r1(t) vector aligns with Σβ early in |
|
training. This happens since, early on, the dynamics for the first layer are _dtd_ **_[W][ 1][ ∼]_** **_[v][(][t][)][β][⊤][Σ][ for]_** |
|
|
|
time dependent vector v(t). However, for the same reasons we discussed in Section 3.2 the kernel |
|
must realign at late times, violating the conditions for silent alignment. |
|
|
|
1Though we focus on neglecting the O(σ2) initial weight matrices in the main text, an approximate analysis for wide networks at finite σ[2] and large width is provided in Appendix H.2, which reveals additional |
|
dependence on relative layer widths. |
|
|
|
|
|
----- |
|
|
|
10[0] 10[1] 10[2] 10[3] |
|
|
|
|1.0 Alignment 0.8 0.6 0.4 and 0.2 Loss 0.0|2 = 105 2 = 104 2 = 103 2 = 102 2 = 101| |
|
|---|---| |
|
||2 = 10 2 = 10 2 = 10 2 = 10| |
|
||| |
|
|
|
|
|
2 = 10 5 |
|
|
|
2 = 10 4 |
|
|
|
2 = 10 3 |
|
|
|
2 = 10 2 |
|
|
|
2 = 10 1 |
|
|
|
t |
|
|
|
(b) L = 3 Dynamics |
|
|
|
|
|
10[7] L = 3 |
|
|
|
10[6] L = 4 |
|
|
|
10[5] L = 5 |
|
|
|
10[4] |
|
|
|
t1/210[3] |
|
|
|
10[2] |
|
|
|
10[1] |
|
|
|
10[0] |
|
|
|
10 2 10 1 |
|
|
|
|
|
t1/2 |
|
|
|
align 10[3] talign L + 2 |
|
, tt1/2 10[2] |
|
|
|
10 2 10 1 |
|
|
|
|
|
(a) ODE Time to Learn |
|
|
|
|
|
(c) Time To Learn L = 3 |
|
|
|
|
|
Figure 3: (a) Time to half loss scales in a power law with σ for networks with L 3: |
|
_L_ _≥_ |
|
_t1/2_ (L 2) _[σ][−][L][+2][ (black dashed) is compared with numerically integrating the dynamics]_ |
|
_∼_ _−_ |
|
|
|
_c˙(t) = c[2][−][2][/L](s −_ _c) (solid). The power law scaling of t1/2 with σ is qualitatively different than_ |
|
what happens for L = 2, where we identified logarithmic scaling t1/2 log(σ[−][2]). (b) Linear |
|
networks with D = 30 inputs and N = 50 hidden units trained on synthetic whitened data with ∼ |
|
_|β| = 1. We show for a L = 3 linear network the cosine similarity of W_ [1][⊤]W [1] with ββ[⊤] (dashed) |
|
and the loss (solid) for different initialization scales. (c) The time to get to 1/2 the initial loss and |
|
the time for the cosine similarity of W [1][⊤]W [1] with ββ[⊤] to reach 1/2 both scale as σ[−][L][+2], however |
|
one can see that alignment occurs before half loss is achieved. |
|
|
|
4.2 MULTIPLE OUTPUT CHANNELS |
|
|
|
|
|
We next discuss the case where the network has multiple C output channels. Each network output, |
|
we denote as fc(x[′]) resulting in C [2] kernel sub-blocks Kc,c′ (x, x[′]) = _fc(x)_ _fc′_ (x[′]). In this |
|
_∇_ _· ∇_ |
|
context, the balanced condition W _[ℓ]W_ _[ℓ][⊤]_ _≈_ **_W_** _[ℓ][+1][⊤]W_ _[ℓ][+1]_ implies that each of the weight matrices |
|
is rank-C, implying a rank-C kernel. We give an explicit formula for this kernel in Appendix H. |
|
For concreteness, consider whitened input data Σ = I and a teacher with weights β ∈ R[C][×][D]. The |
|
singular value decomposition of the teacher weights β = _α_ _[s][α][z][α][v]α[⊤]_ [determines the evolution of] |
|
|
|
each mode (Saxe et al., 2014). Each singular mode begins to be learned at tα = _s1α_ [log] _sαu[−]0_ [2] . |
|
|
|
To guarantee silent alignment, we need all of the Phase I time constants to be smaller than all of[P] |
|
|