TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
102 theory of deep learning
such saddle points are non-degenerate, it suffices to show g ′′ (0) < 0.
It is easy to check that the second directional derivative at the origin
is given by
g ′′ (0) = −4σ 2 j (w j(t) ⊤ Mw j (t) − w ⊤ j Mw j ) < 0,
which completes the proof.
9.4 Role of Parametrization
For least squares linear regression (i.e., for k = 1 and u = W ⊤ 1 ∈ Rd 0
in Problem 9.8), we can show that using dropout amounts to solving
the following regularized problem:
min
u∈R d 0
1
n
n
∑
i=1
(y i − u ⊤ x i ) 2 + λu ⊤ Ĉu.
All the minimizers of the above problem are solutions to the following
system of linear equations (1 + λ)X ⊤ Xu = X ⊤ y, where
X = [x 1 , · · · , x n ] ⊤ ∈ R n×d 0, y = [y 1 , · · · , y n ] ⊤ ∈ R n×1 are the design
matrix and the response vector, respectively. Unlike Tikhonov
regularization which yields solutions to the system of linear equations
(X ⊤ X + λI)u = X ⊤ y (a useful prior, discards the directions
that account for small variance in data even when they exhibit good
discriminability), the dropout regularizer manifests merely as a scaling
of the parameters. This suggests that parametrization plays an
important role in determining the nature of the resulting regularizer.
However, a similar result was shown for deep linear networks [? ]
that the data dependent regularization due to dropout results in
merely scaling of the parameters. At the same time, in the case of
matrix sensing we see a richer class of regularizers. One potential
explanation is that in the case of linear networks, we require a convolutional
structure in the network to yield rich inductive biases. For
instance, matrix sensing can be written as a two layer network in the
following convolutional form:
〈UV ⊤ , A〉 = 〈U ⊤ , V ⊤ A ⊤ 〉 = 〈U ⊤ , (I ⊗ V ⊤ )A ⊤ 〉.