TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
46 theory of deep learning
arg min w∈G
‖w − w 0 ‖.
0.0
−0.5
−1.0
−1.5
−2.0
−2.5
−3.0
−3.5
−4.0
0
1
w init =[0,0,0]
2
3
4
5
6
w *
||. ||
w ∞ η→0
0.5
0.0
η = 0.01
η = 0.1
η = 0.25
η = 0.3
−1.0
−0.5
Figure 6.1: Steepest descent w.r.t
‖.‖4/3: the global minimum to
which steepest descent converges
to depends on η. Here w 0 = [0, 0, 0],
w
‖.‖ ∗ = arg min ψ∈G
‖w‖4/3
denotes
the minimum norm global
minimum, and wη→0 ∞ denotes the
solution of infinitesimal SD with
η → 0. Note that even as η → 0, the
expected characterization does not
hold, i.e., wη→0 ∞ ̸= w∗ ‖.‖ .
In summary, for squared loss, we characterized the implicit bias of
generic mirror descent algorithm in terms of the potential function
and initialization. However, even in simple linear regression, for
steepest descent with general norms, we were unable to get a useful
characterization. In contrast, in Section 6.3.2, we study logistic like
strictly monotonic losses used in classification, where we can get a
characterization for steepest descent.
6.1.2 Geometry induced by parameterization of model class
In many learning problems, the same model class can be parameterized
in multiple ways. For example, the set of linear functions
in R d can be parameterized in a canonical way as w ∈ R d
with f w (x) = w ⊤ x, but also equivalently by u, v ∈ R d with
f u,v (x) = (u · v) ⊤ x or f u,v (x) = (u 2 − v 2 ) ⊤ x. All such equivalent
parameterizations lead to equivalent training objectives, however, in
overparemterized models, using gradient descent on different parameterizations
lead to different induced biases in the function space. For
example, ? ? ] demonstrated this phenomenon in matrix factorization
and linear convolutional networks, where these parameterizations
were shown to introduce interesting and unusual biases towards
minimizing nuclear norm, and l p (for p = 2/depth) norm in Fourier
domain, respectively. In general, these results are suggestive of role
of architecture choice in different neural network models, and shows
how even while using the same gradient descent algorith, different
geometries in the function space can be induced by the different
parameterizations.