26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

46 theory of deep learning

arg min w∈G

‖w − w 0 ‖.

0.0

−0.5

−1.0

−1.5

−2.0

−2.5

−3.0

−3.5

−4.0

0

1

w init =[0,0,0]

2

3

4

5

6

w *

||. ||

w ∞ η→0

0.5

0.0

η = 0.01

η = 0.1

η = 0.25

η = 0.3

−1.0

−0.5

Figure 6.1: Steepest descent w.r.t

‖.‖4/3: the global minimum to

which steepest descent converges

to depends on η. Here w 0 = [0, 0, 0],

w

‖.‖ ∗ = arg min ψ∈G

‖w‖4/3

denotes

the minimum norm global

minimum, and wη→0 ∞ denotes the

solution of infinitesimal SD with

η → 0. Note that even as η → 0, the

expected characterization does not

hold, i.e., wη→0 ∞ ̸= w∗ ‖.‖ .

In summary, for squared loss, we characterized the implicit bias of

generic mirror descent algorithm in terms of the potential function

and initialization. However, even in simple linear regression, for

steepest descent with general norms, we were unable to get a useful

characterization. In contrast, in Section 6.3.2, we study logistic like

strictly monotonic losses used in classification, where we can get a

characterization for steepest descent.

6.1.2 Geometry induced by parameterization of model class

In many learning problems, the same model class can be parameterized

in multiple ways. For example, the set of linear functions

in R d can be parameterized in a canonical way as w ∈ R d

with f w (x) = w ⊤ x, but also equivalently by u, v ∈ R d with

f u,v (x) = (u · v) ⊤ x or f u,v (x) = (u 2 − v 2 ) ⊤ x. All such equivalent

parameterizations lead to equivalent training objectives, however, in

overparemterized models, using gradient descent on different parameterizations

lead to different induced biases in the function space. For

example, ? ? ] demonstrated this phenomenon in matrix factorization

and linear convolutional networks, where these parameterizations

were shown to introduce interesting and unusual biases towards

minimizing nuclear norm, and l p (for p = 2/depth) norm in Fourier

domain, respectively. In general, these results are suggestive of role

of architecture choice in different neural network models, and shows

how even while using the same gradient descent algorith, different

geometries in the function space can be induced by the different

parameterizations.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!