26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

algorithmic regularization 43

minimizes the loss L(w), the algorithm returns a special global minimizer

that implicitly also minimzes the Euclidean distance to the initialization:

w t → argmin ‖w − w 0 ‖ 2

.

w∈G

Proof. The key idea is in noting that that the gradients of the loss

function have a special structure. For the linear regression loss in

eq. (6.1) ∀w, ∇L(w) = ∑ i (w ⊤ x (i) − y (i) )x (i) ∈ span({x (i) }) - that is

the gradients are restricted to a n dimentional subspace that is independent

of w. Thus, the gradient descent updates from iniitalization

w t − w 0 = ∑ t ′ <t ηw t ′, which linearly accumulate gradients, are again

constrained to the n dimensional subspace. It is now easy to check

that there is a unique global minimizer that both fits the data (w ∈ G)

as well as is reacheable by gradient descent (w ∈ w 0 + span({x (i) })).

By checking the KKT conditions, it can be verified that this unique

minimizer is given by argmin w∈G ‖w − w 0 ‖ 2 2 .

In general overparameterized optimization problems, the characterization

of the implicit bias or algorithmic regulariztion is often

not this elegant or easy. For the same model class, changing the algorithm,

or changing associated hyperparameter (like step size and

initialization), or even changing the specific parameterization of the

model class can change the implicit bias. For example, ? ] showed

that for some standard deep learning architectures, variants of SGD

algorithm with different choices of momentum and adaptive gradient

updates (AdaGrad and Adam) exhibit different biases and thus have

different generalization performance;? ], ? ] and ? ] study how the

size of the mini-batches used in SGD influences generalization; and

? ] compare the bias of path-SGD (steepest descent with respect to a

scale invariant path-norm) to standard SGD.

A comprehensive understanding of how all the algorithmic choices

affect the implicit bias is beyond the scope of this chapter (and also

the current state of research). However, in the context of this chapter,

we specifically want to highlight the role of geometry induced by

optimization algorithm and specific parameterization, which are

discussed briefly below.

6.1.1 Geometry induced by updates of local search algorithms

The relation of gradient descent to implicit bias towards minimizing

Euclidean distance to initialization is suggestive of the connection

between algorithmic regularization to the geometry of updates

in local search methods. In particular, gradient descent iterations

can be alternatively specified by the following equation where the

t + 1th iterate is derived by minimizing the a local (first order Taylor)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!