TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
algorithmic regularization 43
minimizes the loss L(w), the algorithm returns a special global minimizer
that implicitly also minimzes the Euclidean distance to the initialization:
w t → argmin ‖w − w 0 ‖ 2
.
w∈G
Proof. The key idea is in noting that that the gradients of the loss
function have a special structure. For the linear regression loss in
eq. (6.1) ∀w, ∇L(w) = ∑ i (w ⊤ x (i) − y (i) )x (i) ∈ span({x (i) }) - that is
the gradients are restricted to a n dimentional subspace that is independent
of w. Thus, the gradient descent updates from iniitalization
w t − w 0 = ∑ t ′ <t ηw t ′, which linearly accumulate gradients, are again
constrained to the n dimensional subspace. It is now easy to check
that there is a unique global minimizer that both fits the data (w ∈ G)
as well as is reacheable by gradient descent (w ∈ w 0 + span({x (i) })).
By checking the KKT conditions, it can be verified that this unique
minimizer is given by argmin w∈G ‖w − w 0 ‖ 2 2 .
In general overparameterized optimization problems, the characterization
of the implicit bias or algorithmic regulariztion is often
not this elegant or easy. For the same model class, changing the algorithm,
or changing associated hyperparameter (like step size and
initialization), or even changing the specific parameterization of the
model class can change the implicit bias. For example, ? ] showed
that for some standard deep learning architectures, variants of SGD
algorithm with different choices of momentum and adaptive gradient
updates (AdaGrad and Adam) exhibit different biases and thus have
different generalization performance;? ], ? ] and ? ] study how the
size of the mini-batches used in SGD influences generalization; and
? ] compare the bias of path-SGD (steepest descent with respect to a
scale invariant path-norm) to standard SGD.
A comprehensive understanding of how all the algorithmic choices
affect the implicit bias is beyond the scope of this chapter (and also
the current state of research). However, in the context of this chapter,
we specifically want to highlight the role of geometry induced by
optimization algorithm and specific parameterization, which are
discussed briefly below.
6.1.1 Geometry induced by updates of local search algorithms
The relation of gradient descent to implicit bias towards minimizing
Euclidean distance to initialization is suggestive of the connection
between algorithmic regularization to the geometry of updates
in local search methods. In particular, gradient descent iterations
can be alternatively specified by the following equation where the
t + 1th iterate is derived by minimizing the a local (first order Taylor)