TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
algorithmic regularization 53
to the linear case in Section ??, there is a related maximum margin
problem. Define the optimal margin as γ = max ‖w‖2 =1 min i y i f i (w).
The associated non-linear margin maximization is given by the
following non-convex constrained optimization:
min ‖w‖ 2 st y i f i (w) ≥ γ.
(Max-Margin)
Analogous to Section ??, we expect that gradient descent on Equation
(6.17) converges to the optimum of the Max-Margin problem
(Max-Margin). However, the max-margin problem itself is a constrained
non-convex problem, so we cannot expect to attain a global
optimum. Instead, we show that gradient descent iterates converge to
first-order stationary points of the max-margin problem.
Definition 6.4.1 (First-order Stationary Point). The first-order optimality
conditions of Max-Margin are:
1. ∀i, y i f i (w) ≥ γ
2. There exists Lagrange multipliers λ ∈ R N + such that w = ∑ n λ n ∇ f n (w)
and λ n = 0 for n /∈ S m (w) := {i : y i f i (w) = γ}, where S m (w) is the
set of support vectors .
We denote by W ⋆ the set of first-order stationary points.
Let w t be the iterates of gradient flow (gradient descent with
step-size tending to zero). Define l it = exp(− f i (w t )) and l t be the
vector with entries l i (t). The following two assumptions assume
that the limiting direction
w t
exist and the limiting direction of
‖w t ‖
l
the losses t
exist. Such assumptions are natural in the context of
‖l t ‖ 1
max-margin problems, since we want to argue that w t converges to
a max-margin direction, and also the losses l t /‖l t ‖ 1 converges to an
indicator vector of the support vectors. We will directly assume these
limits exist, though this is proved in 2 2
.
Assumption 6.4.2 (Smoothness). We assume f i (w) is a C 2 function.
Assumption 6.4.3 (Asymptotic Formulas). Assume that L(w t ) → 0, that
w
is we converge to a global minimizer. Further assume that lim t
and
t→∞ ‖w t ‖ 2
l t
exist. Equivalently,
‖l t ‖ 1
lim
t→∞
l nt = h t a n + h t ɛ nt (6.18)
w t = g t ¯w + g t δ t , (6.19)
with ‖a‖ 1
= 1, ‖ ¯w‖ 2
= 1, lim
t→∞
h(t) = 0, lim
t→∞
ɛ nt = 0, and lim
t→∞
δ t t = 0.
Assumption 6.4.4 (Linear Independence Constraint Qualification).
Let w be a unit vector. LICQ holds at w if the vectors {∇ f i (w)} i∈Sm (w) are
linearly independent.