26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

algorithmic regularization 53

to the linear case in Section ??, there is a related maximum margin

problem. Define the optimal margin as γ = max ‖w‖2 =1 min i y i f i (w).

The associated non-linear margin maximization is given by the

following non-convex constrained optimization:

min ‖w‖ 2 st y i f i (w) ≥ γ.

(Max-Margin)

Analogous to Section ??, we expect that gradient descent on Equation

(6.17) converges to the optimum of the Max-Margin problem

(Max-Margin). However, the max-margin problem itself is a constrained

non-convex problem, so we cannot expect to attain a global

optimum. Instead, we show that gradient descent iterates converge to

first-order stationary points of the max-margin problem.

Definition 6.4.1 (First-order Stationary Point). The first-order optimality

conditions of Max-Margin are:

1. ∀i, y i f i (w) ≥ γ

2. There exists Lagrange multipliers λ ∈ R N + such that w = ∑ n λ n ∇ f n (w)

and λ n = 0 for n /∈ S m (w) := {i : y i f i (w) = γ}, where S m (w) is the

set of support vectors .

We denote by W ⋆ the set of first-order stationary points.

Let w t be the iterates of gradient flow (gradient descent with

step-size tending to zero). Define l it = exp(− f i (w t )) and l t be the

vector with entries l i (t). The following two assumptions assume

that the limiting direction

w t

exist and the limiting direction of

‖w t ‖

l

the losses t

exist. Such assumptions are natural in the context of

‖l t ‖ 1

max-margin problems, since we want to argue that w t converges to

a max-margin direction, and also the losses l t /‖l t ‖ 1 converges to an

indicator vector of the support vectors. We will directly assume these

limits exist, though this is proved in 2 2

.

Assumption 6.4.2 (Smoothness). We assume f i (w) is a C 2 function.

Assumption 6.4.3 (Asymptotic Formulas). Assume that L(w t ) → 0, that

w

is we converge to a global minimizer. Further assume that lim t

and

t→∞ ‖w t ‖ 2

l t

exist. Equivalently,

‖l t ‖ 1

lim

t→∞

l nt = h t a n + h t ɛ nt (6.18)

w t = g t ¯w + g t δ t , (6.19)

with ‖a‖ 1

= 1, ‖ ¯w‖ 2

= 1, lim

t→∞

h(t) = 0, lim

t→∞

ɛ nt = 0, and lim

t→∞

δ t t = 0.

Assumption 6.4.4 (Linear Independence Constraint Qualification).

Let w be a unit vector. LICQ holds at w if the vectors {∇ f i (w)} i∈Sm (w) are

linearly independent.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!