26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

48 theory of deep learning

direction to maximum margin separator with unit l 2 norm, i.e., the

hard margin support vector machine classifier.

This characterization of the implicit bias is independent of both the

step-size as well as the initialization. We already see a fundamentally

difference from the implicit bias of gradient descent for losses with a

unique finite root (Section ??) where the characterization depended

on the initialization. The above result is rigorously proved as part of

a more general result in Theorem 6.3.2. Below is a simpler statement

and with a heuristic proof sketch intended to convey the intuition for

such results.

Theorem 6.3.1. For almost all dataset which is linearly separable, consider

gradient descent updates with any initialization w 0 and any step size that

minimizes the exponential loss in eq. (6.6), i.e., L(w t ) → 0. The gradient

descnet iterates then converge in direction to the l 2 max-margin vector, i.e.,

w

lim t t→∞ ‖w t

= ŵ

‖ 2 ‖ŵ‖ , where

ŵ = argmin ‖w‖ 2 s.t. ∀i, w ⊤ x (i) y (i) ≥ 1. (6.7)

w

Without loss of generality assume that ∀i, y (i) = 1 as the sign for

linear models can be absobed into x (i) .

Proof Sketch We first understand intuitively why an exponential tail

of the loss entail asymptotic convergence to the max margin vector:

examine the asymptotic regime of gradient descent in when the

exponential loss is minimized, as we argued earlier, this required that

∀i : w ⊤ x (i) → ∞. Suppose w t / ‖w t ‖ 2

converges to some limit w ∞ , so

we can write w t = g(t)w ∞ + ρ(t) such that g(t) → ∞, ∀i, w∞x ⊤ (i) > 0,

and lim t→∞ ρ(t)/g(t) = 0. The gradients at w t are given by:

−∇L(w) =

=

n

i=1

(

exp −w ⊤ x (i)) x (i)

n

∑ exp

i=1

(

−g(t)w ⊤ ∞x (i)) (

exp −ρ(t) ⊤ x (i)) x n .

(6.8)

As g(t) → ∞ and the exponents become more negative, only those

samples with the largest (i.e., least negative) exponents will contribute

to the gradient. These are precisely the samples with the

smallest margin argmin i

w ⊤ ∞x (i) , aka the “support vectors”. The accumulation

of negative gradient, and hence w t , would then asymptotically

be dominated by a non-negative linear combination of support

vectors. These are precisely the KKT conditions for the SVM problem

(eq. 6.7). Making these intuitions rigorous constitutes the bulk of the

proof in ? ], which uses a proof technique very different from that in

the following section (Section 6.3.2).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!