TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
48 theory of deep learning
direction to maximum margin separator with unit l 2 norm, i.e., the
hard margin support vector machine classifier.
This characterization of the implicit bias is independent of both the
step-size as well as the initialization. We already see a fundamentally
difference from the implicit bias of gradient descent for losses with a
unique finite root (Section ??) where the characterization depended
on the initialization. The above result is rigorously proved as part of
a more general result in Theorem 6.3.2. Below is a simpler statement
and with a heuristic proof sketch intended to convey the intuition for
such results.
Theorem 6.3.1. For almost all dataset which is linearly separable, consider
gradient descent updates with any initialization w 0 and any step size that
minimizes the exponential loss in eq. (6.6), i.e., L(w t ) → 0. The gradient
descnet iterates then converge in direction to the l 2 max-margin vector, i.e.,
w
lim t t→∞ ‖w t
= ŵ
‖ 2 ‖ŵ‖ , where
ŵ = argmin ‖w‖ 2 s.t. ∀i, w ⊤ x (i) y (i) ≥ 1. (6.7)
w
Without loss of generality assume that ∀i, y (i) = 1 as the sign for
linear models can be absobed into x (i) .
Proof Sketch We first understand intuitively why an exponential tail
of the loss entail asymptotic convergence to the max margin vector:
examine the asymptotic regime of gradient descent in when the
exponential loss is minimized, as we argued earlier, this required that
∀i : w ⊤ x (i) → ∞. Suppose w t / ‖w t ‖ 2
converges to some limit w ∞ , so
we can write w t = g(t)w ∞ + ρ(t) such that g(t) → ∞, ∀i, w∞x ⊤ (i) > 0,
and lim t→∞ ρ(t)/g(t) = 0. The gradients at w t are given by:
−∇L(w) =
=
n
∑
i=1
(
exp −w ⊤ x (i)) x (i)
n
∑ exp
i=1
(
−g(t)w ⊤ ∞x (i)) (
exp −ρ(t) ⊤ x (i)) x n .
(6.8)
As g(t) → ∞ and the exponents become more negative, only those
samples with the largest (i.e., least negative) exponents will contribute
to the gradient. These are precisely the samples with the
smallest margin argmin i
w ⊤ ∞x (i) , aka the “support vectors”. The accumulation
of negative gradient, and hence w t , would then asymptotically
be dominated by a non-negative linear combination of support
vectors. These are precisely the KKT conditions for the SVM problem
(eq. 6.7). Making these intuitions rigorous constitutes the bulk of the
proof in ? ], which uses a proof technique very different from that in
the following section (Section 6.3.2).