TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
44 theory of deep learning
approximation of the loss while constraining the step length in
Euclidean norm.
w t+1 = argmin 〈w, ∇L(w t )〉 + 1
w
2η ‖w − w t‖ 2 2 . (6.2)
Motivated by the above connection, we can study other families of
algorithms that work under different and non-Euclidean geometries.
Two convenient families are mirror descent w.r.t. potential ψ [? ? ]
and steepest descent w.r.t. general norms [? ].
Mirror descent w.r.t. potential ψ Mirror descent updates are defined
for any strongly convex and differentiable potential ψ as
w t+1 = arg min η 〈w, ∇L(w t )〉 + D ψ (w, w t ),
w
=⇒ ∇ψ(w t+1 ) = ∇ψ(w t ) − η∇L(w t )
(6.3)
where D ψ (w, w ′ ) = ψ(w) − ψ(w ′ ) − 〈∇ψ(w ′ ), w − w ′ 〉 is the Bregman
divergence [? ] w.r.t. ψ. This family captures updates where
the geometry is specified by the Bregman divergence D ψ . Examples
of potentials ψ for mirror descent include the squared l 2 norm
ψ(w) = 1/2‖w‖ 2 2
, which leads to gradient descent; the entropy potential
ψ(w) = ∑ i w[i] log w[i] − w[i]; the spectral entropy for matrix
valued w, where ψ(w) is the entropy potential on the singular values
of w; general quadratic potentials ψ(w) = 1/2‖w‖ 2 D = 1/2 w ⊤ Dw
for any positive definite matrix D; and the squared l p norms for
p ∈ (1, 2].
From eq. (6.3), we see that rather than w t (called primal iterates),
it is the ∇ψ(w t ) (called dual iterates) that are constrained to the low
dimensional data manifold ∇ψ(w 0 ) + span({x (i) }). The arguments for
gradient descent can now be generalized to get the following result.
Theorem 6.1.2. For any realizable dataset {x (i) , y (i) }
n=1 N , and any strongly
convex potential ψ, consider the mirror descent iterates w t from eq. (6.3) for
minimizing the empirical loss L(w) in eq. (6.1). For all initializations w 0 , if
the step-size schedule minimzes L(w), i.e., L(w t ) → 0, then the asymptotic
solution of the algorithm is given by
w t →
arg min D ψ (w, w 0 ). (6.4)
w:∀i,w ⊤ x (i) =y (i)
In particular, if we start at w 0 = arg min w
ψ(w) (so that ∇ψ(w 0 ) =
0), then we get to arg min w∈G
ψ(w). 1 1 The analysis of Theorem 6.1.2 and
Steepest descent w.r.t. general norms Gradient descent is also a
Proposition 6.1.1 also hold when
instancewise stochastic gradients are
used in place of ∇L(w t ).
special case of steepest descent (SD) w.r.t a generic norm ‖.‖ [? ] with