26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

44 theory of deep learning

approximation of the loss while constraining the step length in

Euclidean norm.

w t+1 = argmin 〈w, ∇L(w t )〉 + 1

w

2η ‖w − w t‖ 2 2 . (6.2)

Motivated by the above connection, we can study other families of

algorithms that work under different and non-Euclidean geometries.

Two convenient families are mirror descent w.r.t. potential ψ [? ? ]

and steepest descent w.r.t. general norms [? ].

Mirror descent w.r.t. potential ψ Mirror descent updates are defined

for any strongly convex and differentiable potential ψ as

w t+1 = arg min η 〈w, ∇L(w t )〉 + D ψ (w, w t ),

w

=⇒ ∇ψ(w t+1 ) = ∇ψ(w t ) − η∇L(w t )

(6.3)

where D ψ (w, w ′ ) = ψ(w) − ψ(w ′ ) − 〈∇ψ(w ′ ), w − w ′ 〉 is the Bregman

divergence [? ] w.r.t. ψ. This family captures updates where

the geometry is specified by the Bregman divergence D ψ . Examples

of potentials ψ for mirror descent include the squared l 2 norm

ψ(w) = 1/2‖w‖ 2 2

, which leads to gradient descent; the entropy potential

ψ(w) = ∑ i w[i] log w[i] − w[i]; the spectral entropy for matrix

valued w, where ψ(w) is the entropy potential on the singular values

of w; general quadratic potentials ψ(w) = 1/2‖w‖ 2 D = 1/2 w ⊤ Dw

for any positive definite matrix D; and the squared l p norms for

p ∈ (1, 2].

From eq. (6.3), we see that rather than w t (called primal iterates),

it is the ∇ψ(w t ) (called dual iterates) that are constrained to the low

dimensional data manifold ∇ψ(w 0 ) + span({x (i) }). The arguments for

gradient descent can now be generalized to get the following result.

Theorem 6.1.2. For any realizable dataset {x (i) , y (i) }

n=1 N , and any strongly

convex potential ψ, consider the mirror descent iterates w t from eq. (6.3) for

minimizing the empirical loss L(w) in eq. (6.1). For all initializations w 0 , if

the step-size schedule minimzes L(w), i.e., L(w t ) → 0, then the asymptotic

solution of the algorithm is given by

w t →

arg min D ψ (w, w 0 ). (6.4)

w:∀i,w ⊤ x (i) =y (i)

In particular, if we start at w 0 = arg min w

ψ(w) (so that ∇ψ(w 0 ) =

0), then we get to arg min w∈G

ψ(w). 1 1 The analysis of Theorem 6.1.2 and

Steepest descent w.r.t. general norms Gradient descent is also a

Proposition 6.1.1 also hold when

instancewise stochastic gradients are

used in place of ∇L(w t ).

special case of steepest descent (SD) w.r.t a generic norm ‖.‖ [? ] with

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!