26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

52 theory of deep learning

By applying − log,

min 〈w ηγu

t+1, x n 〉 ≥ ∑

2 n∈[N] u≤t

L(w u ) − η

2 B 2 γu

2 − log L(w 0 ). (6.13)

u≤t

2

Step 2: Upper bound ‖w t+1 ‖. Using ‖∆w u ‖ = ‖∇L(w u )‖ ⋆ = γ u ,

we have,

‖w t+1 ‖ ≤ ‖w 0 ‖ + ∑ η ‖∆w u ‖ ≤ ‖w 0 ‖ + ∑ ηγ u . (6.14)

u≤t

u≤t

To complete the proof, we simply combine Equations (6.13) and

(6.14) to lower bound the normalized margin.

ηγu

〈w t+1 , x n 〉

‖w t+1 ‖ ≥ ∑ 2 u≤t L(w u )

∑ u≤t ηγ u + ‖w 0 ‖

− ⎝ ∑ u≤t η2 B 2 γu

2

2

+ log L(w 0 )

⎠ .

‖w t+1 ‖

:= (I) +(II). (6.15)

For term (I), from Proposition 6.3.3, we have γ u = ‖∇L(w u )‖ ⋆ ≥

γL(w u ). Hence the numerator is lower bounded ∑ u≤t

γ ∑ u≤t ηγ u . We have

ηγ 2 u

L(w u )

∑ u≤t

ηγ 2 u

L(w u )

∑ u≤t ηγ u + ‖w 0 ‖ ≥ γ ∑ u≤t ηγ u

∑ u≤t ηγ u + ‖w 0 ‖

→ γ, (6.16)

using ∑ u≤t ηγ u → ∞ and ‖w 0 ‖ < ∞ from Proposition 6.3.4.

For term (II), log L(w 0 ) < ∞ and ∑ u≤t

η 2 B 2 γ 2 u

2

< ∞ using Proposition

6.3.3. Thus (II) → 0.

Using the above in Equation (6.15), we obtain

wt+1 ⊤ lim

x i

w

≥ γ := ⊤ x

max min i

t→∞ ‖w t+1 ‖ ‖w‖≤1 i ‖w‖ .

6.4 Homogeneous Models with Exponential Tailed Loss

≪Suriya notes: Jason: I think we should give Kaifengs’ proof here. Its more general and

concurrent work.≫ In this section, we consider the asymptotic behavior of

gradient descent when the prediction function is homogeneous in the

parameters. Consider the loss

L(w) =

n

∑ exp(−y i f i (w)), (6.17)

i=1

where f i (cw) = c α f i (w) is α-homogeneous. Typically, f i (w) is the

output of the prediction function such as a deep network. Similar

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!