TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
52 theory of deep learning
By applying − log,
min 〈w ηγu
t+1, x n 〉 ≥ ∑
2 n∈[N] u≤t
L(w u ) − η
∑
2 B 2 γu
2 − log L(w 0 ). (6.13)
u≤t
2
Step 2: Upper bound ‖w t+1 ‖. Using ‖∆w u ‖ = ‖∇L(w u )‖ ⋆ = γ u ,
we have,
‖w t+1 ‖ ≤ ‖w 0 ‖ + ∑ η ‖∆w u ‖ ≤ ‖w 0 ‖ + ∑ ηγ u . (6.14)
u≤t
u≤t
To complete the proof, we simply combine Equations (6.13) and
(6.14) to lower bound the normalized margin.
ηγu
〈w t+1 , x n 〉
‖w t+1 ‖ ≥ ∑ 2 u≤t L(w u )
∑ u≤t ηγ u + ‖w 0 ‖
⎛
⎞
− ⎝ ∑ u≤t η2 B 2 γu
2
2
+ log L(w 0 )
⎠ .
‖w t+1 ‖
:= (I) +(II). (6.15)
For term (I), from Proposition 6.3.3, we have γ u = ‖∇L(w u )‖ ⋆ ≥
γL(w u ). Hence the numerator is lower bounded ∑ u≤t
γ ∑ u≤t ηγ u . We have
ηγ 2 u
L(w u )
≥
∑ u≤t
ηγ 2 u
L(w u )
∑ u≤t ηγ u + ‖w 0 ‖ ≥ γ ∑ u≤t ηγ u
∑ u≤t ηγ u + ‖w 0 ‖
→ γ, (6.16)
using ∑ u≤t ηγ u → ∞ and ‖w 0 ‖ < ∞ from Proposition 6.3.4.
For term (II), log L(w 0 ) < ∞ and ∑ u≤t
η 2 B 2 γ 2 u
2
< ∞ using Proposition
6.3.3. Thus (II) → 0.
Using the above in Equation (6.15), we obtain
wt+1 ⊤ lim
x i
w
≥ γ := ⊤ x
max min i
t→∞ ‖w t+1 ‖ ‖w‖≤1 i ‖w‖ .
6.4 Homogeneous Models with Exponential Tailed Loss
≪Suriya notes: Jason: I think we should give Kaifengs’ proof here. Its more general and
concurrent work.≫ In this section, we consider the asymptotic behavior of
gradient descent when the prediction function is homogeneous in the
parameters. Consider the loss
L(w) =
n
∑ exp(−y i f i (w)), (6.17)
i=1
where f i (cw) = c α f i (w) is α-homogeneous. Typically, f i (w) is the
output of the prediction function such as a deep network. Similar