TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
66 theory of deep learning
When x = ± √ λ i v i , and δ = v 1 , we have
δ ⊤ [∇ 2 f (x)]δ = v ⊤ 1 [‖√ λ i v i ‖ 2 2 I + 2λ iv i v ⊤ i − M]v 1 = λ i − λ 1 < 0.
Here the last step uses the fact that v i ’s are orthonormal vectors and
v ⊤ 1 Mv 1 = λ 1 . The proof for x = 0 is very similar. Combining all the
steps above, we proved the following claim:
Claim 7.4.1 (Properties of critical points). The only critical points of
f (x) are of the form x = ± √ λ i v i or x = 0. For all critical points except
x = ± √ λ 1 v 1 , ∇ 2 f (x) has a negative eigenvalue.
This claim directly implies that the only second order stationary
points are x = ± √ λ 1 v 1 , so all second order stationary points are also
global minima.
7.4.2 Finding directions of improvements
The approach in Section 7.4.1 is straight-forward. However, in more
complicated problems it is often infeasible to enumerate all the
solutions for ∇ f (x) = 0. What we proved in Section 7.4.1 is also not
strong enough for showing f (x) is locally optimizable, because we
only proved every exact SOSP is a global minimum, and a locally
optimizable function requires every approximate SOSP to be close to
a global minimum. We will now give an alternative approach that is
often more flexible and robust.
For every point x that is not a global minimum, we define its
direction of improvements as below:
Definition 7.4.2 (Direction of improvement). For an objective function
f and a point x, we say δ is a direction of improvement (of f at x) if
|〈∇ f (x), δ〉| > 0 or δ ⊤ [∇ 2 f (x)]δ < 0. We say δ is an (epsilon, γ) direction
of improvement (of f at x) if |〈∇ f (x), δ〉| > ɛ‖δ‖ 2 or δ ⊤ [∇ 2 f (x)]δ <
−γ‖δ‖ 2 2 .
Intuitively, if δ is a direction of improvement for f at x, then moving
along one of δ or −δ for a small enough step can decrease the
objective function. In fact, if a point x has a direction of improvement,
it cannot be a second order stationary point; if a point x has
an (epsilon, γ) direction of improvement, then it cannot be an (ɛ, γ)-
SOSP.
Now we can look at the contrapositive of what we were trying
to prove in the definition of locally optimizable functions: if every
point x with f (x) > f (x ∗ ) + τ has an (ɛ, γ) direction of improvement,
then every (ɛ, γ)-second order stationary point must satisfy f (x) ≤
f (x ∗ ) + δ. Therefore, our goal in this part is to find a direction of
improvement for every point that is not globally optimal.