26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

66 theory of deep learning

When x = ± √ λ i v i , and δ = v 1 , we have

δ ⊤ [∇ 2 f (x)]δ = v ⊤ 1 [‖√ λ i v i ‖ 2 2 I + 2λ iv i v ⊤ i − M]v 1 = λ i − λ 1 < 0.

Here the last step uses the fact that v i ’s are orthonormal vectors and

v ⊤ 1 Mv 1 = λ 1 . The proof for x = 0 is very similar. Combining all the

steps above, we proved the following claim:

Claim 7.4.1 (Properties of critical points). The only critical points of

f (x) are of the form x = ± √ λ i v i or x = 0. For all critical points except

x = ± √ λ 1 v 1 , ∇ 2 f (x) has a negative eigenvalue.

This claim directly implies that the only second order stationary

points are x = ± √ λ 1 v 1 , so all second order stationary points are also

global minima.

7.4.2 Finding directions of improvements

The approach in Section 7.4.1 is straight-forward. However, in more

complicated problems it is often infeasible to enumerate all the

solutions for ∇ f (x) = 0. What we proved in Section 7.4.1 is also not

strong enough for showing f (x) is locally optimizable, because we

only proved every exact SOSP is a global minimum, and a locally

optimizable function requires every approximate SOSP to be close to

a global minimum. We will now give an alternative approach that is

often more flexible and robust.

For every point x that is not a global minimum, we define its

direction of improvements as below:

Definition 7.4.2 (Direction of improvement). For an objective function

f and a point x, we say δ is a direction of improvement (of f at x) if

|〈∇ f (x), δ〉| > 0 or δ ⊤ [∇ 2 f (x)]δ < 0. We say δ is an (epsilon, γ) direction

of improvement (of f at x) if |〈∇ f (x), δ〉| > ɛ‖δ‖ 2 or δ ⊤ [∇ 2 f (x)]δ <

−γ‖δ‖ 2 2 .

Intuitively, if δ is a direction of improvement for f at x, then moving

along one of δ or −δ for a small enough step can decrease the

objective function. In fact, if a point x has a direction of improvement,

it cannot be a second order stationary point; if a point x has

an (epsilon, γ) direction of improvement, then it cannot be an (ɛ, γ)-

SOSP.

Now we can look at the contrapositive of what we were trying

to prove in the definition of locally optimizable functions: if every

point x with f (x) > f (x ∗ ) + τ has an (ɛ, γ) direction of improvement,

then every (ɛ, γ)-second order stationary point must satisfy f (x) ≤

f (x ∗ ) + δ. Therefore, our goal in this part is to find a direction of

improvement for every point that is not globally optimal.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!