26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tractable landscapes for nonconvex optimization 59

Claim 7.1.4. For an objective function f (w) : R d → R and a critical point

w (∇ f (w) = 0), we know

• If ∇ 2 f (w) ≻ 0, w is a local minimum.

• If ∇ 2 f (w) ≺ 0, w is a local maximum.

• If ∇ 2 f (w) has both a positive and a negative eigenvalue, w is a saddle

point.

These criteria are known as second order sufficient conditions

in optimization. Intuitively, one can prove this claim by looking at

the second-order Taylor expansion. The three cases in the claim do

not cover all the possible Hessian matrices. The remaining cases are

considered to be degenerate, and can either be a local minimum,

local maximum or a saddle point 2 .

Flat regions Even if a function does not have any spurious local

minima or saddle point, it can still be nonconvex, see Figure 7.1. In

high dimensions such functions can still be very hard to optimize.

The main difficulty here is that even if the norm ‖∇ f (w)‖ 2 is small,

unlike convex functions one cannot conclude that f (w) is close to

f (w ∗ ). However, often in such cases one can hope the function f (w)

to satisfy some relaxed notion of convexity, and design efficient

algorithms accordingly. We discuss one of such cases in Section 7.2.

2

One can consider the w = 0 point of

functions w 4 , −w 4 , w 3 , and it is a local

minimum, maximum and saddle point

respectively.

Figure 7.1: Obstacles for nonconvex

optimization. From left

to right: local minimum, saddle

point and flat region.

7.2 Cases with a unique global minimum

We first consider the case that is most similar to convex objectives.

In this section, the objective functions we look at have no spurious

local minima or saddle points. In fact, in our example the objective

is only going to have a unique global minimum. The only obstacle

in optimizing these functions is that points with small gradients may

not be near-optimal.

The main idea here is to identify properties of the objective and

also a potential function, such that the potential function keeps de-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!