26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

64 theory of deep learning

7.4 Case study: top eigenvector of a matrix

In this section we look at a simple example of a locally optimizable

function. Given a symmetric PSD matrix M ∈ R d×d , our goal is to

find its top eigenvector (eigenvector that corresponds to the largest

eigenvalue). More precisely, using SVD we can write M as

M =

d

∑ λ i v i vi ⊤ .

i=1

Here v i ’s are orthonormal vectors that are eigenvectors of M, and λ i ’s

are the eigenvalues. For simplicity we assume λ 1 > λ 2 ≥ λ 3 ≥ · · · ≥

λ d ≥ 0 7 .

There are many objective functions whose global optima gives the

top eigenvector. For example, using basic definition of spectral norm,

we know for PSD matrix M the global optima of

7

Note that the only real assumption

here is λ 1 > λ 2 , so the top eigenvector

is unique. Other inequalities are

without loss of generality.

max x ⊤ Mx

‖x‖ 2 =1

is the top eigenvector of M. However, this formulation requires a

constraint. We instead work with an unconstrained version whose

correctness follows from Eckhart-Young Theorem:

min f (x) := 1

x∈R d 4 ‖M − xx⊤ ‖ 2 F . (7.6)

Note that this function does have a symmetry in the sense that

f (x) = f (−x). Under our assumptions, the only global minima of

this function are x = ± √ λ 1 v 1 . We are going to show that these are

also the only second order stationary points. We will give two proof

strategies that are commonly used to prove the locally optimizable

property.

7.4.1 Characterizing all critical points

The first idea is simple – we will just try to solve the Equation

∇ f (x) = 0 to get the position of all critical points; then for the

critical points that are not the desired global minimum, try to prove

that they are local maximum or saddle points.

Computing gradient and Hessian Before we solve the equation

∇ f (x) = 0 for the objective function f (x) defined in Equation (7.6),

we first give a simple way of computing the gradient and Hessian.

We will first expand f (x + δ) (where δ should be thought of as a small

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!