TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
64 theory of deep learning
7.4 Case study: top eigenvector of a matrix
In this section we look at a simple example of a locally optimizable
function. Given a symmetric PSD matrix M ∈ R d×d , our goal is to
find its top eigenvector (eigenvector that corresponds to the largest
eigenvalue). More precisely, using SVD we can write M as
M =
d
∑ λ i v i vi ⊤ .
i=1
Here v i ’s are orthonormal vectors that are eigenvectors of M, and λ i ’s
are the eigenvalues. For simplicity we assume λ 1 > λ 2 ≥ λ 3 ≥ · · · ≥
λ d ≥ 0 7 .
There are many objective functions whose global optima gives the
top eigenvector. For example, using basic definition of spectral norm,
we know for PSD matrix M the global optima of
7
Note that the only real assumption
here is λ 1 > λ 2 , so the top eigenvector
is unique. Other inequalities are
without loss of generality.
max x ⊤ Mx
‖x‖ 2 =1
is the top eigenvector of M. However, this formulation requires a
constraint. We instead work with an unconstrained version whose
correctness follows from Eckhart-Young Theorem:
min f (x) := 1
x∈R d 4 ‖M − xx⊤ ‖ 2 F . (7.6)
Note that this function does have a symmetry in the sense that
f (x) = f (−x). Under our assumptions, the only global minima of
this function are x = ± √ λ 1 v 1 . We are going to show that these are
also the only second order stationary points. We will give two proof
strategies that are commonly used to prove the locally optimizable
property.
7.4.1 Characterizing all critical points
The first idea is simple – we will just try to solve the Equation
∇ f (x) = 0 to get the position of all critical points; then for the
critical points that are not the desired global minimum, try to prove
that they are local maximum or saddle points.
Computing gradient and Hessian Before we solve the equation
∇ f (x) = 0 for the objective function f (x) defined in Equation (7.6),
we first give a simple way of computing the gradient and Hessian.
We will first expand f (x + δ) (where δ should be thought of as a small