TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
62 theory of deep learning
“gradient” for generalized linear model:
] [
]
∇g(w) = E
[(σ(w ⊤ x) − y)x = E (σ(w ⊤ x) − σ(w ⊤ x,y x
∗ x))x . (7.5)
The first equation gives a way to estimate this “gradient”. The main
difference here is that in the RHS we no longer have a factor σ ′ (w ⊤ x)
as in ∇L(w). Of course, it is unclear why this formula is the gradient
of some function g, but we can construct the function g in the
following way:
Let τ(x) be the integral of σ(x): τ(x) = ∫ x
0 σ(x′ )dx ′ . Define
g(w) := E x
[
τ(w ⊤ x) − σ(w ⊤ ∗ x)w ⊤ x ] . One can check ∇g(w) is indeed
the function in Equation (7.5). What’s very surprising is that g(w) is
actually a convex function with ∇g(w ∗ ) = 0! This means that w ∗ is
a global minimum of g and we only need to follow ∇g(w) to find it.
Nonconvex optimization is unnecessary here.
Of course, this technique is quite special and uses a lot of structure
in generalized linear model. However similar ideas were also used in
5 to learn a single neuron. In general, when one objective is hard to
5
analyze, it might be easier to look for an alternative objective that has
the same global minimum but easier to optimize.
7.3 Symmetry, saddle points and locally optimizable functions
In the previous section, we saw some conditions that allow nonconvex
objectives to be optimized efficiently. However, such conditions
often do not apply to neural networks, or more generally any function
that has some symmetry properties.
More concretely, consider a two-layer neural network h θ (x) : R d →
R. The parameters θ is (w 1 , w 2 , ..., w k ) where w i ∈ R d represents the
weight vector of the i-th neuron. The function can be evaluated as
h θ (x) = ∑ k i=1 σ(〈w i, x〉), where σ is a nonlinear activation function.
Given a dataset (x (1) , y (1) ), . . . , (x (n) , y (n) ) i.i.d.
∼ D , one can define the
training loss and expected loss as in Chapter 1. Now the objective
for this neural network f (θ) = L(h θ ) = E (x,y)∼D [l((x, y), h θ )]
has permutation symmetry. That is, for any permutation π(θ) that
permutes the weights of the neurons, we know f (θ) = f (π(θ)).
The symmetry has many implications. First, if the global minimum
θ ∗ is a point where not all neurons have the same weight vector
(which is very likely to be true), then there must be equivalent global
minimum f (π(θ ∗ )) for every permutation π. An objective with this
symmetry must also be nonconvex, because if it were convex, the
point ¯θ = 1 k! ∑ π π(θ ∗ ) (where π sums over all the permutations) is
a convex combination of global minima, so it must also be a global
minimum. However, for ¯θ the weight vectors of the neurons are all