26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

62 theory of deep learning

“gradient” for generalized linear model:

] [

]

∇g(w) = E

[(σ(w ⊤ x) − y)x = E (σ(w ⊤ x) − σ(w ⊤ x,y x

∗ x))x . (7.5)

The first equation gives a way to estimate this “gradient”. The main

difference here is that in the RHS we no longer have a factor σ ′ (w ⊤ x)

as in ∇L(w). Of course, it is unclear why this formula is the gradient

of some function g, but we can construct the function g in the

following way:

Let τ(x) be the integral of σ(x): τ(x) = ∫ x

0 σ(x′ )dx ′ . Define

g(w) := E x

[

τ(w ⊤ x) − σ(w ⊤ ∗ x)w ⊤ x ] . One can check ∇g(w) is indeed

the function in Equation (7.5). What’s very surprising is that g(w) is

actually a convex function with ∇g(w ∗ ) = 0! This means that w ∗ is

a global minimum of g and we only need to follow ∇g(w) to find it.

Nonconvex optimization is unnecessary here.

Of course, this technique is quite special and uses a lot of structure

in generalized linear model. However similar ideas were also used in

5 to learn a single neuron. In general, when one objective is hard to

5

analyze, it might be easier to look for an alternative objective that has

the same global minimum but easier to optimize.

7.3 Symmetry, saddle points and locally optimizable functions

In the previous section, we saw some conditions that allow nonconvex

objectives to be optimized efficiently. However, such conditions

often do not apply to neural networks, or more generally any function

that has some symmetry properties.

More concretely, consider a two-layer neural network h θ (x) : R d →

R. The parameters θ is (w 1 , w 2 , ..., w k ) where w i ∈ R d represents the

weight vector of the i-th neuron. The function can be evaluated as

h θ (x) = ∑ k i=1 σ(〈w i, x〉), where σ is a nonlinear activation function.

Given a dataset (x (1) , y (1) ), . . . , (x (n) , y (n) ) i.i.d.

∼ D , one can define the

training loss and expected loss as in Chapter 1. Now the objective

for this neural network f (θ) = L(h θ ) = E (x,y)∼D [l((x, y), h θ )]

has permutation symmetry. That is, for any permutation π(θ) that

permutes the weights of the neurons, we know f (θ) = f (π(θ)).

The symmetry has many implications. First, if the global minimum

θ ∗ is a point where not all neurons have the same weight vector

(which is very likely to be true), then there must be equivalent global

minimum f (π(θ ∗ )) for every permutation π. An objective with this

symmetry must also be nonconvex, because if it were convex, the

point ¯θ = 1 k! ∑ π π(θ ∗ ) (where π sums over all the permutations) is

a convex combination of global minima, so it must also be a global

minimum. However, for ¯θ the weight vectors of the neurons are all

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!