26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

tractable landscapes for nonconvex optimization 61

Of course, in practice one can only access the training loss which is

an average over the observed {x (i) , y (i) } pairs. For simplicity we work

with the expected loss here. The difference between the two losses

can be bounded using techniques in Chapter ??.

Generalized linear model can be viewed as learning a single

neuron where σ is its nonlinearity.

We will give high level ideas on how to prove properties such as

weakly-quasi-convex or RSI for generalized linear model. First we

rewrite the objective as:

L(w) = 1 E

[(y − σ(w ⊤ x) 2]

2 (x,y)∼D

= 1 [

2 E (ɛ + σ(w∗ ⊤ x) − σ(w ⊤ x)) 2] .

= 1 2 E ɛ

(x,ɛ)

[

ɛ 2] + 1 2 E x

[

(σ(w ⊤ ∗ x) − σ(w ⊤ x) 2] .

Here the second equality uses Definition of the model (Equation

(7.3)), and the third equality uses the fact that E [ɛ|x] = 0 (so there

are no cross terms). This decomposition is helpful as the first term

[

ɛ

2 ] is now just a constant.

Now we can take the derivative of the objective:

]

∇L(w) = E

[(σ(w ⊤ x) − σ(w ⊤ x

∗ x))σ ′ (w ⊤ x)x .

1

2 E ɛ

Notice that both weakly-quasi-convex and RSI requires that the

objective to be correlated with w − w ∗ , so we compute

]

〈∇L(w), w − w ∗ 〉 = E

[(σ(w ⊤ x) − σ(w ⊤ x

∗ x))σ ′ (w ⊤ x)(w ⊤ x − w∗ ⊤ x) .

The goal here is to show that the RHS is bigger than 0. A simple

way to see that is to use the intermediate value theorem: σ(w ⊤ x) −

σ(w ⊤ ∗ x) = σ ′ (ξ)(w ⊤ x − w ⊤ ∗ x), where ξ is a value between w ⊤ x and

w ⊤ ∗ x. Then we have

〈∇L(w), w − w ∗ 〉 = E x

[σ ′ (ξ)σ ′ (w ⊤ x)(w ⊤ x − w ⊤ ∗ x) 2] .

In the expectation in the RHS, both derivatives (σ ′ (ξ), σ ′ (w ⊤ x)) are

positive as σ is monotone, and (w ⊤ x − w ⊤ ∗ x) 2 is clearly nonnegative.

By making more assumptions on σ and the distribution of x, it is

possible to lowerbound the RHS in the form required by either

weakly-quasi-convex or RSI. We leave this as an exercise.

7.2.2 Alternative objective for generalized linear model

There is another way to find w ∗ for generalized linear model that is

more specific to this setting. In this method, one estimate a different

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!