TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
tractable landscapes for nonconvex optimization 61
Of course, in practice one can only access the training loss which is
an average over the observed {x (i) , y (i) } pairs. For simplicity we work
with the expected loss here. The difference between the two losses
can be bounded using techniques in Chapter ??.
Generalized linear model can be viewed as learning a single
neuron where σ is its nonlinearity.
We will give high level ideas on how to prove properties such as
weakly-quasi-convex or RSI for generalized linear model. First we
rewrite the objective as:
L(w) = 1 E
[(y − σ(w ⊤ x) 2]
2 (x,y)∼D
= 1 [
2 E (ɛ + σ(w∗ ⊤ x) − σ(w ⊤ x)) 2] .
= 1 2 E ɛ
(x,ɛ)
[
ɛ 2] + 1 2 E x
[
(σ(w ⊤ ∗ x) − σ(w ⊤ x) 2] .
Here the second equality uses Definition of the model (Equation
(7.3)), and the third equality uses the fact that E [ɛ|x] = 0 (so there
are no cross terms). This decomposition is helpful as the first term
[
ɛ
2 ] is now just a constant.
Now we can take the derivative of the objective:
]
∇L(w) = E
[(σ(w ⊤ x) − σ(w ⊤ x
∗ x))σ ′ (w ⊤ x)x .
1
2 E ɛ
Notice that both weakly-quasi-convex and RSI requires that the
objective to be correlated with w − w ∗ , so we compute
]
〈∇L(w), w − w ∗ 〉 = E
[(σ(w ⊤ x) − σ(w ⊤ x
∗ x))σ ′ (w ⊤ x)(w ⊤ x − w∗ ⊤ x) .
The goal here is to show that the RHS is bigger than 0. A simple
way to see that is to use the intermediate value theorem: σ(w ⊤ x) −
σ(w ⊤ ∗ x) = σ ′ (ξ)(w ⊤ x − w ⊤ ∗ x), where ξ is a value between w ⊤ x and
w ⊤ ∗ x. Then we have
〈∇L(w), w − w ∗ 〉 = E x
[σ ′ (ξ)σ ′ (w ⊤ x)(w ⊤ x − w ⊤ ∗ x) 2] .
In the expectation in the RHS, both derivatives (σ ′ (ξ), σ ′ (w ⊤ x)) are
positive as σ is monotone, and (w ⊤ x − w ⊤ ∗ x) 2 is clearly nonnegative.
By making more assumptions on σ and the distribution of x, it is
possible to lowerbound the RHS in the form required by either
weakly-quasi-convex or RSI. We leave this as an exercise.
7.2.2 Alternative objective for generalized linear model
There is another way to find w ∗ for generalized linear model that is
more specific to this setting. In this method, one estimate a different