TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
60 theory of deep learning
creasing as we run simple optimization algorithms such as gradient
descent. Many properties were used in previous literature, including
Definition 7.2.1. Let f (w) be an objective function with a unique global
minimum w ∗ , then
Polyak-Lojasiewicz f satisfies Polyak-Lojasiewicz if there exists a value
µ > 0 such that for every w, ‖∇ f (x)‖ 2 2 ≥ µ( f (w) − f (w∗ )).
weakly-quasi-convex f is weakly-quasi-convex if there exists a value τ > 0
such that for every w, 〈∇ f (w), w − w ∗ 〉 ≥ µ( f (w) − f (w ∗ )).
Restricted Secant Inequality (RSI) f satisfies RSI if there exists a value τ
such that for every w, 〈∇ f (w), w − w ∗ 〉 ≥ µ‖w − w ∗ ‖ 2 2 .
Any one of these three properties can imply fast convergence
together with some smoothness of f .
Claim 7.2.2. If an objective function f satisfies one of Polyak-Lojasiewicz,
weakly-quasi-convex or RSI, and f is smooth 3 , then gradient descent
converges to global minimum with a geometric rate 4 .
Intuitively, Polyak-Lojasiewicz condition requires that the gradient
to be nonzero for any point that is not a global minimum, therefore
one can always follow the gradient and further decrease the function
value. This condition can also work in some settings when the global
minimum is not unique. Weakly-quasi-convex and RSI are similar
in the sense that they both require the (negative) gradient to be
correlated with the correct direction - direction from the current point
w to the global minimum w ∗ .
In this section we are going to use generalized linear model as an
example to show how some of these properties can be used.
3
Polyak-Lojasiewicz and RSI requires
standard smoothness definition as in
Equation (2.1), weakly-quasi-convex
requires a special smoothness property
detailed in [? ].
4
The potential functions for Polyak-
Lojasiewicz and weakly-quasi-convex
are function value f ; potential function
for RSI is the squared distance ‖w −
w ∗ ‖ 2 2
7.2.1 Generalized linear model
In generalized linear model (also known as isotonic regression) [?
? ], the input consists of samples {x (i) , y (i) } that are drawn from
distribution D, where (x, y) ∼ D satisfies
y = σ(w ⊤ ∗ x) + ɛ. (7.3)
Here σ : R → R is a known monotone function, ɛ is a noise that
satisfies E [ɛ|x] = 0, and w ∗ is the unknown parameter that we are
trying to learn.
In this case, it is natural to consider the following expected loss
L(w) = 1 2
[
E (y − σ(w ⊤ x) 2] . (7.4)
(x,y)∼D