26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

60 theory of deep learning

creasing as we run simple optimization algorithms such as gradient

descent. Many properties were used in previous literature, including

Definition 7.2.1. Let f (w) be an objective function with a unique global

minimum w ∗ , then

Polyak-Lojasiewicz f satisfies Polyak-Lojasiewicz if there exists a value

µ > 0 such that for every w, ‖∇ f (x)‖ 2 2 ≥ µ( f (w) − f (w∗ )).

weakly-quasi-convex f is weakly-quasi-convex if there exists a value τ > 0

such that for every w, 〈∇ f (w), w − w ∗ 〉 ≥ µ( f (w) − f (w ∗ )).

Restricted Secant Inequality (RSI) f satisfies RSI if there exists a value τ

such that for every w, 〈∇ f (w), w − w ∗ 〉 ≥ µ‖w − w ∗ ‖ 2 2 .

Any one of these three properties can imply fast convergence

together with some smoothness of f .

Claim 7.2.2. If an objective function f satisfies one of Polyak-Lojasiewicz,

weakly-quasi-convex or RSI, and f is smooth 3 , then gradient descent

converges to global minimum with a geometric rate 4 .

Intuitively, Polyak-Lojasiewicz condition requires that the gradient

to be nonzero for any point that is not a global minimum, therefore

one can always follow the gradient and further decrease the function

value. This condition can also work in some settings when the global

minimum is not unique. Weakly-quasi-convex and RSI are similar

in the sense that they both require the (negative) gradient to be

correlated with the correct direction - direction from the current point

w to the global minimum w ∗ .

In this section we are going to use generalized linear model as an

example to show how some of these properties can be used.

3

Polyak-Lojasiewicz and RSI requires

standard smoothness definition as in

Equation (2.1), weakly-quasi-convex

requires a special smoothness property

detailed in [? ].

4

The potential functions for Polyak-

Lojasiewicz and weakly-quasi-convex

are function value f ; potential function

for RSI is the squared distance ‖w −

w ∗ ‖ 2 2

7.2.1 Generalized linear model

In generalized linear model (also known as isotonic regression) [?

? ], the input consists of samples {x (i) , y (i) } that are drawn from

distribution D, where (x, y) ∼ D satisfies

y = σ(w ⊤ ∗ x) + ɛ. (7.3)

Here σ : R → R is a known monotone function, ɛ is a noise that

satisfies E [ɛ|x] = 0, and w ∗ is the unknown parameter that we are

trying to learn.

In this case, it is natural to consider the following expected loss

L(w) = 1 2

[

E (y − σ(w ⊤ x) 2] . (7.4)

(x,y)∼D

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!