26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

basics of generalization theory 33

probability at least 1 − δ

∆ S (h) ≤ 2

log N + log 1 δ

m

.

Proof. For any fixed hypothesis g imagine drawing a training sample

of size m. Then ̂L S (g) is an average of iid variables and its expectation

is L D (g). Concentration bounds imply that L D (g) − ̂L S (g) has

a concentration property at least as strong as univariate Gaussian

N (0, 1/m). The previous statement is true for all hypotheses g in

the class, so the union bound implies that the probability is at most

N exp(−ɛ 2 m/4) that this quantity exceeds ɛ for some hypothesis in

the class. Since h is the solution to ERM, we conclude that when

δ ≤ N exp(−ɛ 2 m/4) then ∆ S (h) ≤ ɛ. Simplifying and eliminating ɛ,

we obtain the theorem.

Of course, the union bound doesn’t apply to deep nets per se because

the set of hypotheses —even after we have fixed the architecture—

consists of all vectors in R k , where k is the number of real-valued

parameters. This is an uncountable set! However, we show it is possible

to reason about the set of all nets as a finite set after suitable

discretization. Suppose we assume that the l 2 norm of the parameter

vectors is at most 1, meaning the set of all deep nets has been

identified with Ball(0, 1). (Here Ball(w, r) refers to set of all points

in R k within distance r of w.) We assume there is a ρ > 0 such that

if w 1 , w 2 ∈ R k satisfy ‖w 1 − w 2 ‖ 2 ≤ ρ then the nets with these two

parameter vectors have essentially the same loss on every input,

meaning the losses differ by at most γ for some γ > 0. 2 (It makes

intuitive sense such a ρ must exist for every γ > 0 since as we let

ρ → 0 the two nets become equal.)

Definition 4.1.2 (ρ-cover). A set of points w 1 , w 2 , . . . ∈ R k is a ρ-cover in

R k if for every w ∈ Ball(0, 1) there is some w i such that w ∈ Ball(w i , ρ).

2

Another way to phrase this assumption

in a somewhat stronger form is

that the loss on any datapoint is a Lipschitz

function of the parameter vector,

with Lipschitz constant at most γ/ρ.

Lemma 4.1.3 (Existence of ρ-cover). There exists a ρ-cover of size at most

(2/ρ) k .

Proof. The proof is simple but ingenious. Let us pick w 1 arbitrarily in

Ball(0, 1). For i = 2, 3, . . . do the following: arbitrarily pick any point

in Ball(0, 1) outside ∪ j≤i Ball(w j , ρ) and designate it as w i+1 .

A priori it is unclear if this process will ever terminate. We now

show it does after at most (2/ρ) k steps. To see this, it suffices to note

that Ball(w i , ρ/2) ∩ Ball(w j , ρ/2) = ∅ for all i < j. (Because if not,

then w j ∈ Ball(w i , ρ), which means that w j could not have been

picked during the above process.) Thus we conclude that the process

must have stopped after at most

volume(Ball(0, 1))/volume(Ball(0, ρ/2))

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!