26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

34 theory of deep learning

iterations, which is at most (2/ρ) k since ball volume in R k scales as

the kth power of the radius.

Finally, the sequence of w i ’s at the end must be a ρ-cover because

the process stops only when no point can be found outside

∪ j Ball(w j , ρ).

Theorem 4.1.4 (Generalization bound for normed spaces). 3 If (i)

hypotheses are unit vectors in R k and (ii) every two hypotheses h 1 , h 2 with

‖h 1 − h 2 ‖ 2 ≤ ρ differ in terms of loss on every datapoint by at most γ then

k log(2/ρ)

∆ S (h) ≤ γ + 2

.

m

Proof. Apply the union bound on the ρ-cover. Every other net can

have loss at most γ higher than nets in the ρ-cover.

3

As you might imagine, this generalization

bound via γ-cover is too loose, and

gives very pessimistic estimates of what

m needs to be.

4.2 Data dependent complexity measures

A complexity measure for hypothesis classes is a way to quantify their

“complicatedness.” It is defined to let us prove an upper bound on

the number of training samples needed to keep down the generalization

error. Above we implicitly defined two complexity measures:

the size of the hypothesis class (assuming it is finite) and the size of

a γ-cover in it. Of course, the resulting bounds on sample complexity

were still loose.

Theorists then realized that the above simple bounds hold for

every data distribution D. In practice, it seems clear that deep nets

—or any learning method—works by being able to exploit properties

of the input distribution (e.g., convolutional structure exploits the

fact that all subpatches of images can be processed very similarly).

Thus one should try to prove some measure of complicatedness that

depends on the data distribution.

4.2.1 Rademacher Complexity

Rademacher complexity is a complexity measure that depends on

data distribution. For simplicity we will assume loss function takes

values in [0, 1].

The definition concerns the following thought experiment. Recall

that the distribution D is on labeled datapoints (x, y). For simplicity

we denote the labeled datapoint as z.

Now Rademacher Complexity 4 of hypothesis class H on a distribution

D is defined as follows where l(z, h) is loss of hypothesis h on

labeled datapoint z.

[

]

1

R m,D (H) = E

S1 ,S 2 2m sup

h∈H

∣ ∑

l(z, h) − ∑ l(z, h)

, (4.2)

z∈S 1 z∈S 2

4

Standard accounts of this often

confuse students, or falsely impress

them with a complicated proof of

Thm 4.2.1. In the standard definition,

loss terms are weighted by iid ±1

random variables. Its value is within is

±O(1/ √ m) of the one in our definition.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!