TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
34 theory of deep learning
iterations, which is at most (2/ρ) k since ball volume in R k scales as
the kth power of the radius.
Finally, the sequence of w i ’s at the end must be a ρ-cover because
the process stops only when no point can be found outside
∪ j Ball(w j , ρ).
Theorem 4.1.4 (Generalization bound for normed spaces). 3 If (i)
hypotheses are unit vectors in R k and (ii) every two hypotheses h 1 , h 2 with
‖h 1 − h 2 ‖ 2 ≤ ρ differ in terms of loss on every datapoint by at most γ then
√
k log(2/ρ)
∆ S (h) ≤ γ + 2
.
m
Proof. Apply the union bound on the ρ-cover. Every other net can
have loss at most γ higher than nets in the ρ-cover.
3
As you might imagine, this generalization
bound via γ-cover is too loose, and
gives very pessimistic estimates of what
m needs to be.
4.2 Data dependent complexity measures
A complexity measure for hypothesis classes is a way to quantify their
“complicatedness.” It is defined to let us prove an upper bound on
the number of training samples needed to keep down the generalization
error. Above we implicitly defined two complexity measures:
the size of the hypothesis class (assuming it is finite) and the size of
a γ-cover in it. Of course, the resulting bounds on sample complexity
were still loose.
Theorists then realized that the above simple bounds hold for
every data distribution D. In practice, it seems clear that deep nets
—or any learning method—works by being able to exploit properties
of the input distribution (e.g., convolutional structure exploits the
fact that all subpatches of images can be processed very similarly).
Thus one should try to prove some measure of complicatedness that
depends on the data distribution.
4.2.1 Rademacher Complexity
Rademacher complexity is a complexity measure that depends on
data distribution. For simplicity we will assume loss function takes
values in [0, 1].
The definition concerns the following thought experiment. Recall
that the distribution D is on labeled datapoints (x, y). For simplicity
we denote the labeled datapoint as z.
Now Rademacher Complexity 4 of hypothesis class H on a distribution
D is defined as follows where l(z, h) is loss of hypothesis h on
labeled datapoint z.
[
]
1
R m,D (H) = E
S1 ,S 2 2m sup
h∈H
∣ ∑
l(z, h) − ∑ l(z, h)
, (4.2)
∣
z∈S 1 z∈S 2
4
Standard accounts of this often
confuse students, or falsely impress
them with a complicated proof of
Thm 4.2.1. In the standard definition,
loss terms are weighted by iid ±1
random variables. Its value is within is
±O(1/ √ m) of the one in our definition.