TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
basics of generalization theory 33
probability at least 1 − δ
√
∆ S (h) ≤ 2
log N + log 1 δ
m
.
Proof. For any fixed hypothesis g imagine drawing a training sample
of size m. Then ̂L S (g) is an average of iid variables and its expectation
is L D (g). Concentration bounds imply that L D (g) − ̂L S (g) has
a concentration property at least as strong as univariate Gaussian
N (0, 1/m). The previous statement is true for all hypotheses g in
the class, so the union bound implies that the probability is at most
N exp(−ɛ 2 m/4) that this quantity exceeds ɛ for some hypothesis in
the class. Since h is the solution to ERM, we conclude that when
δ ≤ N exp(−ɛ 2 m/4) then ∆ S (h) ≤ ɛ. Simplifying and eliminating ɛ,
we obtain the theorem.
Of course, the union bound doesn’t apply to deep nets per se because
the set of hypotheses —even after we have fixed the architecture—
consists of all vectors in R k , where k is the number of real-valued
parameters. This is an uncountable set! However, we show it is possible
to reason about the set of all nets as a finite set after suitable
discretization. Suppose we assume that the l 2 norm of the parameter
vectors is at most 1, meaning the set of all deep nets has been
identified with Ball(0, 1). (Here Ball(w, r) refers to set of all points
in R k within distance r of w.) We assume there is a ρ > 0 such that
if w 1 , w 2 ∈ R k satisfy ‖w 1 − w 2 ‖ 2 ≤ ρ then the nets with these two
parameter vectors have essentially the same loss on every input,
meaning the losses differ by at most γ for some γ > 0. 2 (It makes
intuitive sense such a ρ must exist for every γ > 0 since as we let
ρ → 0 the two nets become equal.)
Definition 4.1.2 (ρ-cover). A set of points w 1 , w 2 , . . . ∈ R k is a ρ-cover in
R k if for every w ∈ Ball(0, 1) there is some w i such that w ∈ Ball(w i , ρ).
2
Another way to phrase this assumption
in a somewhat stronger form is
that the loss on any datapoint is a Lipschitz
function of the parameter vector,
with Lipschitz constant at most γ/ρ.
Lemma 4.1.3 (Existence of ρ-cover). There exists a ρ-cover of size at most
(2/ρ) k .
Proof. The proof is simple but ingenious. Let us pick w 1 arbitrarily in
Ball(0, 1). For i = 2, 3, . . . do the following: arbitrarily pick any point
in Ball(0, 1) outside ∪ j≤i Ball(w j , ρ) and designate it as w i+1 .
A priori it is unclear if this process will ever terminate. We now
show it does after at most (2/ρ) k steps. To see this, it suffices to note
that Ball(w i , ρ/2) ∩ Ball(w j , ρ/2) = ∅ for all i < j. (Because if not,
then w j ∈ Ball(w i , ρ), which means that w j could not have been
picked during the above process.) Thus we conclude that the process
must have stopped after at most
volume(Ball(0, 1))/volume(Ball(0, ρ/2))