TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
14 theory of deep learning
• Training loss (also known as empirical risk):
̂L(h) = 1 n ((
n
∑ l x (i) , y (i)) )
, h ,
i=1
(
where x (1) , y (1)) (
, x (2) , y (2)) (
, . . . , x (n) , y (n)) are n training
examples drawn i.i.d. from D.
• Empirical risk minimizer (ERM): ĥ ∈ arg min h∈H ̂L(h).
• Regularization: Suppose we have a regularizer R(h), then the
regularized loss is
.
̂L λ (h) = ̂L(h) + λR(h)
≪Suriya notes: Misc notations: gradient, hessian, norms≫
1.1 List of useful math facts
Now we list some useful math facts.
1.1.1 Probability tools
In this section we introduce the probability tools we use in the proof.
Lemma 1.1.3, 1.1.4 and 1.1.5 are about tail bounds for random scalar
variables. Lemma 1.1.6 is about cdf of Gaussian distributions. Finally,
Lemma 1.1.7 is a concentration result on random matrices.
Lemma 1.1.1 (Markov’s inequality). If x is a nonnegative random variable
and t > 0, then the probability that x is at least t is at most the expectation
of x divided by t:
Pr[x ≥ t] ≤ E[x]/t.
Lemma 1.1.2 (Chebyshev’s inequality). Let x denote a nonnegative
random variable and t > 0, then
Pr[|x − E[x]| ≥ t] ≤ Var[x]/t 2 .
Lemma 1.1.3 (Chernoff bound [? ]). Let X = ∑ n i=1 X i, where X i = 1 with
probability p i and X i = 0 with probability 1 − p i , and all X i are independent.
Let µ = E[X] = ∑ n i=1 p i. Then
1. Pr[X ≥ (1 + δ)µ] ≤ exp(−δ 2 µ/3), ∀δ > 0 ;
2. Pr[X ≤ (1 − δ)µ] ≤ exp(−δ 2 µ/2), ∀0 < δ < 1.
Lemma 1.1.4 (Hoeffding bound [? ]). Let X 1 , · · · , X n denote n independent
bounded variables in [a i , b i ]. Let X = ∑i=1 n X i, then we have
(
2t
Pr[|X − E[X]| ≥ t] ≤ 2 exp −
2 )
∑i=1 n (b i − a i ) 2 .