26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14 theory of deep learning

• Training loss (also known as empirical risk):

̂L(h) = 1 n ((

n

∑ l x (i) , y (i)) )

, h ,

i=1

(

where x (1) , y (1)) (

, x (2) , y (2)) (

, . . . , x (n) , y (n)) are n training

examples drawn i.i.d. from D.

• Empirical risk minimizer (ERM): ĥ ∈ arg min h∈H ̂L(h).

• Regularization: Suppose we have a regularizer R(h), then the

regularized loss is

.

̂L λ (h) = ̂L(h) + λR(h)

≪Suriya notes: Misc notations: gradient, hessian, norms≫

1.1 List of useful math facts

Now we list some useful math facts.

1.1.1 Probability tools

In this section we introduce the probability tools we use in the proof.

Lemma 1.1.3, 1.1.4 and 1.1.5 are about tail bounds for random scalar

variables. Lemma 1.1.6 is about cdf of Gaussian distributions. Finally,

Lemma 1.1.7 is a concentration result on random matrices.

Lemma 1.1.1 (Markov’s inequality). If x is a nonnegative random variable

and t > 0, then the probability that x is at least t is at most the expectation

of x divided by t:

Pr[x ≥ t] ≤ E[x]/t.

Lemma 1.1.2 (Chebyshev’s inequality). Let x denote a nonnegative

random variable and t > 0, then

Pr[|x − E[x]| ≥ t] ≤ Var[x]/t 2 .

Lemma 1.1.3 (Chernoff bound [? ]). Let X = ∑ n i=1 X i, where X i = 1 with

probability p i and X i = 0 with probability 1 − p i , and all X i are independent.

Let µ = E[X] = ∑ n i=1 p i. Then

1. Pr[X ≥ (1 + δ)µ] ≤ exp(−δ 2 µ/3), ∀δ > 0 ;

2. Pr[X ≤ (1 − δ)µ] ≤ exp(−δ 2 µ/2), ∀0 < δ < 1.

Lemma 1.1.4 (Hoeffding bound [? ]). Let X 1 , · · · , X n denote n independent

bounded variables in [a i , b i ]. Let X = ∑i=1 n X i, then we have

(

2t

Pr[|X − E[X]| ≥ t] ≤ 2 exp −

2 )

∑i=1 n (b i − a i ) 2 .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!