08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Theorem 6.2 (Hoeffding bounds) Let x 1 , x 2 , . . . , x n be independent {0, 1}-valued random<br />

variables with probability p that x i equals one. Let s = ∑ i x i (equivalently, flip n<br />

coins <strong>of</strong> bias p and let s be the total number <strong>of</strong> heads). For any 0 ≤ α ≤ 1,<br />

Prob(s/n > p + α) ≤ e −2nα2<br />

Prob(s/n < p − α) ≤ e −2nα2 .<br />

Theorem 6.2 implies the following uniform convergence analog <strong>of</strong> Theorem 6.1.<br />

Theorem 6.3 (Uniform convergence) Let H be a hypothesis class and let ɛ and δ be<br />

greater than zero. If a training set S <strong>of</strong> size<br />

n ≥ 1<br />

2ɛ 2 (<br />

ln |H| + ln(2/δ)<br />

)<br />

,<br />

is drawn from distribution D, then with probability greater than or equal to 1 − δ, every h<br />

in H satisfies |err S (h) − err D (h)| ≤ ɛ.<br />

Pro<strong>of</strong>: First, fix some h ∈ H and let x j be the indicator random variable for the event<br />

that h makes a mistake on the j th example in S. The x j are independent {0, 1} random<br />

variables and the probability that x i equals 1 is the true error <strong>of</strong> h, and the fraction <strong>of</strong> the<br />

x j ’s equal to 1 is exactly the training error <strong>of</strong> h. Therefore, Hoeffding bounds guarantee<br />

that the probability <strong>of</strong> the event A h that |err D (h) − err S (h)| > ɛ is less than or equal to<br />

2e −2nɛ2 . Applying the union bound to the events A h over all h ∈ H, the probability that<br />

there exists an h ∈ H with the difference between true error and empirical error greater<br />

than ɛ is less than or equal to 2|H|e −2nɛ2 . Using the value <strong>of</strong> n from the theorem statement,<br />

the right-hand-side <strong>of</strong> the above inequality is at most δ as desired.<br />

Theorem 6.3 justifies the approach <strong>of</strong> optimizing over our training sample S even if we<br />

are not able to find a rule <strong>of</strong> zero training error. If our training set S is sufficiently large,<br />

with high probability, good performance on S will translate to good performance on D.<br />

Note that Theorems 6.1 and 6.3 require |H| to be finite in order to be meaningful.<br />

The notion <strong>of</strong> growth functions and VC-dimension in Section 6.9, extend Theorem 6.3 to<br />

certain infinite hypothesis classes.<br />

6.3 Illustrative Examples and Occam’s Razor<br />

We now present some examples to illustrate the use <strong>of</strong> Theorem 6.1 and 6.3 and also<br />

use these theorems to give a formal connection to the notion <strong>of</strong> Occam’s razor.<br />

6.3.1 Learning disjunctions<br />

Consider the instance space X = {0, 1} d and suppose we believe that the target concept<br />

can be represented by a disjunction (an OR) over features, such as c ∗ = {x|x 1 = 1 ∨ x 4 =<br />

194

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!