08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Not spam<br />

{ }} { {<br />

Spam<br />

}} {<br />

x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 emails<br />

↓ ↓ ↓<br />

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 target concept<br />

↕ ↕ ↕<br />

0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 1 hypothesis h i<br />

↑ ↑ ↑<br />

Figure 6.1: The hypothesis h i disagrees with the truth in one quarter <strong>of</strong> the emails. Thus<br />

with a training set |S|, the probability that the hypothesis will survive is (1 − 0.25) |S|<br />

<strong>of</strong> size n and let A i be the event that h i is consistent with S. Since every h i has true error<br />

greater than or equal to ɛ<br />

Prob(A i ) ≤ (1 − ɛ) n .<br />

In other words, if we fix h i and draw a sample S <strong>of</strong> size n, the chance that h i makes no<br />

mistakes on S is at most the probability that a coin <strong>of</strong> bias ɛ comes up tails n times in a<br />

row, which is (1 − ɛ) n . By the union bound over all i we have<br />

Prob (∪ i A i ) ≤ |H|(1 − ɛ) n .<br />

Using the fact that (1−ɛ) ≤ e −ɛ , the probability that any hypothesis in H with true error<br />

greater than or equal to ɛ has training error zero is at most |H|e −ɛn . Replacing n by the<br />

sample size bound from the theorem statement, this is at most |H|e − ln |H|−ln(1/δ) = δ as<br />

desired.<br />

The conclusion <strong>of</strong> Theorem 6.1 is sometimes called a “PAC-learning guarantee” since<br />

it states that if we can find an h ∈ H consistent with the sample, then this h is Probably<br />

Approximately Correct.<br />

Theorem 6.1 addressed the case where there exists a hypothesis in H with zero training<br />

error. What if the best h i in H has 5% error on S? Can we still be confident that its<br />

true error is low, say at most 10%? For this, we want an analog <strong>of</strong> Theorem 6.1 that says<br />

for a sufficiently large training set S, every h i ∈ H has training error within ±ɛ <strong>of</strong> the<br />

true error with high probability. Such a statement is called uniform convergence because<br />

we are asking that the training set errors converge to their true errors uniformly over all<br />

sets in H. To see intuitively why such a statement should be true for sufficiently large<br />

S and a single hypothesis h i , consider two strings that differ in 10% <strong>of</strong> the positions and<br />

randomly select a large sample <strong>of</strong> positions. The number <strong>of</strong> positions that differ in the<br />

sample will be close to 10%.<br />

To prove uniform convergence bounds, we use a tail inequality for sums <strong>of</strong> independent<br />

Bernoulli random variables (i.e., coin tosses). The following is particularly convenient and<br />

is a variation on the Chern<strong>of</strong>f bounds in Section 12.4.11 <strong>of</strong> the appendix.<br />

193

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!