08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

we argue that if S is large enough compared to some property <strong>of</strong> H, then with high probability<br />

all h ∈ H have their training error close to their true error, so that if we find a<br />

hypothesis whose training error is low, we can be confident its true error will be low as well.<br />

Before giving our first result <strong>of</strong> this form, we note that it will <strong>of</strong>ten be convenient to<br />

associate each hypotheses with its {−1, 1}-valued indicator function<br />

{ 1 x ∈ h<br />

h(x) =<br />

−1 x ∉ h<br />

In this notation the true error <strong>of</strong> h is err D (h) = Prob x∼D [h(x) ≠ c ∗ (x)] and the training<br />

error is err S (h) = Prob x∼S [h(x) ≠ c ∗ (x)].<br />

6.2 Overfitting and Uniform Convergence<br />

We now present two results that explain how one can guard against overfitting. Given<br />

a class <strong>of</strong> hypotheses H, the first result states that for any given ɛ greater than zero, so<br />

long as the training data set is large compared to 1 ln(|H|), it is unlikely any hypothesis<br />

ɛ<br />

h ∈ H will have zero training error but have true error greater than ɛ. This means that<br />

with high probability, any hypothesis that our algorithms finds that agrees with the target<br />

hypothesis on the training data will have low true error. The second result states that if<br />

the training data set is large compared to 1 ln(|H|), then it is unlikely that the training<br />

ɛ 2<br />

error and true error will differ by more than ɛ for any hypothesis in H. This means that if<br />

we find an hypothesis in H whose training error is low, we can be confident its true error<br />

will be low as well, even if its training error is not zero.<br />

The basic idea is the following. If we consider some h with large true error, and we<br />

select an element x ∈ X at random according to D, there is a reasonable chance that<br />

x will belong to the symmetric difference h△c ∗ . If we select a large enough training<br />

sample S with each point drawn independently from X according to D, the chance that<br />

S is completely disjoint from h△c ∗ will be incredibly small. This is just for a single<br />

hypothesis h but we can now apply the union bound over all h ∈ H <strong>of</strong> large true error,<br />

when H is finite. We formalize this below.<br />

Theorem 6.1 Let H be an hypothesis class and let ɛ and δ be greater than zero. If a<br />

training set S <strong>of</strong> size<br />

n ≥ 1 ɛ<br />

(<br />

ln |H| + ln(1/δ)<br />

)<br />

,<br />

is drawn from distribution D, then with probability greater than or equal to 1 − δ every h<br />

in H with with true error err D (h) ≥ ɛ has training error err S (h) > 0. Equivalently, with<br />

probability greater than or equal to 1 − δ, every h ∈ H with training error zero has true<br />

error less than ɛ.<br />

Pro<strong>of</strong>: Let h 1 , h 2 , . . . be the hypotheses in H with true error greater than or equal to ɛ.<br />

These are the hypotheses that we don’t want to output. Consider drawing the sample S<br />

192

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!