08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

will allow us to make mathematical statements <strong>of</strong> this form. These statements can be<br />

viewed as formalizing the intuitive philosophical notion <strong>of</strong> Occam’s razor.<br />

Formalizing the problem. To formalize the learning problem, assume there is some<br />

probability distribution D over the instance space X , such that (a) our training set S<br />

consists <strong>of</strong> points drawn independently at random from D, and (b) our objective is to<br />

predict well on new points that are also drawn from D. This is the sense in which we<br />

assume that our training data is representative <strong>of</strong> future data. Let c ∗ , called the target<br />

concept, denote the subset <strong>of</strong> X corresponding to the positive class for the binary classification<br />

we are aiming to make. For example, c ∗ would correspond to the set <strong>of</strong> all patients<br />

who respond well to the treatment in the medical example, or the set <strong>of</strong> all spam emails<br />

in the spam-detection setting. So, each point in our training set S is labeled according<br />

to whether or not it belongs to c ∗ and our goal is to produce a set h ⊆ X , called our<br />

hypothesis, which is close to c ∗ with respect to distribution D. The true error <strong>of</strong> h is<br />

err D (h) = Prob(h△c ∗ ) where “△” denotes symmetric difference, and probability mass is<br />

according to D. In other words, the true error <strong>of</strong> h is the probability it incorrectly classifies<br />

a data point drawn at random from D. Our goal is to produce h <strong>of</strong> low true error.<br />

The training error <strong>of</strong> h, denoted err S (h), is the fraction <strong>of</strong> points in S on which h and<br />

c ∗ disagree. That is, err S (h) = |S ∩ (h△c ∗ )|/|S|. Training error is also called empirical<br />

error. Note that even though S is assumed to consist <strong>of</strong> points randomly drawn from D,<br />

it is possible for a hypothesis h to have low training error or even to completely agree with<br />

c ∗ over the training sample, and yet have high true error. This is called overfitting the<br />

training data. For instance, a hypothesis h that simply consists <strong>of</strong> listing the positive examples<br />

in S, which is equivalent to a rule that memorizes the training sample and predicts<br />

positive on an example if and only if it already appeared positively in the training sample,<br />

would have zero training error. However, this hypothesis likely would have high true error<br />

and therefore would be highly overfitting the training data. More generally, overfitting is<br />

a concern because algorithms will typically be optimizing over the training sample. To<br />

design and analyze algorithms for learning, we will have to address the issue <strong>of</strong> overfitting.<br />

To be able to formally analyze overfitting, we introduce the notion <strong>of</strong> an hypothesis<br />

class, also called a concept class or set system. An hypothesis class H over X is a collection<br />

<strong>of</strong> subsets <strong>of</strong> X , called hypotheses. For instance, the class <strong>of</strong> intervals over X = R is the<br />

collection {[a, b]|a ≤ b}. The class <strong>of</strong> linear separators over X = R d is the collection<br />

{{x ∈ R d |w · x ≥ w 0 }|w ∈ R d , w 0 ∈ R};<br />

that is, it is the collection <strong>of</strong> all sets in R d that are linearly separable from their complement.<br />

In the case that X is the set <strong>of</strong> 4 points in the plane {(−1, −1), (−1, 1), (1, −1), (1, 1)},<br />

the class <strong>of</strong> linear separators contains 14 <strong>of</strong> the 2 4 = 16 possible subsets <strong>of</strong> X . 17 Given an<br />

hypothesis class H and training set S, what we typically aim to do algorithmically is to<br />

find the hypothesis in H that most closely agrees with c ∗ over S. To address overfitting,<br />

17 The only two subsets that are not in the class are the sets {(−1, −1), (1, 1)} and {(−1, 1), (1, −1)}.<br />

191

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!