08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

discard separators in advance that slice through dense regions and instead focus attention<br />

on just those that indeed separate most <strong>of</strong> the distribution by a large margin. This is the<br />

high level idea behind a technique known as Semi-Supervised SVMs. Alternatively, suppose<br />

data objects can be described by two different “kinds” <strong>of</strong> features (e.g., a webpage<br />

could be described using words on the page itself or using words on links pointing to the<br />

page), and one believes that each kind should be sufficient to produce an accurate classifier.<br />

Then one might want to train a pair <strong>of</strong> classifiers (one on each type <strong>of</strong> feature) and<br />

use unlabeled data for which one is confident but the other is not to bootstrap, labeling<br />

such examples with the confident classifier and then feeding them as training data to the<br />

less-confident one. This is the high-level idea behind a technique known as Co-Training.<br />

Or, if one believes “similar examples should generally have the same label”, one might<br />

construct a graph with an edge between examples that are sufficiently similar, and aim for<br />

a classifier that is correct on the labeled data and has a small cut value on the unlabeled<br />

data; this is the high-level idea behind graph-based methods.<br />

A formal model: The batch learning model introduced in Sections 6.1 and 6.3 in essence<br />

assumes that one’s prior beliefs about the target function be described in terms <strong>of</strong> a class<br />

<strong>of</strong> functions H. In order to capture the reasoning used in semi-supervised learning, we<br />

need to also describe beliefs about the relation between the target function and the data<br />

distribution. A clean way to do this is via a notion <strong>of</strong> compatibility χ between a hypothesis<br />

h and a distribution D. Formally, χ maps pairs (h, D) to [0, 1] with χ(h, D) = 1<br />

meaning that h is highly compatible with D and χ(h, D) = 0 meaning that h is very<br />

incompatible with D. The quantity 1 − χ(h, D) is called the unlabeled error rate <strong>of</strong> h, and<br />

denoted err unl (h). Note that for χ to be useful, it must be estimatable from a finite sample;<br />

to this end, let us further require that χ is an expectation over individual examples.<br />

That is, overloading notation for convenience, we require χ(h, D) = E x∼D [χ(h, x)], where<br />

χ : H × X → [0, 1].<br />

For instance, suppose we believe the target should separate most data by margin γ.<br />

We can represent this belief by defining χ(h, x) = 0 if x is within distance γ <strong>of</strong> the decision<br />

boundary <strong>of</strong> h, and χ(h, x) = 1 otherwise. In this case, err unl (h) will denote the<br />

probability mass <strong>of</strong> D within distance γ <strong>of</strong> h’s decision boundary. As a different example,<br />

in co-training, we assume each example can be described using two “views” that<br />

each are sufficient for classification; that is, there exist c ∗ 1, c ∗ 2 such that for each example<br />

x = 〈x 1 , x 2 〉 we have c ∗ 1(x 1 ) = c ∗ 2(x 2 ). We can represent this belief by defining a hypothesis<br />

h = 〈h 1 , h 2 〉 to be compatible with an example 〈x 1 , x 2 〉 if h 1 (x 1 ) = h 2 (x 2 ) and incompatible<br />

otherwise; err unl (h) is then the probability mass <strong>of</strong> examples on which h 1 and h 2 disagree.<br />

As with the class H, one can either assume that the target is fully compatible (i.e.,<br />

err unl (c ∗ ) = 0) or instead aim to do well as a function <strong>of</strong> how compatible the target is.<br />

The case that we assume c ∗ ∈ H and err unl (c ∗ ) = 0 is termed the “doubly realizable<br />

case”. The concept class H and compatibility notion χ are both viewed as known.<br />

229

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!