08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Intuition: In this framework, the way that unlabeled data helps in learning can be intuitively<br />

described as follows. Suppose one is given a concept class H (such as linear<br />

separators) and a compatibility notion χ (such as penalizing h for points within distance<br />

γ <strong>of</strong> the decision boundary). Suppose also that one believes c ∗ ∈ H (or at least is close)<br />

and that err unl (c ∗ ) = 0 (or at least is small). Then, unlabeled data can help by allowing<br />

one to estimate the unlabeled error rate <strong>of</strong> all h ∈ H, thereby in principle reducing the<br />

search space from H (all linear separators) down to just the subset <strong>of</strong> H that is highly<br />

compatible with D. The key challenge is how this can be done efficiently (in theory,<br />

in practice, or both) for natural notions <strong>of</strong> compatibility, as well as identifying types <strong>of</strong><br />

compatibility that data in important problems can be expected to satisfy.<br />

A theorem: The following is a semi-supervised analog <strong>of</strong> our basic sample complexity<br />

theorem, Theorem 6.1. First, fix some set <strong>of</strong> functions H and compatibility notion χ.<br />

Given a labeled sample L, define êrr(h) to be the fraction <strong>of</strong> mistakes <strong>of</strong> h on L. Given<br />

an unlabeled sample U, define χ(h, U) = E x∼U [χ(h, x)] and define êrr unl (h) = 1−χ(h, U).<br />

That is, êrr(h) and êrr unl (h) are the empirical error rate and unlabeled error rate <strong>of</strong> h,<br />

respectively. Finally, given α > 0, define H D,χ (α) to be the set <strong>of</strong> functions f ∈ H such<br />

that err unl (f) ≤ α.<br />

Theorem 6.22 If c ∗ ∈ H then with probability at least 1 − δ, for labeled set L and<br />

unlabeled set U drawn from D, the h ∈ H that optimizes êrr unl (h) subject to êrr(h) = 0<br />

will have err D (h) ≤ ɛ for<br />

|U| ≥ 2 [<br />

ln |H| + ln 4 ]<br />

, and |L| ≥ 1 [<br />

ln |H<br />

ɛ 2 D,χ (err unl (c ∗ ) + 2ɛ)| + ln 2 ]<br />

.<br />

δ<br />

ɛ<br />

δ<br />

Equivalently, for |U| satisfying this bound, for any |L|, whp the h ∈ H that minimizes<br />

êrr unl (h) subject to êrr(h) = 0 has<br />

err D (h) ≤ 1 [<br />

ln |H D,χ (err unl (c ∗ ) + 2ɛ)| + ln 2 ]<br />

.<br />

|L|<br />

δ<br />

Pro<strong>of</strong>: By Hoeffding bounds, |U| is sufficiently large so that with probability at least<br />

1 − δ/2, all h ∈ H have |êrr unl (h) − err unl (h)| ≤ ɛ. Thus we have:<br />

{f ∈ H : êrr unl (f) ≤ err unl (c ∗ ) + ɛ} ⊆ H D,χ (err unl (c ∗ ) + 2ɛ).<br />

The given bound on |L| is sufficient so that with probability at least 1 − δ, all h ∈ H with<br />

êrr(h) = 0 and êrr unl (h) ≤ err unl (c ∗ ) + ɛ have err D (h) ≤ ɛ; furthermore, êrr unl (c ∗ ) ≤<br />

err unl (c ∗ ) + ɛ, so such a function h exists. Therefore, with probability at least 1 − δ, the<br />

h ∈ H that optimizes êrr unl (h) subject to êrr(h) = 0 has err D (h) ≤ ɛ, as desired.<br />

One can view Theorem 6.22 as bounding the number <strong>of</strong> labeled examples needed to learn<br />

well as a function <strong>of</strong> the “helpfulness” <strong>of</strong> the distribution D with respect to χ. Namely,<br />

a helpful distribution is one in which H D,χ (α) is small for α slightly larger than the<br />

compatibility <strong>of</strong> the true target function, so we do not need much labeled data to identify a<br />

good function among those in H D,χ (α). For more information on semi-supervised learning,<br />

see [?, ?, ?, ?, ?].<br />

230

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!