08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

imply that if it does and if S is sufficiently large, we can be confident that its true error<br />

will be low as well. Suppose that stochastic gradient descent is run on a machine where<br />

each weight is a 64-bit floating point number. This means that its hypotheses can each<br />

be described using 64n bits. If S has size at least 1 [64n ln(2) + ln(1/δ)], by Theorem 6.5<br />

ɛ<br />

it is unlikely any such hypothesis <strong>of</strong> true error greater than ɛ will be consistent with the<br />

sample, and so if it finds a hypothesis consistent with S, we can be confident its true error<br />

is at most ɛ. Or, by Theorem 6.3, if |S| ≥ 1<br />

2ɛ 2 (<br />

64n ln(2) + ln(2/δ)<br />

)<br />

then almost surely the<br />

final hypothesis h produced by stochastic gradient descent satisfies true error leas than<br />

or equal to training error plus ɛ.<br />

6.12 Combining (Sleeping) Expert Advice<br />

Imagine you have access to a large collection <strong>of</strong> rules-<strong>of</strong>-thumb that specify what to<br />

predict in different situations. For example, in classifying news articles, you might have<br />

one that says “if the article has the word ‘football’, then classify it as sports” and another<br />

that says “if the article contains a dollar figure, then classify it as business”. In predicting<br />

the stock market, these could be different economic indicators. These predictors might<br />

at times contradict each other, e.g., a news article that has both the word “football” and<br />

a dollar figure, or a day in which two economic indicators are pointing in different directions.<br />

It also may be that no predictor is perfectly accurate with some much better than<br />

others. We present here an algorithm for combining a large number <strong>of</strong> such predictors<br />

with the guarantee that if any <strong>of</strong> them are good, the algorithm will perform nearly as well<br />

as each good predictor on the examples on which that predictor fires.<br />

Formally, define a “sleeping expert” to be a predictor h that on any given example x<br />

either makes a prediction on its label or chooses to stay silent (asleep). We will think <strong>of</strong><br />

them as black boxes. Now, suppose we have access to n such sleeping experts h 1 , . . . , h n ,<br />

and let S i denote the subset <strong>of</strong> examples on which h i makes a prediction (e.g., this could<br />

be articles with the word “football” in them). We consider the online learning model,<br />

and let mistakes(A, S) denote the number <strong>of</strong> mistakes <strong>of</strong> an algorithm A on a sequence<br />

<strong>of</strong> examples S. Then the guarantee <strong>of</strong> our algorithm A will be that for all i<br />

E ( mistakes(A, S i ) ) ≤ (1 + ɛ) · mistakes(h i , S i ) + O ( log n<br />

ɛ<br />

where ɛ is a parameter <strong>of</strong> the algorithm and the expectation is over internal randomness<br />

in the randomized algorithm A.<br />

As a special case, if h 1 , . . . , h n are concepts from a concept class H, and so they all<br />

make predictions on every example, then A performs nearly as well as the best concept<br />

in H. This can be viewed as a noise-tolerant version <strong>of</strong> the Halving Algorithm <strong>of</strong> Section<br />

6.5.2 for the case that no concept in H is perfect. The case <strong>of</strong> predictors that make<br />

predictions on every example is called the problem <strong>of</strong> combining expert advice, and the<br />

more general case <strong>of</strong> predictors that sometimes fire and sometimes are silent is called the<br />

)<br />

220

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!