08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Theorem 6.11 (Online to Batch via Random Stopping) If an online algorithm A<br />

with mistake-bound M is run on a sample S <strong>of</strong> size M/ɛ and stopped at a random time<br />

between 1 and |S|, the expected error <strong>of</strong> the hypothesis h produced satisfies E[err D (h)] ≤ ɛ.<br />

Conversion procedure 2: Controlled Testing. A second natural approach to using<br />

an online learning algorithm A in the distributional setting is to just run a series <strong>of</strong><br />

controlled tests. Specifically, suppose that the initial hypothesis produced by algorithm<br />

A is h 1 . Define δ i = δ/(i + 2) 2 so we have ∑ ∞<br />

i=0 δ i = ( π2 − 1)δ ≤ δ. We draw a set <strong>of</strong><br />

6<br />

n 1 = 1 log( 1 ɛ δ 1<br />

) random examples and test to see whether h 1 gets all <strong>of</strong> them correct. Note<br />

that if err D (h 1 ) ≥ ɛ then the chance h 1 would get them all correct is at most (1−ɛ) n 1<br />

≤ δ 1 .<br />

So, if h 1 indeed gets them all correct, we output h 1 as our hypothesis and halt. If not,<br />

we choose some example x 1 in the sample on which h 1 made a mistake and give it to<br />

algorithm A. Algorithm A then produces some new hypothesis h 2 and we again repeat,<br />

testing h 2 on a fresh set <strong>of</strong> n 2 = 1 log( 1 ɛ δ 2<br />

) random examples, and so on.<br />

In general, given h t we draw a fresh set <strong>of</strong> n t = 1 ɛ log( 1 δ t<br />

) random examples and test<br />

to see whether h t gets all <strong>of</strong> them correct. If so, we output h t and halt; if not, we choose<br />

some x t on which h t (x t ) was incorrect and give it to algorithm A. By choice <strong>of</strong> n t , if h t<br />

had error rate ɛ or larger, the chance we would mistakenly output it is at most δ t . By<br />

choice <strong>of</strong> the values δ t , the chance we ever halt with a hypothesis <strong>of</strong> error ɛ or larger is at<br />

most δ 1 + δ 2 + . . . ≤ δ. Thus, we have the following theorem.<br />

Theorem 6.12 (Online to Batch via Controlled Testing) Let A be an online learning<br />

algorithm with mistake-bound M. Then this procedure will halt after O( M ɛ log( M δ ))<br />

examples and with probability at least 1 − δ will produce a hypothesis <strong>of</strong> error at most ɛ.<br />

Note that in this conversion we cannot re-use our samples: since the hypothesis h t depends<br />

on the previous data, we need to draw a fresh set <strong>of</strong> n t examples to use for testing it.<br />

6.8 Support-Vector Machines<br />

In a batch setting, rather than running the Perceptron algorithm and adapting it via<br />

one <strong>of</strong> the methods above, another natural idea would be just to solve for the vector w<br />

that minimizes the right-hand-side in Theorem 6.9 on the given dataset S. This turns<br />

out to have good guarantees as well, though they are beyond the scope <strong>of</strong> this book. In<br />

fact, this is the Support Vector Machine (SVM) algorithm. Specifically, SVMs solve the<br />

following convex optimization problem over a sample S = {x 1 , x 2 , . . . x n } where c is a<br />

constant that is determined empirically.<br />

minimize<br />

c|w| 2 + ∑ i<br />

s i<br />

subject to<br />

w · x i ≥ 1 − s i for all positive examples x i<br />

w · x i ≤ −1 + s i for all negative examples x i<br />

s i ≥ 0 for all i.<br />

205

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!