08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 Online learning and the Perceptron algorithm<br />

So far we have been considering what is <strong>of</strong>ten called the batch learning scenario. You are<br />

given a “batch” <strong>of</strong> data—the training sample S—and your goal is to use it to produce<br />

a hypothesis h that will have low error on new data, under the assumption that both S<br />

and the new data are sampled from some fixed distribution D. We now switch to the<br />

more challenging online learning scenario where we remove the assumption that data is<br />

sampled from a fixed probability distribution, or from any probabilistic process at all.<br />

Specifically, the online learning scenario proceeds as follows. At each time t = 1, 2, . . .:<br />

1. The algorithm is presented with an arbitrary example x t ∈ X and is asked to make<br />

a prediction l t <strong>of</strong> its label.<br />

2. The algorithm is told the true label <strong>of</strong> the example c ∗ (x t ) and is charged for a<br />

mistake if c ∗ (x t ) ≠ l t .<br />

The goal <strong>of</strong> the learning algorithm is to make as few mistakes as possible in total. For<br />

example, consider an email classifier that when a new email message arrives must classify<br />

it as “important” or “it can wait”. The user then looks at the email and informs the<br />

algorithm if it was incorrect. We might not want to model email messages as independent<br />

random objects from a fixed probability distribution, because they <strong>of</strong>ten are replies to<br />

previous emails and build on each other. Thus, the online learning model would be more<br />

appropriate than the batch model for this setting.<br />

Intuitively, the online learning model is harder than the batch model because we have<br />

removed the requirement that our data consists <strong>of</strong> independent draws from a fixed probability<br />

distribution. Indeed, we will see shortly that any algorithm with good performance<br />

in the online model can be converted to an algorithm with good performance in the batch<br />

model. Nonetheless, the online model can sometimes be a cleaner model for design and<br />

analysis <strong>of</strong> algorithms.<br />

6.5.1 An example: learning disjunctions<br />

As a simple example, let’s revisit the problem <strong>of</strong> learning disjunctions in the online model.<br />

We can solve this problem by starting with a hypothesis h = x 1 ∨ x 2 ∨ . . . ∨ x d and using<br />

it for prediction. We will maintain the invariant that every variable in the target disjunction<br />

is also in our hypothesis, which is clearly true at the start. This ensures that the<br />

only mistakes possible are on examples x for which h(x) is positive but c ∗ (x) is negative.<br />

When such a mistake occurs, we simply remove from h any variable set to 1 in x. Since<br />

such variables cannot be in the target function (since x was negative), we maintain our<br />

invariant and remove at least one variable from h. This implies that the algorithm makes<br />

at most d mistakes total on any series <strong>of</strong> examples consistent with a disjunction.<br />

198

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!