08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

y definition <strong>of</strong> w ∗ . Similarly, if x t is a negative example, then<br />

(w − x t ) T w ∗ = w T w ∗ − x T t w ∗ ≥ w T w ∗ + 1.<br />

Next, on each mistake, we claim that |w| 2 increases by at most R 2 . Let us first consider<br />

mistakes on positive examples. If we make a mistake on a positive example x t then we<br />

have<br />

(w + x t ) T (w + x t ) = |w| 2 + 2x T t w + |x t | 2 ≤ |w| 2 + |x t | 2 ≤ |w| 2 + R 2 ,<br />

where the middle inequality comes from the fact that we made a mistake, which means<br />

that x T t w ≤ 0. Similarly, if we make a mistake on a negative example x t then we have<br />

(w − x t ) T (w − x t ) = |w| 2 − 2x T t w + |x t | 2 ≤ |w| 2 + |x t | 2 ≤ |w| 2 + R 2 .<br />

Note that it is important here that we only update on a mistake.<br />

So, if we make M mistakes, then w T w ∗ ≥ M, and |w| 2 ≤ MR 2 , or equivalently,<br />

|w| ≤ R √ M. Finally, we use the fact that w T w ∗ /|w ∗ | ≤ |w| which is just saying that<br />

the projection <strong>of</strong> w in the direction <strong>of</strong> w ∗ cannot be larger than the length <strong>of</strong> w. This<br />

gives us:<br />

as desired.<br />

M/|w ∗ | ≤ R √ M<br />

√<br />

M ≤ R|w ∗ |<br />

M ≤ R 2 |w ∗ | 2<br />

6.5.4 Extensions: inseparable data and hinge-loss<br />

We assumed above that there existed a perfect w ∗ that correctly classified all the examples,<br />

e.g., correctly classified all the emails into important versus non-important. This<br />

is rarely the case in real-life data. What if even the best w ∗ isn’t quite perfect? We<br />

can see what this does to the above pro<strong>of</strong>: if there is an example that w ∗ doesn’t correctly<br />

classify, then while the second part <strong>of</strong> the pro<strong>of</strong> still holds, the first part (the dot<br />

product <strong>of</strong> w with w ∗ increasing) breaks down. However, if this doesn’t happen too <strong>of</strong>ten,<br />

and also x T t w ∗ is just a “little bit wrong” then we will only make a few more mistakes.<br />

To make this formal, define the hinge-loss <strong>of</strong> w ∗ on a positive example x t as max(0, 1−<br />

x T t w ∗ ). In other words, if x T t w ∗ ≥ 1 as desired then the hinge-loss is zero; else, the hingeloss<br />

is the amount the LHS is less than the RHS. 21 Similarly, the hinge-loss <strong>of</strong> w ∗ on a<br />

negative example x t is max(0, 1 + x T t w ∗ ). Given a sequence <strong>of</strong> labeled examples S, define<br />

the total hinge-loss L hinge (w ∗ , S) as the sum <strong>of</strong> hinge-losses <strong>of</strong> w ∗ on all examples in S.<br />

We now get the following extended theorem.<br />

21 This is called “hinge-loss” because as a function <strong>of</strong> x T t w ∗ it looks like a hinge.<br />

201

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!