08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Theorem 6.9 On any sequence <strong>of</strong> examples S = x 1 , x 2 , . . ., the Perceptron algorithm<br />

makes at most<br />

min<br />

w ∗ (<br />

R 2 |w ∗ | 2 + 2L hinge (w ∗ , S) )<br />

mistakes, where R = max t |x t |.<br />

Pro<strong>of</strong>: As before, each update <strong>of</strong> the Perceptron algorithm increases |w| 2 by at most R 2 ,<br />

so if the algorithm makes M mistakes, we have |w| 2 ≤ MR 2 .<br />

What we can no longer say is that each update <strong>of</strong> the algorithm increases w T w ∗ by<br />

at least 1. Instead, on a positive example we are “increasing” w T w ∗ by x T t w ∗ (it could<br />

be negative), which is at least 1 − L hinge (w ∗ , x t ). Similarly, on a negative example we<br />

“increase” w T w ∗ by −x T t w ∗ , which is also at least 1 − L hinge (w ∗ , x t ). If we sum this up<br />

over all mistakes, we get that at the end we have w T w ∗ ≥ M − L hinge (w ∗ , S), where we<br />

are using here the fact that hinge-loss is never negative so summing over all <strong>of</strong> S is only<br />

larger than summing over the mistakes that w made.<br />

Finally, we just do some algebra. Let L = L hinge (w ∗ , S). So we have:<br />

as desired.<br />

w T w ∗ /|w ∗ | ≤ |w|<br />

(w T w ∗ ) 2 ≤ |w| 2 |w ∗ | 2<br />

(M − L) 2 ≤ MR 2 |w ∗ | 2<br />

M 2 − 2ML + L 2 ≤ MR 2 |w ∗ | 2<br />

M − 2L + L 2 /M ≤ R 2 |w ∗ | 2<br />

M ≤ R 2 |w ∗ | 2 + 2L − L 2 /M ≤ R 2 |w ∗ | 2 + 2L<br />

6.6 Kernel functions<br />

What if even the best w ∗ has high hinge-loss? E.g., perhaps instead <strong>of</strong> a linear separator<br />

decision boundary, the boundary between important emails and unimportant emails looks<br />

more like a circle, for example as in Figure 6.4.<br />

A powerful idea for addressing situations like this is to use what are called kernel<br />

functions, or sometimes the “kernel trick”. Here is the idea. Suppose you have a function<br />

K, called a “kernel”, over pairs <strong>of</strong> data points such that for some function φ : R d → R N ,<br />

where perhaps N ≫ d, we have K(x, x ′ ) = φ(x) T φ(x ′ ). In that case, if we can write<br />

the Perceptron algorithm so that it only interacts with the data via dot-products, and<br />

then replace every dot-product with an invocation <strong>of</strong> K, then we can act as if we had<br />

performed the function φ explicitly without having to actually compute φ.<br />

For example, consider K(x, x ′ ) = (1 + x T x ′ ) k for some integer k ≥ 1. It turns out this<br />

corresponds to a mapping φ into a space <strong>of</strong> dimension N ≈ d k . For example, in the case<br />

202

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!