08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

y<br />

x<br />

φ<br />

Figure 6.4: <strong>Data</strong> that is not linearly separable in the input space R 2 but that is linearly<br />

separable in the “φ-space,” φ(x) = (1, √ 2x 1 , √ 2x 2 , x 2 1, √ 2x 1 x 2 , x 2 2), corresponding to the<br />

kernel function K(x t y) = (1 + x 1 x 2 + y 1 y 2 ) 2 .<br />

d = 2, k = 2 we have (using x i to denote the ith coordinate <strong>of</strong> x):<br />

K(x, x ′ ) = (1 + x 1 x ′ 1 + x 2 x ′ 2) 2<br />

= 1 + 2x 1 x ′ 1 + 2x 2 x ′ 2 + x 2 1x ′2<br />

1 + 2x 1 x 2 x ′ 1x ′ 2 + x 2 2x ′2<br />

2<br />

= φ(x) T φ(x ′ )<br />

for φ(x) = (1, √ 2x 1 , √ 2x 2 , x 2 1, √ 2x 1 x 2 , x 2 2). Notice also that a linear separator in this<br />

space could correspond to a more complicated decision boundary such as an ellipse in<br />

the original space. For instance, the hyperplane φ(x) T w ∗ = 0 for w ∗ = (−4, 0, 0, 1, 0, 1)<br />

corresponds to the circle x 2 1 + x 2 2 = 4 in the original space, such as in Figure 6.4.<br />

The point <strong>of</strong> this is that if in the higher-dimensional “φ-space” there is a w ∗ such that<br />

the bound <strong>of</strong> Theorem 6.9 is small, then the algorithm will perform well and make few<br />

mistakes. But the nice thing is we didn’t have to computationally perform the mapping φ!<br />

So, how can we view the Perceptron algorithm as only interacting with data via dotproducts?<br />

Notice that w is always a linear combination <strong>of</strong> data points. For example, if we<br />

made mistakes on the first, second and fifth examples, and these examples were positive,<br />

positive, and negative respectively, we would have w = x 1 +x 2 −x 5 . So, if we keep track <strong>of</strong><br />

w this way, then to predict on a new example x t , we can write x T t w = x T t x 1 +x T t x 2 −x T t x 5 .<br />

So if we just replace each <strong>of</strong> these dot-products with “K”, we are running the algorithm<br />

as if we had explicitly performed the φ mapping. This is called “kernelizing” the algorithm.<br />

Many different pairwise functions on examples are legal kernel functions. One easy<br />

way to create a kernel function is by combining other kernel functions together, via the<br />

following theorem.<br />

Theorem 6.10 Suppose K 1 and K 2 are kernel functions. Then<br />

203

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!