08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1. For any constant c ≥ 0, cK 1 is a legal kernel. In fact, for any scalar function f,<br />

the function K 3 (x, x ′ ) = f(x)f(x ′ )K 1 (x, x ′ ) is a legal kernel.<br />

2. The sum K 1 + K 2 , is a legal kernel.<br />

3. The product, K 1 K 2 , is a legal kernel.<br />

You will prove Theorem 6.10 in Exercise 6.9. Notice that this immediately implies that<br />

the function K(x, x ′ ) = (1+x T x ′ ) k is a legal kernel by using the fact that K 1 (x, x ′ ) = 1 is<br />

a legal kernel, K 2 (x, x ′ ) = x T x ′ is a legal kernel, then adding them, and then multiplying<br />

that by itself k times. Another popular kernel is the Gaussian kernel, defined as:<br />

K(x, x ′ ) = e −c|x−x′ | 2 .<br />

If we think <strong>of</strong> a kernel as a measure <strong>of</strong> similarity, then this kernel defines the similarity<br />

between two data objects as a quantity that decreases exponentially with the squared<br />

distance between them. The Gaussian kernel can be shown to be a true kernel function<br />

by first writing it as f(x)f(x ′ )e 2cxT x ′ for f(x) = e −c|x|2 and then taking the Taylor<br />

expansion <strong>of</strong> e 2cxT x ′ , applying the rules in Theorem 6.10. Technically, this last step requires<br />

considering countably infinitely many applications <strong>of</strong> the rules and allowing for<br />

infinite-dimensional vector spaces.<br />

6.7 Online to Batch Conversion<br />

Suppose we have an online algorithm with a good mistake bound, such as the Perceptron<br />

algorithm. Can we use it to get a guarantee in the distributional (batch) learning setting?<br />

Intuitively, the answer should be yes since the online setting is only harder. Indeed, this<br />

intuition is correct. We present here two natural approaches for such online to batch<br />

conversion.<br />

Conversion procedure 1: Random Stopping. Suppose we have an online algorithm<br />

A with mistake-bound M. Say we run the algorithm in a single pass on a sample S <strong>of</strong> size<br />

M/ɛ. Let X t be the indicator random variable for the event that A makes a mistake on the<br />

tth example. Since ∑ |S|<br />

t=1 X t ≤ M for any set S, we certainly have that E[ ∑ |S|<br />

t=1 X t] ≤ M<br />

where the expectation is taken over the random draw <strong>of</strong> S from D |S| . By linearity <strong>of</strong><br />

expectation, and dividing both sides by |S| we therefore have:<br />

1<br />

|S|<br />

|S|<br />

∑<br />

t=1<br />

E[X t ] ≤ M/|S| = ɛ. (6.1)<br />

Let h t denote the hypothesis used by algorithm A to predict on the tth example. Since<br />

the tth example was randomly drawn from D, we have E[err D (h t )] = E[X t ]. This means<br />

that if we choose t at random from 1 to |S|, i.e., stop the algorithm at a random time, the<br />

expected error <strong>of</strong> the resulting prediction rule, taken over the randomness in the draw <strong>of</strong><br />

S and the choice <strong>of</strong> t, is at most ɛ as given by equation (6.1). Thus we have:<br />

204

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!