08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.16 Exercises<br />

Exercise 6.1 (Section 6.2 and 6.3) Consider the instance space X = {0, 1} d and let<br />

H be the class <strong>of</strong> 3-CNF formulas. That is, H is the set <strong>of</strong> concepts that can be described<br />

as a conjunction <strong>of</strong> clauses where each clause is an OR <strong>of</strong> up to 3 literals. (These are also<br />

called 3-SAT formulas). For example c ∗ might be (x 1 ∨¯x 2 ∨x 3 )(x 2 ∨x 4 )(¯x 1 ∨x 3 )(x 2 ∨x 3 ∨x 4 ).<br />

Assume we are in the PAC learning setting, so examples are drawn from some underlying<br />

distribution D and labeled by some 3-CNF formula c ∗ .<br />

1. Give a number <strong>of</strong> samples m that would be sufficient to ensure that with probability<br />

≥ 1 − δ, all 3-CNF formulas consistent with the sample have error at most ɛ with<br />

respect to D.<br />

2. Give a polynomial-time algorithm for PAC-learning the class <strong>of</strong> 3-CNF formulas.<br />

Exercise 6.2 (Section 6.2) Consider the instance space X = R, and the class <strong>of</strong> functions<br />

H = {f a : f a (x) = 1 iff x ≥ a} for a ∈ R. That is, H is the set <strong>of</strong> all threshold<br />

functions on the line. Prove that for any distribution D, a sample S <strong>of</strong> size O( 1 log( 1))<br />

ɛ δ<br />

is sufficient to ensure that with probability ≥ 1 − δ, any f a ′ such that err S (f a ′) = 0 has<br />

err D (f a ′) ≤ ɛ. Note that you can answer this question from first principles, without using<br />

the concept <strong>of</strong> VC-dimension.<br />

Exercise 6.3 (Perceptron; Section 6.5.3) Consider running the Perceptron algorithm<br />

in the online model on some sequence <strong>of</strong> examples S. Let S ′ be the same set <strong>of</strong> examples<br />

as S but presented in a different order. Does the Perceptron algorithm necessarily make<br />

the same number <strong>of</strong> mistakes on S as it does on S ′ ? If so, why? If not, show such an S<br />

and S ′ (consisting <strong>of</strong> the same set <strong>of</strong> examples in a different order) where the Perceptron<br />

algorithm makes a different number <strong>of</strong> mistakes on S ′ than it does on S.<br />

Exercise 6.4 (representation and linear separators) Show that any disjunction (see<br />

Section 6.3.1) over {0, 1} d can be represented as a linear separator. Show that moreover<br />

the margin <strong>of</strong> separation is Ω(1/ √ d).<br />

Exercise 6.5 (Linear separators; easy) Show that the parity function on d ≥ 2<br />

Boolean variables cannot be represented by a linear threshold function. The parity function<br />

is 1 if and only if an odd number <strong>of</strong> inputs is 1.<br />

Exercise 6.6 (Perceptron; Section 6.5.3) We know the Perceptron algorithm makes<br />

at most 1/γ 2 mistakes on any sequence <strong>of</strong> examples that is separable by margin γ (we<br />

assume all examples are normalized to have length 1). However, it need not find a separator<br />

<strong>of</strong> large margin. If we also want to find a separator <strong>of</strong> large margin, a natural<br />

alternative is to update on any example x such that f ∗ (x)(w · x) < 1; this is called the<br />

margin perceptron algorithm.<br />

1. Argue why margin perceptron is equivalent to running stochastic gradient descent on<br />

the class <strong>of</strong> linear predictors (f w (x) = w · x) using hinge loss as the loss function<br />

and using λ t = 1.<br />

233

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!