10.07.2015 Views

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

It is shown in (Birnbaum and Shalev-Shwartz, 2012) that solving <strong>kernel</strong> ( <strong>SVM</strong> with)somespecific <strong>kernel</strong> (i.e. some specific ψ) yields an approximation ratio <strong>of</strong> O. It1γ √ log(1/γ)follows that our lower bound in <strong>The</strong>orem 2.6 is essentially tight. Also, this theorem canbe viewed as a substantial generalization <strong>of</strong> ((Ben-David ) et al., 2012, Long and Servedio,2011), who give an approximation ratio <strong>of</strong> Ω with no embedding (i.e., ψ is the identitymap). Also relevant is (Shalev-Shwartz et al., 2011), which shows that for a certain ψ, andm A (γ) = poly (exp ((1/γ) · log (1/(γ))), <strong>kernel</strong> <strong>SVM</strong> has approximation ratio <strong>of</strong> 1. <strong>The</strong>orem2.6 shows that for <strong>kernel</strong>-based learner to achieve a constant approximation ratio, m A mustbe exponential in 1/γ.Next we give lower bounds on the performance <strong>of</strong> finite dimensional learners.<strong>The</strong>orem 2.7 Let l be a Lipschitz surrogate loss and let A be a finite dimensional learnerw.r.t. l. Assume that m A (γ) = exp(o(γ −1/8 )). <strong>The</strong>n, for every γ > 0, there exists adistribution D on S d−1 × {±1} with d = O(log(m A (γ)/γ)) such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)1γ()1≥ Ω √ .γ poly(log(mA (γ)/γ))Corollary 2.8 Let l be a Lipschitz surrogate loss and let A be an efficient finite dimensionallearner w.r.t. l. <strong>The</strong>n, for every γ > 0, there exists a distribution D on S d−1 × {±1} withd = O(log(1/γ)) such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω √ .γ poly(log(1/γ))Corollary 2.9 Let l be a Lipschitz surrogate loss, let ɛ > 0 and let A be a finite dimensionallearner w.r.t. l such that for every γ > 0 and every distribution D on B d with d = ω(log(1/γ))it holds that w.p. ≥ 1/2,Err D,0−1 (A(γ))Err γ (D)≤( 1γ) 12 −ɛ<strong>The</strong>n, for some a = a(ɛ) > 0, m A (γ) = Ω (exp ((1/γ) a )).2.2 Review <strong>of</strong> the pro<strong>of</strong>s’ main ideasTo give the reader some idea <strong>of</strong> our arguments, we sketch some <strong>of</strong> the main ingredients <strong>of</strong> thepro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6. At the end <strong>of</strong> this section we sketch the idea <strong>of</strong> the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem2.7. We note, however, that the actual pro<strong>of</strong>s are organized somewhat differently.We will construct a distribution D over S d−1 ×{±1} (recall that R d is viewed as standardlyembedded in H = l 2 ). Thus, we can assume that the program is formulated in terms <strong>of</strong> theunit sphere, S ∞ ⊂ l 2 , and not the unit ball.Fix an embedding ψ and C > 0. Denote by k : S ∞ × S ∞ → R the corresponding <strong>kernel</strong>k(x, y) = 〈ψ(x), ψ(y)〉 H1 and consider the following set <strong>of</strong> functions over S ∞H k = {Λ v,0 ◦ ψ : v ∈ H 1 } .8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!