10.07.2015 Views

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.5.3 <strong>The</strong> integrality gap – <strong>The</strong>orem 3.3Our first step is a reduction to the hinge loss. Let a = ∂ + l(0). Define{l ∗ ax + 1 x ≤ 1−a(x) =0 o/wit is not hard to see that l ∗ is a convex surrogate satisfying ∀x, l ∗ (x) ≤ l(x) and ∂ + l ∗ (0) =∂ + l(0). Thus, if we substitute l with l ∗ , we just decrease the integrality gap, hence canassume that l = l ∗ . Now, we note that if we consider program (11) with l = l ∗ the inegralitygap <strong>of</strong> coincides with what we get by replacing C with |a| · C and l ∗ with the hinge loss.To see that, note that for every f ∈ H k , b ∈ R, Err D,l ∗(f + b) = Err D,hinge (|a| · f + |a| · b),thus, minimizing Err D,l ∗ over all functions f ∈ H k that satisfy ||f|| Hk ≤ C is equivalent tominimizing Err D,hinge over all functions f ∈ H k that satisfy ||f|| Hk ≤ |a| · C. Thus, it isenough to prove the <strong>The</strong>orem for l = l hinge .Next, we show that we can assume that the embedding is symmetric (i.e., correspond toa symmetric <strong>kernel</strong>). As the integrality gap is at least as large as the approximation ratio,<strong>using</strong> <strong>The</strong>orem 3.2 this will complete our argument. (<strong>The</strong> reduction to the hinge loss yieldsbounds with universal constants in the asymptotic terms).Let γ > 0 and let D be a distribution on S d−1 × {±1}. It is enough to find (a possiblydifferent) distribution D 1 with the same γ-margin <strong>error</strong> as D, for which the optimum <strong>of</strong>program (11) (with l = l hinge ) is not smaller than the optimum <strong>of</strong> the programmin Err D,hinge (f + b)s.t. f ∈ H ks , b ∈ R (17)||f|| HksDenote the optimal value <strong>of</strong> program (17) by α and assume, towards contradiction, thatwhenever Err γ (D 1 ) = Err γ (D), the optimum <strong>of</strong> program (11) is strictly less then α.For every A ∈ O(d), let D A , be the distribution <strong>of</strong> the r.v. (Ax, y) ∈ S d−1 × {±1},where (x, y) ∼ D. Since clearly Err γ (D A ) = Err γ (D), there exist f A ∈ H k and b A ∈ Rsuch that ||f A || Hk ≤ C and Err DA ,hinge(g A ) < α, where g A := f A + b A . Define f ∈ H ks byf(x) = ∫ O(d) f A(Ax)dA and let b = ∫ O(d) b AdA and g = f +b. By <strong>The</strong>orem 5.18, ||f|| Hks ≤ C.Finally, for l = l hinge ,≤ CErr D,hinge (g) = E (x,y)∼D l(yg(x))= E (x,y)∼D l(yE A∼O(d) g A (Ax))≤E (x,y)∼D E A∼O(d) l(yg A (Ax))= E A∼O(d) E (x,y)∼D l(yg A (Ax))= E A∼O(d) E (x,y)∼DA l(yg A (x)) < αContrary to the assumption that α is the optimum <strong>of</strong> program (17).5.5.4 Finite dimension - <strong>The</strong>orems 2.7 and 3.4Let V ⊆ C(S d−1 ) be the linear space {Λ v,b ◦ ψ : v ∈ R m , b ∈ R} and denote ¯W = {Λ v,b ◦ ψ :v ∈ W, b ∈ R}. We note that dim(V ) ≤ m + 1. Instead <strong>of</strong> program (4) we consider the37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!