The error rate of learning halfspaces using kernel-SVM

More documents

Recommendations

Info

$Lectures on fractal geometry and dynamics$

5.5.3 The integrality gap – Theorem 3.3Our first step is a reduction to the hinge loss. Let a = ∂ + l(0). Define{l ∗ ax + 1 x ≤ 1−a(x) =0 o/wit is not hard to see that l ∗ is a convex surrogate satisfying ∀x, l ∗ (x) ≤ l(x) and ∂ + l ∗ (0) =∂ + l(0). Thus, if we substitute l with l ∗ , we just decrease the integrality gap, hence canassume that l = l ∗ . Now, we note that if we consider program (11) with l = l ∗ the inegralitygap of coincides with what we get by replacing C with |a| · C and l ∗ with the hinge loss.To see that, note that for every f ∈ H k , b ∈ R, Err D,l ∗(f + b) = Err D,hinge (|a| · f + |a| · b),thus, minimizing Err D,l ∗ over all functions f ∈ H k that satisfy ||f|| Hk ≤ C is equivalent tominimizing Err D,hinge over all functions f ∈ H k that satisfy ||f|| Hk ≤ |a| · C. Thus, it isenough to prove the Theorem for l = l hinge .Next, we show that we can assume that the embedding is symmetric (i.e., correspond toa symmetric kernel). As the integrality gap is at least as large as the approximation ratio,using Theorem 3.2 this will complete our argument. (The reduction to the hinge loss yieldsbounds with universal constants in the asymptotic terms).Let γ > 0 and let D be a distribution on S d−1 × {±1}. It is enough to find (a possiblydifferent) distribution D 1 with the same γ-margin error as D, for which the optimum ofprogram (11) (with l = l hinge ) is not smaller than the optimum of the programmin Err D,hinge (f + b)s.t. f ∈ H ks , b ∈ R (17)||f|| HksDenote the optimal value of program (17) by α and assume, towards contradiction, thatwhenever Err γ (D 1 ) = Err γ (D), the optimum of program (11) is strictly less then α.For every A ∈ O(d), let D A , be the distribution of the r.v. (Ax, y) ∈ S d−1 × {±1},where (x, y) ∼ D. Since clearly Err γ (D A ) = Err γ (D), there exist f A ∈ H k and b A ∈ Rsuch that ||f A || Hk ≤ C and Err DA ,hinge(g A ) < α, where g A := f A + b A . Define f ∈ H ks byf(x) = ∫ O(d) f A(Ax)dA and let b = ∫ O(d) b AdA and g = f +b. By Theorem 5.18, ||f|| Hks ≤ C.Finally, for l = l hinge ,≤ CErr D,hinge (g) = E (x,y)∼D l(yg(x))= E (x,y)∼D l(yE A∼O(d) g A (Ax))≤E (x,y)∼D E A∼O(d) l(yg A (Ax))= E A∼O(d) E (x,y)∼D l(yg A (Ax))= E A∼O(d) E (x,y)∼DA l(yg A (x)) < αContrary to the assumption that α is the optimum of program (17).5.5.4 Finite dimension - Theorems 2.7 and 3.4Let V ⊆ C(S d−1 ) be the linear space {Λ v,b ◦ ψ : v ∈ R m , b ∈ R} and denote ¯W = {Λ v,b ◦ ψ :v ∈ W, b ∈ R}. We note that dim(V ) ≤ m + 1. Instead of program (4) we consider the37
equivalent formulationminErr D,l (f)s.t. f ∈ ¯W (18)The following lemma is very similar to lemma 5.12, but with better dependency on m (m 1.5instead of m 2 ).Lemma 5.25 Let l be a convex surrogate and let V ⊂ C(S d−1 ) an m-dimensional vectorspace. There exists a continuous kernel k : S d−1 × S d−1 → R with sup x∈S d−1 k(x, x) ≤ 1 suchthat H k = V as a vector space and there exists a probability measure µ N such that∀f ∈ V, ||f|| Hk ≤ 2m1.5|∂ + l(0)| Err µ N ,l(f)Proof Let ψ : S d−1 → V ∗ be the evaluation operator. It maps each x ∈ S d−1 to the linearfunctional f ∈ V ↦→ f(x). We claim that1. ψ is continuous,2. aff(ψ(S d−1 ) ∪ −ψ(S d−1 )) = V ∗ ,3. V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Proof of 1: We need to show that ψ(x n ) → ψ(x) if x n → x. Since V ∗ is finite dimensional, itsuffices to show that ψ(x n )(f) → ψ(x)(f) for every f ∈ V , which follows from the continuityof f.Proof of 2: Note that 0 ∈ U = aff(ψ(S d−1 ) ∪ −ψ(S d−1 )), so U is a linear space. Now, defineT : U ∗ → V via T (u ∗ ) = u ∗ ◦ ψ. We claim that T is onto, whence dim(U) = dim(U ∗ ) =dim(V ) = dim(V ∗ ), so that U = V ∗ . Indeed, for f ∈ V , let u ∗ f ∈ U ∗ be the functionalu ∗ f (u) = u(f). Now, T (u∗ f )(x) = u∗ f (ψ(x)) = ψ(x)(f) = f(x), thus T (u∗ f ) = f.Proof of 3: From U = V ∗ it follows that U ∗ = V ∗∗ , so that the mapping T : V ∗∗ → V isonto, showing that V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Let us apply John’s Lemma to K = conv(ψ(S d−1 )∪−ψ(S d−1 )). It yields an inner producton V ∗ 1with K contained in the unit ball and containing the ball around 0 with radius √ m.Let k be the kernel k(x, y) = 〈ψ(x), ψ(y)〉. Since ψ is continuous, k is continuous as well.By Theorem 5.1.1 and since T is onto, it follows that, as a vector space, V = H k . Since Kis contained in the unit ball, it follows that sup x∈S d−1 k(x, x) ≤ 1. It remains to prove theexistence of the measure µ N .Let e 1 , . . . , e m ∈ V ∗ be an orthonormal basis. For every i ∈ [m], choose(x 1 i , y i ), . . . , (x m+1i , y i ) ∈ S d−1 × {±1} and λ 1 i , . . . , λ m+1i ≥ 0 such that ∑ m+1j=1 λj i = 1 and√1me i = ∑ m+1j=1 λj i y iψ(x j i ). Define µ N(x j i , 1) = µ N(x j i , −1) = λj i. 2mLet f ∈ V . By Theorem 5.1.1 there exists v ∈ V ∗ such that f = Λ v,0 ◦ ψ and ||f|| Hk =||v|| V ∗. It follows that, for a = ∂ + l(0),38
Page 1 and 2: The complexity of learning halfspac
Page 3 and 4: exists an equivalent inner product
Page 5 and 6: is enough that we can efficiently c
Page 7 and 8: our terminology, they considered th
Page 9 and 10: It is shown in (Birnbaum and Shalev
Page 11 and 12: We now expand on this brief descrip
Page 13 and 14: and (uniformly and independently) a
Page 15 and 16: The proof of Theorem 2.7To prove Th
Page 17 and 18: attempts to prove a quantitative op
Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
Page 25 and 26: ( )that in this case Err µN ,hinge
Page 27 and 28: Thus, it is enough to find a neighb
Page 29 and 30: Legendre polynomials we have|P d,n
Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
Page 33 and 34: Now, it holds that∫∫∫ ∫∫
Page 35 and 36: We note that ω f◦g ≤ ω f ·
Page 37: Now, denote δ = ∫ g. It holds th
Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
Page 43 and 44: Consequently, every approximated so
Page 45: Kosaku Yosida. Functional Analysis.

The error rate of learning halfspaces using kernel-SVM

Create successful ePaper yourself

Delete template?

Save as template?