The error rate of learning halfspaces using kernel-SVM

More documents

Recommendations

Info

$Lectures on fractal geometry and dynamics$

Note that the error rate of the threshold 0 on D is λ/2. We next show that each polynomialf of degree K = log(C) that satisfies Err D,hinge (f) ≤ 1 must have f(γ) ≈ f(−γ).Indeed, if1 ≥ Err D,hinge (f) = (1 − λ) Err D1 ,hinge(f) + λ Err D2 ,hinge(f)then Err D2 ,hinge(f) ≤ 1 λ . But,Err D2 ,hinge(f) = 1 2≥ 1 2∫ 1−1∫ 1−1l hinge (f(x))ρ(x)dx + 1 2l hinge (−|f(x)|)ρ(x)dx∫ 1−1l hinge (−f(x))ρ(x)dxand using the convexity of l hinge we obtain from Jensen’s inequality that≥ 1 2 l hinge= 1 (1 +2≥ 1 2∫ 1−1(∫ 1∫ 1)−|f(x)|ρ(x)dx−1)|f(x)|ρ(x)dx−1|f(x)|ρ(x)dx =: 1 2 ‖f‖ 1,dρ .This shows that ‖f‖ 1,dρ ≤ 2 . We next write f = ∑ Kλ i=1 α ˜T i i , where { ˜T i } are the orthonormalpolynomials corresponding to the measure dρ. Since ˜T i are related to Chebyshev polynomialswe can uniformly bound their l ∞ norm, hence obtain that√ ∑αi 2 = ‖f‖ 2,dρ ≤ O( √ (√ )KK) ‖f‖ 1,dρ ≤ O .λiBased on the above, and using a bound on the derivatives of Chebyshev polynomials, we canbound the derivative of the polynomial f|f ′ (x)| ≤ ∑ ( )|α i || ˜T Ki ′3(x)| ≤ O .λiHence, by choosing λ = ω(γK 3 ) = ω(γ log 3 (C)) we obtain( ) γ K|f(γ) − f(−γ)| ≤ 2 γ max |f ′ 3(x)| = O = o(1) ,xλas required.Pulling back to the d − 1 dimensional sphereGiven the distribution D over [−1, 1] × {±1} described before, and some e ∈ S d−1 , we nowdefine a distribution D e on S d−1 × {±1}. To sample from D e , we first sample (α, β) from D11
and (uniformly and independently) a vector z from the 1-codimensional sphere of S d−1 thatis orthogonal to e. The constructed point is (αe + √ 1 − α 2 z, β).For any f ∈ H k and a ∈ [−1, 1] define ¯f(a) to be the expectation of f over the 1-codimensional sphere {x ∈ S d−1 : 〈x, e〉 = a}. We will show that for any f ∈ H k , such that‖f‖ Hk ≤ C and Err De,hinge(f) ≤ 1, we have that | ¯f(γ) − ¯f(−γ)| = o(1).To do so, let us first assume that f is symmetric with respect to e, and hence can bewritten as∞∑f(x) = α n P d,n (〈x, e〉) ,n=0where α n ∈ R and P d,n is the d-dimensional Legendre polynomial of degree n. Furthermore,by a characterization of Hilbert spaces corresponding to symmetric kernels, it follows that∑ α2n ≤ C 2 .Since f is symmetric w.r.t. e we have,¯f(a) =∞∑α n P d,n (a) .n=0For |a| ≤ 1/8, we have that |P d,n (a)| tends to zero exponentially fast with both d and n.Hence, if d is large enough then¯f(a) ≈log(C)∑n=0α n P d,n (a) =: ˜f(a) .Note that ˜f is a polynomial of degree bounded by log(C). In addition, by construction,Err De,hinge(f) = Err D,hinge ( ¯f) ≈ Err D,hinge ( ˜f). Hence, if 1 ≥ Err De,hinge(f) then using theprevious subsection we conclude that | ¯f(γ) − ¯f(−γ)| = o(1).Symmetrization of fIn the above, we assumed that both the kernel function is symmetric and that f is symmetricw.r.t. e. Our next step is to relax the latter assumption, while still assuming that the kernelfunction is symmetric.Let O(e) be the group of linear isometries that fix e, namely, O(e) = {A ∈ O(d) : Ae = e}.By assuming that k is a symmetric kernel, we have that for all A ∈ O(e), the functiong(x) = f(Ax) is also in H k . Furthermore, ‖g‖ Hk = ‖f‖ Hk and by the constructionof D e we also have Err De,hinge(g) = Err De,hinge(f). Let P e f(x) = ∫ f(Ax)dA beO(e)the symmetrization of f w.r.t. e. On one hand, P e f ∈ H k , ‖P e f‖ Hk ≤ ‖f‖ Hk , andErr De,hinge(P e f) ≤ Err De,hinge(f). On the other hand, ¯f = Pe f. Since for P e f we havealready shown that |P e f(γ) − P e f(−γ)| = o(1), it follows that | ¯f(γ) − ¯f(−γ)| = o(1) as well.Symmetrization of the kernelOur final step is to remove the assumption that the kernel is symmetric. To do so, we firstsymmetrize the kernel as follows. Recall that O(d) is the group of linear isometries of R d .12
Page 1 and 2: The complexity of learning halfspac
Page 3 and 4: exists an equivalent inner product
Page 5 and 6: is enough that we can efficiently c
Page 7 and 8: our terminology, they considered th
Page 9 and 10: It is shown in (Birnbaum and Shalev
Page 11: We now expand on this brief descrip
Page 15 and 16: The proof of Theorem 2.7To prove Th
Page 17 and 18: attempts to prove a quantitative op
Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
Page 25 and 26: ( )that in this case Err µN ,hinge
Page 27 and 28: Thus, it is enough to find a neighb
Page 29 and 30: Legendre polynomials we have|P d,n
Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
Page 33 and 34: Now, it holds that∫∫∫ ∫∫
Page 35 and 36: We note that ω f◦g ≤ ω f ·
Page 37 and 38: Now, denote δ = ∫ g. It holds th
Page 39 and 40: equivalent formulationminErr D,l (f
Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
Page 43 and 44: Consequently, every approximated so
Page 45: Kosaku Yosida. Functional Analysis.

The error rate of learning halfspaces using kernel-SVM

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?