The error rate of learning halfspaces using kernel-SVM

More documents

Recommendations

Info

$Lectures on fractal geometry and dynamics$

Note that if ‖w‖ = 1 then |Λ w,b (x)| is the distance of x from the separating hyperplane.Therefore, the γ-margin error rate is the probability of x to either be in the wrong side ofthe hyperplane or to be at a distance of at most γ from the hyperplane. The least γ-marginerror rate of a halfspace classifier is denoted Err γ (D) = min w∈B,b∈R Err D,γ (Λ w,b ).A learning algorithm receives γ, ɛ and access to i.i.d. samples from D. The algorithmshould return a classifier (which need not be an affine function). We say that the algorithmhas approximation ratio α(γ) if for every γ, ɛ and for every distribution, it outputs (w.h.p.over the i.i.d. D-samples) a classifier with error rate ≤ α(γ) Err γ (D) + ɛ. An efficientalgorithm uses poly(1/γ, 1/ɛ) samples, runs in time polynomial in the size of the sample 1and outputs a classifier f such that f(x) can be evaluated in time polynomial in the samplesize.1.2 Kernel-SVM and kernel-based learnersThe SVM paradigm, introduced by Vapnik is inspired by the idea of separation with margin.For the reader’s convenience we first describe the basic (kernel-free) variant of SVM. It iswell known (e.g. Anthony and Bartlet (1999)) that the affine function that minimizes theempirical γ-margin error rate over an i.i.d. sample of size poly(1/γ, 1/ɛ) has error rate≤ Err γ (D) + ɛ. However, this minimization problem is NP -hard and even NP -hard toapproximate (Guruswami and Raghavendra, 2006, Feldman et al., 2006).SVM deals with this hardness by replacing the margin loss with a convex surrogate loss,in particular, the hinge loss 2 l hinge (x) = (1 − x) + . Note that for x ∈ [−2, 2],l 0−1 (x) ≤ l hinge (x/γ) ≤ (1 + 2/γ)l γ (x) ,from which it easily follows that by solving( )1Err D,hinge Λ γ w,b s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ 1minw,bwe obtain an approximation ratio of α(γ) = 1 + 2/γ. It is more convenient to consider theproblemErr D,hinge (Λ w,b ) s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ C , (1)minw,bwhich is equivalent for C = 1 . The basic (kernel-free) variant of SVM essentially solves Problem(1), which can be approximated, up to an additive error of ɛ, by an efficient algorithmγrunning on a sample of size poly( 1 , 1).γ ɛKernel-free SVM minimizes the hinge loss over the space of affine functionals of boundednorm. The family of Kernel-SVM algorithms is obtained by replacing the space of affinefunctionals with other, possibly much larger, spaces (e.g., a polynomial kernel of degree textends the repertoire of possible output functions from affine functionals to all polynomialsof degree at most t). This is accomplished by embedding B into the unit ball of anotherHilbert space on which we apply basic-SVM. Concretely, let ψ : B → B 1 , where B 1 is theunit ball of a Hilbert space H 1 . The embedding ψ need not be computed directly. Rather, it1 The size of a vector x ∈ H is taken to be the largest index j for which x j ≠ 0.2 As usual, z + := max(z, 0).3
is enough that we can efficiently compute the corresponding kernel, k(x, y) := 〈ψ(x), ψ(y)〉 H1(this property, sometimes crucial, is called the kernel trick). It remains to solve the followingproblemminw,bErr D,hinge (Λ w,b ◦ ψ) s.t. w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C . (2)This problem can be approximated, up to an additive error of ɛ, using poly(C/ɛ) samplesand time. We prove lower bounds to all approximate solutions of program (2). In fact, ourresults work with arbitrary (not just hinge loss) convex surrogate losses and arbitrary (notjust efficiently computable) kernels.Although we formulate our results for Problem (2), they apply as well to the followingcommonly used formulation of the kernel SVM problem, where the constraint ‖w‖ H1 ≤ C isreplaced by a regularization term. Namelyminw∈H 1 ,b∈R1C 2 ‖w‖2 H 1+ Err D,hinge (Λ w,b ◦ ψ) (3)The optimum of program (3) is ≤ 1 as shown by the zero solution w = 0, b = 0. Thus, ifw, b is an approximate optimal solution, then ‖w‖2 H 1C 2 ≤ 2 ⇒ ‖w‖ H1 ≤ 2C. This observationmakes it easy to modify our results on program (2) to apply to program (3).1.3 Finite dimensional learnersThe SVM algorithms embed the data in a (possibly infinite dimensional) Hilbert space,and minimize the hinge loss over all affine functionals of bounded norm. The kernel tricksometimes allows us to work in infinite dimensional Hilbert spaces. Even without it, we canstill embed the data in R m for some m, and minimize a convex loss over a collection of affinefunctionals. For example, some algorithms do not constraint the affine functional, while inthe Lasso method (Tibshirani, 1996) the affine functional (represented as a vector in R m )must have small L 1 -norm.Without the kernel trick, such algorithms work directly in R m . Thus, every algorithmmust have time complexity Ω(m), and therefore m is a lower bound on the complexity ofthe algorithm. In this work we will lower bound the performance of any algorithm withm ≤ poly (1/γ). Concretely, we prove lower bounds for any approximate solution to aproblem of the formminw,bErr D,l (Λ w,b ◦ ψ) s.t. w ∈ W ⊂ R m , b ∈ R , (4)where l is some surrogate loss function (see formal definition in the next section) andψ : B → R m .It is not hard to see that for any m-dimensional space V of functions over the ball, thereexists an embedding ψ : B → R m such that{f + b : f ∈ V, b ∈ R} = {Λ w,b ◦ ψ : w ∈ R m , b ∈ R}Hence, our lower bounds hold for any method that optimizes a surrogate loss over a subsetof a finite dimensional space of functions, and return the threshold function correspondingto the optimum.4
Page 1 and 2: The complexity of learning halfspac
Page 3: exists an equivalent inner product
Page 7 and 8: our terminology, they considered th
Page 9 and 10: It is shown in (Birnbaum and Shalev
Page 11 and 12: We now expand on this brief descrip
Page 13 and 14: and (uniformly and independently) a
Page 15 and 16: The proof of Theorem 2.7To prove Th
Page 17 and 18: attempts to prove a quantitative op
Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
Page 25 and 26: ( )that in this case Err µN ,hinge
Page 27 and 28: Thus, it is enough to find a neighb
Page 29 and 30: Legendre polynomials we have|P d,n
Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
Page 33 and 34: Now, it holds that∫∫∫ ∫∫
Page 35 and 36: We note that ω f◦g ≤ ω f ·
Page 37 and 38: Now, denote δ = ∫ g. It holds th
Page 39 and 40: equivalent formulationminErr D,l (f
Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
Page 43 and 44: Consequently, every approximated so
Page 45: Kosaku Yosida. Functional Analysis.

The error rate of learning halfspaces using kernel-SVM

Create successful ePaper yourself

Delete template?

Save as template?