Note that if ‖w‖ = 1 then |Λ w,b (x)| is the distance <strong>of</strong> x from the separating hyperplane.<strong>The</strong>refore, the γ-margin <strong>error</strong> <strong>rate</strong> is the probability <strong>of</strong> x to either be in the wrong side <strong>of</strong>the hyperplane or to be at a distance <strong>of</strong> at most γ from the hyperplane. <strong>The</strong> least γ-margin<strong>error</strong> <strong>rate</strong> <strong>of</strong> a halfspace classifier is denoted Err γ (D) = min w∈B,b∈R Err D,γ (Λ w,b ).A <strong>learning</strong> algorithm receives γ, ɛ and access to i.i.d. samples from D. <strong>The</strong> algorithmshould return a classifier (which need not be an affine function). We say that the algorithmhas approximation ratio α(γ) if for every γ, ɛ and for every distribution, it outputs (w.h.p.over the i.i.d. D-samples) a classifier with <strong>error</strong> <strong>rate</strong> ≤ α(γ) Err γ (D) + ɛ. An efficientalgorithm uses poly(1/γ, 1/ɛ) samples, runs in time polynomial in the size <strong>of</strong> the sample 1and outputs a classifier f such that f(x) can be evaluated in time polynomial in the samplesize.1.2 Kernel-<strong>SVM</strong> and <strong>kernel</strong>-based learners<strong>The</strong> <strong>SVM</strong> paradigm, introduced by Vapnik is inspired by the idea <strong>of</strong> separation with margin.For the reader’s convenience we first describe the basic (<strong>kernel</strong>-free) variant <strong>of</strong> <strong>SVM</strong>. It iswell known (e.g. Anthony and Bartlet (1999)) that the affine function that minimizes theempirical γ-margin <strong>error</strong> <strong>rate</strong> over an i.i.d. sample <strong>of</strong> size poly(1/γ, 1/ɛ) has <strong>error</strong> <strong>rate</strong>≤ Err γ (D) + ɛ. However, this minimization problem is NP -hard and even NP -hard toapproximate (Guruswami and Raghavendra, 2006, Feldman et al., 2006).<strong>SVM</strong> deals with this hardness by replacing the margin loss with a convex surrogate loss,in particular, the hinge loss 2 l hinge (x) = (1 − x) + . Note that for x ∈ [−2, 2],l 0−1 (x) ≤ l hinge (x/γ) ≤ (1 + 2/γ)l γ (x) ,from which it easily follows that by solving( )1Err D,hinge Λ γ w,b s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ 1minw,bwe obtain an approximation ratio <strong>of</strong> α(γ) = 1 + 2/γ. It is more convenient to consider theproblemErr D,hinge (Λ w,b ) s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ C , (1)minw,bwhich is equivalent for C = 1 . <strong>The</strong> basic (<strong>kernel</strong>-free) variant <strong>of</strong> <strong>SVM</strong> essentially solves Problem(1), which can be approximated, up to an additive <strong>error</strong> <strong>of</strong> ɛ, by an efficient algorithmγrunning on a sample <strong>of</strong> size poly( 1 , 1).γ ɛKernel-free <strong>SVM</strong> minimizes the hinge loss over the space <strong>of</strong> affine functionals <strong>of</strong> boundednorm. <strong>The</strong> family <strong>of</strong> Kernel-<strong>SVM</strong> algorithms is obtained by replacing the space <strong>of</strong> affinefunctionals with other, possibly much larger, spaces (e.g., a polynomial <strong>kernel</strong> <strong>of</strong> degree textends the repertoire <strong>of</strong> possible output functions from affine functionals to all polynomials<strong>of</strong> degree at most t). This is accomplished by embedding B into the unit ball <strong>of</strong> anotherHilbert space on which we apply basic-<strong>SVM</strong>. Concretely, let ψ : B → B 1 , where B 1 is theunit ball <strong>of</strong> a Hilbert space H 1 . <strong>The</strong> embedding ψ need not be computed directly. Rather, it1 <strong>The</strong> size <strong>of</strong> a vector x ∈ H is taken to be the largest index j for which x j ≠ 0.2 As usual, z + := max(z, 0).3
is enough that we can efficiently compute the corresponding <strong>kernel</strong>, k(x, y) := 〈ψ(x), ψ(y)〉 H1(this property, sometimes crucial, is called the <strong>kernel</strong> trick). It remains to solve the followingproblemminw,bErr D,hinge (Λ w,b ◦ ψ) s.t. w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C . (2)This problem can be approximated, up to an additive <strong>error</strong> <strong>of</strong> ɛ, <strong>using</strong> poly(C/ɛ) samplesand time. We prove lower bounds to all approximate solutions <strong>of</strong> program (2). In fact, ourresults work with arbitrary (not just hinge loss) convex surrogate losses and arbitrary (notjust efficiently computable) <strong>kernel</strong>s.Although we formulate our results for Problem (2), they apply as well to the followingcommonly used formulation <strong>of</strong> the <strong>kernel</strong> <strong>SVM</strong> problem, where the constraint ‖w‖ H1 ≤ C isreplaced by a regularization term. Namelyminw∈H 1 ,b∈R1C 2 ‖w‖2 H 1+ Err D,hinge (Λ w,b ◦ ψ) (3)<strong>The</strong> optimum <strong>of</strong> program (3) is ≤ 1 as shown by the zero solution w = 0, b = 0. Thus, ifw, b is an approximate optimal solution, then ‖w‖2 H 1C 2 ≤ 2 ⇒ ‖w‖ H1 ≤ 2C. This observationmakes it easy to modify our results on program (2) to apply to program (3).1.3 Finite dimensional learners<strong>The</strong> <strong>SVM</strong> algorithms embed the data in a (possibly infinite dimensional) Hilbert space,and minimize the hinge loss over all affine functionals <strong>of</strong> bounded norm. <strong>The</strong> <strong>kernel</strong> tricksometimes allows us to work in infinite dimensional Hilbert spaces. Even without it, we canstill embed the data in R m for some m, and minimize a convex loss over a collection <strong>of</strong> affinefunctionals. For example, some algorithms do not constraint the affine functional, while inthe Lasso method (Tibshirani, 1996) the affine functional (represented as a vector in R m )must have small L 1 -norm.Without the <strong>kernel</strong> trick, such algorithms work directly in R m . Thus, every algorithmmust have time complexity Ω(m), and therefore m is a lower bound on the complexity <strong>of</strong>the algorithm. In this work we will lower bound the performance <strong>of</strong> any algorithm withm ≤ poly (1/γ). Concretely, we prove lower bounds for any approximate solution to aproblem <strong>of</strong> the formminw,bErr D,l (Λ w,b ◦ ψ) s.t. w ∈ W ⊂ R m , b ∈ R , (4)where l is some surrogate loss function (see formal definition in the next section) andψ : B → R m .It is not hard to see that for any m-dimensional space V <strong>of</strong> functions over the ball, thereexists an embedding ψ : B → R m such that{f + b : f ∈ V, b ∈ R} = {Λ w,b ◦ ψ : w ∈ R m , b ∈ R}Hence, our lower bounds hold for any method that optimizes a surrogate loss over a subset<strong>of</strong> a finite dimensional space <strong>of</strong> functions, and return the threshold function correspondingto the optimum.4
- Page 1 and 2: The complexity of learning halfspac
- Page 3: exists an equivalent inner product
- Page 7 and 8: our terminology, they considered th
- Page 9 and 10: It is shown in (Birnbaum and Shalev
- Page 11 and 12: We now expand on this brief descrip
- Page 13 and 14: and (uniformly and independently) a
- Page 15 and 16: The proof of Theorem 2.7To prove Th
- Page 17 and 18: attempts to prove a quantitative op
- Page 19 and 20: 5.1.3 Harmonic Analysis on the Sphe
- Page 21 and 22: Lemma 5.11 (John’s Lemma) (Matous
- Page 23 and 24: For 1 ≤ i ≤ t Let v i = x i −
- Page 25 and 26: ( )that in this case Err µN ,hinge
- Page 27 and 28: Thus, it is enough to find a neighb
- Page 29 and 30: Legendre polynomials we have|P d,n
- Page 31 and 32: Then, for every K ∈ N, 1 8 > γ >
- Page 33 and 34: Now, it holds that∫∫∫ ∫∫
- Page 35 and 36: We note that ω f◦g ≤ ω f ·
- Page 37 and 38: Now, denote δ = ∫ g. It holds th
- Page 39 and 40: equivalent formulationminErr D,l (f
- Page 41 and 42: Denote ||g|| Hk = C. By Lemma 5.25,
- Page 43 and 44: Consequently, every approximated so
- Page 45: Kosaku Yosida. Functional Analysis.