10.07.2015 Views

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

The error rate of learning halfspaces using kernel-SVM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> complexity <strong>of</strong> <strong>learning</strong> <strong>halfspaces</strong> <strong>using</strong> generalizedlinear methodsAmit Daniely ∗ Nati Linial † Shai Shalev-Shwartz ‡May 10, 2014AbstractMany popular <strong>learning</strong> algorithms (E.g. Regression, Fourier-Transform based algorithms,Kernel <strong>SVM</strong> and Kernel ridge regression) ope<strong>rate</strong> by reducing the problemto a convex optimization problem over a set <strong>of</strong> functions. <strong>The</strong>se methods <strong>of</strong>fer thecurrently best approach to several central problems such as <strong>learning</strong> half spaces and<strong>learning</strong> DNF’s. In addition they are widely used in numerous application domains.Despite their importance, there are still very few pro<strong>of</strong> techniques to show limits onthe power <strong>of</strong> these algorithms.We study the performance <strong>of</strong> this approach in the problem <strong>of</strong> (agnostically andimproperly) <strong>learning</strong> <strong>halfspaces</strong> with margin γ. Let D be a distribution over labeledexamples. <strong>The</strong> γ-margin <strong>error</strong> <strong>of</strong> a hyperplane h is the probability <strong>of</strong> an example t<strong>of</strong>all on the wrong side <strong>of</strong> h or at a distance ≤ γ from it. <strong>The</strong> γ-margin <strong>error</strong> <strong>of</strong> the besth is denoted Err γ (D). An α(γ)-approximation algorithm receives γ, ɛ as input and,<strong>using</strong> i.i.d. samples <strong>of</strong> D, outputs a classifier with <strong>error</strong> <strong>rate</strong> ≤ α(γ) Err γ (D) + ɛ. Suchan algorithm is efficient if it uses poly( 1 γ , 1 ɛ) samples and runs in time polynomial inthe sample size.( )1/γ<strong>The</strong> best approximation ratio achievable by an efficient algorithm is O √log(1/γ)and is achieved <strong>using</strong> an algorithm from the above class. Our main result showsthat(the approximation ) ratio <strong>of</strong> every efficient algorithm from this family must be≥ Ω, essentially matching the best known upper bound.1/γpoly(log(1/γ))∗ Department <strong>of</strong> Mathematics, Hebrew University, Jerusalem 91904, Israel. amit.daniely@mail.huji.ac.il† School <strong>of</strong> Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel.nati@cs.huji.ac.il‡ School <strong>of</strong> Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel.shais@cs.huji.ac.il


1 IntroductionLet X be some set and let D be a distribution on X × {±1}. <strong>The</strong> basic <strong>learning</strong>task is, based on an i.i.d. sample, to find a function f : X → {±1} whose <strong>error</strong>,Err D,0−1 (f) := Pr (X,Y )∼D (f(X) ≠ Y ), is as small as possible. A <strong>learning</strong> problem is definedby specifying a class H <strong>of</strong> competitors (e.g. H is a class <strong>of</strong> functions from X to {±1}).Given such a class, the corresponding <strong>learning</strong> problem is to find f : X → {±1} whose <strong>error</strong>is small relatively to the <strong>error</strong> <strong>of</strong> the best competitor in H. Ignoring computational aspects,the celeb<strong>rate</strong>d PAC/VC theory essentially tells us that the best algorithm for every <strong>learning</strong>problem is an Empirical Risk Minimizer (=ERM) – namely, one that returns the competitorin H <strong>of</strong> least empirical <strong>error</strong>. Unfortunately, for many <strong>learning</strong> problems, implementing theERM paradigm is NP -hard and even NP -hard to approximate.We consider here a very popular family <strong>of</strong> algorithms to cope with this hardness, which wecollectively call “the generalized linear family”. It proceeds as follows: fix some set W ⊂ R Xand return a function <strong>of</strong> the form f(x) = sign(g(x) − b) where the pair (g, b) ∈ W × Rempirically minimizes some convex loss. In order that such a method be useful, the set Wshould be “small” (to prevent overfitting) and “nicely behaved” (to make the optimizationproblem computationally feasible). <strong>The</strong> two main choices for such a set W are• A (usually convex) subset <strong>of</strong> a finite dimensional space <strong>of</strong> functions (e.g. if X ⊂ R dthen W can be the space <strong>of</strong> all polynomials <strong>of</strong> degree ≤ 17 and coefficients boundedby d 3 ). We refer to such algorithms as finite dimensional learners.• A ball in a reproducing <strong>kernel</strong> Hilbet space. We refer to such algorithms as as <strong>kernel</strong>based learners.<strong>The</strong> generalized linear family has been applied extensively to tackle <strong>learning</strong> problems(e.g. Linial et al. (1989), Kushilevitz and Mansour (1991), Klivans and Servedio (2001),Kalai et al. (2005), Blais et al. (2008), Shalev-Shwartz et al. (2011) – see section 1.4).<strong>The</strong>ir statistical charactersitics have been thoroughly studied as well (Vapnik, 1998, Anthonyand Bartlet, 1999, Schölkopf et al., 1998, Cristianini and Shawe-Taylor, 2000, Steinwartand Christmann, 2008). Moreover, the significance <strong>of</strong> this approach is by no means onlytheoretical – algorithms from this family are widely used by practitioners.In spite <strong>of</strong> all that, very few lower bounds are known on the performance <strong>of</strong> this family <strong>of</strong>algorithms (i.e., theorems <strong>of</strong> the form “For every <strong>kernel</strong>-based/finite-dimensional algorithmfor the <strong>learning</strong> problem X, there exists a distribution under which the algorithm performspoorly”). Such a lower bound must quantify over all possible choices <strong>of</strong> “small and nicelybehaved” sets W . In order to address this difficulty we employ a variety <strong>of</strong> mathematicalmethods some <strong>of</strong> which are new in this domain. In particular, we make intensive use <strong>of</strong>harmonic analysis on the sphere, reproducing <strong>kernel</strong> Hilbert spaces, orthogonal polynomials,John’s Lemma as well as a new symmetrization technique.We also prove a new result, which is <strong>of</strong> independent interest: a fundamental fact thatstands behind the theoretical analysis <strong>of</strong> <strong>kernel</strong> based learners is that for every subset X <strong>of</strong>a unit ball in a Hilbert space H, it is possible to learn affine functionals <strong>of</strong> norm ≤ C overX w.r.t. the hinge loss <strong>using</strong> C2 examples. We show a (weak) inverse <strong>of</strong> this fact. Namely,ɛ 2we show that for every X , if affine functionals can be learnt <strong>using</strong> m examples, then there1


exists an equivalent inner product on H under which X is contained in a unit ball, and theaffine functional retuned by any <strong>learning</strong> algorithm must have norm ≤ O (m 3 ).Our lower bounds are established for the basic problem <strong>of</strong> <strong>learning</strong> large margin <strong>halfspaces</strong>(to be defined precisely in Section 1.1). <strong>The</strong> best known efficient (in 1 ) algorithm for thisγproblem (Birnbaum and Shalev-Shwartz, 2012) is a <strong>kernel</strong> based learner that achieves an1/γapproximation ratio <strong>of</strong> √ . (We note, however, that this approximation ratio was firstlog(1/γ)obtained by (Long and Servedio, 2011) <strong>using</strong> a “boosting based” algorithm that does notbelong to the generalized linear family). <strong>The</strong> best known ( ( exact(algorithm ))) (that is, α(γ) = 1),1is also a <strong>kernel</strong> based learner and runs in time exp Θ log 1(Shalev-Shwartz et al.,γ γ2011).Our main results(show that efficient ) <strong>kernel</strong> based learners cannot achieve better approximationratio than Ω, essentially matching the best known upper bound. Also,1/γpoly(log(1/γ))we show(that efficient finite dimensional learners cannot achieve better approximation ratio1/than Ω√ )γ. In addition we show that the running time <strong>of</strong> <strong>kernel</strong> based learnerspoly(log(1/γ))( ) 1−ɛ1γ as well as <strong>of</strong> finite dimensional learners with approxi-with approximation ratio <strong>of</strong>( ) 12mation ratio <strong>of</strong>−ɛ must be exponential in 1/γ.1γNext, we formulate the problem <strong>of</strong> <strong>learning</strong> large margin <strong>halfspaces</strong> and survey some relevantbackground to motivate our definitions <strong>of</strong> <strong>kernel</strong>-based and finite dimensional learnersgiven in Section 2.1.1 Learning large margin <strong>halfspaces</strong>We view R d as a subspace <strong>of</strong> the Hilbert space H = l 2 corresponding to the first d coordinates.Since the notion <strong>of</strong> margin is defined relative to a suitable scaling <strong>of</strong> the examples, we considerthroughout only distributions that are supported in the unit ball, B, <strong>of</strong> H. Also, all thedistributions we consider are supported in R d for some d < ∞. We denote by S d−1 the unitsphere <strong>of</strong> R d .It will be convenient to use loss functions. A loss function is any function l : R → [0, ∞).Given a loss function l and f : B → R, we denote Err D,l (f) = E (x,y)∼D (l(yf(x))).{Two loss1 x ≤ 0functions <strong>of</strong> particular relevance are the 0 − 1 loss function, l 0−1 (x) =, and the0 x > 0{1 x ≤ γγ-margin loss function, l γ (x) =0 x > γ . We use shorthands such as Err D,0−1 instead <strong>of</strong>Err D,l0−1 .A halfspace, parameterized by w ∈ B and b ∈ R, is the classifier f(x) = sign(Λ w,b (x)),where Λ w,b (x) := 〈w, x〉 + b. Given a distribution D over B × {±1}, the <strong>error</strong> <strong>rate</strong> <strong>of</strong> Λ w,b isErr D,0−1 (Λ w,b ) =<strong>The</strong> γ-margin <strong>error</strong> <strong>rate</strong> <strong>of</strong> Λ w,b isPr (sign(Λ w,b(x)) ≠ y) = Pr (yΛ w,b(x) ≤ 0) .(x,y)∼D (x,y)∼DErr D,γ (Λ w,b ) =Pr (yΛ w,b(x) ≤ γ) .(x,y)∼D2


Note that if ‖w‖ = 1 then |Λ w,b (x)| is the distance <strong>of</strong> x from the separating hyperplane.<strong>The</strong>refore, the γ-margin <strong>error</strong> <strong>rate</strong> is the probability <strong>of</strong> x to either be in the wrong side <strong>of</strong>the hyperplane or to be at a distance <strong>of</strong> at most γ from the hyperplane. <strong>The</strong> least γ-margin<strong>error</strong> <strong>rate</strong> <strong>of</strong> a halfspace classifier is denoted Err γ (D) = min w∈B,b∈R Err D,γ (Λ w,b ).A <strong>learning</strong> algorithm receives γ, ɛ and access to i.i.d. samples from D. <strong>The</strong> algorithmshould return a classifier (which need not be an affine function). We say that the algorithmhas approximation ratio α(γ) if for every γ, ɛ and for every distribution, it outputs (w.h.p.over the i.i.d. D-samples) a classifier with <strong>error</strong> <strong>rate</strong> ≤ α(γ) Err γ (D) + ɛ. An efficientalgorithm uses poly(1/γ, 1/ɛ) samples, runs in time polynomial in the size <strong>of</strong> the sample 1and outputs a classifier f such that f(x) can be evaluated in time polynomial in the samplesize.1.2 Kernel-<strong>SVM</strong> and <strong>kernel</strong>-based learners<strong>The</strong> <strong>SVM</strong> paradigm, introduced by Vapnik is inspired by the idea <strong>of</strong> separation with margin.For the reader’s convenience we first describe the basic (<strong>kernel</strong>-free) variant <strong>of</strong> <strong>SVM</strong>. It iswell known (e.g. Anthony and Bartlet (1999)) that the affine function that minimizes theempirical γ-margin <strong>error</strong> <strong>rate</strong> over an i.i.d. sample <strong>of</strong> size poly(1/γ, 1/ɛ) has <strong>error</strong> <strong>rate</strong>≤ Err γ (D) + ɛ. However, this minimization problem is NP -hard and even NP -hard toapproximate (Guruswami and Raghavendra, 2006, Feldman et al., 2006).<strong>SVM</strong> deals with this hardness by replacing the margin loss with a convex surrogate loss,in particular, the hinge loss 2 l hinge (x) = (1 − x) + . Note that for x ∈ [−2, 2],l 0−1 (x) ≤ l hinge (x/γ) ≤ (1 + 2/γ)l γ (x) ,from which it easily follows that by solving( )1Err D,hinge Λ γ w,b s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ 1minw,bwe obtain an approximation ratio <strong>of</strong> α(γ) = 1 + 2/γ. It is more convenient to consider theproblemErr D,hinge (Λ w,b ) s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ C , (1)minw,bwhich is equivalent for C = 1 . <strong>The</strong> basic (<strong>kernel</strong>-free) variant <strong>of</strong> <strong>SVM</strong> essentially solves Problem(1), which can be approximated, up to an additive <strong>error</strong> <strong>of</strong> ɛ, by an efficient algorithmγrunning on a sample <strong>of</strong> size poly( 1 , 1).γ ɛKernel-free <strong>SVM</strong> minimizes the hinge loss over the space <strong>of</strong> affine functionals <strong>of</strong> boundednorm. <strong>The</strong> family <strong>of</strong> Kernel-<strong>SVM</strong> algorithms is obtained by replacing the space <strong>of</strong> affinefunctionals with other, possibly much larger, spaces (e.g., a polynomial <strong>kernel</strong> <strong>of</strong> degree textends the repertoire <strong>of</strong> possible output functions from affine functionals to all polynomials<strong>of</strong> degree at most t). This is accomplished by embedding B into the unit ball <strong>of</strong> anotherHilbert space on which we apply basic-<strong>SVM</strong>. Concretely, let ψ : B → B 1 , where B 1 is theunit ball <strong>of</strong> a Hilbert space H 1 . <strong>The</strong> embedding ψ need not be computed directly. Rather, it1 <strong>The</strong> size <strong>of</strong> a vector x ∈ H is taken to be the largest index j for which x j ≠ 0.2 As usual, z + := max(z, 0).3


is enough that we can efficiently compute the corresponding <strong>kernel</strong>, k(x, y) := 〈ψ(x), ψ(y)〉 H1(this property, sometimes crucial, is called the <strong>kernel</strong> trick). It remains to solve the followingproblemminw,bErr D,hinge (Λ w,b ◦ ψ) s.t. w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C . (2)This problem can be approximated, up to an additive <strong>error</strong> <strong>of</strong> ɛ, <strong>using</strong> poly(C/ɛ) samplesand time. We prove lower bounds to all approximate solutions <strong>of</strong> program (2). In fact, ourresults work with arbitrary (not just hinge loss) convex surrogate losses and arbitrary (notjust efficiently computable) <strong>kernel</strong>s.Although we formulate our results for Problem (2), they apply as well to the followingcommonly used formulation <strong>of</strong> the <strong>kernel</strong> <strong>SVM</strong> problem, where the constraint ‖w‖ H1 ≤ C isreplaced by a regularization term. Namelyminw∈H 1 ,b∈R1C 2 ‖w‖2 H 1+ Err D,hinge (Λ w,b ◦ ψ) (3)<strong>The</strong> optimum <strong>of</strong> program (3) is ≤ 1 as shown by the zero solution w = 0, b = 0. Thus, ifw, b is an approximate optimal solution, then ‖w‖2 H 1C 2 ≤ 2 ⇒ ‖w‖ H1 ≤ 2C. This observationmakes it easy to modify our results on program (2) to apply to program (3).1.3 Finite dimensional learners<strong>The</strong> <strong>SVM</strong> algorithms embed the data in a (possibly infinite dimensional) Hilbert space,and minimize the hinge loss over all affine functionals <strong>of</strong> bounded norm. <strong>The</strong> <strong>kernel</strong> tricksometimes allows us to work in infinite dimensional Hilbert spaces. Even without it, we canstill embed the data in R m for some m, and minimize a convex loss over a collection <strong>of</strong> affinefunctionals. For example, some algorithms do not constraint the affine functional, while inthe Lasso method (Tibshirani, 1996) the affine functional (represented as a vector in R m )must have small L 1 -norm.Without the <strong>kernel</strong> trick, such algorithms work directly in R m . Thus, every algorithmmust have time complexity Ω(m), and therefore m is a lower bound on the complexity <strong>of</strong>the algorithm. In this work we will lower bound the performance <strong>of</strong> any algorithm withm ≤ poly (1/γ). Concretely, we prove lower bounds for any approximate solution to aproblem <strong>of</strong> the formminw,bErr D,l (Λ w,b ◦ ψ) s.t. w ∈ W ⊂ R m , b ∈ R , (4)where l is some surrogate loss function (see formal definition in the next section) andψ : B → R m .It is not hard to see that for any m-dimensional space V <strong>of</strong> functions over the ball, thereexists an embedding ψ : B → R m such that{f + b : f ∈ V, b ∈ R} = {Λ w,b ◦ ψ : w ∈ R m , b ∈ R}Hence, our lower bounds hold for any method that optimizes a surrogate loss over a subset<strong>of</strong> a finite dimensional space <strong>of</strong> functions, and return the threshold function correspondingto the optimum.4


1.4 Previous Results and Related Work<strong>The</strong> problem <strong>of</strong> <strong>learning</strong> <strong>halfspaces</strong> and in particular large margin <strong>halfspaces</strong> is as old as thefield <strong>of</strong> machine <strong>learning</strong>, starting with the perceptron algorithm (Rosenblatt, 1958). Sincethen it has been a fundamental challenge in machine <strong>learning</strong> and has inspired much <strong>of</strong> theexisting theory as well as many popular algorithms.<strong>The</strong> generalized linear method has its roots in the work <strong>of</strong> Gauss and Legendre who usedthe least squares method for astronomical computations. This method has played a key rolein modern statistics. Its first application in computational <strong>learning</strong> theory is in (Linial et al.,1989) where it is shown that AC 0 functions are learnable in quasi-polynomial time w.r.t. theuniform distribution. Subsequently, many authors have used the method to tackle various<strong>learning</strong> problems. For example, Klivans and Servedio (2001) derived the fastest algorithmfor <strong>learning</strong> DNF and Kushilevitz and Mansour (1991) used it to develop an algorithm fordecision trees. <strong>The</strong> main uses <strong>of</strong> the linear method in the problem <strong>of</strong> <strong>learning</strong> <strong>halfspaces</strong>appear in the next paragraph. Needless to say we are unable here to <strong>of</strong>fer a comprehensivesurvey <strong>of</strong> its uses in computational <strong>learning</strong> theory in general.<strong>The</strong> best currently known approximation ratios in the problem <strong>of</strong> <strong>learning</strong> large margin<strong>halfspaces</strong> are due to (Birnbaum and Shalev-Shwartz, 2012) and (Long and Servedio, 2011)1and achieve an approximation ratio <strong>of</strong> . <strong>The</strong> algorithm <strong>of</strong> (Birnbaum and Shalevγ·√log(1/γ)Shwartz, 2012) is a <strong>kernel</strong> based learner, while (Long and Servedio, 2011) used a “boostingbased” approach (that does not belong to the generalized linear method). ( ( <strong>The</strong> ( fastest ))) exact1algorithm is due to Shalev-Shwartz et al. (2011) and runs it time exp Θ log 1, andγ ɛγis also a <strong>kernel</strong> based learner. Better running times can be achieved under distributionalassumptions. For data which is separable with margin γ, i.e. Err γ (D) = 0, the perceptronalgorithm (as well as <strong>SVM</strong> with a linear <strong>kernel</strong>) can find a classifier with <strong>error</strong> ≤ ɛ with timeand sample complexity ≤ poly(1/γ, 1/ɛ). Kalai et al. (2005) gave a finite dimensional learnerwhich is the fastest known algorithm for <strong>learning</strong> <strong>halfspaces</strong> w.r.t. the uniform distributionover S d−1 and the d-dimensional boolean cube (running in time d O(1/ɛ)4 ). <strong>The</strong>y also designeda finite dimensional learner <strong>of</strong> <strong>halfspaces</strong> w.r.t. log-concave distributions. Blais et al. (2008)extended these results from uniform to product distributions. In this work, we focus onalgorithms which work for any distribution and whose runtime is polynomial in both 1/γand 1/ɛ.<strong>The</strong> problem <strong>of</strong> proper 3 <strong>learning</strong> <strong>of</strong> <strong>halfspaces</strong> in the non-separable case was shown to behard to approximate within any constant approximation factor (Feldman et al., 2006, Guruswamiand Raghavendra, 2006). It has been recently shown (Shalev-Shwartz et al., 2011)that improper <strong>learning</strong> under the margin assumption is also hard (under some cryptographicassumptions). Namely, no polynomial time algorithm can achieve an approximation ratio <strong>of</strong>α(γ) = 1. In another recent result Daniely et al. (2013) have shown that under a certaincomplexity assumption, for every constant α, no polynomial time algorithm can achieve anapproximation ratio <strong>of</strong> α.Ben-David et al. (2012) (see also Long and Servedio (2011)) addressed the performance <strong>of</strong>methods that minimize a convex loss over the class <strong>of</strong> affine functional <strong>of</strong> bounded norm (in3 A proper learner must output a halfspace classifier. Here we consider improper <strong>learning</strong> where the learnercan output any classifier.5


our terminology, they considered the narrow class <strong>of</strong> finite dimensional learners that optimizeover the space <strong>of</strong> linear functionals). <strong>The</strong>y showed that the best approximation ratio <strong>of</strong> suchmethods is Θ(1/γ). Our results can be seen as a substantial generalization <strong>of</strong> their results.<strong>The</strong> <strong>learning</strong> theory literature contains consistency results for <strong>learning</strong> with the so-calleduniversal <strong>kernel</strong>s and well-calib<strong>rate</strong>d surrogate loss functions. This includes the study <strong>of</strong>asymptotic relations between surrogate convex loss functions and the 0-1 loss function(Zhang, 2004, Bartlett et al., 2006, Steinwart and Christmann, 2008). It is shown thatthe approximation ratio <strong>of</strong> <strong>SVM</strong> with a universal <strong>kernel</strong> tends to 1 as the sample size grows.Our result implies that this convergence is very slow, e.g., an exponentially large (in 1 ) γsample is needed to make the <strong>error</strong> < 2 Err γ (D).Also related are Ben-David et al. (2003) and Warmuth and Vishwanathan (2005). <strong>The</strong>sepapers show the existence <strong>of</strong> <strong>learning</strong> problems with limitations on the ability to learn them<strong>using</strong> linear methods.2 ResultsWe first define the two families <strong>of</strong> algorithms to which our lower bounds apply. We startwith the class <strong>of</strong> surrogate loss functions. This class includes the most popular choices suchas the absolute loss |1 − x|, the squared loss (1 − x) 2 , the logistic loss log 2 (1 + e −x ), thehinge loss (1 − x) + etc.Definition 2.1 (Surrogate loss function) A function l : R → R is called a surrogate lossfunction if l is convex and is bounded below by the 0-1 loss.<strong>The</strong> first family <strong>of</strong> algorithms contains <strong>kernel</strong> based algorithms, such as <strong>kernel</strong> <strong>SVM</strong>. In thedefinitions below we set the accuracy parameter ɛ to be √ γ. Since our goal is to prove lowerbounds, this choice is without loss <strong>of</strong> generality, and is intended for the sake <strong>of</strong> simplifyingthe theorems statements.Definition 2.2 (Kernel based learner) Let l : R → R be a surrogate loss function. A<strong>kernel</strong> based <strong>learning</strong> algorithm, A, receives as input γ ∈ (0, 1). It then selects C = C A (γ)and an absolutely continuous feature mapping, ψ = ψ A (γ), which maps the original space Hinto the unit ball <strong>of</strong> a new space H 1 (see Section 1.2). <strong>The</strong> algorithm returns a functionsuch that, with probability ≥ 1 − exp(−1/γ),A(γ) ∈ {Λ w,b ◦ ψ : w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C}Err D,l (A(γ)) ≤ inf{Err D,l (Λ w,b ◦ ψ) : w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C} + √ γ .We denote by m A (γ) the maximal number <strong>of</strong> examples A uses. We say that A is efficient ifm A (γ) ≤ poly(1/γ).Note that the definition <strong>of</strong> <strong>kernel</strong> based learner allows for any predefined convex surrogateloss, not just the hinge loss. Namely, we consider the programminw,bErr D,l (Λ w,b ◦ ψ) s.t. w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C . (5)6


We note that our results hold even if the <strong>kernel</strong> corresponds to ψ is hard to compute.<strong>The</strong> second family <strong>of</strong> <strong>learning</strong> algorithms involves an arbitrary feature mapping anddomain constraint on the vector w, as in program (4).Definition 2.3 (Finite dimensional learner) Let l : R → R be some surrogate loss function.A finite dimensional <strong>learning</strong> algorithm, A, receives as input γ ∈ (0, 1). It then selectsa continuous embedding ψ = ψ A (γ) : B → R m and a constraint set W = W A (γ) ⊆ R m . <strong>The</strong>algorithm returns, with probability ≥ 1 − exp(−1/γ), a functionsuch thatA(γ) ∈ {Λ w,b ◦ ψ : w ∈ W, b ∈ R}Err D,l (A(γ)) ≤ inf{Err D,l (Λ w,b ◦ ψ) : w ∈ W, b ∈ R} + √ γ .,We say that A is efficient if m = m A (γ) ≤ poly(1/γ).2.1 Main ResultsWe begin with a lower bound on the performance <strong>of</strong> efficient <strong>kernel</strong>-based algorithms.<strong>The</strong>orem 2.4 Let l be an arbitrary surrogate loss and let A be an efficient <strong>kernel</strong>-basedlearner w.r.t. l. <strong>The</strong>n, for every γ > 0, there exists a distribution D on B such that, w.p.≥ 1 − 10 exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ωγ · poly(log(1/γ))Next we show that <strong>kernel</strong>-based learners that achieve approximation ratio <strong>of</strong>constant ɛ > 0 must suffer exponential complexity..(1γ) 1−ɛfor some<strong>The</strong>orem 2.5 Let l be an arbitrary surrogate loss, let ɛ > 0 and let A be a <strong>kernel</strong>-basedlearner w.r.t. l such that for every γ > 0 and every distribution D on B, w.p. ≥ 1/2,Err D,0−1 (A(γ))Err γ (D)( ) 1−ɛ 1≤ .γ<strong>The</strong>n, for some a = a(ɛ) > 0, m A (γ) = Ω (exp ((1/γ) a )).<strong>The</strong>se two theorems follow from the following result.<strong>The</strong>orem 2.6 Let l be an arbitrary surrogate loss and let A be a <strong>kernel</strong>-based learner w.r.t.l for which m A (γ) = exp(o(γ −2/7 )). <strong>The</strong>n, for every γ > 0, there exists a distribution D onB such that, w.p. ≥ 1 − 10 exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω.γ · poly(log(m A (γ)))7


It is shown in (Birnbaum and Shalev-Shwartz, 2012) that solving <strong>kernel</strong> ( <strong>SVM</strong> with)somespecific <strong>kernel</strong> (i.e. some specific ψ) yields an approximation ratio <strong>of</strong> O. It1γ √ log(1/γ)follows that our lower bound in <strong>The</strong>orem 2.6 is essentially tight. Also, this theorem canbe viewed as a substantial generalization <strong>of</strong> ((Ben-David ) et al., 2012, Long and Servedio,2011), who give an approximation ratio <strong>of</strong> Ω with no embedding (i.e., ψ is the identitymap). Also relevant is (Shalev-Shwartz et al., 2011), which shows that for a certain ψ, andm A (γ) = poly (exp ((1/γ) · log (1/(γ))), <strong>kernel</strong> <strong>SVM</strong> has approximation ratio <strong>of</strong> 1. <strong>The</strong>orem2.6 shows that for <strong>kernel</strong>-based learner to achieve a constant approximation ratio, m A mustbe exponential in 1/γ.Next we give lower bounds on the performance <strong>of</strong> finite dimensional learners.<strong>The</strong>orem 2.7 Let l be a Lipschitz surrogate loss and let A be a finite dimensional learnerw.r.t. l. Assume that m A (γ) = exp(o(γ −1/8 )). <strong>The</strong>n, for every γ > 0, there exists adistribution D on S d−1 × {±1} with d = O(log(m A (γ)/γ)) such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)1γ()1≥ Ω √ .γ poly(log(mA (γ)/γ))Corollary 2.8 Let l be a Lipschitz surrogate loss and let A be an efficient finite dimensionallearner w.r.t. l. <strong>The</strong>n, for every γ > 0, there exists a distribution D on S d−1 × {±1} withd = O(log(1/γ)) such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω √ .γ poly(log(1/γ))Corollary 2.9 Let l be a Lipschitz surrogate loss, let ɛ > 0 and let A be a finite dimensionallearner w.r.t. l such that for every γ > 0 and every distribution D on B d with d = ω(log(1/γ))it holds that w.p. ≥ 1/2,Err D,0−1 (A(γ))Err γ (D)≤( 1γ) 12 −ɛ<strong>The</strong>n, for some a = a(ɛ) > 0, m A (γ) = Ω (exp ((1/γ) a )).2.2 Review <strong>of</strong> the pro<strong>of</strong>s’ main ideasTo give the reader some idea <strong>of</strong> our arguments, we sketch some <strong>of</strong> the main ingredients <strong>of</strong> thepro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6. At the end <strong>of</strong> this section we sketch the idea <strong>of</strong> the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem2.7. We note, however, that the actual pro<strong>of</strong>s are organized somewhat differently.We will construct a distribution D over S d−1 ×{±1} (recall that R d is viewed as standardlyembedded in H = l 2 ). Thus, we can assume that the program is formulated in terms <strong>of</strong> theunit sphere, S ∞ ⊂ l 2 , and not the unit ball.Fix an embedding ψ and C > 0. Denote by k : S ∞ × S ∞ → R the corresponding <strong>kernel</strong>k(x, y) = 〈ψ(x), ψ(y)〉 H1 and consider the following set <strong>of</strong> functions over S ∞H k = {Λ v,0 ◦ ψ : v ∈ H 1 } .8


H k is a Hilbert space with norm ||f|| Hk = inf{||v|| H1 : Λ v,0 ◦ ψ = f}. <strong>The</strong> subscript kindicates that H k is uniquely determined (as a Hilbert space) given the <strong>kernel</strong> k. With thisinterpretation, program (5) is equivalent to the programmin Err D,l (f + b) s.t. ||f|| Hk ≤ C . (6)f∈H k ,b∈RFor simplicity we focus on l being the hinge-loss (the generalization to other surrogate lossfunctions is rather technical).<strong>The</strong> pro<strong>of</strong> may be split into four steps:(1. Our first step is to show that we can restrict to the case C = poly1γ). We showthat for every subset X <strong>of</strong> a Hilbert space H, if affine functionals on X can be learnt<strong>using</strong> m examples w.r.t. the hinge loss, then there exists an equivalent inner producton H under which X is contained in a unit ball, and the affine functional returned byany <strong>learning</strong> algorithm must have norm ≤ O (m 3 ). Since we consider algorithms with(polynomial sample complexity, this allows us to argue as if C = poly2. We consider the one-dimensional problem <strong>of</strong> improperly <strong>learning</strong> <strong>halfspaces</strong> (i.e.thresholds on the line) by optimizing the hinge loss over the space <strong>of</strong> univariate polynomials<strong>of</strong> degree bounded by log(C). We construct a distribution D over [−1, 1] × {±1}that is a convex combination <strong>of</strong> two distributions. One that is separable by a γ-marginhalfspace and the other representing a tiny amount <strong>of</strong> noise. We show that each solution<strong>of</strong> the problem <strong>of</strong> minimizing the hinge-loss w.r.t. D over the space <strong>of</strong> suchpolynomials has the property that f(γ) ≈ f(−γ).3. We pull back the distribution D w.r.t. a direction e ∈ S d−1 to a distribution overS d−1 × {±1}. Let f be an approximate solution <strong>of</strong> program (6). We show that f takesalmost the same value on instances for which 〈x, e〉 = γ and 〈x, e〉 = −γ. This stepcan be further broken into three substeps –(a) First, we assume that the <strong>kernel</strong> is symmetric and f(x) depends only on 〈x, e〉.This substep uses a characterization <strong>of</strong> Hilbert spaces corresponding to symmetric<strong>kernel</strong>s, from which it follows that f has the form∞∑f(x) = α n P d,n (〈x, e〉) .n=1Here P d,n are the d-dimensional Legendre polynomials and ∑ ∞n=0 α2 n < C 2 . Thisallows us to rely on the results for the one-dimensional case from step (1).(b) By symmetrizing f, we relax the assumption that f depends only on 〈x, e〉.(c) By averaging the <strong>kernel</strong> over the group <strong>of</strong> linear isometries on R d , we relax theassumption that the <strong>kernel</strong> is symmetric.4. Finally, we show that for the distribution from the previous step, if f is an approximatesolution to program (6) then f predicts the same value, 1, on instances for which〈x, e〉 = γ and 〈x, e〉 = −γ. This establishes our claim, as the constructed distributionassigns the value −1 to instances for which 〈x, e〉 = −γ.91γ).


We now expand on this brief description <strong>of</strong> the main steps.Polynomial sample implies small CLet X be a subset <strong>of</strong> the unit ball <strong>of</strong> some Hilbert space H and let C > 0. Assume thataffine functionals over X with norm ≤ C can be learnt <strong>using</strong> m examples with <strong>error</strong> ɛ andconfidence δ. That is, assume that there is an algorithm such that• Its input is a sample <strong>of</strong> m points in X × {±1} and its output is an affine functionalΛ w,b with ‖w‖ ≤ C.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, w, b withErr D,hinge (Λ w,b ) ≤ inf ‖w ′ ‖≤C,b ′ ∈R Err D,hinge (Λ w ′ ,b′) + ɛ.We will show that there is an equivalent inner product 〈·, ·〉 ′ on H under which X is containedin a unit ball (not necessarily around 0) and the affine functional returned by any <strong>learning</strong>algorithm as above, must have norm ≤ O (m 3 ) w.r.t. the new norm.<strong>The</strong> construction <strong>of</strong> the norm is done as follows. We first find an affine subspace M ⊂ H<strong>of</strong> dimension d ≤ m that is very close to X in the sense that the distance <strong>of</strong> every pointin X from M is ≤ m . To find such an M, we assume toward a contradiction that there isCno such M, and use this to show that there is a subset A ⊂ X such that every functionf : A → [−1, 1] can be realized by some affine functional with norm ≤ C. This contradictsthe assumption that affine functionals with norm ≤ C can be learnt <strong>using</strong> m examples.Having the subspace M at hand, we construct, <strong>using</strong> John’s lemma (e.g. Matousek(2002)), an inner product 〈·, ·〉 ′′ on M and a distribution µ N on X with the property thatthe projection <strong>of</strong> X on M is contained in a ball <strong>of</strong> radius 1 w.r.t. 〈·, 2 ·〉′′ , and the hinge <strong>error</strong><strong>of</strong> every affine functional w.r.t. µ N is lower bounded by the norm <strong>of</strong> the affine functional,divided by m 2 . We show that that this entails that any affine functional returned by thealgorithm must have a norm ≤ O (m 3 ) w.r.t. the inner product 〈·, ·〉 ′′ .Finally, we construct an inner product 〈·, ·〉 ′ on H by putting the norm 〈·, ·〉 ′′ on M andmultiplying the original inner product by C2 on M ⊥ .2m 2<strong>The</strong> one dimensional distributionWe define a distribution D on [−1, 1] as follows. Start with the distribution D 1 that takes thevalues ±(γ, 1), where D 1 (γ, 1) = 0.7 and D 1 (−γ, −1) = 0.3. Clearly, for this distribution,the threshold 0 has zero <strong>error</strong> <strong>rate</strong>. To construct D, we perturb D 1 with “noise” as follows.Let D = (1 − λ)D 1 + λD 2 , where D 2 is defined as follows. <strong>The</strong> probability <strong>of</strong> the labels isuniform and independent <strong>of</strong> the instance and the marginal probability over the instances isdefined by the density function{0 if |x| > 1/8ρ(x) =√8if |x| ≤ 1/8 .π 1−(8x) 2This choice <strong>of</strong> ρ simplifies our calculations due to its relation to Chebyshev polynomials.However, other choices <strong>of</strong> ρ which are supported on a small interval around zero can alsowork.10


Note that the <strong>error</strong> <strong>rate</strong> <strong>of</strong> the threshold 0 on D is λ/2. We next show that each polynomialf <strong>of</strong> degree K = log(C) that satisfies Err D,hinge (f) ≤ 1 must have f(γ) ≈ f(−γ).Indeed, if1 ≥ Err D,hinge (f) = (1 − λ) Err D1 ,hinge(f) + λ Err D2 ,hinge(f)then Err D2 ,hinge(f) ≤ 1 λ . But,Err D2 ,hinge(f) = 1 2≥ 1 2∫ 1−1∫ 1−1l hinge (f(x))ρ(x)dx + 1 2l hinge (−|f(x)|)ρ(x)dx∫ 1−1l hinge (−f(x))ρ(x)dxand <strong>using</strong> the convexity <strong>of</strong> l hinge we obtain from Jensen’s inequality that≥ 1 2 l hinge= 1 (1 +2≥ 1 2∫ 1−1(∫ 1∫ 1)−|f(x)|ρ(x)dx−1)|f(x)|ρ(x)dx−1|f(x)|ρ(x)dx =: 1 2 ‖f‖ 1,dρ .This shows that ‖f‖ 1,dρ ≤ 2 . We next write f = ∑ Kλ i=1 α ˜T i i , where { ˜T i } are the orthonormalpolynomials corresponding to the measure dρ. Since ˜T i are related to Chebyshev polynomialswe can uniformly bound their l ∞ norm, hence obtain that√ ∑αi 2 = ‖f‖ 2,dρ ≤ O( √ (√ )KK) ‖f‖ 1,dρ ≤ O .λiBased on the above, and <strong>using</strong> a bound on the derivatives <strong>of</strong> Chebyshev polynomials, we canbound the derivative <strong>of</strong> the polynomial f|f ′ (x)| ≤ ∑ ( )|α i || ˜T Ki ′3(x)| ≤ O .λiHence, by choosing λ = ω(γK 3 ) = ω(γ log 3 (C)) we obtain( ) γ K|f(γ) − f(−γ)| ≤ 2 γ max |f ′ 3(x)| = O = o(1) ,xλas required.Pulling back to the d − 1 dimensional sphereGiven the distribution D over [−1, 1] × {±1} described before, and some e ∈ S d−1 , we nowdefine a distribution D e on S d−1 × {±1}. To sample from D e , we first sample (α, β) from D11


and (uniformly and independently) a vector z from the 1-codimensional sphere <strong>of</strong> S d−1 thatis orthogonal to e. <strong>The</strong> constructed point is (αe + √ 1 − α 2 z, β).For any f ∈ H k and a ∈ [−1, 1] define ¯f(a) to be the expectation <strong>of</strong> f over the 1-codimensional sphere {x ∈ S d−1 : 〈x, e〉 = a}. We will show that for any f ∈ H k , such that‖f‖ Hk ≤ C and Err De,hinge(f) ≤ 1, we have that | ¯f(γ) − ¯f(−γ)| = o(1).To do so, let us first assume that f is symmetric with respect to e, and hence can bewritten as∞∑f(x) = α n P d,n (〈x, e〉) ,n=0where α n ∈ R and P d,n is the d-dimensional Legendre polynomial <strong>of</strong> degree n. Furthermore,by a characterization <strong>of</strong> Hilbert spaces corresponding to symmetric <strong>kernel</strong>s, it follows that∑ α2n ≤ C 2 .Since f is symmetric w.r.t. e we have,¯f(a) =∞∑α n P d,n (a) .n=0For |a| ≤ 1/8, we have that |P d,n (a)| tends to zero exponentially fast with both d and n.Hence, if d is large enough then¯f(a) ≈log(C)∑n=0α n P d,n (a) =: ˜f(a) .Note that ˜f is a polynomial <strong>of</strong> degree bounded by log(C). In addition, by construction,Err De,hinge(f) = Err D,hinge ( ¯f) ≈ Err D,hinge ( ˜f). Hence, if 1 ≥ Err De,hinge(f) then <strong>using</strong> theprevious subsection we conclude that | ¯f(γ) − ¯f(−γ)| = o(1).Symmetrization <strong>of</strong> fIn the above, we assumed that both the <strong>kernel</strong> function is symmetric and that f is symmetricw.r.t. e. Our next step is to relax the latter assumption, while still assuming that the <strong>kernel</strong>function is symmetric.Let O(e) be the group <strong>of</strong> linear isometries that fix e, namely, O(e) = {A ∈ O(d) : Ae = e}.By assuming that k is a symmetric <strong>kernel</strong>, we have that for all A ∈ O(e), the functiong(x) = f(Ax) is also in H k . Furthermore, ‖g‖ Hk = ‖f‖ Hk and by the construction<strong>of</strong> D e we also have Err De,hinge(g) = Err De,hinge(f). Let P e f(x) = ∫ f(Ax)dA beO(e)the symmetrization <strong>of</strong> f w.r.t. e. On one hand, P e f ∈ H k , ‖P e f‖ Hk ≤ ‖f‖ Hk , andErr De,hinge(P e f) ≤ Err De,hinge(f). On the other hand, ¯f = Pe f. Since for P e f we havealready shown that |P e f(γ) − P e f(−γ)| = o(1), it follows that | ¯f(γ) − ¯f(−γ)| = o(1) as well.Symmetrization <strong>of</strong> the <strong>kernel</strong>Our final step is to remove the assumption that the <strong>kernel</strong> is symmetric. To do so, we firstsymmetrize the <strong>kernel</strong> as follows. Recall that O(d) is the group <strong>of</strong> linear isometries <strong>of</strong> R d .12


Define the following symmetric <strong>kernel</strong>:∫k s (x, y) = k(Ax, Ay)dA .We show that the corresponding Hilbert space consists <strong>of</strong> functions <strong>of</strong> the form∫f(x) = f A (Ax)dA ,O(d)O(d)where for every A f A ∈ H k . Moreover,‖f‖ 2 H ks≤∫O(d)‖f A ‖ 2 H kdA . (7)Let α be the maximal number such that∀e ∈ S d−1 ∃f e ∈ H k s.t. ‖f e ‖ Hk ≤ C, Err De,hinge(f e ) ≤ 1, | ¯f e (γ) − ¯f e (−γ)| > α .Since H k is closed to negation, it follows that α satisfies∀e ∈ S d−1 ∃f e ∈ H k s.t. ‖f e ‖ Hk ≤ C, Err De,hinge(f e ) ≤ 1, ¯fe (γ) − ¯f e (−γ) > α .Fix some v ∈ S d−1 and define f ∈ H ks to be∫f(x) =O(d)f Av (Ax)dA .By Equation (7) we have that ‖f‖ Hks ≤ C. It is also possible to show that for all AErr Dv,hinge(f Av ◦ A) = Err DAv ,hinge(f Av ) ≤ 1. <strong>The</strong>refore, by the convexity <strong>of</strong> the loss,Err Dv,hinge(f) ≤ 1. It follows, by the previous sections, that | ¯f(γ) − ¯f(−γ)| = o(1). But, weshow that ¯f(γ) − ¯f(−γ) > α. It therefore follows that α = o(1), as required.Concluding the pro<strong>of</strong>We have shown that for every <strong>kernel</strong>, there exists some direction e such that for all f ∈ H kthat satisfies ‖f‖ Hk ≤ C and Err De,hinge(f) ≤ 1 we have that | ¯f(γ) − ¯f(−γ)| = o(1).Next, consider f which is also an (approximated) optimal solution <strong>of</strong> program (2) withrespect to D e . Since Err De,hinge(0) = 1, we clearly have that Err De,hinge(f) ≤ 1, hence| ¯f(γ) − ¯f(−γ)| = o(1). Next we show that ¯f(−γ) > 1/2, which will imply that f predictsthe label 1 for most instances on the 1 co-dimensional sphere such that 〈x, e〉 = −γ. Hence,its 0-1 <strong>error</strong> is close to 0.3(1−λ) ≥ 0.2 while Err γ (D e ) = λ/2. By choosing λ = O(γ log 3.1 (C))(we obtain that the approximation ratio is Ω1γ log 3.1 (C)It is therefore left to show that ¯f(−γ) > 1/2. Let a = ¯f(γ) ≈ ¯f(−γ). On (1 − λ)fraction fraction <strong>of</strong> the distribution, the hinge-loss would be (on average and roughly)0.3[1 + a] + + 0.7[1 − a] + . This function is minimized for a = 1, which concludes our pro<strong>of</strong>since λ is o(1).13).


<strong>The</strong> pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.7To prove <strong>The</strong>orem 2.7, we prove, <strong>using</strong> John’s Lemma (Matousek, 2002), that for every embeddingψ : S d−1 → B 1 , we can construct a <strong>kernel</strong> k : S d−1 × S d−1 → R and a probabilitymeasure µ N over S d−1 with the following properties: If f is an approximate solution <strong>of</strong> pro-(gram (4), where γ fraction <strong>of</strong> the distribution D is perturbed by µ N , then ‖f‖ k ≤ OUsing this, we adapt the pro<strong>of</strong> as sketched above to prove <strong>The</strong>orem 2.7.3 Additional ResultsLow dimensional distributions. It is <strong>of</strong> interest to examine <strong>The</strong>orem 2.6 when D issupported ( on B d for d small. We show that for d = O(log(1/γ)), the approximation ratio is1Ω √ γ·poly(log(1/γ))). Most commonly used <strong>kernel</strong>s (e.g., the polynomial, RBF, and Hyperbolictangent <strong>kernel</strong>s, as well as the <strong>kernel</strong> used in (Shalev-Shwartz et al., 2011)) are symmetric.Namely, for all unit vectors x, y ∈ B, k(x, y) := 〈ψ(x), ψ(y)〉 H1 depends only on 〈x, y〉 H .For symmetric <strong>kernel</strong>s, we show that even with the restriction that d = O(log(1/γ)), the(approximation ratio is still Ω1γ·poly(log(1/γ))m 1.5γ).). However, the result for symmetric <strong>kernel</strong>s isonly proved for (idealized) algorithms that return the exact solution <strong>of</strong> program (5).<strong>The</strong>orem 3.1 Let A be a <strong>kernel</strong>-based learner corresponding to a Lipschitz surrogate. Assumethat m A (γ) = exp(o(γ −1/8 )). <strong>The</strong>n, for every γ > 0, there exists a distribution D onB d , for d = O(log(m A (γ)/γ)), such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω √ .γ · poly(log(mA (γ)))<strong>The</strong>orem 3.2 Assume that m A (γ) = exp(o(γ −2/7 )) and ψ is continuous and symmetric.For every γ > 0, there exists a distribution ( D on B d ), for d = O(log(m A (γ))) and a solutionto program (5) whose 0-1-<strong>error</strong> is Ω· Err γ (D).1γ poly(log(m A (γ)))<strong>The</strong> integrality gap. In bounding the approximation ratio, we considered a predefinedloss l. We believe that similar bounds hold as well for algorithms that can choose l accordingto γ. However, at the moment, we only know to lower bound the integrality gap, as definedbelow.If we let l depend on γ, we should redefine the complexity <strong>of</strong> Program (5) to be C · L,where L is the Lipschitz constant <strong>of</strong> l. (See the discussion following Program (5)). <strong>The</strong> (γ-)integrality gap <strong>of</strong> program (5) and (4) is defined as the worst case, over all possible choices<strong>of</strong> D, <strong>of</strong> the ratio between the optimum <strong>of</strong> the program, running on the input γ, to Err γ (D).We note that Err D,0−1 (f) ≤ Err D,l (f) for every convex surrogate l. Thus, the integrality gapalways upper bounds the approximation ratio. Moreover, this fact establishes most (if notall) guarantees for algorithms that solve Program (5) or Program (4).We denote by ∂ + f the right derivative <strong>of</strong> the real function f. Note that ∂ + f always existsfor f convex. Also, ∀x ∈ R, |∂ + f(x)| ≤ L if f is L-Lipschitz. We prove:14


<strong>The</strong>orem 3.3 Assume that C = exp ( o(γ −2/7 ) ) and ψ is continuous. For every γ > 0, thereexists ( a distribution ) D on B d , for d = O(log(C)) such that the optimum <strong>of</strong> Program (5) isΩ· Err γ (D).1γ poly(log(C·|∂ + l(0)|))(Thus Program (5) has itegrality gap Ωlower bound:1γ poly(log(C·L))). For Program (4) we prove a similar<strong>The</strong>orem 3.4 Let m, d, γ such that d = ω(log(m/γ)) and m = exp ( o(γ −2/7 ) ) . <strong>The</strong>reexist ( a distribution ) D on S d−1 × {±1} such that the optimum <strong>of</strong> Program (4) isΩ· Err γ (D).1γ poly(log(m/γ))4 ConclusionWe prove impossibility results for the family <strong>of</strong> generalized linear methods in the task <strong>of</strong><strong>learning</strong> large margin <strong>halfspaces</strong>. Some <strong>of</strong> our lower bounds nearly match the best knownupper bounds and we conjecture that the rest <strong>of</strong> our bounds can be improved as well tomatch the best known upper bounds. As we describe next, our work leaves much for futureresearch.First, regarding the task <strong>of</strong> <strong>learning</strong> large margin <strong>halfspaces</strong>, our analysis suggests that ifbetter approximation ratios are at all possible then they would require methods other thanoptimizing a convex surrogate over a regularized linear class <strong>of</strong> classifiers.Second, similar to the problem <strong>of</strong> <strong>learning</strong> large margin halfsapces, for many <strong>learning</strong>problems the best known algorithms belong to the generalized linear family. Understandingthe limits <strong>of</strong> the generalized linear method for these problems is therefore <strong>of</strong> particular interestand might indicate where is the line discriminating between feasibility and infeasibility forthese problems. We believe that our techniques will prove useful in proving lower bounds onthe performance <strong>of</strong> generalized linear methods for these and other <strong>learning</strong> problems. E.g.,our techniques yield lower bounds on the performance <strong>of</strong> generalized linear algorithms thatlearn <strong>halfspaces</strong> over the boolean cube {±1} n : it can be shown that these methods cannotachieve approximation ratios better than ˜Ω( √ n) even if the algorithm competes only with<strong>halfspaces</strong> defined by vectors in {±1} n . <strong>The</strong>se ideas will be elabo<strong>rate</strong>d on in a long version<strong>of</strong> this manuscript.Third, while our results indicate the limitations <strong>of</strong> generalized linear methods, it is anempirical fact that these methods perform very well in practice. <strong>The</strong>refore, it is <strong>of</strong> greatinterest to find conditions on distributions that hold in practice, under which these methodsguaranteed to perform well. We note that <strong>learning</strong> <strong>halfspaces</strong> under distributional assumptions,has already been addressed to a certain degree. For example, (Kalai et al., 2005, Blaiset al., 2008) show positive results under several assumptions on the marginal distribution(namely, they assume that the distribution is either uniform, log-concave or a product distribution).<strong>The</strong>re is still much to do here, specifically in search <strong>of</strong> better runtimes. Currentlythese results yield a runtime which is exponential in poly(1/ɛ), where ɛ is the excess <strong>error</strong> <strong>of</strong>the learnt hypothesis.Fourth, as part <strong>of</strong> our pro<strong>of</strong>, we have shown a (weak) inverse (lemma 5.15) <strong>of</strong> the famousfact that affine functionals <strong>of</strong> norm ≤ C can be learnt <strong>using</strong> poly(C) samples. We made no15


attempts to prove a quantitative optimal result in this vein, and we strongly believe thatmuch sharper versions can be proved. This interesting direction is largely left as an openproblem.<strong>The</strong>re are several limitations <strong>of</strong> our analysis that deserve further work. In our workthe surrogate loss is fixed in advance. We believe that similar results hold even if the lossdepends on γ. This belief is supported by our results about the integrality gap. As explainedin Section 6, this is a subtle issue that related to questions about sample complexity. Finally,in view <strong>of</strong> <strong>The</strong>orems 3.3 and 3.4, we believe that, as in <strong>The</strong>orem 2.6, the lower bound in<strong>The</strong>orems 2.7 and 3.1 can be improved to depend on 1 rather than on √1γ γ.5 Pro<strong>of</strong>s5.1 Background and NotationHere we introduce some notations and terminology to be used throughout. <strong>The</strong> L p normcorresponding to a measure µ is denoted || · || p,µ . Also, N = {1, 2, . . .} and N 0 = {0, 1, 2, . . .}.For a collection <strong>of</strong> function F ⊂ Y X and A ⊂ X we denote F| A = {f| A | f ∈ F}. Let H bea Hilbert space. We denote the projection on a closed convex subset <strong>of</strong> H by P C . We denotethe norm <strong>of</strong> an affine functional Λ on H by ‖Λ‖ H = sup ‖x‖H =1 |Λ(x) − Λ(0)|.5.1.1 Reproducing Kernel Hilbert SpacesAll the theorems we quote here are standard and can be found, e.g., in Chapter 2 <strong>of</strong> (Saitoh,1988). Let H be a Hilbert space <strong>of</strong> functions from a set S to C. Note that H consists <strong>of</strong>functions and not <strong>of</strong> equivalence classes <strong>of</strong> functions. We say that H is a reproducing <strong>kernel</strong>Hilbert space (RKHS for short) if, for every x ∈ S, the linear functional f → f(x) is bounded.A function k : S×S → C is a reproducing <strong>kernel</strong> (or just a <strong>kernel</strong>) if, for every x 1 , . . . , x n ∈S, the matrix {k(x i , x j )} 1≤i,j≤n is positive semi-definite.Kernels and RKHSs are essentially synonymous:<strong>The</strong>orem 5.1 1. For every <strong>kernel</strong> k there exists a unique RKHS H k such that for everyy ∈ S, k(·, y) ∈ H k and ∀f ∈ H, f(y) = 〈f(·), k(·, y)〉 Hk .2. A Hilbert space H ⊆ C S is a RKHS if and only if there exists a <strong>kernel</strong> k : S × S → Rsuch that H = H k .3. For every <strong>kernel</strong> k, span{k(·, y)} y∈S = H k . Moreover,〈n∑α i k(·, x i ),i=1n∑β i k(·, y i )〉 Hk =i=1∑1≤i,j,≤nα i ¯βj k(y j , x i )4. If the <strong>kernel</strong> k : S × S → R takes only real values, then H R k := {Re(f) : f ∈ H k} ⊂ H k .Moreover, H R k is a real Hilbert space with the inner product induced from H k.5. For every <strong>kernel</strong> k, convergence in H k implies point-wise convergence. Ifsup x∈S k(x, x) < ∞ then this convergence is uniform.16


<strong>The</strong>re is also a tight connection between embeddings <strong>of</strong> S into a Hilbert space and RKHSs.<strong>The</strong>orem 5.2 A function k : S × S → R is a <strong>kernel</strong> iff there exists a mapping φ : S → Hto some real Hilbert space for which k(x, y) = 〈φ(y), φ(x)〉 H . Also,H k = {f v : v ∈ H}Where f v (x) = 〈v, φ(x)〉 H . <strong>The</strong> mapping v ↦→ f v , restricted to span(φ(S)), is a Hilbert spaceisomorphism.A <strong>kernel</strong> k : S × S → R is called normalized if sup x∈S k(x, x) = 1. Also,<strong>The</strong>orem 5.3 Let k : S × S → R be a <strong>kernel</strong> and let {f n } ∞ n=1 be an orthonormal basis <strong>of</strong> aH k . <strong>The</strong>n, k(x, y) = ∑ ∞n=1 f n(x)f n (y).5.1.2 Unitary Representations <strong>of</strong> Compact GroupsPro<strong>of</strong>s <strong>of</strong> the results stated here can be found in (Folland, 1994), chapter 5. Let G be acompact group. A unitary representation (or just a representation) <strong>of</strong> G is a group homomorphismρ : G → U(H) where U(H) is the class <strong>of</strong> unitary operators over a Hilbert spaceH, such that, for every v ∈ H, the mapping a ↦→ ρ(a)v is continuous.We say that a closed subspace M ⊂ H is invariant (to ρ) if for every a ∈ G, v ∈ M,ρ(a)v ∈ M. We note that if M is invariant then so is M ⊥ . We denote by ρ| M : G → U(M)the restriction <strong>of</strong> ρ to M. That is, ∀a ∈ G, ρ| M (a) = ρ(a)| M . We say that ρ : G → U(H) isreducible if H = M ⊕M ⊥ such that M, M ⊥ are both non zero closed and invariant subspaces<strong>of</strong> H. A basic result is that every representation <strong>of</strong> a compact group is a sum <strong>of</strong> irreduciblerepresentation.<strong>The</strong>orem 5.4 Let ρ : G → U(H) be a representation <strong>of</strong> a compact group G. <strong>The</strong>n, H =⊕ n∈I H n , where every H n is invariant to ρ and ρ| Hn is irreducible.We shall also use the following Lemma.Lemma 5.5 Let G be a compact group, V a finite dimensional vector space and let ρ : G →GL(V ) be a continuous homomorphism <strong>of</strong> groups (here, GL(V ) is the group <strong>of</strong> invertiblelinear operators over V ). <strong>The</strong>n,1. <strong>The</strong>re exists an inner product on V making ρ a unitary representation.2. Moreover, if V has no non-trivial invariant subspaces (here a subspace U ⊂ V is calledinvariant if, ∀a ∈ G, f ∈ U, ρ(a)f ∈ U) then this inner product is unique up to scalarmultiple.17


5.1.3 Harmonic Analysis on the SphereAll the results stated here can be found in (Atkinson and Han, 2012), chapters 1 and2. Denote by O(d) the group <strong>of</strong> unitary operators over R d and by dA the uniformprobability∫measure over O(d) (that is, dA is the unique probability measure satisfyingf(A)dA = ∫f(AB)dA = ∫f(BA)dA for every B ∈ O(d) and every integrableO(d) O(d) O(d)function f : O(d) → C). Denote by dx = dx d−1 the Lebesgue (area) measure over S d−1 andlet L 2 (S d−1 ) := L 2 (S d−1 , dx). Given a measurable set Z ⊆ S d−1 , we sometime denote itsLebesgue measure by |Z|. Also, denote dm =dx the Lebesgue measure, normalized to be|S d−1 |a probability measure.For every n ∈ N 0 , we denote by Y d n the linear space <strong>of</strong> d-variables harmonic (i.e., satisfying∆p = 0) homogeneous polynomials <strong>of</strong> degree n. It holds that( ) ( )d + n − 1 d + n − 3N d,n = dim(Y d n) =−=d − 1 d − 1(2n + d − 2)(n + d − 3)!n!(d − 2)!Denote by P d,n : L 2 (S d−1 ) → Y d n the orthogonal projection onto Y d n.We denote by ρ : O(d) → U(L 2 (S d−1 )) the unitary representation defined byρ(A)f = f ◦ A −1We say that a closed subspace M ⊂ L 2 (S d−1 ) is invariant if it is invariant w.r.t. ρ (that is,∀f ∈ M, A ∈ O(d), f ◦ A ∈ M). We say that an invariant space M is primitive if ρ| M isirreducible.<strong>The</strong>orem 5.6 1. L 2 (S d−1 ) = ⊕ ∞ n=0Y d n.2. <strong>The</strong> primitive finite dimensional subspaces <strong>of</strong> L 2 (S d−1 ) are exactly {Y d n} ∞ n=0.Lemma 5.7 Fix an orthonormal basis Yn,j, d 1 ≤ j ≤ N d,n to Y d n. For every x ∈ S d−1 itholds thatN d,n∑j=1|Y dn,j(x)| 2 = N d,n|S d−1 |5.1.4 Legendre and Chebyshev Polynomials<strong>The</strong> results stated here can be found at (Atkinson and Han, 2012). Fix d ≥ 2. <strong>The</strong> ddimensional Legendre polynomials are the sequence <strong>of</strong> polynomials over [−1, 1] defined bythe recursion formulaP d,n (x) = 2n+d−4xPn+d−3 d,n−1(x) +P d,0 ≡ 1, P d,1 (x) = xn−1n+d−3 P d,n−2(x)We shall make use <strong>of</strong> the following properties <strong>of</strong> the Legendre polynomials.Proposition 5.8 1.(For every d ≥ 2, the)sequence {P d,n } is orthogonal basis <strong>of</strong> theHilbert space L 2 [−1, 1], (1 − x 2 ) d−32 dx .18(8)


2. For every n, d, ||P d,n || ∞ = 1.<strong>The</strong> Chebyshev polynomials <strong>of</strong> the first kind are defined as T n := P 2,n . <strong>The</strong> Chebyshevpolynomials <strong>of</strong> the second kind are the polynomials over [−1, 1] defined by the recursionformulaU n (x) = 2xU n−1 (x) − U n−2 (x)U 0 ≡ 1, U 1 (x) = 2xWe shall make use <strong>of</strong> the following properties <strong>of</strong> the Chebyshev polynomials.Proposition 5.9 1. For every n ≥ 1, T ′ n = nU n−1 .2. ||U n || ∞ = n + 1.Given a measure µ over [−1, 1], the orthogonal polynomials corresponding to µ are the sequence<strong>of</strong> polynomials obtained upon the Gram-Schmidt procedure applied to 1, x, x 2 , x 3 , . . ..We note that the 1, √ 2T 1 , √ 2T 2 , √ 2T 3 , . . . are the orthogonal polynomials corresponding tothe probability measure dµ =dxπ √ 1−x 25.1.5 Bochner Integral and Bochner SpacesPro<strong>of</strong>s and elaborations on the material appearing in this section can be found in (KosakuYosida, 1963). Let (X, m, µ) be a measure space and let H be a Hilbert space. A functionf : X → H is (Bochner) measurable if there exits a sequence <strong>of</strong> function f n : X → H suchthat• For almost every x ∈ X, f(x) = lim n→∞ f n (x).• <strong>The</strong> range <strong>of</strong> every f n is countable and, for every v ∈ H, f −1 (v) is measurable.A measurable function f : X → H is (Bochner) integrable if there exists a sequence <strong>of</strong> simplemeasurable functions (in the usual sense) s n such that lim n→∞∫X ||f(x)−s n(x)|| H dµ(x) = 0.We define the integral <strong>of</strong> f to be ∫ X fdµ = lim n→∞∫sn dµ, where the integral <strong>of</strong> a simplefunction s = ∑ ni=1 1 A iv i , A i ∈ m, v i ∈ H is ∫ X sdµ = ∑ ni=1 µ(A i)v i .Define by L 2 (X, H) the Kolmogorov quotient (by equality almost everywhere) <strong>of</strong> allmeasurable functions f : X → H such that ∫ X ||f||2 Hdµ < ∞.<strong>The</strong>orem∫5.10 L 2 (X, H) in a Hilbert space w.r.t. the inner product 〈f, g〉 L 2 (X,H) =〈f(x), g(x)〉 X Hdµ(x)5.2 Learnability implies small radius<strong>The</strong> purpose <strong>of</strong> this section is to show that if X is a subset <strong>of</strong> some Hilbert space H suchthat it is possible to learn affine functionals over X w.r.t. some loss, then we can essentiallyassume that X is contained is a unit ball and the returned affine functional is <strong>of</strong> norm O (m 3 ),where m is the number <strong>of</strong> examples.19


Lemma 5.11 (John’s Lemma) (Matousek, 2002) Let V be an m-dimensional real vectorspace and let K be a full-dimensional compact convex set. <strong>The</strong>re exists an inner product onV so that K is contained in a unit ball and contains a ball <strong>of</strong> radius 1 , both are centered atm(the same) x ∈ K. Moreover, if K is 0-symmetric it is possible to take x = 0 and the ratiobetween the radiuses can be improved to √ m.Lemma 5.12 Let l be a convex surrogate, let V an m-dimensional vector space and letX ⊂ V be a bounded subset that spans V as an affine space. <strong>The</strong>re exists an inner product〈·, ·〉 on V and a probability measure µ N such that• For every w ∈ V, b ∈ R, ||w|| ≤ 4m 2 Err µN ,hinge(Λ w,b )• X is contained in a unit ball.Pro<strong>of</strong> Let us apply John’s Lemma to K = conv(X ). It yields an inner product on V withK contained in a unit ball and containing the ball with radius 1 both centered at the samemx ∈ V . It remains to prove the existence <strong>of</strong> the measure µ N . W.l.o.g., we assume that x = 0.Let e 1 , . . . , e m ∈ V be an orthonormal basis. For every i ∈ [m], represent both 1 e m i and− 1 e m i as a convex combination <strong>of</strong> m + 1 elements from X :Now, definem+11m e ∑i =j=1λ j i xj i , − 1 m e i =m+1∑j=1ρ j i zj i .µ N (x j i , 1) = µ N(x j i , −1) = λj i4m , µ N(z j i , 1) = µ N(z j i , −1) = ρj i4m .20


Finally, let v ∈ V, b ∈ R. We haveErr µN ,hinge(Λ w,b ) =m∑i=1m+1∑j=1λ j [i lhinge (Λ w,b (x j i4m)) + l hinge(−Λ w,b (x j i ))]+ ρj [i lhinge (Λ w,b (z j i4m)) + l hinge(−Λ w,b (z j i ))][ (≥ 1 m∑m+1))]∑m+1l hinge λ j i4m(Λ w,b(x j i )) ∑+ l hinge(− λ j i (Λ w,b(x j i ))i=1j=1j=1[ ( m+1))]m+1+ l hinge ρ j i (Λ w,b(z j i )) ∑+ l hinge(− ρ j i (Λ w,b(z j i ))= 14m≥ 14m≥≥i=1∑ m∑j=1m∑ (〈l hinge w, e 〉imi=1〈+l hinge(− w, e 〉 ) (〈i+ b + l hingemm∑(l hinge − |〈w, e )i〉|m14m 2i=114m 2 ||w|||〈w, e i 〉|) (+ b + l hinge −w, e imj=1〈w, e 〉im〉 )− b)− b✷Let X be a bounded subset <strong>of</strong> some Hilbert space H and let C > 0. DenoteF H (X , C) = {Λ w,b | X | ‖w‖ H ≤ C, b ∈ R} .Denote by M the collection <strong>of</strong> all affine subspaces <strong>of</strong> H that are spanned by points from X .Denote by t = t H (X , C) be the maximal number such that for every affine subspace M ∈ M<strong>of</strong> dimension less than t there is x ∈ X such that d(x, M) > t C .Lemma 5.13 Let X be a bounded subset <strong>of</strong> some Hilbert space H and let C > 0. <strong>The</strong>re isA ⊂ X with |A| = t H (X , C) such that [−1, 1] A ⊂ F H (X , C)| A .Pro<strong>of</strong> Denote t = t H (X , C). Let x 0 , . . . , x t ∈ X be points such that the (t dimensional)volume <strong>of</strong> the parallelogram Q defined by the vectorsx 1 − x 0 , . . . , x t − x 0is maximal (if the supremum is not attained, the argument can be carried out with a parallelogramwhose volume is sufficiently close to the supremum). Let A = {x 1 , . . . , x t }. We claimthat for every 1 ≤ i ≤ t, the distance <strong>of</strong> x i from the affine span, M i , <strong>of</strong> A∪{x 0 }\{x i } is ≥ t C .Indeed, the volume <strong>of</strong> Q is the (t − 1 dimensional) volume <strong>of</strong> the parallelogram Q ∩ (M i − x 0 )times d(x i , M i ). By the maximality <strong>of</strong> x 0 , . . . , x t and the definition <strong>of</strong> t, d(x i , M i ) ≥ t C .21


For 1 ≤ i ≤ t Let v i = x i − P Mi x i . Note that ‖v i ‖ H = d(x i , M i ) ≥ t Now, given aCfunction f : A → [−1, 1], we will show that f ∈ F H (X , C). Consider the affine functional〈t∑t∑〉f(x i )f(x i )v it∑〈 〉f(xi )v iΛ(x) = 〈v‖vi=1 i ‖ 2 i , x − P Mi (x)〉 H =, x −, PH‖vi=1 i ‖ 2 H‖vH i=1 i ‖ 2 Mi (x) .HHNote that since v i is perpendicular to M i , b := − ∑ 〈 〉t f(xi )v ii=1, P Mi (x) does not dependon x. Let w := ∑ ti=1f(x i )v i‖v i. We have‖ 2 H‖v i ‖ 2 HH‖w‖ H ≤t∑ 1|f(x i )|‖v i ‖ ≤ tC t = C.i=1<strong>The</strong>refore, Λ| X ∈ F H (X , C). Finally, for every 1 ≤ j ≤ t we haveΛ(x j ) =t∑i=1f(x i )〈v‖v i ‖ 2 i , v j − P i (v j )〉 = f(x i ).HHere, the last inequality follows form that fact that for i ≠ j, since v i is perpendicular toM i , 〈v i , v j − P i (v j )〉 = 0. <strong>The</strong>refore, f = Λ| A .Let l : R → R be a surrogate loss function. We say that an algorithm (ɛ, δ)-learns F ⊂ R X<strong>using</strong> m examples w.r.t. l if:• Its input is a sample <strong>of</strong> m points in X × {±1} and its output is a hypothesis in F.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, ˆf ∈ F withErr D,l ( ˆf) ≤ inf f∈F Err D,l (f) + ɛLemma 5.14 Suppose that an algorithm A (ɛ, δ)-learns F <strong>using</strong> m examples w.r.t. a surrogateloss l. <strong>The</strong>n, for every pair <strong>of</strong> distributions D and D ′ on X × {±1}, if ˆf ∈ F is thehypothesis returned by A running on D, then Err D ′ ,l(f) ≤ m(l(0) + ɛ) w.p. ≥ 1 − 2eδ.Pro<strong>of</strong> Suppose toward a contradiction that w.p. ≥ 2eδ we have Err D ′ ,l(f) > a for a >m(l(0) + ɛ). Consider the following distribution, ˜D: w.p.1we sample from m D′ and withprobability 1 − 1 we sample from D. Suppose now that we run the algorithm A on ˜D.mConditioning on the event that all the samples are from D, we have, w.p. ≥ 2eδ,Err D ′ ,l( ˆf) > a and therefore, Err ˜D,l( ˆf) > a . <strong>The</strong> probability that indeed all the m samplesmare from D is ( 1 − m) 1 m>1. Hence, with probability > δ, we have Err 2e ˜D,l( ˆf) > a . mOn the other hand, With probability ≥ 1 − δ we have Err ˜D,l( ˆf) ≤ inf f∈F Err ˜D(f) + ɛ ≤Err D,l (0) + ɛ = l(0) + ɛ. Hence, with positive probability,It follows that a ≤ m(l(0) + ɛ).am ≤ l(0) + ɛ .22✷


Lemma 5.15 For every surrogate loss l with ∂ + l(0) < 0 there is a constant c > 0 suchthat the following holds. Let X be a bounded subset <strong>of</strong> a Hilbert space H. If F H (X , C) is(ɛ, δ)-learnable <strong>using</strong> m example w.r.t. l then there is an inner product 〈·, ·〉 s on H such that• X is contained in a unit ball w.r.t. ‖ · ‖ s .• If A (ɛ, δ)-learns F H (X , C) then for every distribution D on X × {±1}, the hypothesisΛ w,b returned by A has ‖Λ w,b ‖ s ≤ c · m 3 w.p. 1 − 2eδ.• <strong>The</strong> norm ‖ · ‖ s is equivalent 4 to ‖ · ‖ H .Remark 5.16 If X is the image <strong>of</strong> some mapping ψ : Z → H then it follows from the lemmathat there is a normalized <strong>kernel</strong> k on Z such that the hypothesis returned by the <strong>learning</strong>algorithm (interpreted as a function from Z to R) if the form f + b with ‖f‖ k ≤ c · m 3 . Also,if ψ is continuous/absolutely continuous, then so is k.Pro<strong>of</strong> Let t = t H (X , C). By lemma 5.13 there is some A ⊂ X such that [−1, 1] A ⊂F H (X , C)| A . Since F H (X , C) is (ɛ, δ)-learnable <strong>using</strong> m examples, it is not hard to see thatwe must have t ≤ c ′ · m for some c ′ > 0 that depends only on l.<strong>The</strong>refore, there exists an affine subspace M ⊂ H <strong>of</strong> dimension d ≤ c ′ m, such thatfor every x ∈ X , d(x, M) < d ≤ c′ m. Moreover, we can assume that M is spanned byC Csome subset <strong>of</strong> X . Denote by ˜M the linear space corresponding to to M (i.e., ˜M is thetranslation <strong>of</strong> M that contains 0). By lemma 5.12, there is an inner product 〈·, ·〉 1 on ˜M,and a probability measure µ N on P ˜MXsuch that• For every w ∈ H, b ∈ R we have ‖Λ P ˜M w,b‖ 1 ≤ 4d 2 Err µN ,hinge(Λ w,b ).• For all x ∈ X , ‖P ˜Mx‖ 1 ≤ 1.Finally, define〈x, y〉 s = 1 2 〈P ˜M(x), P ˜M(y)〉 1 + C22d 2 〈P ˜M ⊥(x), P ˜M ⊥(y)〉 H .<strong>The</strong> first and last assertions <strong>of</strong> the lemma are easy to verify, so we proceed to the second.Let Λ w,b be the hypothesis returned by A after running on some m examples sampled fromsome distribution D. Let D N be a probability measure on X whose projection on ˜M is µ N .By lemma 5.14 we have, with probability ≥ 1 − 2eδ, Err DN ,l(Λ w,b ) ≤ (l(0) + 1)m. We claim4 Two norms ‖ · ‖ and ‖ · ‖ ′ on a vector space X are equivalent if for some c 1 , c 2 > 0, ∀x ∈ A, c 1 · ‖x‖ ≤‖x‖ ′ ≤ c 2 ‖x‖✷23


( )that in this case Err µN ,hinge(Λ w,b ) ≤ l(0)+1+ 2 ∂ +m. Indeed,l(0)Err µN ,hinge(Λ w,b ) = E(x,y)∼µ Nl hinge (yΛ w,b (x))= E(x,y)∼D Nl hinge (yΛ w,b (P ˜Mx))= E(x,y)∼D Nl hinge (yΛ w,b (x + (x − P ˜Mx)))≤≤≤E l hinge (yΛ w,b (x)) + ‖w‖ H · d(x, P ˜Mx)(x,y)∼D NErr DN ,hinge(Λ w,b ) + C · mC1|∂ + l(0)| Err D N ,l(Λ w,b ) + 1 + m≤ l(0) + 1∂ + l(0) m + 2mBy the properties <strong>of</strong> µ N , it follows that ‖Λ P ˜M w,b‖ 1 = O (m 3 ) (here, the constant in the big-Onotation depends only on l). Finally, we have,‖Λ w ‖ 2 s = ‖Λ P ˜Mw‖ 2 s + ‖Λ P ˜M⊥w‖ 2 s=≤2‖Λ P ˜Mw‖ 2 1 + 2m2C ‖Λ 2 P ˜M⊥w‖ 2 HO ( m 6) + 2m 2 ≤ O ( m 6) ✷5.3 Symmetric Kernels and SymmetrizationIn this section we concern symmetric <strong>kernel</strong>s. Fix d ≥ 2 and let k : S d−1 × S d−1 → R be acontinuous positive definite <strong>kernel</strong>. We say that k is symmetric if∀A ∈ O(d), x, y ∈ S d−1 , k(Ax, Ay) = k(x, y)In other words, k(x, y) depends only on 〈x, y〉 R d. A RKHS is called symmetric if its reproducing<strong>kernel</strong> is symmetric. <strong>The</strong> next theorem characterize symmetric RKHSs. We notethat <strong>The</strong>orems <strong>of</strong> the same spirit have already been proved (e.g. (Schoenberg, 1942)).<strong>The</strong>orem 5.17 Let k : S d−1 ×S d−1 → R be a normalized, symmetric and continuous <strong>kernel</strong>.<strong>The</strong>n,1. <strong>The</strong> group O(d) acts on H k . That is, for every A ∈ O(d) and every f ∈ H k if holdsthat f ◦ A ∈ H k and ||f|| Hk = ||f ◦ A|| Hk .2. <strong>The</strong> mapping ρ : O(d) → U(H k ) defined by ρ(A)f = f ◦A −1 is a unitary representation.3. <strong>The</strong> space H k consists <strong>of</strong> continuous functions.24


4. <strong>The</strong> decomposition <strong>of</strong> ρ into a sum <strong>of</strong> irreducible representation is H = ⊕ n∈I Y d n forsome set I ⊂ N 0 . Moreover,∀f, g ∈ H k , 〈f, g〉 Hk = ∑ n∈IWhere {a n } n∈I are positive numbers.5. It holds that ∑ n∈IN d,n|S d−1 | a−2 n = 1.a 2 n〈P d,n f, P d,n g〉 L 2 (S d−1 )Pro<strong>of</strong> Let f ∈ H k , A ∈ O(d). To prove part 1, assume first thatn∑∀x ∈ S d−1 , f(x) = α i k(x, y i ) (9)For some y 1 , . . . , y n ∈ S d−1 and α 1 , . . . , α n ∈ C. We have, since k is symmetric, thatn∑f ◦ A(x) = α i k(Ax, y i )==i=1i=1n∑α i k(A −1 Ax, A −1 y i )i=1n∑α i k(x, A −1 y i )Thus, by <strong>The</strong>orem 5.1, f ◦ A ∈ H k . Moreover, it holds that||f ◦ A|| 2 H k= ∑α i ᾱ j k(A −1 y j , A −1 y i )i=11≤i,j≤n= ∑1≤i,j≤nα i ᾱ j k(y j , y i ) = ||f|| 2 H kThus, part 1 holds for function f ∈ H k <strong>of</strong> the form (9). For general f ∈ H k , by <strong>The</strong>orem 5.1,there is a sequence f n ∈ H k <strong>of</strong> functions <strong>of</strong> the from (9) that converges to f in H k . From whatwe have shown for functions <strong>of</strong> the form (9) if follows that ||f n −f m || Hk = ||f n ◦A−f m ◦A|| Hk ,thus f n ◦ A is a Cauchy sequence, hence, has a limit g ∈ H k . By <strong>The</strong>orem 5.1, convergencein H k entails point wise convergence, thus, g = f ◦ A. Finally,||f|| Hk = limn→∞||f n || Hk = limn→∞||f n ◦ A|| Hk = ||f ◦ A|| HkWe proceed to part 2. It is not hard to check that ρ is group homomorphism, so itonly remains to validate that for every f ∈ H the mapping A ↦→ ρ(A)f is continuous. Letɛ > 0 and let A ∈ O(d). We must show that there exists a neighbourhood U <strong>of</strong> A such that∀B ∈ U, ||f ◦ A −1 − f ◦ B −1 || Hk < ɛ. Choose g(·) = ∑ ni=1 α ik(·, y i ) such that ||g − f|| Hk < ɛ . 3By part 1, it holds that||f ◦ A −1 − f ◦ B −1 || Hk ≤ ||f ◦ A −1 − g ◦ A −1 || Hk + ||g ◦ A −1 − g ◦ B −1 || Hk + ||g ◦ B −1 − f ◦ B −1 || Hk= ||f − g|| Hk + ||g ◦ A −1 − g ◦ B −1 || Hk + ||g − f|| Hk< ɛ 3 + ||g ◦ A−1 − g ◦ B −1 || Hk + ɛ 325


Thus, it is enough to find a neighbourhood U <strong>of</strong> A such that ∀B ∈ U, ||g◦A −1 −g◦B −1 || Hk


Pro<strong>of</strong> Part 1. follows readily from the definition. We proceed to part 2. Define φ : S d−1 →L 2 (O(d), H k ) byφ(x)(A)(·) = k(Ax, ·)Note that〈φ(x), φ(y)〉 L 2 (O(d),H k ) =Thus, the <strong>The</strong>orem follows from <strong>The</strong>orem 5.1.15.4 Lemma 5.22 and its pro<strong>of</strong>=∫∫O(d)O(d)= k s (x, y)〈φ(x)(A), φ(y)(A)〉〈φ(x)(A), φ(y)(A)〉Lemma 5.19 For every n > 0, d ≥ 5 and t ∈ [−1, 1] it holds that{ (Γd−1) [ ] d−2 (242|P d,n (t)| ≤ min √ ,π n(1 − t 2 )Moreover, ifn+ 2|t| ≤ 1 we also haven+d−2 ∏|P d,n (t)| ≤ √ ni=1()ii + d − 2 + 2|t|) nnn + d − 2 + 2|t|Finally, there exist constants E > 0 and 0 < r, s < 1 such that for every K > 0, d ≥ 5 andt ∈ [ − 1 8 , 1 8]we have∞∑}2✷n=K|P d,n (t)| ≤ Er K + Es dPro<strong>of</strong> In (Atkinson and Han, 2012) it is shown that |P d,n (t)| ≤shall prove, by induction on k that∏|P d,n (t)| ≤ √ n ()ii + d − 2 + 2|t|i=1Γ(d−12 )[√ 4π n(1−t 2 )] d−22. WenWhenever + 2|t| ≤ 1. For n = 0, 1 it follows from the fact that P n+d−2 d,0 ≡ 1 andP d,1 (t) = t. Let n > 1. By the induction hypothesis and the recursion formula for the27


Legendre polynomials we have|P d,n (t)| ≤ 2n + d − 4n + d − 3 |t||P d,n−1(t)| + n − 1n + d − 3 |P d,n−2(t)|≤ 2|t||P d,n−1 (t)| + n − 1n + d − 3 |P d,n−2(t)|≤ 2|t| √ n−1 ∏()ii + d − 2 + 2|t| + n − 1√ n−2 ∏()in + d − 3 i + d − 2 + 2|t| i=1i=1≤ 2|t| √ n−2 ∏()ii + d − 2 + 2|t| + n − 1√ n−2 ∏()in + d − 3 i + d − 2 + 2|t|≤=i=1√ (2|t| + n − 1n + d − 3∏√ n (i=1) (2|t| +)ii + d − 2 + 2|t|Now, every K, ¯K() 1¯K≥ 0 such that ¯K+d−2 + 2|t| 2nn + d − 2i=1) √ √√n−2∏(i=1< 1, we have)ii + d − 2 + 2|t|∞∑|P d,n (t)| ≤n=K≤≤≤≤=¯K∑n=K¯K∑n=K((∞∑(n=K) nnn + d − 2 + 2|t| 2 ∑ ∞+) n¯K¯K + d − 2 + 2|t| 2 ∑ ∞+¯K¯K + d − 2 + 2|t| ) n() K¯K¯K+d−2 + 2|t| 2() 1¯K1 − ¯K+d−2 + 2|t| 2() K¯K¯K+d−2 + 2|t| 2() 1¯K1 − ¯K+d−2 + 2|t| 2() K¯K¯K+d−2 + 2|t| 2() 1¯K1 − ¯K+d−2 + 2|t| 2Γ ( )d−1[24√ π n(1 − t 2 )n= ¯K+1Γ ( )d−1[24√ π n(1 − t 2 )n= ¯K+12Γ ( )d−1[ ] d−2+242∞∑√ π (1 − t 2 )+ Γ ( )d−1[24√ π (1 − t 2 )+ Γ ( )d−1[24√ π (1 − t 2 )+ Γ ( )d−1[24√ π (1 − t 2 )] d−22] d−22∞∑n= ¯K+1∫ ∞¯K] d−22 ¯K− d−42d−42] d−22] d−22n= ¯K+1n − d−22x − d−22 dxn − d−2228


= ||f|| 1 · M · ||f|| 2√K✷(We limit ourselves to d ≥ 5 to guarantee the convergence <strong>of</strong> ∑ n − d−22 .) In particular, if|t| ≤ 1 and ¯K = d − 2, we have,8∞∑|P d,n (t)| ≤n=K≤∼=⎛⎝ 1⎛1 − ( 34⎝ 1⎛1 − ( 34⎝ 1⎛1 − ( 34⎝ 11 − ( 34) 12) 12) 12) 12⎞(⎠ 32 Γ ( )d−1[+2 4.07√4)Kπ (d − 2)⎞(⎠ 32 6Γ ( )d−1[ ] d−2+2 4.072√4)Kπ (d − 2)⎞( [ ] d−2√⎠ 326 4.072 2π+ √π4)K(d − 2)d−22⎞(⎠ 34)K2+ 12[ 4.072e] d−22] d−22d − 2d−42( d − 2Lemma 5.20 Let µ be a probability measure on [−1, 1] and let p 0 , p 1 , . . . be the correspondingorthogonal polynomials. <strong>The</strong>n, for every f ∈ span{p 0 , . . . , p K−1 } we haveHere, all L p norms are w.r.t. µ.||f|| 2 ≤ √ K||f|| 1 ·max0≤i≤K−1 ||p i|| ∞Pro<strong>of</strong> Write f = ∑ K−1i=0 α ip i and denote M = max 0≤i≤K−1 ||p i || ∞ .We have||f|| 2 2 ≤ ||f|| 1 · ||f|| ∞≤≤||f|| 1 · MK−1∑n=0|α k |||f|| 1 · M√ K−1 ∑αk 2 · √Kn=02e) d−22✷Lemma 5.21 Let d ≥ 5 and let f : [−1, 1] → R be a continuous function whose expansionin the basis <strong>of</strong> d-dimensional Legendre polynomials is∞∑f = α n P d,nw(x) =n=0Denote C = sup n |α n |. Let µ be the probability measure on [−1, 1] whose density function is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 829


<strong>The</strong>n, for every K ∈ N, 1 8 > γ > 0,|f(γ) − f(−γ)| ≤ 32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · C · E · (r K + s d )Here, E, r and s are the constants from Lemma 5.19.Pro<strong>of</strong> Let ¯f = ∑ K−1n=0 α nP d,n . We have || ¯f|| 1,µ ≤ ||f|| 1,µ +|| ¯f −f|| ∞,µ . Define g : [−1, 1] → Rby g(x) = ¯f( x ) and denote by dλ =dx8 π √ . Write,1−x 2g =K−1∑n=0β n T nWhere T n are the Chebyshev polynomials. By Lemma 5.20 it holds that, for every 0 ≤ n ≤K − 1,|β n | ≤ √ 2||g|| 2,λ ≤ 2 √ K||g|| 1,λ = 2 √ K|| ¯f|| 1,µNow,g ′ =K−1∑n=1β k nU n−1Where U n are the Chebyshev polynomials <strong>of</strong> the second kind. Thus,||g ′ || ∞,λ ≤Finally, by Lemma 5.19,K−1∑n=1|β k | · n · ||U n−1 || ∞,λ =K−1∑n=1|β k | · n 2 ≤ 2 √ K|| ¯f|| 1,µ · K 3|f(γ) − f(−γ)| ≤ |g(8γ) − g(−8γ)| + 2||f − ¯f|| ∞,µ≤≤32γK 3.5 · || ¯f|| 1,µ + 2||f − ¯f|| ∞,µ32γK 3.5 · (||f||1,µ + ||f − ¯f||)∞,µ + 2||f − ¯f||∞,µ≤32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · ||f − ¯f|| ∞,µ≤ 32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )For e ∈ S d−1 we define the group O(e) := {A ∈ O(d) : Ae = e}. If H k be a symmetricRKHS and e ∈ S d−1 we define Symmetrization around e. This is the operator P e : H k → H kwhich is the projection on the subspace {f ∈ H k : ∀A ∈ O(e), f ◦ A = f}. It is not hardto see that (P e f)(x) = ∫ {x ′ :〈x ′ ,e〉=〈x,e〉} f(x′ )dx ′ = ∫ f ◦ A(x)dA. Since P O(e) ef is a convexcombination <strong>of</strong> the functions {f ◦ A} A∈O(e) , it follows that if R : H k → R is a convexfunctional then R(P e f) ≤ ∫ R(f ◦ A)dA.O(e)Lemma 5.22 (main) <strong>The</strong>re exists a probability measure µ on [−1, 1] with the followingproperties. For every continuous and normalized <strong>kernel</strong> k : S d−1 × S d−1 → R and C > 0,there exists e ∈ S d−1 such that, for every f ∈ H k with ||f|| Hk ≤ C, K ∈ N and 0 < γ < 1,8 ∫∫∣ f −f∣ ≤ 32γK 3.5 · ||f|| 1,µe + ( 32γK 3.5 + 2 ) · E · C · (r K + s d ){x:〈x,e〉=γ}{x:〈x,e〉=−γ}≤ 32γK 3.5 · ||f|| 1,µe + 10 · E · K 3.5 · C · (r K + s d )30✷


<strong>The</strong> integrals are w.r.t. the uniform probability over {x : 〈x, e〉 = γ} and {x : 〈x, e〉 = −γ}and E, r, s are the constants from Lemma 5.19.Pro<strong>of</strong> Suppose first that k is symmetric. Let µ be the distribution over [−1, 1] whosedensity function isw(x) ={0 |x| >18√π81−(8x) 2 |x| ≤ 1 8We can assume that f is O(e)-invariant. Otherwise, we can replace f with P e f, which doesnot change the l.h.s. and does not increase the r.h.s. This assumption yields (see (Atkinsonand Han, 2012), pages 17-18)f(x) =∞∑α n P d,n (〈e, x〉).n=0<strong>The</strong> L 2 (S d−1 )-norm <strong>of</strong> the map x ↦→ P d,n (〈x, e〉) is |Sd−1 |N d,n(e.g. (Atkinson and Han, 2012),page 71). <strong>The</strong>refore,||f|| 2 k = ∑ |S d−1 |a 2Nnαn2d,nn∈Iwhere {a n } n∈I are the numbers corresponding to H k from <strong>The</strong>orem 5.17. In particular (sincealso for n ∉ I, α n = 0),WriteBy Lemma 5.21,|α n | 2 ≤ N d,n|S d−1 | a−2 n ||f|| 2 k ≤ ||f|| 2 kg(t) = f(te), t ∈ [−1, 1]|g(γ) − g(−γ)| ≤ 32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )Finally, ∫ f = g(γ), ∫ f = g(−γ) since f is O(e)-invariant. <strong>The</strong> Lemma{x:〈x,e〉=γ} {x:〈x,e〉=−γ}follows.We proceed to the general case where k is not necessarily symmetric. Assume by way <strong>of</strong>contradiction that for every e ∈ S d−1 , there exists a function f e such that∫∫f e −f e > 32γK 3.5·||f e || 1,µe + ( 32γK 3.5 + 2 )·||f e || Hk ·C ·(r K +s d ) (10){x:〈x,e〉=γ}{x:〈x,e〉=−γ}For convenience we normalize, so l.h.s. equals 1. Fix a vector e 0 ∈ S d−1 . Define Φ ∈L 2 (O(d), H k ) byΦ(A) = f Ae0and let f ∈ H ksbe the function∫f(x) =O(d)∫Φ(A)(Ax)dA =31O(d)f Ae0 (Ax)dA


Now, it holds that∫∫∫ ∫∫ ∫f −f =f Ae0 (Ax)dAdx −f Ae0 (Ax)dAdx{x:〈x,e 0 〉=γ} {x:〈x,e 0 〉=−γ}{x:〈x,e 0 〉=γ} O(d){x:〈x,e 0 〉=−γ} O(d)∫ ∫∫=f Ae0 (Ax)dx −f Ae0 (Ax)dxdAO(d){x:〈x,e 0 〉=γ}∫ ∫∫=f Ae0 (x)dx −= 1O(d){x:〈x,e 0 〉=−γ}{x:〈x,Ae 0 〉=γ}{x:〈x,Ae 0 〉=−γ}f Ae0 (x)dxdAOn the other hand||f|| 1,µe =≤≤=∫S d−1 ∣ ∣∣∣∫∫ ∫∫∫O(d)O(d)O(d)O(d)f Ae0 (Ax)dA∣ dµ e 0(x)|f Ae0 (Ax)| dµ e0 (x)dAS∫d−1 |f Ae0 (x)| dµ Ae0 (x)dAS d−1||f Ae0 || 1,µAe0 dAMoreover, by <strong>The</strong>orem 5.18,||f|| 2 H ks∫≤ ||Φ|| 2 L 2 (O(d),H k ) = ||f Ae0 || 2 H kdA ≤ C 2Since the Lemma is already proved for symmetric <strong>kernel</strong>s, it follows that1 ≤ 32γK 3.5 · ||f|| 1,µe0 + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )∫≤ 32γK 3.5 · ||f Ae0 || 1,µAe0 dA + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )O(d)∫= 32γK 3.5 · ||f Ae0 || 1,µAe0 + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )dAO(d)Thus, for some A ∈ O(d)1 ≤ 32γK 3.5 · ||f Ae0 || 1,µAe0 + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )Contradicting Equation (10).O(d)5.5 Pro<strong>of</strong>s <strong>of</strong> the main <strong>The</strong>oremsWe are now ready to prove <strong>The</strong>orems 2.6 and 3.2. We only consider distributions thatsupported on the unit sphere, and we can therefore assume that the problem is formulated32✷


it terms <strong>of</strong> the unit sphere and not the unit ball. Also, we reformulate program (5) asfollows: Given l : R → R a convex surrogate, a constant C > 0 and a continuous <strong>kernel</strong>k : S ∞ × S ∞ → R with sup x∈S ∞ k(x, x) ≤ 1, we want to solvemin Err D,l (f + b)s.t. f ∈ H k , b ∈ R (11)||f|| Hk ≤ CWe can assume that ∂ + l(0) < 0, for otherwise the approximation ratio is ∞. To see that,let the distribution D be concent<strong>rate</strong>d on a single point on the sphere and always return thelabel 1. Of course, Err γ (D) = 0. However, if ∂ + l(0) ≥ 0, it is bot hard to see that if f, b isthe solution <strong>of</strong> program (11), then f(x) + b ≤ 0, so that Err 0−1 (f + b) = 1.Lemma 5.23 Let l be a surrogate loss, µ a probability measure on S d−1 and f ∈ C(S d−1 ).Let ¯µ be the probability measure on S d−1 × {±1} which is the product measure <strong>of</strong> µ and theuniform distribution on {±1}. <strong>The</strong>n||f|| 1,µ ≤Pro<strong>of</strong> By Jansen’s inequaliy, it holds that2|∂ + l(0)| Err¯µ,l(f)Err¯µ,l (f) = E (x,y)∼¯µ l(y · f(x))= 1 2 E (x,y)∼¯µl(f(x)) + l(−f(x))≥1 2 E (x,y)∼¯µl(−|f(x)|)≥ 1 2 l ( −E (x,y)∼¯µ |f(x)| )It follows that l (−||f|| 1,µ ) ≤ 2 Err¯µ,l (f). By the convexity <strong>of</strong> l, it follows that for everyx ∈ R, l(x) ≥ l(0) + x · ∂ + l(0) = l(0) − x · |∂ + l(0)| ≥ −x · |∂ + l(0)|. Thus,5.5.1 <strong>The</strong>orems 2.6 and 3.2||f|| 1,µ ≤2|∂ + l(0)| Err¯µ,l(f)We will need Levy’s measure concentration Lemma (e.g., (Milman and Schechtman, 2002)).Let f : X → Y be an absolutely continuous map between metric spaces. We define itsmodulus <strong>of</strong> continuity as∀ɛ > 0, ω f (ɛ) = sup{d(f(x), f(y)) : x, y ∈ X, d(x, y) ≤ ɛ}<strong>The</strong>orem 5.24 (Levy’s Lemma) <strong>The</strong>re exists a constant η > 0 such that for every continuousfunction f : S d−1 → R,Pr (|f − Ef| > ω f (ɛ)) ≤ exp ( −ηdɛ 2)Here, both probability and expectation are w.r.t. the uniform distribution.33✷


We note that ω f◦g ≤ ω f · ω g and that ω Λv (ɛ) = ‖v‖ · ɛ. Thus, if ψ : S ∞ → H 1 is an absolutelycontinuous embedding such that k(x, y) = 〈ψ(x), ψ(y)〉 H1 , then for every v ∈ H 1 , it holdsthat ω Λv,0 ◦ψ ≤ ||v|| H1 · ω ψ . Suppose now that f ∈ H k with ‖f‖ Hk ≤ C. Let v ∈ H 1 such thatf = Λ v,0 ◦ ψ and ||v|| H1 = ||f|| Hk ≤ C. It follows from Levi’s Lemma thatPr (|f − Ef| > C · ω ψ (ɛ)) ≤ Pr (|f − Ef| > ω f (ɛ)) ≤ exp ( −ηdɛ 2) (12)Again, when both probability and expectation are w.r.t. the uniform distribution over S d−1 .Pro<strong>of</strong> (<strong>of</strong> <strong>The</strong>orems 2.6 and 3.2) Let β > α > 0 such that l(α) > l(β). Choose 0 < θ < 1large enough so that (1 − θ)l(−β) + θl(β) < θl(α). Define probability measures µ 1 , µ 2 , µ over[−1, 1] × {±1} as follows.µ 1 ((−γ, −1)) = 1 − θ, µ 1 ((γ, 1)) = θ<strong>The</strong> measure µ 2 is the product <strong>of</strong> uniform{±1} and the measure on [−1, 1] whose densityfunction is{0 |x| >18w(x) =√π81−(8x) 2 |x| ≤ 1 8Finally, µ = (1 − λ)µ 1 + λµ 2 for λ > 0, which will be chosen later.By lemma 5.15 (see remark 5.16), there is a continuous normalized <strong>kernel</strong> k ′ such thatw.p. ≥ 1 − 2e exp(− 1 ) the function returned by the algorithm is <strong>of</strong> the form f + b withγ‖f‖ Hk ′ ≤ c · m 3 A (γ) for some c > 0 (depending only on l). Let e ∈ Sd−1 be the vector fromLemma 5.22, corresponding to the <strong>kernel</strong> k ′ . <strong>The</strong> distribution D is the pullback <strong>of</strong> µ w.r.t.e. By considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ.Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (11) with C = C A (γ) and with an additive<strong>error</strong> ≤ √ γ. Since the value <strong>of</strong> the zero solution for program (11) is l(0), it follows thatl(0) + √ γ ≥ Err µ,l (g) = (1 − λ) Err µ 1 e ,l(g) + λ Err µ 2 e ,l(g)Thus, Err µ 2 e ,l(g) ≤ l(0)+√ γ≤ 2l(0) . Combining Lemma 5.23, Lemma 5.15, and Lemma 5.22 isλ λfollows that w.p. ≥ 1 − (1 + 2e) exp(− 1 ) ≥ 1 − 10 exp(− 1 ), for m = m γ γ A(γ)∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · c · K 3.5 · E · m 3 · (r K + s d )|∂ + l(0)|λ{x:〈x,e〉=γ}{x:〈x,e〉=−γ}By choosing K = Θ(log(m)), λ = Θ (γK 3.5 ) = Θ ( γ log 3.5 (m) ) and d = Θ(log(m)), we canmake the last bound ≤ α. We claim that ∫ g > α . To see that, note that otherwise∫2 {x:〈x,e〉=−γ} 2 g ≤ α thus,{x:〈x,e〉=γ}E (x,y)∼D l((f(x) + b)y) = E (x,y)∼D l(g(x)y)∫≥ θ(1 − λ) ·≥≥(∫θ(1 − λ) · l{x:〈x,e〉=γ}l(g(x))dxg(x)dx{x:〈x,e〉=γ})θ · l (α) · (1 − λ) = θ · l (α) + o(1)34


This contradict the optimality <strong>of</strong> f, b, as for f ′ = 0, b ′ = β it holds thatE (x,y)∼D l((f ′ (x) + b ′ )y) ≤ λl(−β) + (1 − λ) · (1 − θ)l(−β) + θ · l(β))= (1 − θ)l(−β) + θ · l(β) + o(1)We can conclude now the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6. By choosing d large enoughand <strong>using</strong> Equation (12), we can guarantee that g| {x:〈x,e〉=−γ} is very concent<strong>rate</strong>daround its expectation. In particular, if (x, y) are sampled according to D, then w.p.> 0.5 · (1 − θ) · (1 − λ) = Ω(1), it holds that yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1), whileErr γ (D) ≤ λ = O (γ poly(log(m)))To conclude the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 3.2, we note that we can assume that g is O(e)-invariant. Otherwise, we can replace it with P e f + b. This does not increase ||f|| Hk norErr D,l (f + b), thus, the solution P e f + b is optimal as well. Now, it follows that g| {x:〈x,e〉=−γ}is constant and we finish as before.5.5.2 <strong>The</strong>orem 3.1Let L be the Lipschitz constant <strong>of</strong> l. Let β > α > 0 such that l(α) > l(β). Choose 0 < θ < 1large enough so that (1−θ)l(−β)+θl(β) < θl(α). First, define probability measures µ 1 , µ 2 , µ 3and µ over [−1, 1] × {±1} as follows.µ 1 (γ, 1) = θ, µ 1 (−γ, −1) = 1 − θw(x) =µ 2 (−γ, 1) = 1<strong>The</strong> measure µ 3 is the product <strong>of</strong> uniform{±1} and the measure over [−1, 1] whose densityfunction is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 8Finally, µ = (1 − λ 1 − λ 2 )µ 1 + λ 2 µ 2 + λ 3 µ 3 with λ 2 , λ 3 > 0 to be chosen later.By lemma 5.15 (see remark 5.16), there is a continuous normalized <strong>kernel</strong> k ′ such thatw.p. ≥ 1 − 2e exp(− 1 ) the function returned by the algorithm is <strong>of</strong> the form f + b withγ‖f‖ Hk ′ ≤ c · m 3 A (γ) for some c > 0 (depending only on l). Now, let e ∈ Sd−1 be the vectorfrom Lemma 5.22, corresponding to the <strong>kernel</strong> k ′ . <strong>The</strong> distribution D is the pullback <strong>of</strong> µw.r.t. e. By considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ 3 + λ 2 .Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (11) with C = C A (γ) and with an additive<strong>error</strong> ≤ √ γ. As in the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6, it holds that, w.p. ≥ 1 − 10 exp(− 1 ) for γm = m A (γ),∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · c · K 3.5 · E · m 3 · (r K + s d ) (13)|∂ + l(0)|λ 3{x:〈x,e〉=γ}{x:〈x,e〉=−γ}Denote the last bound by ɛ. It holds thatErr D,l (g) = (1 − λ 2 − λ 3 )E µ 1 el(yg(x)) + λ 2 E µ 2 el(yg(x)) + λ 3 E µ 3 el(yg(x)) (14)35✷


Now, denote δ = ∫ g. It holds that{x:〈x,e〉=−γ}∫∫E µ 1 el(yg(x)) = θ l(g(x)) + (1 − θ)Thus,≥≥{x:〈x,e〉=γ}(∫θ · lg{x:〈x,e〉=γ})+ (1 − θ) · lθ · l(δ) + (1 − θ) · l(−δ) − Lɛ{x:〈x,e〉=−γ}(−∫l(−g(x)){x:〈x,e〉=−γ}Err D,l (g) ≥ (1 − λ 2 − λ 3 )(θ · l(δ) + (1 − θ) · l(−δ)) − Lɛ + λ 2 E µ 2 el(yg(x))However, by considering the constant solution δ, it follows that1Err D,l (g) ≤ (1 − λ 2 − λ 3 )(θl(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + λ 32 (l(δ) + l(−δ)) + √ γ≤ (1 − λ 2 − λ 3 )(θ · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + λ 3 · l(−|δ|) + √ γThus,Err µ 2 e ,l(g) ≤ Lɛλ 2+ l(δ) + λ 3λ 2l(−|δ|) += L · l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+)g(15)√ γ(16)λ 210 · c · L · K3.5· E · m 3 · (r K + s d ) + l(δ) + λ √3γl(−|δ|) +λ 2 λ 2Now, relying on the assumption that γ · log 8 (m) = o(1), it is possible to choose λ 2 =Θ ( √ γK4 ) = Θ ( √ γ log 4 (m) ) , λ 3 = √ γ, K = Θ(log(m/γ)), and d = Θ(log(m/γ)) such thatthe bound in Equation (13), L·l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+ 10·c·K3.5λ 2· E · m 3 · (r K + s d ), λ 2 , λ 3 and λ 3λ 2o(1).l(δ) ≤ l ( α2are allSince the ) bound in Equation (13) is o(1), it follows, as in the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6, thatand consequently, 0 l(0)−l( α 2 )− o(1), l(g(x)) < l(0) ⇒l(0)g(x) > 0. Since the marginal distributions <strong>of</strong> µ 1 e and µ 2 e are the same, it follows that, if (x, y)(are chosen according to D, then w.p. >l(0)−l( α 2 )l(0))− o(1) · (1 − λ 2 − λ 3 ) · (1 − θ) = Ω(1),yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1) while Err γ (D) ≤ λ 2 + λ 3 = O ( √ γ poly(log(m))).36✷λ 2


5.5.3 <strong>The</strong> integrality gap – <strong>The</strong>orem 3.3Our first step is a reduction to the hinge loss. Let a = ∂ + l(0). Define{l ∗ ax + 1 x ≤ 1−a(x) =0 o/wit is not hard to see that l ∗ is a convex surrogate satisfying ∀x, l ∗ (x) ≤ l(x) and ∂ + l ∗ (0) =∂ + l(0). Thus, if we substitute l with l ∗ , we just decrease the integrality gap, hence canassume that l = l ∗ . Now, we note that if we consider program (11) with l = l ∗ the inegralitygap <strong>of</strong> coincides with what we get by replacing C with |a| · C and l ∗ with the hinge loss.To see that, note that for every f ∈ H k , b ∈ R, Err D,l ∗(f + b) = Err D,hinge (|a| · f + |a| · b),thus, minimizing Err D,l ∗ over all functions f ∈ H k that satisfy ||f|| Hk ≤ C is equivalent tominimizing Err D,hinge over all functions f ∈ H k that satisfy ||f|| Hk ≤ |a| · C. Thus, it isenough to prove the <strong>The</strong>orem for l = l hinge .Next, we show that we can assume that the embedding is symmetric (i.e., correspond toa symmetric <strong>kernel</strong>). As the integrality gap is at least as large as the approximation ratio,<strong>using</strong> <strong>The</strong>orem 3.2 this will complete our argument. (<strong>The</strong> reduction to the hinge loss yieldsbounds with universal constants in the asymptotic terms).Let γ > 0 and let D be a distribution on S d−1 × {±1}. It is enough to find (a possiblydifferent) distribution D 1 with the same γ-margin <strong>error</strong> as D, for which the optimum <strong>of</strong>program (11) (with l = l hinge ) is not smaller than the optimum <strong>of</strong> the programmin Err D,hinge (f + b)s.t. f ∈ H ks , b ∈ R (17)||f|| HksDenote the optimal value <strong>of</strong> program (17) by α and assume, towards contradiction, thatwhenever Err γ (D 1 ) = Err γ (D), the optimum <strong>of</strong> program (11) is strictly less then α.For every A ∈ O(d), let D A , be the distribution <strong>of</strong> the r.v. (Ax, y) ∈ S d−1 × {±1},where (x, y) ∼ D. Since clearly Err γ (D A ) = Err γ (D), there exist f A ∈ H k and b A ∈ Rsuch that ||f A || Hk ≤ C and Err DA ,hinge(g A ) < α, where g A := f A + b A . Define f ∈ H ks byf(x) = ∫ O(d) f A(Ax)dA and let b = ∫ O(d) b AdA and g = f +b. By <strong>The</strong>orem 5.18, ||f|| Hks ≤ C.Finally, for l = l hinge ,≤ CErr D,hinge (g) = E (x,y)∼D l(yg(x))= E (x,y)∼D l(yE A∼O(d) g A (Ax))≤E (x,y)∼D E A∼O(d) l(yg A (Ax))= E A∼O(d) E (x,y)∼D l(yg A (Ax))= E A∼O(d) E (x,y)∼DA l(yg A (x)) < αContrary to the assumption that α is the optimum <strong>of</strong> program (17).5.5.4 Finite dimension - <strong>The</strong>orems 2.7 and 3.4Let V ⊆ C(S d−1 ) be the linear space {Λ v,b ◦ ψ : v ∈ R m , b ∈ R} and denote ¯W = {Λ v,b ◦ ψ :v ∈ W, b ∈ R}. We note that dim(V ) ≤ m + 1. Instead <strong>of</strong> program (4) we consider the37


equivalent formulationminErr D,l (f)s.t. f ∈ ¯W (18)<strong>The</strong> following lemma is very similar to lemma 5.12, but with better dependency on m (m 1.5instead <strong>of</strong> m 2 ).Lemma 5.25 Let l be a convex surrogate and let V ⊂ C(S d−1 ) an m-dimensional vectorspace. <strong>The</strong>re exists a continuous <strong>kernel</strong> k : S d−1 × S d−1 → R with sup x∈S d−1 k(x, x) ≤ 1 suchthat H k = V as a vector space and there exists a probability measure µ N such that∀f ∈ V, ||f|| Hk ≤ 2m1.5|∂ + l(0)| Err µ N ,l(f)Pro<strong>of</strong> Let ψ : S d−1 → V ∗ be the evaluation operator. It maps each x ∈ S d−1 to the linearfunctional f ∈ V ↦→ f(x). We claim that1. ψ is continuous,2. aff(ψ(S d−1 ) ∪ −ψ(S d−1 )) = V ∗ ,3. V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Pro<strong>of</strong> <strong>of</strong> 1: We need to show that ψ(x n ) → ψ(x) if x n → x. Since V ∗ is finite dimensional, itsuffices to show that ψ(x n )(f) → ψ(x)(f) for every f ∈ V , which follows from the continuity<strong>of</strong> f.Pro<strong>of</strong> <strong>of</strong> 2: Note that 0 ∈ U = aff(ψ(S d−1 ) ∪ −ψ(S d−1 )), so U is a linear space. Now, defineT : U ∗ → V via T (u ∗ ) = u ∗ ◦ ψ. We claim that T is onto, whence dim(U) = dim(U ∗ ) =dim(V ) = dim(V ∗ ), so that U = V ∗ . Indeed, for f ∈ V , let u ∗ f ∈ U ∗ be the functionalu ∗ f (u) = u(f). Now, T (u∗ f )(x) = u∗ f (ψ(x)) = ψ(x)(f) = f(x), thus T (u∗ f ) = f.Pro<strong>of</strong> <strong>of</strong> 3: From U = V ∗ it follows that U ∗ = V ∗∗ , so that the mapping T : V ∗∗ → V isonto, showing that V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Let us apply John’s Lemma to K = conv(ψ(S d−1 )∪−ψ(S d−1 )). It yields an inner producton V ∗ 1with K contained in the unit ball and containing the ball around 0 with radius √ m.Let k be the <strong>kernel</strong> k(x, y) = 〈ψ(x), ψ(y)〉. Since ψ is continuous, k is continuous as well.By <strong>The</strong>orem 5.1.1 and since T is onto, it follows that, as a vector space, V = H k . Since Kis contained in the unit ball, it follows that sup x∈S d−1 k(x, x) ≤ 1. It remains to prove theexistence <strong>of</strong> the measure µ N .Let e 1 , . . . , e m ∈ V ∗ be an orthonormal basis. For every i ∈ [m], choose(x 1 i , y i ), . . . , (x m+1i , y i ) ∈ S d−1 × {±1} and λ 1 i , . . . , λ m+1i ≥ 0 such that ∑ m+1j=1 λj i = 1 and√1me i = ∑ m+1j=1 λj i y iψ(x j i ). Define µ N(x j i , 1) = µ N(x j i , −1) = λj i. 2mLet f ∈ V . By <strong>The</strong>orem 5.1.1 there exists v ∈ V ∗ such that f = Λ v,0 ◦ ψ and ||f|| Hk =||v|| V ∗. It follows that, for a = ∂ + l(0),38


Err µN ,l(f) =m∑i=1≥ 12m= 12m= 12m= 12m≥ 12m≥≥m+1∑λ j [i l(yi f(x j i2m)) + l(−y if(x j i ))]j=1[ (m∑ m+1) ()]∑m+1l λ j i y if(x j i ) ∑+ l − λ j i y if(x j i )i=1 j=1j=1[ (m∑ m+1) ()]∑m+1l λ j i y i〈v, ψ(x j i )〉 ∑+ l − λ j i y i〈v, ψ(x j i )〉i=1 j=1j=1(m∑l 〈v,i=1m∑i=1|a|2m 1.5m∑i=1∑ mm+1∑j=1( )e il 〈v, √ 〉 m(l − |〈v, e )√i〉|mi=1|〈v, e i 〉||a|2m 1.5 ||v|| V ∗ =λ j i y iψ(x j i )〉 )+ l(+ l −〈v,|a|2m 1.5 ||f|| H k(−〈v,)e√ i〉 mm+1∑j=1λ j i y iψ(x j i )〉 )✷Pro<strong>of</strong> (<strong>of</strong> <strong>The</strong>orem 2.7) Let L be the Lipschitz constant <strong>of</strong> l. Let β > α > 0 such thatl(α) > l(β). Choose 0 < θ < 1 large enough so that (1 − θ)l(−β) + θl(β) < θl(α). First,define probability measures µ 1 , µ 2 , µ 3 and µ over [−1, 1] × {±1} as follows.µ 1 (γ, 1) = θ, µ 1 (−γ, −1) = 1 − θw(x) =µ 2 (−γ, 1) = 1<strong>The</strong> measure µ 3 is the product <strong>of</strong> uniform{±1} and the measure over [−1, 1] whose densityfunction is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 8Let k, µ N be the distribution and <strong>kernel</strong> from Lemma 5.25. Now, let e ∈ S d−1 be the vectorfrom Lemma 5.22. We define the distribution D corresponding to the measureµ = (1 − λ 2 − λ 3 − λ N )µ 1 e + λ 2 µ 2 e + λ 3 µ 3 e + λ N µ NBy considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ 3 + λ 2 + λ N .Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (18) with an additive <strong>error</strong> ≤ √ γ.39


Denote ||g|| Hk = C. By Lemma 5.25, it holds thatC ≤ 2m1.5|∂ + l(0)| Err µ N ,l(g)≤ 2m1.5 Err µ,l (g)|∂ + l(0)| λ N≤ 2m1.5 l(0)|∂ + l(0)| λ NAs in the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6, it holds that∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · K 3.5 · E · C · (r K + s d ) (19)|∂ + l(0)|λ 3{x:〈x,e〉=γ}{x:〈x,e〉=−γ}Denote the last bound by ɛ. It holds thatErr D,l (g) = (1 − λ 2 − λ 3 − λ N )E µ 1 el(yg(x)) + λ 2 E µ 2 el(yg(x)) + λ 3 E µ 3 el(yg(x)) + λ N E µN l(yg(x))(20)Now, denote δ = ∫ g. It holds that{x:〈x,e〉=−γ}∫∫E µ 1 el(yg(x)) = θ l(g(x)) + (1 − θ)l(−g(x))Thus,≥≥{x:〈x,e〉=γ}(∫θ · lg{x:〈x,e〉=γ})+ (1 − θ) · lθ · l(δ) + (1 − θ) · l(−δ) − Lɛ{x:〈x,e〉=−γ}(−∫{x:〈x,e〉=−γ}Err D,l (g) ≥ (1 − λ 2 − λ 3 − λ N )(θ · l(δ) + (1 − θ) · l(−δ)) − Lɛ + λ 2 E µ 2 el(yg(x))However, by considering the constant solution δ, it follows thatErr D,l (g) ≤ (1 − λ 2 − λ 3 − λ N )(a · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + (λ 3 + λ N ) 1 2 (l(δ) + l(−δ)) + √ γThus,≤)g(21)(1 − λ 2 − λ 3 − λ N )(θ · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + (λ 3 + λ N ) · l(−|δ|) + √ γErr µ 2 e ,l(g) ≤ Lɛλ 2+ l(δ) + λ 3 + λ Nλ 2l(−|δ|) +≤L · l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+10 · L · K3.5λ 2√ γλ 2(22)· E · C · (r K + s d ) + l(δ) + λ 3 + λ N + √ γl(−|δ|)λ 2Now, relying on the assumption that γ · log 8 (C) = o(1), it is possible to choose λ 2 =Θ ( √ γK4 ) = Θ ( √ γ log 4 (C) ) , λ 3 = √ γ, K = Θ(log(C/γ)), λ N = γ and d = Θ(log(C/γ))such that the bound in Equation (19), L·l(0)128γK3.5λ 3 +λ N + √ γλ 2are all o(1).|∂ + l(0)|λ 2 λ 3+ 10K3.5λ 240· E · C · (r K + s d ), λ 2 , λ 3 , λ N and


Since the bound in Equation (19) is o(1), it follows, as in the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 2.6, thatl(δ) ≤ l ( )α2 and consequently, 0 l(0)−l( α 2 )− o(1), l(g(x)) < l(0) ⇒l(0)g(x) > 0. Since the marginal distributions <strong>of</strong> µ 1 e and µ 2 e are the same, it follows that, if (x, y)(are chosen according to D, then w.p. >l(0)−l( α 2 )l(0))− o(1) ·(1−λ 2 −λ 3 −λ N )·(1−θ) = Ω(1),yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1) while Err γ (D) ≤ λ 2 +λ 3 +λ N = O ( √ γ poly(log(C)))=O ( √ γ poly(log(m/γ))).Pro<strong>of</strong> (<strong>of</strong> <strong>The</strong>orem 3.4) As in the pro<strong>of</strong> <strong>of</strong> <strong>The</strong>orem 3.3, we can assume w.l.o.g. thatl = l hinge . Let k, µ N be the measure and the <strong>kernel</strong> from Lemma 5.25. Let C = 2m 1.5 /γ.By (the pro<strong>of</strong> <strong>of</strong>) <strong>The</strong>orem 3.3, there exists a probability measure ¯µ over S d−1 × {±1} suchthat for every f ∈ H k with ||f|| Hk ≤ C it holds that Err¯µ,l (f) = Ω(1) but Err γ (¯µ) =O(γ · poly(log(C))). Consider the distribution µ = (1 − γ)¯µ + γµ N . It still holds thatErr γ (¯µ) = O(γ · poly(log(C))) = O(γ · poly(log(m/γ))). Let f be an optimal for program(18). We have that 1 ≥ Err µ,l (f) ≥ γ · Err µN ,l(f). By Lemma 5.25, ||f|| Hk ≤ C. Thus,Err µ,l (f) ≥ (1 − γ) Err¯µ,l (f) = Ω(1).6 Choosing a surrogate according to the margin<strong>The</strong> purpose <strong>of</strong> this section is to demonst<strong>rate</strong> the subtleties relating to the possibility <strong>of</strong>choosing a convex surrogate l according to the margin γ. Let k : B × B → R be the <strong>kernel</strong>k(x, y) =11 − 1 2 〈x, y〉 Hand let ψ : B → H 1 be a corresponding embedding (i.e., k(x, y) = 〈ψ(x), ψ(y)〉 H1 ). In(Shalev-Shwartz et al., 2011) it has been shown that the solution f, b to Program (2), withC = C(γ) = poly(exp(1/γ · log(1/γ))) and the embedding ψ, satisfiesErr hinge (f + b) ≤ Err γ (D) + γ .41✷✷


Consequently, every approximated solution to the Program with an additive <strong>error</strong> <strong>of</strong> at mostγ will have a 0-1 loss bounded by Err γ (D) + 2γ.For every γ, define a 1-Lipschitz convex surrogate by{1 − x x ≤ 1/C(γ)l γ (x) =1 − 1/C(γ) x ≥ 1/C(γ)Claim 1 A function g : B → R is a solutions to Program (5) with l = l γ , C = 1 and theembedding ψ, if and only if C(γ) · g is a solutions to Program (2) with C = C(γ) and theembedding ψ.We postpone the pro<strong>of</strong> to the end <strong>of</strong> the section. We note that Program (5) with l = l γ ,C = 1 and the embedding ψ, have a complexity <strong>of</strong> 1, according to our conventions. Moreover,by Claim 1, the optimal solution to it has a 0-1 <strong>error</strong> <strong>of</strong> at most Err γ (D) + γ. Thus, if A isan algorithm that is only obligated to return an approximated solution to Program (5) withl = l γ , C = 1 and the embedding ψ, we cannot lower bound its approximation ratio. Inparticular, our <strong>The</strong>orems regarding the approximation ratio are no longer true, as currentlystated, if the algorithms are allowed to choose the surrogate according to γ. One might betempted to think that by the above construction (i.e. taking ψ as our embedding, choosingC = 1 and l = l γ , and approximate the program upon a sample <strong>of</strong> size poly(1/γ)), we haveactually gave 1-approximation algorithm. <strong>The</strong> crux <strong>of</strong> the matter is that algorithms thatapproximate the program according to a finite sample <strong>of</strong> size poly(1/γ) are only guaranteedto find a solution with an additive <strong>error</strong> <strong>of</strong> poly(γ). For the loss l γ , such an additive <strong>error</strong>is meaningless: Since for every function f, Err D,lγ (f) ≥ 1 − 1/C(γ), the 0 solution has anadditive <strong>error</strong> <strong>of</strong> poly(γ). <strong>The</strong>refore, we cannot argue that the solution returned by thealgorithm will have a small 0-1 <strong>error</strong>. Indeed we anticipate that the algorithm we havedescribed will suffer from serious over-fitting.To summarize, we note that the lower bounds we have proved, relies on the fact that theoptimal solutions <strong>of</strong> the programs we considered are very bad. For the algorithm we sketchedabove, the optimal solution is very good. However, guaranties on approximated solutionsobtained from a polynomial sample are meaningless. We conclude that lower bounds forsuch algorithms will have to involve over-fitting arguments, which are out <strong>of</strong> the scope <strong>of</strong> thepaper.Pro<strong>of</strong> (<strong>of</strong> claim 1) Define{1 − C(γ)x x ≤1lγ(x) ∗ C(γ)=0 x ≥ 1C(γ)Since lγ(x) ∗ = C(γ) · (l γ (x) − (1 − 1 )), it follows that the solutions to Program (5) withC(γ)l = lγ, ∗ C = 1 and ψ coincide with the solutions with l = l γ , C = 1 and ψ. Now, we notethat, for every function f : B → R,Err D,l ∗ γ(f) = Err D,hinge (C(γ) · f)Thus, w, b minimizes Err D,l ∗ γ(Λ w,b ◦ ψ) under the restriction that ‖w‖ ≤ 1 if and only ifC(γ) · w, C(γ) · b minimizes Err D,hinge (Λ w,b ◦ ψ) under the restriction that ‖w‖ ≤ C(γ).42✷


Acknowledgements: Amit Daniely is a recipient <strong>of</strong> the Google Europe Fellowship inLearning <strong>The</strong>ory, and this research is supported in part by this Google Fellowship. NatiLinial is supported by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supportedby the Israeli Science Foundation grant number 590-10.ReferencesMartin Anthony and Peter Bartlet.Cambridge University Press, 1999.Neural Network Learning: <strong>The</strong>oretical Foundations.K. Atkinson and W. Han. Spherical Harmonics and Approximations on the Unit Sphere: AnIntroduction, volume 2044. Springer, 2012.P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds.Journal <strong>of</strong> the American Statistical Association, 101:138–156, 2006.S. Ben-David, D. Loker, N. Srebro, and K. Sridharan. Minimizing the misclassification <strong>error</strong><strong>rate</strong> <strong>using</strong> a surrogate convex loss. In ICML, 2012.Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. Limitations <strong>of</strong> <strong>learning</strong> via embeddingsin euclidean half spaces. <strong>The</strong> Journal <strong>of</strong> Machine Learning Research, 3:441–461,2003.A. Birnbaum and S. Shalev-Shwartz. Learning <strong>halfspaces</strong> with the zero-one loss: Timeaccuracytrade<strong>of</strong>fs. In NIPS, 2012.E. Blais, R. O’Donnell, and K Wimmer. Polynomial regression under arbitrary productdistributions. In COLT, 2008.N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. CambridgeUniversity Press, 2000.Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper<strong>learning</strong> complexity. arXiv preprint arXiv:1311.2272, 2013.V. Feldman, P. Gopalan, S. Khot, and A.K. Ponnuswami. New results for <strong>learning</strong> noisyparities and <strong>halfspaces</strong>. In In Proceedings <strong>of</strong> the 47th Annual IEEE Symposium on Foundations<strong>of</strong> Computer Science, 2006.G.B. Folland. A course in abstract harmonic analysis. CRC, 1994.V. Guruswami and P. Raghavendra. Hardness <strong>of</strong> <strong>learning</strong> <strong>halfspaces</strong> with noise. In Proceedings<strong>of</strong> the 47th Foundations <strong>of</strong> Computer Science (FOCS), 2006.A. Kalai, A.R. Klivans, Y. Mansour, and R. Servedio. Agnostically <strong>learning</strong> <strong>halfspaces</strong>. InProceedings <strong>of</strong> the 46th Foundations <strong>of</strong> Computer Science (FOCS), 2005.A.R. Klivans and R. Servedio. Learning DNF in time 2Õ(n1/3) . In STOC, pages 258–265.ACM, 2001.43


Kosaku Yosida. Functional Analysis. Springer-Verlag, Heidelberg, 1963.Eyal Kushilevitz and Yishay Mansour. Learning decision trees <strong>using</strong> the Fourier spectrum.In STOC, pages 455–464, May 1991.Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform,and learnability. In FOCS, pages 574–579, October 1989.P.M. Long and R.A. Servedio. Learning large-margin <strong>halfspaces</strong> with more malicious noise.In NIPS, 2011.J. Matousek. Lectures on discrete geometry, volume 212. Springer, 2002.V.D. Milman and G. Schechtman. Asymptotic <strong>The</strong>ory <strong>of</strong> Finite Dimensional Normed Spaces:Isoperimetric Inequalities in Riemannian Manifolds, volume 1200. Springer, 2002.F. Rosenblatt. <strong>The</strong> perceptron: A probabilistic model for information storage and organizationin the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing(MIT Press, 1988).).S. Saitoh. <strong>The</strong>ory <strong>of</strong> reproducing <strong>kernel</strong>s and its applications. Longman Scientific & TechnicalEngland, 1988.IJ Schoenberg. Positive definite functions on spheres. Duke. Math. J., 1942.B. Schölkopf, C. Burges, and A. Smola, editors. Advances in Kernel Methods - SupportVector Learning. MIT Press, 1998.S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning <strong>kernel</strong>-based <strong>halfspaces</strong> with the0-1 loss. SIAM Journal on Computing, 40:1623–1646, 2011.I. Steinwart and A. Christmann. Support vector machines. Springer, 2008.R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.V. N. Vapnik. Statistical Learning <strong>The</strong>ory. Wiley, 1998.Manfred K Warmuth and SVN Vishwanathan. Leaving the span. In Learning <strong>The</strong>ory, pages366–381. Springer, 2005.T. Zhang. Statistical behavior and consistency <strong>of</strong> classification methods based on convexrisk minimization. <strong>The</strong> Annals <strong>of</strong> Statistics, 32:56–85, 2004.44

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!