The error rate of learning halfspaces using kernel-SVM

The complexity of learning halfspaces using generalizedlinear methodsAmit Daniely ∗ Nati Linial † Shai Shalev-Shwartz ‡May 10, 2014AbstractMany popular learning algorithms (E.g. Regression, Fourier-Transform based algorithms,Kernel SVM and Kernel ridge regression) operate by reducing the problemto a convex optimization problem over a set of functions. These methods offer thecurrently best approach to several central problems such as learning half spaces andlearning DNF’s. In addition they are widely used in numerous application domains.Despite their importance, there are still very few proof techniques to show limits onthe power of these algorithms.We study the performance of this approach in the problem of (agnostically andimproperly) learning halfspaces with margin γ. Let D be a distribution over labeledexamples. The γ-margin error of a hyperplane h is the probability of an example tofall on the wrong side of h or at a distance ≤ γ from it. The γ-margin error of the besth is denoted Err γ (D). An α(γ)-approximation algorithm receives γ, ɛ as input and,using i.i.d. samples of D, outputs a classifier with error rate ≤ α(γ) Err γ (D) + ɛ. Suchan algorithm is efficient if it uses poly( 1 γ , 1 ɛ) samples and runs in time polynomial inthe sample size.( )1/γThe best approximation ratio achievable by an efficient algorithm is O √log(1/γ)and is achieved using an algorithm from the above class. Our main result showsthat(the approximation ) ratio of every efficient algorithm from this family must be≥ Ω, essentially matching the best known upper bound.1/γpoly(log(1/γ))∗ Department of Mathematics, Hebrew University, Jerusalem 91904, Israel. amit.daniely@mail.huji.ac.il† School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel.nati@cs.huji.ac.il‡ School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel.shais@cs.huji.ac.il

1 IntroductionLet X be some set and let D be a distribution on X × {±1}. The basic learningtask is, based on an i.i.d. sample, to find a function f : X → {±1} whose error,Err D,0−1 (f) := Pr (X,Y )∼D (f(X) ≠ Y ), is as small as possible. A learning problem is definedby specifying a class H of competitors (e.g. H is a class of functions from X to {±1}).Given such a class, the corresponding learning problem is to find f : X → {±1} whose erroris small relatively to the error of the best competitor in H. Ignoring computational aspects,the celebrated PAC/VC theory essentially tells us that the best algorithm for every learningproblem is an Empirical Risk Minimizer (=ERM) – namely, one that returns the competitorin H of least empirical error. Unfortunately, for many learning problems, implementing theERM paradigm is NP -hard and even NP -hard to approximate.We consider here a very popular family of algorithms to cope with this hardness, which wecollectively call “the generalized linear family”. It proceeds as follows: fix some set W ⊂ R Xand return a function of the form f(x) = sign(g(x) − b) where the pair (g, b) ∈ W × Rempirically minimizes some convex loss. In order that such a method be useful, the set Wshould be “small” (to prevent overfitting) and “nicely behaved” (to make the optimizationproblem computationally feasible). The two main choices for such a set W are• A (usually convex) subset of a finite dimensional space of functions (e.g. if X ⊂ R dthen W can be the space of all polynomials of degree ≤ 17 and coefficients boundedby d 3 ). We refer to such algorithms as finite dimensional learners.• A ball in a reproducing kernel Hilbet space. We refer to such algorithms as as kernelbased learners.The generalized linear family has been applied extensively to tackle learning problems(e.g. Linial et al. (1989), Kushilevitz and Mansour (1991), Klivans and Servedio (2001),Kalai et al. (2005), Blais et al. (2008), Shalev-Shwartz et al. (2011) – see section 1.4).Their statistical charactersitics have been thoroughly studied as well (Vapnik, 1998, Anthonyand Bartlet, 1999, Schölkopf et al., 1998, Cristianini and Shawe-Taylor, 2000, Steinwartand Christmann, 2008). Moreover, the significance of this approach is by no means onlytheoretical – algorithms from this family are widely used by practitioners.In spite of all that, very few lower bounds are known on the performance of this family ofalgorithms (i.e., theorems of the form “For every kernel-based/finite-dimensional algorithmfor the learning problem X, there exists a distribution under which the algorithm performspoorly”). Such a lower bound must quantify over all possible choices of “small and nicelybehaved” sets W . In order to address this difficulty we employ a variety of mathematicalmethods some of which are new in this domain. In particular, we make intensive use ofharmonic analysis on the sphere, reproducing kernel Hilbert spaces, orthogonal polynomials,John’s Lemma as well as a new symmetrization technique.We also prove a new result, which is of independent interest: a fundamental fact thatstands behind the theoretical analysis of kernel based learners is that for every subset X ofa unit ball in a Hilbert space H, it is possible to learn affine functionals of norm ≤ C overX w.r.t. the hinge loss using C2 examples. We show a (weak) inverse of this fact. Namely,ɛ 2we show that for every X , if affine functionals can be learnt using m examples, then there1

exists an equivalent inner product on H under which X is contained in a unit ball, and theaffine functional retuned by any learning algorithm must have norm ≤ O (m 3 ).Our lower bounds are established for the basic problem of learning large margin halfspaces(to be defined precisely in Section 1.1). The best known efficient (in 1 ) algorithm for thisγproblem (Birnbaum and Shalev-Shwartz, 2012) is a kernel based learner that achieves an1/γapproximation ratio of √ . (We note, however, that this approximation ratio was firstlog(1/γ)obtained by (Long and Servedio, 2011) using a “boosting based” algorithm that does notbelong to the generalized linear family). The best known ( ( exact(algorithm ))) (that is, α(γ) = 1),1is also a kernel based learner and runs in time exp Θ log 1(Shalev-Shwartz et al.,γ γ2011).Our main results(show that efficient ) kernel based learners cannot achieve better approximationratio than Ω, essentially matching the best known upper bound. Also,1/γpoly(log(1/γ))we show(that efficient finite dimensional learners cannot achieve better approximation ratio1/than Ω√ )γ. In addition we show that the running time of kernel based learnerspoly(log(1/γ))( ) 1−ɛ1γ as well as of finite dimensional learners with approxi-with approximation ratio of( ) 12mation ratio of−ɛ must be exponential in 1/γ.1γNext, we formulate the problem of learning large margin halfspaces and survey some relevantbackground to motivate our definitions of kernel-based and finite dimensional learnersgiven in Section 2.1.1 Learning large margin halfspacesWe view R d as a subspace of the Hilbert space H = l 2 corresponding to the first d coordinates.Since the notion of margin is defined relative to a suitable scaling of the examples, we considerthroughout only distributions that are supported in the unit ball, B, of H. Also, all thedistributions we consider are supported in R d for some d < ∞. We denote by S d−1 the unitsphere of R d .It will be convenient to use loss functions. A loss function is any function l : R → [0, ∞).Given a loss function l and f : B → R, we denote Err D,l (f) = E (x,y)∼D (l(yf(x))).{Two loss1 x ≤ 0functions of particular relevance are the 0 − 1 loss function, l 0−1 (x) =, and the0 x > 0{1 x ≤ γγ-margin loss function, l γ (x) =0 x > γ . We use shorthands such as Err D,0−1 instead ofErr D,l0−1 .A halfspace, parameterized by w ∈ B and b ∈ R, is the classifier f(x) = sign(Λ w,b (x)),where Λ w,b (x) := 〈w, x〉 + b. Given a distribution D over B × {±1}, the error rate of Λ w,b isErr D,0−1 (Λ w,b ) =The γ-margin error rate of Λ w,b isPr (sign(Λ w,b(x)) ≠ y) = Pr (yΛ w,b(x) ≤ 0) .(x,y)∼D (x,y)∼DErr D,γ (Λ w,b ) =Pr (yΛ w,b(x) ≤ γ) .(x,y)∼D2

Note that if ‖w‖ = 1 then |Λ w,b (x)| is the distance of x from the separating hyperplane.Therefore, the γ-margin error rate is the probability of x to either be in the wrong side ofthe hyperplane or to be at a distance of at most γ from the hyperplane. The least γ-marginerror rate of a halfspace classifier is denoted Err γ (D) = min w∈B,b∈R Err D,γ (Λ w,b ).A learning algorithm receives γ, ɛ and access to i.i.d. samples from D. The algorithmshould return a classifier (which need not be an affine function). We say that the algorithmhas approximation ratio α(γ) if for every γ, ɛ and for every distribution, it outputs (w.h.p.over the i.i.d. D-samples) a classifier with error rate ≤ α(γ) Err γ (D) + ɛ. An efficientalgorithm uses poly(1/γ, 1/ɛ) samples, runs in time polynomial in the size of the sample 1and outputs a classifier f such that f(x) can be evaluated in time polynomial in the samplesize.1.2 Kernel-SVM and kernel-based learnersThe SVM paradigm, introduced by Vapnik is inspired by the idea of separation with margin.For the reader’s convenience we first describe the basic (kernel-free) variant of SVM. It iswell known (e.g. Anthony and Bartlet (1999)) that the affine function that minimizes theempirical γ-margin error rate over an i.i.d. sample of size poly(1/γ, 1/ɛ) has error rate≤ Err γ (D) + ɛ. However, this minimization problem is NP -hard and even NP -hard toapproximate (Guruswami and Raghavendra, 2006, Feldman et al., 2006).SVM deals with this hardness by replacing the margin loss with a convex surrogate loss,in particular, the hinge loss 2 l hinge (x) = (1 − x) + . Note that for x ∈ [−2, 2],l 0−1 (x) ≤ l hinge (x/γ) ≤ (1 + 2/γ)l γ (x) ,from which it easily follows that by solving( )1Err D,hinge Λ γ w,b s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ 1minw,bwe obtain an approximation ratio of α(γ) = 1 + 2/γ. It is more convenient to consider theproblemErr D,hinge (Λ w,b ) s.t. w ∈ H, b ∈ R, ‖w‖ H ≤ C , (1)minw,bwhich is equivalent for C = 1 . The basic (kernel-free) variant of SVM essentially solves Problem(1), which can be approximated, up to an additive error of ɛ, by an efficient algorithmγrunning on a sample of size poly( 1 , 1).γ ɛKernel-free SVM minimizes the hinge loss over the space of affine functionals of boundednorm. The family of Kernel-SVM algorithms is obtained by replacing the space of affinefunctionals with other, possibly much larger, spaces (e.g., a polynomial kernel of degree textends the repertoire of possible output functions from affine functionals to all polynomialsof degree at most t). This is accomplished by embedding B into the unit ball of anotherHilbert space on which we apply basic-SVM. Concretely, let ψ : B → B 1 , where B 1 is theunit ball of a Hilbert space H 1 . The embedding ψ need not be computed directly. Rather, it1 The size of a vector x ∈ H is taken to be the largest index j for which x j ≠ 0.2 As usual, z + := max(z, 0).3

is enough that we can efficiently compute the corresponding kernel, k(x, y) := 〈ψ(x), ψ(y)〉 H1(this property, sometimes crucial, is called the kernel trick). It remains to solve the followingproblemminw,bErr D,hinge (Λ w,b ◦ ψ) s.t. w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C . (2)This problem can be approximated, up to an additive error of ɛ, using poly(C/ɛ) samplesand time. We prove lower bounds to all approximate solutions of program (2). In fact, ourresults work with arbitrary (not just hinge loss) convex surrogate losses and arbitrary (notjust efficiently computable) kernels.Although we formulate our results for Problem (2), they apply as well to the followingcommonly used formulation of the kernel SVM problem, where the constraint ‖w‖ H1 ≤ C isreplaced by a regularization term. Namelyminw∈H 1 ,b∈R1C 2 ‖w‖2 H 1+ Err D,hinge (Λ w,b ◦ ψ) (3)The optimum of program (3) is ≤ 1 as shown by the zero solution w = 0, b = 0. Thus, ifw, b is an approximate optimal solution, then ‖w‖2 H 1C 2 ≤ 2 ⇒ ‖w‖ H1 ≤ 2C. This observationmakes it easy to modify our results on program (2) to apply to program (3).1.3 Finite dimensional learnersThe SVM algorithms embed the data in a (possibly infinite dimensional) Hilbert space,and minimize the hinge loss over all affine functionals of bounded norm. The kernel tricksometimes allows us to work in infinite dimensional Hilbert spaces. Even without it, we canstill embed the data in R m for some m, and minimize a convex loss over a collection of affinefunctionals. For example, some algorithms do not constraint the affine functional, while inthe Lasso method (Tibshirani, 1996) the affine functional (represented as a vector in R m )must have small L 1 -norm.Without the kernel trick, such algorithms work directly in R m . Thus, every algorithmmust have time complexity Ω(m), and therefore m is a lower bound on the complexity ofthe algorithm. In this work we will lower bound the performance of any algorithm withm ≤ poly (1/γ). Concretely, we prove lower bounds for any approximate solution to aproblem of the formminw,bErr D,l (Λ w,b ◦ ψ) s.t. w ∈ W ⊂ R m , b ∈ R , (4)where l is some surrogate loss function (see formal definition in the next section) andψ : B → R m .It is not hard to see that for any m-dimensional space V of functions over the ball, thereexists an embedding ψ : B → R m such that{f + b : f ∈ V, b ∈ R} = {Λ w,b ◦ ψ : w ∈ R m , b ∈ R}Hence, our lower bounds hold for any method that optimizes a surrogate loss over a subsetof a finite dimensional space of functions, and return the threshold function correspondingto the optimum.4

1.4 Previous Results and Related WorkThe problem of learning halfspaces and in particular large margin halfspaces is as old as thefield of machine learning, starting with the perceptron algorithm (Rosenblatt, 1958). Sincethen it has been a fundamental challenge in machine learning and has inspired much of theexisting theory as well as many popular algorithms.The generalized linear method has its roots in the work of Gauss and Legendre who usedthe least squares method for astronomical computations. This method has played a key rolein modern statistics. Its first application in computational learning theory is in (Linial et al.,1989) where it is shown that AC 0 functions are learnable in quasi-polynomial time w.r.t. theuniform distribution. Subsequently, many authors have used the method to tackle variouslearning problems. For example, Klivans and Servedio (2001) derived the fastest algorithmfor learning DNF and Kushilevitz and Mansour (1991) used it to develop an algorithm fordecision trees. The main uses of the linear method in the problem of learning halfspacesappear in the next paragraph. Needless to say we are unable here to offer a comprehensivesurvey of its uses in computational learning theory in general.The best currently known approximation ratios in the problem of learning large marginhalfspaces are due to (Birnbaum and Shalev-Shwartz, 2012) and (Long and Servedio, 2011)1and achieve an approximation ratio of . The algorithm of (Birnbaum and Shalevγ·√log(1/γ)Shwartz, 2012) is a kernel based learner, while (Long and Servedio, 2011) used a “boostingbased” approach (that does not belong to the generalized linear method). ( ( The ( fastest ))) exact1algorithm is due to Shalev-Shwartz et al. (2011) and runs it time exp Θ log 1, andγ ɛγis also a kernel based learner. Better running times can be achieved under distributionalassumptions. For data which is separable with margin γ, i.e. Err γ (D) = 0, the perceptronalgorithm (as well as SVM with a linear kernel) can find a classifier with error ≤ ɛ with timeand sample complexity ≤ poly(1/γ, 1/ɛ). Kalai et al. (2005) gave a finite dimensional learnerwhich is the fastest known algorithm for learning halfspaces w.r.t. the uniform distributionover S d−1 and the d-dimensional boolean cube (running in time d O(1/ɛ)4 ). They also designeda finite dimensional learner of halfspaces w.r.t. log-concave distributions. Blais et al. (2008)extended these results from uniform to product distributions. In this work, we focus onalgorithms which work for any distribution and whose runtime is polynomial in both 1/γand 1/ɛ.The problem of proper 3 learning of halfspaces in the non-separable case was shown to behard to approximate within any constant approximation factor (Feldman et al., 2006, Guruswamiand Raghavendra, 2006). It has been recently shown (Shalev-Shwartz et al., 2011)that improper learning under the margin assumption is also hard (under some cryptographicassumptions). Namely, no polynomial time algorithm can achieve an approximation ratio ofα(γ) = 1. In another recent result Daniely et al. (2013) have shown that under a certaincomplexity assumption, for every constant α, no polynomial time algorithm can achieve anapproximation ratio of α.Ben-David et al. (2012) (see also Long and Servedio (2011)) addressed the performance ofmethods that minimize a convex loss over the class of affine functional of bounded norm (in3 A proper learner must output a halfspace classifier. Here we consider improper learning where the learnercan output any classifier.5

our terminology, they considered the narrow class of finite dimensional learners that optimizeover the space of linear functionals). They showed that the best approximation ratio of suchmethods is Θ(1/γ). Our results can be seen as a substantial generalization of their results.The learning theory literature contains consistency results for learning with the so-calleduniversal kernels and well-calibrated surrogate loss functions. This includes the study ofasymptotic relations between surrogate convex loss functions and the 0-1 loss function(Zhang, 2004, Bartlett et al., 2006, Steinwart and Christmann, 2008). It is shown thatthe approximation ratio of SVM with a universal kernel tends to 1 as the sample size grows.Our result implies that this convergence is very slow, e.g., an exponentially large (in 1 ) γsample is needed to make the error < 2 Err γ (D).Also related are Ben-David et al. (2003) and Warmuth and Vishwanathan (2005). Thesepapers show the existence of learning problems with limitations on the ability to learn themusing linear methods.2 ResultsWe first define the two families of algorithms to which our lower bounds apply. We startwith the class of surrogate loss functions. This class includes the most popular choices suchas the absolute loss |1 − x|, the squared loss (1 − x) 2 , the logistic loss log 2 (1 + e −x ), thehinge loss (1 − x) + etc.Definition 2.1 (Surrogate loss function) A function l : R → R is called a surrogate lossfunction if l is convex and is bounded below by the 0-1 loss.The first family of algorithms contains kernel based algorithms, such as kernel SVM. In thedefinitions below we set the accuracy parameter ɛ to be √ γ. Since our goal is to prove lowerbounds, this choice is without loss of generality, and is intended for the sake of simplifyingthe theorems statements.Definition 2.2 (Kernel based learner) Let l : R → R be a surrogate loss function. Akernel based learning algorithm, A, receives as input γ ∈ (0, 1). It then selects C = C A (γ)and an absolutely continuous feature mapping, ψ = ψ A (γ), which maps the original space Hinto the unit ball of a new space H 1 (see Section 1.2). The algorithm returns a functionsuch that, with probability ≥ 1 − exp(−1/γ),A(γ) ∈ {Λ w,b ◦ ψ : w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C}Err D,l (A(γ)) ≤ inf{Err D,l (Λ w,b ◦ ψ) : w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C} + √ γ .We denote by m A (γ) the maximal number of examples A uses. We say that A is efficient ifm A (γ) ≤ poly(1/γ).Note that the definition of kernel based learner allows for any predefined convex surrogateloss, not just the hinge loss. Namely, we consider the programminw,bErr D,l (Λ w,b ◦ ψ) s.t. w ∈ H 1 , b ∈ R, ‖w‖ H1 ≤ C . (5)6

We note that our results hold even if the kernel corresponds to ψ is hard to compute.The second family of learning algorithms involves an arbitrary feature mapping anddomain constraint on the vector w, as in program (4).Definition 2.3 (Finite dimensional learner) Let l : R → R be some surrogate loss function.A finite dimensional learning algorithm, A, receives as input γ ∈ (0, 1). It then selectsa continuous embedding ψ = ψ A (γ) : B → R m and a constraint set W = W A (γ) ⊆ R m . Thealgorithm returns, with probability ≥ 1 − exp(−1/γ), a functionsuch thatA(γ) ∈ {Λ w,b ◦ ψ : w ∈ W, b ∈ R}Err D,l (A(γ)) ≤ inf{Err D,l (Λ w,b ◦ ψ) : w ∈ W, b ∈ R} + √ γ .,We say that A is efficient if m = m A (γ) ≤ poly(1/γ).2.1 Main ResultsWe begin with a lower bound on the performance of efficient kernel-based algorithms.Theorem 2.4 Let l be an arbitrary surrogate loss and let A be an efficient kernel-basedlearner w.r.t. l. Then, for every γ > 0, there exists a distribution D on B such that, w.p.≥ 1 − 10 exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ωγ · poly(log(1/γ))Next we show that kernel-based learners that achieve approximation ratio ofconstant ɛ > 0 must suffer exponential complexity..(1γ) 1−ɛfor someTheorem 2.5 Let l be an arbitrary surrogate loss, let ɛ > 0 and let A be a kernel-basedlearner w.r.t. l such that for every γ > 0 and every distribution D on B, w.p. ≥ 1/2,Err D,0−1 (A(γ))Err γ (D)( ) 1−ɛ 1≤ .γThen, for some a = a(ɛ) > 0, m A (γ) = Ω (exp ((1/γ) a )).These two theorems follow from the following result.Theorem 2.6 Let l be an arbitrary surrogate loss and let A be a kernel-based learner w.r.t.l for which m A (γ) = exp(o(γ −2/7 )). Then, for every γ > 0, there exists a distribution D onB such that, w.p. ≥ 1 − 10 exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω.γ · poly(log(m A (γ)))7

It is shown in (Birnbaum and Shalev-Shwartz, 2012) that solving kernel ( SVM with)somespecific kernel (i.e. some specific ψ) yields an approximation ratio of O. It1γ √ log(1/γ)follows that our lower bound in Theorem 2.6 is essentially tight. Also, this theorem canbe viewed as a substantial generalization of ((Ben-David ) et al., 2012, Long and Servedio,2011), who give an approximation ratio of Ω with no embedding (i.e., ψ is the identitymap). Also relevant is (Shalev-Shwartz et al., 2011), which shows that for a certain ψ, andm A (γ) = poly (exp ((1/γ) · log (1/(γ))), kernel SVM has approximation ratio of 1. Theorem2.6 shows that for kernel-based learner to achieve a constant approximation ratio, m A mustbe exponential in 1/γ.Next we give lower bounds on the performance of finite dimensional learners.Theorem 2.7 Let l be a Lipschitz surrogate loss and let A be a finite dimensional learnerw.r.t. l. Assume that m A (γ) = exp(o(γ −1/8 )). Then, for every γ > 0, there exists adistribution D on S d−1 × {±1} with d = O(log(m A (γ)/γ)) such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)1γ()1≥ Ω √ .γ poly(log(mA (γ)/γ))Corollary 2.8 Let l be a Lipschitz surrogate loss and let A be an efficient finite dimensionallearner w.r.t. l. Then, for every γ > 0, there exists a distribution D on S d−1 × {±1} withd = O(log(1/γ)) such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω √ .γ poly(log(1/γ))Corollary 2.9 Let l be a Lipschitz surrogate loss, let ɛ > 0 and let A be a finite dimensionallearner w.r.t. l such that for every γ > 0 and every distribution D on B d with d = ω(log(1/γ))it holds that w.p. ≥ 1/2,Err D,0−1 (A(γ))Err γ (D)≤( 1γ) 12 −ɛThen, for some a = a(ɛ) > 0, m A (γ) = Ω (exp ((1/γ) a )).2.2 Review of the proofs’ main ideasTo give the reader some idea of our arguments, we sketch some of the main ingredients of theproof of Theorem 2.6. At the end of this section we sketch the idea of the proof of Theorem2.7. We note, however, that the actual proofs are organized somewhat differently.We will construct a distribution D over S d−1 ×{±1} (recall that R d is viewed as standardlyembedded in H = l 2 ). Thus, we can assume that the program is formulated in terms of theunit sphere, S ∞ ⊂ l 2 , and not the unit ball.Fix an embedding ψ and C > 0. Denote by k : S ∞ × S ∞ → R the corresponding kernelk(x, y) = 〈ψ(x), ψ(y)〉 H1 and consider the following set of functions over S ∞H k = {Λ v,0 ◦ ψ : v ∈ H 1 } .8

H k is a Hilbert space with norm ||f|| Hk = inf{||v|| H1 : Λ v,0 ◦ ψ = f}. The subscript kindicates that H k is uniquely determined (as a Hilbert space) given the kernel k. With thisinterpretation, program (5) is equivalent to the programmin Err D,l (f + b) s.t. ||f|| Hk ≤ C . (6)f∈H k ,b∈RFor simplicity we focus on l being the hinge-loss (the generalization to other surrogate lossfunctions is rather technical).The proof may be split into four steps:(1. Our first step is to show that we can restrict to the case C = poly1γ). We showthat for every subset X of a Hilbert space H, if affine functionals on X can be learntusing m examples w.r.t. the hinge loss, then there exists an equivalent inner producton H under which X is contained in a unit ball, and the affine functional returned byany learning algorithm must have norm ≤ O (m 3 ). Since we consider algorithms with(polynomial sample complexity, this allows us to argue as if C = poly2. We consider the one-dimensional problem of improperly learning halfspaces (i.e.thresholds on the line) by optimizing the hinge loss over the space of univariate polynomialsof degree bounded by log(C). We construct a distribution D over [−1, 1] × {±1}that is a convex combination of two distributions. One that is separable by a γ-marginhalfspace and the other representing a tiny amount of noise. We show that each solutionof the problem of minimizing the hinge-loss w.r.t. D over the space of suchpolynomials has the property that f(γ) ≈ f(−γ).3. We pull back the distribution D w.r.t. a direction e ∈ S d−1 to a distribution overS d−1 × {±1}. Let f be an approximate solution of program (6). We show that f takesalmost the same value on instances for which 〈x, e〉 = γ and 〈x, e〉 = −γ. This stepcan be further broken into three substeps –(a) First, we assume that the kernel is symmetric and f(x) depends only on 〈x, e〉.This substep uses a characterization of Hilbert spaces corresponding to symmetrickernels, from which it follows that f has the form∞∑f(x) = α n P d,n (〈x, e〉) .n=1Here P d,n are the d-dimensional Legendre polynomials and ∑ ∞n=0 α2 n < C 2 . Thisallows us to rely on the results for the one-dimensional case from step (1).(b) By symmetrizing f, we relax the assumption that f depends only on 〈x, e〉.(c) By averaging the kernel over the group of linear isometries on R d , we relax theassumption that the kernel is symmetric.4. Finally, we show that for the distribution from the previous step, if f is an approximatesolution to program (6) then f predicts the same value, 1, on instances for which〈x, e〉 = γ and 〈x, e〉 = −γ. This establishes our claim, as the constructed distributionassigns the value −1 to instances for which 〈x, e〉 = −γ.91γ).

We now expand on this brief description of the main steps.Polynomial sample implies small CLet X be a subset of the unit ball of some Hilbert space H and let C > 0. Assume thataffine functionals over X with norm ≤ C can be learnt using m examples with error ɛ andconfidence δ. That is, assume that there is an algorithm such that• Its input is a sample of m points in X × {±1} and its output is an affine functionalΛ w,b with ‖w‖ ≤ C.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, w, b withErr D,hinge (Λ w,b ) ≤ inf ‖w ′ ‖≤C,b ′ ∈R Err D,hinge (Λ w ′ ,b′) + ɛ.We will show that there is an equivalent inner product 〈·, ·〉 ′ on H under which X is containedin a unit ball (not necessarily around 0) and the affine functional returned by any learningalgorithm as above, must have norm ≤ O (m 3 ) w.r.t. the new norm.The construction of the norm is done as follows. We first find an affine subspace M ⊂ Hof dimension d ≤ m that is very close to X in the sense that the distance of every pointin X from M is ≤ m . To find such an M, we assume toward a contradiction that there isCno such M, and use this to show that there is a subset A ⊂ X such that every functionf : A → [−1, 1] can be realized by some affine functional with norm ≤ C. This contradictsthe assumption that affine functionals with norm ≤ C can be learnt using m examples.Having the subspace M at hand, we construct, using John’s lemma (e.g. Matousek(2002)), an inner product 〈·, ·〉 ′′ on M and a distribution µ N on X with the property thatthe projection of X on M is contained in a ball of radius 1 w.r.t. 〈·, 2 ·〉′′ , and the hinge errorof every affine functional w.r.t. µ N is lower bounded by the norm of the affine functional,divided by m 2 . We show that that this entails that any affine functional returned by thealgorithm must have a norm ≤ O (m 3 ) w.r.t. the inner product 〈·, ·〉 ′′ .Finally, we construct an inner product 〈·, ·〉 ′ on H by putting the norm 〈·, ·〉 ′′ on M andmultiplying the original inner product by C2 on M ⊥ .2m 2The one dimensional distributionWe define a distribution D on [−1, 1] as follows. Start with the distribution D 1 that takes thevalues ±(γ, 1), where D 1 (γ, 1) = 0.7 and D 1 (−γ, −1) = 0.3. Clearly, for this distribution,the threshold 0 has zero error rate. To construct D, we perturb D 1 with “noise” as follows.Let D = (1 − λ)D 1 + λD 2 , where D 2 is defined as follows. The probability of the labels isuniform and independent of the instance and the marginal probability over the instances isdefined by the density function{0 if |x| > 1/8ρ(x) =√8if |x| ≤ 1/8 .π 1−(8x) 2This choice of ρ simplifies our calculations due to its relation to Chebyshev polynomials.However, other choices of ρ which are supported on a small interval around zero can alsowork.10

Note that the error rate of the threshold 0 on D is λ/2. We next show that each polynomialf of degree K = log(C) that satisfies Err D,hinge (f) ≤ 1 must have f(γ) ≈ f(−γ).Indeed, if1 ≥ Err D,hinge (f) = (1 − λ) Err D1 ,hinge(f) + λ Err D2 ,hinge(f)then Err D2 ,hinge(f) ≤ 1 λ . But,Err D2 ,hinge(f) = 1 2≥ 1 2∫ 1−1∫ 1−1l hinge (f(x))ρ(x)dx + 1 2l hinge (−|f(x)|)ρ(x)dx∫ 1−1l hinge (−f(x))ρ(x)dxand using the convexity of l hinge we obtain from Jensen’s inequality that≥ 1 2 l hinge= 1 (1 +2≥ 1 2∫ 1−1(∫ 1∫ 1)−|f(x)|ρ(x)dx−1)|f(x)|ρ(x)dx−1|f(x)|ρ(x)dx =: 1 2 ‖f‖ 1,dρ .This shows that ‖f‖ 1,dρ ≤ 2 . We next write f = ∑ Kλ i=1 α ˜T i i , where { ˜T i } are the orthonormalpolynomials corresponding to the measure dρ. Since ˜T i are related to Chebyshev polynomialswe can uniformly bound their l ∞ norm, hence obtain that√ ∑αi 2 = ‖f‖ 2,dρ ≤ O( √ (√ )KK) ‖f‖ 1,dρ ≤ O .λiBased on the above, and using a bound on the derivatives of Chebyshev polynomials, we canbound the derivative of the polynomial f|f ′ (x)| ≤ ∑ ( )|α i || ˜T Ki ′3(x)| ≤ O .λiHence, by choosing λ = ω(γK 3 ) = ω(γ log 3 (C)) we obtain( ) γ K|f(γ) − f(−γ)| ≤ 2 γ max |f ′ 3(x)| = O = o(1) ,xλas required.Pulling back to the d − 1 dimensional sphereGiven the distribution D over [−1, 1] × {±1} described before, and some e ∈ S d−1 , we nowdefine a distribution D e on S d−1 × {±1}. To sample from D e , we first sample (α, β) from D11

and (uniformly and independently) a vector z from the 1-codimensional sphere of S d−1 thatis orthogonal to e. The constructed point is (αe + √ 1 − α 2 z, β).For any f ∈ H k and a ∈ [−1, 1] define ¯f(a) to be the expectation of f over the 1-codimensional sphere {x ∈ S d−1 : 〈x, e〉 = a}. We will show that for any f ∈ H k , such that‖f‖ Hk ≤ C and Err De,hinge(f) ≤ 1, we have that | ¯f(γ) − ¯f(−γ)| = o(1).To do so, let us first assume that f is symmetric with respect to e, and hence can bewritten as∞∑f(x) = α n P d,n (〈x, e〉) ,n=0where α n ∈ R and P d,n is the d-dimensional Legendre polynomial of degree n. Furthermore,by a characterization of Hilbert spaces corresponding to symmetric kernels, it follows that∑ α2n ≤ C 2 .Since f is symmetric w.r.t. e we have,¯f(a) =∞∑α n P d,n (a) .n=0For |a| ≤ 1/8, we have that |P d,n (a)| tends to zero exponentially fast with both d and n.Hence, if d is large enough then¯f(a) ≈log(C)∑n=0α n P d,n (a) =: ˜f(a) .Note that ˜f is a polynomial of degree bounded by log(C). In addition, by construction,Err De,hinge(f) = Err D,hinge ( ¯f) ≈ Err D,hinge ( ˜f). Hence, if 1 ≥ Err De,hinge(f) then using theprevious subsection we conclude that | ¯f(γ) − ¯f(−γ)| = o(1).Symmetrization of fIn the above, we assumed that both the kernel function is symmetric and that f is symmetricw.r.t. e. Our next step is to relax the latter assumption, while still assuming that the kernelfunction is symmetric.Let O(e) be the group of linear isometries that fix e, namely, O(e) = {A ∈ O(d) : Ae = e}.By assuming that k is a symmetric kernel, we have that for all A ∈ O(e), the functiong(x) = f(Ax) is also in H k . Furthermore, ‖g‖ Hk = ‖f‖ Hk and by the constructionof D e we also have Err De,hinge(g) = Err De,hinge(f). Let P e f(x) = ∫ f(Ax)dA beO(e)the symmetrization of f w.r.t. e. On one hand, P e f ∈ H k , ‖P e f‖ Hk ≤ ‖f‖ Hk , andErr De,hinge(P e f) ≤ Err De,hinge(f). On the other hand, ¯f = Pe f. Since for P e f we havealready shown that |P e f(γ) − P e f(−γ)| = o(1), it follows that | ¯f(γ) − ¯f(−γ)| = o(1) as well.Symmetrization of the kernelOur final step is to remove the assumption that the kernel is symmetric. To do so, we firstsymmetrize the kernel as follows. Recall that O(d) is the group of linear isometries of R d .12

Define the following symmetric kernel:∫k s (x, y) = k(Ax, Ay)dA .We show that the corresponding Hilbert space consists of functions of the form∫f(x) = f A (Ax)dA ,O(d)O(d)where for every A f A ∈ H k . Moreover,‖f‖ 2 H ks≤∫O(d)‖f A ‖ 2 H kdA . (7)Let α be the maximal number such that∀e ∈ S d−1 ∃f e ∈ H k s.t. ‖f e ‖ Hk ≤ C, Err De,hinge(f e ) ≤ 1, | ¯f e (γ) − ¯f e (−γ)| > α .Since H k is closed to negation, it follows that α satisfies∀e ∈ S d−1 ∃f e ∈ H k s.t. ‖f e ‖ Hk ≤ C, Err De,hinge(f e ) ≤ 1, ¯fe (γ) − ¯f e (−γ) > α .Fix some v ∈ S d−1 and define f ∈ H ks to be∫f(x) =O(d)f Av (Ax)dA .By Equation (7) we have that ‖f‖ Hks ≤ C. It is also possible to show that for all AErr Dv,hinge(f Av ◦ A) = Err DAv ,hinge(f Av ) ≤ 1. Therefore, by the convexity of the loss,Err Dv,hinge(f) ≤ 1. It follows, by the previous sections, that | ¯f(γ) − ¯f(−γ)| = o(1). But, weshow that ¯f(γ) − ¯f(−γ) > α. It therefore follows that α = o(1), as required.Concluding the proofWe have shown that for every kernel, there exists some direction e such that for all f ∈ H kthat satisfies ‖f‖ Hk ≤ C and Err De,hinge(f) ≤ 1 we have that | ¯f(γ) − ¯f(−γ)| = o(1).Next, consider f which is also an (approximated) optimal solution of program (2) withrespect to D e . Since Err De,hinge(0) = 1, we clearly have that Err De,hinge(f) ≤ 1, hence| ¯f(γ) − ¯f(−γ)| = o(1). Next we show that ¯f(−γ) > 1/2, which will imply that f predictsthe label 1 for most instances on the 1 co-dimensional sphere such that 〈x, e〉 = −γ. Hence,its 0-1 error is close to 0.3(1−λ) ≥ 0.2 while Err γ (D e ) = λ/2. By choosing λ = O(γ log 3.1 (C))(we obtain that the approximation ratio is Ω1γ log 3.1 (C)It is therefore left to show that ¯f(−γ) > 1/2. Let a = ¯f(γ) ≈ ¯f(−γ). On (1 − λ)fraction fraction of the distribution, the hinge-loss would be (on average and roughly)0.3[1 + a] + + 0.7[1 − a] + . This function is minimized for a = 1, which concludes our proofsince λ is o(1).13).

The proof of Theorem 2.7To prove Theorem 2.7, we prove, using John’s Lemma (Matousek, 2002), that for every embeddingψ : S d−1 → B 1 , we can construct a kernel k : S d−1 × S d−1 → R and a probabilitymeasure µ N over S d−1 with the following properties: If f is an approximate solution of pro-(gram (4), where γ fraction of the distribution D is perturbed by µ N , then ‖f‖ k ≤ OUsing this, we adapt the proof as sketched above to prove Theorem 2.7.3 Additional ResultsLow dimensional distributions. It is of interest to examine Theorem 2.6 when D issupported ( on B d for d small. We show that for d = O(log(1/γ)), the approximation ratio is1Ω √ γ·poly(log(1/γ))). Most commonly used kernels (e.g., the polynomial, RBF, and Hyperbolictangent kernels, as well as the kernel used in (Shalev-Shwartz et al., 2011)) are symmetric.Namely, for all unit vectors x, y ∈ B, k(x, y) := 〈ψ(x), ψ(y)〉 H1 depends only on 〈x, y〉 H .For symmetric kernels, we show that even with the restriction that d = O(log(1/γ)), the(approximation ratio is still Ω1γ·poly(log(1/γ))m 1.5γ).). However, the result for symmetric kernels isonly proved for (idealized) algorithms that return the exact solution of program (5).Theorem 3.1 Let A be a kernel-based learner corresponding to a Lipschitz surrogate. Assumethat m A (γ) = exp(o(γ −1/8 )). Then, for every γ > 0, there exists a distribution D onB d , for d = O(log(m A (γ)/γ)), such that, w.p. ≥ 1 − exp(−1/γ),Err D,0−1 (A(γ))Err γ (D)()1≥ Ω √ .γ · poly(log(mA (γ)))Theorem 3.2 Assume that m A (γ) = exp(o(γ −2/7 )) and ψ is continuous and symmetric.For every γ > 0, there exists a distribution ( D on B d ), for d = O(log(m A (γ))) and a solutionto program (5) whose 0-1-error is Ω· Err γ (D).1γ poly(log(m A (γ)))The integrality gap. In bounding the approximation ratio, we considered a predefinedloss l. We believe that similar bounds hold as well for algorithms that can choose l accordingto γ. However, at the moment, we only know to lower bound the integrality gap, as definedbelow.If we let l depend on γ, we should redefine the complexity of Program (5) to be C · L,where L is the Lipschitz constant of l. (See the discussion following Program (5)). The (γ-)integrality gap of program (5) and (4) is defined as the worst case, over all possible choicesof D, of the ratio between the optimum of the program, running on the input γ, to Err γ (D).We note that Err D,0−1 (f) ≤ Err D,l (f) for every convex surrogate l. Thus, the integrality gapalways upper bounds the approximation ratio. Moreover, this fact establishes most (if notall) guarantees for algorithms that solve Program (5) or Program (4).We denote by ∂ + f the right derivative of the real function f. Note that ∂ + f always existsfor f convex. Also, ∀x ∈ R, |∂ + f(x)| ≤ L if f is L-Lipschitz. We prove:14

Theorem 3.3 Assume that C = exp ( o(γ −2/7 ) ) and ψ is continuous. For every γ > 0, thereexists ( a distribution ) D on B d , for d = O(log(C)) such that the optimum of Program (5) isΩ· Err γ (D).1γ poly(log(C·|∂ + l(0)|))(Thus Program (5) has itegrality gap Ωlower bound:1γ poly(log(C·L))). For Program (4) we prove a similarTheorem 3.4 Let m, d, γ such that d = ω(log(m/γ)) and m = exp ( o(γ −2/7 ) ) . Thereexist ( a distribution ) D on S d−1 × {±1} such that the optimum of Program (4) isΩ· Err γ (D).1γ poly(log(m/γ))4 ConclusionWe prove impossibility results for the family of generalized linear methods in the task oflearning large margin halfspaces. Some of our lower bounds nearly match the best knownupper bounds and we conjecture that the rest of our bounds can be improved as well tomatch the best known upper bounds. As we describe next, our work leaves much for futureresearch.First, regarding the task of learning large margin halfspaces, our analysis suggests that ifbetter approximation ratios are at all possible then they would require methods other thanoptimizing a convex surrogate over a regularized linear class of classifiers.Second, similar to the problem of learning large margin halfsapces, for many learningproblems the best known algorithms belong to the generalized linear family. Understandingthe limits of the generalized linear method for these problems is therefore of particular interestand might indicate where is the line discriminating between feasibility and infeasibility forthese problems. We believe that our techniques will prove useful in proving lower bounds onthe performance of generalized linear methods for these and other learning problems. E.g.,our techniques yield lower bounds on the performance of generalized linear algorithms thatlearn halfspaces over the boolean cube {±1} n : it can be shown that these methods cannotachieve approximation ratios better than ˜Ω( √ n) even if the algorithm competes only withhalfspaces defined by vectors in {±1} n . These ideas will be elaborated on in a long versionof this manuscript.Third, while our results indicate the limitations of generalized linear methods, it is anempirical fact that these methods perform very well in practice. Therefore, it is of greatinterest to find conditions on distributions that hold in practice, under which these methodsguaranteed to perform well. We note that learning halfspaces under distributional assumptions,has already been addressed to a certain degree. For example, (Kalai et al., 2005, Blaiset al., 2008) show positive results under several assumptions on the marginal distribution(namely, they assume that the distribution is either uniform, log-concave or a product distribution).There is still much to do here, specifically in search of better runtimes. Currentlythese results yield a runtime which is exponential in poly(1/ɛ), where ɛ is the excess error ofthe learnt hypothesis.Fourth, as part of our proof, we have shown a (weak) inverse (lemma 5.15) of the famousfact that affine functionals of norm ≤ C can be learnt using poly(C) samples. We made no15

attempts to prove a quantitative optimal result in this vein, and we strongly believe thatmuch sharper versions can be proved. This interesting direction is largely left as an openproblem.There are several limitations of our analysis that deserve further work. In our workthe surrogate loss is fixed in advance. We believe that similar results hold even if the lossdepends on γ. This belief is supported by our results about the integrality gap. As explainedin Section 6, this is a subtle issue that related to questions about sample complexity. Finally,in view of Theorems 3.3 and 3.4, we believe that, as in Theorem 2.6, the lower bound inTheorems 2.7 and 3.1 can be improved to depend on 1 rather than on √1γ γ.5 Proofs5.1 Background and NotationHere we introduce some notations and terminology to be used throughout. The L p normcorresponding to a measure µ is denoted || · || p,µ . Also, N = {1, 2, . . .} and N 0 = {0, 1, 2, . . .}.For a collection of function F ⊂ Y X and A ⊂ X we denote F| A = {f| A | f ∈ F}. Let H bea Hilbert space. We denote the projection on a closed convex subset of H by P C . We denotethe norm of an affine functional Λ on H by ‖Λ‖ H = sup ‖x‖H =1 |Λ(x) − Λ(0)|.5.1.1 Reproducing Kernel Hilbert SpacesAll the theorems we quote here are standard and can be found, e.g., in Chapter 2 of (Saitoh,1988). Let H be a Hilbert space of functions from a set S to C. Note that H consists offunctions and not of equivalence classes of functions. We say that H is a reproducing kernelHilbert space (RKHS for short) if, for every x ∈ S, the linear functional f → f(x) is bounded.A function k : S×S → C is a reproducing kernel (or just a kernel) if, for every x 1 , . . . , x n ∈S, the matrix {k(x i , x j )} 1≤i,j≤n is positive semi-definite.Kernels and RKHSs are essentially synonymous:Theorem 5.1 1. For every kernel k there exists a unique RKHS H k such that for everyy ∈ S, k(·, y) ∈ H k and ∀f ∈ H, f(y) = 〈f(·), k(·, y)〉 Hk .2. A Hilbert space H ⊆ C S is a RKHS if and only if there exists a kernel k : S × S → Rsuch that H = H k .3. For every kernel k, span{k(·, y)} y∈S = H k . Moreover,〈n∑α i k(·, x i ),i=1n∑β i k(·, y i )〉 Hk =i=1∑1≤i,j,≤nα i ¯βj k(y j , x i )4. If the kernel k : S × S → R takes only real values, then H R k := {Re(f) : f ∈ H k} ⊂ H k .Moreover, H R k is a real Hilbert space with the inner product induced from H k.5. For every kernel k, convergence in H k implies point-wise convergence. Ifsup x∈S k(x, x) < ∞ then this convergence is uniform.16

There is also a tight connection between embeddings of S into a Hilbert space and RKHSs.Theorem 5.2 A function k : S × S → R is a kernel iff there exists a mapping φ : S → Hto some real Hilbert space for which k(x, y) = 〈φ(y), φ(x)〉 H . Also,H k = {f v : v ∈ H}Where f v (x) = 〈v, φ(x)〉 H . The mapping v ↦→ f v , restricted to span(φ(S)), is a Hilbert spaceisomorphism.A kernel k : S × S → R is called normalized if sup x∈S k(x, x) = 1. Also,Theorem 5.3 Let k : S × S → R be a kernel and let {f n } ∞ n=1 be an orthonormal basis of aH k . Then, k(x, y) = ∑ ∞n=1 f n(x)f n (y).5.1.2 Unitary Representations of Compact GroupsProofs of the results stated here can be found in (Folland, 1994), chapter 5. Let G be acompact group. A unitary representation (or just a representation) of G is a group homomorphismρ : G → U(H) where U(H) is the class of unitary operators over a Hilbert spaceH, such that, for every v ∈ H, the mapping a ↦→ ρ(a)v is continuous.We say that a closed subspace M ⊂ H is invariant (to ρ) if for every a ∈ G, v ∈ M,ρ(a)v ∈ M. We note that if M is invariant then so is M ⊥ . We denote by ρ| M : G → U(M)the restriction of ρ to M. That is, ∀a ∈ G, ρ| M (a) = ρ(a)| M . We say that ρ : G → U(H) isreducible if H = M ⊕M ⊥ such that M, M ⊥ are both non zero closed and invariant subspacesof H. A basic result is that every representation of a compact group is a sum of irreduciblerepresentation.Theorem 5.4 Let ρ : G → U(H) be a representation of a compact group G. Then, H =⊕ n∈I H n , where every H n is invariant to ρ and ρ| Hn is irreducible.We shall also use the following Lemma.Lemma 5.5 Let G be a compact group, V a finite dimensional vector space and let ρ : G →GL(V ) be a continuous homomorphism of groups (here, GL(V ) is the group of invertiblelinear operators over V ). Then,1. There exists an inner product on V making ρ a unitary representation.2. Moreover, if V has no non-trivial invariant subspaces (here a subspace U ⊂ V is calledinvariant if, ∀a ∈ G, f ∈ U, ρ(a)f ∈ U) then this inner product is unique up to scalarmultiple.17

5.1.3 Harmonic Analysis on the SphereAll the results stated here can be found in (Atkinson and Han, 2012), chapters 1 and2. Denote by O(d) the group of unitary operators over R d and by dA the uniformprobability∫measure over O(d) (that is, dA is the unique probability measure satisfyingf(A)dA = ∫f(AB)dA = ∫f(BA)dA for every B ∈ O(d) and every integrableO(d) O(d) O(d)function f : O(d) → C). Denote by dx = dx d−1 the Lebesgue (area) measure over S d−1 andlet L 2 (S d−1 ) := L 2 (S d−1 , dx). Given a measurable set Z ⊆ S d−1 , we sometime denote itsLebesgue measure by |Z|. Also, denote dm =dx the Lebesgue measure, normalized to be|S d−1 |a probability measure.For every n ∈ N 0 , we denote by Y d n the linear space of d-variables harmonic (i.e., satisfying∆p = 0) homogeneous polynomials of degree n. It holds that( ) ( )d + n − 1 d + n − 3N d,n = dim(Y d n) =−=d − 1 d − 1(2n + d − 2)(n + d − 3)!n!(d − 2)!Denote by P d,n : L 2 (S d−1 ) → Y d n the orthogonal projection onto Y d n.We denote by ρ : O(d) → U(L 2 (S d−1 )) the unitary representation defined byρ(A)f = f ◦ A −1We say that a closed subspace M ⊂ L 2 (S d−1 ) is invariant if it is invariant w.r.t. ρ (that is,∀f ∈ M, A ∈ O(d), f ◦ A ∈ M). We say that an invariant space M is primitive if ρ| M isirreducible.Theorem 5.6 1. L 2 (S d−1 ) = ⊕ ∞ n=0Y d n.2. The primitive finite dimensional subspaces of L 2 (S d−1 ) are exactly {Y d n} ∞ n=0.Lemma 5.7 Fix an orthonormal basis Yn,j, d 1 ≤ j ≤ N d,n to Y d n. For every x ∈ S d−1 itholds thatN d,n∑j=1|Y dn,j(x)| 2 = N d,n|S d−1 |5.1.4 Legendre and Chebyshev PolynomialsThe results stated here can be found at (Atkinson and Han, 2012). Fix d ≥ 2. The ddimensional Legendre polynomials are the sequence of polynomials over [−1, 1] defined bythe recursion formulaP d,n (x) = 2n+d−4xPn+d−3 d,n−1(x) +P d,0 ≡ 1, P d,1 (x) = xn−1n+d−3 P d,n−2(x)We shall make use of the following properties of the Legendre polynomials.Proposition 5.8 1.(For every d ≥ 2, the)sequence {P d,n } is orthogonal basis of theHilbert space L 2 [−1, 1], (1 − x 2 ) d−32 dx .18(8)

2. For every n, d, ||P d,n || ∞ = 1.The Chebyshev polynomials of the first kind are defined as T n := P 2,n . The Chebyshevpolynomials of the second kind are the polynomials over [−1, 1] defined by the recursionformulaU n (x) = 2xU n−1 (x) − U n−2 (x)U 0 ≡ 1, U 1 (x) = 2xWe shall make use of the following properties of the Chebyshev polynomials.Proposition 5.9 1. For every n ≥ 1, T ′ n = nU n−1 .2. ||U n || ∞ = n + 1.Given a measure µ over [−1, 1], the orthogonal polynomials corresponding to µ are the sequenceof polynomials obtained upon the Gram-Schmidt procedure applied to 1, x, x 2 , x 3 , . . ..We note that the 1, √ 2T 1 , √ 2T 2 , √ 2T 3 , . . . are the orthogonal polynomials corresponding tothe probability measure dµ =dxπ √ 1−x 25.1.5 Bochner Integral and Bochner SpacesProofs and elaborations on the material appearing in this section can be found in (KosakuYosida, 1963). Let (X, m, µ) be a measure space and let H be a Hilbert space. A functionf : X → H is (Bochner) measurable if there exits a sequence of function f n : X → H suchthat• For almost every x ∈ X, f(x) = lim n→∞ f n (x).• The range of every f n is countable and, for every v ∈ H, f −1 (v) is measurable.A measurable function f : X → H is (Bochner) integrable if there exists a sequence of simplemeasurable functions (in the usual sense) s n such that lim n→∞∫X ||f(x)−s n(x)|| H dµ(x) = 0.We define the integral of f to be ∫ X fdµ = lim n→∞∫sn dµ, where the integral of a simplefunction s = ∑ ni=1 1 A iv i , A i ∈ m, v i ∈ H is ∫ X sdµ = ∑ ni=1 µ(A i)v i .Define by L 2 (X, H) the Kolmogorov quotient (by equality almost everywhere) of allmeasurable functions f : X → H such that ∫ X ||f||2 Hdµ < ∞.Theorem∫5.10 L 2 (X, H) in a Hilbert space w.r.t. the inner product 〈f, g〉 L 2 (X,H) =〈f(x), g(x)〉 X Hdµ(x)5.2 Learnability implies small radiusThe purpose of this section is to show that if X is a subset of some Hilbert space H suchthat it is possible to learn affine functionals over X w.r.t. some loss, then we can essentiallyassume that X is contained is a unit ball and the returned affine functional is of norm O (m 3 ),where m is the number of examples.19

Lemma 5.11 (John’s Lemma) (Matousek, 2002) Let V be an m-dimensional real vectorspace and let K be a full-dimensional compact convex set. There exists an inner product onV so that K is contained in a unit ball and contains a ball of radius 1 , both are centered atm(the same) x ∈ K. Moreover, if K is 0-symmetric it is possible to take x = 0 and the ratiobetween the radiuses can be improved to √ m.Lemma 5.12 Let l be a convex surrogate, let V an m-dimensional vector space and letX ⊂ V be a bounded subset that spans V as an affine space. There exists an inner product〈·, ·〉 on V and a probability measure µ N such that• For every w ∈ V, b ∈ R, ||w|| ≤ 4m 2 Err µN ,hinge(Λ w,b )• X is contained in a unit ball.Proof Let us apply John’s Lemma to K = conv(X ). It yields an inner product on V withK contained in a unit ball and containing the ball with radius 1 both centered at the samemx ∈ V . It remains to prove the existence of the measure µ N . W.l.o.g., we assume that x = 0.Let e 1 , . . . , e m ∈ V be an orthonormal basis. For every i ∈ [m], represent both 1 e m i and− 1 e m i as a convex combination of m + 1 elements from X :Now, definem+11m e ∑i =j=1λ j i xj i , − 1 m e i =m+1∑j=1ρ j i zj i .µ N (x j i , 1) = µ N(x j i , −1) = λj i4m , µ N(z j i , 1) = µ N(z j i , −1) = ρj i4m .20

Finally, let v ∈ V, b ∈ R. We haveErr µN ,hinge(Λ w,b ) =m∑i=1m+1∑j=1λ j [i lhinge (Λ w,b (x j i4m)) + l hinge(−Λ w,b (x j i ))]+ ρj [i lhinge (Λ w,b (z j i4m)) + l hinge(−Λ w,b (z j i ))][ (≥ 1 m∑m+1))]∑m+1l hinge λ j i4m(Λ w,b(x j i )) ∑+ l hinge(− λ j i (Λ w,b(x j i ))i=1j=1j=1[ ( m+1))]m+1+ l hinge ρ j i (Λ w,b(z j i )) ∑+ l hinge(− ρ j i (Λ w,b(z j i ))= 14m≥ 14m≥≥i=1∑ m∑j=1m∑ (〈l hinge w, e 〉imi=1〈+l hinge(− w, e 〉 ) (〈i+ b + l hingemm∑(l hinge − |〈w, e )i〉|m14m 2i=114m 2 ||w|||〈w, e i 〉|) (+ b + l hinge −w, e imj=1〈w, e 〉im〉 )− b)− b✷Let X be a bounded subset of some Hilbert space H and let C > 0. DenoteF H (X , C) = {Λ w,b | X | ‖w‖ H ≤ C, b ∈ R} .Denote by M the collection of all affine subspaces of H that are spanned by points from X .Denote by t = t H (X , C) be the maximal number such that for every affine subspace M ∈ Mof dimension less than t there is x ∈ X such that d(x, M) > t C .Lemma 5.13 Let X be a bounded subset of some Hilbert space H and let C > 0. There isA ⊂ X with |A| = t H (X , C) such that [−1, 1] A ⊂ F H (X , C)| A .Proof Denote t = t H (X , C). Let x 0 , . . . , x t ∈ X be points such that the (t dimensional)volume of the parallelogram Q defined by the vectorsx 1 − x 0 , . . . , x t − x 0is maximal (if the supremum is not attained, the argument can be carried out with a parallelogramwhose volume is sufficiently close to the supremum). Let A = {x 1 , . . . , x t }. We claimthat for every 1 ≤ i ≤ t, the distance of x i from the affine span, M i , of A∪{x 0 }\{x i } is ≥ t C .Indeed, the volume of Q is the (t − 1 dimensional) volume of the parallelogram Q ∩ (M i − x 0 )times d(x i , M i ). By the maximality of x 0 , . . . , x t and the definition of t, d(x i , M i ) ≥ t C .21

For 1 ≤ i ≤ t Let v i = x i − P Mi x i . Note that ‖v i ‖ H = d(x i , M i ) ≥ t Now, given aCfunction f : A → [−1, 1], we will show that f ∈ F H (X , C). Consider the affine functional〈t∑t∑〉f(x i )f(x i )v it∑〈〉f(xi )v iΛ(x) = 〈v‖vi=1 i ‖ 2 i , x − P Mi (x)〉 H =, x −, PH‖vi=1 i ‖ 2 H‖vH i=1 i ‖ 2 Mi (x) .HHNote that since v i is perpendicular to M i , b := − ∑ 〈〉t f(xi )v ii=1, P Mi (x) does not dependon x. Let w := ∑ ti=1f(x i )v i‖v i. We have‖ 2 H‖v i ‖ 2 HH‖w‖ H ≤t∑ 1|f(x i )|‖v i ‖ ≤ tC t = C.i=1Therefore, Λ| X ∈ F H (X , C). Finally, for every 1 ≤ j ≤ t we haveΛ(x j ) =t∑i=1f(x i )〈v‖v i ‖ 2 i , v j − P i (v j )〉 = f(x i ).HHere, the last inequality follows form that fact that for i ≠ j, since v i is perpendicular toM i , 〈v i , v j − P i (v j )〉 = 0. Therefore, f = Λ| A .Let l : R → R be a surrogate loss function. We say that an algorithm (ɛ, δ)-learns F ⊂ R Xusing m examples w.r.t. l if:• Its input is a sample of m points in X × {±1} and its output is a hypothesis in F.• For every distribution D on X × {±1}, it returns, with probability 1 − δ, ˆf ∈ F withErr D,l ( ˆf) ≤ inf f∈F Err D,l (f) + ɛLemma 5.14 Suppose that an algorithm A (ɛ, δ)-learns F using m examples w.r.t. a surrogateloss l. Then, for every pair of distributions D and D ′ on X × {±1}, if ˆf ∈ F is thehypothesis returned by A running on D, then Err D ′ ,l(f) ≤ m(l(0) + ɛ) w.p. ≥ 1 − 2eδ.Proof Suppose toward a contradiction that w.p. ≥ 2eδ we have Err D ′ ,l(f) > a for a >m(l(0) + ɛ). Consider the following distribution, ˜D: w.p.1we sample from m D′ and withprobability 1 − 1 we sample from D. Suppose now that we run the algorithm A on ˜D.mConditioning on the event that all the samples are from D, we have, w.p. ≥ 2eδ,Err D ′ ,l( ˆf) > a and therefore, Err ˜D,l( ˆf) > a . The probability that indeed all the m samplesmare from D is ( 1 − m) 1 m>1. Hence, with probability > δ, we have Err 2e ˜D,l( ˆf) > a . mOn the other hand, With probability ≥ 1 − δ we have Err ˜D,l( ˆf) ≤ inf f∈F Err ˜D(f) + ɛ ≤Err D,l (0) + ɛ = l(0) + ɛ. Hence, with positive probability,It follows that a ≤ m(l(0) + ɛ).am ≤ l(0) + ɛ .22✷

Lemma 5.15 For every surrogate loss l with ∂ + l(0) < 0 there is a constant c > 0 suchthat the following holds. Let X be a bounded subset of a Hilbert space H. If F H (X , C) is(ɛ, δ)-learnable using m example w.r.t. l then there is an inner product 〈·, ·〉 s on H such that• X is contained in a unit ball w.r.t. ‖ · ‖ s .• If A (ɛ, δ)-learns F H (X , C) then for every distribution D on X × {±1}, the hypothesisΛ w,b returned by A has ‖Λ w,b ‖ s ≤ c · m 3 w.p. 1 − 2eδ.• The norm ‖ · ‖ s is equivalent 4 to ‖ · ‖ H .Remark 5.16 If X is the image of some mapping ψ : Z → H then it follows from the lemmathat there is a normalized kernel k on Z such that the hypothesis returned by the learningalgorithm (interpreted as a function from Z to R) if the form f + b with ‖f‖ k ≤ c · m 3 . Also,if ψ is continuous/absolutely continuous, then so is k.Proof Let t = t H (X , C). By lemma 5.13 there is some A ⊂ X such that [−1, 1] A ⊂F H (X , C)| A . Since F H (X , C) is (ɛ, δ)-learnable using m examples, it is not hard to see thatwe must have t ≤ c ′ · m for some c ′ > 0 that depends only on l.Therefore, there exists an affine subspace M ⊂ H of dimension d ≤ c ′ m, such thatfor every x ∈ X , d(x, M) < d ≤ c′ m. Moreover, we can assume that M is spanned byC Csome subset of X . Denote by ˜M the linear space corresponding to to M (i.e., ˜M is thetranslation of M that contains 0). By lemma 5.12, there is an inner product 〈·, ·〉 1 on ˜M,and a probability measure µ N on P ˜MXsuch that• For every w ∈ H, b ∈ R we have ‖Λ P ˜M w,b‖ 1 ≤ 4d 2 Err µN ,hinge(Λ w,b ).• For all x ∈ X , ‖P ˜Mx‖ 1 ≤ 1.Finally, define〈x, y〉 s = 1 2 〈P ˜M(x), P ˜M(y)〉 1 + C22d 2 〈P ˜M ⊥(x), P ˜M ⊥(y)〉 H .The first and last assertions of the lemma are easy to verify, so we proceed to the second.Let Λ w,b be the hypothesis returned by A after running on some m examples sampled fromsome distribution D. Let D N be a probability measure on X whose projection on ˜M is µ N .By lemma 5.14 we have, with probability ≥ 1 − 2eδ, Err DN ,l(Λ w,b ) ≤ (l(0) + 1)m. We claim4 Two norms ‖ · ‖ and ‖ · ‖ ′ on a vector space X are equivalent if for some c 1 , c 2 > 0, ∀x ∈ A, c 1 · ‖x‖ ≤‖x‖ ′ ≤ c 2 ‖x‖✷23

( )that in this case Err µN ,hinge(Λ w,b ) ≤ l(0)+1+ 2 ∂ +m. Indeed,l(0)Err µN ,hinge(Λ w,b ) = E(x,y)∼µ Nl hinge (yΛ w,b (x))= E(x,y)∼D Nl hinge (yΛ w,b (P ˜Mx))= E(x,y)∼D Nl hinge (yΛ w,b (x + (x − P ˜Mx)))≤≤≤E l hinge (yΛ w,b (x)) + ‖w‖ H · d(x, P ˜Mx)(x,y)∼D NErr DN ,hinge(Λ w,b ) + C · mC1|∂ + l(0)| Err D N ,l(Λ w,b ) + 1 + m≤ l(0) + 1∂ + l(0) m + 2mBy the properties of µ N , it follows that ‖Λ P ˜M w,b‖ 1 = O (m 3 ) (here, the constant in the big-Onotation depends only on l). Finally, we have,‖Λ w ‖ 2 s = ‖Λ P ˜Mw‖ 2 s + ‖Λ P ˜M⊥w‖ 2 s=≤2‖Λ P ˜Mw‖ 2 1 + 2m2C ‖Λ 2 P ˜M⊥w‖ 2 HO ( m 6) + 2m 2 ≤ O ( m 6) ✷5.3 Symmetric Kernels and SymmetrizationIn this section we concern symmetric kernels. Fix d ≥ 2 and let k : S d−1 × S d−1 → R be acontinuous positive definite kernel. We say that k is symmetric if∀A ∈ O(d), x, y ∈ S d−1 , k(Ax, Ay) = k(x, y)In other words, k(x, y) depends only on 〈x, y〉 R d. A RKHS is called symmetric if its reproducingkernel is symmetric. The next theorem characterize symmetric RKHSs. We notethat Theorems of the same spirit have already been proved (e.g. (Schoenberg, 1942)).Theorem 5.17 Let k : S d−1 ×S d−1 → R be a normalized, symmetric and continuous kernel.Then,1. The group O(d) acts on H k . That is, for every A ∈ O(d) and every f ∈ H k if holdsthat f ◦ A ∈ H k and ||f|| Hk = ||f ◦ A|| Hk .2. The mapping ρ : O(d) → U(H k ) defined by ρ(A)f = f ◦A −1 is a unitary representation.3. The space H k consists of continuous functions.24

4. The decomposition of ρ into a sum of irreducible representation is H = ⊕ n∈I Y d n forsome set I ⊂ N 0 . Moreover,∀f, g ∈ H k , 〈f, g〉 Hk = ∑ n∈IWhere {a n } n∈I are positive numbers.5. It holds that ∑ n∈IN d,n|S d−1 | a−2 n = 1.a 2 n〈P d,n f, P d,n g〉 L 2 (S d−1 )Proof Let f ∈ H k , A ∈ O(d). To prove part 1, assume first thatn∑∀x ∈ S d−1 , f(x) = α i k(x, y i ) (9)For some y 1 , . . . , y n ∈ S d−1 and α 1 , . . . , α n ∈ C. We have, since k is symmetric, thatn∑f ◦ A(x) = α i k(Ax, y i )==i=1i=1n∑α i k(A −1 Ax, A −1 y i )i=1n∑α i k(x, A −1 y i )Thus, by Theorem 5.1, f ◦ A ∈ H k . Moreover, it holds that||f ◦ A|| 2 H k= ∑α i ᾱ j k(A −1 y j , A −1 y i )i=11≤i,j≤n= ∑1≤i,j≤nα i ᾱ j k(y j , y i ) = ||f|| 2 H kThus, part 1 holds for function f ∈ H k of the form (9). For general f ∈ H k , by Theorem 5.1,there is a sequence f n ∈ H k of functions of the from (9) that converges to f in H k . From whatwe have shown for functions of the form (9) if follows that ||f n −f m || Hk = ||f n ◦A−f m ◦A|| Hk ,thus f n ◦ A is a Cauchy sequence, hence, has a limit g ∈ H k . By Theorem 5.1, convergencein H k entails point wise convergence, thus, g = f ◦ A. Finally,||f|| Hk = limn→∞||f n || Hk = limn→∞||f n ◦ A|| Hk = ||f ◦ A|| HkWe proceed to part 2. It is not hard to check that ρ is group homomorphism, so itonly remains to validate that for every f ∈ H the mapping A ↦→ ρ(A)f is continuous. Letɛ > 0 and let A ∈ O(d). We must show that there exists a neighbourhood U of A such that∀B ∈ U, ||f ◦ A −1 − f ◦ B −1 || Hk < ɛ. Choose g(·) = ∑ ni=1 α ik(·, y i ) such that ||g − f|| Hk < ɛ . 3By part 1, it holds that||f ◦ A −1 − f ◦ B −1 || Hk ≤ ||f ◦ A −1 − g ◦ A −1 || Hk + ||g ◦ A −1 − g ◦ B −1 || Hk + ||g ◦ B −1 − f ◦ B −1 || Hk= ||f − g|| Hk + ||g ◦ A −1 − g ◦ B −1 || Hk + ||g − f|| Hk< ɛ 3 + ||g ◦ A−1 − g ◦ B −1 || Hk + ɛ 325

Thus, it is enough to find a neighbourhood U of A such that ∀B ∈ U, ||g◦A −1 −g◦B −1 || Hk

Proof Part 1. follows readily from the definition. We proceed to part 2. Define φ : S d−1 →L 2 (O(d), H k ) byφ(x)(A)(·) = k(Ax, ·)Note that〈φ(x), φ(y)〉 L 2 (O(d),H k ) =Thus, the Theorem follows from Theorem 5.1.15.4 Lemma 5.22 and its proof=∫∫O(d)O(d)= k s (x, y)〈φ(x)(A), φ(y)(A)〉〈φ(x)(A), φ(y)(A)〉Lemma 5.19 For every n > 0, d ≥ 5 and t ∈ [−1, 1] it holds that{ (Γd−1) [ ] d−2 (242|P d,n (t)| ≤ min √ ,π n(1 − t 2 )Moreover, ifn+ 2|t| ≤ 1 we also haven+d−2 ∏|P d,n (t)| ≤ √ ni=1()ii + d − 2 + 2|t|) nnn + d − 2 + 2|t|Finally, there exist constants E > 0 and 0 < r, s < 1 such that for every K > 0, d ≥ 5 andt ∈ [ − 1 8 , 1 8]we have∞∑}2✷n=K|P d,n (t)| ≤ Er K + Es dProof In (Atkinson and Han, 2012) it is shown that |P d,n (t)| ≤shall prove, by induction on k that∏|P d,n (t)| ≤ √ n ()ii + d − 2 + 2|t|i=1Γ(d−12 )[√ 4π n(1−t 2 )] d−22. WenWhenever + 2|t| ≤ 1. For n = 0, 1 it follows from the fact that P n+d−2 d,0 ≡ 1 andP d,1 (t) = t. Let n > 1. By the induction hypothesis and the recursion formula for the27

Legendre polynomials we have|P d,n (t)| ≤ 2n + d − 4n + d − 3 |t||P d,n−1(t)| + n − 1n + d − 3 |P d,n−2(t)|≤ 2|t||P d,n−1 (t)| + n − 1n + d − 3 |P d,n−2(t)|≤ 2|t| √ n−1 ∏()ii + d − 2 + 2|t| + n − 1√ n−2 ∏()in + d − 3 i + d − 2 + 2|t| i=1i=1≤ 2|t| √ n−2 ∏()ii + d − 2 + 2|t| + n − 1√ n−2 ∏()in + d − 3 i + d − 2 + 2|t|≤=i=1√ (2|t| + n − 1n + d − 3∏√ n (i=1) (2|t| +)ii + d − 2 + 2|t|Now, every K, ¯K() 1¯K≥ 0 such that ¯K+d−2 + 2|t| 2nn + d − 2i=1) √ √√n−2∏(i=1< 1, we have)ii + d − 2 + 2|t|∞∑|P d,n (t)| ≤n=K≤≤≤≤=¯K∑n=K¯K∑n=K((∞∑(n=K) nnn + d − 2 + 2|t| 2 ∑ ∞+) n¯K¯K + d − 2 + 2|t| 2 ∑ ∞+¯K¯K + d − 2 + 2|t| ) n() K¯K¯K+d−2 + 2|t| 2() 1¯K1 − ¯K+d−2 + 2|t| 2() K¯K¯K+d−2 + 2|t| 2() 1¯K1 − ¯K+d−2 + 2|t| 2() K¯K¯K+d−2 + 2|t| 2() 1¯K1 − ¯K+d−2 + 2|t| 2Γ ( )d−1[24√ π n(1 − t 2 )n= ¯K+1Γ ( )d−1[24√ π n(1 − t 2 )n= ¯K+12Γ ( )d−1[ ] d−2+242∞∑√ π (1 − t 2 )+ Γ ( )d−1[24√ π (1 − t 2 )+ Γ ( )d−1[24√ π (1 − t 2 )+ Γ ( )d−1[24√ π (1 − t 2 )] d−22] d−22∞∑n= ¯K+1∫ ∞¯K] d−22 ¯K− d−42d−42] d−22] d−22n= ¯K+1n − d−22x − d−22 dxn − d−2228

= ||f|| 1 · M · ||f|| 2√K✷(We limit ourselves to d ≥ 5 to guarantee the convergence of ∑ n − d−22 .) In particular, if|t| ≤ 1 and ¯K = d − 2, we have,8∞∑|P d,n (t)| ≤n=K≤∼=⎛⎝ 1⎛1 − ( 34⎝ 1⎛1 − ( 34⎝ 1⎛1 − ( 34⎝ 11 − ( 34) 12) 12) 12) 12⎞(⎠ 32 Γ ( )d−1[+2 4.07√4)Kπ (d − 2)⎞(⎠ 32 6Γ ( )d−1[ ] d−2+2 4.072√4)Kπ (d − 2)⎞( [ ] d−2√⎠ 326 4.072 2π+ √π4)K(d − 2)d−22⎞(⎠ 34)K2+ 12[ 4.072e] d−22] d−22d − 2d−42( d − 2Lemma 5.20 Let µ be a probability measure on [−1, 1] and let p 0 , p 1 , . . . be the correspondingorthogonal polynomials. Then, for every f ∈ span{p 0 , . . . , p K−1 } we haveHere, all L p norms are w.r.t. µ.||f|| 2 ≤ √ K||f|| 1 ·max0≤i≤K−1 ||p i|| ∞Proof Write f = ∑ K−1i=0 α ip i and denote M = max 0≤i≤K−1 ||p i || ∞ .We have||f|| 2 2 ≤ ||f|| 1 · ||f|| ∞≤≤||f|| 1 · MK−1∑n=0|α k |||f|| 1 · M√ K−1 ∑αk 2 · √Kn=02e) d−22✷Lemma 5.21 Let d ≥ 5 and let f : [−1, 1] → R be a continuous function whose expansionin the basis of d-dimensional Legendre polynomials is∞∑f = α n P d,nw(x) =n=0Denote C = sup n |α n |. Let µ be the probability measure on [−1, 1] whose density function is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 829

Then, for every K ∈ N, 1 8 > γ > 0,|f(γ) − f(−γ)| ≤ 32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · C · E · (r K + s d )Here, E, r and s are the constants from Lemma 5.19.Proof Let ¯f = ∑ K−1n=0 α nP d,n . We have || ¯f|| 1,µ ≤ ||f|| 1,µ +|| ¯f −f|| ∞,µ . Define g : [−1, 1] → Rby g(x) = ¯f( x ) and denote by dλ =dx8 π √ . Write,1−x 2g =K−1∑n=0β n T nWhere T n are the Chebyshev polynomials. By Lemma 5.20 it holds that, for every 0 ≤ n ≤K − 1,|β n | ≤ √ 2||g|| 2,λ ≤ 2 √ K||g|| 1,λ = 2 √ K|| ¯f|| 1,µNow,g ′ =K−1∑n=1β k nU n−1Where U n are the Chebyshev polynomials of the second kind. Thus,||g ′ || ∞,λ ≤Finally, by Lemma 5.19,K−1∑n=1|β k | · n · ||U n−1 || ∞,λ =K−1∑n=1|β k | · n 2 ≤ 2 √ K|| ¯f|| 1,µ · K 3|f(γ) − f(−γ)| ≤ |g(8γ) − g(−8γ)| + 2||f − ¯f|| ∞,µ≤≤32γK 3.5 · || ¯f|| 1,µ + 2||f − ¯f|| ∞,µ32γK 3.5 · (||f||1,µ + ||f − ¯f||)∞,µ + 2||f − ¯f||∞,µ≤32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · ||f − ¯f|| ∞,µ≤ 32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )For e ∈ S d−1 we define the group O(e) := {A ∈ O(d) : Ae = e}. If H k be a symmetricRKHS and e ∈ S d−1 we define Symmetrization around e. This is the operator P e : H k → H kwhich is the projection on the subspace {f ∈ H k : ∀A ∈ O(e), f ◦ A = f}. It is not hardto see that (P e f)(x) = ∫ {x ′ :〈x ′ ,e〉=〈x,e〉} f(x′ )dx ′ = ∫ f ◦ A(x)dA. Since P O(e) ef is a convexcombination of the functions {f ◦ A} A∈O(e) , it follows that if R : H k → R is a convexfunctional then R(P e f) ≤ ∫ R(f ◦ A)dA.O(e)Lemma 5.22 (main) There exists a probability measure µ on [−1, 1] with the followingproperties. For every continuous and normalized kernel k : S d−1 × S d−1 → R and C > 0,there exists e ∈ S d−1 such that, for every f ∈ H k with ||f|| Hk ≤ C, K ∈ N and 0 < γ < 1,8 ∫∫∣ f −f∣ ≤ 32γK 3.5 · ||f|| 1,µe + ( 32γK 3.5 + 2 ) · E · C · (r K + s d ){x:〈x,e〉=γ}{x:〈x,e〉=−γ}≤ 32γK 3.5 · ||f|| 1,µe + 10 · E · K 3.5 · C · (r K + s d )30✷

The integrals are w.r.t. the uniform probability over {x : 〈x, e〉 = γ} and {x : 〈x, e〉 = −γ}and E, r, s are the constants from Lemma 5.19.Proof Suppose first that k is symmetric. Let µ be the distribution over [−1, 1] whosedensity function isw(x) ={0 |x| >18√π81−(8x) 2 |x| ≤ 1 8We can assume that f is O(e)-invariant. Otherwise, we can replace f with P e f, which doesnot change the l.h.s. and does not increase the r.h.s. This assumption yields (see (Atkinsonand Han, 2012), pages 17-18)f(x) =∞∑α n P d,n (〈e, x〉).n=0The L 2 (S d−1 )-norm of the map x ↦→ P d,n (〈x, e〉) is |Sd−1 |N d,n(e.g. (Atkinson and Han, 2012),page 71). Therefore,||f|| 2 k = ∑ |S d−1 |a 2Nnαn2d,nn∈Iwhere {a n } n∈I are the numbers corresponding to H k from Theorem 5.17. In particular (sincealso for n ∉ I, α n = 0),WriteBy Lemma 5.21,|α n | 2 ≤ N d,n|S d−1 | a−2 n ||f|| 2 k ≤ ||f|| 2 kg(t) = f(te), t ∈ [−1, 1]|g(γ) − g(−γ)| ≤ 32γK 3.5 · ||f|| 1,µ + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )Finally, ∫ f = g(γ), ∫ f = g(−γ) since f is O(e)-invariant. The Lemma{x:〈x,e〉=γ} {x:〈x,e〉=−γ}follows.We proceed to the general case where k is not necessarily symmetric. Assume by way ofcontradiction that for every e ∈ S d−1 , there exists a function f e such that∫∫f e −f e > 32γK 3.5·||f e || 1,µe + ( 32γK 3.5 + 2 )·||f e || Hk ·C ·(r K +s d ) (10){x:〈x,e〉=γ}{x:〈x,e〉=−γ}For convenience we normalize, so l.h.s. equals 1. Fix a vector e 0 ∈ S d−1 . Define Φ ∈L 2 (O(d), H k ) byΦ(A) = f Ae0and let f ∈ H ksbe the function∫f(x) =O(d)∫Φ(A)(Ax)dA =31O(d)f Ae0 (Ax)dA

Now, it holds that∫∫∫ ∫∫ ∫f −f =f Ae0 (Ax)dAdx −f Ae0 (Ax)dAdx{x:〈x,e 0 〉=γ} {x:〈x,e 0 〉=−γ}{x:〈x,e 0 〉=γ} O(d){x:〈x,e 0 〉=−γ} O(d)∫ ∫∫=f Ae0 (Ax)dx −f Ae0 (Ax)dxdAO(d){x:〈x,e 0 〉=γ}∫ ∫∫=f Ae0 (x)dx −= 1O(d){x:〈x,e 0 〉=−γ}{x:〈x,Ae 0 〉=γ}{x:〈x,Ae 0 〉=−γ}f Ae0 (x)dxdAOn the other hand||f|| 1,µe =≤≤=∫S d−1 ∣ ∣∣∣∫∫ ∫∫∫O(d)O(d)O(d)O(d)f Ae0 (Ax)dA∣ dµ e 0(x)|f Ae0 (Ax)| dµ e0 (x)dAS∫d−1 |f Ae0 (x)| dµ Ae0 (x)dAS d−1||f Ae0 || 1,µAe0 dAMoreover, by Theorem 5.18,||f|| 2 H ks∫≤ ||Φ|| 2 L 2 (O(d),H k ) = ||f Ae0 || 2 H kdA ≤ C 2Since the Lemma is already proved for symmetric kernels, it follows that1 ≤ 32γK 3.5 · ||f|| 1,µe0 + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )∫≤ 32γK 3.5 · ||f Ae0 || 1,µAe0 dA + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )O(d)∫= 32γK 3.5 · ||f Ae0 || 1,µAe0 + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )dAO(d)Thus, for some A ∈ O(d)1 ≤ 32γK 3.5 · ||f Ae0 || 1,µAe0 + ( 32γK 3.5 + 2 ) · E · C · (r K + s d )Contradicting Equation (10).O(d)5.5 Proofs of the main TheoremsWe are now ready to prove Theorems 2.6 and 3.2. We only consider distributions thatsupported on the unit sphere, and we can therefore assume that the problem is formulated32✷

it terms of the unit sphere and not the unit ball. Also, we reformulate program (5) asfollows: Given l : R → R a convex surrogate, a constant C > 0 and a continuous kernelk : S ∞ × S ∞ → R with sup x∈S ∞ k(x, x) ≤ 1, we want to solvemin Err D,l (f + b)s.t. f ∈ H k , b ∈ R (11)||f|| Hk ≤ CWe can assume that ∂ + l(0) < 0, for otherwise the approximation ratio is ∞. To see that,let the distribution D be concentrated on a single point on the sphere and always return thelabel 1. Of course, Err γ (D) = 0. However, if ∂ + l(0) ≥ 0, it is bot hard to see that if f, b isthe solution of program (11), then f(x) + b ≤ 0, so that Err 0−1 (f + b) = 1.Lemma 5.23 Let l be a surrogate loss, µ a probability measure on S d−1 and f ∈ C(S d−1 ).Let ¯µ be the probability measure on S d−1 × {±1} which is the product measure of µ and theuniform distribution on {±1}. Then||f|| 1,µ ≤Proof By Jansen’s inequaliy, it holds that2|∂ + l(0)| Err¯µ,l(f)Err¯µ,l (f) = E (x,y)∼¯µ l(y · f(x))= 1 2 E (x,y)∼¯µl(f(x)) + l(−f(x))≥1 2 E (x,y)∼¯µl(−|f(x)|)≥ 1 2 l ( −E (x,y)∼¯µ |f(x)| )It follows that l (−||f|| 1,µ ) ≤ 2 Err¯µ,l (f). By the convexity of l, it follows that for everyx ∈ R, l(x) ≥ l(0) + x · ∂ + l(0) = l(0) − x · |∂ + l(0)| ≥ −x · |∂ + l(0)|. Thus,5.5.1 Theorems 2.6 and 3.2||f|| 1,µ ≤2|∂ + l(0)| Err¯µ,l(f)We will need Levy’s measure concentration Lemma (e.g., (Milman and Schechtman, 2002)).Let f : X → Y be an absolutely continuous map between metric spaces. We define itsmodulus of continuity as∀ɛ > 0, ω f (ɛ) = sup{d(f(x), f(y)) : x, y ∈ X, d(x, y) ≤ ɛ}Theorem 5.24 (Levy’s Lemma) There exists a constant η > 0 such that for every continuousfunction f : S d−1 → R,Pr (|f − Ef| > ω f (ɛ)) ≤ exp ( −ηdɛ 2)Here, both probability and expectation are w.r.t. the uniform distribution.33✷

We note that ω f◦g ≤ ω f · ω g and that ω Λv (ɛ) = ‖v‖ · ɛ. Thus, if ψ : S ∞ → H 1 is an absolutelycontinuous embedding such that k(x, y) = 〈ψ(x), ψ(y)〉 H1 , then for every v ∈ H 1 , it holdsthat ω Λv,0 ◦ψ ≤ ||v|| H1 · ω ψ . Suppose now that f ∈ H k with ‖f‖ Hk ≤ C. Let v ∈ H 1 such thatf = Λ v,0 ◦ ψ and ||v|| H1 = ||f|| Hk ≤ C. It follows from Levi’s Lemma thatPr (|f − Ef| > C · ω ψ (ɛ)) ≤ Pr (|f − Ef| > ω f (ɛ)) ≤ exp ( −ηdɛ 2) (12)Again, when both probability and expectation are w.r.t. the uniform distribution over S d−1 .Proof (of Theorems 2.6 and 3.2) Let β > α > 0 such that l(α) > l(β). Choose 0 < θ < 1large enough so that (1 − θ)l(−β) + θl(β) < θl(α). Define probability measures µ 1 , µ 2 , µ over[−1, 1] × {±1} as follows.µ 1 ((−γ, −1)) = 1 − θ, µ 1 ((γ, 1)) = θThe measure µ 2 is the product of uniform{±1} and the measure on [−1, 1] whose densityfunction is{0 |x| >18w(x) =√π81−(8x) 2 |x| ≤ 1 8Finally, µ = (1 − λ)µ 1 + λµ 2 for λ > 0, which will be chosen later.By lemma 5.15 (see remark 5.16), there is a continuous normalized kernel k ′ such thatw.p. ≥ 1 − 2e exp(− 1 ) the function returned by the algorithm is of the form f + b withγ‖f‖ Hk ′ ≤ c · m 3 A (γ) for some c > 0 (depending only on l). Let e ∈ Sd−1 be the vector fromLemma 5.22, corresponding to the kernel k ′ . The distribution D is the pullback of µ w.r.t.e. By considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ.Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (11) with C = C A (γ) and with an additiveerror ≤ √ γ. Since the value of the zero solution for program (11) is l(0), it follows thatl(0) + √ γ ≥ Err µ,l (g) = (1 − λ) Err µ 1 e ,l(g) + λ Err µ 2 e ,l(g)Thus, Err µ 2 e ,l(g) ≤ l(0)+√ γ≤ 2l(0) . Combining Lemma 5.23, Lemma 5.15, and Lemma 5.22 isλ λfollows that w.p. ≥ 1 − (1 + 2e) exp(− 1 ) ≥ 1 − 10 exp(− 1 ), for m = m γ γ A(γ)∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · c · K 3.5 · E · m 3 · (r K + s d )|∂ + l(0)|λ{x:〈x,e〉=γ}{x:〈x,e〉=−γ}By choosing K = Θ(log(m)), λ = Θ (γK 3.5 ) = Θ ( γ log 3.5 (m) ) and d = Θ(log(m)), we canmake the last bound ≤ α. We claim that ∫ g > α . To see that, note that otherwise∫2 {x:〈x,e〉=−γ} 2 g ≤ α thus,{x:〈x,e〉=γ}E (x,y)∼D l((f(x) + b)y) = E (x,y)∼D l(g(x)y)∫≥ θ(1 − λ) ·≥≥(∫θ(1 − λ) · l{x:〈x,e〉=γ}l(g(x))dxg(x)dx{x:〈x,e〉=γ})θ · l (α) · (1 − λ) = θ · l (α) + o(1)34

This contradict the optimality of f, b, as for f ′ = 0, b ′ = β it holds thatE (x,y)∼D l((f ′ (x) + b ′ )y) ≤ λl(−β) + (1 − λ) · (1 − θ)l(−β) + θ · l(β))= (1 − θ)l(−β) + θ · l(β) + o(1)We can conclude now the proof of Theorem 2.6. By choosing d large enoughand using Equation (12), we can guarantee that g| {x:〈x,e〉=−γ} is very concentratedaround its expectation. In particular, if (x, y) are sampled according to D, then w.p.> 0.5 · (1 − θ) · (1 − λ) = Ω(1), it holds that yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1), whileErr γ (D) ≤ λ = O (γ poly(log(m)))To conclude the proof of Theorem 3.2, we note that we can assume that g is O(e)-invariant. Otherwise, we can replace it with P e f + b. This does not increase ||f|| Hk norErr D,l (f + b), thus, the solution P e f + b is optimal as well. Now, it follows that g| {x:〈x,e〉=−γ}is constant and we finish as before.5.5.2 Theorem 3.1Let L be the Lipschitz constant of l. Let β > α > 0 such that l(α) > l(β). Choose 0 < θ < 1large enough so that (1−θ)l(−β)+θl(β) < θl(α). First, define probability measures µ 1 , µ 2 , µ 3and µ over [−1, 1] × {±1} as follows.µ 1 (γ, 1) = θ, µ 1 (−γ, −1) = 1 − θw(x) =µ 2 (−γ, 1) = 1The measure µ 3 is the product of uniform{±1} and the measure over [−1, 1] whose densityfunction is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 8Finally, µ = (1 − λ 1 − λ 2 )µ 1 + λ 2 µ 2 + λ 3 µ 3 with λ 2 , λ 3 > 0 to be chosen later.By lemma 5.15 (see remark 5.16), there is a continuous normalized kernel k ′ such thatw.p. ≥ 1 − 2e exp(− 1 ) the function returned by the algorithm is of the form f + b withγ‖f‖ Hk ′ ≤ c · m 3 A (γ) for some c > 0 (depending only on l). Now, let e ∈ Sd−1 be the vectorfrom Lemma 5.22, corresponding to the kernel k ′ . The distribution D is the pullback of µw.r.t. e. By considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ 3 + λ 2 .Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (11) with C = C A (γ) and with an additiveerror ≤ √ γ. As in the proof of Theorem 2.6, it holds that, w.p. ≥ 1 − 10 exp(− 1 ) for γm = m A (γ),∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · c · K 3.5 · E · m 3 · (r K + s d ) (13)|∂ + l(0)|λ 3{x:〈x,e〉=γ}{x:〈x,e〉=−γ}Denote the last bound by ɛ. It holds thatErr D,l (g) = (1 − λ 2 − λ 3 )E µ 1 el(yg(x)) + λ 2 E µ 2 el(yg(x)) + λ 3 E µ 3 el(yg(x)) (14)35✷

Now, denote δ = ∫ g. It holds that{x:〈x,e〉=−γ}∫∫E µ 1 el(yg(x)) = θ l(g(x)) + (1 − θ)Thus,≥≥{x:〈x,e〉=γ}(∫θ · lg{x:〈x,e〉=γ})+ (1 − θ) · lθ · l(δ) + (1 − θ) · l(−δ) − Lɛ{x:〈x,e〉=−γ}(−∫l(−g(x)){x:〈x,e〉=−γ}Err D,l (g) ≥ (1 − λ 2 − λ 3 )(θ · l(δ) + (1 − θ) · l(−δ)) − Lɛ + λ 2 E µ 2 el(yg(x))However, by considering the constant solution δ, it follows that1Err D,l (g) ≤ (1 − λ 2 − λ 3 )(θl(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + λ 32 (l(δ) + l(−δ)) + √ γ≤ (1 − λ 2 − λ 3 )(θ · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + λ 3 · l(−|δ|) + √ γThus,Err µ 2 e ,l(g) ≤ Lɛλ 2+ l(δ) + λ 3λ 2l(−|δ|) += L · l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+)g(15)√ γ(16)λ 210 · c · L · K3.5· E · m 3 · (r K + s d ) + l(δ) + λ √3γl(−|δ|) +λ 2 λ 2Now, relying on the assumption that γ · log 8 (m) = o(1), it is possible to choose λ 2 =Θ ( √ γK4 ) = Θ ( √ γ log 4 (m) ) , λ 3 = √ γ, K = Θ(log(m/γ)), and d = Θ(log(m/γ)) such thatthe bound in Equation (13), L·l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+ 10·c·K3.5λ 2· E · m 3 · (r K + s d ), λ 2 , λ 3 and λ 3λ 2o(1).l(δ) ≤ l ( α2are allSince the ) bound in Equation (13) is o(1), it follows, as in the proof of Theorem 2.6, thatand consequently, 0 l(0)−l( α 2 )− o(1), l(g(x)) < l(0) ⇒l(0)g(x) > 0. Since the marginal distributions of µ 1 e and µ 2 e are the same, it follows that, if (x, y)(are chosen according to D, then w.p. >l(0)−l( α 2 )l(0))− o(1) · (1 − λ 2 − λ 3 ) · (1 − θ) = Ω(1),yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1) while Err γ (D) ≤ λ 2 + λ 3 = O ( √ γ poly(log(m))).36✷λ 2

5.5.3 The integrality gap – Theorem 3.3Our first step is a reduction to the hinge loss. Let a = ∂ + l(0). Define{l ∗ ax + 1 x ≤ 1−a(x) =0 o/wit is not hard to see that l ∗ is a convex surrogate satisfying ∀x, l ∗ (x) ≤ l(x) and ∂ + l ∗ (0) =∂ + l(0). Thus, if we substitute l with l ∗ , we just decrease the integrality gap, hence canassume that l = l ∗ . Now, we note that if we consider program (11) with l = l ∗ the inegralitygap of coincides with what we get by replacing C with |a| · C and l ∗ with the hinge loss.To see that, note that for every f ∈ H k , b ∈ R, Err D,l ∗(f + b) = Err D,hinge (|a| · f + |a| · b),thus, minimizing Err D,l ∗ over all functions f ∈ H k that satisfy ||f|| Hk ≤ C is equivalent tominimizing Err D,hinge over all functions f ∈ H k that satisfy ||f|| Hk ≤ |a| · C. Thus, it isenough to prove the Theorem for l = l hinge .Next, we show that we can assume that the embedding is symmetric (i.e., correspond toa symmetric kernel). As the integrality gap is at least as large as the approximation ratio,using Theorem 3.2 this will complete our argument. (The reduction to the hinge loss yieldsbounds with universal constants in the asymptotic terms).Let γ > 0 and let D be a distribution on S d−1 × {±1}. It is enough to find (a possiblydifferent) distribution D 1 with the same γ-margin error as D, for which the optimum ofprogram (11) (with l = l hinge ) is not smaller than the optimum of the programmin Err D,hinge (f + b)s.t. f ∈ H ks , b ∈ R (17)||f|| HksDenote the optimal value of program (17) by α and assume, towards contradiction, thatwhenever Err γ (D 1 ) = Err γ (D), the optimum of program (11) is strictly less then α.For every A ∈ O(d), let D A , be the distribution of the r.v. (Ax, y) ∈ S d−1 × {±1},where (x, y) ∼ D. Since clearly Err γ (D A ) = Err γ (D), there exist f A ∈ H k and b A ∈ Rsuch that ||f A || Hk ≤ C and Err DA ,hinge(g A ) < α, where g A := f A + b A . Define f ∈ H ks byf(x) = ∫ O(d) f A(Ax)dA and let b = ∫ O(d) b AdA and g = f +b. By Theorem 5.18, ||f|| Hks ≤ C.Finally, for l = l hinge ,≤ CErr D,hinge (g) = E (x,y)∼D l(yg(x))= E (x,y)∼D l(yE A∼O(d) g A (Ax))≤E (x,y)∼D E A∼O(d) l(yg A (Ax))= E A∼O(d) E (x,y)∼D l(yg A (Ax))= E A∼O(d) E (x,y)∼DA l(yg A (x)) < αContrary to the assumption that α is the optimum of program (17).5.5.4 Finite dimension - Theorems 2.7 and 3.4Let V ⊆ C(S d−1 ) be the linear space {Λ v,b ◦ ψ : v ∈ R m , b ∈ R} and denote ¯W = {Λ v,b ◦ ψ :v ∈ W, b ∈ R}. We note that dim(V ) ≤ m + 1. Instead of program (4) we consider the37

equivalent formulationminErr D,l (f)s.t. f ∈ ¯W (18)The following lemma is very similar to lemma 5.12, but with better dependency on m (m 1.5instead of m 2 ).Lemma 5.25 Let l be a convex surrogate and let V ⊂ C(S d−1 ) an m-dimensional vectorspace. There exists a continuous kernel k : S d−1 × S d−1 → R with sup x∈S d−1 k(x, x) ≤ 1 suchthat H k = V as a vector space and there exists a probability measure µ N such that∀f ∈ V, ||f|| Hk ≤ 2m1.5|∂ + l(0)| Err µ N ,l(f)Proof Let ψ : S d−1 → V ∗ be the evaluation operator. It maps each x ∈ S d−1 to the linearfunctional f ∈ V ↦→ f(x). We claim that1. ψ is continuous,2. aff(ψ(S d−1 ) ∪ −ψ(S d−1 )) = V ∗ ,3. V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Proof of 1: We need to show that ψ(x n ) → ψ(x) if x n → x. Since V ∗ is finite dimensional, itsuffices to show that ψ(x n )(f) → ψ(x)(f) for every f ∈ V , which follows from the continuityof f.Proof of 2: Note that 0 ∈ U = aff(ψ(S d−1 ) ∪ −ψ(S d−1 )), so U is a linear space. Now, defineT : U ∗ → V via T (u ∗ ) = u ∗ ◦ ψ. We claim that T is onto, whence dim(U) = dim(U ∗ ) =dim(V ) = dim(V ∗ ), so that U = V ∗ . Indeed, for f ∈ V , let u ∗ f ∈ U ∗ be the functionalu ∗ f (u) = u(f). Now, T (u∗ f )(x) = u∗ f (ψ(x)) = ψ(x)(f) = f(x), thus T (u∗ f ) = f.Proof of 3: From U = V ∗ it follows that U ∗ = V ∗∗ , so that the mapping T : V ∗∗ → V isonto, showing that V = {v ∗∗ ◦ ψ : v ∗∗ ∈ V ∗∗ }.Let us apply John’s Lemma to K = conv(ψ(S d−1 )∪−ψ(S d−1 )). It yields an inner producton V ∗ 1with K contained in the unit ball and containing the ball around 0 with radius √ m.Let k be the kernel k(x, y) = 〈ψ(x), ψ(y)〉. Since ψ is continuous, k is continuous as well.By Theorem 5.1.1 and since T is onto, it follows that, as a vector space, V = H k . Since Kis contained in the unit ball, it follows that sup x∈S d−1 k(x, x) ≤ 1. It remains to prove theexistence of the measure µ N .Let e 1 , . . . , e m ∈ V ∗ be an orthonormal basis. For every i ∈ [m], choose(x 1 i , y i ), . . . , (x m+1i , y i ) ∈ S d−1 × {±1} and λ 1 i , . . . , λ m+1i ≥ 0 such that ∑ m+1j=1 λj i = 1 and√1me i = ∑ m+1j=1 λj i y iψ(x j i ). Define µ N(x j i , 1) = µ N(x j i , −1) = λj i. 2mLet f ∈ V . By Theorem 5.1.1 there exists v ∈ V ∗ such that f = Λ v,0 ◦ ψ and ||f|| Hk =||v|| V ∗. It follows that, for a = ∂ + l(0),38

Err µN ,l(f) =m∑i=1≥ 12m= 12m= 12m= 12m≥ 12m≥≥m+1∑λ j [i l(yi f(x j i2m)) + l(−y if(x j i ))]j=1[ (m∑ m+1) ()]∑m+1l λ j i y if(x j i ) ∑+ l − λ j i y if(x j i )i=1 j=1j=1[ (m∑ m+1) ()]∑m+1l λ j i y i〈v, ψ(x j i )〉 ∑+ l − λ j i y i〈v, ψ(x j i )〉i=1 j=1j=1(m∑l 〈v,i=1m∑i=1|a|2m 1.5m∑i=1∑ mm+1∑j=1( )e il 〈v, √ 〉 m(l − |〈v, e )√i〉|mi=1|〈v, e i 〉||a|2m 1.5 ||v|| V ∗ =λ j i y iψ(x j i )〉 )+ l(+ l −〈v,|a|2m 1.5 ||f|| H k(−〈v,)e√ i〉 mm+1∑j=1λ j i y iψ(x j i )〉 )✷Proof (of Theorem 2.7) Let L be the Lipschitz constant of l. Let β > α > 0 such thatl(α) > l(β). Choose 0 < θ < 1 large enough so that (1 − θ)l(−β) + θl(β) < θl(α). First,define probability measures µ 1 , µ 2 , µ 3 and µ over [−1, 1] × {±1} as follows.µ 1 (γ, 1) = θ, µ 1 (−γ, −1) = 1 − θw(x) =µ 2 (−γ, 1) = 1The measure µ 3 is the product of uniform{±1} and the measure over [−1, 1] whose densityfunction is{0 |x| >18√π81−(8x) 2 |x| ≤ 1 8Let k, µ N be the distribution and kernel from Lemma 5.25. Now, let e ∈ S d−1 be the vectorfrom Lemma 5.22. We define the distribution D corresponding to the measureµ = (1 − λ 2 − λ 3 − λ N )µ 1 e + λ 2 µ 2 e + λ 3 µ 3 e + λ N µ NBy considering the affine functional Λ e,0 , it holds that Err γ (D) ≤ λ 3 + λ 2 + λ N .Let g be the solution returned by the algorithm. With probability ≥ 1 − exp(−1/γ),g = f + b, where f, b is a solution to program (18) with an additive error ≤ √ γ.39

Denote ||g|| Hk = C. By Lemma 5.25, it holds thatC ≤ 2m1.5|∂ + l(0)| Err µ N ,l(g)≤ 2m1.5 Err µ,l (g)|∂ + l(0)| λ N≤ 2m1.5 l(0)|∂ + l(0)| λ NAs in the proof of Theorem 2.6, it holds that∫∫∣ g −g∣ ≤ 128l(0)γK3.5 + 10 · K 3.5 · E · C · (r K + s d ) (19)|∂ + l(0)|λ 3{x:〈x,e〉=γ}{x:〈x,e〉=−γ}Denote the last bound by ɛ. It holds thatErr D,l (g) = (1 − λ 2 − λ 3 − λ N )E µ 1 el(yg(x)) + λ 2 E µ 2 el(yg(x)) + λ 3 E µ 3 el(yg(x)) + λ N E µN l(yg(x))(20)Now, denote δ = ∫ g. It holds that{x:〈x,e〉=−γ}∫∫E µ 1 el(yg(x)) = θ l(g(x)) + (1 − θ)l(−g(x))Thus,≥≥{x:〈x,e〉=γ}(∫θ · lg{x:〈x,e〉=γ})+ (1 − θ) · lθ · l(δ) + (1 − θ) · l(−δ) − Lɛ{x:〈x,e〉=−γ}(−∫{x:〈x,e〉=−γ}Err D,l (g) ≥ (1 − λ 2 − λ 3 − λ N )(θ · l(δ) + (1 − θ) · l(−δ)) − Lɛ + λ 2 E µ 2 el(yg(x))However, by considering the constant solution δ, it follows thatErr D,l (g) ≤ (1 − λ 2 − λ 3 − λ N )(a · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + (λ 3 + λ N ) 1 2 (l(δ) + l(−δ)) + √ γThus,≤)g(21)(1 − λ 2 − λ 3 − λ N )(θ · l(δ) + (1 − θ) · l(−δ)) + λ 2 · l(δ) + (λ 3 + λ N ) · l(−|δ|) + √ γErr µ 2 e ,l(g) ≤ Lɛλ 2+ l(δ) + λ 3 + λ Nλ 2l(−|δ|) +≤L · l(0)128γK3.5|∂ + l(0)|λ 2 λ 3+10 · L · K3.5λ 2√ γλ 2(22)· E · C · (r K + s d ) + l(δ) + λ 3 + λ N + √ γl(−|δ|)λ 2Now, relying on the assumption that γ · log 8 (C) = o(1), it is possible to choose λ 2 =Θ ( √ γK4 ) = Θ ( √ γ log 4 (C) ) , λ 3 = √ γ, K = Θ(log(C/γ)), λ N = γ and d = Θ(log(C/γ))such that the bound in Equation (19), L·l(0)128γK3.5λ 3 +λ N + √ γλ 2are all o(1).|∂ + l(0)|λ 2 λ 3+ 10K3.5λ 240· E · C · (r K + s d ), λ 2 , λ 3 , λ N and

Since the bound in Equation (19) is o(1), it follows, as in the proof of Theorem 2.6, thatl(δ) ≤ l ( )α2 and consequently, 0 l(0)−l( α 2 )− o(1), l(g(x)) < l(0) ⇒l(0)g(x) > 0. Since the marginal distributions of µ 1 e and µ 2 e are the same, it follows that, if (x, y)(are chosen according to D, then w.p. >l(0)−l( α 2 )l(0))− o(1) ·(1−λ 2 −λ 3 −λ N )·(1−θ) = Ω(1),yg(x) < 0. Thus, Err D,0−1 (g) = Ω(1) while Err γ (D) ≤ λ 2 +λ 3 +λ N = O ( √ γ poly(log(C)))=O ( √ γ poly(log(m/γ))).Proof (of Theorem 3.4) As in the proof of Theorem 3.3, we can assume w.l.o.g. thatl = l hinge . Let k, µ N be the measure and the kernel from Lemma 5.25. Let C = 2m 1.5 /γ.By (the proof of) Theorem 3.3, there exists a probability measure ¯µ over S d−1 × {±1} suchthat for every f ∈ H k with ||f|| Hk ≤ C it holds that Err¯µ,l (f) = Ω(1) but Err γ (¯µ) =O(γ · poly(log(C))). Consider the distribution µ = (1 − γ)¯µ + γµ N . It still holds thatErr γ (¯µ) = O(γ · poly(log(C))) = O(γ · poly(log(m/γ))). Let f be an optimal for program(18). We have that 1 ≥ Err µ,l (f) ≥ γ · Err µN ,l(f). By Lemma 5.25, ||f|| Hk ≤ C. Thus,Err µ,l (f) ≥ (1 − γ) Err¯µ,l (f) = Ω(1).6 Choosing a surrogate according to the marginThe purpose of this section is to demonstrate the subtleties relating to the possibility ofchoosing a convex surrogate l according to the margin γ. Let k : B × B → R be the kernelk(x, y) =11 − 1 2 〈x, y〉 Hand let ψ : B → H 1 be a corresponding embedding (i.e., k(x, y) = 〈ψ(x), ψ(y)〉 H1 ). In(Shalev-Shwartz et al., 2011) it has been shown that the solution f, b to Program (2), withC = C(γ) = poly(exp(1/γ · log(1/γ))) and the embedding ψ, satisfiesErr hinge (f + b) ≤ Err γ (D) + γ .41✷✷

Consequently, every approximated solution to the Program with an additive error of at mostγ will have a 0-1 loss bounded by Err γ (D) + 2γ.For every γ, define a 1-Lipschitz convex surrogate by{1 − x x ≤ 1/C(γ)l γ (x) =1 − 1/C(γ) x ≥ 1/C(γ)Claim 1 A function g : B → R is a solutions to Program (5) with l = l γ , C = 1 and theembedding ψ, if and only if C(γ) · g is a solutions to Program (2) with C = C(γ) and theembedding ψ.We postpone the proof to the end of the section. We note that Program (5) with l = l γ ,C = 1 and the embedding ψ, have a complexity of 1, according to our conventions. Moreover,by Claim 1, the optimal solution to it has a 0-1 error of at most Err γ (D) + γ. Thus, if A isan algorithm that is only obligated to return an approximated solution to Program (5) withl = l γ , C = 1 and the embedding ψ, we cannot lower bound its approximation ratio. Inparticular, our Theorems regarding the approximation ratio are no longer true, as currentlystated, if the algorithms are allowed to choose the surrogate according to γ. One might betempted to think that by the above construction (i.e. taking ψ as our embedding, choosingC = 1 and l = l γ , and approximate the program upon a sample of size poly(1/γ)), we haveactually gave 1-approximation algorithm. The crux of the matter is that algorithms thatapproximate the program according to a finite sample of size poly(1/γ) are only guaranteedto find a solution with an additive error of poly(γ). For the loss l γ , such an additive erroris meaningless: Since for every function f, Err D,lγ (f) ≥ 1 − 1/C(γ), the 0 solution has anadditive error of poly(γ). Therefore, we cannot argue that the solution returned by thealgorithm will have a small 0-1 error. Indeed we anticipate that the algorithm we havedescribed will suffer from serious over-fitting.To summarize, we note that the lower bounds we have proved, relies on the fact that theoptimal solutions of the programs we considered are very bad. For the algorithm we sketchedabove, the optimal solution is very good. However, guaranties on approximated solutionsobtained from a polynomial sample are meaningless. We conclude that lower bounds forsuch algorithms will have to involve over-fitting arguments, which are out of the scope of thepaper.Proof (of claim 1) Define{1 − C(γ)x x ≤1lγ(x) ∗ C(γ)=0 x ≥ 1C(γ)Since lγ(x) ∗ = C(γ) · (l γ (x) − (1 − 1 )), it follows that the solutions to Program (5) withC(γ)l = lγ, ∗ C = 1 and ψ coincide with the solutions with l = l γ , C = 1 and ψ. Now, we notethat, for every function f : B → R,Err D,l ∗ γ(f) = Err D,hinge (C(γ) · f)Thus, w, b minimizes Err D,l ∗ γ(Λ w,b ◦ ψ) under the restriction that ‖w‖ ≤ 1 if and only ifC(γ) · w, C(γ) · b minimizes Err D,hinge (Λ w,b ◦ ψ) under the restriction that ‖w‖ ≤ C(γ).42✷

Acknowledgements: Amit Daniely is a recipient of the Google Europe Fellowship inLearning Theory, and this research is supported in part by this Google Fellowship. NatiLinial is supported by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supportedby the Israeli Science Foundation grant number 590-10.ReferencesMartin Anthony and Peter Bartlet.Cambridge University Press, 1999.Neural Network Learning: Theoretical Foundations.K. Atkinson and W. Han. Spherical Harmonics and Approximations on the Unit Sphere: AnIntroduction, volume 2044. Springer, 2012.P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101:138–156, 2006.S. Ben-David, D. Loker, N. Srebro, and K. Sridharan. Minimizing the misclassification errorrate using a surrogate convex loss. In ICML, 2012.Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. Limitations of learning via embeddingsin euclidean half spaces. The Journal of Machine Learning Research, 3:441–461,2003.A. Birnbaum and S. Shalev-Shwartz. Learning halfspaces with the zero-one loss: Timeaccuracytradeoffs. In NIPS, 2012.E. Blais, R. O’Donnell, and K Wimmer. Polynomial regression under arbitrary productdistributions. In COLT, 2008.N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. CambridgeUniversity Press, 2000.Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improperlearning complexity. arXiv preprint arXiv:1311.2272, 2013.V. Feldman, P. Gopalan, S. Khot, and A.K. Ponnuswami. New results for learning noisyparities and halfspaces. In In Proceedings of the 47th Annual IEEE Symposium on Foundationsof Computer Science, 2006.G.B. Folland. A course in abstract harmonic analysis. CRC, 1994.V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proceedingsof the 47th Foundations of Computer Science (FOCS), 2006.A. Kalai, A.R. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. InProceedings of the 46th Foundations of Computer Science (FOCS), 2005.A.R. Klivans and R. Servedio. Learning DNF in time 2Õ(n1/3) . In STOC, pages 258–265.ACM, 2001.43

Kosaku Yosida. Functional Analysis. Springer-Verlag, Heidelberg, 1963.Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the Fourier spectrum.In STOC, pages 455–464, May 1991.Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform,and learnability. In FOCS, pages 574–579, October 1989.P.M. Long and R.A. Servedio. Learning large-margin halfspaces with more malicious noise.In NIPS, 2011.J. Matousek. Lectures on discrete geometry, volume 212. Springer, 2002.V.D. Milman and G. Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces:Isoperimetric Inequalities in Riemannian Manifolds, volume 1200. Springer, 2002.F. Rosenblatt. The perceptron: A probabilistic model for information storage and organizationin the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing(MIT Press, 1988).).S. Saitoh. Theory of reproducing kernels and its applications. Longman Scientific & TechnicalEngland, 1988.IJ Schoenberg. Positive definite functions on spheres. Duke. Math. J., 1942.B. Schölkopf, C. Burges, and A. Smola, editors. Advances in Kernel Methods - SupportVector Learning. MIT Press, 1998.S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the0-1 loss. SIAM Journal on Computing, 40:1623–1646, 2011.I. Steinwart and A. Christmann. Support vector machines. Springer, 2008.R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.Manfred K Warmuth and SVN Vishwanathan. Leaving the span. In Learning Theory, pages366–381. Springer, 2005.T. Zhang. Statistical behavior and consistency of classification methods based on convexrisk minimization. The Annals of Statistics, 32:56–85, 2004.44

The error rate of learning halfspaces using kernel-SVM

Create successful ePaper yourself

Delete template?

Save as template?