11.07.2015 Views

A Tutorial on Support Vector Machines for Pattern Recognition

A Tutorial on Support Vector Machines for Pattern Recognition

A Tutorial on Support Vector Machines for Pattern Recognition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2and Smola, 1996). In most of these cases, SVM generalizati<strong>on</strong> per<strong>for</strong>mance (i.e. error rates<strong>on</strong> test sets) either matches or is signicantly better than that of competing methods. Theuse of SVMs <strong>for</strong> density estimati<strong>on</strong> (West<strong>on</strong> et al., 1997) and ANOVA decompositi<strong>on</strong> (Stits<strong>on</strong>et al., 1997) has also been studied. Regarding extensi<strong>on</strong>s, the basic SVMs c<strong>on</strong>tain noprior knowledge of the problem (<strong>for</strong> example, a large class of SVMs <strong>for</strong> the image recogniti<strong>on</strong>problem would give the same results if the pixels were rst permuted randomly (witheach image suering the same permutati<strong>on</strong>), an act of vandalism that would leave the bestper<strong>for</strong>ming neural networks severely handicapped) and much work has been d<strong>on</strong>e <strong>on</strong> incorporatingprior knowledge into SVMs (Scholkopf, Burges and Vapnik, 1996 Scholkopf etal., 1998a Burges, 1998). Although SVMs have good generalizati<strong>on</strong> per<strong>for</strong>mance, they canbe abysmally slow in test phase, a problem addressed in (Burges, 1996 Osuna and Girosi,1998). Recent work has generalized the basic ideas (Smola, Scholkopf and Muller, 1998aSmola and Scholkopf, 1998), shown c<strong>on</strong>necti<strong>on</strong>s to regularizati<strong>on</strong> theory (Smola, Scholkopfand Muller, 1998b Girosi, 1998 Wahba, 1998), and shown how SVM ideas can be incorporatedin a wide range of other algorithms (Scholkopf, Smola and Muller, 1998b Scholkopfet al, 1998c). The reader may also nd the thesis of (Scholkopf, 1997) helpful.The problem which drove the initial development of SVMs occurs in several guises - thebias variance tradeo (Geman and Bienenstock, 1992), capacityc<strong>on</strong>trol (Guy<strong>on</strong> et al., 1992),overtting (M<strong>on</strong>tgomery and Peck, 1992) - but the basic idea is the same. Roughly speaking,<strong>for</strong> a given learning task, with a given nite amount of training data, the best generalizati<strong>on</strong>per<strong>for</strong>mance will be achieved if the right balance is struck between the accuracy attained<strong>on</strong> that particular training set, and the \capacity" of the machine, that is, the ability of themachine to learn any training set without error. A machine with too much capacity is likea botanist with a photographic memory who, when presented with a new tree, c<strong>on</strong>cludesthat it is not a tree because it has a dierent number of leaves from anything she has seenbe<strong>for</strong>e amachine with too little capacity islike the botanist's lazy brother, who declaresthat if it's green, it's a tree. Neither can generalize well. The explorati<strong>on</strong> and <strong>for</strong>malizati<strong>on</strong>of these c<strong>on</strong>cepts has resulted in <strong>on</strong>e of the shining peaks of the theory of statistical learning(Vapnik, 1979).In the following, bold typeface will indicate vector or matrix quantities normal typefacewill be used <strong>for</strong> vector and matrix comp<strong>on</strong>ents and <strong>for</strong> scalars. We will label comp<strong>on</strong>entsof vectors and matrices with Greek indices, and label vectors and matrices themselves withRoman indices. Familiarity with the use of Lagrange multipliers to solve problems withequality or inequality c<strong>on</strong>straints is assumed 2 .2. A Bound <strong>on</strong> the Generalizati<strong>on</strong> Per<strong>for</strong>mance of a <strong>Pattern</strong> Recogniti<strong>on</strong> LearningMachineThere is a remarkable family of bounds governing the relati<strong>on</strong> between the capacity of alearning machine and its per<strong>for</strong>mance 3 . The theory grew out of c<strong>on</strong>siderati<strong>on</strong>s of under whatcircumstances, and how quickly, the mean of some empirical quantity c<strong>on</strong>verges uni<strong>for</strong>mly,as the number of data points increases, to the true mean (that which would be calculatedfrom an innite amount ofdata)(Vapnik, 1979). Let us start with <strong>on</strong>e of these bounds.The notati<strong>on</strong> here will largely follow that of (Vapnik, 1995). Suppose we are given lobservati<strong>on</strong>s. Each observati<strong>on</strong> c<strong>on</strong>sists of a pair: avector x i 2 R n i =1:::l and theassociated \truth" y i ,given to us by a trusted source. In the tree recogniti<strong>on</strong> problem, x imight beavector of pixel values (e.g. n = 256 <strong>for</strong> a 16x16 image), and y i would be 1 if theimage c<strong>on</strong>tains a tree, and -1 otherwise (we use -1 here rather than 0 to simplify subsequent<strong>for</strong>mulae). Now it is assumed that there exists some unknown probability distributi<strong>on</strong>


6x=01 2 3 4Figure 2.Four points that cannot be shattered by (sin(x)), despite innite VC dimensi<strong>on</strong>.1.41.2VC C<strong>on</strong>fidence10.80.60.40.20.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1h / l = VC Dimensi<strong>on</strong> / Sample SizeFigure 3.VC c<strong>on</strong>dence is m<strong>on</strong>ot<strong>on</strong>ic in h2.4. Minimizing The Bound by Minimizing hFigure 3 shows how the sec<strong>on</strong>d term <strong>on</strong> the right hand side of Eq. (3) varies with h, givenachoice of 95% c<strong>on</strong>dence level ( =0:05) and assuming a training sample of size 10,000.The VC c<strong>on</strong>dence is a m<strong>on</strong>ot<strong>on</strong>ic increasing functi<strong>on</strong> of h. This will be true <strong>for</strong> any valueof l.Thus, given some selecti<strong>on</strong> of learning machines whose empirical risk is zero, <strong>on</strong>e wants tochoose that learning machine whose associated set of functi<strong>on</strong>s has minimal VC dimensi<strong>on</strong>.This will lead to a better upper bound <strong>on</strong> the actual error. In general, <strong>for</strong> n<strong>on</strong> zero empiricalrisk, <strong>on</strong>e wants to choose that learning machine which minimizes the right handsideofEq.(3).Note that in adopting this strategy, we are <strong>on</strong>ly using Eq. (3) as a guide. Eq. (3) gives(with some chosen probability) an upper bound <strong>on</strong> the actual risk. This does not prevent aparticular machine with the same value <strong>for</strong> empirical risk, and whose functi<strong>on</strong> set has higherVC dimensi<strong>on</strong>, from having better per<strong>for</strong>mance. In fact an example of a system that givesgood per<strong>for</strong>mance despite having innite VC dimensi<strong>on</strong> is given in the next Secti<strong>on</strong>. Notealso that the graph shows that <strong>for</strong> h=l > 0:37 (and <strong>for</strong> =0:05 and l =10 000), the VCc<strong>on</strong>dence exceeds unity, and so <strong>for</strong> higher values the bound is guaranteed not tight.2.5. Two ExamplesC<strong>on</strong>sider the k'th nearest neighbour classier, with k =1. This set of functi<strong>on</strong>s has inniteVC dimensi<strong>on</strong> and zero empirical risk, since any number of points, labeled arbitrarily, willbe successfully learned by the algorithm (provided no two points of opposite class lie right<strong>on</strong> top of each other). Thus the bound provides no in<strong>for</strong>mati<strong>on</strong>. In fact, <strong>for</strong> any classierwith innite VC dimensi<strong>on</strong>, the bound is not even valid 7 . However, even though the bound


7is not valid, nearest neighbour classiers can still per<strong>for</strong>m well. Thus this rst example is acauti<strong>on</strong>ary tale: innite \capacity" does not guarantee poor per<strong>for</strong>mance.Let's follow the time h<strong>on</strong>oured traditi<strong>on</strong> of understanding things by trying to break them,and see if we can come up with a classier <strong>for</strong> which thebound is supposed to hold, butwhich violates the bound. We want the left hand side of Eq. (3) to be as large as possible,and the right hand side to be as small as possible. So we want a family of classiers whichgives the worst possible actual risk of 0:5, zero empirical risk up to some number of trainingobservati<strong>on</strong>s, and whose VC dimensi<strong>on</strong> is easy to compute and is less than l (so that thebound is n<strong>on</strong> trivial). An example is the following, which I call the \notebook classier."This classier c<strong>on</strong>sists of a notebook with enough room to write down the classes of mtraining observati<strong>on</strong>s, where m l. For all subsequent patterns, the classier simply saysthat all patterns have the same class. Suppose also that the data have asmany positive(y = +1) as negative (y = ;1) examples, and that the samples are chosen randomly. Thenotebook classier will have zero empirical risk <strong>for</strong> up to m observati<strong>on</strong>s 0:5 training error<strong>for</strong> all subsequent observati<strong>on</strong>s 0:5 actual error, and VC dimensi<strong>on</strong> h = m. Substitutingthese values in Eq. (3), the bound becomes:m ln(2l=m)+1; (1=m)ln(=4) (8)4lwhich is certainly met <strong>for</strong> all iff(z) = z2exp (z=4;1) 1 z (m=l) 0 z 1 (9)which is true, since f(z) is m<strong>on</strong>ot<strong>on</strong>ic increasing, and f(z =1)=0:236.2.6. Structural Risk Minimizati<strong>on</strong>We can now summarize the principle of structural risk minimizati<strong>on</strong> (SRM) (Vapnik, 1979).Note that the VC c<strong>on</strong>dence term in Eq. (3) depends <strong>on</strong> the chosen class of functi<strong>on</strong>s,whereas the empirical risk and actual risk depend <strong>on</strong> the <strong>on</strong>e particular functi<strong>on</strong> chosen bythe training procedure. We would like to nd that subset of the chosen set of functi<strong>on</strong>s, suchthat the risk bound <strong>for</strong> that subset is minimized. Clearly we cannot arrange things so thatthe VC dimensi<strong>on</strong> h varies smoothly, since it is an integer. Instead, introduce a \structure"by dividing the entire class of functi<strong>on</strong>s into nested subsets (Figure 4). For each subset,we must be able either to compute h, ortogetabound<strong>on</strong>h itself. SRM then c<strong>on</strong>sists ofnding that subset of functi<strong>on</strong>s which minimizes the bound <strong>on</strong> the actual risk. This can bed<strong>on</strong>e by simply training a series of machines, <strong>on</strong>e <strong>for</strong> each subset, where <strong>for</strong> a given subsetthe goal of training is simply to minimize the empirical risk. One then takes that trainedmachine in the series whose sum of empirical risk and VC c<strong>on</strong>dence is minimal.h4 h3 h2 h1h1 < h2 < h3 ...Figure 4.Nested subsets of functi<strong>on</strong>s, ordered by VC dimensi<strong>on</strong>.


8We have nowlaidthegroundwork necessary to begin our explorati<strong>on</strong> of support vectormachines.3. Linear <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>3.1. The Separable CaseWe will start with the simplest case: linear machines trained <strong>on</strong> separable data (as we shallsee, the analysis <strong>for</strong> the general case - n<strong>on</strong>linear machines trained <strong>on</strong> n<strong>on</strong>-separable data -results in a very similar quadratic programming problem). Again label the training datafx i y i g i = 1 l y i 2 f;1 1g x i 2 R d . Suppose we have some hyperplane whichseparates the positive from the negative examples (a \separating hyperplane"). The pointsx which lie <strong>on</strong> the hyperplane satisfy w x + b = 0, where w is normal to the hyperplane,jbj=kwk is the perpendicular distance from the hyperplane to the origin, and kwk is theEuclidean norm of w. Let d + (d ; ) be the shortest distance from the separating hyperplaneto the closest positive (negative) example. Dene the \margin" of a separating hyperplaneto be d + +d ; . For the linearly separable case, the support vector algorithm simply looks <strong>for</strong>the separating hyperplane with largest margin. This can be <strong>for</strong>mulated as follows: supposethat all the training data satisfy the following c<strong>on</strong>straints:x i w + b +1 <strong>for</strong> y i =+1 (10)x i w + b ;1 <strong>for</strong> y i = ;1 (11)These can be combined into <strong>on</strong>e set of inequalities:y i (x i w + b) ; 1 0 8i (12)Now c<strong>on</strong>sider the points <strong>for</strong> which the equality inEq. (10) holds (requiring that thereexists such a point is equivalent to choosing a scale <strong>for</strong> w and b). These points lie <strong>on</strong>the hyperplane H 1 : x i w + b =1with normal w and perpendicular distance from theorigin j1 ; bj=kwk. Similarly, thepoints <strong>for</strong> which the equality inEq. (11) holds lie <strong>on</strong> thehyperplane H 2 : x i w + b = ;1, with normal again w, and perpendicular distance fromthe origin j;1 ; bj=kwk. Hence d + = d ; =1=kwk and the margin is simply 2=kwk. Notethat H 1 and H 2 are parallel (they have the same normal) and that no training points fallbetween them. Thus we can nd the pair of hyperplanes which gives the maximum marginby minimizing kwk 2 , subject to c<strong>on</strong>straints (12).Thus we expect the soluti<strong>on</strong> <strong>for</strong> atypical two dimensi<strong>on</strong>al case to have the <strong>for</strong>m shownin Figure 5. Those training points <strong>for</strong> which the equality in Eq. (12) holds (i.e. thosewhich wind up lying <strong>on</strong> <strong>on</strong>e of the hyperplanes H 1 , H 2 ), and whose removal would changethe soluti<strong>on</strong> found, are called support vectors they are indicated in Figure 5 by the extracircles.We willnow switch to a Lagrangian <strong>for</strong>mulati<strong>on</strong> of the problem. There are two reas<strong>on</strong>s<strong>for</strong> doing this. The rst is that the c<strong>on</strong>straints (12) will be replaced by c<strong>on</strong>straints <strong>on</strong> theLagrange multipliers themselves, which will be much easier to handle. The sec<strong>on</strong>d is that inthis re<strong>for</strong>mulati<strong>on</strong> of the problem, the training data will <strong>on</strong>ly appear (in the actual trainingand test algorithms) in the <strong>for</strong>m of dot products between vectors. This is a crucial propertywhich will allow us to generalize the procedure to the n<strong>on</strong>linear case (Secti<strong>on</strong> 4).Thus, we introduce positive Lagrange multipliers i i = 1 l, <strong>on</strong>e <strong>for</strong> each of theinequality c<strong>on</strong>straints (12). Recall that the rule is that <strong>for</strong> c<strong>on</strong>straints of the <strong>for</strong>m c i 0,


9Origin-b|w|wH 1H 2MarginFigure 5.Linear separating hyperplanes <strong>for</strong> the separable case. The support vectors are circled.the c<strong>on</strong>straint equati<strong>on</strong>s are multiplied by positive Lagrange multipliers and subtractedfrom the objective functi<strong>on</strong>, to <strong>for</strong>m the Lagrangian. For equality c<strong>on</strong>straints, the Lagrangemultipliers are unc<strong>on</strong>strained. This gives Lagrangian:L P 1 2 kwk2 ;lXi=1 i y i (x i w + b)+lXi=1 i (13)We must now minimize L P with respect to w b, and simultaneously require that thederivatives of L P with respect to all the i vanish, all subject to the c<strong>on</strong>straints i 0(let's call this particular set of c<strong>on</strong>straints C 1 ). Now this is a c<strong>on</strong>vex quadratic programmingproblem, since the objective functi<strong>on</strong> is itself c<strong>on</strong>vex, and those points which satisfy thec<strong>on</strong>straints also <strong>for</strong>m a c<strong>on</strong>vex set (any linear c<strong>on</strong>straint denes a c<strong>on</strong>vex set, and a set ofN simultaneous linear c<strong>on</strong>straints denes the intersecti<strong>on</strong> of N c<strong>on</strong>vex sets, which isalsoa c<strong>on</strong>vex set). This means that we can equivalently solve the following \dual" problem:maximize L P , subject to the c<strong>on</strong>straints that the gradient ofL P with respect to w and bvanish, and subject also to the c<strong>on</strong>straints that the i 0 (let's call that particular set ofc<strong>on</strong>straints C 2 ). This particular dual <strong>for</strong>mulati<strong>on</strong> of the problem is called the Wolfe dual(Fletcher, 1987). It has the property that the maximum of L P , subject to c<strong>on</strong>straints C 2 ,occurs at the same values of the w, b and , as the minimum of L P , subject to c<strong>on</strong>straintsC 18 .Requiring that the gradient ofL P with respect to w and b vanish give the c<strong>on</strong>diti<strong>on</strong>s:w = X i i y i x i (14)Xi i y i =0: (15)Since these are equality c<strong>on</strong>straints in the dual <strong>for</strong>mulati<strong>on</strong>, we can substitute them intoEq. (13) to giveL D = X i i ; 1 2Xij i j y i y j x i x j (16)


10Note that we have nowgiven the Lagrangian dierent labels (P <strong>for</strong> primal, D <strong>for</strong> dual) toemphasize that the two <strong>for</strong>mulati<strong>on</strong>s are dierent: L P and L D arise from the same objectivefuncti<strong>on</strong> but with dierent c<strong>on</strong>straints and the soluti<strong>on</strong> is found by minimizing L P or bymaximizing L D . Note also that if we <strong>for</strong>mulate the problem with b = 0, which amounts torequiring that all hyperplanes c<strong>on</strong>tain the origin, the c<strong>on</strong>straint (15) does not appear. Thisis a mild restricti<strong>on</strong> <strong>for</strong> high dimensi<strong>on</strong>al spaces, since it amounts to reducing the numberof degrees of freedom by <strong>on</strong>e.<strong>Support</strong> vector training (<strong>for</strong> the separable, linear case) there<strong>for</strong>e amounts to maximizingL D with respect to the i , subject to c<strong>on</strong>straints (15) and positivity ofthe i , with soluti<strong>on</strong>given by (14). Notice that there is a Lagrange multiplier i <strong>for</strong> every training point. Inthe soluti<strong>on</strong>, those points <strong>for</strong> which i > 0 are called \support vectors", and lie <strong>on</strong> <strong>on</strong>e ofthe hyperplanes H 1 H 2 . All other training points have i = 0 and lie either <strong>on</strong> H 1 orH 2 (such thattheequality in Eq. (12) holds), or <strong>on</strong> that side of H 1 or H 2 such thatthestrict inequality in Eq. (12) holds. For these machines, the support vectors are the criticalelements of the training set. They lie closest to the decisi<strong>on</strong> boundary if all other trainingpoints were removed (or moved around, but so as not to cross H 1 or H 2 ), and training wasrepeated, the same separating hyperplane would be found.3.2. The Karush-Kuhn-Tucker C<strong>on</strong>diti<strong>on</strong>sThe Karush-Kuhn-Tucker (KKT) c<strong>on</strong>diti<strong>on</strong>s play a central role in both the theory andpractice of c<strong>on</strong>strained optimizati<strong>on</strong>. For the primal problem above, the KKT c<strong>on</strong>diti<strong>on</strong>smay be stated (Fletcher, 1987):@ X L P = w ; i y i x i = 0 =1 d (17)@w i@ X @b L P = ; i y i = 0 (18)iy i (x i w + b) ; 1 0 i =1 l (19) i 0 8i (20) i (y i (w x i + b) ; 1) = 0 8i (21)The KKT c<strong>on</strong>diti<strong>on</strong>s are satised at the soluti<strong>on</strong> of any c<strong>on</strong>strained optimizati<strong>on</strong> problem(c<strong>on</strong>vex or not), with any kind of c<strong>on</strong>straints, provided that the intersecti<strong>on</strong> of the setof feasible directi<strong>on</strong>s with the set of descent directi<strong>on</strong>s coincides with the intersecti<strong>on</strong> ofthe set of feasible directi<strong>on</strong>s <strong>for</strong> linearized c<strong>on</strong>straints with the set of descent directi<strong>on</strong>s(see Fletcher, 1987 McCormick, 1983)). This rather technical regularity assumpti<strong>on</strong> holds<strong>for</strong> all support vector machines, since the c<strong>on</strong>straints are always linear. Furthermore, theproblem <strong>for</strong> SVMs is c<strong>on</strong>vex (a c<strong>on</strong>vex objective functi<strong>on</strong>, with c<strong>on</strong>straints which give ac<strong>on</strong>vex feasible regi<strong>on</strong>), and <strong>for</strong> c<strong>on</strong>vex problems (if the regularity c<strong>on</strong>diti<strong>on</strong> holds), theKKT c<strong>on</strong>diti<strong>on</strong>s are necessary and sucient <strong>for</strong> w b to be a soluti<strong>on</strong> (Fletcher, 1987).Thus solving the SVM problem is equivalent to nding a soluti<strong>on</strong> to the KKT c<strong>on</strong>diti<strong>on</strong>s.This fact results in several approaches to nding the soluti<strong>on</strong> (<strong>for</strong> example, the primal-dualpath following method menti<strong>on</strong>ed in Secti<strong>on</strong> 5).As an immediate applicati<strong>on</strong>, note that, while w is explicitly determined by the trainingprocedure, the threshold b is not, although it is implicitly determined. However b is easilyfound by using the KKT \complementarity" c<strong>on</strong>diti<strong>on</strong>, Eq. (21), by choosing any i <strong>for</strong>


11which i 6= 0 and computing b (note that it is numerically safer to take the mean value ofb resulting from all such equati<strong>on</strong>s).Notice that all we've d<strong>on</strong>e so far is to cast the problem into an optimizati<strong>on</strong> problemwhere the c<strong>on</strong>straints are rather more manageable than those in Eqs. (10), (11). Findingthe soluti<strong>on</strong> <strong>for</strong> real world problems will usually require numerical methods. We willhavemore to say <strong>on</strong>this later. However, let's rst work out a rare case where the problem isn<strong>on</strong>trivial (the number of dimensi<strong>on</strong>s is arbitrary, and the soluti<strong>on</strong> certainly not obvious),but where the soluti<strong>on</strong> can be found analytically.3.3. Optimal Hyperplanes: An ExampleWhile the main aim of this Secti<strong>on</strong> is to explore a n<strong>on</strong>-trivial pattern recogniti<strong>on</strong> problemwhere the support vector soluti<strong>on</strong> can be found analytically, theresults derived here willalso be useful in a later proof. For the problem c<strong>on</strong>sidered, every training point will turnout to be a support vector, which is <strong>on</strong>e reas<strong>on</strong> we can nd the soluti<strong>on</strong> analytically.C<strong>on</strong>sider n +1 symmetrically placed points lying <strong>on</strong> a sphere S n;1 of radius R: moreprecisely, the points <strong>for</strong>m the vertices of an n-dimensi<strong>on</strong>al symmetric simplex. It is c<strong>on</strong>venienttoembed the points in R n+1 in such away thattheyalllieinthehyperplane whichpasses through the origin and which is perpendicular to the (n+1)-vector (1 1 ::: 1) (in this<strong>for</strong>mulati<strong>on</strong>, the points lie <strong>on</strong> S n;1 , they span R n , and are embedded in R n+1 ). Explicitly,recalling that vectors themselves are labeled by Roman indices and their coordinates byGreek, the coordinates are given by:s rRRnx i = ;(1 ; i )n(n +1) + i(22)n +1where the Kr<strong>on</strong>ecker delta, i , is dened by i = 1 if = i, 0 otherwise. Thus, <strong>for</strong>example, the vectors <strong>for</strong> three equidistant points <strong>on</strong> the unit circle (see Figure 12) are:r2x 1 = (3 ;1p 6x 2 = ( ;1 p6x 3 = ( ;1 p6;1p6)r23 ;1p6);1p6r23 ) (23)One c<strong>on</strong>sequence of the symmetry is that the angle between any pair of vectors is thesame (and is equal to arccos(;1=n)):kx i k 2 = R 2 (24)x i x j = ;R 2 =n (25)or, more succinctly,x i x jR 2 = ij ; (1 ; ij ) 1 n : (26)Assigning a class label C 2 f+1 ;1g arbitrarily to each point, we wish to nd thathyperplane which separates the two classes with widest margin. Thus we must maximize


12L D in Eq. (16), subject to i 0 and also subject to the equality c<strong>on</strong>straint, Eq. (15).Our strategy is to simply solve the problem as though there were no inequality c<strong>on</strong>straints.If the resulting soluti<strong>on</strong> does in fact satisfy i 0 8i, then we willhave found the generalsoluti<strong>on</strong>, since the actual maximum of L D will then lie in the feasible regi<strong>on</strong>, provided theequality c<strong>on</strong>straint, Eq. (15), is also met. In order to impose the equality c<strong>on</strong>straint weintroduce an additi<strong>on</strong>al Lagrange multiplier . Thus we seek to maximizeL D n+1Xi=1 i ; 1 2n+1Xij=1Xn+1 i H ij j ; where we have introduced the Hessiani=1 i y i (27)H ij y i y j x i x j : (28)Setting @LD@ i=0gives(H) i + y i =1 8i (29)Now H has a very simple structure: the o-diag<strong>on</strong>al elements are ;y i y j R 2 =n, and thediag<strong>on</strong>al elements are R 2 . The fact that all the o-diag<strong>on</strong>al elements dier <strong>on</strong>ly by factorsof y i suggests looking <strong>for</strong> a soluti<strong>on</strong> which has the <strong>for</strong>m: i = 1+yi2a + 1 ; yi2b (30)where a and b are unknowns. Plugging this <strong>for</strong>m in Eq. (29) gives: n +1 a + b; y ip a + b= 1 ; y i(31)n 2 n 2 R 2where p is dened byp n+1Xi=1y i : (32)Thusa + b =2nR 2 (n +1)and substituting this into the equality c<strong>on</strong>straint Eq. (15) to nd a, b gives na =1 ; pn b =1+ pR 2 (n +1) n +1 R 2 (n +1) n +1which gives <strong>for</strong> the soluti<strong>on</strong> i =nR 2 (n +1)1 ; y ipn +1(33)(34)(35)Also,(H) i =1;y ipn +1 : (36)


13Hencekwk 2 ==n+1Xij=1n+1Xi=1 i j y i y j x i x j = T H i1 ; y ipn +1=n+1Xi=1 i = nR 2 2!p1 ;n +1(37)Note that this is <strong>on</strong>e of those cases where the Lagrange multiplier can remain undetermined(although determining it is trivial). We have now solved the problem, since all the i are clearly positive orzero (in fact the i will <strong>on</strong>ly be zero if all training points havethe same class). Note that kwk depends <strong>on</strong>ly <strong>on</strong> the number of positive (negative) polaritypoints, and not <strong>on</strong> how the class labels are assigned to the points in Eq. (22). This is clearlynot true of w itself, which isgiven byw =nR 2 (n +1)n+1Xi=1y i ;pn +1The margin, M =2=kwk, isthus given byM =x i (38)2Rpn (1 ; (p=(n +1))2 ) : (39)Thus when the number of points n +1iseven, the minimum margin occurs when p =0(equal numbers of positive and negative examples), in which case the margin is M min =2R= p n. If n + 1 is odd, the minimum margin occurs when p = 1, in which case M min =2R(n +1)=(n p n + 2). In both cases, the maximum margin is given by M max = R(n +1)=n.Thus, <strong>for</strong> example, <strong>for</strong> the two dimensi<strong>on</strong>al simplex c<strong>on</strong>sisting of three points lying <strong>on</strong> S 1(and spanning R 2 ), and with labeling such that not all three points have the same polarity,the maximum and minimum margin are both 3R=2 (see Figure (12)).Note that the results of this Secti<strong>on</strong> amount to an alternative, c<strong>on</strong>structive proof that theVC dimensi<strong>on</strong> of oriented separating hyperplanes in R n is at least n +1.3.4. Test PhaseOnce we have trained a <strong>Support</strong> <strong>Vector</strong> Machine, how canwe use it? We simply determine<strong>on</strong> which side of the decisi<strong>on</strong> boundary (that hyperplane lying half way between H 1 and H 2and parallel to them) a given test pattern x lies and assign the corresp<strong>on</strong>ding class label,i.e. we take the class of x to be sgn(w x + b).3.5. The N<strong>on</strong>-Separable CaseThe above algorithm <strong>for</strong> separable data, when applied to n<strong>on</strong>-separable data, will nd nofeasible soluti<strong>on</strong>: this will be evidenced by the objective functi<strong>on</strong> (i.e. the dual Lagrangian)growing arbitrarily large. So how canwe extend these ideas to handle n<strong>on</strong>-separable data?We would like to relax the c<strong>on</strong>straints (10) and (11), but <strong>on</strong>ly when necessary, that is, wewould like tointroduce a further cost (i.e. an increase in the primal objective functi<strong>on</strong>) <strong>for</strong>doing so. This can be d<strong>on</strong>e by introducing positive slack variables i i =1 l in thec<strong>on</strong>straints (Cortes and Vapnik, 1995), which then become:


14x i w + b +1 ; i <strong>for</strong> y i =+1 (40)x i w + b ;1+ i <strong>for</strong> y i = ;1 (41) i 0 8i: (42)P Thus, <strong>for</strong> an error to occur, the corresp<strong>on</strong>ding i must exceed unity, soi i is an upperbound <strong>on</strong> the number of training errors. Hence a natural way to assign an extra cost<strong>for</strong> P errors is to change the objective functi<strong>on</strong> to be minimized from kwk 2 =2tokwk 2 =2+C (i i) k , where C is a parameter to be chosen by the user, a larger C corresp<strong>on</strong>ding toassigning a higher penalty to errors. As it stands, this is a c<strong>on</strong>vex programming problem<strong>for</strong> any positive integer k <strong>for</strong> k = 2 and k = 1 it is also a quadratic programming problem,and the choice k = 1 has the further advantage that neither the i , nor their Lagrangemultipliers, appear in the Wolfe dual problem, which becomes:Maximize:L D X isubject to: i ; 1 2Xij i j y i y j x i x j (43)0 i C (44)Xi i y i =0: (45)The soluti<strong>on</strong> is again given byw =XN Si=1 i y i x i : (46)where N S is the number of support vectors. Thus the <strong>on</strong>ly dierence from the optimalhyperplane case is that the i now have an upper bound of C. The situati<strong>on</strong> is summarizedschematically in Figure 6.We will need the Karush-Kuhn-Tucker c<strong>on</strong>diti<strong>on</strong>s <strong>for</strong> the primal problem. The primalLagrangian isL P = 1 2 kwk2 + C X i i ; X i i fy i (x i w + b) ; 1+ i g; X i i i (47)where the i are the Lagrange multipliers introduced to en<strong>for</strong>ce positivity of the i . TheKKT c<strong>on</strong>diti<strong>on</strong>s <strong>for</strong> the primal problem are there<strong>for</strong>e (note i runs from 1 to the number oftraining points, and from 1 to the dimensi<strong>on</strong> of the data)@L P@w = w ; X i i y i x i =0 (48)@L P@b= ;X i i y i =0 (49)


15@L P@ i= C ; i ; i =0 (50)y i (x i w + b) ; 1+ i 0 (51) i 0 (52) i 0 (53) i 0 (54) i fy i (x i w + b) ; 1+ i g =0 (55) i i =0 (56)As be<strong>for</strong>e, we can use the KKT complementarity c<strong>on</strong>diti<strong>on</strong>s, Eqs. (55) and (56), todetermine the threshold b. Note that Eq. (50) combined with Eq. (56) shows that i =0if i


16(i.e., also <strong>for</strong> the n<strong>on</strong>linear case described below). The analogy emphasizes the interestingpoint that the \most important" data points are the support vectors with highest values of, since they exert the highest <strong>for</strong>ces <strong>on</strong> the decisi<strong>on</strong> sheet. For the n<strong>on</strong>-separable case, theupper bound i C corresp<strong>on</strong>ds to an upper bound <strong>on</strong> the <strong>for</strong>ce any given point is allowedto exert <strong>on</strong> the sheet. This analogy also provides a reas<strong>on</strong> (as good as any other) to callthese particular vectors \support vectors" 10 .3.7. Examples by PicturesFigure 7 shows two examples of a two-class pattern recogniti<strong>on</strong> problem, <strong>on</strong>e separable and<strong>on</strong>e not. The two classes are denoted by circles and disks respectively. <strong>Support</strong> vectors areidentied with an extra circle. The error in the n<strong>on</strong>-separable case is identied with a cross.The reader is invited to use Lucent's SVM Applet (Burges, Knirsch and Haratsch, 1996) toexperiment and create pictures like these (if possible, try using 16 or 24 bit color).Figure 7.The linear case, separable (left) and not (right). The background colour shows the shape of thedecisi<strong>on</strong> surface.4. N<strong>on</strong>linear <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>How cantheabove methods be generalized to the case where the decisi<strong>on</strong> functi<strong>on</strong> 11 is nota linear functi<strong>on</strong> of the data? (Boser, Guy<strong>on</strong> and Vapnik, 1992), showed that a rather oldtrick (Aizerman, 1964) can be used to accomplish this in an ast<strong>on</strong>ishingly straight<strong>for</strong>wardway. First notice that the <strong>on</strong>ly way inwhich the data appears in the training problem, Eqs.(43) - (45), is in the <strong>for</strong>m of dot products, x i x j . Now suppose we rst mapped the datato some other (possibly innite dimensi<strong>on</strong>al) Euclidean space H, using a mapping which wewill call ::R d 7!H: (59)Then of course the training algorithm would <strong>on</strong>ly depend <strong>on</strong> the data through dot productsin H, i.e.<strong>on</strong> functi<strong>on</strong>s of the <strong>for</strong>m (x i ) (x j ). Now if there were a \kernel functi<strong>on</strong>" Ksuch thatK(x i x j )=(x i ) (x j ), we would <strong>on</strong>ly need to use K in the training algorithm,and would never need to explicitly even know what is. One example is


17K(x i x j )=e ;kxi;xjk2 =2 2 : (60)In this particular example, H is innite dimensi<strong>on</strong>al, so it would not be very easy to workwith explicitly. However, if <strong>on</strong>e replaces x i x j by K(x i x j )everywhere in the trainingalgorithm, the algorithm will happily produce a support vector machine which lives in aninnite dimensi<strong>on</strong>al space, and furthermore do so in roughly the same amount oftime itwould take to train <strong>on</strong> the un-mapped data. All the c<strong>on</strong>siderati<strong>on</strong>s of the previous secti<strong>on</strong>shold, since we are still doing a linear separati<strong>on</strong>, but in a dierent space.But how can we use this machine? After all, we need w, and that will live inH also (seeEq. (46)). But in test phase an SVM is used by computing dot products of a given testpoint x with w, or more specically by computing the sign off(x) =XN Si=1 i y i (s i ) (x)+b =XN Si=1 i y i K(s i x)+b (61)where the s i are the support vectors. So again we canavoid computing (x) explicitlyand use the K(s i x) =(s i ) (x) instead.Let us call the space in which the data live, L. (Here and below weuseL as a mnem<strong>on</strong>ic<strong>for</strong> \low dimensi<strong>on</strong>al", and H <strong>for</strong> \high dimensi<strong>on</strong>al": it is usually the case that the range ofisofmuch higher dimensi<strong>on</strong> than its domain). Note that, in additi<strong>on</strong> to the fact that wlives in H, there will in general be no vector in L which maps, via the map , to w. If therewere, f(x) in Eq. (61) could be computed in <strong>on</strong>e step, avoiding the sum (and making thecorresp<strong>on</strong>ding SVM N S times faster, where N S is the number of support vectors). Despitethis, ideas al<strong>on</strong>g these lines can be used to signicantly speed up the test phase of SVMs(Burges, 1996). Note also that it is easy to nd kernels (<strong>for</strong> example, kernels which arefuncti<strong>on</strong>s of the dot products of the x i in L) such that the training algorithm and soluti<strong>on</strong>found are independent of the dimensi<strong>on</strong> of both L and H.In the next Secti<strong>on</strong> we will discuss which functi<strong>on</strong>s K are allowable and which arenot.Let us end this Secti<strong>on</strong> with a very simple example of an allowed kernel, <strong>for</strong> which we canc<strong>on</strong>struct the mapping .Suppose that your data are vectors in R 2 , and you choose K(x i x j )=(x i x j ) 2 . Thenit's easy to nd a space H, and mapping from R 2 to H, suchthat(x y) 2 =(x) (y):we choose H = R 3 and(x) =0@ x2 1 p2 x1 x 2x 2 21A (62)(note that here the subscripts refer to vector comp<strong>on</strong>ents). For data in L dened <strong>on</strong> thesquare [;1 1] [;1 1] 2 R 2 (a typical situati<strong>on</strong>, <strong>for</strong> grey level image data), the (entire)image of is shown in Figure 8. This Figure also illustrates how to think of this mapping:the image of may live in a space of very high dimensi<strong>on</strong>, but it is just a (possibly veryc<strong>on</strong>torted) surface whose intrinsic dimensi<strong>on</strong> 12 is just that of L.Note that neither the mapping nor the space H are unique <strong>for</strong> a given kernel. We couldequally well have chosen H to again be R 3 and0 1(x) = p 1 @ (x2 1 ; x 2 2)2x 1 x 2A (63)2 (x 2 + 1 x2) 2


1810.80.60.40.200.20.40.60.81-1-0.500.51Figure 8.Image, in H, of the square [;1 1] [;1 1] 2 R 2 under the mapping .or H to be R 4 and(x) =0B@x 2 1x 1 x 2x 1 x 2x 2 21CA : (64)The literature <strong>on</strong> SVMs usually refers to the space H as a Hilbert space, so let's end thisSecti<strong>on</strong> with a few notes <strong>on</strong> this point. You can think of a Hilbert space as a generalizati<strong>on</strong>of Euclidean space that behaves in a gentlemanly fashi<strong>on</strong>. Specically, itisany linear space,with an inner product dened, which is also complete with respect to the corresp<strong>on</strong>dingnorm (that is, any Cauchy sequence of points c<strong>on</strong>verges to a point in the space). Someauthors (e.g. (Kolmogorov, 1970)) also require that it be separable (that is, it must have acountable subset whose closure is the space itself), and some (e.g. Halmos, 1967) d<strong>on</strong>'t. It's ageneralizati<strong>on</strong> mainly because its inner product can be any inner product, not just the scalar(\dot") product used here (and in Euclidean spaces in general). It's interesting that theolder mathematical literature (e.g. Kolmogorov, 1970) also required that Hilbert spaces beinnite dimensi<strong>on</strong>al, and that mathematicians are quite happy dening innite dimensi<strong>on</strong>alEuclidean spaces. Research <strong>on</strong> Hilbert spaces centers <strong>on</strong> operators in those spaces, sincethe basic properties have l<strong>on</strong>g since been worked out. Since some people understandablyblanch atthe menti<strong>on</strong> of Hilbert spaces, I decided to use the term Euclidean throughoutthis tutorial.4.1. Mercer's C<strong>on</strong>diti<strong>on</strong>For which kernels does there exist a pair fH g, with the properties described above, and<strong>for</strong> which does there not? The answer is given by Mercer's c<strong>on</strong>diti<strong>on</strong> (Vapnik, 1995 Courantand Hilbert, 1953): There exists a mapping and an expansi<strong>on</strong>K(x y) = X i(x) i (y) i (65)


19if and <strong>on</strong>ly if, <strong>for</strong> any g(x) suchthatZg(x) 2 dx is nite (66)thenZK(x y)g(x)g(y)dxdy 0: (67)Note that <strong>for</strong> specic cases, it may not be easy to check whether Mercer's c<strong>on</strong>diti<strong>on</strong> issatised. Eq. (67) must hold <strong>for</strong> every g with nite L 2 norm (i.e. which satises Eq. (66)).However, we can easily prove that the c<strong>on</strong>diti<strong>on</strong> is satised <strong>for</strong> positive integral powers ofthe dot product: K(x y) =(x y) p . We must show thatZ(dXi=1x i y i ) p g(x)g(y)dxdy 0: (68)The typical term in the multinomial expansi<strong>on</strong> of ( P di=1 x iy i ) p c<strong>on</strong>tributes a term of the<strong>for</strong>mZp!r 1 !r 2 ! (p ; r 1 ; r 2 )!x r11 xr2 2 yr1 1 yr2 2g(x)g(y)dxdy (69)to the left hand side of Eq. (67), which factorizes:Zp!=r 1 !r 2 ! (p ; r 1 ; r 2 )! ( x r11 xr2 2 g(x)dx)2 0: (70)One simple c<strong>on</strong>sequence is that anykernel which can be expressed as K(x y) = P 1p=0 c p(xy) p , where the c p are positive real coecients and the series is uni<strong>for</strong>mly c<strong>on</strong>vergent, satisesMercer's c<strong>on</strong>diti<strong>on</strong>, a fact also noted in (Smola, Scholkopf and Muller, 1998b).Finally, what happens if <strong>on</strong>e uses a kernel which does not satisfy Mercer's c<strong>on</strong>diti<strong>on</strong>? Ingeneral, there may exist data such that the Hessian is indenite, and <strong>for</strong> which the quadraticprogramming problem will have no soluti<strong>on</strong> (the dual objective functi<strong>on</strong> can become arbitrarilylarge). However, even <strong>for</strong> kernels that do not satisfy Mercer's c<strong>on</strong>diti<strong>on</strong>, <strong>on</strong>e mightstill nd that a given training set results in a positive semidenite Hessian, in which case thetraining will c<strong>on</strong>verge perfectly well. In this case, however, the geometrical interpretati<strong>on</strong>described above is lacking.4.2. Some Notes <strong>on</strong> and HMercer's c<strong>on</strong>diti<strong>on</strong> tells us whether or not a prospective kernel is actually a dot productin some space, but it does not tell us how to c<strong>on</strong>struct or even what H is. However, aswith the homogeneous (that is, homogeneous in the dot product in L) quadratic polynomialkernel discussed above, we can explicitly c<strong>on</strong>struct the mapping <strong>for</strong> some kernels. In Secti<strong>on</strong>6.1 we show howEq. (62) can be extended to arbitrary homogeneous polynomial kernels,. Thus <strong>for</strong>and that the corresp<strong>on</strong>ding space H is a Euclidean space of dimensi<strong>on</strong> ; d+p;1pexample, <strong>for</strong> a degree p = 4 polynomial, and <strong>for</strong> data c<strong>on</strong>sisting of 16 by 16 images (d=256),dim(H) is 183,181,376.Usually, mapping your data to a \feature space" with an enormous number of dimensi<strong>on</strong>swould bode ill <strong>for</strong> the generalizati<strong>on</strong> per<strong>for</strong>mance of the resulting machine. After all, the


20set of all hyperplanes fwbg are parameterized by dim(H) +1 numbers. Most patternrecogniti<strong>on</strong> systems with billi<strong>on</strong>s, or even an innite, number of parameters would not makeit past the start gate. How come SVMs do so well? One might argue that, given the <strong>for</strong>mof soluti<strong>on</strong>, there are at most l + 1 adjustable parameters (where l is the number of trainingsamples), but this seems to be begging the questi<strong>on</strong> 13 . It must be something to do with ourrequirement ofmaximum margin hyperplanes that is saving the day. As we shall see below,a str<strong>on</strong>g case can be made <strong>for</strong> this claim.Since the mapped surface is of intrinsic dimensi<strong>on</strong> dim(L), unless dim(L) =dim(H), itis obvious that the mapping cannot be <strong>on</strong>to (surjective). It also need not be <strong>on</strong>e to <strong>on</strong>e(bijective): c<strong>on</strong>sider x 1 !;x 1 x 2 !;x 2 in Eq. (62). The image of need not itself beavector space: again, c<strong>on</strong>sidering the above simple quadratic example, the vector ;(x) isnot in the image of unless x =0. Further, a little playing with the inhomogeneous kernelK(x i x j )=(x i x j +1) 2 (71)will c<strong>on</strong>vince you that the corresp<strong>on</strong>ding can map twovectors that are linearly dependentin L <strong>on</strong>to two vectors that are linearly independent inH.So far we have c<strong>on</strong>sidered cases where is d<strong>on</strong>e implicitly. One can equally well turnthings around and start with , and then c<strong>on</strong>struct the corresp<strong>on</strong>ding kernel. For example(Vapnik, 1996), if L = R 1 ,thenaFourier expansi<strong>on</strong> in the data x, cut o after N terms,has the <strong>for</strong>mf(x) = a 0 XN2 + (a 1r cos(rx)+a 2r sin(rx)) (72)r=1and this can be viewed as a dot product between twovectors in R 2N+1 : a =( p a02a 11 :::a 21 :::),and the mapped (x) =( 1 p 2 cos(x) cos(2x):::sin(x) sin(2x):::). Then the corresp<strong>on</strong>ding(Dirichlet) kernel can be computed in closed <strong>for</strong>m:(x i ) (x j )=K(x i x j )= sin((N +1=2)(x i ; x j )): (73)2sin((x i ; x j )=2)This is easily seen as follows: letting x i ; x j ,(x i ) (x j ) = 1 2 + N Xr=1cos(rx i ) cos(rx j )+sin(rx i )sin(rx j )= ; 1 2 + N Xr=0cos(r) =; 1 2 + Ref N Xr=0e (ir) g= ; 1 2 + Ref(1 ; ei(N+1) )=(1 ; e i )g= (sin((N +1=2)))=2 sin(=2):Finally, it is clear that the above implicit mapping trick willwork <strong>for</strong> any algorithm inwhich the data <strong>on</strong>ly appear as dot products (<strong>for</strong> example, the nearest neighbor algorithm).This fact has been used to derive an<strong>on</strong>linear versi<strong>on</strong> of principal comp<strong>on</strong>ent analysis by(Scholkopf, Smola and Muller, 1998b) it seems likely that this trick willc<strong>on</strong>tinue to nduses elsewhere.


214.3. Some Examples of N<strong>on</strong>linear SVMsThe rst kernels investigated <strong>for</strong> the pattern recogniti<strong>on</strong> problem were the following:K(x y) =(x y +1) p (74)K(x y) =e ;kx;yk2 =2 2 (75)K(x y) = tanh(x y ; ) (76)Eq. (74) results in a classier that is a polynomial of degree p in the data Eq. (75) givesa Gaussian radial basis functi<strong>on</strong> classier, and Eq. (76) gives a particular kind of two-layersigmoidal neural network. For the RBF case, the number of centers (N S in Eq. (61)),the centers themselves (the s i ), the weights ( i ), and the threshold (b) are all producedautomatically by the SVM training and give excellent results compared to classical RBFs,<strong>for</strong> the case of Gaussian RBFs (Scholkopf et al, 1997). For the neural network case, therst layer c<strong>on</strong>sists of N S sets of weights, each set c<strong>on</strong>sisting of d L (the dimensi<strong>on</strong> of thedata) weights, and the sec<strong>on</strong>d layer c<strong>on</strong>sists of N S weights (the i ), so that an evaluati<strong>on</strong>simply requires taking a weighted sum of sigmoids, themselves evaluated <strong>on</strong> dot products ofthe test data with the support vectors. Thus <strong>for</strong> the neural network case, the architecture(number of weights) is determined by SVM training.Note, however, that the hyperbolic tangent kernel <strong>on</strong>ly satises Mercer's c<strong>on</strong>diti<strong>on</strong> <strong>for</strong>certain values of the parameters and (and of the data kxk 2 ). This was rst noticedexperimentally (Vapnik, 1995) however some necessary c<strong>on</strong>diti<strong>on</strong>s <strong>on</strong> these parameters <strong>for</strong>positivity are now known 14 .Figure 9 shows results <strong>for</strong> the same pattern recogniti<strong>on</strong> problem as that shown in Figure7, but where the kernel was chosen to be a cubic polynomial. Notice that, even thoughthe number of degrees of freedom is higher, <strong>for</strong> the linearly separable case (left panel), thesoluti<strong>on</strong> is roughly linear, indicating that the capacity is being c<strong>on</strong>trolled and that thelinearly n<strong>on</strong>-separable case (right panel) has become separable.Figure 9.Degree 3 polynomial kernel. The background colour shows the shape of the decisi<strong>on</strong> surface.Finally, note that although the SVM classiers described above are binary classiers, theyare easily combined to handle the multiclass case. A simple, eective combinati<strong>on</strong> trains


22N <strong>on</strong>e-versus-rest classiers (say, \<strong>on</strong>e" positive, \rest" negative) <strong>for</strong> the N-class case andtakes the class <strong>for</strong> a test point to be that corresp<strong>on</strong>ding to the largest positive distance(Boser, Guy<strong>on</strong> and Vapnik, 1992).4.4. Global Soluti<strong>on</strong>s and UniquenessWhen is the soluti<strong>on</strong> to the support vector training problem global, and when is it unique?By \global", we mean that there exists no other point in the feasible regi<strong>on</strong> at which theobjective functi<strong>on</strong> takes a lower value. We will address two kinds of ways in which uniquenessmay not hold: soluti<strong>on</strong>s <strong>for</strong> which fwbg are themselves unique, but <strong>for</strong> which the expansi<strong>on</strong>of w in Eq. (46) is not and soluti<strong>on</strong>s whose fwbg dier. Both are of interest: even if thepair fwbg is unique, if the i are not, there may be equivalent expansi<strong>on</strong>s of w whichrequire fewer support vectors (a trivial example of this is given below), and which there<strong>for</strong>erequire fewer instructi<strong>on</strong>s during test phase.It turns out that every local soluti<strong>on</strong> is also global. This is a property of any c<strong>on</strong>vexprogramming problem (Fletcher, 1987). Furthermore, the soluti<strong>on</strong> is guaranteed to beunique if the objective functi<strong>on</strong> (Eq. (43)) is strictly c<strong>on</strong>vex, which in our case meansthat the Hessian must be positive denite (note that <strong>for</strong> quadratic objective functi<strong>on</strong>s F ,the Hessian is positive denite if and <strong>on</strong>ly if F is strictly c<strong>on</strong>vex this is not true <strong>for</strong> n<strong>on</strong>quadraticF : there, a positive denite Hessian implies a strictly c<strong>on</strong>vex objective functi<strong>on</strong>,but not vice versa (c<strong>on</strong>sider F = x 4 ) (Fletcher, 1987)). However, even if the Hessian ispositive semidenite, the soluti<strong>on</strong> can still be unique: c<strong>on</strong>sider two points al<strong>on</strong>g the realline with coordinates x 1 =1andx 2 = 2, and with polarities + and ;. Here the Hessian ispositive semidenite, but the soluti<strong>on</strong> (w = ;2 b =3 i = 0 in Eqs. (40), (41), (42)) isunique. It is also easy to nd soluti<strong>on</strong>s which are not unique in the sense that the i in theexpansi<strong>on</strong> of w are not unique:: <strong>for</strong> example, c<strong>on</strong>sider the problem of four separable points<strong>on</strong> a square in R 2 : x 1 =[1 1], x 2 =[;1 1], x 3 =[;1 ;1] and x 4 =[1 ;1], with polarities[+ ; ; +] respectively. One soluti<strong>on</strong> is w = [1 0], b = 0, = [0:25 0:25 0:25 0:25]another has the same w P and b, but =[0:5 0:5 0 0] (note that both soluti<strong>on</strong>s satisfy thec<strong>on</strong>straints i > 0andi iy i = 0). When can this occur in general? Given some soluti<strong>on</strong>, choose an 0 which is in the null space of the Hessian H ij = y i y j x i x j , and require that 0 be orthog<strong>on</strong>al to the vector all of whose comp<strong>on</strong>ents are 1. Then adding 0 to in Eq.(43) will leave L D unchanged. If 0 i + 0 i C and 0 satises Eq. (45), then + 0 isalso a soluti<strong>on</strong> 15 .How about soluti<strong>on</strong>s where the fwbg are themselves not unique? (We emphasize thatthis can <strong>on</strong>ly happen in principle if the Hessian is not positive denite, and even then,the soluti<strong>on</strong>s are necessarily global). The following very simple theorem shows that if n<strong>on</strong>uniquesoluti<strong>on</strong>s occur, then the soluti<strong>on</strong> at <strong>on</strong>e optimal point isc<strong>on</strong>tinuously de<strong>for</strong>mableinto the soluti<strong>on</strong> at the other optimal point, in such away thatallintermediate points arealso soluti<strong>on</strong>s.Theorem 2 Let the variable X stand <strong>for</strong> the pair of variables fwbg. Let the Hessian <strong>for</strong>the problem be positive semidenite, so that the objective functi<strong>on</strong> is c<strong>on</strong>vex. Let X 0 and X 1be two points at which the objective functi<strong>on</strong> attains its minimal value. Then there exists apath X = X() =(1; )X 0 + X 1 2 [0 1], such that X() is a soluti<strong>on</strong> <strong>for</strong> all .Proof: Let the minimum value of the objective functi<strong>on</strong> be F min . Then by assumpti<strong>on</strong>,F (X 0 )=F (X 1 )=F min . By c<strong>on</strong>vexity ofF , F (X()) (1 ; )F (X 0 )+F(X 1 )=F min .Furthermore, by linearity, theX() satisfy the c<strong>on</strong>straints Eq. (40), (41): explicitly (againcombining both c<strong>on</strong>straints into <strong>on</strong>e):


23y i (w x i + b ) = y i ((1 ; )(w 0 x i + b 0 )+(w 1 x i + b 1 )) (1 ; )(1 ; i )+(1 ; i )=1; i (77)Although simple, this theorem is quite instructive. For example, <strong>on</strong>e might think that theproblems depicted in Figure 10 have several dierent optimal soluti<strong>on</strong>s (<strong>for</strong> the case of linearsupport vector machines). However, since <strong>on</strong>e cannot smoothly move thehyperplane from<strong>on</strong>e proposed soluti<strong>on</strong> to another without generating hyperplanes which are not soluti<strong>on</strong>s,we know that these proposed soluti<strong>on</strong>s are in fact not soluti<strong>on</strong>s at all. In fact, <strong>for</strong> eachof these cases, the optimal unique soluti<strong>on</strong> is at w = 0, with a suitable choice of b (whichhas the eect of assigning the same label to all the points). Note that this is a perfectlyacceptable soluti<strong>on</strong> to the classicati<strong>on</strong> problem: any proposed hyperplane (with w 6= 0)will cause the primal objective functi<strong>on</strong> to take a higher value.Figure 10. Two problems, with proposed (incorrect) n<strong>on</strong>-unique soluti<strong>on</strong>s.Finally, note that the fact that SVM training always nds a global soluti<strong>on</strong> is in c<strong>on</strong>trastto the case of neural networks, where many local minima usually exist.5. Methods of Soluti<strong>on</strong>The support vector optimizati<strong>on</strong> problem can be solved analytically <strong>on</strong>ly when the numberof training data is very small, or <strong>for</strong> the separable case when it is known be<strong>for</strong>ehand whichof the training data become support vectors (as in Secti<strong>on</strong>s 3.3 and 6.2). Note that this canhappen when the problem has some symmetry (Secti<strong>on</strong> 3.3), but that it can also happenwhen it does not (Secti<strong>on</strong> 6.2). For the general analytic case, the worst case computati<strong>on</strong>alcomplexity is of order N 3 S (inversi<strong>on</strong> of the Hessian), where N S is the number of supportvectors, although the two examples given both have complexity of O(1).However, in most real world cases, Equati<strong>on</strong>s (43) (with dot products replaced by kernels),(44), and (45) must be solved numerically. For small problems, any general purposeoptimizati<strong>on</strong> package that solves linearly c<strong>on</strong>strained c<strong>on</strong>vex quadratic programs will do. Agood survey of the available solvers, and where to get them, can be found 16 in (More andWright, 1993).For larger problems, a range of existing techniques can be brought to bear. A full explorati<strong>on</strong>of the relative merits of these methods would ll another tutorial. Here we justdescribe the general issues, and <strong>for</strong> c<strong>on</strong>creteness, give a brief explanati<strong>on</strong> of the techniquewe currently use. Below, a \face" means a set of points lying <strong>on</strong> the boundary of the feasibleregi<strong>on</strong>, and an \active c<strong>on</strong>straint" is a c<strong>on</strong>straint <strong>for</strong>which theequality holds. For more


24<strong>on</strong> n<strong>on</strong>linear programming techniques see (Fletcher, 1987 Mangasarian, 1969 McCormick,1983).The basic recipe is to (1) note the optimality (KKT) c<strong>on</strong>diti<strong>on</strong>s which the soluti<strong>on</strong> mustsatisfy, (2) dene a strategy <strong>for</strong> approaching optimality by uni<strong>for</strong>mly increasing the dualobjective functi<strong>on</strong> subject to the c<strong>on</strong>straints, and (3) decide <strong>on</strong> a decompositi<strong>on</strong> algorithmso that <strong>on</strong>ly porti<strong>on</strong>s of the training data need be handled at a given time (Boser, Guy<strong>on</strong>and Vapnik, 1992 Osuna, Freund and Girosi, 1997a). We give a brief descripti<strong>on</strong> of someof the issues involved. One can view the problem as requiring the soluti<strong>on</strong> of a sequenceof equality c<strong>on</strong>strained problems. A given equality c<strong>on</strong>strained problem can be solved in<strong>on</strong>e step by using the Newt<strong>on</strong> method (although this requires storage <strong>for</strong> a factorizati<strong>on</strong> ofthe projected Hessian), or in at most l steps using c<strong>on</strong>jugate gradient ascent (Press et al.,1992) (where l is the number of data points <strong>for</strong> the problem currently being solved: no extrastorage is required). Some algorithms move within a given face until a new c<strong>on</strong>straint isencountered, in which case the algorithm is restarted with the new c<strong>on</strong>straint added to thelist of equality c<strong>on</strong>straints. This method has the disadvantage that <strong>on</strong>ly <strong>on</strong>e new c<strong>on</strong>straintis made active at a time. \Projecti<strong>on</strong> methods" have also been c<strong>on</strong>sidered (More, 1991),where a point outside the feasible regi<strong>on</strong> is computed, and then line searches and projecti<strong>on</strong>sare d<strong>on</strong>e so that the actual move remains inside the feasible regi<strong>on</strong>. This approach can addseveral new c<strong>on</strong>straints at <strong>on</strong>ce. Note that in both approaches, several active c<strong>on</strong>straintscan become inactive in <strong>on</strong>e step. In all algorithms, <strong>on</strong>ly the essential part of the Hessian(the columns corresp<strong>on</strong>ding to i 6= 0) need be computed (although some algorithms docompute the whole Hessian). For the Newt<strong>on</strong> approach, <strong>on</strong>e can also take advantage of thefact that the Hessian is positive semidenite by diag<strong>on</strong>alizing it with the Bunch-Kaufmanalgorithm (Bunch and Kaufman, 1977 Bunch and Kaufman, 1980) (if the Hessian wereindenite, it could still be easily reduced to 2x2 block diag<strong>on</strong>al <strong>for</strong>m with this algorithm).In this algorithm, when a new c<strong>on</strong>straint is made active or inactive, the factorizati<strong>on</strong> ofthe projected Hessian is easily updated (as opposed to recomputing the factorizati<strong>on</strong> fromscratch). Finally, ininterior point methods, the variables are essentially rescaled so as toalways remain inside the feasible regi<strong>on</strong>. An example is the \LOQO" algorithm of (Vanderbei,1994a Vanderbei, 1994b), which is a primal-dual path following algorithm. This lastmethod is likely to be useful <strong>for</strong> problems where the number of support vectors as a fracti<strong>on</strong>of training sample size is expected to be large.We briey describe the core optimizati<strong>on</strong> method we currently use 17 . It is an active setmethod combining gradient and c<strong>on</strong>jugate gradient ascent. Whenever the objective functi<strong>on</strong>is computed, so is the gradient, at very little extra cost. In phase 1, the search directi<strong>on</strong>ss are al<strong>on</strong>g the gradient. The nearest face al<strong>on</strong>g the search directi<strong>on</strong> is found. If the dotproduct of the gradient therewiths indicates that the maximum al<strong>on</strong>g s lies between thecurrent point and the nearest face, the optimal point al<strong>on</strong>g the search directi<strong>on</strong> is computedanalytically (note that this does not require a line search), and phase 2 is entered. Otherwise,we jump to the new face and repeat phase 1. In phase 2, Polak-Ribiere c<strong>on</strong>jugate gradientascent (Press et al., 1992) is d<strong>on</strong>e, until a new face is encountered (in which case phase 1 isre-entered) or the stopping criteri<strong>on</strong> is met. Note the following:Search directi<strong>on</strong>s are always projected so that the i c<strong>on</strong>tinue to satisfy the equalityc<strong>on</strong>straint Eq. (45). Note that the c<strong>on</strong>jugate gradient algorithm will still work weare simply searching in a subspace. However, it is important that this projecti<strong>on</strong> isimplemented in such away that not <strong>on</strong>ly is Eq. (45) met (easy), but also so that theangle between the resulting search directi<strong>on</strong>, and the search directi<strong>on</strong> prior to projecti<strong>on</strong>,is minimized (not quite so easy).


25We also use a \sticky faces" algorithm: whenever a given face is hit more than <strong>on</strong>ce,the search directi<strong>on</strong>s are adjusted so that all subsequent searches are d<strong>on</strong>e within thatface. All \sticky faces" are reset (made \n<strong>on</strong>-sticky") when the rate of increase of theobjective functi<strong>on</strong> falls below a threshold.The algorithm stops when the fracti<strong>on</strong>al rate of increase of the objective functi<strong>on</strong> F fallsbelow a tolerance (typically 1e-10, <strong>for</strong> double precisi<strong>on</strong>). Note that <strong>on</strong>e can also useas stopping criteri<strong>on</strong> the c<strong>on</strong>diti<strong>on</strong> that the size of the projected search directi<strong>on</strong> fallsbelow a threshold. However, this criteri<strong>on</strong> does not handle scaling well.In my opini<strong>on</strong> the hardest thing to get right is handling precisi<strong>on</strong> problems correctlyeverywhere. If this is not d<strong>on</strong>e, the algorithm may not c<strong>on</strong>verge, or may bemuch slowerthan it needs to be.Agoodway tocheck thatyour algorithm is working is to check that the soluti<strong>on</strong> satisesall the Karush-Kuhn-Tucker c<strong>on</strong>diti<strong>on</strong>s <strong>for</strong> the primal problem, since these are necessaryand sucient c<strong>on</strong>diti<strong>on</strong>s that the soluti<strong>on</strong> be optimal. The KKT c<strong>on</strong>diti<strong>on</strong>s are Eqs. (48)through (56), with dot products between data vectors replaced by kernels wherever theyappear (note w must be expanded as in Eq. (48) rst, since w is not in general the mappingof a point in L). Thus to check the KKT c<strong>on</strong>diti<strong>on</strong>s, it is sucient to check that the i satisfy 0 i C, that the equality c<strong>on</strong>straint (49)holds, that all points <strong>for</strong> which0 i


26subset of the data and trains <strong>on</strong> that. The rest of the training data is tested <strong>on</strong> the resultingclassier, and a list of the errors is c<strong>on</strong>structed, sorted by how far <strong>on</strong> the wr<strong>on</strong>g side of themargin they lie (i.e. how egregiously the KKT c<strong>on</strong>diti<strong>on</strong>s are violated). The next chunk isc<strong>on</strong>structed from the rst N of these, combined with the N S support vectors already found,where N + N S is decided heuristically (a chunk size that is allowed to grow tooquickly ortoo slowly will result in slow overall c<strong>on</strong>vergence). Note that vectors can be dropped fromachunk, and that support vectors in <strong>on</strong>e chunk may not appear in the nal soluti<strong>on</strong>. Thisprocess is c<strong>on</strong>tinued until all data points are found to satisfy the KKT c<strong>on</strong>diti<strong>on</strong>s.The above method requires that the number of support vectors N S be small enough so thata Hessian of size N S by N S will t in memory. An alternative decompositi<strong>on</strong> algorithm hasbeen proposed which overcomes this limitati<strong>on</strong> (Osuna, Freund and Girosi, 1997b). Again,in this algorithm, <strong>on</strong>ly a small porti<strong>on</strong> of the training data is trained <strong>on</strong> at a given time, andfurthermore, <strong>on</strong>ly a subset of the support vectors need be in the \working set" (i.e. that setof points whose 's are allowed to vary). This method has been shown to be able to easilyhandle a problem with 110,000 training points and 100,000 support vectors. However, itmust be noted that the speed of this approach relies <strong>on</strong> many of the support vectors havingcorresp<strong>on</strong>ding Lagrange multipliers i at the upper bound, i = C.These training algorithms maytakeadvantage of parallel processing in several ways. First,all elements of the Hessian itself can be computed simultaneously. Sec<strong>on</strong>d, each elementoften requires the computati<strong>on</strong> of dot products of training data, which could also be parallelized.Third, the computati<strong>on</strong> of the objective functi<strong>on</strong>, or gradient, which isaspeedbottleneck, can be parallelized (it requires a matrix multiplicati<strong>on</strong>). Finally, <strong>on</strong>e can envisi<strong>on</strong>parallelizing at a higher level, <strong>for</strong> example by training <strong>on</strong> dierent chunks simultaneously.Schemes such as these, combined with the decompositi<strong>on</strong> algorithm of (Osuna, Freund andGirosi, 1997b), will be needed to makevery large problems (i.e. >> 100,000 support vectors,with many not at bound), tractable.5.1.2. Testing In test phase, <strong>on</strong>e must simply evaluate Eq. (61), which will requireO(MN S ) operati<strong>on</strong>s, where M is the number of operati<strong>on</strong>s required to evaluate the kernel.For dot product and RBF kernels, M is O(d L ), the dimensi<strong>on</strong> of the data vectors. Again,both the evaluati<strong>on</strong> of the kernel and of the sum are highly parallelizable procedures.In the absence of parallel hardware, <strong>on</strong>e can still speed up test phase by a large factor, asdescribed in Secti<strong>on</strong> 9.6. The VC Dimensi<strong>on</strong> of <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong>We now show that the VC dimensi<strong>on</strong> of SVMs can be very large (even innite). We willthen explore several arguments as to why, in spite of this, SVMs usually exhibit goodgeneralizati<strong>on</strong> per<strong>for</strong>mance. However it should be emphasized that these are essentiallyplausibility arguments. Currently there exists no theory which guarantees that a givenfamily of SVMs will have high accuracy <strong>on</strong> a given problem.We will call any kernel that satises Mercer's c<strong>on</strong>diti<strong>on</strong> a positive kernel, and the corresp<strong>on</strong>dingspace H the embedding space. We will also call anyembedding space with minimaldimensi<strong>on</strong> <strong>for</strong> a given kernel a \minimal embedding space". We have the followingTheorem 3 Let K be a positive kernel which corresp<strong>on</strong>ds to a minimal embedding spaceH. Then the VC dimensi<strong>on</strong> of the corresp<strong>on</strong>ding support vector machine (where the errorpenalty C in Eq. (44) is allowed to take all values) is dim(H)+1.


27Proof: If the minimal embedding space has dimensi<strong>on</strong> d H ,thend H points in the image ofL under the mapping can be found whose positi<strong>on</strong> vectors in H are linearly independent.From Theorem 1, these vectors can be shattered by hyperplanes in H. Thus by eitherrestricting ourselves to SVMs <strong>for</strong> the separable case (Secti<strong>on</strong> 3.1), or <strong>for</strong> which the errorpenalty C is allowed to take all values (so that, if the points are linearly separable, a C canbe found such that the soluti<strong>on</strong> does indeed separate them), the family of support vectormachines with kernel K can also shatter these points, and hence has VC dimensi<strong>on</strong> d H +1.Let's look at two examples.6.1. The VC Dimensi<strong>on</strong> <strong>for</strong> Polynomial KernelsC<strong>on</strong>sider an SVM with homogeneous polynomial kernel, acting <strong>on</strong> data in R dL :K(x 1 x 2 )=(x 1 x 2 ) p x 1 x 2 2 R dL (78)As in the case when d L = 2 and the kernel is quadratic (Secti<strong>on</strong> 4), <strong>on</strong>e can explicitlyc<strong>on</strong>struct the map . Letting z i = x 1i x 2i ,sothatK(x 1 x 2 )=(z 1 + + z dL ) p ,we see thateach dimensi<strong>on</strong> of H corresp<strong>on</strong>ds to a term with given powers of the z i in the expansi<strong>on</strong> ofK. In fact if we choose to label the comp<strong>on</strong>ents of (x) inthismanner,we can explicitlywrite the mapping <strong>for</strong> any p and d L : r1r 2r dL(x) =This leads tos p!r 1 !r 2 ! r dL !x r11 xr22 xr d Ld Ld L Xi=1r i = p r i 0 (79)Theorem 4 If the space in which the data live has dimensi<strong>on</strong> d L (i.e. L = R dL ), thedimensi<strong>on</strong> of the minimal embedding space, <strong>for</strong> ; homogeneous polynomial kernels of degree p(K(x 1 x 2 )=(x 1 x 2 ) p x 1 x 2 2 R dL d), is L+p;1 p .; dL+p;1p(The proof is in the Appendix). Thus the VC dimensi<strong>on</strong> of SVMs with these kernels is+1. As noted above, this gets very large very quickly.6.2. The VC Dimensi<strong>on</strong> <strong>for</strong> Radial Basis Functi<strong>on</strong> KernelsTheorem 5 C<strong>on</strong>sider the class of Mercer kernels <strong>for</strong> which K(x 1 x 2 ) ! 0 as kx 1 ; x 2 k!1, and <strong>for</strong> which K(x x) is O(1), and assume that the data can be chosen arbitrarily fromR d . Then the family of classiers c<strong>on</strong>sisting of support vector machines using these kernels,and <strong>for</strong> which the error penalty is allowed to take all values, has innite VC dimensi<strong>on</strong>.Proof: The kernel matrix, K ij K(x i x j ), is a Gram matrix (a matrix of dot products:see (Horn, 1985)) in H. Clearly we can choose training data such that all o-diag<strong>on</strong>alelements K i6=j can be made arbitrarily small, and by assumpti<strong>on</strong> all diag<strong>on</strong>al elements K i=jare of O(1). The matrix K is then of full rank hence the set of vectors, whose dot productsin H <strong>for</strong>m K, are linearly independent (Horn, 1985) hence, by Theorem 1, the points can beshattered by hyperplanes in H, and hence also by support vector machines with sucientlylarge error penalty. Since this is true <strong>for</strong> any nite number of points, the VC dimensi<strong>on</strong> ofthese classiers is innite.


28Note that the assumpti<strong>on</strong>s in the theorem are str<strong>on</strong>ger than necessary (they were chosento make the c<strong>on</strong>necti<strong>on</strong> to radial basis functi<strong>on</strong>s clear). In fact it is <strong>on</strong>ly necessary that ltraining points can be chosen such that the rank of the matrix K ij increases without limitas l increases. For example, <strong>for</strong> Gaussian RBF kernels, this can also be accomplished (even<strong>for</strong> training data restricted to lie in a bounded subset of R dL ) by choosing small enoughRBF widths. However in general the VC dimensi<strong>on</strong> of SVM RBF classiers can certainly benite, and indeed, <strong>for</strong> data restricted to lie in a bounded subset of R dL ,choosing restricti<strong>on</strong>s<strong>on</strong> the RBF widths is a good way toc<strong>on</strong>trol the VC dimensi<strong>on</strong>.This case gives us a sec<strong>on</strong>d opportunity topresent asituati<strong>on</strong> where the SVM soluti<strong>on</strong>can be computed analytically, which also amounts to a sec<strong>on</strong>d, c<strong>on</strong>structive proof of theTheorem. For c<strong>on</strong>creteness we will take the case <strong>for</strong> Gaussian RBF kernels of the <strong>for</strong>mK(x 1 x 2 )=e ;kx1;x2k2 =2 2 . Let us choose training points such that the smallest distancebetween any pair of points is much larger than the width . C<strong>on</strong>sider the decisi<strong>on</strong> functi<strong>on</strong>evaluated <strong>on</strong> the support vector s j :f(s j )= X i i y i e ;ksi;sjk2 =2 2 + b: (80)The sum <strong>on</strong> the right hand side will then be largely dominated by the term i = j in factthe ratio of that term to the c<strong>on</strong>tributi<strong>on</strong> from the rest of the sum can be made arbitrarilylarge by choosing the training points to be arbitrarily far apart. In order to nd the SVMsoluti<strong>on</strong>, we again assume <strong>for</strong> the moment that every training point becomes a supportvector, and we work with SVMs <strong>for</strong> the separable case (Secti<strong>on</strong> 3.1) (the same argumentwill hold <strong>for</strong> SVMs <strong>for</strong> the n<strong>on</strong>-separable case if C in Eq. (44) is allowed to take largeenough values). Since all points are support vectors, the equalities in Eqs. (10), (11)will hold <strong>for</strong> them. Let there be N + (N ; ) positive (negative) polarity points. We furtherassume that all positive (negative) polarity points have thesame value + ( ; )<strong>for</strong>theirLagrange multiplier. (We will know that this assumpti<strong>on</strong> is correct if it delivers a soluti<strong>on</strong>which satises all the KKT c<strong>on</strong>diti<strong>on</strong>s and c<strong>on</strong>straints). Then Eqs. (19), applied to all thetraining data, and the equality c<strong>on</strong>straint Eq. (18), become + + b = 1; ; + b = ;1N + + ; N ; ; = 0 (81)which are satised by + = ; =2N ;N ; + N +2N +N ; + N +b = N + ; N ;N ; + N +(82)Thus, since the resulting i are also positive, all the KKT c<strong>on</strong>diti<strong>on</strong>s and c<strong>on</strong>straints aresatised, and we must have found the global soluti<strong>on</strong> (with zero training errors). Since thenumber of training points, and their labeling, is arbitrary, and they are separated withouterror, the VC dimensi<strong>on</strong> is innite.The situati<strong>on</strong> is summarized schematically in Figure 11.


29Figure 11. Gaussian RBF SVMs of suciently small width can classify an arbitrarily large number oftraining points correctly, and thus have innite VC dimensi<strong>on</strong>Now we are left with a striking c<strong>on</strong>undrum. Even though their VC dimensi<strong>on</strong> is innite (ifthe data is allowed to take all values in R dL ), SVM RBFs can have excellent per<strong>for</strong>mance(Scholkopf et al, 1997). A similar story holds <strong>for</strong> polynomial SVMs. How come?7. The Generalizati<strong>on</strong> Per<strong>for</strong>mance of SVMsIn this Secti<strong>on</strong> we collect various arguments and bounds relating to the generalizati<strong>on</strong> per<strong>for</strong>manceof SVMs. We start by presenting a family of SVM-like classiers <strong>for</strong> which structuralrisk minimizati<strong>on</strong> can be rigorously implemented, and which will give ussomeinsightastowhy maximizing the margin is so important.7.1. VC Dimensi<strong>on</strong> of Gap Tolerant ClassiersC<strong>on</strong>sider a family of classiers (i.e. a set of functi<strong>on</strong>s <strong>on</strong> R d ) which we will call \gaptolerant classiers." A particular classier 2 is specied by the locati<strong>on</strong> and diameterof a ball in R d , and by two hyperplanes, with parallel normals, also in R d . Call the set ofpoints lying between, but not <strong>on</strong>, the hyperplanes the \margin set." The decisi<strong>on</strong> functi<strong>on</strong>s are dened as follows: points that lie inside the ball, but not in the margin set, are assignedclass f1g, depending <strong>on</strong> which side of the margin set they fall. All other points are simplydened to be \correct", that is, they are not assigned a class by the classier, and do notc<strong>on</strong>tribute to any risk. The situati<strong>on</strong> is summarized, <strong>for</strong> d = 2, in Figure 12. This ratherodd family of classiers, together with a c<strong>on</strong>diti<strong>on</strong> we will impose <strong>on</strong> how they are trained,will result in systems very similar to SVMs, and <strong>for</strong> which structural risk minimizati<strong>on</strong> canbe dem<strong>on</strong>strated. A rigorous discussi<strong>on</strong> is given in the Appendix.Label the diameter of the ball D and the perpendicular distance between the two hyperplanesM. The VC dimensi<strong>on</strong> is dened as be<strong>for</strong>e to be the maximum number of points thatcan be shattered by the family, butby \shattered" we mean that the points can occur aserrors in all possible ways (see the Appendix <strong>for</strong> further discussi<strong>on</strong>). Clearly we can c<strong>on</strong>trolthe VC dimensi<strong>on</strong> of a family of these classiers by c<strong>on</strong>trolling the minimum margin Mand maximum diameter D that members of the family are allowed to assume. For example,c<strong>on</strong>sider the family of gap tolerant classiers in R 2 with diameter D =2,shown in Figure12. Those with margin satisfying M 3=2 can shatter three points if 3=2


30classiers corresp<strong>on</strong>ds to <strong>on</strong>e of the sets of classiers in Figure 4, with just three nestedsubsets of functi<strong>on</strong>s, and with h 1 =1,h 2 =2,andh 3 =3.Φ=0Φ=1D = 2M = 3/2Φ=0Φ=−1Φ=0Figure 12. A gap tolerant classier <strong>on</strong> data in R 2 .These ideas can be used to show how gap tolerant classiers implement structural riskminimizati<strong>on</strong>. The extensi<strong>on</strong> of the above example to spaces of arbitrary dimensi<strong>on</strong> isencapsulated in a (modied) theorem of (Vapnik, 1995):Theorem 6 For data in R d , the VC dimensi<strong>on</strong> h of gap tolerant classiers of minimummargin M min and maximum diameter D max is bounded above 19 by minfdDmax=M 2 min 2 edg+1.For the proof we assume the following lemma, which in(Vapnik, 1979) is held to followfrom symmetry arguments 20 :Lemma: C<strong>on</strong>sider n d +1 points lying in a ball B 2 R d . Let the points be shatterableby gap tolerant classiers with margin M. Then in order <strong>for</strong> M to be maximized, the pointsmust lie <strong>on</strong> the vertices of an (n ; 1)-dimensi<strong>on</strong>al symmetric simplex, and must also lie <strong>on</strong>the surface of the ball.Proof: We need <strong>on</strong>ly c<strong>on</strong>sider the case where the number of points n satises n d +1.(n >d+ 1 points will not be shatterable, since the VC dimensi<strong>on</strong> of oriented hyperplanes inR d is d+1, and any distributi<strong>on</strong> of points which can be shattered by a gap tolerant classiercan also be shattered by an oriented hyperplane this also shows that h d +1). Again wec<strong>on</strong>sider points <strong>on</strong> a sphere of diameter D, where the sphere itself is of dimensi<strong>on</strong> d ; 2. Wewill need two results from Secti<strong>on</strong> 3.3, namely (1) if n is even, we can nd a distributi<strong>on</strong> of npoints (the vertices of the (n;1)-dimensi<strong>on</strong>al symmetric simplex) which can be shattered bygap tolerant classiers if Dmax=M 2 min 2 = n ; 1, and (2) if n is odd, we can nd a distributi<strong>on</strong>of n points which can be so shattered if Dmax=M 2 min 2 =(n ; 1)2 (n +1)=n 2 .If n is even, at most n points can be shattered whenevern ; 1 D 2 max=M 2 min


31Thus <strong>for</strong> n even the maximum number of points that can be shattered may bewrittenbD 2 max=M 2 min c +1.If n is odd, at most n points can be shattered when D 2 max =M 2 min =(n ; 1)2 (n +1)=n 2 .However, the quantity <strong>on</strong>theright hand side satisesn ; 2 < (n ; 1) 2 (n +1)=n 2 1. Thus <strong>for</strong> n odd the largest number of points that can be shatteredis certainly bounded above by dDmax=M 2 min 2 e +1, and from the above thisbound is alsosatised when n is even. Hence in general the VC dimensi<strong>on</strong> h of gap tolerant classiersmust satisfyh d D2 maxe +1: (85)Mmin2This result, together with h d + 1, c<strong>on</strong>cludes the proof.7.2. Gap Tolerant Classiers, Structural Risk Minimizati<strong>on</strong>, and SVMsLet's see how we can do structural risk minimizati<strong>on</strong> with gap tolerant classiers. We need<strong>on</strong>ly c<strong>on</strong>sider that subset of the , call it S , <strong>for</strong> which training \succeeds", where by successwe mean that all training data are assigned a label 2f1g (note that these labels do nothave to coincide with the actual labels, i.e. training errors are allowed). Within S , ndthe subset which gives the fewest training errors - call this number of errors N min . Withinthat subset, nd the functi<strong>on</strong> which gives maximum margin (and hence the lowest bound<strong>on</strong> the VC dimensi<strong>on</strong>). Note the value of the resulting risk bound (the right hand side ofEq. (3), using the bound <strong>on</strong> the VC dimensi<strong>on</strong> in place of the VC dimensi<strong>on</strong>). Next, within S , nd that subset which gives N min + 1 training errors. Again, within that subset, ndthe which gives the maximum margin, and note the corresp<strong>on</strong>ding risk bound. Iterate,and take that classier which gives the overall minimum risk bound.An alternativeapproach is to divide the functi<strong>on</strong>s into nested subsets i i 2Z i 1,as follows: all 2 i have fD Mg satisfying dD 2 =M 2 ei. Thus the family of functi<strong>on</strong>sin i has VC dimensi<strong>on</strong> bounded above bymin(i d)+1. Note also that i i+1 . SRMthen proceeds by taking that <strong>for</strong> which training succeeds in each subset and <strong>for</strong> whichthe empirical risk is minimized in that subset, and again, choosing that which gives thelowest overall risk bound.Note that it is essential to these arguments that the bound (3) holds <strong>for</strong> any chosen decisi<strong>on</strong>functi<strong>on</strong>, not just the <strong>on</strong>e that minimizes the empirical risk (otherwise eliminating soluti<strong>on</strong>s<strong>for</strong> which some training point x satises (x) =0would invalidate the argument).The resulting gap tolerant classier is in fact a special kind of support vector machinewhich simply does not count data falling outside the sphere c<strong>on</strong>taining all the training data,or inside the separating margin, as an error. It seems very reas<strong>on</strong>able to c<strong>on</strong>clude thatsupport vector machines, which are trained with very similar objectives, also gain a similarkind of capacity c<strong>on</strong>trol from their training. However, a gap tolerant classier is not anSVM, and so the argument does not c<strong>on</strong>stitute a rigorous dem<strong>on</strong>strati<strong>on</strong> of structural riskminimizati<strong>on</strong> <strong>for</strong> SVMs. The original argument <strong>for</strong> structural risk minimizati<strong>on</strong> <strong>for</strong> SVMs isknown to be awed, since the structure there is determined by the data (see (Vapnik, 1995),Secti<strong>on</strong> 5.11). I believe that there is a further subtle problem with the original argument.The structure is dened so that no training points are members of the margin set. However,<strong>on</strong>e must still specify howtestpoints that fall into the margin are to be labeled. If <strong>on</strong>e simply


32assigns the same, xed class to them (say +1), then the VC dimensi<strong>on</strong> will be higher 21 thanthe bound derived in Theorem 6. However, the same is true if <strong>on</strong>e labels them all as errors(see the Appendix). If <strong>on</strong>e labels them all as \correct", <strong>on</strong>e arrives at gap tolerant classiers.On the other hand, it is known how to do structural risk minimizati<strong>on</strong> <strong>for</strong> systems wherethe structure does depend <strong>on</strong> the data (Shawe-Taylor et al., 1996a Shawe-Taylor et al.,1996b). Un<strong>for</strong>tunately the resulting bounds are much looser than the VC bounds above,which are already very loose (we will examine a typical case below wheretheVC bound isa factor of 100 higher than the measured test error). Thus at the moment structural riskminimizati<strong>on</strong> al<strong>on</strong>e does not provide a rigorous explanati<strong>on</strong> as to why SVMs often have goodgeneralizati<strong>on</strong> per<strong>for</strong>mance. However, the above arguments str<strong>on</strong>gly suggest that algorithmsthat minimize D 2 =M 2 can be expected to give better generalizati<strong>on</strong> per<strong>for</strong>mance. Furtherevidence <strong>for</strong> this is found in the following theorem of (Vapnik, 1998), whichwe quote withoutproof 22 :Theorem 7For optimal hyperplanes passing through the origin, we haveE[P (error)] E[D2 =M 2 ]lwhere P (error) is the probability of error <strong>on</strong> the test set, the expectati<strong>on</strong> <strong>on</strong> the left is overall training sets of size l ; 1, and the expectati<strong>on</strong> <strong>on</strong> the right is over all training sets of sizel.However, in order <strong>for</strong> these observati<strong>on</strong>s to be useful <strong>for</strong> real problems, we needaway tocompute the diameter of the minimal enclosing sphere described above, <strong>for</strong> any number oftraining points and <strong>for</strong> any kernel mapping.(86)7.3. How to Compute the Minimal Enclosing SphereAgain let be the mapping to the embedding space H. We wish to compute the radiusof the smallest sphere in H which encloses the mapped training data: that is, we wish tominimize R 2 subject tok(x i ) ; Ck 2 R 2 8i (87)where C 2His the (unknown) center of the sphere. Thus introducing positive Lagrangemultipliers i , the primal Lagrangian isL P = R 2 ; X i i (R 2 ;k(x i ) ; Ck 2 ): (88)This is again a c<strong>on</strong>vex quadratic programming problem, so we can instead maximize theWolfe dualL D = X i i K(x i x i ) ; X ij i j K(x i x j ) (89)(where we have again replaced (x i ) (x j )by K(x i x j )) subject to:Xi i = 1 (90) i 0 (91)


33with soluti<strong>on</strong> given byC = X i i (x i ): (92)Thus the problem is very similar to that of support vector training, and in fact the code<strong>for</strong> the latter is easily modied to solve the above problem. Note that we were in a sense\lucky", because the above analysis shows us that there exists an expansi<strong>on</strong> (92) <strong>for</strong> thecenter there is no a priori reas<strong>on</strong> why we should expect that the center of the sphere in Hshould be expressible in terms of the mapped training data in this way. The same can besaid of the soluti<strong>on</strong> <strong>for</strong> the support vector problem, Eq. (46). (Had we chosen some othergeometrical c<strong>on</strong>structi<strong>on</strong>, we might nothave been so <strong>for</strong>tunate. C<strong>on</strong>sider the smallest areaequilateral triangle c<strong>on</strong>taining two given points in R 2 . If the points' positi<strong>on</strong> vectors arelinearly dependent, the center of the triangle cannot be expressed in terms of them.)7.4. A Bound from Leave-One-Out(Vapnik, 1995) gives an alternative bound <strong>on</strong> the actual risk of support vector machines:E[P (error)] E[Number of support vectors]Number of training samples (93)where P (error) is the actual risk <strong>for</strong> a machine trained <strong>on</strong> l ; 1 examples, E[P (error)]is the expectati<strong>on</strong> of the actual risk over all choices of training set of size l ; 1, andE[Number of support vectors] is the expectati<strong>on</strong> of the number of support vectors over allchoices of training sets of size l. It's easy to see how this bound arises: c<strong>on</strong>sider the typicalsituati<strong>on</strong> after training <strong>on</strong> a given training set, shown in Figure 13.Figure 13.<strong>Support</strong> vectors (circles) can become errors (cross) after removal and re-training (the dotted linedenotes the new decisi<strong>on</strong> surface).We can get an estimate of the test error by removing <strong>on</strong>e of the training points, re-training,and then testing <strong>on</strong> the removed point and then repeating this, <strong>for</strong> all training points. Fromthe support vector soluti<strong>on</strong> we know that removing any training points that are not supportvectors (the latter include the errors) will have no eect <strong>on</strong> the hyperplane found. Thusthe worst that can happen is that every support vector will become an error. Taking theexpectati<strong>on</strong> over all such training sets there<strong>for</strong>e gives an upper bound <strong>on</strong> the actual risk,<strong>for</strong> training sets of size l ; 1.


34Although elegant, I haveyet to nd a use <strong>for</strong> this bound. There seem to be many situati<strong>on</strong>swhere the actual error increases even though the number of support vectors decreases, sothe intuitive c<strong>on</strong>clusi<strong>on</strong> (systems that give fewer support vectors give better per<strong>for</strong>mance)often seems to fail. Furthermore, although the bound can be tighter than that found usingthe estimate of the VC dimensi<strong>on</strong> combined with Eq. (3), it can at the same time be lesspredictive, as we shall see in the next Secti<strong>on</strong>.7.5. VC, SV Bounds and the Actual RiskLet us put these observati<strong>on</strong>s to some use. As menti<strong>on</strong>ed above, training an SVM RBFclassier will automatically give values <strong>for</strong> the RBF weights, number of centers, centerpositi<strong>on</strong>s, and threshold. For Gaussian RBFs, there is <strong>on</strong>ly <strong>on</strong>e parameter left: the RBFwidth ( in Eq. (80)) (we assume here <strong>on</strong>ly <strong>on</strong>e RBF width <strong>for</strong> the problem). Can wend the optimal value <strong>for</strong> that too, by choosing that which minimizes D 2 =M 2 ? Figure 14shows a series of experiments d<strong>on</strong>e <strong>on</strong> 28x28 NIST digit data, with 10,000 training points and60,000 test points. The top curve in the left hand panel shows the VC bound (i.e. the boundresulting from approximating the VC dimensi<strong>on</strong> in Eq. (3) 23 by Eq. (85)), the middle curveshows the bound from leave-<strong>on</strong>e-out (Eq. (93)), and the bottom curve shows the measuredtest error. Clearly, in this case, the bounds are very loose. The right hand panel shows justthe VC bound (the top curve, <strong>for</strong> 2 > 200), together with the test error, with the latterscaled up by a factor of 100 (note that the two curves cross). It is striking that the twocurves have minima in the same place: thus in this case, the VC bound, although loose,seems to be nevertheless predictive. Experiments <strong>on</strong> digits 2 through 9 showed that the VCbound gave a minimum <strong>for</strong> which 2 was within a factor of two of that which minimized thetest error (digit 1 was inc<strong>on</strong>clusive). Interestingly, in those cases the VC bound c<strong>on</strong>sistentlygave alower predicti<strong>on</strong> <strong>for</strong> 2 than that which minimized the test error. On the other hand,the leave-<strong>on</strong>e-out bound, although tighter, does not seem to be predictive, since it had nominimum <strong>for</strong> the values of 2 tested.0.70.7Actual Risk : SV Bound : VC Bound0.60.50.40.30.20.1VC Bound : Actual Risk * 1000.650.60.550.50.450.400 100 200 300 400 500 600 700 800 900 1000Sigma Squared0.350 100 200 300 400 500 600 700 800 900 1000Sigma SquaredFigure 14. The VC bound can be predictive even when loose.


358. Limitati<strong>on</strong>sPerhaps the biggest limitati<strong>on</strong> of the support vector approach lies in choice of the kernel.Once the kernel is xed, SVM classiers have <strong>on</strong>ly <strong>on</strong>e user-chosen parameter (the errorpenalty), but the kernel is a very big rug under which tosweep parameters. Some work hasbeen d<strong>on</strong>e <strong>on</strong> limiting kernels using prior knowledge (Scholkopf et al., 1998a Burges, 1998),but the best choice of kernel <strong>for</strong> a given problem is still a research issue.A sec<strong>on</strong>d limitati<strong>on</strong> is speed and size, both in training and testing. While the speedproblem in test phase is largely solved in (Burges, 1996), this still requires two trainingpasses. Training <strong>for</strong> very large datasets (milli<strong>on</strong>s of support vectors) is an unsolved problem.Discrete data presents another problem, although with suitable rescaling excellent resultshave nevertheless been obtained (Joachims, 1997). Finally, although some work has beend<strong>on</strong>e <strong>on</strong> training a multiclass SVM in <strong>on</strong>e step 24 , the optimal design <strong>for</strong> multiclass SVMclassiers is a further area <strong>for</strong> research.9. Extensi<strong>on</strong>sWe very briey describe two of the simplest, and most eective, methods <strong>for</strong> improving theper<strong>for</strong>mance of SVMs.The virtual support vector method (Scholkopf, Burges and Vapnik, 1996 Burges andScholkopf, 1997), attempts to incorporate known invariances of the problem (<strong>for</strong> example,translati<strong>on</strong> invariance <strong>for</strong> the image recogniti<strong>on</strong> problem) by rst training a system, andthen creating new data by distorting the resulting support vectors (translating them, in thecase menti<strong>on</strong>ed), and nally training a new system <strong>on</strong> the distorted (and the undistorted)data. The idea is easy to implement and seems to work better than other methods <strong>for</strong>incorporating invariances proposed so far.The reduced set method (Burges, 1996 Burges and Scholkopf, 1997) was introduced toaddress the speed of support vector machines in test phase, and also starts with a trainedSVM. The idea is to replace the sum in Eq. (46) by a similar sum, where instead of supportvectors, computed vectors (which are not elements of the training set) are used, and insteadof the i , a dierent set of weights are computed. The number of parameters is chosenbe<strong>for</strong>ehand to give the speedup desired. The resulting vector is still a vector in H, andthe parameters are found by minimizing the Euclidean norm of the dierence between theoriginal vector w and the approximati<strong>on</strong> to it. The same technique could be used <strong>for</strong> SVMregressi<strong>on</strong> to nd much more ecient functi<strong>on</strong> representati<strong>on</strong>s (which could be used, <strong>for</strong>example, in data compressi<strong>on</strong>).Combining these two methods gave a factor of 50 speedup (while the error rate increasedfrom 1.0% to 1.1%) <strong>on</strong> the NIST digits (Burges and Scholkopf, 1997).10. C<strong>on</strong>clusi<strong>on</strong>sSVMs provide a new approach to the problem of pattern recogniti<strong>on</strong> (together with regressi<strong>on</strong>estimati<strong>on</strong> and linear operator inversi<strong>on</strong>) with clear c<strong>on</strong>necti<strong>on</strong>s to the underlyingstatistical learning theory. They dier radically from comparable approaches such as neuralnetworks: SVM training always nds a global minimum, and their simple geometric interpretati<strong>on</strong>provides fertile ground <strong>for</strong> further investigati<strong>on</strong>. An SVM is largely characterizedby thechoice of its kernel, and SVMs thus link the problems they are designed <strong>for</strong> with alarge body of existing work <strong>on</strong> kernel based methods. I hope that this tutorial will encouragesome to explore SVMs <strong>for</strong> themselves.


36AcknowledgmentsI'm very grateful to P. Knirsch, C. Nohl, E. Osuna, E. Rietman, B. Scholkopf, Y. Singer, A.Smola, C. Stenard, and V. Vapnik, <strong>for</strong> their comments <strong>on</strong> the manuscript. Thanks also tothe reviewers, and to the Editor, U. Fayyad, <strong>for</strong> extensive, useful comments. Special thanksare due to V. Vapnik, under whose patient guidance I learned the ropes to A. Smola andB. Scholkopf, <strong>for</strong> many interesting and fruitful discussi<strong>on</strong>s and to J. Shawe-Taylor and D.Schuurmans, <strong>for</strong> valuable discussi<strong>on</strong>s <strong>on</strong> structural risk minimizati<strong>on</strong>.AppendixA.1.Proofs of TheoremsWe collect here the theorems stated in the text, together with their proofs. The Lemma hasa shorter proof using a \Theorem of the Alternative," (Mangasarian, 1969) but we wishedto keep the proofs as self-c<strong>on</strong>tained as possible.Lemma 1 Two sets of points in R n may be separated by a hyperplane if and <strong>on</strong>ly if theintersecti<strong>on</strong> of their c<strong>on</strong>vex hulls is empty.Proof: We allow the noti<strong>on</strong>s of points in R n , and positi<strong>on</strong> vectors of those points, to beused interchangeably in this proof. Let C A , C B be the c<strong>on</strong>vex hulls of two sets of pointsA, B in R n . Let A ; B denote the set of points whose positi<strong>on</strong> vectors are given bya ; b a 2 A b 2 B (note that A ; B does not c<strong>on</strong>tain the origin), and let C A ; C B havethe corresp<strong>on</strong>ding meaning <strong>for</strong> the c<strong>on</strong>vex hulls. Then showing that A and B are linearlyseparable (separable by ahyperplane) is equivalent toshowing that the set A ; B is linearlyseparable from the origin O. For suppose the latter: then 9 w 2 R n b 2 R b0 8x 2 A ; B. Now pick some y 2 B, and denote the set of all pointsa ; b + y a 2 A b 2 B by A ; B + y. Then x w + b>y w 8x 2 A ; B + y, and clearlyy w + b 0. Suppose 9 x 2 Ssuch that x x min 0. Let L be the line segment joining x min and x. Then c<strong>on</strong>vexityimplies that L S. Thus O =2 L, since by assumpti<strong>on</strong> O =2 S. Hence the three points O, xand x min <strong>for</strong>m an obtuse (or right) triangle, with obtuse (or right) angle occurring at thepoint O. Dene ^n (x ; x min )=kx ; x min k. Then the distance from the closest point inL to O is kx min k 2 ; (x min ^n) 2 ,which is less than kx min k 2 . Hence x x min > 0andS islinearly separable from O. Thus C A ; C B is linearly separable from O, anda <strong>for</strong>tiori A ; Bis linearly separable from O, and thus A is linearly separable from B.It remains to show that, if the two sets of points A, B are linearly separable, the intersecti<strong>on</strong>of their c<strong>on</strong>vex hulls if empty. By assumpti<strong>on</strong> there exists a pair w 2 R n b 2 R, suchthat 8a i 2 A wa i +b>0 and 8b i 2 B wb i +b


37Pi =1 0 i 1. Then w x+b = P i ifwa i + bg > 0.P may be written x =i ia i TSimilarly, <strong>for</strong> points y 2 C B , w y + b < 0. Hence C A CB = , since otherwise wewould be able to nd a point x = y which simultaneously satises both inequalities.Theorem 1: C<strong>on</strong>sider some set of m points in R n . Choose any <strong>on</strong>e of the points asorigin. Then the m points can be shattered by oriented hyperplanes if and <strong>on</strong>ly if thepositi<strong>on</strong> vectors of the remaining points are linearly independent.Proof: Label the origin O, and assume that the m ; 1 positi<strong>on</strong> vectors of the remainingpoints are linearly independent. C<strong>on</strong>sider any partiti<strong>on</strong> of the m points into two subsets,S 1 and S 2 ,o<strong>for</strong>derm 1 and m 2 respectively, sothatm 1 + m 2 = m. Let S 1 be the subsetc<strong>on</strong>taining O. Then the c<strong>on</strong>vex hull C 1 of S 1 is that set of points whose positi<strong>on</strong> vectors xsatisfyx =Xm 1i=1 i s 1i Xm 1i=1 i =1 i 0 (A.1)where the s 1i are the positi<strong>on</strong> vectors of the m 1 points in S 1 (including the null positi<strong>on</strong>vector of the origin). Similarly, the c<strong>on</strong>vex hull C 2 of S 2 is that set of points whose positi<strong>on</strong>vectors x satisfyx =Xm 2i=1 i s 2i Xm 2i=1 i =1 i 0 (A.2)where the s 2i are the positi<strong>on</strong> vectors of the m 2 points in S 2 . Now suppose that C 1 andC 2 intersect. Then there exists an x 2 R n which simultaneously satises Eq. (A.1) and Eq.(A.2). Subtracting these equati<strong>on</strong>s gives a linear combinati<strong>on</strong> of the m ; 1 n<strong>on</strong>-null positi<strong>on</strong>vectors which vanishes, which c<strong>on</strong>tradicts the assumpti<strong>on</strong> of linear independence. By thelemma, since C 1 and C 2 do not intersect, there exists a hyperplane separating S 1 and S 2 .Since this is true <strong>for</strong> any choice of partiti<strong>on</strong>, the m points can be shattered.It remains to show that if the m;1 n<strong>on</strong>-null positi<strong>on</strong> vectors are not linearly independent,then the m points cannot be shattered by oriented hyperplanes. If the m;1 positi<strong>on</strong> vectorsare not linearly independent, then there exist m ; 1numbers, i ,suchthatm;1Xi=1 i s i =0(A.3)If all the i are of the same sign, then we can scale them so that i 2 [0 1] and P i i =1.Eq. (A.3) then states that the origin lies in the c<strong>on</strong>vex hull of the remaining points hence,by the lemma, the origin cannot be separated from the remaining points by ahyperplane,and the points cannot be shattered.If the i are not all of the same sign, place all the terms with negative i <strong>on</strong> the right:Xj2I 1j j js j = X k2I 2j k js k(A.4)where I 1 , I 2 are the indices of the corresp<strong>on</strong>ding partiti<strong>on</strong> of SnO (i.e. of the set S with theorigin removed). Now scale this equati<strong>on</strong> so that either P j2I 1j j j =1and P k2I 2j k j1,or P j2I 1j j j 1 and P k2I 2j k j =1. Suppose without loss of generality thatthe latterholds. Then the left hand side of Eq. (A.4) is the positi<strong>on</strong> vector of a point lying in the


38c<strong>on</strong>vex hull of the points f S j2I 1s j g S O (or, if the equality holds, of the points f S j2I 1s j g),and the right hand side is the positi<strong>on</strong> vector of a point lying in the c<strong>on</strong>vex hull of the pointsSk2I 2s k , so the c<strong>on</strong>vex hulls overlap, and by the lemma, the two sets of points cannot beseparated by ahyperplane. Thus the m points cannot be shattered.Theorem 4: If the data is d-dimensi<strong>on</strong>al (i.e. L = R d ), the dimensi<strong>on</strong> of the minimalembedding space, <strong>for</strong> ; homogeneous polynomial kernels of degree p (K(x 1 x 2 )=(x 1 x 2 ) p x 1 x 2 2 R d d+p;1), isp .; p+d;1Proof: First we show that the the number of comp<strong>on</strong>ents of (x) isp . Label thecomp<strong>on</strong>ents of as in Eq. P (79). Then a comp<strong>on</strong>ent is uniquely identied by the choicedof the d integers r i 0, r i=1 i = p. Now c<strong>on</strong>sider p objects distributed am<strong>on</strong>gst d ; 1partiti<strong>on</strong>s (numbered 1 through d ; 1), such that objects are allowed to be to the left ofall partiti<strong>on</strong>s, or to the right ofall partiti<strong>on</strong>s. Suppose m objects fall between partiti<strong>on</strong>sq and q +1. Let this corresp<strong>on</strong>d to a term x m q+1 in the product in Eq. (79). Similarly, mobjects falling to the left of all partiti<strong>on</strong>s corresp<strong>on</strong>ds to a term x m 1,andm objects fallingto the right of all partiti<strong>on</strong>s corresp<strong>on</strong>ds to a term x m d . Thus the number of distinct termsPof the <strong>for</strong>m x r11 xr2 2 xr dd d r i=1 i = p r i 0 is the number of way of distributingthe objects and partiti<strong>on</strong>s am<strong>on</strong>gst themselves, ; modulo permutati<strong>on</strong>s of the partiti<strong>on</strong>s andp+d;1permutati<strong>on</strong>s of the objects, which isp .Next we must show that the set of vectors with comp<strong>on</strong>ents r1r 2r d(x) span the space H.This follows from the fact that the comp<strong>on</strong>ents of (x) are linearly independent functi<strong>on</strong>s.For suppose instead that the image of acting <strong>on</strong> x 2Lis a subspace of H. Then thereexists a xed n<strong>on</strong>zero vector V 2Hsuch thatdim(H) Xi=1V i i (x) =0 8x 2L:Using the labeling introduced above, c<strong>on</strong>sider a particular comp<strong>on</strong>ent of: r1r 2r d(x)dXi=1r i = p:(A.5)(A.6)Since Eq. (A.5) holds <strong>for</strong> all x, and since the mapping in Eq. (79) certainly has allderivatives dened, we can apply the operator( @@x 1) r1 ( @@x d) r d(A.7)to Eq. (A.5), which will pick that <strong>on</strong>e term with corresp<strong>on</strong>ding powers of the x i in Eq.(79), givingV r1r 2r d=0:(A.8)P dSince this is true <strong>for</strong> all choices of r 1 r d such that r i=1 i = p, every comp<strong>on</strong>ent ofV must vanish. Hence the image of acting <strong>on</strong> x 2Lspans H.A.2.Gap Tolerant Classiers and VC BoundsThe following point is central to the argument. One normally thinks of a collecti<strong>on</strong> of pointsas being \shattered" by a set of functi<strong>on</strong>s, if <strong>for</strong> anychoice of labels <strong>for</strong> the points, a functi<strong>on</strong>


39from the set can be found which assigns those labels to the points. The VC dimensi<strong>on</strong> of thatset of functi<strong>on</strong>s is then dened as the maximum number of points that can be so shattered.However, c<strong>on</strong>sider a slightly dierent deniti<strong>on</strong>. Let a set of points be shattered by a setof functi<strong>on</strong>s if <strong>for</strong> any choice of labels <strong>for</strong> the points, a functi<strong>on</strong> from the set can be foundwhich assigns the incorrect labels to all the points. Again let the VC dimensi<strong>on</strong> of that setof functi<strong>on</strong>s be dened as the maximum number of points that can be so shattered.It is in fact this sec<strong>on</strong>d deniti<strong>on</strong> (which we adopt from here <strong>on</strong>) that enters the VCbound proofs (Vapnik, 1979 Devroye, Gyor and Lugosi, 1996). Of course <strong>for</strong> functi<strong>on</strong>swhose range is f1g (i.e. all data will be assigned either positive or negative class), the twodeniti<strong>on</strong>s are the same. However,ifallpoints falling in some regi<strong>on</strong> are simply deemed tobe \errors", or \correct", the two deniti<strong>on</strong>s are dierent. As a c<strong>on</strong>crete example, supposewe dene \gap intolerant classiers", which are like gap tolerant classiers, but which labelall points lying in the margin or outside the sphere as errors. C<strong>on</strong>sider again the situati<strong>on</strong> inFigure 12, but assign positive class to all three points. Then a gap intolerant classier withmargin width greater than the ball diameter cannot shatter the points if we use the rstdeniti<strong>on</strong> of \shatter", but can shatter the points if we use the sec<strong>on</strong>d (correct) deniti<strong>on</strong>.With this caveat in mind, we now outline how theVC bounds can apply to functi<strong>on</strong>s withrange f1 0g, where the label 0 means that the point islabeled \correct." (The boundswill also apply to functi<strong>on</strong>s where 0 is dened to mean \error", but the corresp<strong>on</strong>ding VCdimensi<strong>on</strong> will be higher, weakening the bound, and in our case, making it useless). We willfollow the notati<strong>on</strong> of (Devroye, Gyor and Lugosi, 1996).C<strong>on</strong>sider points x 2 R d , and let p(x) denote a density <strong>on</strong>R d . Let be a functi<strong>on</strong> <strong>on</strong> R dwith range f1 0g, and let be a set of such functi<strong>on</strong>s. Let each x have an associatedlabel y x 2f1g. Let fx 1 x n g be any nite number of points in R d : then we require to have the property that there exists at least <strong>on</strong>e 2 such that (x i ) 2f1g 8x i . Forgiven , dene the set of points A byA = fx : y x =1(x) =;1g[fx : y x = ;1(x) =1g(A.9)We require that the be such that all sets A are measurable. Let A denote the set of allA.Deniti<strong>on</strong>: Let x i i = 1 n be n points. We dene the empirical risk <strong>for</strong> the setfx i g to be n (fx i g) =(1=n)nXi=1Ix i 2A:(A.10)where I is the indicator functi<strong>on</strong>. Note that the empirical risk is zero if (x i )=08 x i .Deniti<strong>on</strong>: We dene the actual risk <strong>for</strong> the functi<strong>on</strong> to be() =P (x 2 A):(A.11)Note also that those points x <strong>for</strong> which (x) = 0 do not c<strong>on</strong>tribute to the actual risk.Deniti<strong>on</strong>: For xed (x 1 x n ) 2 R d ,letN A be the number of dierent setsinffx 1 x n g\A : A 2Ag(A.12)


40where the sets A are dened above. The n-th shatter coecient ofA is deneds(An)=maxx N A(x 1 x n ):1x n2fR d g n(A.13)We also dene the VC dimensi<strong>on</strong> <strong>for</strong> the class A to be the maximum integer k 1 <strong>for</strong>which s(Ak)=2 k .Theorem 8 (adapted from Devroye, Gyor and Lugosi, 1996, Theorem 12.6):Given n (fx i g),() and s(An) dened above, and given n points (x 1 :::x n ) 2 R d ,let 0 denote that subsetof such that all 2 0 satisfy (x i ) 2f1g 8x i . (This restricti<strong>on</strong> may be viewed aspart of the training algorithm). Then <strong>for</strong> any such ,P (j n (fx i g) ; ()j >) 8s(An) exp ;n2 =32(A.14)The proof is exactly that of (Devroye, Gyor and Lugosi, 1996), Secti<strong>on</strong>s 12.3, 12.4 and12.5, Theorems 12.5 and 12.6. We have dropped the \sup" to emphasize that this holds<strong>for</strong> any of the functi<strong>on</strong>s . In particular, it holds <strong>for</strong> those which minimize the empiricalerror and <strong>for</strong> which all training data take the values f1g. Note however that the proof<strong>on</strong>ly holds <strong>for</strong> the sec<strong>on</strong>d deniti<strong>on</strong> of shattering given above. Finally, note that the usual<strong>for</strong>m of the VC boundsiseasilyderived from Eq. (A.14) by using s(An) (en=h) h (whereh is the VC dimensi<strong>on</strong>) (Vapnik, 1995), setting =8s(An)exp ;n2 =32, and solving <strong>for</strong> .Clearly these results apply to our gap tolerant classiers of Secti<strong>on</strong> 7.1.For them, aparticular classier 2 is specied by a set of parameters fB H Mg, where B is aball in R d , D 2 R is the diameter of B, H is a d ; 1 dimensi<strong>on</strong>al oriented hyperplane inR d , and M 2 R is a scalar which we have called the margin. H itself is specied by itsnormal (whose directi<strong>on</strong> species which points H + (H ; ) are labeled positive (negative) bythe functi<strong>on</strong>), and by the minimal distance from H to the origin. Foragiven 2 , themargin set S M is dened as the set c<strong>on</strong>sisting of those points whose minimal distance to His less than M=2. Dene Z S MT B, Z+ Z T H + ,andZ ; Z T H ; . The functi<strong>on</strong> is then dened as follows:(x) =18x 2 Z + (x) =;1 8x 2 Z ; (x) = 0 otherwise (A.15)and the corresp<strong>on</strong>ding sets A as in Eq. (A.9).Notes1. K. Muller, Private Communicati<strong>on</strong>2. The reader in whom this elicits a sinking feeling is urged to study (Strang, 1986 Fletcher,1987 Bishop, 1995). There is a simple geometrical interpretati<strong>on</strong> of Lagrange multipliers: ata boundary corresp<strong>on</strong>ding to a single c<strong>on</strong>straint, the gradient of the functi<strong>on</strong> being extremizedmust be parallel to the gradient of the functi<strong>on</strong> whose c<strong>on</strong>tours specify the boundary. At aboundary corresp<strong>on</strong>ding to the intersecti<strong>on</strong> of c<strong>on</strong>straints, the gradient must be parallel to alinear combinati<strong>on</strong> (n<strong>on</strong>-negative in the case of inequality c<strong>on</strong>straints) of the gradients of thefuncti<strong>on</strong>s whose c<strong>on</strong>tours specify the boundary.3. In this paper, the phrase \learning machine" will be used <strong>for</strong> any functi<strong>on</strong> estimati<strong>on</strong> algorithm,\training" <strong>for</strong> the parameter estimati<strong>on</strong> procedure, \testing" <strong>for</strong> the computati<strong>on</strong> of thefuncti<strong>on</strong> value, and \per<strong>for</strong>mance" <strong>for</strong> the generalizati<strong>on</strong> accuracy (i.e. error rate as test setsize tends to innity), unless otherwise stated.


414. Given the name \test set," perhaps we should also use \train set" but the hobbyists got thererst.5. We use the term \oriented hyperplane" to emphasize that the mathematical object c<strong>on</strong>sideredis the pair fH ng, where H is the set of points which lie in the hyperplane and n is a particularchoice <strong>for</strong> the unit normal. Thus fH ng and fH ;ng are dierent oriented hyperplanes.6. Such a set of m points (which spananm ; 1 dimensi<strong>on</strong>al subspace of a linear space) are saidto be \in general positi<strong>on</strong>" (Kolmogorov, 1970). The c<strong>on</strong>vex hull of a set of m points in generalpositi<strong>on</strong> denes an m ; 1 dimensi<strong>on</strong>al simplex, the vertices of which are the points themselves.7. The derivati<strong>on</strong> of the bound assumes that the empirical risk c<strong>on</strong>verges uni<strong>for</strong>mly to the actualrisk as the number of training observati<strong>on</strong>s increases (Vapnik, 1979). A necessary and sucientc<strong>on</strong>diti<strong>on</strong> <strong>for</strong> this is that lim l!1 H(l)=l = 0, where l is the number of training samples andH(l) istheVCentropy of the set of decisi<strong>on</strong> functi<strong>on</strong>s (Vapnik, 1979 Vapnik, 1995). For anyset of functi<strong>on</strong>s with innite VC dimensi<strong>on</strong>, the VC entropy isl log 2: hence <strong>for</strong> these classiers,the required uni<strong>for</strong>m c<strong>on</strong>vergence does not hold, and so neither does the bound.8. There is a nice geometric interpretati<strong>on</strong> <strong>for</strong> the dual problem: it is basically nding the twoclosest points of c<strong>on</strong>vex hulls of the two sets. See (Bennett and Bredensteiner, 1998).9. One can dene the torque to be; 1 :::n;2 = i :::n x n;1 Fn (A.16)where repeated indices are summed over <strong>on</strong> the right hand side, and where is the totallyantisymmetric tensor with 1:::n =1. (Recall that Greek indices are used to denote tensorcomp<strong>on</strong>ents). The sum of torques <strong>on</strong> the decisi<strong>on</strong> sheet is then:XX 1 :::n s in;1 F in = 1 :::n s in;1 iy i ^w n = 1 :::n w n;1 ^wn =0(A.17)ii10. In the original <strong>for</strong>mulati<strong>on</strong> (Vapnik, 1979) they were called \extreme vectors."11. By \decisi<strong>on</strong> functi<strong>on</strong>" we mean a functi<strong>on</strong> f(x) whose sign represents the class assigned todata point x.12. By \intrinsic dimensi<strong>on</strong>" we mean the number of parameters required to specify a point <strong>on</strong> themanifold.13. Alternatively <strong>on</strong>e can argue that, given the <strong>for</strong>m of the soluti<strong>on</strong>, the possible w must lie in asubspace of dimensi<strong>on</strong> l.14. Work in preparati<strong>on</strong>.15. Thanks to A. Smola <strong>for</strong> pointing this out.16. Many thanks to <strong>on</strong>e of the reviewers <strong>for</strong> pointing this out.17. The core quadratic optimizer is about 700 lines of C++. The higher level code (to handlecaching of dot products, chunking, IO, etc) is quite complex and c<strong>on</strong>siderably larger.18. Thanks to L. Kaufman <strong>for</strong> providing me with these results.19. Recall that the \ceiling" sign de means \smallest integer greater than or equal to." Also, thereis a typo in the actual <strong>for</strong>mula given in (Vapnik, 1995), which Ihave corrected here.20. Note, <strong>for</strong> example, that the distance between every pair of vertices of the symmetric simplexis the same: see Eq. (26). However, a rigorous proof is needed, and as far as I know islacking.21. Thanks to J. Shawe-Taylor <strong>for</strong> pointing this out.22. V. Vapnik, Private Communicati<strong>on</strong>.23. There is an alternative bound <strong>on</strong>e might use, namely that corresp<strong>on</strong>ding to the set of totallybounded n<strong>on</strong>-negative functi<strong>on</strong>s (Equati<strong>on</strong> (3.28) in (Vapnik, 1995)). However, <strong>for</strong> loss functi<strong>on</strong>staking the value zero or <strong>on</strong>e, and if the empirical risk is zero, this bound is looser thanthat in Eq. (3) whenever h(log(2l=h)+1);log(=4) > 1=16, which is the case here.l24. V. Blanz, Private Communicati<strong>on</strong>References10 M.A. Aizerman, E.M. Braverman, and L.I. Roz<strong>on</strong>er. Theoretical foundati<strong>on</strong>s of the potential functi<strong>on</strong>method in pattern recogniti<strong>on</strong> learning. Automati<strong>on</strong> and Remote C<strong>on</strong>trol, 25:821{837, 1964.


42M. Anth<strong>on</strong>y and N. Biggs. Pac learning and neural networks. In The Handbook of Brain Theory andNeural Networks, pages 694{697, 1995.K.P. Bennett and E. Bredensteiner. Geometry in learning. In Geometry at Work, page to appear,Washingt<strong>on</strong>, D.C., 1998. Mathematical Associati<strong>on</strong> of America.C.M. Bishop. Neural Networks <strong>for</strong> <strong>Pattern</strong> Recogniti<strong>on</strong>. Clarend<strong>on</strong> Press, Ox<strong>for</strong>d, 1995.V. Blanz, B. Scholkopf, H. Bultho, C. Burges, V. Vapnik, and T. Vetter. Comparis<strong>on</strong> of view{basedobject recogniti<strong>on</strong> algorithms using realistic 3d models. In C. v<strong>on</strong> der Malsburg, W. v<strong>on</strong> Seelen, J. C.Vorbruggen, and B. Sendho, editors, Articial Neural Networks | ICANN'96, pages 251 { 256, Berlin,1996. Springer Lecture Notes in Computer Science, Vol. 1112.B. E. Boser, I. M. Guy<strong>on</strong>, and V .Vapnik. A training algorithm <strong>for</strong> optimal margin classiers. In FifthAnnual Workshop <strong>on</strong> Computati<strong>on</strong>al Learning Theory, Pittsburgh, 1992. ACM.James R. Bunch and Linda Kaufman. Some stable methods <strong>for</strong> calculating inertia and solving symmetriclinear systems. Mathematics of computati<strong>on</strong>, 31(137):163{179, 1977.James R. Bunch and Linda Kaufman. A computati<strong>on</strong>al method <strong>for</strong> the indenite quadratic programmingproblem. Linear Algebra and its Applicati<strong>on</strong>s, 34:341{370, 1980.C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support vector learning machines.In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural In<strong>for</strong>mati<strong>on</strong> Processing Systems 9,pages 375{381, Cambridge, MA, 1997. MIT Press.C.J.C. Burges. Simplied support vector decisi<strong>on</strong> rules. In Lorenza Saitta, editor, Proceedings of the ThirteenthInternati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Machine Learning, pages 71{77, Bari, Italy, 1996. Morgan Kaufman.C.J.C. Burges. Building locally invariant kernels. In Proceedings of the 1997 NIPS Workshop <strong>on</strong> <strong>Support</strong><strong>Vector</strong> <strong>Machines</strong> (to appear), 1998.C.J.C. Burges, P. Knirsch, and R. Haratsch. <strong>Support</strong> vector web page: http://svm.research.bell-labs.com.Technical report, Lucent Technologies, 1996.C. Cortes and V. Vapnik. <strong>Support</strong> vector networks. Machine Learning, 20:273{297, 1995.R. Courant and D. Hilbert. Methods of Mathematical Physics. Interscience, 1953.Luc Devroye, Laszlo Gyor, and Gabor Lugosi. AProbabilistic Theory of <strong>Pattern</strong> Recogniti<strong>on</strong>. SpringerVerlag, Applicati<strong>on</strong>s of Mathematics Vol. 31, 1996.H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, and V. Vapnik. <strong>Support</strong> vector regressi<strong>on</strong> machines.Advances in Neural In<strong>for</strong>mati<strong>on</strong> Processing Systems, 9:155{161, 1997.R. Fletcher. Practical Methods of Optimizati<strong>on</strong>. John Wiley and S<strong>on</strong>s, Inc., 2nd editi<strong>on</strong>, 1987.Stuart Geman and Elie Bienenstock. Neural networks and the bias / variance dilemma. Neural Computati<strong>on</strong>,4:1{58, 1992.F. Girosi. An equivalence between sparse approximati<strong>on</strong> and support vector machines. Neural Computati<strong>on</strong>(to appear) CBCL AI Memo 1606, MIT, 1998.I. Guy<strong>on</strong>, V. Vapnik, B. Boser, L. Bottou, and S.A. Solla. Structural risk minimizati<strong>on</strong> <strong>for</strong> characterrecogniti<strong>on</strong>. Advances in Neural In<strong>for</strong>mati<strong>on</strong> Processing Systems, 4:471{479, 1992.P.R. Halmos. AHilbert Space Problem Book. D. Van Nostrand Company, Inc., 1967.Roger A. Horn and Charles R. Johns<strong>on</strong>. Matrix Analysis. Cambridge University Press, 1985.T. Joachims. Text categorizati<strong>on</strong> with support vector machines. Technical report, LS VIII Number 23,University ofDortmund, 1997. ftp://ftp-ai.in<strong>for</strong>matik.uni-dortmund.de/pub/Reports/report23.ps.Z.L. Kaufman. Solving the qp problem <strong>for</strong> support vector training. In Proceedings of the 1997 NIPS Workshop<strong>on</strong> <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong> (to appear), 1998.A.N. Kolmogorov and S.V.Fomin. Introductory Real Analysis. Prentice-Hall, Inc., 1970.O.L. Mangarasian. N<strong>on</strong>linear Programming. McGraw Hill, New York, 1969.Garth P. McCormick. N<strong>on</strong> Linear Programming: Theory, Algorithms and Applicati<strong>on</strong>s. John Wiley andS<strong>on</strong>s, Inc., 1983.D.C. M<strong>on</strong>tgomery and E.A. Peck. Introducti<strong>on</strong> to Linear Regressi<strong>on</strong> Analysis. John Wiley and S<strong>on</strong>s, Inc.,2nd editi<strong>on</strong>, 1992.More and Wright. Optimizati<strong>on</strong> Guide. SIAM, 1993.Jorge J. More and Gerardo Toraldo. On the soluti<strong>on</strong> of large quadratic programming problems with boundc<strong>on</strong>straints. SIAM J. Optimizati<strong>on</strong>, 1(1):93{113, 1991.S. Mukherjee, E. Osuna, and F. Girosi. N<strong>on</strong>linear predicti<strong>on</strong> of chaotic time series using a support vectormachine. In Proceedings of the IEEE Workshop <strong>on</strong> Neural Networks <strong>for</strong> Signal Processing 7, pages511{519, Amelia Island, FL, 1997.K.-R. Muller, A. Smola, G. Ratsch, B. Scholkopf, J. Kohlmorgen, and V. Vapnik. Predicting time serieswith support vector machines. In Proceedings, Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Articial Neural Networks,page 999. Springer Lecture Notes in Computer Science, 1997.Edgar Osuna, Robert Freund, and Federico Girosi. An improved training algorithm <strong>for</strong> support vectormachines. In Proceedings of the 1997 IEEE Workshop <strong>on</strong> Neural Networks <strong>for</strong> Signal Processing, Eds.J. Principe, L. Giles, N. Morgan, E. Wils<strong>on</strong>, pages 276 { 285, Amelia Island, FL, 1997.Edgar Osuna, Robert Freund, and Federico Girosi. Training support vector machines: an applicati<strong>on</strong> toface detecti<strong>on</strong>. In IEEE C<strong>on</strong>ference <strong>on</strong> Computer Visi<strong>on</strong> and <strong>Pattern</strong> Recogniti<strong>on</strong>, pages 130 { 136, 1997.


43Edgar Osuna and Federico Girosi. Reducing the run-time complexity of support vector machines. InInternati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> <strong>Pattern</strong> Recogniti<strong>on</strong> (submitted), 1998.William H. Press, Brain P. Flannery, SaulA.Teukolsky, and William T. Vettering. Numerical recipes inC: the art of scientic computing. Cambridge University Press, 2nd editi<strong>on</strong>, 1992.M. Schmidt. Identifying speaker with support vector networks. In Interface '96 Proceedings, Sydney, 1996.B. Scholkopf. <strong>Support</strong> <strong>Vector</strong> Learning. R. Oldenbourg Verlag, Munich, 1997.B. Scholkopf, C. Burges, and V. Vapnik. Extracting support data <strong>for</strong> a given task. In U. M. Fayyad andR. Uthurusamy, editors, Proceedings, First Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Knowledge Discovery & DataMining. AAAI Press, Menlo Park, CA, 1995.B. Scholkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines. InC. v<strong>on</strong> der Malsburg, W. v<strong>on</strong> Seelen, J. C. Vorbruggen, and B. Sendho, editors, Articial Neural Networks| ICANN'96, pages 47 { 52, Berlin, 1996. Springer Lecture Notes in Computer Science, Vol. 1112.B. Scholkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In M. Jordan,M. Kearns, and S. Solla, editors, Advances in Neural In<strong>for</strong>mati<strong>on</strong> Processing Systems 10, Cambridge, MA,1998. MIT Press. In press.B. Scholkopf, A. Smola, and K.-R. Muller. N<strong>on</strong>linear comp<strong>on</strong>ent analysis as a kernel eigenvalue problem.Neural Computati<strong>on</strong>, 1998. In press.B. Scholkopf, A. Smola, K.-R. Muller, C.J.C. Burges, and V. Vapnik. <strong>Support</strong> vector methods in learningand feature extracti<strong>on</strong>. In Ninth Australian C<strong>on</strong>gress <strong>on</strong> Neural Networks (to appear), 1998.B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing supportvector machines with gaussian kernels to radial basis functi<strong>on</strong> classiers. IEEE Trans. Sign. Processing,45:2758 { 2765, 1997.John Shawe-Taylor, Peter L. Bartlett, Robert C. Williams<strong>on</strong>, and Martin Anth<strong>on</strong>y. A framework <strong>for</strong>structural risk minimizati<strong>on</strong>. In Proceedings, 9th Annual C<strong>on</strong>ference <strong>on</strong> Computati<strong>on</strong>al Learning Theory,pages 68{76, 1996.John Shawe-Taylor, Peter L. Bartlett, Robert C. Williams<strong>on</strong>, and Martin Anth<strong>on</strong>y. Structural risk minimizati<strong>on</strong>over data-dependent hierarchies. Technical report, NeuroCOLT Technical Report NC-TR-96-053,1996.A. Smola and B. Scholkopf. Onakernel-based method <strong>for</strong> pattern recogniti<strong>on</strong>, regressi<strong>on</strong>, approximati<strong>on</strong>and operator inversi<strong>on</strong>. Algorithmica (to appear), 1998.A. Smola, B. Scholkopf, and K.-R. Muller. General cost functi<strong>on</strong>s <strong>for</strong> support vector regressi<strong>on</strong>. In NinthAustralian C<strong>on</strong>gress <strong>on</strong> Neural Networks (to appear), 1998.Alex J. Smola, Bernhard Scholkopf, and Klaus-Robert Muller. The c<strong>on</strong>necti<strong>on</strong> between regularizati<strong>on</strong>operators and support vector kernels. Neural Networks (to appear), 1998.M. O. Stits<strong>on</strong>, A. Gammerman, V. Vapnik, V.Vovk, C. Watkins, and J. West<strong>on</strong>. <strong>Support</strong> vector anovadecompositi<strong>on</strong>. Technical report, Royal Holloway College, Report number CSD-TR-97-22, 1997.Gilbert Strang. Introducti<strong>on</strong> to Applied Mathematics. Wellesley-Cambridge Press, 1986.R. J. Vanderbei. Interior point methods : Algorithms and <strong>for</strong>mulati<strong>on</strong>s. ORSA J. Computing, 6(1):32{34,1994.R.J. Vanderbei. LOQO: An interior point code <strong>for</strong> quadratic programming. Technical report, Program inStatistics & Operati<strong>on</strong>s Research, Princet<strong>on</strong> University, 1994.V. Vapnik. Estimati<strong>on</strong> of Dependences Based <strong>on</strong> Empirical Data [in Russian]. Nauka, Moscow, 1979.(English translati<strong>on</strong>: Springer Verlag, New York, 1982).V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.V. Vapnik. Statistical Learning Theory. John Wiley and S<strong>on</strong>s, Inc., New York, in preparati<strong>on</strong>.V. Vapnik, S. Golowich, and A. Smola. <strong>Support</strong> vector method <strong>for</strong> functi<strong>on</strong> approximati<strong>on</strong>, regressi<strong>on</strong>estimati<strong>on</strong>, and signal processing. Advances in Neural In<strong>for</strong>mati<strong>on</strong> Processing Systems, 9:281{287, 1996.Grace Wahba. <strong>Support</strong> vector machines, reproducing kernel hilbert spaces and the gacv. In Proceedings ofthe 1997 NIPS Workshop <strong>on</strong> <strong>Support</strong> <strong>Vector</strong> <strong>Machines</strong> (to appear), 1998.J. West<strong>on</strong>, A. Gammerman, M. O. Stits<strong>on</strong>, V. Vapnik, V.Vovk, and C. Watkins. Density estimati<strong>on</strong> usingsupport vector machines. Technical report, Royal Holloway College, Report number CSD-TR-97-23, 1997.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!