10Note that we have nowgiven the Lagrangian dierent labels (P <strong>for</strong> primal, D <strong>for</strong> dual) toemphasize that the two <strong>for</strong>mulati<strong>on</strong>s are dierent: L P and L D arise from the same objectivefuncti<strong>on</strong> but with dierent c<strong>on</strong>straints and the soluti<strong>on</strong> is found by minimizing L P or bymaximizing L D . Note also that if we <strong>for</strong>mulate the problem with b = 0, which amounts torequiring that all hyperplanes c<strong>on</strong>tain the origin, the c<strong>on</strong>straint (15) does not appear. Thisis a mild restricti<strong>on</strong> <strong>for</strong> high dimensi<strong>on</strong>al spaces, since it amounts to reducing the numberof degrees of freedom by <strong>on</strong>e.<strong>Support</strong> vector training (<strong>for</strong> the separable, linear case) there<strong>for</strong>e amounts to maximizingL D with respect to the i , subject to c<strong>on</strong>straints (15) and positivity ofthe i , with soluti<strong>on</strong>given by (14). Notice that there is a Lagrange multiplier i <strong>for</strong> every training point. Inthe soluti<strong>on</strong>, those points <strong>for</strong> which i > 0 are called \support vectors", and lie <strong>on</strong> <strong>on</strong>e ofthe hyperplanes H 1 H 2 . All other training points have i = 0 and lie either <strong>on</strong> H 1 orH 2 (such thattheequality in Eq. (12) holds), or <strong>on</strong> that side of H 1 or H 2 such thatthestrict inequality in Eq. (12) holds. For these machines, the support vectors are the criticalelements of the training set. They lie closest to the decisi<strong>on</strong> boundary if all other trainingpoints were removed (or moved around, but so as not to cross H 1 or H 2 ), and training wasrepeated, the same separating hyperplane would be found.3.2. The Karush-Kuhn-Tucker C<strong>on</strong>diti<strong>on</strong>sThe Karush-Kuhn-Tucker (KKT) c<strong>on</strong>diti<strong>on</strong>s play a central role in both the theory andpractice of c<strong>on</strong>strained optimizati<strong>on</strong>. For the primal problem above, the KKT c<strong>on</strong>diti<strong>on</strong>smay be stated (Fletcher, 1987):@ X L P = w ; i y i x i = 0 =1 d (17)@w i@ X @b L P = ; i y i = 0 (18)iy i (x i w + b) ; 1 0 i =1 l (19) i 0 8i (20) i (y i (w x i + b) ; 1) = 0 8i (21)The KKT c<strong>on</strong>diti<strong>on</strong>s are satised at the soluti<strong>on</strong> of any c<strong>on</strong>strained optimizati<strong>on</strong> problem(c<strong>on</strong>vex or not), with any kind of c<strong>on</strong>straints, provided that the intersecti<strong>on</strong> of the setof feasible directi<strong>on</strong>s with the set of descent directi<strong>on</strong>s coincides with the intersecti<strong>on</strong> ofthe set of feasible directi<strong>on</strong>s <strong>for</strong> linearized c<strong>on</strong>straints with the set of descent directi<strong>on</strong>s(see Fletcher, 1987 McCormick, 1983)). This rather technical regularity assumpti<strong>on</strong> holds<strong>for</strong> all support vector machines, since the c<strong>on</strong>straints are always linear. Furthermore, theproblem <strong>for</strong> SVMs is c<strong>on</strong>vex (a c<strong>on</strong>vex objective functi<strong>on</strong>, with c<strong>on</strong>straints which give ac<strong>on</strong>vex feasible regi<strong>on</strong>), and <strong>for</strong> c<strong>on</strong>vex problems (if the regularity c<strong>on</strong>diti<strong>on</strong> holds), theKKT c<strong>on</strong>diti<strong>on</strong>s are necessary and sucient <strong>for</strong> w b to be a soluti<strong>on</strong> (Fletcher, 1987).Thus solving the SVM problem is equivalent to nding a soluti<strong>on</strong> to the KKT c<strong>on</strong>diti<strong>on</strong>s.This fact results in several approaches to nding the soluti<strong>on</strong> (<strong>for</strong> example, the primal-dualpath following method menti<strong>on</strong>ed in Secti<strong>on</strong> 5).As an immediate applicati<strong>on</strong>, note that, while w is explicitly determined by the trainingprocedure, the threshold b is not, although it is implicitly determined. However b is easilyfound by using the KKT \complementarity" c<strong>on</strong>diti<strong>on</strong>, Eq. (21), by choosing any i <strong>for</strong>
11which i 6= 0 and computing b (note that it is numerically safer to take the mean value ofb resulting from all such equati<strong>on</strong>s).Notice that all we've d<strong>on</strong>e so far is to cast the problem into an optimizati<strong>on</strong> problemwhere the c<strong>on</strong>straints are rather more manageable than those in Eqs. (10), (11). Findingthe soluti<strong>on</strong> <strong>for</strong> real world problems will usually require numerical methods. We willhavemore to say <strong>on</strong>this later. However, let's rst work out a rare case where the problem isn<strong>on</strong>trivial (the number of dimensi<strong>on</strong>s is arbitrary, and the soluti<strong>on</strong> certainly not obvious),but where the soluti<strong>on</strong> can be found analytically.3.3. Optimal Hyperplanes: An ExampleWhile the main aim of this Secti<strong>on</strong> is to explore a n<strong>on</strong>-trivial pattern recogniti<strong>on</strong> problemwhere the support vector soluti<strong>on</strong> can be found analytically, theresults derived here willalso be useful in a later proof. For the problem c<strong>on</strong>sidered, every training point will turnout to be a support vector, which is <strong>on</strong>e reas<strong>on</strong> we can nd the soluti<strong>on</strong> analytically.C<strong>on</strong>sider n +1 symmetrically placed points lying <strong>on</strong> a sphere S n;1 of radius R: moreprecisely, the points <strong>for</strong>m the vertices of an n-dimensi<strong>on</strong>al symmetric simplex. It is c<strong>on</strong>venienttoembed the points in R n+1 in such away thattheyalllieinthehyperplane whichpasses through the origin and which is perpendicular to the (n+1)-vector (1 1 ::: 1) (in this<strong>for</strong>mulati<strong>on</strong>, the points lie <strong>on</strong> S n;1 , they span R n , and are embedded in R n+1 ). Explicitly,recalling that vectors themselves are labeled by Roman indices and their coordinates byGreek, the coordinates are given by:s rRRnx i = ;(1 ; i )n(n +1) + i(22)n +1where the Kr<strong>on</strong>ecker delta, i , is dened by i = 1 if = i, 0 otherwise. Thus, <strong>for</strong>example, the vectors <strong>for</strong> three equidistant points <strong>on</strong> the unit circle (see Figure 12) are:r2x 1 = (3 ;1p 6x 2 = ( ;1 p6x 3 = ( ;1 p6;1p6)r23 ;1p6);1p6r23 ) (23)One c<strong>on</strong>sequence of the symmetry is that the angle between any pair of vectors is thesame (and is equal to arccos(;1=n)):kx i k 2 = R 2 (24)x i x j = ;R 2 =n (25)or, more succinctly,x i x jR 2 = ij ; (1 ; ij ) 1 n : (26)Assigning a class label C 2 f+1 ;1g arbitrarily to each point, we wish to nd thathyperplane which separates the two classes with widest margin. Thus we must maximize