22N <strong>on</strong>e-versus-rest classiers (say, \<strong>on</strong>e" positive, \rest" negative) <strong>for</strong> the N-class case andtakes the class <strong>for</strong> a test point to be that corresp<strong>on</strong>ding to the largest positive distance(Boser, Guy<strong>on</strong> and Vapnik, 1992).4.4. Global Soluti<strong>on</strong>s and UniquenessWhen is the soluti<strong>on</strong> to the support vector training problem global, and when is it unique?By \global", we mean that there exists no other point in the feasible regi<strong>on</strong> at which theobjective functi<strong>on</strong> takes a lower value. We will address two kinds of ways in which uniquenessmay not hold: soluti<strong>on</strong>s <strong>for</strong> which fwbg are themselves unique, but <strong>for</strong> which the expansi<strong>on</strong>of w in Eq. (46) is not and soluti<strong>on</strong>s whose fwbg dier. Both are of interest: even if thepair fwbg is unique, if the i are not, there may be equivalent expansi<strong>on</strong>s of w whichrequire fewer support vectors (a trivial example of this is given below), and which there<strong>for</strong>erequire fewer instructi<strong>on</strong>s during test phase.It turns out that every local soluti<strong>on</strong> is also global. This is a property of any c<strong>on</strong>vexprogramming problem (Fletcher, 1987). Furthermore, the soluti<strong>on</strong> is guaranteed to beunique if the objective functi<strong>on</strong> (Eq. (43)) is strictly c<strong>on</strong>vex, which in our case meansthat the Hessian must be positive denite (note that <strong>for</strong> quadratic objective functi<strong>on</strong>s F ,the Hessian is positive denite if and <strong>on</strong>ly if F is strictly c<strong>on</strong>vex this is not true <strong>for</strong> n<strong>on</strong>quadraticF : there, a positive denite Hessian implies a strictly c<strong>on</strong>vex objective functi<strong>on</strong>,but not vice versa (c<strong>on</strong>sider F = x 4 ) (Fletcher, 1987)). However, even if the Hessian ispositive semidenite, the soluti<strong>on</strong> can still be unique: c<strong>on</strong>sider two points al<strong>on</strong>g the realline with coordinates x 1 =1andx 2 = 2, and with polarities + and ;. Here the Hessian ispositive semidenite, but the soluti<strong>on</strong> (w = ;2 b =3 i = 0 in Eqs. (40), (41), (42)) isunique. It is also easy to nd soluti<strong>on</strong>s which are not unique in the sense that the i in theexpansi<strong>on</strong> of w are not unique:: <strong>for</strong> example, c<strong>on</strong>sider the problem of four separable points<strong>on</strong> a square in R 2 : x 1 =[1 1], x 2 =[;1 1], x 3 =[;1 ;1] and x 4 =[1 ;1], with polarities[+ ; ; +] respectively. One soluti<strong>on</strong> is w = [1 0], b = 0, = [0:25 0:25 0:25 0:25]another has the same w P and b, but =[0:5 0:5 0 0] (note that both soluti<strong>on</strong>s satisfy thec<strong>on</strong>straints i > 0andi iy i = 0). When can this occur in general? Given some soluti<strong>on</strong>, choose an 0 which is in the null space of the Hessian H ij = y i y j x i x j , and require that 0 be orthog<strong>on</strong>al to the vector all of whose comp<strong>on</strong>ents are 1. Then adding 0 to in Eq.(43) will leave L D unchanged. If 0 i + 0 i C and 0 satises Eq. (45), then + 0 isalso a soluti<strong>on</strong> 15 .How about soluti<strong>on</strong>s where the fwbg are themselves not unique? (We emphasize thatthis can <strong>on</strong>ly happen in principle if the Hessian is not positive denite, and even then,the soluti<strong>on</strong>s are necessarily global). The following very simple theorem shows that if n<strong>on</strong>uniquesoluti<strong>on</strong>s occur, then the soluti<strong>on</strong> at <strong>on</strong>e optimal point isc<strong>on</strong>tinuously de<strong>for</strong>mableinto the soluti<strong>on</strong> at the other optimal point, in such away thatallintermediate points arealso soluti<strong>on</strong>s.Theorem 2 Let the variable X stand <strong>for</strong> the pair of variables fwbg. Let the Hessian <strong>for</strong>the problem be positive semidenite, so that the objective functi<strong>on</strong> is c<strong>on</strong>vex. Let X 0 and X 1be two points at which the objective functi<strong>on</strong> attains its minimal value. Then there exists apath X = X() =(1; )X 0 + X 1 2 [0 1], such that X() is a soluti<strong>on</strong> <strong>for</strong> all .Proof: Let the minimum value of the objective functi<strong>on</strong> be F min . Then by assumpti<strong>on</strong>,F (X 0 )=F (X 1 )=F min . By c<strong>on</strong>vexity ofF , F (X()) (1 ; )F (X 0 )+F(X 1 )=F min .Furthermore, by linearity, theX() satisfy the c<strong>on</strong>straints Eq. (40), (41): explicitly (againcombining both c<strong>on</strong>straints into <strong>on</strong>e):
23y i (w x i + b ) = y i ((1 ; )(w 0 x i + b 0 )+(w 1 x i + b 1 )) (1 ; )(1 ; i )+(1 ; i )=1; i (77)Although simple, this theorem is quite instructive. For example, <strong>on</strong>e might think that theproblems depicted in Figure 10 have several dierent optimal soluti<strong>on</strong>s (<strong>for</strong> the case of linearsupport vector machines). However, since <strong>on</strong>e cannot smoothly move thehyperplane from<strong>on</strong>e proposed soluti<strong>on</strong> to another without generating hyperplanes which are not soluti<strong>on</strong>s,we know that these proposed soluti<strong>on</strong>s are in fact not soluti<strong>on</strong>s at all. In fact, <strong>for</strong> eachof these cases, the optimal unique soluti<strong>on</strong> is at w = 0, with a suitable choice of b (whichhas the eect of assigning the same label to all the points). Note that this is a perfectlyacceptable soluti<strong>on</strong> to the classicati<strong>on</strong> problem: any proposed hyperplane (with w 6= 0)will cause the primal objective functi<strong>on</strong> to take a higher value.Figure 10. Two problems, with proposed (incorrect) n<strong>on</strong>-unique soluti<strong>on</strong>s.Finally, note that the fact that SVM training always nds a global soluti<strong>on</strong> is in c<strong>on</strong>trastto the case of neural networks, where many local minima usually exist.5. Methods of Soluti<strong>on</strong>The support vector optimizati<strong>on</strong> problem can be solved analytically <strong>on</strong>ly when the numberof training data is very small, or <strong>for</strong> the separable case when it is known be<strong>for</strong>ehand whichof the training data become support vectors (as in Secti<strong>on</strong>s 3.3 and 6.2). Note that this canhappen when the problem has some symmetry (Secti<strong>on</strong> 3.3), but that it can also happenwhen it does not (Secti<strong>on</strong> 6.2). For the general analytic case, the worst case computati<strong>on</strong>alcomplexity is of order N 3 S (inversi<strong>on</strong> of the Hessian), where N S is the number of supportvectors, although the two examples given both have complexity of O(1).However, in most real world cases, Equati<strong>on</strong>s (43) (with dot products replaced by kernels),(44), and (45) must be solved numerically. For small problems, any general purposeoptimizati<strong>on</strong> package that solves linearly c<strong>on</strong>strained c<strong>on</strong>vex quadratic programs will do. Agood survey of the available solvers, and where to get them, can be found 16 in (More andWright, 1993).For larger problems, a range of existing techniques can be brought to bear. A full explorati<strong>on</strong>of the relative merits of these methods would ll another tutorial. Here we justdescribe the general issues, and <strong>for</strong> c<strong>on</strong>creteness, give a brief explanati<strong>on</strong> of the techniquewe currently use. Below, a \face" means a set of points lying <strong>on</strong> the boundary of the feasibleregi<strong>on</strong>, and an \active c<strong>on</strong>straint" is a c<strong>on</strong>straint <strong>for</strong>which theequality holds. For more