24<strong>on</strong> n<strong>on</strong>linear programming techniques see (Fletcher, 1987 Mangasarian, 1969 McCormick,1983).The basic recipe is to (1) note the optimality (KKT) c<strong>on</strong>diti<strong>on</strong>s which the soluti<strong>on</strong> mustsatisfy, (2) dene a strategy <strong>for</strong> approaching optimality by uni<strong>for</strong>mly increasing the dualobjective functi<strong>on</strong> subject to the c<strong>on</strong>straints, and (3) decide <strong>on</strong> a decompositi<strong>on</strong> algorithmso that <strong>on</strong>ly porti<strong>on</strong>s of the training data need be handled at a given time (Boser, Guy<strong>on</strong>and Vapnik, 1992 Osuna, Freund and Girosi, 1997a). We give a brief descripti<strong>on</strong> of someof the issues involved. One can view the problem as requiring the soluti<strong>on</strong> of a sequenceof equality c<strong>on</strong>strained problems. A given equality c<strong>on</strong>strained problem can be solved in<strong>on</strong>e step by using the Newt<strong>on</strong> method (although this requires storage <strong>for</strong> a factorizati<strong>on</strong> ofthe projected Hessian), or in at most l steps using c<strong>on</strong>jugate gradient ascent (Press et al.,1992) (where l is the number of data points <strong>for</strong> the problem currently being solved: no extrastorage is required). Some algorithms move within a given face until a new c<strong>on</strong>straint isencountered, in which case the algorithm is restarted with the new c<strong>on</strong>straint added to thelist of equality c<strong>on</strong>straints. This method has the disadvantage that <strong>on</strong>ly <strong>on</strong>e new c<strong>on</strong>straintis made active at a time. \Projecti<strong>on</strong> methods" have also been c<strong>on</strong>sidered (More, 1991),where a point outside the feasible regi<strong>on</strong> is computed, and then line searches and projecti<strong>on</strong>sare d<strong>on</strong>e so that the actual move remains inside the feasible regi<strong>on</strong>. This approach can addseveral new c<strong>on</strong>straints at <strong>on</strong>ce. Note that in both approaches, several active c<strong>on</strong>straintscan become inactive in <strong>on</strong>e step. In all algorithms, <strong>on</strong>ly the essential part of the Hessian(the columns corresp<strong>on</strong>ding to i 6= 0) need be computed (although some algorithms docompute the whole Hessian). For the Newt<strong>on</strong> approach, <strong>on</strong>e can also take advantage of thefact that the Hessian is positive semidenite by diag<strong>on</strong>alizing it with the Bunch-Kaufmanalgorithm (Bunch and Kaufman, 1977 Bunch and Kaufman, 1980) (if the Hessian wereindenite, it could still be easily reduced to 2x2 block diag<strong>on</strong>al <strong>for</strong>m with this algorithm).In this algorithm, when a new c<strong>on</strong>straint is made active or inactive, the factorizati<strong>on</strong> ofthe projected Hessian is easily updated (as opposed to recomputing the factorizati<strong>on</strong> fromscratch). Finally, ininterior point methods, the variables are essentially rescaled so as toalways remain inside the feasible regi<strong>on</strong>. An example is the \LOQO" algorithm of (Vanderbei,1994a Vanderbei, 1994b), which is a primal-dual path following algorithm. This lastmethod is likely to be useful <strong>for</strong> problems where the number of support vectors as a fracti<strong>on</strong>of training sample size is expected to be large.We briey describe the core optimizati<strong>on</strong> method we currently use 17 . It is an active setmethod combining gradient and c<strong>on</strong>jugate gradient ascent. Whenever the objective functi<strong>on</strong>is computed, so is the gradient, at very little extra cost. In phase 1, the search directi<strong>on</strong>ss are al<strong>on</strong>g the gradient. The nearest face al<strong>on</strong>g the search directi<strong>on</strong> is found. If the dotproduct of the gradient therewiths indicates that the maximum al<strong>on</strong>g s lies between thecurrent point and the nearest face, the optimal point al<strong>on</strong>g the search directi<strong>on</strong> is computedanalytically (note that this does not require a line search), and phase 2 is entered. Otherwise,we jump to the new face and repeat phase 1. In phase 2, Polak-Ribiere c<strong>on</strong>jugate gradientascent (Press et al., 1992) is d<strong>on</strong>e, until a new face is encountered (in which case phase 1 isre-entered) or the stopping criteri<strong>on</strong> is met. Note the following:Search directi<strong>on</strong>s are always projected so that the i c<strong>on</strong>tinue to satisfy the equalityc<strong>on</strong>straint Eq. (45). Note that the c<strong>on</strong>jugate gradient algorithm will still work weare simply searching in a subspace. However, it is important that this projecti<strong>on</strong> isimplemented in such away that not <strong>on</strong>ly is Eq. (45) met (easy), but also so that theangle between the resulting search directi<strong>on</strong>, and the search directi<strong>on</strong> prior to projecti<strong>on</strong>,is minimized (not quite so easy).
25We also use a \sticky faces" algorithm: whenever a given face is hit more than <strong>on</strong>ce,the search directi<strong>on</strong>s are adjusted so that all subsequent searches are d<strong>on</strong>e within thatface. All \sticky faces" are reset (made \n<strong>on</strong>-sticky") when the rate of increase of theobjective functi<strong>on</strong> falls below a threshold.The algorithm stops when the fracti<strong>on</strong>al rate of increase of the objective functi<strong>on</strong> F fallsbelow a tolerance (typically 1e-10, <strong>for</strong> double precisi<strong>on</strong>). Note that <strong>on</strong>e can also useas stopping criteri<strong>on</strong> the c<strong>on</strong>diti<strong>on</strong> that the size of the projected search directi<strong>on</strong> fallsbelow a threshold. However, this criteri<strong>on</strong> does not handle scaling well.In my opini<strong>on</strong> the hardest thing to get right is handling precisi<strong>on</strong> problems correctlyeverywhere. If this is not d<strong>on</strong>e, the algorithm may not c<strong>on</strong>verge, or may bemuch slowerthan it needs to be.Agoodway tocheck thatyour algorithm is working is to check that the soluti<strong>on</strong> satisesall the Karush-Kuhn-Tucker c<strong>on</strong>diti<strong>on</strong>s <strong>for</strong> the primal problem, since these are necessaryand sucient c<strong>on</strong>diti<strong>on</strong>s that the soluti<strong>on</strong> be optimal. The KKT c<strong>on</strong>diti<strong>on</strong>s are Eqs. (48)through (56), with dot products between data vectors replaced by kernels wherever theyappear (note w must be expanded as in Eq. (48) rst, since w is not in general the mappingof a point in L). Thus to check the KKT c<strong>on</strong>diti<strong>on</strong>s, it is sucient to check that the i satisfy 0 i C, that the equality c<strong>on</strong>straint (49)holds, that all points <strong>for</strong> which0 i