A Tutorial on Support Vector Machines for Pattern Recognition

Recommendations

Info

26subset of the data and trains on that. The rest of the training data is tested on the resultingclassier, and a list of the errors is constructed, sorted by how far on the wrong side of themargin they lie (i.e. how egregiously the KKT conditions are violated). The next chunk isconstructed from the rst N of these, combined with the N S support vectors already found,where N + N S is decided heuristically (a chunk size that is allowed to grow tooquickly ortoo slowly will result in slow overall convergence). Note that vectors can be dropped fromachunk, and that support vectors in one chunk may not appear in the nal solution. Thisprocess is continued until all data points are found to satisfy the KKT conditions.The above method requires that the number of support vectors N S be small enough so thata Hessian of size N S by N S will t in memory. An alternative decomposition algorithm hasbeen proposed which overcomes this limitation (Osuna, Freund and Girosi, 1997b). Again,in this algorithm, only a small portion of the training data is trained on at a given time, andfurthermore, only a subset of the support vectors need be in the \working set" (i.e. that setof points whose 's are allowed to vary). This method has been shown to be able to easilyhandle a problem with 110,000 training points and 100,000 support vectors. However, itmust be noted that the speed of this approach relies on many of the support vectors havingcorresponding Lagrange multipliers i at the upper bound, i = C.These training algorithms maytakeadvantage of parallel processing in several ways. First,all elements of the Hessian itself can be computed simultaneously. Second, each elementoften requires the computation of dot products of training data, which could also be parallelized.Third, the computation of the objective function, or gradient, which isaspeedbottleneck, can be parallelized (it requires a matrix multiplication). Finally, one can envisionparallelizing at a higher level, for example by training on dierent chunks simultaneously.Schemes such as these, combined with the decomposition algorithm of (Osuna, Freund andGirosi, 1997b), will be needed to makevery large problems (i.e. >> 100,000 support vectors,with many not at bound), tractable.5.1.2. Testing In test phase, one must simply evaluate Eq. (61), which will requireO(MN S ) operations, where M is the number of operations required to evaluate the kernel.For dot product and RBF kernels, M is O(d L ), the dimension of the data vectors. Again,both the evaluation of the kernel and of the sum are highly parallelizable procedures.In the absence of parallel hardware, one can still speed up test phase by a large factor, asdescribed in Section 9.6. The VC Dimension of Support Vector MachinesWe now show that the VC dimension of SVMs can be very large (even innite). We willthen explore several arguments as to why, in spite of this, SVMs usually exhibit goodgeneralization performance. However it should be emphasized that these are essentiallyplausibility arguments. Currently there exists no theory which guarantees that a givenfamily of SVMs will have high accuracy on a given problem.We will call any kernel that satises Mercer's condition a positive kernel, and the correspondingspace H the embedding space. We will also call anyembedding space with minimaldimension for a given kernel a \minimal embedding space". We have the followingTheorem 3 Let K be a positive kernel which corresponds to a minimal embedding spaceH. Then the VC dimension of the corresponding support vector machine (where the errorpenalty C in Eq. (44) is allowed to take all values) is dim(H)+1.
27Proof: If the minimal embedding space has dimension d H ,thend H points in the image ofL under the mapping can be found whose position vectors in H are linearly independent.From Theorem 1, these vectors can be shattered by hyperplanes in H. Thus by eitherrestricting ourselves to SVMs for the separable case (Section 3.1), or for which the errorpenalty C is allowed to take all values (so that, if the points are linearly separable, a C canbe found such that the solution does indeed separate them), the family of support vectormachines with kernel K can also shatter these points, and hence has VC dimension d H +1.Let's look at two examples.6.1. The VC Dimension for Polynomial KernelsConsider an SVM with homogeneous polynomial kernel, acting on data in R dL :K(x 1 x 2 )=(x 1 x 2 ) p x 1 x 2 2 R dL (78)As in the case when d L = 2 and the kernel is quadratic (Section 4), one can explicitlyconstruct the map . Letting z i = x 1i x 2i ,sothatK(x 1 x 2 )=(z 1 + + z dL ) p ,we see thateach dimension of H corresponds to a term with given powers of the z i in the expansion ofK. In fact if we choose to label the components of (x) inthismanner,we can explicitlywrite the mapping for any p and d L : r1r 2r dL(x) =This leads tos p!r 1 !r 2 ! r dL !x r11 xr22 xr d Ld Ld L Xi=1r i = p r i 0 (79)Theorem 4 If the space in which the data live has dimension d L (i.e. L = R dL ), thedimension of the minimal embedding space, for ; homogeneous polynomial kernels of degree p(K(x 1 x 2 )=(x 1 x 2 ) p x 1 x 2 2 R dL d), is L+p;1 p .; dL+p;1p(The proof is in the Appendix). Thus the VC dimension of SVMs with these kernels is+1. As noted above, this gets very large very quickly.6.2. The VC Dimension for Radial Basis Function KernelsTheorem 5 Consider the class of Mercer kernels for which K(x 1 x 2 ) ! 0 as kx 1 ; x 2 k!1, and for which K(x x) is O(1), and assume that the data can be chosen arbitrarily fromR d . Then the family of classiers consisting of support vector machines using these kernels,and for which the error penalty is allowed to take all values, has innite VC dimension.Proof: The kernel matrix, K ij K(x i x j ), is a Gram matrix (a matrix of dot products:see (Horn, 1985)) in H. Clearly we can choose training data such that all o-diagonalelements K i6=j can be made arbitrarily small, and by assumption all diagonal elements K i=jare of O(1). The matrix K is then of full rank hence the set of vectors, whose dot productsin H form K, are linearly independent (Horn, 1985) hence, by Theorem 1, the points can beshattered by hyperplanes in H, and hence also by support vector machines with sucientlylarge error penalty. Since this is true for any nite number of points, the VC dimension ofthese classiers is innite.
Page 2: 2and Smola, 1996). In most of these
Page 7 and 8: 7is not valid, nearest neighbour cl
Page 9 and 10: 9Origin-b|w|wH 1H 2MarginFigure 5.L
Page 11 and 12: 11which i 6= 0 and computing b (no
Page 13 and 14: 13Hencekwk 2 ==n+1Xij=1n+1Xi=1 i j
Page 15 and 16: 15@L P@ i= C ; i ; i =0 (50)y i (
Page 17 and 18: 17K(x i x j )=e ;kxi;xjk2 =2 2 : (
Page 19 and 20: 19if and only if, for any g(x) such
Page 21 and 22: 214.3. Some Examples of Nonlinear S
Page 23 and 24: 23y i (w x i + b ) = y i ((1 ; )
Page 25: 25We also use a \sticky faces" algo
Page 29 and 30: 29Figure 11. Gaussian RBF SVMs of s
Page 31 and 32: 31Thus for n even the maximum numbe
Page 33 and 34: 33with solution given byC = X i i (
Page 35 and 36: 358. LimitationsPerhaps the biggest
Page 37 and 38: 37Pi =1 0 i 1. Then w x+b = P i
Page 39 and 40: 39from the set can be found which a
Page 41 and 42: 414. Given the name \test set," per
Page 43: 43Edgar Osuna and Federico Girosi.

A Tutorial on Support Vector Machines for Pattern Recognition

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?