11.07.2015 Views

A Tutorial on Support Vector Machines for Pattern Recognition

A Tutorial on Support Vector Machines for Pattern Recognition

A Tutorial on Support Vector Machines for Pattern Recognition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2and Smola, 1996). In most of these cases, SVM generalizati<strong>on</strong> per<strong>for</strong>mance (i.e. error rates<strong>on</strong> test sets) either matches or is signicantly better than that of competing methods. Theuse of SVMs <strong>for</strong> density estimati<strong>on</strong> (West<strong>on</strong> et al., 1997) and ANOVA decompositi<strong>on</strong> (Stits<strong>on</strong>et al., 1997) has also been studied. Regarding extensi<strong>on</strong>s, the basic SVMs c<strong>on</strong>tain noprior knowledge of the problem (<strong>for</strong> example, a large class of SVMs <strong>for</strong> the image recogniti<strong>on</strong>problem would give the same results if the pixels were rst permuted randomly (witheach image suering the same permutati<strong>on</strong>), an act of vandalism that would leave the bestper<strong>for</strong>ming neural networks severely handicapped) and much work has been d<strong>on</strong>e <strong>on</strong> incorporatingprior knowledge into SVMs (Scholkopf, Burges and Vapnik, 1996 Scholkopf etal., 1998a Burges, 1998). Although SVMs have good generalizati<strong>on</strong> per<strong>for</strong>mance, they canbe abysmally slow in test phase, a problem addressed in (Burges, 1996 Osuna and Girosi,1998). Recent work has generalized the basic ideas (Smola, Scholkopf and Muller, 1998aSmola and Scholkopf, 1998), shown c<strong>on</strong>necti<strong>on</strong>s to regularizati<strong>on</strong> theory (Smola, Scholkopfand Muller, 1998b Girosi, 1998 Wahba, 1998), and shown how SVM ideas can be incorporatedin a wide range of other algorithms (Scholkopf, Smola and Muller, 1998b Scholkopfet al, 1998c). The reader may also nd the thesis of (Scholkopf, 1997) helpful.The problem which drove the initial development of SVMs occurs in several guises - thebias variance tradeo (Geman and Bienenstock, 1992), capacityc<strong>on</strong>trol (Guy<strong>on</strong> et al., 1992),overtting (M<strong>on</strong>tgomery and Peck, 1992) - but the basic idea is the same. Roughly speaking,<strong>for</strong> a given learning task, with a given nite amount of training data, the best generalizati<strong>on</strong>per<strong>for</strong>mance will be achieved if the right balance is struck between the accuracy attained<strong>on</strong> that particular training set, and the \capacity" of the machine, that is, the ability of themachine to learn any training set without error. A machine with too much capacity is likea botanist with a photographic memory who, when presented with a new tree, c<strong>on</strong>cludesthat it is not a tree because it has a dierent number of leaves from anything she has seenbe<strong>for</strong>e amachine with too little capacity islike the botanist's lazy brother, who declaresthat if it's green, it's a tree. Neither can generalize well. The explorati<strong>on</strong> and <strong>for</strong>malizati<strong>on</strong>of these c<strong>on</strong>cepts has resulted in <strong>on</strong>e of the shining peaks of the theory of statistical learning(Vapnik, 1979).In the following, bold typeface will indicate vector or matrix quantities normal typefacewill be used <strong>for</strong> vector and matrix comp<strong>on</strong>ents and <strong>for</strong> scalars. We will label comp<strong>on</strong>entsof vectors and matrices with Greek indices, and label vectors and matrices themselves withRoman indices. Familiarity with the use of Lagrange multipliers to solve problems withequality or inequality c<strong>on</strong>straints is assumed 2 .2. A Bound <strong>on</strong> the Generalizati<strong>on</strong> Per<strong>for</strong>mance of a <strong>Pattern</strong> Recogniti<strong>on</strong> LearningMachineThere is a remarkable family of bounds governing the relati<strong>on</strong> between the capacity of alearning machine and its per<strong>for</strong>mance 3 . The theory grew out of c<strong>on</strong>siderati<strong>on</strong>s of under whatcircumstances, and how quickly, the mean of some empirical quantity c<strong>on</strong>verges uni<strong>for</strong>mly,as the number of data points increases, to the true mean (that which would be calculatedfrom an innite amount ofdata)(Vapnik, 1979). Let us start with <strong>on</strong>e of these bounds.The notati<strong>on</strong> here will largely follow that of (Vapnik, 1995). Suppose we are given lobservati<strong>on</strong>s. Each observati<strong>on</strong> c<strong>on</strong>sists of a pair: avector x i 2 R n i =1:::l and theassociated \truth" y i ,given to us by a trusted source. In the tree recogniti<strong>on</strong> problem, x imight beavector of pixel values (e.g. n = 256 <strong>for</strong> a 16x16 image), and y i would be 1 if theimage c<strong>on</strong>tains a tree, and -1 otherwise (we use -1 here rather than 0 to simplify subsequent<strong>for</strong>mulae). Now it is assumed that there exists some unknown probability distributi<strong>on</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!