11.07.2015 Views

A Tutorial on Support Vector Machines for Pattern Recognition

A Tutorial on Support Vector Machines for Pattern Recognition

A Tutorial on Support Vector Machines for Pattern Recognition

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

7is not valid, nearest neighbour classiers can still per<strong>for</strong>m well. Thus this rst example is acauti<strong>on</strong>ary tale: innite \capacity" does not guarantee poor per<strong>for</strong>mance.Let's follow the time h<strong>on</strong>oured traditi<strong>on</strong> of understanding things by trying to break them,and see if we can come up with a classier <strong>for</strong> which thebound is supposed to hold, butwhich violates the bound. We want the left hand side of Eq. (3) to be as large as possible,and the right hand side to be as small as possible. So we want a family of classiers whichgives the worst possible actual risk of 0:5, zero empirical risk up to some number of trainingobservati<strong>on</strong>s, and whose VC dimensi<strong>on</strong> is easy to compute and is less than l (so that thebound is n<strong>on</strong> trivial). An example is the following, which I call the \notebook classier."This classier c<strong>on</strong>sists of a notebook with enough room to write down the classes of mtraining observati<strong>on</strong>s, where m l. For all subsequent patterns, the classier simply saysthat all patterns have the same class. Suppose also that the data have asmany positive(y = +1) as negative (y = ;1) examples, and that the samples are chosen randomly. Thenotebook classier will have zero empirical risk <strong>for</strong> up to m observati<strong>on</strong>s 0:5 training error<strong>for</strong> all subsequent observati<strong>on</strong>s 0:5 actual error, and VC dimensi<strong>on</strong> h = m. Substitutingthese values in Eq. (3), the bound becomes:m ln(2l=m)+1; (1=m)ln(=4) (8)4lwhich is certainly met <strong>for</strong> all iff(z) = z2exp (z=4;1) 1 z (m=l) 0 z 1 (9)which is true, since f(z) is m<strong>on</strong>ot<strong>on</strong>ic increasing, and f(z =1)=0:236.2.6. Structural Risk Minimizati<strong>on</strong>We can now summarize the principle of structural risk minimizati<strong>on</strong> (SRM) (Vapnik, 1979).Note that the VC c<strong>on</strong>dence term in Eq. (3) depends <strong>on</strong> the chosen class of functi<strong>on</strong>s,whereas the empirical risk and actual risk depend <strong>on</strong> the <strong>on</strong>e particular functi<strong>on</strong> chosen bythe training procedure. We would like to nd that subset of the chosen set of functi<strong>on</strong>s, suchthat the risk bound <strong>for</strong> that subset is minimized. Clearly we cannot arrange things so thatthe VC dimensi<strong>on</strong> h varies smoothly, since it is an integer. Instead, introduce a \structure"by dividing the entire class of functi<strong>on</strong>s into nested subsets (Figure 4). For each subset,we must be able either to compute h, ortogetabound<strong>on</strong>h itself. SRM then c<strong>on</strong>sists ofnding that subset of functi<strong>on</strong>s which minimizes the bound <strong>on</strong> the actual risk. This can bed<strong>on</strong>e by simply training a series of machines, <strong>on</strong>e <strong>for</strong> each subset, where <strong>for</strong> a given subsetthe goal of training is simply to minimize the empirical risk. One then takes that trainedmachine in the series whose sum of empirical risk and VC c<strong>on</strong>dence is minimal.h4 h3 h2 h1h1 < h2 < h3 ...Figure 4.Nested subsets of functi<strong>on</strong>s, ordered by VC dimensi<strong>on</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!