A Tutorial on Support Vector Machines for Pattern Recognition

Recommendations

Info

6x=01 2 3 4Figure 2.Four points that cannot be shattered by (sin(x)), despite innite VC dimension.1.41.2VC Confidence10.80.60.40.20.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1h / l = VC Dimension / Sample SizeFigure 3.VC condence is monotonic in h2.4. Minimizing The Bound by Minimizing hFigure 3 shows how the second term on the right hand side of Eq. (3) varies with h, givenachoice of 95% condence level ( =0:05) and assuming a training sample of size 10,000.The VC condence is a monotonic increasing function of h. This will be true for any valueof l.Thus, given some selection of learning machines whose empirical risk is zero, one wants tochoose that learning machine whose associated set of functions has minimal VC dimension.This will lead to a better upper bound on the actual error. In general, for non zero empiricalrisk, one wants to choose that learning machine which minimizes the right handsideofEq.(3).Note that in adopting this strategy, we are only using Eq. (3) as a guide. Eq. (3) gives(with some chosen probability) an upper bound on the actual risk. This does not prevent aparticular machine with the same value for empirical risk, and whose function set has higherVC dimension, from having better performance. In fact an example of a system that givesgood performance despite having innite VC dimension is given in the next Section. Notealso that the graph shows that for h=l > 0:37 (and for =0:05 and l =10 000), the VCcondence exceeds unity, and so for higher values the bound is guaranteed not tight.2.5. Two ExamplesConsider the k'th nearest neighbour classier, with k =1. This set of functions has inniteVC dimension and zero empirical risk, since any number of points, labeled arbitrarily, willbe successfully learned by the algorithm (provided no two points of opposite class lie righton top of each other). Thus the bound provides no information. In fact, for any classierwith innite VC dimension, the bound is not even valid 7 . However, even though the bound
7is not valid, nearest neighbour classiers can still perform well. Thus this rst example is acautionary tale: innite \capacity" does not guarantee poor performance.Let's follow the time honoured tradition of understanding things by trying to break them,and see if we can come up with a classier for which thebound is supposed to hold, butwhich violates the bound. We want the left hand side of Eq. (3) to be as large as possible,and the right hand side to be as small as possible. So we want a family of classiers whichgives the worst possible actual risk of 0:5, zero empirical risk up to some number of trainingobservations, and whose VC dimension is easy to compute and is less than l (so that thebound is non trivial). An example is the following, which I call the \notebook classier."This classier consists of a notebook with enough room to write down the classes of mtraining observations, where m l. For all subsequent patterns, the classier simply saysthat all patterns have the same class. Suppose also that the data have asmany positive(y = +1) as negative (y = ;1) examples, and that the samples are chosen randomly. Thenotebook classier will have zero empirical risk for up to m observations 0:5 training errorfor all subsequent observations 0:5 actual error, and VC dimension h = m. Substitutingthese values in Eq. (3), the bound becomes:m ln(2l=m)+1; (1=m)ln(=4) (8)4lwhich is certainly met for all iff(z) = z2exp (z=4;1) 1 z (m=l) 0 z 1 (9)which is true, since f(z) is monotonic increasing, and f(z =1)=0:236.2.6. Structural Risk MinimizationWe can now summarize the principle of structural risk minimization (SRM) (Vapnik, 1979).Note that the VC condence term in Eq. (3) depends on the chosen class of functions,whereas the empirical risk and actual risk depend on the one particular function chosen bythe training procedure. We would like to nd that subset of the chosen set of functions, suchthat the risk bound for that subset is minimized. Clearly we cannot arrange things so thatthe VC dimension h varies smoothly, since it is an integer. Instead, introduce a \structure"by dividing the entire class of functions into nested subsets (Figure 4). For each subset,we must be able either to compute h, ortogetaboundonh itself. SRM then consists ofnding that subset of functions which minimizes the bound on the actual risk. This can bedone by simply training a series of machines, one for each subset, where for a given subsetthe goal of training is simply to minimize the empirical risk. One then takes that trainedmachine in the series whose sum of empirical risk and VC condence is minimal.h4 h3 h2 h1h1 < h2 < h3 ...Figure 4.Nested subsets of functions, ordered by VC dimension.
Page 2: 2and Smola, 1996). In most of these
Page 9 and 10: 9Origin-b|w|wH 1H 2MarginFigure 5.L
Page 11 and 12: 11which i 6= 0 and computing b (no
Page 13 and 14: 13Hencekwk 2 ==n+1Xij=1n+1Xi=1 i j
Page 15 and 16: 15@L P@ i= C ; i ; i =0 (50)y i (
Page 17 and 18: 17K(x i x j )=e ;kxi;xjk2 =2 2 : (
Page 19 and 20: 19if and only if, for any g(x) such
Page 21 and 22: 214.3. Some Examples of Nonlinear S
Page 23 and 24: 23y i (w x i + b ) = y i ((1 ; )
Page 25 and 26: 25We also use a \sticky faces" algo
Page 27 and 28: 27Proof: If the minimal embedding s
Page 29 and 30: 29Figure 11. Gaussian RBF SVMs of s
Page 31 and 32: 31Thus for n even the maximum numbe
Page 33 and 34: 33with solution given byC = X i i (
Page 35 and 36: 358. LimitationsPerhaps the biggest
Page 37 and 38: 37Pi =1 0 i 1. Then w x+b = P i
Page 39 and 40: 39from the set can be found which a
Page 41 and 42: 414. Given the name \test set," per
Page 43: 43Edgar Osuna and Federico Girosi.

A Tutorial on Support Vector Machines for Pattern Recognition

Create successful ePaper yourself

Delete template?

Save as template?