13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

218 CHAPTER 6 | IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMESmaximum margin hyperplane is relatively stable: it only moves if traininginstances are added or deleted that are support vectors—<strong>and</strong> this is true evenin the high-dimensional space spanned by the nonlinear transformation. Overfittingis caused by too much flexibility in the decision boundary. The supportvectors are global representatives of the whole set of training points, <strong>and</strong> thereare usually few of them, which gives little flexibility. Thus overfitting is unlikelyto occur.What about computational complexity? This is still a problem. Suppose thatthe transformed space is a high-dimensional one so that the transformedsupport vectors <strong>and</strong> test instance have many components. According to the precedingequation, every time an instance is classified its dot product with allsupport vectors must be calculated. In the high-dimensional space produced bythe nonlinear mapping this is rather expensive. Obtaining the dot productinvolves one multiplication <strong>and</strong> one addition for each attribute, <strong>and</strong> the numberof attributes in the new space can be huge. This problem occurs not only duringclassification but also during training, because the optimization algorithms haveto calculate the same dot products very frequently.Fortunately, it turns out that it is possible to calculate the dot product beforethe nonlinear mapping is performed, on the original attribute set. A highdimensionalversion of the preceding equation is simplyx = b+ Âa iyi( a( i)◊a ) ,nwhere n is chosen as the number of factors in the transformation (three in theexample we used earlier). If you exp<strong>and</strong> the term (a(i)◊a) n , you will find that itcontains all the high-dimensional terms that would have been involved if thetest <strong>and</strong> training vectors were first transformed by including all products of nfactors <strong>and</strong> the dot product was taken of the result. (If you actually do the calculation,you will notice that some constant factors—binomial coefficients—are introduced. However, these do not matter: it is the dimensionality of thespace that concerns us; the constants merely scale the axes.) Because of thismathematical equivalence, the dot products can be computed in the originallow-dimensional space, <strong>and</strong> the problem becomes feasible. In implementationterms, you take a software package for constrained quadratic optimization <strong>and</strong>every time a(i)◊a is evaluated you evaluate (a(i)◊a) n instead. It’s as simple as that,because in both the optimization <strong>and</strong> the classification algorithms these vectorsare only ever used in this dot product form. The training vectors, including thesupport vectors, <strong>and</strong> the test instance all remain in the original low-dimensionalspace throughout the calculations.The function (x◊y) n , which computes the dot product of two vectors x <strong>and</strong>y <strong>and</strong> raises the result to the power n, is called a polynomial kernel. A good

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!