13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.4 INSTANCE-BASED LEARNING 241third region is where the boundary meets the lower border of the larger rectanglewhen projected upward <strong>and</strong> the left border of the smaller one when projectedto the right. The boundary is linear in this region, because it is equidistantfrom these two borders. The fourth is where the boundary lies to the right ofthe larger rectangle but below the bottom of that rectangle. In this case theboundary is parabolic because it is the locus of points equidistant from the lowerright corner of the larger rectangle <strong>and</strong> the left side of the smaller one. The fifthregion lies between the two rectangles: here the boundary is vertical. The patternis repeated in the upper right part of the diagram: first parabolic, then linear,then parabolic (although this particular parabola is almost indistinguishablefrom a straight line), <strong>and</strong> finally linear as the boundary finally escapes from thescope of both rectangles.This simple situation certainly defines a complex boundary! Of course, it isnot necessary to represent the boundary explicitly; it is generated implicitly bythe nearest-neighbor calculation. Nevertheless, the solution is still not a verygood one. Whereas taking the distance from the nearest instance within a hyperrectangleis overly dependent on the position of that particular instance, takingthe distance to the nearest point of the hyperrectangle is overly dependent onthat corner of the rectangle—the nearest example might be a long way from thecorner.A final problem concerns measuring the distance to hyperrectangles thatoverlap or are nested. This complicates the situation because an instance mayfall within more than one hyperrectangle. A suitable heuristic for use in this caseis to choose the class of the most specific hyperrectangle containing the instance,that is, the one covering the smallest area of instance space.Whether or not overlap or nesting is permitted, the distance function shouldbe modified to take account of both the observed prediction accuracy of exemplars<strong>and</strong> the relative importance of different features, as described in the precedingsections on pruning noisy exemplars <strong>and</strong> attribute weighting.Generalized distance functionsThere are many different ways of defining a distance function, <strong>and</strong> it is hard tofind rational grounds for any particular choice. An elegant solution is to considerone instance being transformed into another through a sequence of predefinedelementary operations <strong>and</strong> to calculate the probability of such asequence occurring if operations are chosen r<strong>and</strong>omly. Robustness is improvedif all possible transformation paths are considered, weighted by their probabilities,<strong>and</strong> the scheme generalizes naturally to the problem of calculating thedistance between an instance <strong>and</strong> a set of other instances by considering transformationsto all instances in the set. Through such a technique it is possible toconsider each instance as exerting a “sphere of influence,” but a sphere with soft

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!