13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

298 CHAPTER 7 | TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUTwould have turned out to be useful in the learning process by using gradationsthat are too coarse or by unfortunate choices of boundary that needlessly lumptogether many instances of different classes.Equal-interval binning often distributes instances very unevenly: some binscontain many instances, <strong>and</strong> others contain none. This can seriously impair theability of the attribute to help to build good decision structures. It is often betterto allow the intervals to be of different sizes, choosing them so that the samenumber of training examples fall into each one. This method, equal-frequencybinning, divides the attribute’s range into a predetermined number of bins basedon the distribution of examples along that axis—sometimes called histogramequalization, because if you take a histogram of the contents of the resultingbins it will be completely flat. If you view the number of bins as a resource, thismethod makes best use of it.However, equal-frequency binning is still oblivious to the instances’ classes,<strong>and</strong> this can cause bad boundaries. For example, if all instances in a bin haveone class, <strong>and</strong> all instances in the next higher bin have another except for thefirst, which has the original class, surely it makes sense to respect the classdivisions <strong>and</strong> include that first instance in the previous bin, sacrificing the equalfrequencyproperty for the sake of homogeneity. Supervised discretization—taking classes into account during the process—certainly has advantages.Nevertheless, it has been found that equal-frequency binning can yield excellentresults, at least in conjunction with the Naïve Bayes learning scheme, when thenumber of bins is chosen in a data-dependent fashion by setting it to the squareroot of the number of instances. This method is called proportional k-intervaldiscretization.Entropy-based discretizationBecause the criterion used for splitting a numeric attribute during the formationof a decision tree works well in practice, it seems a good idea to extend itto more general discretization by recursively splitting intervals until it is timeto stop. In Chapter 6 we saw how to sort the instances by the attribute’s value<strong>and</strong> consider, for each possible splitting point, the information gain of theresulting split. To discretize the attribute, once the first split is determined thesplitting process can be repeated in the upper <strong>and</strong> lower parts of the range, <strong>and</strong>so on, recursively.To see this working in practice, we revisit the example on page 189 for discretizingthe temperature attribute of the weather data, whose values are64 65 68 69 70 71 72 75 80 81 83 85no yesyes no yes yes yes nono yes yes noyes yes

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!