13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

294 CHAPTER 7 | TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUTof most promising c<strong>and</strong>idates. Genetic algorithm search procedures are looselybased on the principal of natural selection: they “evolve” good feature subsetsby using r<strong>and</strong>om perturbations of a current list of c<strong>and</strong>idate subsets.Scheme-specific selectionThe performance of an attribute subset with scheme-specific selection is measuredin terms of the learning scheme’s classification performance using justthose attributes. Given a subset of attributes, accuracy is estimated using thenormal procedure of cross-validation described in Section 5.3. Of course, otherevaluation methods such as performance on a holdout set (Section 5.3) or thebootstrap estimator (Section 5.4) could equally well be used.The entire attribute selection process is computation intensive. If each evaluationinvolves a 10-fold cross-validation, the learning procedure must be executed10 times. With k attributes, the heuristic forward selection or backwardelimination multiplies evaluation time by a factor of up to k 2 —<strong>and</strong> for moresophisticated searches, the penalty will be far greater, up to 2 k for an exhaustivealgorithm that examines each of the 2 k possible subsets.Good results have been demonstrated on many datasets. In general terms,backward elimination produces larger attribute sets, <strong>and</strong> better classificationaccuracy, than forward selection. The reason is that the performance measureis only an estimate, <strong>and</strong> a single optimistic estimate will cause both of thesesearch procedures to halt prematurely—backward elimination with too manyattributes <strong>and</strong> forward selection with not enough. But forward selection is usefulif the focus is on underst<strong>and</strong>ing the decision structures involved, because it oftenreduces the number of attributes with only a very small effect on classificationaccuracy. Experience seems to show that more sophisticated search techniquesare not generally justified—although they can produce much better results incertain cases.One way to accelerate the search process is to stop evaluating a subset ofattributes as soon as it becomes apparent that it is unlikely to lead to higheraccuracy than another c<strong>and</strong>idate subset. This is a job for a paired statistical significancetest, performed between the classifier based on this subset <strong>and</strong> all theother c<strong>and</strong>idate classifiers based on other subsets. The performance differencebetween two classifiers on a particular test instance can be taken to be -1, 0, or1 depending on whether the first classifier is worse, the same as, or better thanthe second on that instance. A paired t-test (described in Section 5.5) can beapplied to these figures over the entire test set, effectively treating the results foreach instance as an independent estimate of the difference in performance. Thenthe cross-validation for a classifier can be prematurely terminated as soon as itturns out to be significantly worse than another—which, of course, may neverhappen. We might want to discard classifiers more aggressively by modifying

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!