13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.2 CLASSIFICATION RULES 205Initialize E to the instance setSplit E into Grow <strong>and</strong> Prune in the ratio 2:1For each class C for which Grow <strong>and</strong> Prune both contain an instanceUse the basic covering algorithm to create the best perfect rule for class CCalculate the worth w(R) for the rule on Prune, <strong>and</strong> of the rule with thefinal condition omitted w(R-)While w(R-) > w(R), remove the final condition from the rule <strong>and</strong> repeat theprevious stepFrom the rules generated, select the one with the largest w(R)Print the ruleRemove the instances covered by the rule from EContinueFigure 6.3 Algorithm for forming rules by incremental reduced-error pruning.Using global optimizationIn general, rules generated using incremental reduced-error pruning in thismanner seem to perform quite well, particularly on large datasets. However, ithas been found that a worthwhile performance advantage can be obtained byperforming a global optimization step on the set of rules induced. The motivationis to increase the accuracy of the rule set by revising or replacing individualrules. Experiments show that both the size <strong>and</strong> the performance of rule setsare significantly improved by postinduction optimization. On the other h<strong>and</strong>,the process itself is rather complex.To give an idea of how elaborate—<strong>and</strong> heuristic—industrial-strength rulelearners become, Figure 6.4 shows an algorithm called RIPPER, an acronym forrepeated incremental pruning to produce error reduction. Classes are examined inincreasing size <strong>and</strong> an initial set of rules for the class is generated using incrementalreduced-error pruning. An extra stopping condition is introduced thatdepends on the description length of the examples <strong>and</strong> rule set. The descriptionlength DL is a complex formula that takes into account the number of bitsneeded to send a set of examples with respect to a set of rules, the number ofbits required to send a rule with k conditions, <strong>and</strong> the number of bits neededto send the integer k—times an arbitrary factor of 50% to compensate for possibleredundancy in the attributes. Having produced a rule set for the class, eachrule is reconsidered <strong>and</strong> two variants produced, again using reduced-errorpruning—but at this stage, instances covered by other rules for the class areremoved from the pruning set, <strong>and</strong> success rate on the remaining instancesis used as the pruning criterion. If one of the two variants yields a better

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!