13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

202 CHAPTER 6 | IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMESdeferred until most of the other instances have been taken care of, at which timetests will probably emerge that involve other attributes. Covering algorithms fordecision lists have a decided advantage over decision tree algorithms in thisrespect: tricky examples can be left until late in the process, at which time theywill appear less tricky because most of the other examples have already beenclassified <strong>and</strong> removed from the instance set.Numeric attributes can be dealt with in exactly the same way as they are fortrees. For each numeric attribute, instances are sorted according to theattribute’s value <strong>and</strong>, for each possible threshold, a binary less-than/greater-thantest is considered <strong>and</strong> evaluated in exactly the same way that a binary attributewould be.Generating good rulesSuppose you don’t want to generate perfect rules that guarantee to give thecorrect classification on all instances in the training set, but would rather generate“sensible” ones that avoid overfitting the training set <strong>and</strong> thereby st<strong>and</strong> abetter chance of performing well on new test instances. How do you decidewhich rules are worthwhile? How do you tell when it becomes counterproductiveto continue adding terms to a rule to exclude a few pesky instances of thewrong type, all the while excluding more <strong>and</strong> more instances of the right type,too?Let’s look at a few examples of possible rules—some good <strong>and</strong> some bad—for the contact lens problem in Table 1.1. Consider first the ruleIf astigmatism = yes <strong>and</strong> tear production rate = normalthen recommendation = hardThis gives a correct result for four of the six cases that it covers; thus itssuccess fraction is 4/6. Suppose we add a further term to make the rule a“perfect” one:If astigmatism = yes <strong>and</strong> tear production rate = normal<strong>and</strong> age = young then recommendation = hardThis improves accuracy to 2/2. Which rule is better? The second one is moreaccurate on the training data but covers only two cases, whereas the first onecovers six. It may be that the second version is just overfitting the training data.For a practical rule learner we need a principled way of choosing the appropriateversion of a rule, preferably one that maximizes accuracy on future test data.Suppose we split the training data into two parts that we will call a growingset <strong>and</strong> a pruning set. The growing set is used to form a rule using the basic coveringalgorithm. Then a test is deleted from the rule, <strong>and</strong> the effect is evaluatedby trying out the truncated rule on the pruning set <strong>and</strong> seeing whether it

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!