13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

18 CHAPTER 1 | WHAT’S IT ALL ABOUT?Table 1.6The labor negotiations data.Attribute Type 1 2 3 . . . 40duration years 1 2 3 2wage increase 1st year percentage 2% 4% 4.3% 4.5wage increase 2nd year percentage ? 5% 4.4% 4.0wage increase 3rd year percentage ? ? ? ?cost of living adjustment {none, tcf, tc} none tcf ? noneworking hours per week hours 28 35 38 40pension {none, ret-allw, empl-cntr} none ? ? ?st<strong>and</strong>by pay percentage ? 13% ? ?shift-work supplement percentage ? 5% 4% 4education allowance {yes, no} yes ? ? ?statutory holidays days 11 15 12 12vacation {below-avg, avg, gen} avg gen gen avglong-term disability assistance {yes, no} no ? ? yesdental plan contribution {none, half, full} none ? full fullbereavement assistance {yes, no} no ? ? yeshealth plan contribution {none, half, full} none ? full halfacceptability of contract {good, bad} bad good good goodFigure 1.3(b) is a more complex decision tree that represents the samedataset. In fact, this is a more accurate representation of the actual dataset thatwas used to create the tree. But it is not necessarily a more accurate representationof the underlying concept of good versus bad contracts. Look down the leftbranch. It doesn’t seem to make sense intuitively that, if the working hoursexceed 36, a contract is bad if there is no health-plan contribution or a fullhealth-plan contribution but is good if there is a half health-plan contribution.It is certainly reasonable that the health-plan contribution plays a role in thedecision but not if half is good <strong>and</strong> both full <strong>and</strong> none are bad. It seems likelythat this is an artifact of the particular values used to create the decision treerather than a genuine feature of the good versus bad distinction.The tree in Figure 1.3(b) is more accurate on the data that was used to trainthe classifier but will probably perform less well on an independent set of testdata. It is “overfitted” to the training data—it follows it too slavishly. The treein Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process ofpruning, which we will learn more about in Chapter 6.Soybean classification: A classic machine learning successAn often-quoted early success story in the application of machine learning topractical problems is the identification of rules for diagnosing soybean diseases.The data is taken from questionnaires describing plant diseases. There are about

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!