13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.6 LINEAR MODELS 119through the dataset for each different size of item set. Sometimes the dataset istoo large to read in to main memory <strong>and</strong> must be kept on disk; then it may beworth reducing the number of passes by checking item sets of two consecutivesizes in one go. For example, once sets with two items have been generated, allsets of three items could be generated from them before going through theinstance set to count the actual number of items in the sets. More three-itemsets than necessary would be considered, but the number of passes through theentire dataset would be reduced.In practice, the amount of computation needed to generate association rulesdepends critically on the minimum coverage specified. The accuracy has lessinfluence because it does not affect the number of passes that we must makethrough the dataset. In many situations we will want to obtain a certain numberof rules—say 50—with the greatest possible coverage at a prespecifiedminimum accuracy level. One way to do this is to begin by specifying the coverageto be rather high <strong>and</strong> to then successively reduce it, reexecuting the entirerule-finding algorithm for each coverage value <strong>and</strong> repeating this until thedesired number of rules has been generated.The tabular input format that we use throughout this book, <strong>and</strong> in particulara st<strong>and</strong>ard ARFF file based on it, is very inefficient for many association-ruleproblems. Association rules are often used when attributes are binary—eitherpresent or absent—<strong>and</strong> most of the attribute values associated with a giveninstance are absent. This is a case for the sparse data representation describedin Section 2.4; the same algorithm for finding association rules applies.4.6 Linear modelsThe methods we have been looking at for decision trees <strong>and</strong> rules work mostnaturally with nominal attributes. They can be extended to numeric attributeseither by incorporating numeric-value tests directly into the decision tree or ruleinduction scheme, or by prediscretizing numeric attributes into nominal ones.We will see how in Chapters 6 <strong>and</strong> 7, respectively. However, there are methodsthat work most naturally with numeric attributes. We look at simple ones here,ones that form components of more complex learning methods, which we willexamine later.Numeric prediction: Linear regressionWhen the outcome, or class, is numeric, <strong>and</strong> all the attributes are numeric, linearregression is a natural technique to consider. This is a staple method in statistics.The idea is to express the class as a linear combination of the attributes,with predetermined weights:x= w + wa+ wa + + wa0 1 1 2 2 ...kk

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!