13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

246 CHAPTER 6 | IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMESThe expected error for test data at a node is calculated as described previously,using the linear model for prediction. Because of the compensation factor(n + n)/(n - n), it may be that the linear model can be further simplified bydropping terms to minimize the estimated error. Dropping a term decreases themultiplication factor, which may be enough to offset the inevitable increase inaverage error over the training instances. Terms are dropped one by one, greedily,as long as the error estimate decreases.Finally, once a linear model is in place for each interior node, the tree ispruned back from the leaves as long as the expected estimated error decreases.The expected error for the linear model at that node is compared with theexpected error from the subtree below. To calculate the latter, the error fromeach branch is combined into a single, overall value for the node by weightingthe branch by the proportion of the training instances that go down it <strong>and</strong> combiningthe error estimates linearly using those weights.Nominal attributesBefore constructing a model tree, all nominal attributes are transformed intobinary variables that are then treated as numeric. For each nominal attribute,the average class value corresponding to each possible value in the enumerationis calculated from the training instances, <strong>and</strong> the values in the enumeration aresorted according to these averages. Then, if the nominal attribute has k possiblevalues, it is replaced by k - 1 synthetic binary attributes, the ith being 0 ifthe value is one of the first i in the ordering <strong>and</strong> 1 otherwise. Thus all splits arebinary: they involve either a numeric attribute or a synthetic binary one, treatedas a numeric attribute.It is possible to prove analytically that the best split at a node for a nominalvariable with k values is one of the k - 1 positions obtained by orderingthe average class values for each value of the attribute. This sorting operationshould really be repeated at each node; however, there is an inevitable increasein noise because of small numbers of instances at lower nodes in the tree (<strong>and</strong>in some cases nodes may not represent all values for some attributes), <strong>and</strong>not much is lost by performing the sorting just once, before starting to build amodel tree.Missing valuesTo take account of missing values, a modification is made to the SDR formula.The final formula, including the missing value compensation, ism ÈTj˘SDR = ¥ ( ) - ¥ ( )ÎÍsd T  sd TjT˚˙,jŒ{ L,R}T

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!