13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 NUMERIC PREDICTION 249by split, <strong>and</strong> pruning it from the leaves upward, performed by prune. The nodedata structure contains a type flag indicating whether it is an internal node ora leaf, pointers to the left <strong>and</strong> right child, the set of instances that reach thatnode, the attribute that is used for splitting at that node, <strong>and</strong> a structure representingthe linear model for the node.The sd function called at the beginning of the main program <strong>and</strong> againat the beginning of split calculates the st<strong>and</strong>ard deviation of the class valuesof a set of instances. Then follows the procedure for obtaining syntheticbinary attributes that was described previously. St<strong>and</strong>ard procedures for creatingnew nodes <strong>and</strong> printing the final tree are not shown. In split, sizeof returnsthe number of elements in a set. Missing attribute values are dealt with asdescribed earlier. The SDR is calculated according to the equation at the beginningof the previous subsection. Although not shown in the code, it is set toinfinity if splitting on the attribute would create a leaf with fewer than twoinstances. In prune, the linearRegression routine recursively descends thesubtree collecting attributes, performs a linear regression on the instances at thatnode as a function of those attributes, <strong>and</strong> then greedily drops terms if doingso improves the error estimate, as described earlier. Finally, the error functionreturnsn + nn- ¥ Ândeviation from predicted class valueinstances,nwhere n is the number of instances at the node <strong>and</strong> n is the number of parametersin the node’s linear model.Figure 6.16 gives an example of a model tree formed by this algorithm for aproblem with two numeric <strong>and</strong> two nominal attributes. What is to be predictedis the rise time of a simulated servo system involving a servo amplifier, motor,lead screw, <strong>and</strong> sliding carriage. The nominal attributes play important roles.Four synthetic binary attributes have been created for each of the five-valuednominal attributes motor <strong>and</strong> screw, <strong>and</strong> they are shown in Table 6.1 in termsof the two sets of values to which they correspond. The ordering of thesevalues—D, E, C, B, A for motor <strong>and</strong> coincidentally D, E, C, B, A for screw also—is determined from the training data: the rise time averaged over all examplesfor which motor = D is less than that averaged over examples for which motor= E, which is less than when motor = C, <strong>and</strong> so on. It is apparent from the magnitudeof the coefficients in Table 6.1 that motor = D versus E, C, B, A plays aleading role in the LM2 model, <strong>and</strong> motor = D, E versus C, B, A plays a leadingrole in LM1. Both motor <strong>and</strong> screw also play minor roles in several of the models.The decision tree shows a three-way split on a numeric attribute. First a binarysplittingtree was generated in the usual way. It turned out that the root <strong>and</strong> oneof its descendants tested the same attribute, pgain, <strong>and</strong> a simple algorithm was

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!