13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 NUMERIC PREDICTION 247where m is the number of instances without missing values for that attribute,<strong>and</strong> T is the set of instances that reach this node. T L <strong>and</strong> T R are sets that resultfrom splitting on this attribute—because all tests on attributes are now binary.When processing both training <strong>and</strong> test instances, once an attribute is selectedfor splitting it is necessary to divide the instances into subsets according to theirvalue for this attribute. An obvious problem arises when the value is missing.An interesting technique called surrogate splitting has been developed to h<strong>and</strong>lethis situation. It involves finding another attribute to split on in place of theoriginal one <strong>and</strong> using it instead. The attribute is chosen as the one most highlycorrelated with the original attribute. However, this technique is both complexto implement <strong>and</strong> time consuming to execute.A simpler heuristic is to use the class value as the surrogate attribute, in thebelief that, a priori, this is the attribute most likely to be correlated with the onebeing used for splitting. Of course, this is only possible when processing thetraining set, because for test examples the class is unknown. A simple solutionfor test examples is simply to replace the unknown attribute value with theaverage value of that attribute for the training examples that reach the node—which has the effect, for a binary attribute, of choosing the most populoussubnode. This simple approach seems to work well in practice.Let’s consider in more detail how to use the class value as a surrogate attributeduring the training process. We first deal with all instances for which thevalue of the splitting attribute is known. We determine a threshold for splittingin the usual way, by sorting the instances according to its value <strong>and</strong>, for eachpossible split point, calculating the SDR according to the preceding formula,choosing the split point that yields the greatest reduction in error. Only theinstances for which the value of the splitting attribute is known are used todetermine the split point.Then we divide these instances into the two sets L <strong>and</strong> R according to thetest. We determine whether the instances in L or R have the greater average classvalue, <strong>and</strong> we calculate the average of these two averages. Then, an instance forwhich this attribute value is unknown is placed into L or R according to whetherits class value exceeds this overall average or not. If it does, it goes into whicheverof L <strong>and</strong> R has the greater average class value; otherwise, it goes into the onewith the smaller average class value. When the splitting stops, all the missingvalues will be replaced by the average values of the corresponding attributes ofthe training instances reaching the leaves.Pseudocode for model tree inductionFigure 6.15 gives pseudocode for the model tree algorithm we have described.The two main parts are creating a tree by successively splitting nodes, performed

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!