13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

196 CHAPTER 6 | IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMESComplexity of decision tree inductionNow that we have learned how to accomplish the pruning operations, we havefinally covered all the central aspects of decision tree induction. Let’s take stock<strong>and</strong> consider the computational complexity of inducing decision trees. We willuse the st<strong>and</strong>ard order notation: O(n) st<strong>and</strong>s for a quantity that grows at mostlinearly with n, O(n 2 ) grows at most quadratically with n, <strong>and</strong> so on.Suppose that the training data contains n instances <strong>and</strong> m attributes. We needto make some assumption about the size of the tree, <strong>and</strong> we will assume that itsdepth is on the order of log n, that is, O(log n). This is the st<strong>and</strong>ard rate ofgrowth of a tree with n leaves, provided that it remains “bushy” <strong>and</strong> doesn’tdegenerate into a few very long, stringy branches. Note that we are tacitly assumwageincrease first year≤ 2.5 > 2.5working hoursper week≤ 36> 361 bad1 goodhealth plan contributionnone half full4 bad2 good1 bad1 good4 bad2 goodFigure 6.2 Pruning the labor negotiations decision tree.the error estimate for the working hours node, so the subtree is pruned away <strong>and</strong>replaced by a leaf node.The estimated error figures obtained in these examples should be taken witha grain of salt because the estimate is only a heuristic one <strong>and</strong> is based on anumber of shaky assumptions: the use of the upper confidence limit; theassumption of a normal distribution; <strong>and</strong> the fact that statistics from the trainingset are used. However, the qualitative behavior of the error formula is correct<strong>and</strong> the method seems to work reasonably well in practice. If necessary, theunderlying confidence level, which we have taken to be 25%, can be tweaked toproduce more satisfactory results.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!