12.07.2015 Views

A Comparison of Decision Tree Ensemble Creation Techniques Ç

A Comparison of Decision Tree Ensemble Creation Techniques Ç

A Comparison of Decision Tree Ensemble Creation Techniques Ç

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

174 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007TABLE 1Selected Aspects <strong>of</strong> This Work Compared with Previous Worksavailable features is randomly selected and the best split availablewithin those features is selected for that node. Also, bagging is usedto create the training set <strong>of</strong> data items for each individual tree. Thenumber <strong>of</strong> features randomly chosen (from n total) at each node is aparameter <strong>of</strong> this approach. Following [6], we considered versions<strong>of</strong> random forests created with random subsets <strong>of</strong> size 1, 2, andblog 2 ðnÞþ1c. Breiman reported on experiments with 20 data sets, inwhich each data set was randomly split 100 times into 90 percent fortraining and 10 percent for testing. <strong>Ensemble</strong>s <strong>of</strong> size 50 werecreated for Adaboost and ensembles <strong>of</strong> size 100 were created forrandom forests, except for the zip code data set, for whichensembles <strong>of</strong> size 200 were created. Accuracy results were averagedover the 100 train-test splits. The random forest with a singleattribute randomly chosen at each node was better than AdaBooston 11 <strong>of</strong> the 20 data sets. The random forest with blog 2 ðnÞþ1cattributes was better than AdaBoost on 14 <strong>of</strong> the 20 data sets. Theresults were not evaluated for statistical significance.Dietterich introduced an approach that he termed randomizedC4.5 [7]. We will refer to this more generally as random trees. Inthis approach, at each node in the decision tree, the 20 best tests aredetermined and one <strong>of</strong> them is randomly selected for use at thatnode. With continuous attributes, it is possible that multiple testsfrom the same attribute will be in the top 20. Dietterich reported onexperiments with 33 data sets from the UC Irvine repository. Forall but three <strong>of</strong> the data sets, a 10-fold cross-validation approachwas followed. The other three used a train/test split as included inthe distribution <strong>of</strong> the data set. Random tree ensembles werecreated using both unpruned and pruned (with certainty factor 10)trees, and the better <strong>of</strong> the two was manually selected forcomparison against bagging. Differences in accuracy were testedfor statistical significance at the 95 percent level. With thisapproach, it was found that randomized C4.5 resulted in betteraccuracy than bagging six times, worse performance three times,and was not statistically significantly different 24 times.Freund and Schapire introduced a boosting algorithm [3] forincremental refinement <strong>of</strong> an ensemble by emphasizing hard-toclassifydata examples. This algorithm, referred to as AdaBoost.M1,creates classifiers using a training set with weights assigned to everyexample. Examples that are incorrectly classified by a classifier aregiven an increased weight for the next iteration. Freund andSchapire showed that boosting was <strong>of</strong>ten more accurate thanbagging when using a nearest neighbor algorithm as the baseclassifier, though this margin was significantly diminished whenusing C4.5. Results were reported for 27 data sets, comparing theperformance <strong>of</strong> boosting with that <strong>of</strong> bagging using C4.5 as the baseclassifier. The same ensemble size <strong>of</strong> 100 was used for boosting andbagging. In general, 10-fold cross-validation was done, repeated for10 trials, and average error rate reported. For data sets with adefined test set, an average <strong>of</strong> 20 trials was used with this test set.Boosting resulted in higher accuracy than bagging on 13 <strong>of</strong> the27 data sets, bagging resulted in higher accuracy than boosting on10 data sets, and there were 4 ties. The differences in accuracy werenot evaluated for statistical significance.Table 1 shows a comparative summary <strong>of</strong> experiments andresults <strong>of</strong> this work with the previously discussed work.3 EXPERIMENTAL DESIGNIn this work, we used the free open source s<strong>of</strong>tware package“OpenDT” [13] for learning decision trees in parallel. This programhas the ability to output trees very similar to C4.5 release 8 [14], buthas added functionality for ensemble creation. In OpenDT, likeC4.5, a penalty is assessed to the information gain <strong>of</strong> a continuousattribute with many potential splits. In the event that the attributeset randomly chosen provides a “negative” information gain, ourapproach is to randomly rechoose attributes until a positiveinformation gain is obtained, or no further split is possible. Thisenables each test to improve the purity <strong>of</strong> the resultant leaves. Thisapproach was also used in the WEKA system [15].As AdaBoost.M1 was designed for binary classes, we use asimple extension to this algorithm called AdaBoost.M1W [2] whichmodifies the stopping criteria and weight update mechanism todeal with multiple classes and weak learning algorithms. Ourboosting algorithm uses a weighted random sampling withreplacement from the initial training set, which is different froma boosting-by-weighting approach where the information gain isadjusted according to the weight <strong>of</strong> the examples. Freund andSchapire used boosting-by-resampling in [3]. There appears to beno accuracy advantage for boosting-by-resampling or boosting-byreweighting[16], [17], [18] though Breiman reports increasedaccuracy for boosting-by-resampling when using unpruned trees[19]. We use unpruned trees because <strong>of</strong> this and, in general, forincreased ensemble diversity [20]. Boosting-by-resampling maytake longer to converge than boosting-by-reweighting though.We have made a modification to the randomized C4.5 ensemblecreation method in which only the best test from each attribute is

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!