anytime algorithms for learning anytime classifiers saher ... - Technion
anytime algorithms for learning anytime classifiers saher ... - Technion
anytime algorithms for learning anytime classifiers saher ... - Technion
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Technion</strong> - Computer Science Department - Ph.D. Thesis PHD-2008-12 - 2008<br />
Murphy and Pazzani (1994) reported a set of experiments in which all the possible<br />
consistent decision trees were produced and showed that, <strong>for</strong> several tasks,<br />
the smallest consistent decision tree had higher predictive error than slightly<br />
larger trees. However, when the authors compared the likelihood of better generalization<br />
<strong>for</strong> smaller vs. more complex trees, they concluded that simpler hypotheses<br />
should be preferred when no further knowledge of the target concept<br />
is available. The small number of training examples relative to the size of the<br />
tree that perfectly describes the concept might explain why, in these cases, the<br />
smallest tree did not generalize best. Another reason could be that only small<br />
spaces of decision trees were explored. To verify these explanations, we use a similar<br />
experimental setup where the datasets have larger training sets and attribute<br />
vectors of higher dimensionality. Because the number of all possible consistent<br />
trees is huge, we use a Random Tree Generator (RTG) to sample the space of<br />
trees obtainable by TDIDT <strong>algorithms</strong>. RTG builds a tree top-down and chooses<br />
the splitting attribute at random.<br />
We report the results <strong>for</strong> three datasets: XOR-5, Tic-tac-toe, and Zoo (See<br />
Appendix A <strong>for</strong> detailed descriptions of these datasets). For each dataset, the<br />
examples were partitioned into a training set (90%) and a testing set (10%), and<br />
RTG was used to generate a sample of 10 million consistent decision trees. The<br />
behavior in the three datasets is similar: the accuracy monotonically decreases<br />
with the increase in the size of the trees (number of leaves), confirming the utility<br />
of Occam’s Razor. In Appendix B we report further experiments with a variety<br />
of datasets, which indicate that the inverse correlation between size and accuracy<br />
is statistically significant (Esmeir & Markovitch, 2007c).<br />
It is important to note that smaller trees have several advantages aside from<br />
their probable greater accuracy, such as greater statistical evidence at the leaves,<br />
better comprehensibility, lower storage costs, and faster classification (in terms<br />
of total attribute evaluations).<br />
Motivated by the above discussion, our goal is to find the smallest consistent<br />
tree. In TDIDT, the learner has to choose a split at each node, given a set of<br />
examples E that reach the node and a set of unused attributes A. For each attribute<br />
a, let Tmin(a, A, E) be the smallest tree rooted at a that is consistent with<br />
E and uses attributes from A − {a} <strong>for</strong> the internal splits. Given two candidate<br />
attributes a1 and a2, we obviously prefer the attribute whose associated Tmin(a)<br />
is the smaller. For a set of attributes A, we define � A to be the set of attributes<br />
whose associated Tmin(a) is the smaller. Formally, � A = arg mina∈A Tmin (a). We<br />
say that a splitting attribute a is optimal if a ∈ � A. Observe that if the learner<br />
makes an optimal decision at each node, then the final tree is necessarily globally<br />
optimal.<br />
23