18.11.2012 Views

anytime algorithms for learning anytime classifiers saher ... - Technion

anytime algorithms for learning anytime classifiers saher ... - Technion

anytime algorithms for learning anytime classifiers saher ... - Technion

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Technion</strong> - Computer Science Department - Ph.D. Thesis PHD-2008-12 - 2008<br />

Murphy and Pazzani (1994) reported a set of experiments in which all the possible<br />

consistent decision trees were produced and showed that, <strong>for</strong> several tasks,<br />

the smallest consistent decision tree had higher predictive error than slightly<br />

larger trees. However, when the authors compared the likelihood of better generalization<br />

<strong>for</strong> smaller vs. more complex trees, they concluded that simpler hypotheses<br />

should be preferred when no further knowledge of the target concept<br />

is available. The small number of training examples relative to the size of the<br />

tree that perfectly describes the concept might explain why, in these cases, the<br />

smallest tree did not generalize best. Another reason could be that only small<br />

spaces of decision trees were explored. To verify these explanations, we use a similar<br />

experimental setup where the datasets have larger training sets and attribute<br />

vectors of higher dimensionality. Because the number of all possible consistent<br />

trees is huge, we use a Random Tree Generator (RTG) to sample the space of<br />

trees obtainable by TDIDT <strong>algorithms</strong>. RTG builds a tree top-down and chooses<br />

the splitting attribute at random.<br />

We report the results <strong>for</strong> three datasets: XOR-5, Tic-tac-toe, and Zoo (See<br />

Appendix A <strong>for</strong> detailed descriptions of these datasets). For each dataset, the<br />

examples were partitioned into a training set (90%) and a testing set (10%), and<br />

RTG was used to generate a sample of 10 million consistent decision trees. The<br />

behavior in the three datasets is similar: the accuracy monotonically decreases<br />

with the increase in the size of the trees (number of leaves), confirming the utility<br />

of Occam’s Razor. In Appendix B we report further experiments with a variety<br />

of datasets, which indicate that the inverse correlation between size and accuracy<br />

is statistically significant (Esmeir & Markovitch, 2007c).<br />

It is important to note that smaller trees have several advantages aside from<br />

their probable greater accuracy, such as greater statistical evidence at the leaves,<br />

better comprehensibility, lower storage costs, and faster classification (in terms<br />

of total attribute evaluations).<br />

Motivated by the above discussion, our goal is to find the smallest consistent<br />

tree. In TDIDT, the learner has to choose a split at each node, given a set of<br />

examples E that reach the node and a set of unused attributes A. For each attribute<br />

a, let Tmin(a, A, E) be the smallest tree rooted at a that is consistent with<br />

E and uses attributes from A − {a} <strong>for</strong> the internal splits. Given two candidate<br />

attributes a1 and a2, we obviously prefer the attribute whose associated Tmin(a)<br />

is the smaller. For a set of attributes A, we define � A to be the set of attributes<br />

whose associated Tmin(a) is the smaller. Formally, � A = arg mina∈A Tmin (a). We<br />

say that a splitting attribute a is optimal if a ∈ � A. Observe that if the learner<br />

makes an optimal decision at each node, then the final tree is necessarily globally<br />

optimal.<br />

23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!