01.03.2013 Views

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

264 6 Statistical Classification<br />

i(t1) = i(t2) = 1×1= 1;<br />

2 1 2<br />

i(t11) = i(t12) = = ;<br />

3 3 9<br />

i(t21) = i(t22) = 1×0 = 0.<br />

In the automatic generation of binary trees the tree starts at the root node, which<br />

corresponds to the whole training set. Then, it progresses by searching for each<br />

variable the threshold level achieving the maximum decrease of the impurity at<br />

each node. The generation of splits stops when no significant decrease of the<br />

impurity is achieved. It is common practice to use the individual feature values of<br />

the training set cases as c<strong>and</strong>idate threshold values. Sometimes, after generating a<br />

tree automatically, some sort of tree pruning should be performed in order to<br />

remove branches of no interest.<br />

<strong>SPSS</strong> <strong>and</strong> <strong>STATISTICA</strong> have specific comm<strong>and</strong>s for designing tree classifiers,<br />

based on univariate splits. The method of exhaustive search for the best univariate<br />

splits is usually called the CRT (also CART or C&RT) method, pioneered by<br />

Breiman, Friedman, Olshen <strong>and</strong> Stone (see Breiman et al., 1993).<br />

Example 6.17<br />

Q: Use the CRT approach with univariate splits <strong>and</strong> the Gini index as splitting<br />

criterion in order to derive a decision tree for the Breast Tissue dataset.<br />

Assume equal priors of the classes.<br />

A: Applying the comm<strong>and</strong>s for CRT univariate split with the Gini index, described<br />

in Comm<strong>and</strong>s 6.3, the tree presented in Figure 6.28 was found with <strong>SPSS</strong> (same<br />

solution with <strong>STATISTICA</strong>). The tree shows the split thresholds at each node as<br />

well as the improvement achieved in the Gini index. For instance, the first split<br />

variable PERIM was selected with a threshold level of 1563.84.<br />

Table 6.13. Training set classification matrix, obtained with <strong>SPSS</strong>, corresponding<br />

to the tree shown in Figure 6.28.<br />

Observed Predicted<br />

car fad mas gla con adi<br />

Percent<br />

Correct<br />

car 20 0 1 0 0 0 95.2%<br />

fad 0 0 12 3 0 0 0.0%<br />

mas 2 0 15 1 0 0 83.3%<br />

gla 1 0 4 11 0 0 68.8%<br />

con 0 0 0 0 14 0 100.0%<br />

adi 0 0 0 0 1 21 95.5%

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!