01.03.2013 Views

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

260 6 Statistical Classification<br />

At each stage of the tree classifier, a simpler problem with a smaller number of<br />

features is solved. This is an additional benefit, namely in practical multi-class<br />

problems where it is rather difficult to guarantee normal or even symmetric<br />

distributions with similar covariance matrices for all classes, but it may be<br />

possible, with the multistage approach, that those conditions are approximately met<br />

at each stage, affording then optimal classifiers.<br />

Example 6.16<br />

Q: Consider the Breast Tissue dataset (electric impedance measurements of<br />

freshly excised breast tissue) with 6 classes denoted CAR (carcinoma), FAD<br />

(fibro-adenoma), GLA (gl<strong>and</strong>ular), MAS (mastopathy), CON (connective) <strong>and</strong><br />

ADI (adipose). Derive a decision tree solution for this classification problem.<br />

A: Performing a Kruskal-Wallis analysis, it is readily seen that all the features have<br />

discriminative capabilities, namely I0 <strong>and</strong> PA500, <strong>and</strong> that it is practically<br />

impossible to discriminate between classes GLA, FAD <strong>and</strong> MAS. The low<br />

dimensionality ratio of this dataset for the individual classes (e.g. only 14 cases for<br />

class CON) strongly recommends a decision tree approach, with the use of merged<br />

classes <strong>and</strong> a greatly reduced number of features at each node.<br />

As I0 <strong>and</strong> PA500 are promising features, it is worthwhile to look at the<br />

respective scatter diagram shown in Figure 6.23. Two case clusters are visually<br />

identified: one corresponding to {CON, ADI}, the other to {MAS, GLA, FAD,<br />

CAR}. At the first stage of the tree we then use I0 alone, with a threshold of<br />

I0 = 600, achieving zero errors.<br />

At stage two, we attempt the most useful discrimination from the medical point<br />

of view: class CAR (carcinoma) vs. {FAD, MAS, GLA}. <strong>Using</strong> discriminant<br />

analysis, this can be performed with an overall training set error of about 8%, using<br />

features AREA_DA <strong>and</strong> IPMAX, whose distributions are well modelled by the<br />

normal distribution.<br />

PA500<br />

0.40<br />

0.35<br />

0.30<br />

0.25<br />

0.20<br />

0.15<br />

0.10<br />

0.05<br />

0.00<br />

-0.05<br />

-200 300 800 1300 1800 2300 2800<br />

I0<br />

CLASS: car<br />

CLASS: fad<br />

CLASS: mas<br />

CLASS: gla<br />

CLASS: con<br />

CLASS: adi<br />

Figure 6.23. Scatter plot of six classes of breast tissue with features I0 <strong>and</strong> PA500.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!