01.03.2013 Views

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

Applied Statistics Using SPSS, STATISTICA, MATLAB and R

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 Feature Selection 253<br />

A: The ROC curves for ALTV <strong>and</strong> ASTV are shown in Figure 6.20. The areas<br />

under the ROC curve, computed by <strong>SPSS</strong> with a 95% confidence interval, are<br />

0.709 ± 0.11 <strong>and</strong> 0.781 ± 0.10 for ALTV <strong>and</strong> ASTV, respectively. We, therefore,<br />

select the ASTV parameter as the best diagnostic feature.<br />

6.5 Feature Selection<br />

As already discussed in section 6.3.3, great care must be exercised in reducing the<br />

number of features used by a classifier, in order to maintain a high dimensionality<br />

ratio <strong>and</strong>, therefore, reproducible performance, with error estimates sufficiently<br />

near the theoretical value. For this purpose, one may use the hypothesis test<br />

methods described in chapters 4 <strong>and</strong> 5 with the aim of discarding features that are<br />

clearly non-useful at an initial stage of the classifier design. This feature<br />

assessment task, while assuring that an information-carrying feature set is indeed<br />

used in the classifier, does not guarantee it will need the whole set. Consider, for<br />

instance, that we are presented with a classification problem described by 4<br />

features, x1, x2, x3 <strong>and</strong> x4, with x1 <strong>and</strong> x2 perfectly discriminating the classes, <strong>and</strong> x3<br />

<strong>and</strong> x4 being linearly dependent of x1 <strong>and</strong> x2. The hypothesis tests will then find that<br />

all features contribute to class discrimination. However, this discrimination could<br />

be performed equally well using the alternative sets {x1, x2} or {x3, x4}. Briefly,<br />

discarding features with no aptitude for class discrimination is no guarantee against<br />

redundant features.<br />

There is abundant literature on the topic of feature selection (see References).<br />

Feature selection uses a search procedure of a feature subset (model) obeying a<br />

stipulated merit criterion. A possible choice for this criterion is minimising Pe,<br />

with the disadvantage of the search process depending on the classifier type. More<br />

often, a class separability criterion such as the Bhattacharyya distance or the<br />

ANOVA F statistic is used. The Wilks’ lambda, defined as the ratio of the<br />

determinant of the pooled covariance over the determinant of the total covariance,<br />

is also a popular criterion. Physically, it can be interpreted as the ratio between the<br />

average class volume <strong>and</strong> the total volume of all cases. Its value will range from 0<br />

(complete class separation) to 1 (complete class fusion).<br />

As for the search method, the following are popular ones <strong>and</strong> available in<br />

<strong>STATISTICA</strong> <strong>and</strong> <strong>SPSS</strong>:<br />

1. Sequential search (direct)<br />

The direct sequential search corresponds to performing successive feature<br />

additions or eliminations to the target set, based on a separability criterion.<br />

In a forward search, one starts with the feature of most merit <strong>and</strong>, at each step,<br />

all the features not yet included in the subset are revised; the one that contributes<br />

the most to class discrimination is evaluated through the merit criterion. This<br />

feature is then included in the subset <strong>and</strong> the procedure advances to the next search<br />

step. The process goes on until the merit criterion for any c<strong>and</strong>idate feature is<br />

below a specified threshold.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!