16.11.2012 Views

Data Mining Methods and Models

Data Mining Methods and Models

Data Mining Methods and Models

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

16 CHAPTER 1 DIMENSION REDUCTION METHODS<br />

among the variables <strong>and</strong> contributes less to the PCA solution. Communalities that<br />

are very low for a particular variable should be an indication to the analyst that<br />

the particular variable might not participate in the PCA solution (i.e., might not be<br />

a member of any of the principal components). Overall, large communality values<br />

indicate that the principal components have successfully extracted a large proportion<br />

of the variability in the original variables; small communality values show that there<br />

is still much variation in the data set that has not been accounted for by the principal<br />

components.<br />

Communality values are calculated as the sum of squared component weights<br />

for a given variable. We are trying to determine whether to retain component 4, the<br />

“housing age” component. Thus, we calculate the commonality value for the variable<br />

housing median age, using the component weights for this variable (hage-z) from<br />

Table 1.2. Two communality values for housing median age are calculated, one for<br />

retaining three components <strong>and</strong> the other for retaining four components.<br />

� Communality (housing median age, three components):<br />

(−0.429) 2 + (0.025) 2 + (−0.407) 2 = 0.350315<br />

� Communality (housing median age, four components):<br />

(−0.429) 2 + (0.025) 2 + (−0.407) 2 + (0.806) 2 = 0.999951<br />

Communalities less than 0.5 can be considered to be too low, since this would<br />

mean that the variable shares less than half of its variability in common with the<br />

other variables. Now, suppose that for some reason we wanted or needed to keep<br />

the variable housing median age as an active part of the analysis. Then, extracting<br />

only three components would not be adequate, since housing median age shares only<br />

35% of its variance with the other variables. If we wanted to keep this variable in the<br />

analysis, we would need to extract the fourth component, which lifts the communality<br />

for housing median age over the 50% threshold. This leads us to the statement of the<br />

minimum communality criterion for component selection, which we alluded to earlier.<br />

Minimum Communality Criterion<br />

Suppose that it is required to keep a certain set of variables in the analysis. Then<br />

enough components should be extracted so that the communalities for each of these<br />

variables exceeds a certain threshold (e.g., 50%).<br />

Hence, we are finally ready to decide how many components to retain. We have<br />

decided to retain four components, for the following reasons:<br />

� The eigenvalue criterion recommended three components but did not absolutely<br />

reject the fourth component. Also, for small numbers of variables, this criterion<br />

can underestimate the best number of components to extract.<br />

� The proportion of variance explained criterion stated that we needed to use four<br />

components if we wanted to account for that superb 96% of the variability.<br />

Since our ultimate goal is to substitute these components for the original data

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!