16.11.2012 Views

Data Mining Methods and Models

Data Mining Methods and Models

Data Mining Methods and Models

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

PRINCIPAL COMPONENTS ANALYSIS 5<br />

component, which equals the sum of the eigenvalues, which equals the number<br />

of variables. That is,<br />

m�<br />

m�<br />

m�<br />

Var(Yi) = Var(Zi) = λi = m<br />

i=1<br />

i=1<br />

� Result 2. The partial correlation between a given component <strong>and</strong> a given variable<br />

is a<br />

√<br />

function of an eigenvector <strong>and</strong> an eigenvalue. Specifically, Corr(Yi, Z j) =<br />

eij λi, i, j = 1, 2,...,m, where (λ1, e1), (λ2, e2),...,(λm, em) are the<br />

eigenvalue–eigenvector pairs for the correlation matrix ρ, <strong>and</strong> we note that<br />

λ1 ≥ λ2 ≥···≥λm. Apartial correlation coefficient is a correlation coefficient<br />

that takes into account the effect of all the other variables.<br />

� Result 3. The proportion of the total variability in Z that is explained by the ith<br />

principal component is the ratio of the ith eigenvalue to the number of variables,<br />

that is, the ratio λi/m.<br />

Next, to illustrate how to apply principal components analysis on real data, we<br />

turn to an example.<br />

Applying Principal Components Analysis<br />

to the Houses <strong>Data</strong> Set<br />

We turn to the houses data set [3], which provides census information from all the<br />

block groups from the 1990 California census. For this data set, a block group has<br />

an average of 1425.5 people living in an area that is geographically compact. Block<br />

groups that contained zero entries for any of the variables were excluded. Median<br />

house value is the response variable; the predictor variables are:<br />

� Median income<br />

� Housing median age<br />

� Total rooms<br />

� Total bedrooms<br />

� Population<br />

� Households<br />

� Latitude<br />

� Longitude<br />

The original data set had 20,640 records, of which 18,540 were selected r<strong>and</strong>omly<br />

for a training data set, <strong>and</strong> 2100 held out for a test data set. A quick look at<br />

the variables is provided in Figure 1.1. (“Range” is Clementine’s type label for continuous<br />

variables.) Median house value appears to be in dollars, but median income<br />

has been scaled to a continuous scale from 0 to 15. Note that longitude is expressed<br />

in negative terms, meaning west of Greenwich. Larger absolute values for longitude<br />

indicate geographic locations farther west.<br />

Relating this data set to our earlier notation, we have X1 = median income,<br />

X2 = housing median age,..., X8 = longitude, so that m = 8 <strong>and</strong> n = 18,540. A<br />

glimpse of the first 20 records in the data set looks like Figure 1.2. So, for example, for<br />

the first block group, the median house value is $452,600, the median income is 8.325<br />

(on the census scale), the housing median age is 41, the total rooms is 880, the total<br />

bedrooms is 129, the population is 322, the number of households is 126, the latitude<br />

is 37.88 North <strong>and</strong> the longitude is 122.23 West. Clearly, this is a smallish block<br />

group with very high median house value. A map search reveals that this block group<br />

i=1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!