14.03.2014 Views

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 18 Clustering Data 463<br />

Hierarchical Clustering<br />

Select Data is distance matrix if you have a data table of distances instead of raw data. If your raw data<br />

consists of n observations, the distance table should have n rows <strong>and</strong> n columns, with the values being the<br />

distances between the observations. The distance table needs to have an additional column giving a unique<br />

identifier (such as row number) that matches the column names of the other n columns. The diagonal<br />

elements of the table should be zero, since the distance between a point <strong>and</strong> itself is zero. The table can be<br />

square (both upper <strong>and</strong> lower elements), or it can be upper or lower triangular. If using a square table, the<br />

platform gives a warning if the table is not symmetric. For an example of what the distance table should look<br />

like, use the option “Save Distance Matrix” on page 467.<br />

To sort clusters by their mean value, specify an Ordering column. One way to use this feature is to complete<br />

a Principal Components analysis (using <strong>Multivariate</strong>) <strong>and</strong> save the first principal component to use as an<br />

Ordering column. The clusters are then sorted by these values.<br />

Use the Missing value imputation option to impute missing values in the data. Missing value imputation is<br />

done assuming that there are no clusters, that the data come from a single multivariate normal distribution,<br />

<strong>and</strong> that the missing values are completely at r<strong>and</strong>om. These assumptions are usually not reasonable in<br />

practice. Thus this feature must be used with caution, but it may produce more informative results than<br />

throwing away most of your data.<br />

Using the Pairwise method, a single covariance matrix is formed for all the data. Then each missing value is<br />

imputed by a method that is equivalent to regression prediction using all the non-missing variables as<br />

predictors. If you have categorical variables, it uses the category indices as dummy variables. If regression<br />

prediction fails due to a non-positive-definite covariance for the non-missing values, JMP uses univariate<br />

means.<br />

Hierarchical clustering supports character columns as follows. K-Means clustering only supports numeric<br />

columns.<br />

• For Ordinal columns, the data value used for clustering is just the index of the ordered category, treated<br />

as if it were continuous data. These data values are st<strong>and</strong>ardized like continuous columns.<br />

• For Nominal columns, the categories must either match to contribute a distance of zero, or contribute a<br />

st<strong>and</strong>ardized distance of 1.<br />

Hierarchical Clustering<br />

The Hierarchical option clusters rows that group the points (rows) of a JMP table into clusters whose values<br />

are close to each other relative to those of other clusters. Hierarchical clustering is a process that starts with<br />

each point in its own cluster. At each step, the two clusters that are closest together are combined into a<br />

single cluster. This process continues until there is only one cluster containing all the points. This type of<br />

clustering is good for smaller data sets (a few hundred observations).<br />

To see a simple example of hierarchical clustering, select the Birth Death Subset.jmp data table. The data<br />

are the 1976 crude birth <strong>and</strong> death rates per 100,000 people.<br />

When you select the Cluster comm<strong>and</strong>, the Cluster Launch dialog (shown previously in Figure 18.1)<br />

appears. In this example the birth <strong>and</strong> death columns are used as cluster variables in the default method,<br />

Ward’s minimum variance, for hierarchical clustering.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!