14.03.2014 Views

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 18 Clustering Data 467<br />

Hierarchical Clustering<br />

Save Cluster Hierarchy saves information needed if you are going to do a custom dendrogram with<br />

scripting. For each clustering, it outputs three rows, the joiner, the leader, <strong>and</strong> the result, with the cluster<br />

centers, size, <strong>and</strong> other information.<br />

Save Distance Matrix<br />

makes a new data table containing the distances between the observations.<br />

Parallel Coord Plots creates a parallel coordinate plot for each cluster. For details about the plots, see<br />

Basic Analysis <strong>and</strong> Graphing.<br />

Script<br />

contains options that are available to all platforms. See Using JMP.<br />

Technical Details for Hierarchical Clustering<br />

The following description of hierarchical clustering methods gives distance formulas that use the following<br />

notation. Lowercase symbols generally pertain to observations <strong>and</strong> uppercase symbols to clusters.<br />

n is the number of observations<br />

v is the number of variables<br />

x i is the ith observation<br />

C K is the Kth cluster, subset of {1, 2,..., n}<br />

N K is the number of observations in C K<br />

x<br />

is the sample mean vector<br />

x K<br />

x<br />

x)<br />

is the mean vector for cluster C K<br />

is the square root of the sum of the squares of the elements of x (the Euclidean length of the vector<br />

d(x i , x j ) is<br />

x 2<br />

Average Linkage In average linkage, the distance between two clusters is the average distance between<br />

pairs of observations, or one in each cluster. Average linkage tends to join clusters with small variances<br />

<strong>and</strong> is slightly biased toward producing clusters with the same variance. See Sokal <strong>and</strong> Michener (1958).<br />

Distance for the average linkage cluster method is<br />

D KL<br />

=<br />

d( x i<br />

, x j<br />

)<br />

------------------<br />

i ∈ C K j ∈ C L<br />

N K<br />

N L<br />

Centroid Method In the centroid method, the distance between two clusters is defined as the squared<br />

Euclidean distance between their means. The centroid method is more robust to outliers than most<br />

other hierarchical methods but in other respects might not perform as well as Ward’s method or average<br />

linkage. See Milligan (1980).<br />

Distance for the centroid method of clustering is<br />

2<br />

D KL<br />

= x K – x L

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!