14.03.2014 Views

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

468 Clustering Data Chapter 18<br />

Hierarchical Clustering<br />

Ward’s In Ward’s minimum variance method, the distance between two clusters is the ANOVA sum of<br />

squares between the two clusters added up over all the variables. At each generation, the within-cluster<br />

sum of squares is minimized over all partitions obtainable by merging two clusters from the previous<br />

generation. The sums of squares are easier to interpret when they are divided by the total sum of squares<br />

to give the proportions of variance (squared semipartial correlations).<br />

Ward’s method joins clusters to maximize the likelihood at each level of the hierarchy under the<br />

assumptions of multivariate normal mixtures, spherical covariance matrices, <strong>and</strong> equal sampling<br />

probabilities.<br />

Ward’s method tends to join clusters with a small number of observations <strong>and</strong> is strongly biased toward<br />

producing clusters with approximately the same number of observations. It is also very sensitive to<br />

outliers. See Milligan (1980).<br />

Distance for Ward’s method is<br />

2<br />

x K – x L<br />

D KL<br />

= -------------------------<br />

-------<br />

1 1<br />

+ ------<br />

N K<br />

N L<br />

Single Linkage In single linkage the distance between two clusters is the minimum distance between an<br />

observation in one cluster <strong>and</strong> an observation in the other cluster. Single linkage has many desirable<br />

theoretical properties. See Jardine <strong>and</strong> Sibson (1976), Fisher <strong>and</strong> Van Ness (1971), <strong>and</strong> Hartigan (1981).<br />

Single linkage has, however, fared poorly in Monte Carlo studies. See Milligan (1980). By imposing no<br />

constraints on the shape of clusters, single linkage sacrifices performance in the recovery of compact<br />

clusters in return for the ability to detect elongated <strong>and</strong> irregular clusters. Single linkage tends to chop<br />

off the tails of distributions before separating the main clusters. See Hartigan (1981). Single linkage was<br />

originated by Florek et al. (1951a, 1951b) <strong>and</strong> later reinvented by McQuitty (1957) <strong>and</strong> Sneath (1957).<br />

Distance for the single linkage cluster method is<br />

D KL<br />

= min i ∈ CK<br />

min j ∈<br />

CL<br />

d ( x i<br />

, x j<br />

)<br />

Complete Linkage In complete linkage, the distance between two clusters is the maximum distance<br />

between an observation in one cluster <strong>and</strong> an observation in the other cluster. Complete linkage is<br />

strongly biased toward producing clusters with approximately equal diameters <strong>and</strong> can be severely<br />

distorted by moderate outliers. See Milligan (1980).<br />

Distance for the Complete linkage cluster method is<br />

D KL<br />

= max i ∈ CK<br />

max j ∈<br />

CL<br />

d ( x i<br />

, x j<br />

)<br />

Fast Ward is a way of applying Ward's method more quickly for large numbers of rows. It is used<br />

automatically whenever there are more than 2000 rows.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!