14.03.2014 Views

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 18 Clustering Data 461<br />

Introduction to Clustering <strong>Methods</strong><br />

Introduction to Clustering <strong>Methods</strong><br />

Clustering is a multivariate technique of grouping rows together that share similar values. It can use any<br />

number of variables. The variables must be numeric variables for which numerical differences make sense.<br />

The common situation is that data are not scattered evenly through n-dimensional space, but rather they<br />

form clumps, locally dense areas, modes, or clusters. The identification of these clusters goes a long way<br />

toward characterizing the distribution of values.<br />

JMP provides two approaches to clustering:<br />

• hierarchical clustering for small tables, up to several thous<strong>and</strong> rows<br />

• k-means <strong>and</strong> normal mixtures clustering for large tables, up to hundreds of thous<strong>and</strong>s of rows.<br />

Hierarchical clustering is also called agglomerative clustering because it is a combining process. The method<br />

starts with each point (row) as its own cluster. At each step the clustering process calculates the distance<br />

between each cluster, <strong>and</strong> combines the two clusters that are closest together. This combining continues<br />

until all the points are in one final cluster. The user then chooses the number of clusters that seems right <strong>and</strong><br />

cuts the clustering tree at that point. The combining record is portrayed as a tree, called a dendrogram. The<br />

single points are leaves, the final single cluster of all points are the trunk, <strong>and</strong> the intermediate cluster<br />

combinations are branches. Since the process starts with n(n + 1)/2 distances for n points, this method<br />

becomes too expensive in memory <strong>and</strong> time when n is large.<br />

Hierarchical clustering also supports character columns. If the column is ordinal, then the data value used<br />

for clustering is just the index of the ordered category, treated as if it were continuous data. If the column is<br />

nominal, then the categories must match to contribute a distance of zero. They contribute a distance of 1<br />

otherwise.<br />

JMP offers five rules for defining distances between clusters: Average, Centroid, Ward, Single, <strong>and</strong><br />

Complete. Each rule can generate a different sequence of clusters.<br />

K-means clustering is an iterative follow-the-leader strategy. First, the user must specify the number of<br />

clusters, k. Then a search algorithm goes out <strong>and</strong> finds k points in the data, called seeds, that are not close to<br />

each other. Each seed is then treated as a cluster center. The routine goes through the points (rows) <strong>and</strong><br />

assigns each point to the closest cluster. For each cluster, a new cluster center is formed as the means<br />

(centroid) of the points currently in the cluster. This process continues as an alternation between assigning<br />

points to clusters <strong>and</strong> recalculating cluster centers until the clusters become stable.<br />

Normal mixtures clustering, like k-means clustering, begins with a user-defined number of clusters <strong>and</strong> then<br />

selects distance seeds. JMP uses the cluster centers chosen by k-means as seeds. However, each point, rather<br />

than being classified into one group, is assigned a probability of being in each group.<br />

SOMs are a variation on k-means where the cluster centers are laid out on a grid. Clusters <strong>and</strong> points close<br />

together on the grid are meant to be close together in the multivariate space. See “Self Organizing Maps” on<br />

page 477.<br />

K-means, normal mixtures, <strong>and</strong> SOM clustering are doubly iterative processes. The clustering process<br />

iterates between two steps in a particular implementation of the EM algorithm:<br />

• The expectation step of mixture clustering assigns each observation a probability of belonging to each<br />

cluster.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!