14.03.2014 Views

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 18 Clustering Data 469<br />

K-Means Clustering<br />

K-Means Clustering<br />

The k-means approach to clustering performs an iterative alternating fitting process to form the number of<br />

specified clusters. The k-means method first selects a set of n points called cluster seeds as a first guess of the<br />

means of the clusters. Each observation is assigned to the nearest seed to form a set of temporary clusters.<br />

The seeds are then replaced by the cluster means, the points are reassigned, <strong>and</strong> the process continues until<br />

no further changes occur in the clusters. When the clustering process is finished, you see tables showing<br />

brief summaries of the clusters. The k-means approach is a special case of a general approach called the EM<br />

algorithm; E st<strong>and</strong>s for Expectation (the cluster means in this case), <strong>and</strong> M st<strong>and</strong>s for maximization, which<br />

means assigning points to closest clusters in this case.<br />

The k-means method is intended for use with larger data tables, from approximately 200 to 100,000<br />

observations. With smaller data tables, the results can be highly sensitive to the order of the observations in<br />

the data table.<br />

K-Means clustering only supports numeric columns. K-Means clustering ignores model types (nominal <strong>and</strong><br />

ordinal), <strong>and</strong> treat all numeric columns as continuous columns.<br />

To see the KMeans cluster launch dialog (see Figure 18.4), select KMeans from the Options menu on the<br />

platform launch dialog. The figure uses the Cytometry.jmp data table.<br />

Figure 18.4 KMeans Launch Dialog<br />

The dialog has the following options:<br />

Columns Scaled Individually is used when variables do not share a common measurement scale, <strong>and</strong><br />

you do not want one variable to dominate the clustering process. For example, one variable might have<br />

values that are between 0-1000, <strong>and</strong> another variable might have values between 0-10. In this situation,<br />

you can use the option so the clustering process is not dominated by the first variable.<br />

Johnson Transform<br />

the values.<br />

balances highly skewed variables or brings outliers closer to the center of the rest of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!