08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

(0,1)<br />

(3,0)<br />

(0,-1)<br />

Figure 8.2: A locally-optimal but globally-suboptimal k-means clustering.<br />

strategy is called “farthest traversal”. Here, we begin by choosing one data point as initial<br />

center c 1 (say, randomly), then pick the farthest data point from c 1 to use as c 2 , then<br />

pick the farthest data point from {c 1 , c 2 } to use as c 3 , and so on. These are then used<br />

as the initial centers. Notice that this will produce the correct solution in the example in<br />

Figure 8.2.<br />

Farthest traversal can unfortunately get fooled by a small number <strong>of</strong> outliers. To<br />

address this, a smoother, probabilistic variation known as k-means++ instead weights<br />

data points based on their distance from the previously chosen centers, specifically, proportional<br />

to distance squared. Then it selects the next center probabilistically according<br />

to these weights. This approach has the nice property that a small number <strong>of</strong> outliers<br />

will not overly influence the algorithm so long as they are not too far away, in which case<br />

perhaps they should be their own clusters anyway.<br />

An alternative SVD-based method for initialization is described and analyzed in Section<br />

8.6. Another approach is to run some other approximation algorithm for the k-means<br />

problem, and then use its output as the starting point for Lloyd’s algorithm. Note that<br />

applying Lloyd’s algorithm to the output <strong>of</strong> any other algorithm can only improve its<br />

score.<br />

8.2.4 Ward’s algorithm<br />

Another popular heuristic for k-means clustering is Ward’s algorithm. Ward’s algorithm<br />

begins with each datapoint in its own cluster, and then repeatedly merges pairs <strong>of</strong> clusters<br />

until only k clusters remain. Specifically, Ward’s algorithm merges the two clusters<br />

that minimize the immediate increase in k-means cost. That is, for a cluster C, define<br />

cost(C) = ∑ a i ∈C d2 (a i , c), where c is the centroid <strong>of</strong> C. Then Ward’s algorithm merges<br />

the pair (C, C ′ ) minimizing cost(C ∪ C ′ ) − cost(C) − cost(C ′ ). Thus, Ward’s algorithm<br />

can be viewed as a greedy k-means algorithm.<br />

270

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!