08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Lloyd’s algorithm:<br />

Start with k centers.<br />

Cluster each point with the center nearest to it.<br />

Find the centroid <strong>of</strong> each cluster and replace the set <strong>of</strong> old centers with the centroids.<br />

Repeat the above two steps until the centers converge (according to some criterion, such<br />

as the k-means score no longer improving).<br />

This algorithm always converges to a local minimum <strong>of</strong> the objective. To show convergence,<br />

we argue that the sum <strong>of</strong> the squares <strong>of</strong> the distances <strong>of</strong> each point to its cluster<br />

center always improves. Each iteration consists <strong>of</strong> two steps. First, consider the step<br />

that finds the centroid <strong>of</strong> each cluster and replaces the old centers with the new centers.<br />

By Corollary 8.2, this step improves the sum <strong>of</strong> internal cluster distances squared. The<br />

second step reclusters by assigning each point to its nearest cluster center, which also<br />

improves the internal cluster distances.<br />

A problem that arises with some implementations <strong>of</strong> the k-means clustering algorithm<br />

is that one or more <strong>of</strong> the clusters becomes empty and there is no center from which to<br />

measure distance. A simple case where this occurs is illustrated in the following example.<br />

You might think how you would modify the code to resolve this issue.<br />

Example: Consider running the k-means clustering algorithm to find three clusters on<br />

the following 1-dimension data set: {2,3,7,8} starting with centers {0,5,10}.<br />

0 1 2 3 4 5 6 7 8 9 10<br />

0 1 2 3 4 5 6 7 8 9 10<br />

0 1 2 3 4 5 6 7 8 9 10<br />

The center at 5 ends up with no items and there are only two clusters instead <strong>of</strong> the<br />

desired three.<br />

As noted above, Lloyd’s algorithm only finds a local optimum to the k-means objective<br />

that might not be globally optimal. Consider, for example, Figure 8.2. Here data<br />

lies in three dense clusters in R 2 : one centered at (0, 1), one centered at (0, −1) and<br />

one centered at (3, 0). If we initialize with, say, one center at (0, 1) and two centers<br />

near (3, 0), then the center at (0, 1) will move to near (0, 0) and capture the points near<br />

(0, 1) and (0, −1), whereas the centers near (3, 0) will just stay there, splitting that cluster.<br />

Because the initial centers can substantially influence the quality <strong>of</strong> the result, there<br />

has been significant work on initialization strategies for Lloyd’s algorithm. One popular<br />

269

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!