08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

But for l ≠ l ′ , by hypothesis the cluster centers are at least 15kσ(C)/ε apart. This implies<br />

that<br />

For i, k /∈ M , i, k in different clusters in C<br />

|v i − v k | ≥ |c i − c k | − |c i − v i | − |v k − c k | ≥ 9kσ(C) . (8.4)<br />

2ε<br />

We will show by induction on the number <strong>of</strong> iterations <strong>of</strong> Step 3 the invariant that at the<br />

end <strong>of</strong> t iterations <strong>of</strong> Step 3, S consists <strong>of</strong> the union <strong>of</strong> k − t <strong>of</strong> the k C l \ M plus a subset<br />

<strong>of</strong> M. In other words, but for elements <strong>of</strong> M, S is precisely the union <strong>of</strong> k − t clusters<br />

<strong>of</strong> C. Clearly, this holds for t = 0. Suppose it holds for a t. Suppose in iteration t + 1<br />

<strong>of</strong> Step 3 <strong>of</strong> the algorithm, we choose an i 0 /∈ M and say i 0 is in cluster l in C. Then by<br />

(8.3) and (8.4), T will contain all points <strong>of</strong> C l \ M and will contain no points <strong>of</strong> C l ′ \ M<br />

for any l ′ ≠ l. This proves the invariant still holds after the iteration. Also, the cluster<br />

returned by the algorithm will agree with C l except possibly on M provided i 0 /∈ M.<br />

Now by (8.2), |M| ≤ ε 2 n and since each |C l | ≥ εn, we have that |C l \ M| ≥ (ε − ε 2 )n.<br />

If we have done less than k iterations <strong>of</strong> Step 3 and not yet peeled <strong>of</strong>f C l , then there are<br />

still (ε − ε 2 )n points <strong>of</strong> C l \ M left. So the probability that the next pick i 0 will be in M<br />

is at most |M|/(ε − ε 2 )n ≤ ε/k by (8.2). So with probability at least 1 − ε all the k i 0 ’s<br />

we pick are out <strong>of</strong> M and the theorem follows.<br />

8.7 High-Density Clusters<br />

We now turn from the assumption that clusters are center-based to the assumption<br />

that clusters consist <strong>of</strong> high-density regions, separated by low-density moats such as in<br />

Figure 8.1.<br />

8.7.1 Single-linkage<br />

One natural algorithm for clustering under the high-density assumption is called single<br />

linkage. This algorithm begins with each point in its own cluster and then repeatedly<br />

merges the two “closest” clusters into one, where the distance between two clusters is<br />

defined as the minimum distance between points in each cluster. That is, d min (C, C ′ ) =<br />

min x∈C,y∈C ′ d(x, y), and the algorithm merges the two clusters C and C ′ whose d min value<br />

is smallest over all pairs <strong>of</strong> clusters breaking ties arbitrarily. It then continues until there<br />

are only k clusters. This is called an agglomerative clustering algorithm because it begins<br />

with many clusters and then starts merging, or agglomerating them together. 33 Singlelinkage<br />

is equivalent to running Kruskal’s minimum-spanning-tree algorithm, but halting<br />

when there are k trees remaining. The following theorem is fairly immediate.<br />

33 Other agglomerative algorithms include complete linkage which merges the two clusters whose maximum<br />

distance between points is smallest, and Ward’s algorithm described earlier that merges the two<br />

clusters that cause the k-means cost to increase by the least.<br />

281

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!