08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

a clustering that is ɛ-close to C T . That is, we can perform as well as if we had a generic<br />

1.1-factor approximation algorithm, even though achieving a 1.1-factor approximation is<br />

NP-hard in general. Results known for the k-means objective are somewhat weaker, finding<br />

a clustering that is O(ɛ)-close to C T . Here, we show this for the k-median objective<br />

where the analysis is cleanest. We make a few additional assumptions in order to focus<br />

on the main idea. In the following one should think <strong>of</strong> ɛ as o( c−1 ). The results described<br />

k<br />

here apply to data in any metric space, it need not be R d . 31<br />

For simplicity and ease <strong>of</strong> notation assume that C T = C ∗ ; that is, the target clustering<br />

is also the optimum for the objective. For a given data point a i , define its weight w(a i ) to<br />

be its distance to the center <strong>of</strong> its cluster in C ∗ . Notice that for the k-median objective,<br />

we have Φ(C ∗ ) = ∑ n<br />

i=1 w(a i). Define w avg = Φ(C ∗ )/n to be the average weight <strong>of</strong> the<br />

points in A. Finally, define w 2 (a i ) to be the distance <strong>of</strong> a i to its second-closest center in<br />

C ∗ . We now begin with a useful lemma.<br />

Lemma 8.4 Assume dataset A satisfies (c, ɛ) approximation-stability with respect to the<br />

k-median objective, each cluster in C T has size at least 2ɛn, and C T = C ∗ . Then,<br />

1. Fewer than ɛn points a i have w 2 (a i ) − w(a i ) ≤ (c − 1)w avg /ɛ.<br />

2. At most 5ɛn/(c − 1) points a i have w(a i ) ≥ (c − 1)w avg /(5ɛ).<br />

Pro<strong>of</strong>: For part (1), suppose that ɛn points a i have w 2 (a i ) − w(a i ) ≤ (c − 1)w avg /ɛ.<br />

Consider modifying C T to a new clustering C ′ by moving each <strong>of</strong> these points a i into<br />

the cluster containing its second-closest center. By assumption, the k-means cost <strong>of</strong> the<br />

clustering has increased by at most ɛn(c − 1)w avg /ɛ = (c − 1)Φ(C ∗ ). This means that<br />

Φ(C ′ ) ≤ c · Φ(C ∗ ). However, dist(C ′ , C T ) = ɛ because (a) we moved ɛn points to different<br />

clusters, and (b) each cluster in C T has size at least 2ɛn so the optimal permutation σ in the<br />

definition <strong>of</strong> dist remains the identity. So, this contradicts approximation stability. Part<br />

(2) follows from the definition <strong>of</strong> “average”; if it did not hold then ∑ n<br />

i=1 w(a i) > nw avg ,<br />

a contradiction.<br />

A datapoint a i is bad if it satisfies either item (1) or (2) <strong>of</strong> Lemma 8.4 and good if it<br />

satisfies neither one. So, there are at most b = ɛn + 5ɛn bad points and the rest are good.<br />

c−1<br />

Define “critical distance” d crit = (c−1)wavg . So, Lemma 8.4 implies that the good points<br />

5ɛ<br />

have distance at most d crit to the center <strong>of</strong> their own cluster in C ∗ and distance at least<br />

5d crit to the center <strong>of</strong> any other cluster in C ∗ .<br />

This suggests the following algorithm. Suppose we create a graph G with the points<br />

a i as vertices, and edges between any two points a i and a j with d(a i , a j ) < 2d crit . Notice<br />

31 An example <strong>of</strong> a data set satisfying (2, 0.2/k) approximation stability would be k clusters <strong>of</strong> n/k<br />

points each, where in each cluster, 90% <strong>of</strong> the points are within distance 1 <strong>of</strong> the cluster center (call this<br />

the “core” <strong>of</strong> the cluster), with the other 10% arbitrary, and all cluster cores are at distance at least 10k<br />

apart.<br />

273

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!