08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8.4 Finding Low-Error Clusterings<br />

In the previous sections we saw algorithms for finding a local optimum to the k-means<br />

clustering objective, for finding a global optimum to the k-means objective on the line,<br />

and for finding a factor 2 approximation to the k-center objective. But what about<br />

finding a clustering that is close to the correct answer, such as the true clustering <strong>of</strong><br />

proteins by function or a correct clustering <strong>of</strong> news articles by topic? For this we need<br />

some assumption about the data and what the correct answer looks like. In the next two<br />

sections we will see two different natural assumptions, and algorithms with guarantees<br />

based on them.<br />

8.5 Approximation Stability<br />

Implicit in considering objectives like k-means, k-median, or k-center is the hope that<br />

the optimal solution to the objective is a desirable clustering. Implicit in considering<br />

algorithms that find near-optimal solutions is the hope that near-optimal solutions are<br />

also desirable clusterings. Let’s now make this idea formal.<br />

Let C = {C 1 , . . . , C k } and C ′ = {C 1, ′ . . . , C k ′ } be two different k-clusterings <strong>of</strong> some<br />

data set A. A natural notion <strong>of</strong> the distance between these two clusterings is the fraction<br />

<strong>of</strong> points that would have to be moved between clusters in C to make it match C ′ , where<br />

by “match” we allow the indices to be permuted. Since C and C ′ are both partitions <strong>of</strong><br />

the set A, this is the same as the fraction <strong>of</strong> points that would have to be moved among<br />

clusters in C ′ to make it match C. We can write this distance mathematically as:<br />

dist(C, C ′ 1<br />

) = min<br />

σ n<br />

k∑<br />

|C i \ C σ(i)|,<br />

′<br />

where the minimum is over all permutations σ <strong>of</strong> {1, . . . , k}.<br />

Given an objective Φ (such as k-means, k-median, etc), define C ∗ to be the clustering<br />

that minimizes Φ. Define C T to be the “target” clustering we are aiming for, such<br />

as correctly clustering documents by topic or correctly clustering protein sequences by<br />

their function. For c ≥ 1 and ɛ > 0 we say that a data set satisfies (c, ɛ) approximationstability<br />

with respect to objective Φ if every clustering C with Φ(C) ≤ c · Φ(C ∗ ) satisfies<br />

dist(C, C T ) < ɛ. That is, it is sufficient to be within a factor c <strong>of</strong> optimal to the objective<br />

Φ in order for the fraction <strong>of</strong> points clustered incorrectly to be less than ɛ.<br />

What is interesting about approximation-stability is the following. The current best<br />

polynomial-time approximation guarantee known for the k-means objective is roughly a<br />

factor <strong>of</strong> nine, and for k-median it is roughly a factor 2.7; beating a factor 1 + 3/e for<br />

k-means and 1 + 1/e for k-median are both NP-hard. Nonetheless, given data that satisfies<br />

(1.1, ɛ) approximation-stability for the k-median objective, it turns out that so long<br />

as ɛn is sufficiently small compared to the smallest cluster in C T , we can efficiently find<br />

272<br />

i=1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!