08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

We will show that the following algorithm satisfies these three axioms.<br />

Balanced k-means algorithm<br />

Among all partitions <strong>of</strong> the input set <strong>of</strong> n points into k sets, each <strong>of</strong> size n/k, return<br />

the one that minimizes the sum <strong>of</strong> squared distances between all pairs <strong>of</strong> points in<br />

the same cluster.<br />

Theorem 8.15 The balanced k-means algorithm satisfies the consistency condition, scale<br />

invariance, and the richness property.<br />

Pro<strong>of</strong>: Scale invariance is obvious. Richness is also easy to see. Just place n/k points <strong>of</strong><br />

S to coincide with each point <strong>of</strong> K. To prove consistency, define the cost <strong>of</strong> a cluster T<br />

to be the sum <strong>of</strong> squared distances <strong>of</strong> all pairs <strong>of</strong> points in T .<br />

Suppose S 1 , S 2 , . . . , S k is an optimal clustering <strong>of</strong> S according to the balanced k-<br />

means algorithm. Move a point x ∈ S 1 to z so that its distance to each point in S 1 is<br />

non increasing and its distance to each point in S 2 , S 3 , . . . , S k is non decreasing. Suppose<br />

T 1 , T 2 , . . . , T k is an optimal clustering after the move. Without loss <strong>of</strong> generality assume<br />

z ∈ T 1 . Define ˜T 1 = (T 1 \ {z}) ∪ {x} and ˜S 1 = (S 1 \ {x}) ∪ {z}. Note that ˜T 1 , T 2 , . . . , T k<br />

is a clustering before the move, although not necessarily an optimal clustering. Thus<br />

( )<br />

cost ˜T1 + cost (T 2 ) + · · · + cost (T k ) ≥ cost (S 1 ) + cost (S 2 ) + · · · + cost (S k ) .<br />

( ) ( )<br />

If cost (T 1 ) − cost ˜T1 ≥ cost ˜S1 − cost (S 1 ) then<br />

( )<br />

cost (T 1 ) + cost (T 2 ) + · · · + cost (T k ) ≥ cost ˜S1 + cost (S 2 ) + · · · + cost (S k ) .<br />

Since T 1 , T 2 , . . . , T k is an optimal clustering after the move, so also must be ˜S 1 , S 2 , . . . , S k<br />

proving the theorem.<br />

( ) ( )<br />

It remains to show that cost (T 1 ) − cos t ˜T1 ≥ cost ˜S1 − cost (S 1 ). Let u and v<br />

stand for elements other than x and z in S 1 and T 1 . The terms |u − v| 2 are common to T 1<br />

and ˜T 1 on the left hand side and cancel out. So too on the right hand side. So we need<br />

only prove ∑<br />

(|z − u| 2 − |x − u| 2 ) ≥ ∑ (|z − u| 2 − |x − u| 2 ).<br />

u∈T 1 u∈S 1<br />

For u ∈ S 1 ∩ T 1 , the terms appear on both sides, and we may cancel them, so we are left<br />

to prove ∑<br />

(|z − u| 2 − |x − u| 2 ) ≥<br />

∑<br />

(|z − u| 2 − |x − u| 2 )<br />

u∈T 1 \S 1 u∈S 1 \T 1<br />

which is true because by the movement <strong>of</strong> x to z, each term on the left hand side is non<br />

negative and each term on the right hand side is non positive.<br />

294

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!