26.09.2015 Views

K-means clustering algorithm

K-means clustering algorithm - ISCAS 2007

K-means clustering algorithm - ISCAS 2007

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

K-<strong>means</strong> <strong>clustering</strong> <strong>algorithm</strong><br />

Once the number of clusters is set, the<br />

<strong>algorithm</strong> to find the clusters is<br />

straightforward:<br />

Let the set of<br />

Let the set of<br />

samples<br />

be<br />

x<br />

( i)<br />

: i<br />

cluster centers be<br />

m<br />

0, ,<br />

N<br />

j<br />

:<br />

j<br />

1 .<br />

0, ,<br />

K<br />

1 .<br />

For each x<br />

( i)<br />

, i<br />

For each Cluster<br />

0, ,<br />

N<br />

j, j<br />

1, assign<br />

0, ,<br />

K<br />

x<br />

( i)<br />

1,update<br />

to Cluster<br />

m<br />

j<br />

k<br />

if<br />

k<br />

1<br />

Cluster<br />

j<br />

argmin<br />

x<br />

( i )<br />

j<br />

x<br />

x<br />

( i)<br />

Cluster j<br />

( i)<br />

.<br />

m<br />

j<br />

.<br />

Stop if<br />

no cluster center is<br />

updated.


How to set k in k-<strong>means</strong> <strong>clustering</strong><br />

For K=1, 2, 3, …, run the k-<strong>means</strong> <strong>clustering</strong> <strong>algorithm</strong>.<br />

After the k-<strong>means</strong> <strong>algorithm</strong> has converged, we have cluster assignments for each<br />

sample as well as the locations of the cluster centers.<br />

Compute<br />

as d<br />

K<br />

Let d<br />

K<br />

Let the<br />

1<br />

N<br />

be<br />

the mean squared distance<br />

K<br />

j<br />

1<br />

( i )<br />

0 x Cluster j<br />

the distortion<br />

x<br />

of<br />

( i)<br />

transformed distortion<br />

m<br />

j<br />

2<br />

the <strong>clustering</strong><br />

be d<br />

.<br />

K<br />

of<br />

p / 2<br />

a sample<br />

result.<br />

from its<br />

, where p is the dimension<br />

corresponding<br />

of<br />

cluster center<br />

d K decreases as K increases<br />

the data samples.<br />

The jump value of transformed distortion<br />

(Assume d<br />

0<br />

0 when computing<br />

J<br />

1<br />

.)<br />

is<br />

J<br />

K<br />

d<br />

p / 2<br />

K<br />

d<br />

p / 2<br />

K 1<br />

.<br />

The peak of the jump values corresponds to the K that provides the best description of<br />

the original samples.


How to set k in k-<strong>means</strong> <strong>clustering</strong><br />

• Another example<br />

– Higher dimension data set<br />

In class, I often talk about a training set of 4 billion<br />

vectors, each having 4000 features…<br />

And yet, all the examples we have seen are 2-d or, as<br />

in the case of the “bonus” examples in the Lecture<br />

20 notes, 3-d<br />

Let us look at the results of processing a high<br />

dimensional data set!


How to set k in k-<strong>means</strong> <strong>clustering</strong><br />

• 17-dimensional data set; i.e., p=17<br />

• 12,000 vectors<br />

• The data set has 21 groups<br />

– Group 0 has prior probability 0.20<br />

– The remaining 20 groups have equal probability (0.04)<br />

• Each group has the same Gaussian density that differs only in<br />

the group mean<br />

• The group covariance matrix is the identity matrix; i.e.,<br />

features are pairwise uncorrelated


Results<br />

• Run k-<strong>means</strong> for K=1, 2, …, 25<br />

• After each run,<br />

– Compute the mean squared distance to the<br />

corresponding cluster center as the total<br />

distortion<br />

– Compute transformed distortion<br />

– Compute the jump of the transformed distortion<br />

– Compute the inverse of the distortion (for<br />

comparison)<br />

– Compute the jump of the inverted distortion


Distortion<br />

34<br />

32<br />

30<br />

28<br />

26<br />

24<br />

22<br />

20<br />

18<br />

16<br />

0 5 10 15 20 25<br />

Number of Clusters


Transformed<br />

Distortion<br />

4.5E-11<br />

4E-11<br />

3.5E-11<br />

3E-11<br />

2.5E-11<br />

2E-11<br />

1.5E-11<br />

1E-11<br />

5E-12<br />

0<br />

0 5 10 15 20 25<br />

Number of Clusters<br />

Inverted<br />

Distortion<br />

0.065<br />

0.06<br />

0.055<br />

0.05<br />

0.045<br />

0.04<br />

0.035<br />

0.03<br />

0 5 10 15 20 25<br />

Number of Clusters


0.065<br />

0.06<br />

Inverted<br />

Distortion<br />

0.055<br />

0.05<br />

0.045<br />

0.04<br />

0.035<br />

0.03<br />

Using the inverted distortion,<br />

the best choice is K=1<br />

0.035<br />

0.03<br />

0 5 10 15 20 25<br />

Number of Clusters<br />

Jump of<br />

Inverted<br />

Distortion<br />

0.025<br />

0.02<br />

0.015<br />

0.01<br />

0.005<br />

0<br />

-0.005<br />

0 5 10 15 20 25<br />

Number of Clusters


4.5E-11<br />

4E-11<br />

3.5E-11<br />

Transformed<br />

Distortion<br />

3E-11<br />

2.5E-11<br />

2E-11<br />

1.5E-11<br />

1E-11<br />

5E-12<br />

Using the inverted distortion,<br />

the best choice is K=25!<br />

0<br />

1E-11<br />

0 5 10 15 20 25<br />

Number of Clusters<br />

8E-12<br />

Jump of<br />

Transformed<br />

Distortion<br />

6E-12<br />

4E-12<br />

2E-12<br />

0<br />

-2E-12<br />

0 5 10 15 20 25<br />

-4E-12<br />

Number of Clusters


Results<br />

• I expect to see the jump value of the transformed distortion<br />

to peak at K=21<br />

• I did not get what I expected to see<br />

• Although not shown here, two other examples show similar<br />

results


Discussion<br />

• We note that the results are subject to sampling errors (also<br />

see the “bonus” 4-group 2d example in the Lecture 20 notes)<br />

• The jump value of the transformed distortion does get us to<br />

the neighborhood of correct K (the high values are at K=17,<br />

20, 22, 25)<br />

• Because the peak occurs at K=25, we really should have ran a<br />

few more runs at larger values of K<br />

• By comparison, the jump value of the inverted distortion<br />

selected a one-cluster result as best description

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!