10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.288 20 — An Example <strong>Inference</strong> Task: Clustering(a)10864200 2 4 6 8 10(b)10864200 2 4 6 8 10Figure 20.5. K-means algorithmfor a case with two dissimilarclusters. (a) The “little ’n’ large”data. (b) A stable set ofassignments <strong>and</strong> means. Note thatfour points belonging to the broadcluster have been incorrectlyassigned to the narrower cluster.(Points assigned to the right-h<strong>and</strong>cluster are shown by plus signs.)Figure 20.6. Two elongatedclusters, <strong>and</strong> the stable solutionfound by the K-means algorithm.(a)(b)function is provided as part of the problem definition; but I’m assuming weare interested in data-modelling rather than vector quantization.] How do wechoose K? Having found multiple alternative clusterings for a given K, howcan we choose among them?Cases where K-means might be viewed as failing.Further questions arise when we look for cases where the algorithm behavesbadly (compared with what the man in the street would call ‘clustering’).Figure 20.5a shows a set of 75 data points generated from a mixture of twoGaussians. The right-h<strong>and</strong> Gaussian has less weight (only one fifth of the datapoints), <strong>and</strong> it is a less broad cluster. Figure 20.5b shows the outcome of usingK-means clustering with K = 2 means. Four of the big cluster’s data pointshave been assigned to the small cluster, <strong>and</strong> both means end up displacedto the left of the true centres of the clusters. The K-means algorithm takesaccount only of the distance between the means <strong>and</strong> the data points; it hasno representation of the weight or breadth of each cluster. Consequently, datapoints that actually belong to the broad cluster are incorrectly assigned to thenarrow cluster.Figure 20.6 shows another case of K-means behaving badly. The dataevidently fall into two elongated clusters. But the only stable state of theK-means algorithm is that shown in figure 20.6b: the two clusters have beensliced in half! These two examples show that there is something wrong withthe distance d in the K-means algorithm. The K-means algorithm has no wayof representing the size or shape of a cluster.A final criticism of K-means is that it is a ‘hard’ rather than a ‘soft’algorithm: points are assigned to exactly one cluster <strong>and</strong> all points assignedto a cluster are equals in that cluster. Points located near the border betweentwo or more clusters should, arguably, play a partial role in determining thelocations of all the clusters that they could plausibly be assigned to. But inthe K-means algorithm, each borderline point is dumped in one cluster, <strong>and</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!