13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

254 CHAPTER 6 | IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMESweighted learning, primarily in the context of regression problems. Frank et al.(2003) evaluated the use of locally weighted learning in conjunction with NaïveBayes.6.6 ClusteringIn Section 4.8 we examined the k-means clustering algorithm in which k initialpoints are chosen to represent initial cluster centers, all data points are assignedto the nearest one, the mean value of the points in each cluster is computed toform its new cluster center, <strong>and</strong> iteration continues until there are no changesin the clusters. This procedure only works when the number of clusters is knownin advance, <strong>and</strong> this section begins by describing what you can do if it is not.Next we examine two techniques that do not partition instances into disjointclusters as k-means does. The first is an incremental clustering method that wasdeveloped in the late 1980s <strong>and</strong> embodied in a pair of systems called Cobweb(for nominal attributes) <strong>and</strong> Classit (for numeric attributes). Both come up witha hierarchical grouping of instances <strong>and</strong> use a measure of cluster “quality” calledcategory utility. The second is a statistical clustering method based on a mixturemodel of different probability distributions, one for each cluster. It assignsinstances to classes probabilistically, not deterministically. We explain the basictechnique <strong>and</strong> sketch the working of a comprehensive clustering scheme calledAutoClass.Choosing the number of clustersSuppose you are using k-means but do not know the number of clusters inadvance. One solution is to try out different possibilities <strong>and</strong> see which is best—that is, which one minimizes the total squared distance of all points to theircluster center. A simple strategy is to start from a given minimum, perhapsk = 1, <strong>and</strong> work up to a small fixed maximum, using cross-validation to find thebest value. Because k-means is slow, <strong>and</strong> cross-validation makes it even slower,it will probably not be feasible to try many possible values for k. Note that onthe training data the “best” clustering according to the total squared distancecriterion will always be to choose as many clusters as there are data points! Topenalize solutions with many clusters you have to apply something like the MDLcriterion of Section 5.10, or use cross-validation.Another possibility is to begin by finding a few clusters <strong>and</strong> determiningwhether it is worth splitting them. You could choose k = 2, perform k-meansclustering until it terminates, <strong>and</strong> then consider splitting each cluster. Computationtime will be reduced considerably if the initial two-way clusteringis considered irrevocable <strong>and</strong> splitting is investigated for each component

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!