13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.6 CLUSTERING 257the first five instances, there is no such host: it is better, in terms of categoryutility, to form a new leaf for each instance. With the sixth it finally becomesbeneficial to form a cluster, joining the new instance f with the old one—thehost—e. If you look back at Table 4.6 (page 103) you will see that the fifth <strong>and</strong>sixth instances are indeed very similar, differing only in the windy attribute (<strong>and</strong>play, which is being ignored here). The next example, g, is placed in the samecluster (it differs from e only in outlook). This involves another call to the clusteringprocedure. First, g is evaluated to see which of the five children of theroot makes the best host; it turns out to be the rightmost, the one that is alreadya cluster. Then the clustering algorithm is invoked with this as the root, <strong>and</strong> itstwo children are evaluated to see which would make the better host. In this caseit proves best, according to the category utility measure, to add the new instanceas a subcluster in its own right.If we were to continue in this vein, there would be no possibility of any radicalrestructuring of the tree, <strong>and</strong> the final clustering would be excessively dependenton the ordering of examples. To avoid this, there is provision for restructuring,<strong>and</strong> you can see it come into play when instance h is added in the nextstep shown in Figure 6.17. In this case two existing nodes are merged into a singlecluster: nodes a <strong>and</strong> d are merged before the new instance h is added. One wayof accomplishing this would be to consider all pairs of nodes for merging <strong>and</strong>evaluate the category utility of each pair. However, that would be computationallyexpensive <strong>and</strong> would involve a lot of repeated work if it were undertakenwhenever a new instance was added.Instead, whenever the nodes at a particular level are scanned for a suitablehost, both the best-matching node—the one that produces the greatest categoryutility for the split at that level—<strong>and</strong> the runner-up are noted. The best one willform the host for the new instance (unless that new instance is better off in acluster of its own). However, before setting to work on putting the new instancein with the host, consideration is given to merging the host <strong>and</strong> the runner-up.In this case, a is the preferred host <strong>and</strong> d is the runner-up. When a merge of a<strong>and</strong> d is evaluated, it turns out that it would improve the category utilitymeasure. Consequently, these two nodes are merged, yielding a version of thefifth hierarchy of Figure 6.17 before h is added. Then, consideration is given tothe placement of h in the new, merged node; <strong>and</strong> it turns out to be best to makeit a subcluster in its own right, as shown.An operation converse to merging is also implemented, called splitting,although it does not take place in this particular example. Whenever the besthost is identified, <strong>and</strong> merging has not proved beneficial, consideration is givento splitting the host node. Splitting has exactly the opposite effect of merging,taking a node <strong>and</strong> replacing it with its children. For example, splitting the rightmostnode in the fourth hierarchy of Figure 6.17 would raise the e, f, <strong>and</strong> g leavesup a level, making them siblings of a, b, c, <strong>and</strong> d. Merging <strong>and</strong> splitting provide

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!