Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Progress <strong>in</strong> Cluster<strong>in</strong>g Methodology 17<br />
has time requirements of OðMN 2 Þ for M clusters and N compounds, mak<strong>in</strong>g<br />
the method particularly suitable for f<strong>in</strong>d<strong>in</strong>g a small number of clusters. Wang,<br />
Yan, and Sriskandarajah 55 updated the s<strong>in</strong>gle criterion m<strong>in</strong>imum-diameter<br />
method with a multiple criteria algorithm that considers both maximum split<br />
(<strong>in</strong>tercluster separation) and m<strong>in</strong>imum diameter <strong>in</strong> decid<strong>in</strong>g the best bipartition.<br />
Their algorithm reduces the dissection effect (similar items forced <strong>in</strong>to<br />
different clusters because do<strong>in</strong>g so reduces the diameter) associated with the<br />
m<strong>in</strong>imum-diameter criterion and the cha<strong>in</strong><strong>in</strong>g effect associated with the<br />
maximum-split criterion. More recently, Ste<strong>in</strong>bach, Karypis, and Kumar 56<br />
reported an <strong>in</strong>terest<strong>in</strong>g variant of k-means that is actually a hierarchical polythetic<br />
divisive method. At each po<strong>in</strong>t where a cluster is to be split <strong>in</strong>to two<br />
clusters, the split is determ<strong>in</strong>ed by us<strong>in</strong>g k-means, hence the name ‘‘bisect<strong>in</strong>g<br />
k-means.’’ The results for document cluster<strong>in</strong>g, us<strong>in</strong>g keywords as descriptors,<br />
are shown to be better than standard k-means, with cluster sizes be<strong>in</strong>g more<br />
uniform, and better than the agglomerative group-average method.<br />
Monothetic divisive cluster<strong>in</strong>g has largely been ignored, although there<br />
have been applications and development of a classification method closely<br />
related to monothetic divisive cluster<strong>in</strong>g. This classification is recursive partition<strong>in</strong>g,<br />
a type of decision tree method. 57–60<br />
Nonhierarchical algorithms that cluster the data set <strong>in</strong> a s<strong>in</strong>gle pass, such<br />
as the leader algorithm, have had little development, except to identify appropriate<br />
ways of preorder<strong>in</strong>g the data set so as to get around the problem of<br />
dependency on process<strong>in</strong>g order (work on this is discussed <strong>in</strong> the Chemical<br />
Applications section). For multipass algorithms, however, efforts have been<br />
made to m<strong>in</strong>imize the number of passes required, <strong>in</strong> some cases reduc<strong>in</strong>g<br />
them to s<strong>in</strong>gle-pass algorithms. In the area of data m<strong>in</strong><strong>in</strong>g, this work has<br />
resulted <strong>in</strong> a method that does not fit neatly <strong>in</strong>to the categorization used <strong>in</strong><br />
this review. Zhang, Ramakrishnan, and Livny 61 developed a program called<br />
BIRCH (Balanced Iterative Reduc<strong>in</strong>g and Cluster<strong>in</strong>g us<strong>in</strong>g Heuristics), an<br />
OðN 2 Þ method that performs a s<strong>in</strong>gle scan of the data set to sort items <strong>in</strong>to<br />
a cluster features (CF) tree. This operation has some similarity with the leader<br />
algorithm; the nodes of the tree store summary <strong>in</strong>formation about clusters<br />
of dense po<strong>in</strong>ts <strong>in</strong> the data so that the orig<strong>in</strong>al data need not be accessed<br />
aga<strong>in</strong> dur<strong>in</strong>g the cluster<strong>in</strong>g process. Cluster<strong>in</strong>g then proceeds on the <strong>in</strong>memory<br />
summaries of the data. However, the <strong>in</strong>itial CF tree build<strong>in</strong>g requires<br />
the maximum cluster diameter to be specified beforehand, and the<br />
subsequent tree build<strong>in</strong>g is thus sensitive to the value chosen. Overall, the<br />
idea of BIRCH is to br<strong>in</strong>g together items that should always be grouped<br />
together, with the maximum cluster diameter ensur<strong>in</strong>g that the cluster summaries<br />
will all fit <strong>in</strong>to available memory. Ganti et al. 62 outl<strong>in</strong>ed a variant of<br />
BIRCH called BUBBLE. It does not rely on vector operations but builds up<br />
the cluster summaries on the basis of a distance function that obeys the triangle<br />
<strong>in</strong>equality, an operation that is more CPU demand<strong>in</strong>g than operations <strong>in</strong><br />
coord<strong>in</strong>ate space.