19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Progress <strong>in</strong> Cluster<strong>in</strong>g Methodology 17<br />

has time requirements of OðMN 2 Þ for M clusters and N compounds, mak<strong>in</strong>g<br />

the method particularly suitable for f<strong>in</strong>d<strong>in</strong>g a small number of clusters. Wang,<br />

Yan, and Sriskandarajah 55 updated the s<strong>in</strong>gle criterion m<strong>in</strong>imum-diameter<br />

method with a multiple criteria algorithm that considers both maximum split<br />

(<strong>in</strong>tercluster separation) and m<strong>in</strong>imum diameter <strong>in</strong> decid<strong>in</strong>g the best bipartition.<br />

Their algorithm reduces the dissection effect (similar items forced <strong>in</strong>to<br />

different clusters because do<strong>in</strong>g so reduces the diameter) associated with the<br />

m<strong>in</strong>imum-diameter criterion and the cha<strong>in</strong><strong>in</strong>g effect associated with the<br />

maximum-split criterion. More recently, Ste<strong>in</strong>bach, Karypis, and Kumar 56<br />

reported an <strong>in</strong>terest<strong>in</strong>g variant of k-means that is actually a hierarchical polythetic<br />

divisive method. At each po<strong>in</strong>t where a cluster is to be split <strong>in</strong>to two<br />

clusters, the split is determ<strong>in</strong>ed by us<strong>in</strong>g k-means, hence the name ‘‘bisect<strong>in</strong>g<br />

k-means.’’ The results for document cluster<strong>in</strong>g, us<strong>in</strong>g keywords as descriptors,<br />

are shown to be better than standard k-means, with cluster sizes be<strong>in</strong>g more<br />

uniform, and better than the agglomerative group-average method.<br />

Monothetic divisive cluster<strong>in</strong>g has largely been ignored, although there<br />

have been applications and development of a classification method closely<br />

related to monothetic divisive cluster<strong>in</strong>g. This classification is recursive partition<strong>in</strong>g,<br />

a type of decision tree method. 57–60<br />

Nonhierarchical algorithms that cluster the data set <strong>in</strong> a s<strong>in</strong>gle pass, such<br />

as the leader algorithm, have had little development, except to identify appropriate<br />

ways of preorder<strong>in</strong>g the data set so as to get around the problem of<br />

dependency on process<strong>in</strong>g order (work on this is discussed <strong>in</strong> the Chemical<br />

Applications section). For multipass algorithms, however, efforts have been<br />

made to m<strong>in</strong>imize the number of passes required, <strong>in</strong> some cases reduc<strong>in</strong>g<br />

them to s<strong>in</strong>gle-pass algorithms. In the area of data m<strong>in</strong><strong>in</strong>g, this work has<br />

resulted <strong>in</strong> a method that does not fit neatly <strong>in</strong>to the categorization used <strong>in</strong><br />

this review. Zhang, Ramakrishnan, and Livny 61 developed a program called<br />

BIRCH (Balanced Iterative Reduc<strong>in</strong>g and Cluster<strong>in</strong>g us<strong>in</strong>g Heuristics), an<br />

OðN 2 Þ method that performs a s<strong>in</strong>gle scan of the data set to sort items <strong>in</strong>to<br />

a cluster features (CF) tree. This operation has some similarity with the leader<br />

algorithm; the nodes of the tree store summary <strong>in</strong>formation about clusters<br />

of dense po<strong>in</strong>ts <strong>in</strong> the data so that the orig<strong>in</strong>al data need not be accessed<br />

aga<strong>in</strong> dur<strong>in</strong>g the cluster<strong>in</strong>g process. Cluster<strong>in</strong>g then proceeds on the <strong>in</strong>memory<br />

summaries of the data. However, the <strong>in</strong>itial CF tree build<strong>in</strong>g requires<br />

the maximum cluster diameter to be specified beforehand, and the<br />

subsequent tree build<strong>in</strong>g is thus sensitive to the value chosen. Overall, the<br />

idea of BIRCH is to br<strong>in</strong>g together items that should always be grouped<br />

together, with the maximum cluster diameter ensur<strong>in</strong>g that the cluster summaries<br />

will all fit <strong>in</strong>to available memory. Ganti et al. 62 outl<strong>in</strong>ed a variant of<br />

BIRCH called BUBBLE. It does not rely on vector operations but builds up<br />

the cluster summaries on the basis of a distance function that obeys the triangle<br />

<strong>in</strong>equality, an operation that is more CPU demand<strong>in</strong>g than operations <strong>in</strong><br />

coord<strong>in</strong>ate space.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!