19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Progress <strong>in</strong> Cluster<strong>in</strong>g Methodology 21<br />

The use of a fixed model <strong>in</strong> a cluster<strong>in</strong>g method favors retrieval of clusters<br />

of certa<strong>in</strong> shapes (as exemplified by the hyperspherical clusters retrieved<br />

by centroid-based methods). An alternative is to use a density-based approach,<br />

<strong>in</strong> which a cluster is formed from a region of higher density than its surround<strong>in</strong>g<br />

area. The cluster<strong>in</strong>g is then based on local criteria, and it can pick out<br />

clusters of any shape and <strong>in</strong>ternal distribution. Such approaches are typically<br />

not applicable directly to high dimensions, but progress is be<strong>in</strong>g made <strong>in</strong> that<br />

direction with<strong>in</strong> the data m<strong>in</strong><strong>in</strong>g community. An example is the DBSCAN<br />

(Density-Based Spatial Cluster<strong>in</strong>g of Applications with Noise) method of Ester<br />

et al. 87 that was subsequently extended by Ankerst et al. 88 to give the OPTICS<br />

(Order<strong>in</strong>g Po<strong>in</strong>ts To Identify the Cluster<strong>in</strong>g Structure) method. These two<br />

methods work on a pr<strong>in</strong>ciple that each po<strong>in</strong>t of a cluster must have at least<br />

a given number of other po<strong>in</strong>ts with<strong>in</strong> a specified radius. Po<strong>in</strong>ts fulfill<strong>in</strong>g these<br />

conditions are clustered; any rema<strong>in</strong><strong>in</strong>g po<strong>in</strong>ts are considered to be outliers,<br />

that is, noise. The OPTICS method has been enhanced by Breunig et al. 89 to<br />

identify outliers, and by Breunig, Kriegel, and Sander, 90 who comb<strong>in</strong>ed it with<br />

BIRCH 61 to <strong>in</strong>crease speed.<br />

Other density-based approaches designed for high dimensions <strong>in</strong>clude<br />

CLIQUE (Cluster<strong>in</strong>g In QUEst) by Agrawal et al., 91 and PROCLUS (PROjected<br />

CLUSters), by Aggarwal et al. 92 These two methods recognize that<br />

high dimensional spaces are typically sparse so that the similarity between<br />

two po<strong>in</strong>ts is determ<strong>in</strong>ed by a few dimensions, with the other dimensions be<strong>in</strong>g<br />

irrelevant. Clusters are thus formed by similarity with respect to subspaces<br />

rather than full dimensional space. In the CLIQUE algorithm, dense regions<br />

of data space are determ<strong>in</strong>ed by us<strong>in</strong>g cell-based partition<strong>in</strong>g, which are<br />

then used as <strong>in</strong>itial bases for form<strong>in</strong>g the clusters. The algorithm works<br />

from lower to higher dimensional subspaces by start<strong>in</strong>g from cells identified<br />

as dense <strong>in</strong> ðk 1Þ-dimensional subspace and extend<strong>in</strong>g them <strong>in</strong>to k-dimensional<br />

subspace. The result is a set of overlapp<strong>in</strong>g dense regions that are<br />

extracted as the clusters. Research <strong>in</strong>to improv<strong>in</strong>g grid-based methods is cont<strong>in</strong>u<strong>in</strong>g,<br />

as demonstrated by the variable grid method of Nagesh. 93 In contrast,<br />

the PROCLUS program generates nonoverlapp<strong>in</strong>g clusters by identify<strong>in</strong>g<br />

potential cluster centers (medoids) us<strong>in</strong>g a MaxM<strong>in</strong> subset selection procedure.<br />

The best medoids are selected from the <strong>in</strong>itial set by an iterative procedure<br />

<strong>in</strong> which data items with<strong>in</strong> the locality of a medoid (i.e., with<strong>in</strong> the<br />

m<strong>in</strong>imum distance between any two medoids) are assigned to that cluster.<br />

Rather than us<strong>in</strong>g all dimensions, the dimensions associated with each cluster<br />

are used <strong>in</strong> the Manhattan segmental distance 92 to calculate the distance of an<br />

item from the cluster. The Manhattan segmental distance is a normalized form<br />

of the Manhattan distance that enables comparison of different clusters with<br />

vary<strong>in</strong>g numbers of dimensions. (The Manhattan, or city-block, or Hamm<strong>in</strong>g,<br />

distance is the sum of absolute differences between descriptor values; <strong>in</strong><br />

contrast, the Euclidean distance is the square root of the sum of squares<br />

differences between descriptor values.) Once the best medoids have been

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!