19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Introduction 5<br />

is, it is a top-down approach. If, at each split, only one descriptor is used to<br />

determ<strong>in</strong>e how the cluster is split, the method is monothetic; otherwise, more<br />

descriptors (typically all available) are used, and the method is polythetic.<br />

Nonhierarchical methods encompass a wide range of different techniques<br />

to build clusters. A s<strong>in</strong>gle-pass method is one <strong>in</strong> which the partition<br />

is created by a s<strong>in</strong>gle pass through the data set or, if randomly accessed, <strong>in</strong><br />

which each compound is exam<strong>in</strong>ed only once to decide which cluster it should<br />

be assigned to. A relocation method is one <strong>in</strong> which compounds are moved<br />

from one cluster to another to try to improve on the <strong>in</strong>itial estimation of the<br />

clusters. The relocat<strong>in</strong>g is typically accomplished based on improv<strong>in</strong>g a cost<br />

function describ<strong>in</strong>g the ‘‘goodness’’ of each resultant cluster. The nearestneighbor<br />

approach is more compound centered than are the other nonhierarchical<br />

methods. In it, the environment around each compound is exam<strong>in</strong>ed<br />

<strong>in</strong> terms of its most similar neighbor<strong>in</strong>g compounds, with commonality<br />

between nearest neighbors be<strong>in</strong>g used as a criterion for cluster formation. In<br />

mixture model cluster<strong>in</strong>g the data are assumed to exist as a mixture of densities<br />

that are usually assumed to be Gaussian (normal) distributions, s<strong>in</strong>ce their<br />

densities are not known <strong>in</strong> advance. Solutions to the mixture model are<br />

derived iteratively <strong>in</strong> a manner similar to the relocation methods. Topographic<br />

methods, such as use of Kohonen maps, typically apply a variable cost function<br />

with the added restriction that topographic relationships are preserved so<br />

that neighbor<strong>in</strong>g clusters are close <strong>in</strong> descriptor space. Other nonhierarchical<br />

methods <strong>in</strong>clude density-based and probabilistic methods. Density-based, or<br />

mode-seek<strong>in</strong>g, methods regard the distribution of descriptors across the data<br />

set as generat<strong>in</strong>g patterns of high and low density that, when identified, can be<br />

used to separate the compounds <strong>in</strong>to clusters. Probabilistic cluster<strong>in</strong>g generates<br />

nonoverlapp<strong>in</strong>g clusters <strong>in</strong> which a compound is assigned a probability, <strong>in</strong> the<br />

range 0 to 1, that it belongs to the chosen cluster (<strong>in</strong> contrast to fuzzy cluster<strong>in</strong>g<br />

<strong>in</strong> which the clusters are overlapp<strong>in</strong>g and the degree of membership is not<br />

a probability).<br />

Hav<strong>in</strong>g now provided a broad overview of cluster<strong>in</strong>g methodology, we<br />

next focus on the ‘‘classical’’ methods, which <strong>in</strong>clude hierarchical and s<strong>in</strong>glepass,<br />

relocation, and nearest-neighbor nonhierarchical techniques. The classification<br />

we have described <strong>in</strong> Figure 2 is one that is commonly used by many<br />

scientists; however, it is just one of many possible classifications. Another way<br />

to differentiate between cluster<strong>in</strong>g techniques is to consider parametric and<br />

nonparametric methods. Parametric methods require distance-based comparisons<br />

be made. Here access to the descriptors is required (typically given as<br />

Euclidean vectors), rather than just a proximity matrix derived from the<br />

descriptors. Parametric methods can be further organized <strong>in</strong>to generative<br />

and reconstructive methods. Generative methods, <strong>in</strong>clud<strong>in</strong>g mixture model,<br />

density-based, and probabilistic techniques, try to match parameters (e.g.,<br />

cluster centers, variances with<strong>in</strong> and between clusters, and mix<strong>in</strong>g coefficients<br />

for the descriptor distributions) to the distribution of descriptors with<strong>in</strong> the<br />

data set. Reconstructive methods, such as relocation and topographic, are

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!