Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Introduction 5<br />
is, it is a top-down approach. If, at each split, only one descriptor is used to<br />
determ<strong>in</strong>e how the cluster is split, the method is monothetic; otherwise, more<br />
descriptors (typically all available) are used, and the method is polythetic.<br />
Nonhierarchical methods encompass a wide range of different techniques<br />
to build clusters. A s<strong>in</strong>gle-pass method is one <strong>in</strong> which the partition<br />
is created by a s<strong>in</strong>gle pass through the data set or, if randomly accessed, <strong>in</strong><br />
which each compound is exam<strong>in</strong>ed only once to decide which cluster it should<br />
be assigned to. A relocation method is one <strong>in</strong> which compounds are moved<br />
from one cluster to another to try to improve on the <strong>in</strong>itial estimation of the<br />
clusters. The relocat<strong>in</strong>g is typically accomplished based on improv<strong>in</strong>g a cost<br />
function describ<strong>in</strong>g the ‘‘goodness’’ of each resultant cluster. The nearestneighbor<br />
approach is more compound centered than are the other nonhierarchical<br />
methods. In it, the environment around each compound is exam<strong>in</strong>ed<br />
<strong>in</strong> terms of its most similar neighbor<strong>in</strong>g compounds, with commonality<br />
between nearest neighbors be<strong>in</strong>g used as a criterion for cluster formation. In<br />
mixture model cluster<strong>in</strong>g the data are assumed to exist as a mixture of densities<br />
that are usually assumed to be Gaussian (normal) distributions, s<strong>in</strong>ce their<br />
densities are not known <strong>in</strong> advance. Solutions to the mixture model are<br />
derived iteratively <strong>in</strong> a manner similar to the relocation methods. Topographic<br />
methods, such as use of Kohonen maps, typically apply a variable cost function<br />
with the added restriction that topographic relationships are preserved so<br />
that neighbor<strong>in</strong>g clusters are close <strong>in</strong> descriptor space. Other nonhierarchical<br />
methods <strong>in</strong>clude density-based and probabilistic methods. Density-based, or<br />
mode-seek<strong>in</strong>g, methods regard the distribution of descriptors across the data<br />
set as generat<strong>in</strong>g patterns of high and low density that, when identified, can be<br />
used to separate the compounds <strong>in</strong>to clusters. Probabilistic cluster<strong>in</strong>g generates<br />
nonoverlapp<strong>in</strong>g clusters <strong>in</strong> which a compound is assigned a probability, <strong>in</strong> the<br />
range 0 to 1, that it belongs to the chosen cluster (<strong>in</strong> contrast to fuzzy cluster<strong>in</strong>g<br />
<strong>in</strong> which the clusters are overlapp<strong>in</strong>g and the degree of membership is not<br />
a probability).<br />
Hav<strong>in</strong>g now provided a broad overview of cluster<strong>in</strong>g methodology, we<br />
next focus on the ‘‘classical’’ methods, which <strong>in</strong>clude hierarchical and s<strong>in</strong>glepass,<br />
relocation, and nearest-neighbor nonhierarchical techniques. The classification<br />
we have described <strong>in</strong> Figure 2 is one that is commonly used by many<br />
scientists; however, it is just one of many possible classifications. Another way<br />
to differentiate between cluster<strong>in</strong>g techniques is to consider parametric and<br />
nonparametric methods. Parametric methods require distance-based comparisons<br />
be made. Here access to the descriptors is required (typically given as<br />
Euclidean vectors), rather than just a proximity matrix derived from the<br />
descriptors. Parametric methods can be further organized <strong>in</strong>to generative<br />
and reconstructive methods. Generative methods, <strong>in</strong>clud<strong>in</strong>g mixture model,<br />
density-based, and probabilistic techniques, try to match parameters (e.g.,<br />
cluster centers, variances with<strong>in</strong> and between clusters, and mix<strong>in</strong>g coefficients<br />
for the descriptor distributions) to the distribution of descriptors with<strong>in</strong> the<br />
data set. Reconstructive methods, such as relocation and topographic, are