Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
The term<strong>in</strong>ology associated with cluster<strong>in</strong>g is extensive, with many terms<br />
used to describe the same th<strong>in</strong>g (reflect<strong>in</strong>g the separate development of cluster<strong>in</strong>g<br />
methods with<strong>in</strong> a multitude of discipl<strong>in</strong>es). Clusters can be overlapp<strong>in</strong>g or<br />
nonoverlapp<strong>in</strong>g; if a compound occurs <strong>in</strong> more than one cluster, the clusters<br />
are overlapp<strong>in</strong>g. At one extreme, each compound is a member of all clusters to<br />
a certa<strong>in</strong> degree. An example of this is fuzzy cluster<strong>in</strong>g <strong>in</strong> which the degree of<br />
membership of an <strong>in</strong>dividual compound is <strong>in</strong> the range 0 to 1, and the total<br />
membership summed across all clusters is normally required to be 1. This<br />
scheme contrasts with crisp cluster<strong>in</strong>g <strong>in</strong> which each compound’s degree of<br />
membership <strong>in</strong> any cluster is either 0 or 1. At the other extreme, is the situation<br />
where<strong>in</strong> each compound is a member of exactly one cluster, <strong>in</strong> which case the<br />
clusters are said to be nonoverlapp<strong>in</strong>g. Intermediate situations sometimes<br />
occur, where compounds can be members of several, though not of all, clusters.<br />
The majority of cluster<strong>in</strong>g methods used on chemical data sets generate<br />
crisp, nonoverlapp<strong>in</strong>g clusters, because analysis of such clusters is relatively<br />
simple.<br />
If a data set is analyzed <strong>in</strong> an iterative way, such that at each step a pair<br />
of clusters is merged or a s<strong>in</strong>gle cluster is divided, the result is hierarchical,<br />
with a parent–child relationship be<strong>in</strong>g established between clusters at each<br />
successive level of the iteration. The successive levels can be visualized us<strong>in</strong>g<br />
a dendrogram, as shown <strong>in</strong> Figure 1. Each level of the hierarchy represents a<br />
partition<strong>in</strong>g of the data set <strong>in</strong>to a set of clusters. In contrast, if the data set is<br />
analyzed to produce a s<strong>in</strong>gle partition of the compounds result<strong>in</strong>g <strong>in</strong> a set of<br />
clusters, the result is then nonhierarchical. Note that the term partition<strong>in</strong>g<br />
................................................................................................................................................<br />
8<br />
3 1 2 4 5 6 7<br />
Introduction 3<br />
Figure 1 An example of a hierarchy (dendrogram) generated from the cluster<strong>in</strong>g of eight<br />
items (shown numbered 1–8 across the bottom). The top (root) is a s<strong>in</strong>gle cluster<br />
conta<strong>in</strong><strong>in</strong>g all eight items. The vertical positions of the horizontal l<strong>in</strong>es jo<strong>in</strong><strong>in</strong>g pairs of<br />
items or cluster <strong>in</strong>dicate the relative similarities of those pairs. Items 1 and 2 are the most<br />
similar and clusters [8,3,1,2] and [4,5,6,7] are the least similar. The dotted horizontal<br />
l<strong>in</strong>e represents a s<strong>in</strong>gle partition conta<strong>in</strong><strong>in</strong>g the four clusters [8], [3,1,2], [4,5], and [6,7].