19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The term<strong>in</strong>ology associated with cluster<strong>in</strong>g is extensive, with many terms<br />

used to describe the same th<strong>in</strong>g (reflect<strong>in</strong>g the separate development of cluster<strong>in</strong>g<br />

methods with<strong>in</strong> a multitude of discipl<strong>in</strong>es). Clusters can be overlapp<strong>in</strong>g or<br />

nonoverlapp<strong>in</strong>g; if a compound occurs <strong>in</strong> more than one cluster, the clusters<br />

are overlapp<strong>in</strong>g. At one extreme, each compound is a member of all clusters to<br />

a certa<strong>in</strong> degree. An example of this is fuzzy cluster<strong>in</strong>g <strong>in</strong> which the degree of<br />

membership of an <strong>in</strong>dividual compound is <strong>in</strong> the range 0 to 1, and the total<br />

membership summed across all clusters is normally required to be 1. This<br />

scheme contrasts with crisp cluster<strong>in</strong>g <strong>in</strong> which each compound’s degree of<br />

membership <strong>in</strong> any cluster is either 0 or 1. At the other extreme, is the situation<br />

where<strong>in</strong> each compound is a member of exactly one cluster, <strong>in</strong> which case the<br />

clusters are said to be nonoverlapp<strong>in</strong>g. Intermediate situations sometimes<br />

occur, where compounds can be members of several, though not of all, clusters.<br />

The majority of cluster<strong>in</strong>g methods used on chemical data sets generate<br />

crisp, nonoverlapp<strong>in</strong>g clusters, because analysis of such clusters is relatively<br />

simple.<br />

If a data set is analyzed <strong>in</strong> an iterative way, such that at each step a pair<br />

of clusters is merged or a s<strong>in</strong>gle cluster is divided, the result is hierarchical,<br />

with a parent–child relationship be<strong>in</strong>g established between clusters at each<br />

successive level of the iteration. The successive levels can be visualized us<strong>in</strong>g<br />

a dendrogram, as shown <strong>in</strong> Figure 1. Each level of the hierarchy represents a<br />

partition<strong>in</strong>g of the data set <strong>in</strong>to a set of clusters. In contrast, if the data set is<br />

analyzed to produce a s<strong>in</strong>gle partition of the compounds result<strong>in</strong>g <strong>in</strong> a set of<br />

clusters, the result is then nonhierarchical. Note that the term partition<strong>in</strong>g<br />

................................................................................................................................................<br />

8<br />

3 1 2 4 5 6 7<br />

Introduction 3<br />

Figure 1 An example of a hierarchy (dendrogram) generated from the cluster<strong>in</strong>g of eight<br />

items (shown numbered 1–8 across the bottom). The top (root) is a s<strong>in</strong>gle cluster<br />

conta<strong>in</strong><strong>in</strong>g all eight items. The vertical positions of the horizontal l<strong>in</strong>es jo<strong>in</strong><strong>in</strong>g pairs of<br />

items or cluster <strong>in</strong>dicate the relative similarities of those pairs. Items 1 and 2 are the most<br />

similar and clusters [8,3,1,2] and [4,5,6,7] are the least similar. The dotted horizontal<br />

l<strong>in</strong>e represents a s<strong>in</strong>gle partition conta<strong>in</strong><strong>in</strong>g the four clusters [8], [3,1,2], [4,5], and [6,7].

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!