19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14 Cluster<strong>in</strong>g Methods and Their Uses <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />

PROGRESS IN CLUSTERING METHODOLOGY<br />

The representations used for chemical compounds are typically ‘‘datapr<strong>in</strong>ts’’<br />

(tens or hundreds of real number descriptors, such as topological<br />

<strong>in</strong>dexes and physicochemical properties) or f<strong>in</strong>gerpr<strong>in</strong>ts (thousands of b<strong>in</strong>ary<br />

descriptors <strong>in</strong>dicat<strong>in</strong>g the presence or absence of 2D structural fragments or<br />

3D pharmacophores). These numbers can be compared to the tens or hundreds<br />

of descriptors typically encountered <strong>in</strong> data m<strong>in</strong><strong>in</strong>g and the thousands of<br />

descriptors encountered <strong>in</strong> <strong>in</strong>formation retrieval. We now outl<strong>in</strong>e the development<br />

of cluster<strong>in</strong>g methods that are suited to handl<strong>in</strong>g these representations<br />

and that have been, or <strong>in</strong> the near future are likely to be, used for chemical<br />

applications. Specific examples of chemical applications are given later <strong>in</strong><br />

the section entitled Chemical Applications.<br />

Algorithm Developments<br />

Hav<strong>in</strong>g briefly outl<strong>in</strong>ed the basic algorithms that are implemented <strong>in</strong><br />

many of the standard cluster<strong>in</strong>g methods, we now set the algorithms <strong>in</strong> context<br />

by review<strong>in</strong>g their historical development, discuss the characteristics of each<br />

method, and then highlight some of the variants that have been developed<br />

for overcom<strong>in</strong>g certa<strong>in</strong> limitations. Cluster<strong>in</strong>g is now such a large area of<br />

research and everyday use that this chapter must be selective rather than comprehensive<br />

<strong>in</strong> scope. The <strong>in</strong>terested reader can access further details from<br />

the references cited throughout this chapter and from the recent review by<br />

Murtagh. 39<br />

Most of the development of hierarchical cluster<strong>in</strong>g methods occurred<br />

from the 1960s through the mid-1980s, after which there was a period of consolidation,<br />

with little new development until recently. From this developmental<br />

period, two key publications were those of Lance and Williams 29 <strong>in</strong> 1967<br />

and the review of hierarchical cluster<strong>in</strong>g methods by Murtagh 27 <strong>in</strong> 1985.<br />

Follow<strong>in</strong>g this developmental period, a few variations have been proposed.<br />

Matula 40 developed algorithms that implemented both divisive and agglomerative<br />

average-l<strong>in</strong>kage methods, but with high computational costs for process<strong>in</strong>g<br />

large data sets. That same year, Ja<strong>in</strong>, Indrayan, and Goel 41 compared<br />

s<strong>in</strong>gle and complete l<strong>in</strong>kage, group and weighted average, centroid, and median<br />

agglomerative methods and concluded that complete l<strong>in</strong>kage performed<br />

best <strong>in</strong> fail<strong>in</strong>g to f<strong>in</strong>d clusters from random data. Podani 42 produced a useful<br />

classification of agglomerative methods, <strong>in</strong> which the standard Lance–Williams<br />

update recurrence formula 29 is split <strong>in</strong>to two formulas. He also <strong>in</strong>troduced<br />

three new parameter variations, that is, three new agglomerative<br />

methods were def<strong>in</strong>ed, but these seem to represent more of an <strong>in</strong>clusion for<br />

the sake of completeness than a significant alternative to previously def<strong>in</strong>ed<br />

parameter variations. Roux 43 recognized the complexity problems <strong>in</strong> Matula’s<br />

implementations 40 and mentioned restrictions that could be applied to

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!