Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
CHAPTER 1<br />
Cluster<strong>in</strong>g Methods and Their Uses<br />
<strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />
Geoff M. Downs and John M. Barnard<br />
Barnard Chemical Information Ltd., 46 Uppergate Road,<br />
Stann<strong>in</strong>gton, Sheffield S6 6BX, United K<strong>in</strong>gdom<br />
INTRODUCTION<br />
<strong>Reviews</strong> <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong>, <strong>Volume</strong> <strong>18</strong><br />
Edited by Kenny B. Lipkowitz and Donald B. Boyd<br />
Copyr ight © 2002 John Wiley & Sons, I nc.<br />
ISBN: 0-471-21576-7<br />
Cluster<strong>in</strong>g is a data analysis technique that, when applied to a set of<br />
heterogeneous items, identifies homogeneous subgroups as def<strong>in</strong>ed by a given<br />
model or measure of similarity. Of the many uses of cluster<strong>in</strong>g, a prime motivation<br />
for the <strong>in</strong>creas<strong>in</strong>g <strong>in</strong>terest <strong>in</strong> cluster<strong>in</strong>g methods is their use <strong>in</strong> the selection<br />
and design of comb<strong>in</strong>atorial libraries of chemical structures pert<strong>in</strong>ent to<br />
pharmaceutical discovery.<br />
One feature of cluster<strong>in</strong>g is that the process is unsupervised, that is, there<br />
is no predef<strong>in</strong>ed group<strong>in</strong>g that the cluster<strong>in</strong>g seeks to reproduce. In contrast to<br />
supervised learn<strong>in</strong>g, where the task is to establish relationships between given<br />
<strong>in</strong>puts and outputs to enable prediction of the output from new <strong>in</strong>puts, <strong>in</strong><br />
unsupervised learn<strong>in</strong>g only the <strong>in</strong>puts are available and the task is to reveal<br />
aspects of the underly<strong>in</strong>g distribution of the <strong>in</strong>put data. Cluster<strong>in</strong>g is thus complemented<br />
by the related supervised process of classification, <strong>in</strong> which items<br />
are assigned labels applied to predef<strong>in</strong>ed groups: examples <strong>in</strong>clude recursive<br />
partition<strong>in</strong>g, naïve Bayesian analysis, and K nearest-neighbor selection. Cluster<strong>in</strong>g<br />
is a technique for exploratory data analysis and is used <strong>in</strong>creas<strong>in</strong>gly <strong>in</strong><br />
prelim<strong>in</strong>ary analyses of large data sets of medium and high dimensionality as a<br />
method of selection, diversity analysis, and data reduction. This chapter<br />
reviews the ma<strong>in</strong> cluster<strong>in</strong>g methods that are used for analyz<strong>in</strong>g chemical<br />
1