19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 1<br />

Cluster<strong>in</strong>g Methods and Their Uses<br />

<strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />

Geoff M. Downs and John M. Barnard<br />

Barnard Chemical Information Ltd., 46 Uppergate Road,<br />

Stann<strong>in</strong>gton, Sheffield S6 6BX, United K<strong>in</strong>gdom<br />

INTRODUCTION<br />

<strong>Reviews</strong> <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong>, <strong>Volume</strong> <strong>18</strong><br />

Edited by Kenny B. Lipkowitz and Donald B. Boyd<br />

Copyr ight © 2002 John Wiley & Sons, I nc.<br />

ISBN: 0-471-21576-7<br />

Cluster<strong>in</strong>g is a data analysis technique that, when applied to a set of<br />

heterogeneous items, identifies homogeneous subgroups as def<strong>in</strong>ed by a given<br />

model or measure of similarity. Of the many uses of cluster<strong>in</strong>g, a prime motivation<br />

for the <strong>in</strong>creas<strong>in</strong>g <strong>in</strong>terest <strong>in</strong> cluster<strong>in</strong>g methods is their use <strong>in</strong> the selection<br />

and design of comb<strong>in</strong>atorial libraries of chemical structures pert<strong>in</strong>ent to<br />

pharmaceutical discovery.<br />

One feature of cluster<strong>in</strong>g is that the process is unsupervised, that is, there<br />

is no predef<strong>in</strong>ed group<strong>in</strong>g that the cluster<strong>in</strong>g seeks to reproduce. In contrast to<br />

supervised learn<strong>in</strong>g, where the task is to establish relationships between given<br />

<strong>in</strong>puts and outputs to enable prediction of the output from new <strong>in</strong>puts, <strong>in</strong><br />

unsupervised learn<strong>in</strong>g only the <strong>in</strong>puts are available and the task is to reveal<br />

aspects of the underly<strong>in</strong>g distribution of the <strong>in</strong>put data. Cluster<strong>in</strong>g is thus complemented<br />

by the related supervised process of classification, <strong>in</strong> which items<br />

are assigned labels applied to predef<strong>in</strong>ed groups: examples <strong>in</strong>clude recursive<br />

partition<strong>in</strong>g, naïve Bayesian analysis, and K nearest-neighbor selection. Cluster<strong>in</strong>g<br />

is a technique for exploratory data analysis and is used <strong>in</strong>creas<strong>in</strong>gly <strong>in</strong><br />

prelim<strong>in</strong>ary analyses of large data sets of medium and high dimensionality as a<br />

method of selection, diversity analysis, and data reduction. This chapter<br />

reviews the ma<strong>in</strong> cluster<strong>in</strong>g methods that are used for analyz<strong>in</strong>g chemical<br />

1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!