19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2 Cluster<strong>in</strong>g Methods and Their Uses <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />

data sets and gives examples of their application <strong>in</strong> pharmaceutical companies.<br />

Compared to the other costs of drug discovery, cluster<strong>in</strong>g can add significant<br />

value at m<strong>in</strong>imal cost. First, we provide an outl<strong>in</strong>e of cluster<strong>in</strong>g as a discipl<strong>in</strong>e<br />

and def<strong>in</strong>e some of the term<strong>in</strong>ology. Then, we give a brief tutorial on cluster<strong>in</strong>g<br />

algorithms, review progress <strong>in</strong> develop<strong>in</strong>g the methods, and offer some<br />

example applications.<br />

Cluster<strong>in</strong>g methodology has been developed and used <strong>in</strong> a variety of<br />

areas <strong>in</strong>clud<strong>in</strong>g archaeology, astronomy, biology, computer science, electronics,<br />

eng<strong>in</strong>eer<strong>in</strong>g, <strong>in</strong>formation science, and medic<strong>in</strong>e. Good, general <strong>in</strong>troductory<br />

texts on the topic of cluster<strong>in</strong>g <strong>in</strong>clude those by Sneath and Sokal, 1<br />

Kaufmann and Rousseeuw, 2 Everitt, 3 and Gordon. 4 The ma<strong>in</strong> text that is<br />

devoted to cluster<strong>in</strong>g of chemical data sets is by Willett, 5 with review articles<br />

by Bratchell, 6 Barnard and Downs, 7 and Downs and Willett. 8 The present<br />

chapter is a complement and update to the latter article. In a previous volume<br />

of this series, Lewis, Pickett, and Clark 9 reviewed the use of diversity analysis<br />

techniques <strong>in</strong> comb<strong>in</strong>atorial library design.<br />

As will be shown <strong>in</strong> the section on Chemical Applications, the current<br />

ma<strong>in</strong> uses of cluster<strong>in</strong>g for chemical data sets are to f<strong>in</strong>d representative subsets<br />

from high throughput screen<strong>in</strong>g (HTS) and comb<strong>in</strong>atorial chemistry, and to<br />

<strong>in</strong>crease the diversity of <strong>in</strong>-house data sets through selection of additional<br />

compounds from other data sets. Methods suitable for compound selection<br />

are the ma<strong>in</strong> focus of this chapter. The methods must be able to handle large<br />

data sets of high-dimensional data. For small, low-dimensional data sets, most<br />

cluster<strong>in</strong>g methods are applicable, and descriptions <strong>in</strong> the standard texts and<br />

implementations available <strong>in</strong> standard statistical software packages 10,11<br />

suffice. Implementations designed for use on chemical data sets are available<br />

from most of the specialist software vendors, 12–17 the majority of which were<br />

reviewed by Warr. <strong>18</strong><br />

The overall process of cluster<strong>in</strong>g <strong>in</strong>volves the follow<strong>in</strong>g steps:<br />

1. Generate appropriate descriptors for each compound <strong>in</strong> the data set.<br />

2. Select an appropriate similarity measure.<br />

3. Use an appropriate cluster<strong>in</strong>g method to cluster the data set.<br />

4. Analyze the results.<br />

This chapter focuses on step 3. For step 1, descriptors may <strong>in</strong>clude property<br />

values, biological properties, topological <strong>in</strong>dexes, and structural fragments.<br />

The performance of these descriptors and forms of representation have been<br />

analyzed by Brown 19 and Brown and Mart<strong>in</strong>. 20,21 Similarity search<strong>in</strong>g for<br />

step 2 has been discussed by Downs and Willett; 22 characteristics of various<br />

similarity measures have been discussed by Barnard, Downs, and Willett. 23,24<br />

For step 4, little has been published specifically about visualization and analysis<br />

of results for chemical data sets. However, most publications that focus on<br />

implement<strong>in</strong>g systems that utilize cluster<strong>in</strong>g do provide details of how the<br />

results were displayed or analyzed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!