Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2 Cluster<strong>in</strong>g Methods and Their Uses <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />
data sets and gives examples of their application <strong>in</strong> pharmaceutical companies.<br />
Compared to the other costs of drug discovery, cluster<strong>in</strong>g can add significant<br />
value at m<strong>in</strong>imal cost. First, we provide an outl<strong>in</strong>e of cluster<strong>in</strong>g as a discipl<strong>in</strong>e<br />
and def<strong>in</strong>e some of the term<strong>in</strong>ology. Then, we give a brief tutorial on cluster<strong>in</strong>g<br />
algorithms, review progress <strong>in</strong> develop<strong>in</strong>g the methods, and offer some<br />
example applications.<br />
Cluster<strong>in</strong>g methodology has been developed and used <strong>in</strong> a variety of<br />
areas <strong>in</strong>clud<strong>in</strong>g archaeology, astronomy, biology, computer science, electronics,<br />
eng<strong>in</strong>eer<strong>in</strong>g, <strong>in</strong>formation science, and medic<strong>in</strong>e. Good, general <strong>in</strong>troductory<br />
texts on the topic of cluster<strong>in</strong>g <strong>in</strong>clude those by Sneath and Sokal, 1<br />
Kaufmann and Rousseeuw, 2 Everitt, 3 and Gordon. 4 The ma<strong>in</strong> text that is<br />
devoted to cluster<strong>in</strong>g of chemical data sets is by Willett, 5 with review articles<br />
by Bratchell, 6 Barnard and Downs, 7 and Downs and Willett. 8 The present<br />
chapter is a complement and update to the latter article. In a previous volume<br />
of this series, Lewis, Pickett, and Clark 9 reviewed the use of diversity analysis<br />
techniques <strong>in</strong> comb<strong>in</strong>atorial library design.<br />
As will be shown <strong>in</strong> the section on Chemical Applications, the current<br />
ma<strong>in</strong> uses of cluster<strong>in</strong>g for chemical data sets are to f<strong>in</strong>d representative subsets<br />
from high throughput screen<strong>in</strong>g (HTS) and comb<strong>in</strong>atorial chemistry, and to<br />
<strong>in</strong>crease the diversity of <strong>in</strong>-house data sets through selection of additional<br />
compounds from other data sets. Methods suitable for compound selection<br />
are the ma<strong>in</strong> focus of this chapter. The methods must be able to handle large<br />
data sets of high-dimensional data. For small, low-dimensional data sets, most<br />
cluster<strong>in</strong>g methods are applicable, and descriptions <strong>in</strong> the standard texts and<br />
implementations available <strong>in</strong> standard statistical software packages 10,11<br />
suffice. Implementations designed for use on chemical data sets are available<br />
from most of the specialist software vendors, 12–17 the majority of which were<br />
reviewed by Warr. <strong>18</strong><br />
The overall process of cluster<strong>in</strong>g <strong>in</strong>volves the follow<strong>in</strong>g steps:<br />
1. Generate appropriate descriptors for each compound <strong>in</strong> the data set.<br />
2. Select an appropriate similarity measure.<br />
3. Use an appropriate cluster<strong>in</strong>g method to cluster the data set.<br />
4. Analyze the results.<br />
This chapter focuses on step 3. For step 1, descriptors may <strong>in</strong>clude property<br />
values, biological properties, topological <strong>in</strong>dexes, and structural fragments.<br />
The performance of these descriptors and forms of representation have been<br />
analyzed by Brown 19 and Brown and Mart<strong>in</strong>. 20,21 Similarity search<strong>in</strong>g for<br />
step 2 has been discussed by Downs and Willett; 22 characteristics of various<br />
similarity measures have been discussed by Barnard, Downs, and Willett. 23,24<br />
For step 4, little has been published specifically about visualization and analysis<br />
of results for chemical data sets. However, most publications that focus on<br />
implement<strong>in</strong>g systems that utilize cluster<strong>in</strong>g do provide details of how the<br />
results were displayed or analyzed.