Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Progress <strong>in</strong> Cluster<strong>in</strong>g Methodology 23<br />
or k-means method). The nature of the rough-distance measure used can guarantee<br />
that the canopies will be sufficiently broad to encompass all candidates<br />
for the ensu<strong>in</strong>g full-distance measure. These ideas to speed up nearestneighbor<br />
searches are similar to the earlier use of bounds on the distance<br />
measure, as discussed by Murtagh. 27<br />
Comparative Studies on Chemical Data Sets<br />
Much of the use of cluster<strong>in</strong>g for chemical applications is based on the<br />
similar property pr<strong>in</strong>ciple. 107 This pr<strong>in</strong>ciple, which holds <strong>in</strong> many, but certa<strong>in</strong>ly<br />
not all, structure–property relationships, states that compounds with<br />
similar structure are likely to exhibit similar properties. Cluster<strong>in</strong>g on the basis<br />
of structural descriptors is thus likely to group compounds hav<strong>in</strong>g similar<br />
properties. However, there exist many different cluster<strong>in</strong>g methods, each<br />
hav<strong>in</strong>g its own particular characteristics that are likely to affect the composition<br />
of the resultant clusters. Consequently, there have been several comparative<br />
studies on the performance of different cluster<strong>in</strong>g methods when<br />
applied to chemical data sets. The first such studies were conducted by<br />
Willett and Rub<strong>in</strong> 5,108–110 <strong>in</strong> the early 1980s. These studies were highly<br />
<strong>in</strong>fluential <strong>in</strong> the subsequent implementation of cluster<strong>in</strong>g methods <strong>in</strong><br />
commercial and <strong>in</strong>-house software systems used by the pharmaceutical<br />
<strong>in</strong>dustry. Over 30 hierarchical and nonhierarchical methods were tested on 10<br />
small data sets for which certa<strong>in</strong> properties were known. Cluster<strong>in</strong>g was conducted<br />
us<strong>in</strong>g 2D f<strong>in</strong>gerpr<strong>in</strong>ts as compound representations. The leave-one-out<br />
approach (based on the similar property pr<strong>in</strong>ciple) was used to compare the<br />
results of different cluster<strong>in</strong>g methods by predict<strong>in</strong>g the property of each<br />
compound (as the average of the property of the other members of the cluster)<br />
and correlat<strong>in</strong>g it with the actual property. High correlations <strong>in</strong>dicate that<br />
compounds with similar properties have been clustered together. The results<br />
<strong>in</strong>dicated that the Ward hierarchical method gave the best overall performance.<br />
But, this method was not well suited to process<strong>in</strong>g large data sets due<br />
to the requirement for random access to the f<strong>in</strong>gerpr<strong>in</strong>ts. The Jarvis–Patrick<br />
nonhierarchical method results were almost as good and, because it does not<br />
require the f<strong>in</strong>gerpr<strong>in</strong>ts to be <strong>in</strong> memory, it became the recommended method.<br />
In the early 1990s, a subsequent study by Downs, Willett, and Fisanick 46<br />
compared the performance of the Ward and group-average agglomerative<br />
methods, the m<strong>in</strong>imum-diameter divisive hierarchical method, and the<br />
Jarvis–Patrick nonhierarchical method when us<strong>in</strong>g datapr<strong>in</strong>ts of calculated<br />
physicochemical properties. In this assessment, a data set was used that was<br />
considerably larger than those used <strong>in</strong> the orig<strong>in</strong>al studies. 108–110 The results<br />
highlighted the poor performance of the Jarvis–Patrick method <strong>in</strong> comparison<br />
with the hierarchical methods. The hierarchical methods all had similar levels<br />
of performance with the m<strong>in</strong>imum-diameter method be<strong>in</strong>g slightly better for<br />
small numbers of clusters. Brown and Mart<strong>in</strong> 20 then <strong>in</strong>vestigated the same