19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Progress <strong>in</strong> Cluster<strong>in</strong>g Methodology 23<br />

or k-means method). The nature of the rough-distance measure used can guarantee<br />

that the canopies will be sufficiently broad to encompass all candidates<br />

for the ensu<strong>in</strong>g full-distance measure. These ideas to speed up nearestneighbor<br />

searches are similar to the earlier use of bounds on the distance<br />

measure, as discussed by Murtagh. 27<br />

Comparative Studies on Chemical Data Sets<br />

Much of the use of cluster<strong>in</strong>g for chemical applications is based on the<br />

similar property pr<strong>in</strong>ciple. 107 This pr<strong>in</strong>ciple, which holds <strong>in</strong> many, but certa<strong>in</strong>ly<br />

not all, structure–property relationships, states that compounds with<br />

similar structure are likely to exhibit similar properties. Cluster<strong>in</strong>g on the basis<br />

of structural descriptors is thus likely to group compounds hav<strong>in</strong>g similar<br />

properties. However, there exist many different cluster<strong>in</strong>g methods, each<br />

hav<strong>in</strong>g its own particular characteristics that are likely to affect the composition<br />

of the resultant clusters. Consequently, there have been several comparative<br />

studies on the performance of different cluster<strong>in</strong>g methods when<br />

applied to chemical data sets. The first such studies were conducted by<br />

Willett and Rub<strong>in</strong> 5,108–110 <strong>in</strong> the early 1980s. These studies were highly<br />

<strong>in</strong>fluential <strong>in</strong> the subsequent implementation of cluster<strong>in</strong>g methods <strong>in</strong><br />

commercial and <strong>in</strong>-house software systems used by the pharmaceutical<br />

<strong>in</strong>dustry. Over 30 hierarchical and nonhierarchical methods were tested on 10<br />

small data sets for which certa<strong>in</strong> properties were known. Cluster<strong>in</strong>g was conducted<br />

us<strong>in</strong>g 2D f<strong>in</strong>gerpr<strong>in</strong>ts as compound representations. The leave-one-out<br />

approach (based on the similar property pr<strong>in</strong>ciple) was used to compare the<br />

results of different cluster<strong>in</strong>g methods by predict<strong>in</strong>g the property of each<br />

compound (as the average of the property of the other members of the cluster)<br />

and correlat<strong>in</strong>g it with the actual property. High correlations <strong>in</strong>dicate that<br />

compounds with similar properties have been clustered together. The results<br />

<strong>in</strong>dicated that the Ward hierarchical method gave the best overall performance.<br />

But, this method was not well suited to process<strong>in</strong>g large data sets due<br />

to the requirement for random access to the f<strong>in</strong>gerpr<strong>in</strong>ts. The Jarvis–Patrick<br />

nonhierarchical method results were almost as good and, because it does not<br />

require the f<strong>in</strong>gerpr<strong>in</strong>ts to be <strong>in</strong> memory, it became the recommended method.<br />

In the early 1990s, a subsequent study by Downs, Willett, and Fisanick 46<br />

compared the performance of the Ward and group-average agglomerative<br />

methods, the m<strong>in</strong>imum-diameter divisive hierarchical method, and the<br />

Jarvis–Patrick nonhierarchical method when us<strong>in</strong>g datapr<strong>in</strong>ts of calculated<br />

physicochemical properties. In this assessment, a data set was used that was<br />

considerably larger than those used <strong>in</strong> the orig<strong>in</strong>al studies. 108–110 The results<br />

highlighted the poor performance of the Jarvis–Patrick method <strong>in</strong> comparison<br />

with the hierarchical methods. The hierarchical methods all had similar levels<br />

of performance with the m<strong>in</strong>imum-diameter method be<strong>in</strong>g slightly better for<br />

small numbers of clusters. Brown and Mart<strong>in</strong> 20 then <strong>in</strong>vestigated the same

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!