19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

overlap between the libraries. Two very dissimilar libraries formed two dist<strong>in</strong>ct<br />

clusters with little overlap, whereas two very similar libraries showed<br />

no dist<strong>in</strong>ction.<br />

The use of mixture-model or density-based cluster<strong>in</strong>g has not yet been<br />

reported for process<strong>in</strong>g chemical data sets. An <strong>in</strong>terest<strong>in</strong>g application of these<br />

techniques is their use to group the compound descriptors so as to obta<strong>in</strong> a set<br />

of orthogonal descriptors. Up to this po<strong>in</strong>t, the cluster<strong>in</strong>g that we have discussed<br />

has been applied to the patterns (f<strong>in</strong>gerpr<strong>in</strong>ts or datapr<strong>in</strong>ts) characteriz<strong>in</strong>g<br />

each compound; this is the ‘‘Q-mode cluster<strong>in</strong>g’’ referred to by Sneath<br />

and Sokal. 1 One can also cluster the features (the descriptors used <strong>in</strong> the f<strong>in</strong>gerpr<strong>in</strong>ts<br />

or datapr<strong>in</strong>ts) to highlight groups of similar descriptors. Sneath and<br />

Sokal call this ‘‘R-mode cluster<strong>in</strong>g.’’ The similar property pr<strong>in</strong>ciple, upon<br />

which structure–property relationships depend, assumes that the compound<br />

descriptors are <strong>in</strong>dependent of each other. Reduc<strong>in</strong>g the number of descriptors<br />

can thus help <strong>in</strong> subsequent Q-mode cluster<strong>in</strong>g by reduc<strong>in</strong>g the dimensionality.<br />

Cluster<strong>in</strong>g the descriptors, so that a subset of orthogonal descriptors can be<br />

extracted, is an alternative to factor analysis and pr<strong>in</strong>cipal components analysis.<br />

Us<strong>in</strong>g an orthogonal subset of descriptors has the benefit that the result is a<br />

set of <strong>in</strong>dividual descriptors rather than composite descriptors. Taraviras,<br />

Ivanciuc, and Cabrol-Bass 65 applied the s<strong>in</strong>gle-l<strong>in</strong>k, group-average, completel<strong>in</strong>k,<br />

and Ward hierarchical methods, along with Jarvis–Patrick, variablelength<br />

Jarvis–Patrick, and k-means nonhierarchical methods to a set of 240<br />

topological <strong>in</strong>dices <strong>in</strong> an attempt to reveal any ‘‘natural’’ clusters of the<br />

descriptors. Descriptors that were found to exist <strong>in</strong> the same clusters across<br />

all seven methods were regarded as be<strong>in</strong>g strongly clustered. Reduc<strong>in</strong>g the<br />

number of methods that needed to be <strong>in</strong> agreement revealed progressively<br />

weaker clusters. Overall, it was found that the strategy of us<strong>in</strong>g multiple cluster<strong>in</strong>g<br />

methods for R-mode cluster<strong>in</strong>g could be used to provide representative<br />

sets of orthogonal descriptors for use <strong>in</strong> QSAR analysis.<br />

CONCLUSIONS<br />

Conclusions 33<br />

Cluster<strong>in</strong>g methodology has been developed over many decades. The<br />

application of cluster<strong>in</strong>g to chemical data sets began <strong>in</strong> the 1980s, co<strong>in</strong>cid<strong>in</strong>g<br />

with the <strong>in</strong>creas<strong>in</strong>g size of <strong>in</strong>-house compound collections hav<strong>in</strong>g their <strong>in</strong>formation<br />

conta<strong>in</strong>ed <strong>in</strong> structural databases and with advances made by the<br />

<strong>in</strong>formation retrieval community to analyze large document collections. In<br />

the 1990s the advent of high-throughput screen<strong>in</strong>g, comb<strong>in</strong>atorial libraries,<br />

and commercially available external chemical <strong>in</strong>ventories placed a greater<br />

emphasis on rational compound selection. The demands of cluster<strong>in</strong>g data<br />

sets of several million compounds with high-dimensional representations led<br />

to the widespread adoption of a few <strong>in</strong>herently efficient and optimally implemented<br />

methods, namely, the Jarvis–Patrick, Ward, and k-means methods.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!