Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
Reviews in Computational Chemistry Volume 18
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
30 Cluster<strong>in</strong>g Methods and Their Uses <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />
found useful <strong>in</strong> the detection of false positives, especially from comb<strong>in</strong>atorial<br />
libraries. In these cases, the structural similarity between the hits was low and<br />
their biological activity was subsequently attributed to a common side product.<br />
Cluster<strong>in</strong>g was performed by Stanton 127 us<strong>in</strong>g BCUT (Burden–CAS–<br />
University of Texas) descriptors, 128 with the optimum hierarchy level determ<strong>in</strong>ed<br />
visually from the dendrogram. Visual selection was possible because<br />
the hit sets were typically a few hundred compounds.<br />
The most significant application of a nonhierarchical s<strong>in</strong>gle-pass method<br />
was for screen<strong>in</strong>g antitumor activity at the National Cancer Institute. A variant<br />
of the leader algorithm was developed 129 <strong>in</strong> which the descriptors were<br />
weighted by occurrence <strong>in</strong> each compound, size of the fragment, and<br />
frequency of occurrence <strong>in</strong> the data set. Because of the use of these weighted<br />
descriptors, an asymmetric coefficient 129 was used to determ<strong>in</strong>e similarity,<br />
rather than the more usual Tanimoto coefficient. The data set was then<br />
ordered by the <strong>in</strong>creas<strong>in</strong>g sum of fragment weights to remove the order dependency<br />
associated with the leader algorithm (or at least, to have a reasonable<br />
basis for choos<strong>in</strong>g a particular order) and to enable the use of heuristics to<br />
reduce the number of similarity calculations. Compounds were then assigned<br />
to any exist<strong>in</strong>g cluster for which they exceeded the given similarity threshold,<br />
thus creat<strong>in</strong>g overlapp<strong>in</strong>g clusters. The algorithm was implemented on parallel<br />
hardware, 105 and the results from cluster<strong>in</strong>g several data sets were presented<br />
with a discussion on the large number of s<strong>in</strong>gleton clusters produced. 130<br />
Another variant on the leader algorithm was proposed by But<strong>in</strong>a. 131 In his<br />
approach, the compounds are first sorted by decreas<strong>in</strong>g number of near neighbors<br />
(with<strong>in</strong> a specified threshold similarity), thus aga<strong>in</strong> remov<strong>in</strong>g the order<br />
dependence of the basic algorithm. Of course, identify<strong>in</strong>g the number of<br />
near neighbors for each compound <strong>in</strong>troduces an O(N 2 ) step, which <strong>in</strong> turn<br />
obviates the s<strong>in</strong>gle-pass algorithm’s primary advantage of l<strong>in</strong>ear speed.<br />
At Rohm and Haas Company, Reynolds, Drucker, and Pfahler 132 developed<br />
a two-pass method similar to the <strong>in</strong>itial assignment stage of k-means. In<br />
the first pass, a similarity threshold is specified, and then the sphere exclusion<br />
diverse subset selection method 80 is used to select the cluster seeds (referred to<br />
as probes). In the second pass, all other compounds are assigned to the most<br />
similar probe (the published version unnecessarily performs this <strong>in</strong> two stages).<br />
Clark and Langton 133 adopted a similar methodology <strong>in</strong> the Tripos OptiSim fast<br />
cluster<strong>in</strong>g system for select<strong>in</strong>g diverse yet representative subsets. OptiSim<br />
works by select<strong>in</strong>g an <strong>in</strong>itial seed at random, select<strong>in</strong>g a random sample of<br />
size K, analyz<strong>in</strong>g the random sample by choos<strong>in</strong>g the most dissimilar member<br />
of the sample from exist<strong>in</strong>g seeds, and, if the m<strong>in</strong>imum similarity threshold, R,<br />
to all exist<strong>in</strong>g seeds is exceeded, add<strong>in</strong>g it to the seed set. This operation<br />
cont<strong>in</strong>ues until the specified number of seeds, M, has been selected or no<br />
more candidates rema<strong>in</strong>. All other compounds are then assigned to their nearest<br />
seed (which is equivalent to the <strong>in</strong>itial assignment stage of k-means, with<br />
no ref<strong>in</strong>ement). OptiSim is an obvious amalgam of the MaxM<strong>in</strong> and sphere