19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

30 Cluster<strong>in</strong>g Methods and Their Uses <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />

found useful <strong>in</strong> the detection of false positives, especially from comb<strong>in</strong>atorial<br />

libraries. In these cases, the structural similarity between the hits was low and<br />

their biological activity was subsequently attributed to a common side product.<br />

Cluster<strong>in</strong>g was performed by Stanton 127 us<strong>in</strong>g BCUT (Burden–CAS–<br />

University of Texas) descriptors, 128 with the optimum hierarchy level determ<strong>in</strong>ed<br />

visually from the dendrogram. Visual selection was possible because<br />

the hit sets were typically a few hundred compounds.<br />

The most significant application of a nonhierarchical s<strong>in</strong>gle-pass method<br />

was for screen<strong>in</strong>g antitumor activity at the National Cancer Institute. A variant<br />

of the leader algorithm was developed 129 <strong>in</strong> which the descriptors were<br />

weighted by occurrence <strong>in</strong> each compound, size of the fragment, and<br />

frequency of occurrence <strong>in</strong> the data set. Because of the use of these weighted<br />

descriptors, an asymmetric coefficient 129 was used to determ<strong>in</strong>e similarity,<br />

rather than the more usual Tanimoto coefficient. The data set was then<br />

ordered by the <strong>in</strong>creas<strong>in</strong>g sum of fragment weights to remove the order dependency<br />

associated with the leader algorithm (or at least, to have a reasonable<br />

basis for choos<strong>in</strong>g a particular order) and to enable the use of heuristics to<br />

reduce the number of similarity calculations. Compounds were then assigned<br />

to any exist<strong>in</strong>g cluster for which they exceeded the given similarity threshold,<br />

thus creat<strong>in</strong>g overlapp<strong>in</strong>g clusters. The algorithm was implemented on parallel<br />

hardware, 105 and the results from cluster<strong>in</strong>g several data sets were presented<br />

with a discussion on the large number of s<strong>in</strong>gleton clusters produced. 130<br />

Another variant on the leader algorithm was proposed by But<strong>in</strong>a. 131 In his<br />

approach, the compounds are first sorted by decreas<strong>in</strong>g number of near neighbors<br />

(with<strong>in</strong> a specified threshold similarity), thus aga<strong>in</strong> remov<strong>in</strong>g the order<br />

dependence of the basic algorithm. Of course, identify<strong>in</strong>g the number of<br />

near neighbors for each compound <strong>in</strong>troduces an O(N 2 ) step, which <strong>in</strong> turn<br />

obviates the s<strong>in</strong>gle-pass algorithm’s primary advantage of l<strong>in</strong>ear speed.<br />

At Rohm and Haas Company, Reynolds, Drucker, and Pfahler 132 developed<br />

a two-pass method similar to the <strong>in</strong>itial assignment stage of k-means. In<br />

the first pass, a similarity threshold is specified, and then the sphere exclusion<br />

diverse subset selection method 80 is used to select the cluster seeds (referred to<br />

as probes). In the second pass, all other compounds are assigned to the most<br />

similar probe (the published version unnecessarily performs this <strong>in</strong> two stages).<br />

Clark and Langton 133 adopted a similar methodology <strong>in</strong> the Tripos OptiSim fast<br />

cluster<strong>in</strong>g system for select<strong>in</strong>g diverse yet representative subsets. OptiSim<br />

works by select<strong>in</strong>g an <strong>in</strong>itial seed at random, select<strong>in</strong>g a random sample of<br />

size K, analyz<strong>in</strong>g the random sample by choos<strong>in</strong>g the most dissimilar member<br />

of the sample from exist<strong>in</strong>g seeds, and, if the m<strong>in</strong>imum similarity threshold, R,<br />

to all exist<strong>in</strong>g seeds is exceeded, add<strong>in</strong>g it to the seed set. This operation<br />

cont<strong>in</strong>ues until the specified number of seeds, M, has been selected or no<br />

more candidates rema<strong>in</strong>. All other compounds are then assigned to their nearest<br />

seed (which is equivalent to the <strong>in</strong>itial assignment stage of k-means, with<br />

no ref<strong>in</strong>ement). OptiSim is an obvious amalgam of the MaxM<strong>in</strong> and sphere

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!