19.02.2013 Views

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

Reviews in Computational Chemistry Volume 18

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>18</strong> Cluster<strong>in</strong>g Methods and Their Uses <strong>in</strong> <strong>Computational</strong> <strong>Chemistry</strong><br />

Nearest-neighbor nonhierarchical methods have received much attention<br />

<strong>in</strong> the chemical community because of their fast process<strong>in</strong>g speeds and ease of<br />

implementation. The comparative studies outl<strong>in</strong>ed <strong>in</strong> the next section (Comparative<br />

Studies on Chemical Data Sets) led to the widespread adoption of<br />

the Jarvis–Patrick nearest-neighbor method for cluster<strong>in</strong>g large chemical<br />

data sets. To improve results obta<strong>in</strong>ed by the standard Jarvis–Patrick implementation,<br />

several extensions have been developed. The standard implementation<br />

tends to produce a few large heterogeneous clusters and an abundance of<br />

s<strong>in</strong>gletons, which is hardly surpris<strong>in</strong>g because the method was orig<strong>in</strong>ally<br />

designed to be space distort<strong>in</strong>g, 34 that is, contraction of sparsely populated<br />

areas clusters and splitt<strong>in</strong>g of densely populated areas. Attempts to overcome<br />

these tendencies <strong>in</strong>clude the use of variable-length nearest-neighbor lists, 12,20<br />

recluster<strong>in</strong>g of s<strong>in</strong>gletons, 63 and the use of fuzzy cluster<strong>in</strong>g. 64 For variablelength<br />

nearest-neighbor lists, the user specifies a proximity threshold so that<br />

the lists will conta<strong>in</strong> all neighbors that pass the threshold test rather than a<br />

fixed number of nearest neighbors. Dur<strong>in</strong>g cluster<strong>in</strong>g, the comparison between<br />

nearest-neighbor lists is made on the basis of a specified m<strong>in</strong>imum percentage<br />

of the neighbors <strong>in</strong> the shorter list be<strong>in</strong>g <strong>in</strong> common. These modifications help<br />

prevent true outliers from be<strong>in</strong>g forced to jo<strong>in</strong> a cluster while prevent<strong>in</strong>g the<br />

arbitrary splitt<strong>in</strong>g of large clusters aris<strong>in</strong>g from the limitations imposed by<br />

fixed length lists. When us<strong>in</strong>g f<strong>in</strong>gerpr<strong>in</strong>ts for cluster<strong>in</strong>g chemical data sets,<br />

Brown and Mart<strong>in</strong> 20 showed improved results compared with the standard<br />

implementation, whereas Taraviras, Ivanciuc, and Cabrol-Bass 65 show contrary<br />

results when cluster<strong>in</strong>g descriptors.<br />

The recluster<strong>in</strong>g of s<strong>in</strong>gletons is used <strong>in</strong> the ‘‘cascaded cluster<strong>in</strong>g’’<br />

method of Menard, Lewis, and Mason. 63 This method applies the standard<br />

Jarvis–Patrick cluster<strong>in</strong>g iteratively, removes the s<strong>in</strong>gletons, and reclusters<br />

them us<strong>in</strong>g less strict parameters until fewer than a specified percentage of s<strong>in</strong>gletons<br />

rema<strong>in</strong>. The fuzzy Jarvis–Patrick method outl<strong>in</strong>ed by Doman et al. 64 is<br />

the most radical Jarvis–Patrick variant. In the fuzzy method, clusters <strong>in</strong> dense<br />

regions are extracted us<strong>in</strong>g a similarity threshold and the standard crisp<br />

method. The compounds are then assigned probabilities of belong<strong>in</strong>g to<br />

each of the crisp clusters. Any previously unclustered compounds not exceed<strong>in</strong>g<br />

a specified threshold probability of belong<strong>in</strong>g to any of the crisp clusters<br />

are regarded as outliers and rema<strong>in</strong> as s<strong>in</strong>gletons.<br />

Other nearest-neighbor methods <strong>in</strong>clude the agglomerative hierarchical<br />

method of Gowda and Krishna, 66 which uses the position of nearest neighbors,<br />

rather than just the number, <strong>in</strong> a measure called the mutual neighborhood<br />

value (MNV). Given po<strong>in</strong>ts i and j, ifi is the pth neighbor of j and j is<br />

the qth neighbor of i, then the MNV is ðp þ qÞ. Smaller values of MNV <strong>in</strong>dicate<br />

greater similarity, and a specified threshold MNV is used to determ<strong>in</strong>e<br />

whether po<strong>in</strong>ts should be merged. Dugad and Ahuja 67 extended the MNV<br />

concept to <strong>in</strong>clude the density of two clusters that are be<strong>in</strong>g considered for<br />

merger. In addition to the threshold MNV, if there exists a po<strong>in</strong>t k with

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!