Reviews in Computational Chemistry Volume 18

More documents

Recommendations

Info

Progress in Clustering Methodology 21 The use of a fixed model in a clustering method favors retrieval of clusters of certain shapes (as exemplified by the hyperspherical clusters retrieved by centroid-based methods). An alternative is to use a density-based approach, in which a cluster is formed from a region of higher density than its surrounding area. The clustering is then based on local criteria, and it can pick out clusters of any shape and internal distribution. Such approaches are typically not applicable directly to high dimensions, but progress is being made in that direction within the data mining community. An example is the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) method of Ester et al. 87 that was subsequently extended by Ankerst et al. 88 to give the OPTICS (Ordering Points To Identify the Clustering Structure) method. These two methods work on a principle that each point of a cluster must have at least a given number of other points within a specified radius. Points fulfilling these conditions are clustered; any remaining points are considered to be outliers, that is, noise. The OPTICS method has been enhanced by Breunig et al. 89 to identify outliers, and by Breunig, Kriegel, and Sander, 90 who combined it with BIRCH 61 to increase speed. Other density-based approaches designed for high dimensions include CLIQUE (Clustering In QUEst) by Agrawal et al., 91 and PROCLUS (PROjected CLUSters), by Aggarwal et al. 92 These two methods recognize that high dimensional spaces are typically sparse so that the similarity between two points is determined by a few dimensions, with the other dimensions being irrelevant. Clusters are thus formed by similarity with respect to subspaces rather than full dimensional space. In the CLIQUE algorithm, dense regions of data space are determined by using cell-based partitioning, which are then used as initial bases for forming the clusters. The algorithm works from lower to higher dimensional subspaces by starting from cells identified as dense in ðk 1Þ-dimensional subspace and extending them into k-dimensional subspace. The result is a set of overlapping dense regions that are extracted as the clusters. Research into improving grid-based methods is continuing, as demonstrated by the variable grid method of Nagesh. 93 In contrast, the PROCLUS program generates nonoverlapping clusters by identifying potential cluster centers (medoids) using a MaxMin subset selection procedure. The best medoids are selected from the initial set by an iterative procedure in which data items within the locality of a medoid (i.e., within the minimum distance between any two medoids) are assigned to that cluster. Rather than using all dimensions, the dimensions associated with each cluster are used in the Manhattan segmental distance 92 to calculate the distance of an item from the cluster. The Manhattan segmental distance is a normalized form of the Manhattan distance that enables comparison of different clusters with varying numbers of dimensions. (The Manhattan, or city-block, or Hamming, distance is the sum of absolute differences between descriptor values; in contrast, the Euclidean distance is the square root of the sum of squares differences between descriptor values.) Once the best medoids have been
22 Clustering Methods and Their Uses in Computational Chemistry selected, a final single pass over the data set assigns each item to its nearest medoid. Graph-theoretic algorithms have seen little use in chemical applications. The basis of these methods is some form of a graph in which the vertices are the items in the data set and the edges are the proximities between them. Early methods created clusters by removing edges from a minimum spanning tree or by constructing a Gabriel graph, a relative neighborhood graph, or a Delauney triangulation, but none of these graph-theoretic methods are suitable for high dimensions. Reviews of these methods are given by Jain and Dubes 28 and Matula. 94 Recent advances in computational biology have spurred development of novel graph-theoretic algorithms based on isolating areas called cliques or ‘‘almost cliques’’ (i.e., highly connected subgraphs) from the graph of all pairwise similarities. Examples include the algorithms by Ben-Dor, Shamir, and Yakhini, 95 Hartuv et al., 96 and Sharan and Shamir 97 that find clusters in gene expression data. Jonyer, Holder, and Cook 98 developed a hierarchical graph-theoretic method that begins with the graph of all pairwise similarities and then iteratively finds subgraphs that maximally compress the graph. The time consumption of these graph-theoretic methods is currently too great to apply to very large data sets. One way to speed up the clustering process is to implement algorithms on parallel hardware. In the 1980s Murtagh 27,99 outlined a parallel version of the RNN algorithm for hierarchical agglomerative clustering. Also in that decade, Rasmussen, Downs, and Willett 45,100 published research on parallel implementations of Jarvis–Patrick, single-link, and Ward clustering for both document and chemical data sets, and Li and Fang 101 developed parallel algorithms for k-means and single-link clustering. In 1990, Li 102 published a review of parallel algorithms for hierarchical clustering. This in turn elicited a classic riposte from Murtagh 103 to the effect that the parallel algorithms were no better than the more recent OðN 2 Þ serial algorithms. Olson 104 presented OðNÞ and OðN log NÞ algorithms for hierarchical methods using N processors. For chemical applications, in-house parallel implementations include the leader algorithm at the National Cancer Institute 105 and k-means at Eli Lilly 79 (both discussed in the section on Chemical Applications), and commercially available parallel implementations include the highly optimized implementation of Jarvis–Patrick by Daylight 14 and the multiprocessor version of the Ward and group-average methods by Barnard Chemical Information. 12 Another way of speeding up clustering calculations is to use a quick and rough calculation of distance to assess an initial separation of items and then to apply the more CPU-expensive, full-distance calculation on only those items that were found to cluster using the rough calculation. McCallum, Nigam, and Ungar 106 exploited this idea by using the rough calculation to divide the data into canopies (roughly overlapping clusters). Only items within the same canopy, or canopies, were used in the subsequent full-distance calculations to determine nonoverlapping clusters (using, e.g., a hierarchical agglomerative, EM,
Page 1 and 2: Reviews in Computational Chemistry
Page 3 and 4: Kenny B. Lipkowitz Department of Ch
Page 5 and 6: vi Preface three-dimensional struct
Page 7 and 8: viii Preface some descriptors and i
Page 9 and 10: Epilogue and Dedication My associat
Page 11 and 12: Contents 1. Clustering Methods and
Page 13 and 14: Contents xv Electron Transfer in Po
Page 15 and 16: Contributors John M. Barnard, Barna
Page 17 and 18: Contributors to Previous Volumes *
Page 19 and 20: Volume 3 (1992) Tamar Schlick, Opti
Page 21 and 22: Volume 7 (1996) Geoffrey M. Downs a
Page 23 and 24: Volume 11 (1997) Mark A. Murcko, Re
Page 25 and 26: T. Daniel Crawford* and Henry F. Sc
Page 27 and 28: Topics Covered in Volumes 1-18 * Ab
Page 29 and 30: Reviews in Computational Chemistry
Page 31 and 32: 2 Clustering Methods and Their Uses
Page 39 and 40: 10 Clustering Methods and Their Use
Page 49: 20 Clustering Methods and Their Use
Page 71 and 72: 42 The Use of Scoring Functions in
Page 79 and 80: Table 1 Reference List for the Most
Page 101 and 102:
72 The Use of Scoring Functions in
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
CHAPTER 3 Potentials and Algorithms
Page 119 and 120:
Vn (1 + cos(nω + γ)) 2 K θ (θ
Page 121 and 122:
are modified by their environment w
Page 123 and 124:
Table 1 Polarizability Parameters f
Page 125 and 126:
The polarizable point dipole models
Page 127 and 128:
two to three orders of magnitude sl
Page 129 and 130:
M i z i + q i d i k i and shell cha
Page 131 and 132:
Shell Models 103 on estimates of sh
Page 133 and 134:
minimization can be replaced by mor
Page 135 and 136:
The energy required to create a cha
Page 137 and 138:
for all i ði:e:; 8 iÞ: @U @qi l
Page 139 and 140:
where we have used q Cl ¼ qNa. The
Page 141 and 142:
Electronegativity Equalization Mode
Page 143 and 144:
Electronegativity Equalization Mode
Page 145 and 146:
of N molecules is taken as a Hartre
Page 147 and 148:
is treated using variable charges.
Page 149 and 150:
water have been developed, includin
Page 151 and 152:
Applications 123 classical and rigi
Page 153 and 154:
developing polarizable models. A va
Page 155 and 156:
Comparison of the Polarization Mode
Page 157 and 158:
negligible errors in such propertie
Page 159 and 160:
ecome significant at field strength
Page 161 and 162:
noteworthy in this regard because t
Page 163 and 164:
References 135 9. P. Cieplak and P.
Page 165 and 166:
References 137 48. E. L. Pollock an
Page 167 and 168:
References 139 Computational Chemis
Page 169 and 170:
References 141 139. J. Hinze and H.
Page 171 and 172:
References 143 178. J. J. P. Stewar
Page 173 and 174:
References 145 216. M. W. Mahoney a
Page 175 and 176:
CHAPTER 4 New Developments in the T
Page 177 and 178:
Introduction 149 applications). For
Page 179 and 180:
FCWD(∆E ) FCWD(0) ∆E = 0 ∆E
Page 181 and 182:
Introduction 153 Equations [6]-[12]
Page 183 and 184:
Paradigm of Free Energy Surfaces 15
Page 185 and 186:
Paradigm of Free Energy Surfaces 15
Page 187 and 188:
In Eqs. [18] and [19], F0i is the e
Page 189 and 190:
where ‘‘þ’’ and ‘‘ ’
Page 191 and 192:
is, however, small for the usual co
Page 193 and 194:
solute-solvent coupling through the
Page 195 and 196:
where Z is the electrode overpotent
Page 197 and 198:
energy surfaces of ET. 33,50-56 It
Page 199 and 200:
This result indicates a fundamental
Page 201 and 202:
βF i (X ) 40 20 0 −20 1 2 βλ 1
Page 203 and 204:
Table 1 Main Features of the Two-Pa
Page 205 and 206:
and Hss ¼ U rep ss 1 2 X j;k ðmj
Page 207 and 208:
λ i /eV 4 3 2 1 0 0 10 20 30 40 Bo
Page 209 and 210:
the width/Stokes shift relation (Eq
Page 211 and 212:
Table 3 Mapping of the Q Model on S
Page 213 and 214:
This situation is of course not sat
Page 215 and 216:
F ± (Y ad )/λ I 0.8 0.4 0 −0.4
Page 217 and 218:
it by choosing the GMH basis set 7
Page 219 and 220:
Anharmonic higher order terms gain
Page 221 and 222:
of the radiation is the perturbatio
Page 223 and 224:
individual vibrational excitations
Page 225 and 226:
m 12 /D 6 5.5 5 4.5 4 3.5 17 18 19
Page 227 and 228:
In Eq. [144], the coordinates Y km
Page 229 and 230:
ε/M −1 cm −1 4000 2000 0 analy
Page 231 and 232:
βσ 2 /10 3 cm −1 14 12 10 8 Opt
Page 233 and 234:
0.2 0.1 0 12 16 20 24 28 32 The app
Page 235 and 236:
References 207 7. M. D. Newton, Adv
Page 237 and 238:
References 209 59. R. Kubo and Y. T
show all

Reviews in Computational Chemistry Volume 18

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?