Reviews in Computational Chemistry Volume 18

More documents

Recommendations

Info

Progress in Clustering Methodology 23 or k-means method). The nature of the rough-distance measure used can guarantee that the canopies will be sufficiently broad to encompass all candidates for the ensuing full-distance measure. These ideas to speed up nearestneighbor searches are similar to the earlier use of bounds on the distance measure, as discussed by Murtagh. 27 Comparative Studies on Chemical Data Sets Much of the use of clustering for chemical applications is based on the similar property principle. 107 This principle, which holds in many, but certainly not all, structure–property relationships, states that compounds with similar structure are likely to exhibit similar properties. Clustering on the basis of structural descriptors is thus likely to group compounds having similar properties. However, there exist many different clustering methods, each having its own particular characteristics that are likely to affect the composition of the resultant clusters. Consequently, there have been several comparative studies on the performance of different clustering methods when applied to chemical data sets. The first such studies were conducted by Willett and Rubin 5,108–110 in the early 1980s. These studies were highly influential in the subsequent implementation of clustering methods in commercial and in-house software systems used by the pharmaceutical industry. Over 30 hierarchical and nonhierarchical methods were tested on 10 small data sets for which certain properties were known. Clustering was conducted using 2D fingerprints as compound representations. The leave-one-out approach (based on the similar property principle) was used to compare the results of different clustering methods by predicting the property of each compound (as the average of the property of the other members of the cluster) and correlating it with the actual property. High correlations indicate that compounds with similar properties have been clustered together. The results indicated that the Ward hierarchical method gave the best overall performance. But, this method was not well suited to processing large data sets due to the requirement for random access to the fingerprints. The Jarvis–Patrick nonhierarchical method results were almost as good and, because it does not require the fingerprints to be in memory, it became the recommended method. In the early 1990s, a subsequent study by Downs, Willett, and Fisanick 46 compared the performance of the Ward and group-average agglomerative methods, the minimum-diameter divisive hierarchical method, and the Jarvis–Patrick nonhierarchical method when using dataprints of calculated physicochemical properties. In this assessment, a data set was used that was considerably larger than those used in the original studies. 108–110 The results highlighted the poor performance of the Jarvis–Patrick method in comparison with the hierarchical methods. The hierarchical methods all had similar levels of performance with the minimum-diameter method being slightly better for small numbers of clusters. Brown and Martin 20 then investigated the same
24 Clustering Methods and Their Uses in Computational Chemistry clustering methods to compare their performance for compound selection, using various 2D and 3D fingerprints. Active/inactive data was available for the compounds in the data sets used, so assessment was based on the degree to which clustering separated active from inactive compounds (into nonsingleton clusters). Although the Jarvis–Patrick method was the fastest of the methods, it again gave the poorest results. The results were improved slightly by using a variant of the Jarvis–Patrick method that uses variable rather than fixed-length nearest-neighbor lists. 12 Overall, the Ward method gave the best and most consistent results. The group-average and minimum-diameter methods were broadly similar and only slightly worse in performance than the Ward method. The influence of the studies summarized above can be seen in the methods subsequently implemented by many other researchers for their applications (see the section on Chemical Applications). One method that was included in the original assessment studies, but not in the later assessments, is k-means. This method did not perform particularly well on the small data sets of the original studies, and the resultant clusters were found to be very dependent on the choice of initial seeds; hence it was not included in the subsequent studies. However, k-means is computationally efficient enough to be of use for very large data sets. Indeed, over the last decade k-means and its variants have been studied extensively and developed for use in other disciplines. Because it is being increasingly used for chemical applications, any future comparisons of clustering methods should include k-means. How Many Clusters? A problem associated with the k-means, expectation maximization, and hierarchical methods involves deciding how many ‘‘natural’’ (intuitively obvious) clusters exist in a given data set. Determining the number of ‘‘natural’’ clusters is one of the most difficult problems in clustering and to date no general solution has been identified. An early contribution from Jain and Dubes 28 discussed the issue of clustering tendency, whereby the data set is analyzed first to determine whether it is distributed uniformly. Note that randomly distributed data is not generally uniform, and, because of this, most clustering methods will isolate clusters in random data. To avoid this problem, Lawson and Jurs 111 devised a variation of the Hopkins’ statistic that indicates the degree to which a data set contains clusters. McFarland and Gans 112 proposed a method for evaluating the statistical significance of individual clusters by comparing the within-cluster variance with the within-group variance of every other possible subset of the data set with the same number of members. However, for large heterogeneous chemical data sets it can be assumed that the data is not uniformly or randomly distributed, and so the issue becomes one of identifying the most natural clusters. Nonhierarchical methods such as k-means and EM need to be initialized with k seeds. This presupposes that k is a reasonable estimation of the number
Page 1 and 2: Reviews in Computational Chemistry
Page 3 and 4: Kenny B. Lipkowitz Department of Ch
Page 5 and 6: vi Preface three-dimensional struct
Page 7 and 8: viii Preface some descriptors and i
Page 9 and 10: Epilogue and Dedication My associat
Page 11 and 12: Contents 1. Clustering Methods and
Page 13 and 14: Contents xv Electron Transfer in Po
Page 15 and 16: Contributors John M. Barnard, Barna
Page 17 and 18: Contributors to Previous Volumes *
Page 19 and 20: Volume 3 (1992) Tamar Schlick, Opti
Page 21 and 22: Volume 7 (1996) Geoffrey M. Downs a
Page 23 and 24: Volume 11 (1997) Mark A. Murcko, Re
Page 25 and 26: T. Daniel Crawford* and Henry F. Sc
Page 27 and 28: Topics Covered in Volumes 1-18 * Ab
Page 29 and 30: Reviews in Computational Chemistry
Page 31 and 32: 2 Clustering Methods and Their Uses
Page 39 and 40: 10 Clustering Methods and Their Use
Page 51: 22 Clustering Methods and Their Use
Page 71 and 72: 42 The Use of Scoring Functions in
Page 79 and 80: Table 1 Reference List for the Most
Page 103 and 104:
74 The Use of Scoring Functions in
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
CHAPTER 3 Potentials and Algorithms
Page 119 and 120:
Vn (1 + cos(nω + γ)) 2 K θ (θ
Page 121 and 122:
are modified by their environment w
Page 123 and 124:
Table 1 Polarizability Parameters f
Page 125 and 126:
The polarizable point dipole models
Page 127 and 128:
two to three orders of magnitude sl
Page 129 and 130:
M i z i + q i d i k i and shell cha
Page 131 and 132:
Shell Models 103 on estimates of sh
Page 133 and 134:
minimization can be replaced by mor
Page 135 and 136:
The energy required to create a cha
Page 137 and 138:
for all i ði:e:; 8 iÞ: @U @qi l
Page 139 and 140:
where we have used q Cl ¼ qNa. The
Page 141 and 142:
Electronegativity Equalization Mode
Page 143 and 144:
Electronegativity Equalization Mode
Page 145 and 146:
of N molecules is taken as a Hartre
Page 147 and 148:
is treated using variable charges.
Page 149 and 150:
water have been developed, includin
Page 151 and 152:
Applications 123 classical and rigi
Page 153 and 154:
developing polarizable models. A va
Page 155 and 156:
Comparison of the Polarization Mode
Page 157 and 158:
negligible errors in such propertie
Page 159 and 160:
ecome significant at field strength
Page 161 and 162:
noteworthy in this regard because t
Page 163 and 164:
References 135 9. P. Cieplak and P.
Page 165 and 166:
References 137 48. E. L. Pollock an
Page 167 and 168:
References 139 Computational Chemis
Page 169 and 170:
References 141 139. J. Hinze and H.
Page 171 and 172:
References 143 178. J. J. P. Stewar
Page 173 and 174:
References 145 216. M. W. Mahoney a
Page 175 and 176:
CHAPTER 4 New Developments in the T
Page 177 and 178:
Introduction 149 applications). For
Page 179 and 180:
FCWD(∆E ) FCWD(0) ∆E = 0 ∆E
Page 181 and 182:
Introduction 153 Equations [6]-[12]
Page 183 and 184:
Paradigm of Free Energy Surfaces 15
Page 185 and 186:
Paradigm of Free Energy Surfaces 15
Page 187 and 188:
In Eqs. [18] and [19], F0i is the e
Page 189 and 190:
where ‘‘þ’’ and ‘‘ ’
Page 191 and 192:
is, however, small for the usual co
Page 193 and 194:
solute-solvent coupling through the
Page 195 and 196:
where Z is the electrode overpotent
Page 197 and 198:
energy surfaces of ET. 33,50-56 It
Page 199 and 200:
This result indicates a fundamental
Page 201 and 202:
βF i (X ) 40 20 0 −20 1 2 βλ 1
Page 203 and 204:
Table 1 Main Features of the Two-Pa
Page 205 and 206:
and Hss ¼ U rep ss 1 2 X j;k ðmj
Page 207 and 208:
λ i /eV 4 3 2 1 0 0 10 20 30 40 Bo
Page 209 and 210:
the width/Stokes shift relation (Eq
Page 211 and 212:
Table 3 Mapping of the Q Model on S
Page 213 and 214:
This situation is of course not sat
Page 215 and 216:
F ± (Y ad )/λ I 0.8 0.4 0 −0.4
Page 217 and 218:
it by choosing the GMH basis set 7
Page 219 and 220:
Anharmonic higher order terms gain
Page 221 and 222:
of the radiation is the perturbatio
Page 223 and 224:
individual vibrational excitations
Page 225 and 226:
m 12 /D 6 5.5 5 4.5 4 3.5 17 18 19
Page 227 and 228:
In Eq. [144], the coordinates Y km
Page 229 and 230:
ε/M −1 cm −1 4000 2000 0 analy
Page 231 and 232:
βσ 2 /10 3 cm −1 14 12 10 8 Opt
Page 233 and 234:
0.2 0.1 0 12 16 20 24 28 32 The app
Page 235 and 236:
References 207 7. M. D. Newton, Adv
Page 237 and 238:
References 209 59. R. Kubo and Y. T
show all

Reviews in Computational Chemistry Volume 18

Create successful ePaper yourself

Delete template?

Save as template?