Reviews in Computational Chemistry Volume 18

More documents

Recommendations

Info

Progress in Clustering Methodology 19 MNV (i,k) less than MNV (i,j) but distance (i,k) greater than or equal to distance (i,j), then i is not a valid neighbor of j, and j is not a valid neighbor of i. The neighbor validity check can result in many small clusters, but these clusters can be merged afterward by relaxing the reciprocal nature of the check. Relocation algorithms are widely used outside of chemical applications, largely because of their simplicity and speed. The original k-means noncombinatorial methods, such as that by Forgy, 68 and the combinatorial methods, such as that by MacQueen, 69 have been modified into different versions for use in many disciplines, a few of which are mentioned here. Efficient implementations of k-means include those by Hartigan and Wong 70 and Spaeth. 71 A variation of the k-means algorithm, referred to as the moving method, looks ahead to see whether moving an item from one cluster to another will result in an overall decrease in the square error (Eq. [2]); if it does, then the moving is carried out. Duda and Hart 72 and Ismail and Kamel 73 originally outlined this variant, while Zhang, Wang, and Boyle 74 further developed the idea and obtained better results than a standard noncombinatorial implementation of k-means. Because the method relies on the concept of a centroid, it is usually used with numerical data. However, Huang 75 reported variants that use k-modes and k-prototypes that are suitable for use with categorical and mixed-numerical and categorical data, respectively. The main problems with k-means are (1) the tendency to find hyperspherical clusters, (2) the danger of falling into local minima, (3) the sensitivity to noise, and (4) the variability in results that depends on the choice of the initial seed points. Because k-means (and its fuzzy equivalent, c-means) is a centroidbased method, nothing much can be done about the tendency to produce hyperspherical clusters, although the CURE methodology mentioned above might alleviate this tendency somewhat. Falling into local minima cannot be avoided, but rerunning k-means with different seeds is a standard way of producing alternative solutions. After a given number of reruns, the solution is chosen that has produced the lowest square-error across the partition. An alternative to this is to perturb an existing solution, rather than starting again. Zhang and Boyle 76 examined the effects of four types of perturbation on the moving method and found little difference between them. Estivell-Castro and Yang 77 suggested that the problem of sensitivity to noise is due to the use of means (and centroids) rather than medians (and medoids). These authors proposed a variant of k-means based on the use of medoids to represent each cluster. However, calculation of a point to represent the medoid is more CPU-expensive [Oðn log nÞ for each cluster of size n] than that required for the centroid, resulting in a method that is slightly slower than k-means (but faster than EM algorithms 36 ). A similar variant based on medoids is the PAM (Partitioning Around Medoids) method developed by Kaufman and Rouseeuw. 2 This method is very time consuming, and so the authors developed CLARA (Clustering LARge Applications), which takes a sample of a data
20 Clustering Methods and Their Uses in Computational Chemistry set and applies PAM to it. An alternative to sampling the compounds has been developed by Ng and Han. 78 Their CLARANS (Clustering Large Applications based on RANdomized Search) method samples the neighbors, rather than the compounds, to make PAM more efficient. The most common way of choosing seeds for k-means is by random selection, which is statistically reasonable given a large heterogeneous data set. Alternatively, a set of k diverse seeds could be selected using, for example, the MaxMin subset selection method. 79,80 Diverse seeds have been shown to give better clustering results by Fisher, Xu, and Zard. 81 One of the early suggestions, by Milligan, 82 was that a partition resulting from hierarchical agglomerative clustering should be used as the initial partition for k-means to refine. It may seem counterproductive to initialize an OðNÞ method by first running an OðN 2 Þ method, because it means that very large data sets cannot be processed, but k-means is then effectively being used to refine individual partitions and to correct inappropriate assignments made by the hierarchical method. An iterative method for refining an entire hierarchy has been discussed by Fisher. 83 The iterative method starts at the root (i.e., the top of the hierarchy, with all compounds in one cluster), recursively removes each cluster, resorts it into the hierarchy, and continues iterating until no clusters are moved, other than moving individual items from one cluster to another. Of the mixture model methods, the expectation maximization (EM) algorithm 36 is the most popular because it is a general and effective method for estimating the model parameters and for fitting the model to the data. Though now quite old, the method was relatively unused until a surge of recent interest has propelled its further development and implementation for data mining applications. 84 As mentioned earlier, k-means is a special case of EM. However, because standard k-means uses the Euclidean metric, it is not appropriate for clustering discrete or categorical data. The EM algorithm does not have these limitations, and, since the mixture model is probabilistic, it can also effectively separate clusters of different sizes, shapes, and densities. A major contribution to the development of the EM algorithm came from Banfield and Raftery 85 who reparameterized the standard distributions to make them more flexible and include a Poisson distribution to account for noise. Various models were developed and compared using the approximate weight of evidence (AWE) statistic, which estimates the Bayesian posterior probability of the clustering solution. Fraley and Raftery 86 subsequently replaced AWE by the more reliable Bayesian information criterion (BIC), which enabled them to produce an EM algorithm that simultaneously yields the best model and determines the best number of clusters. One other interesting aspect of their work is that the EM algorithm is seeded with the clustering results from hierarchical agglomerative clustering. It is not clear whether, by using a less expensive seed selection, the EM algorithm will scale to the very large, high-dimensional data sets of chemical applications, or if the necessary parameterization will be acceptable in practice.
Page 1 and 2: Reviews in Computational Chemistry
Page 3 and 4: Kenny B. Lipkowitz Department of Ch
Page 5 and 6: vi Preface three-dimensional struct
Page 7 and 8: viii Preface some descriptors and i
Page 9 and 10: Epilogue and Dedication My associat
Page 11 and 12: Contents 1. Clustering Methods and
Page 13 and 14: Contents xv Electron Transfer in Po
Page 15 and 16: Contributors John M. Barnard, Barna
Page 17 and 18: Contributors to Previous Volumes *
Page 19 and 20: Volume 3 (1992) Tamar Schlick, Opti
Page 21 and 22: Volume 7 (1996) Geoffrey M. Downs a
Page 23 and 24: Volume 11 (1997) Mark A. Murcko, Re
Page 25 and 26: T. Daniel Crawford* and Henry F. Sc
Page 27 and 28: Topics Covered in Volumes 1-18 * Ab
Page 29 and 30: Reviews in Computational Chemistry
Page 31 and 32: 2 Clustering Methods and Their Uses
Page 39 and 40: 10 Clustering Methods and Their Use
Page 47: 18 Clustering Methods and Their Use
Page 71 and 72: 42 The Use of Scoring Functions in
Page 79 and 80: Table 1 Reference List for the Most
Page 99 and 100:
70 The Use of Scoring Functions in
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
CHAPTER 3 Potentials and Algorithms
Page 119 and 120:
Vn (1 + cos(nω + γ)) 2 K θ (θ
Page 121 and 122:
are modified by their environment w
Page 123 and 124:
Table 1 Polarizability Parameters f
Page 125 and 126:
The polarizable point dipole models
Page 127 and 128:
two to three orders of magnitude sl
Page 129 and 130:
M i z i + q i d i k i and shell cha
Page 131 and 132:
Shell Models 103 on estimates of sh
Page 133 and 134:
minimization can be replaced by mor
Page 135 and 136:
The energy required to create a cha
Page 137 and 138:
for all i ði:e:; 8 iÞ: @U @qi l
Page 139 and 140:
where we have used q Cl ¼ qNa. The
Page 141 and 142:
Electronegativity Equalization Mode
Page 143 and 144:
Electronegativity Equalization Mode
Page 145 and 146:
of N molecules is taken as a Hartre
Page 147 and 148:
is treated using variable charges.
Page 149 and 150:
water have been developed, includin
Page 151 and 152:
Applications 123 classical and rigi
Page 153 and 154:
developing polarizable models. A va
Page 155 and 156:
Comparison of the Polarization Mode
Page 157 and 158:
negligible errors in such propertie
Page 159 and 160:
ecome significant at field strength
Page 161 and 162:
noteworthy in this regard because t
Page 163 and 164:
References 135 9. P. Cieplak and P.
Page 165 and 166:
References 137 48. E. L. Pollock an
Page 167 and 168:
References 139 Computational Chemis
Page 169 and 170:
References 141 139. J. Hinze and H.
Page 171 and 172:
References 143 178. J. J. P. Stewar
Page 173 and 174:
References 145 216. M. W. Mahoney a
Page 175 and 176:
CHAPTER 4 New Developments in the T
Page 177 and 178:
Introduction 149 applications). For
Page 179 and 180:
FCWD(∆E ) FCWD(0) ∆E = 0 ∆E
Page 181 and 182:
Introduction 153 Equations [6]-[12]
Page 183 and 184:
Paradigm of Free Energy Surfaces 15
Page 185 and 186:
Paradigm of Free Energy Surfaces 15
Page 187 and 188:
In Eqs. [18] and [19], F0i is the e
Page 189 and 190:
where ‘‘þ’’ and ‘‘ ’
Page 191 and 192:
is, however, small for the usual co
Page 193 and 194:
solute-solvent coupling through the
Page 195 and 196:
where Z is the electrode overpotent
Page 197 and 198:
energy surfaces of ET. 33,50-56 It
Page 199 and 200:
This result indicates a fundamental
Page 201 and 202:
βF i (X ) 40 20 0 −20 1 2 βλ 1
Page 203 and 204:
Table 1 Main Features of the Two-Pa
Page 205 and 206:
and Hss ¼ U rep ss 1 2 X j;k ðmj
Page 207 and 208:
λ i /eV 4 3 2 1 0 0 10 20 30 40 Bo
Page 209 and 210:
the width/Stokes shift relation (Eq
Page 211 and 212:
Table 3 Mapping of the Q Model on S
Page 213 and 214:
This situation is of course not sat
Page 215 and 216:
F ± (Y ad )/λ I 0.8 0.4 0 −0.4
Page 217 and 218:
it by choosing the GMH basis set 7
Page 219 and 220:
Anharmonic higher order terms gain
Page 221 and 222:
of the radiation is the perturbatio
Page 223 and 224:
individual vibrational excitations
Page 225 and 226:
m 12 /D 6 5.5 5 4.5 4 3.5 17 18 19
Page 227 and 228:
In Eq. [144], the coordinates Y km
Page 229 and 230:
ε/M −1 cm −1 4000 2000 0 analy
Page 231 and 232:
βσ 2 /10 3 cm −1 14 12 10 8 Opt
Page 233 and 234:
0.2 0.1 0 12 16 20 24 28 32 The app
Page 235 and 236:
References 207 7. M. D. Newton, Adv
Page 237 and 238:
References 209 59. R. Kubo and Y. T
show all

Reviews in Computational Chemistry Volume 18

Create successful ePaper yourself

Delete template?

Save as template?