Data Mining Methods and Models

More documents

Recommendations

Info

4 CHAPTER 1 DIMENSION REDUCTION METHODS Then the correlation matrix is denoted as ρ (rho, the Greek letter for r): ⎡ σ ⎢ ρ = ⎢ ⎣ 2 11 σ11σ11 σ 2 12 σ11σ22 ··· σ 2 σ 1m σ11σmm 2 12 σ11σ22 σ 2 22 σ22σ22 ··· σ 2 . σ . . .. 2m σ22σmm . 2 1m σ 2 2m ··· σ 2 ⎤ ⎥ ⎦ mm σ11σmm σ22σmm σmmσmm Consider again the standardized data matrix Z = � V 1/2� −1 (X − µ). Then since each variable has been standardized, we have E(Z) = 0, where 0 denotes an n × 1 vector of zeros and Z has covariance matrix Cov(Z) = � V 1/2� −1 � � V 1/2 � −1 = ρ. Thus, for the standardized data set, the covariance matrix and the correlation matrix are the same. The ith principal component of the standardized data matrix Z = [Z1, Z2, ...,Zm] is given by Yi = e ′ i Z, where ei refers to the ith eigenvector (discussed below) and e ′ i refers to the transpose of ei. The principal components are linear combinations Y1, Y2,...,Yk of the standardized variables in Z such that (1) the variances of the Yi are as large as possible, and (2) the Yi are uncorrelated. The first principal component is the linear combination Y1 = e ′ 1 Z = e11Z1 + e12Z2 +···+e1m Zm which has greater variability than any other possible linear combination of the Z variables. Thus: � The first principal component is the linear combination Y1 = e ′ 1Z, which maximizes Var(Y1) = e ′ 1ρ e1. � The second principal component is the linear combination Y2 = e ′ 2Z, which is independent of Y1 and maximizes Var(Y2) = e ′ 2ρ e2. � The ith principal component is the linear combination Yi = e ′ iX, which is independent of all the other principal components Y j, j < i, and maximizes Var(Yi) = e ′ iρ ei. We have the following definitions: � Eigenvalues. Let B be an m × m matrix, and let I be the m × m identity matrix (diagonal matrix with 1’s on the diagonal). Then the scalars (numbers of dimension 1 × 1) λ1,λ1, ...,λm are said to be the eigenvalues of B if they satisfy |B − λI| = 0. � Eigenvectors. Let B be an m × m matrix, and let λ be an eigenvalue of B. Then nonzero m × 1 vector e is said to be an eigenvector of B if Be = λe. The following results are very important for our PCA analysis. � Result 1. The total variability in the standardized data set equals the sum of the variances for each Z-vector, which equals the sum of the variances for each
PRINCIPAL COMPONENTS ANALYSIS 5 component, which equals the sum of the eigenvalues, which equals the number of variables. That is, m� m� m� Var(Yi) = Var(Zi) = λi = m i=1 i=1 � Result 2. The partial correlation between a given component and a given variable is a √ function of an eigenvector and an eigenvalue. Specifically, Corr(Yi, Z j) = eij λi, i, j = 1, 2,...,m, where (λ1, e1), (λ2, e2),...,(λm, em) are the eigenvalue–eigenvector pairs for the correlation matrix ρ, and we note that λ1 ≥ λ2 ≥···≥λm. Apartial correlation coefficient is a correlation coefficient that takes into account the effect of all the other variables. � Result 3. The proportion of the total variability in Z that is explained by the ith principal component is the ratio of the ith eigenvalue to the number of variables, that is, the ratio λi/m. Next, to illustrate how to apply principal components analysis on real data, we turn to an example. Applying Principal Components Analysis to the Houses Data Set We turn to the houses data set [3], which provides census information from all the block groups from the 1990 California census. For this data set, a block group has an average of 1425.5 people living in an area that is geographically compact. Block groups that contained zero entries for any of the variables were excluded. Median house value is the response variable; the predictor variables are: � Median income � Housing median age � Total rooms � Total bedrooms � Population � Households � Latitude � Longitude The original data set had 20,640 records, of which 18,540 were selected randomly for a training data set, and 2100 held out for a test data set. A quick look at the variables is provided in Figure 1.1. (“Range” is Clementine’s type label for continuous variables.) Median house value appears to be in dollars, but median income has been scaled to a continuous scale from 0 to 15. Note that longitude is expressed in negative terms, meaning west of Greenwich. Larger absolute values for longitude indicate geographic locations farther west. Relating this data set to our earlier notation, we have X1 = median income, X2 = housing median age,..., X8 = longitude, so that m = 8 and n = 18,540. A glimpse of the first 20 records in the data set looks like Figure 1.2. So, for example, for the first block group, the median house value is $452,600, the median income is 8.325 (on the census scale), the housing median age is 41, the total rooms is 880, the total bedrooms is 129, the population is 322, the number of households is 126, the latitude is 37.88 North and the longitude is 122.23 West. Clearly, this is a smallish block group with very high median house value. A map search reveals that this block group i=1
Page 2 and 3: DATA MINING METHODS AND MODELS DANI
Page 5 and 6: DATA MINING METHODS AND MODELS DANI
Page 7: DEDICATION To those who have gone b
Page 10 and 11: viii CONTENTS 3 MULTIPLE REGRESSION
Page 12 and 13: x CONTENTS Deriving New Variables 2
Page 14 and 15: xii PREFACE understanding of the al
Page 16 and 17: xiv PREFACE Web site at www.spss.co
Page 18 and 19: xvi PREFACE express my eternal grat
Page 20 and 21: 2 CHAPTER 1 DIMENSION REDUCTION MET
Page 28 and 29: 10 CHAPTER 1 DIMENSION REDUCTION ME
Page 52 and 53: 34 CHAPTER 2 REGRESSION MODELING EX
Page 54 and 55: 36 CHAPTER 2 REGRESSION MODELING No
Page 56 and 57: 38 CHAPTER 2 REGRESSION MODELING TA
Page 58 and 59: 40 CHAPTER 2 REGRESSION MODELING th
Page 60 and 61: 42 CHAPTER 2 REGRESSION MODELING th
Page 62 and 63: 44 CHAPTER 2 REGRESSION MODELING Fo
Page 64 and 65: 46 CHAPTER 2 REGRESSION MODELING Th
Page 66 and 67: 48 CHAPTER 2 REGRESSION MODELING OU
Page 68 and 69: 50 CHAPTER 2 REGRESSION MODELING TA
Page 70 and 71: 52 CHAPTER 2 REGRESSION MODELING Di
Page 72 and 73:
54 CHAPTER 2 REGRESSION MODELING TA
Page 74 and 75:
56 CHAPTER 2 REGRESSION MODELING x
Page 76 and 77:
58 CHAPTER 2 REGRESSION MODELING Th
Page 78 and 79:
60 CHAPTER 2 REGRESSION MODELING Co
Page 80 and 81:
62 CHAPTER 2 REGRESSION MODELING ro
Page 82 and 83:
Page 84 and 85:
66 CHAPTER 2 REGRESSION MODELING Pe
Page 86 and 87:
68 CHAPTER 2 REGRESSION MODELING fo
Page 88 and 89:
Page 90 and 91:
72 CHAPTER 2 REGRESSION MODELING Re
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
78 CHAPTER 2 REGRESSION MODELING as
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
84 CHAPTER 2 REGRESSION MODELING Fo
Page 104 and 105:
86 CHAPTER 2 REGRESSION MODELING pr
Page 106 and 107:
88 CHAPTER 2 REGRESSION MODELING Te
Page 108 and 109:
Page 110 and 111:
92 CHAPTER 2 REGRESSION MODELING (h
Page 112 and 113:
94 CHAPTER 3 MULTIPLE REGRESSION AN
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
100 CHAPTER 3 MULTIPLE REGRESSION A
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
132 TABLE 3.19 Best Subsets Results
Page 152 and 153:
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
156 CHAPTER 4 LOGISTIC REGRESSION S
Page 176 and 177:
158 CHAPTER 4 LOGISTIC REGRESSION w
Page 178 and 179:
160 CHAPTER 4 LOGISTIC REGRESSION I
Page 180 and 181:
162 CHAPTER 4 LOGISTIC REGRESSION I
Page 182 and 183:
164 CHAPTER 4 LOGISTIC REGRESSION T
Page 184 and 185:
Page 186 and 187:
168 CHAPTER 4 LOGISTIC REGRESSION F
Page 188 and 189:
170 CHAPTER 4 LOGISTIC REGRESSION a
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
186 CHAPTER 4 LOGISTIC REGRESSION N
Page 206 and 207:
Page 208 and 209:
Page 210 and 211:
Page 212 and 213:
194 CHAPTER 4 LOGISTIC REGRESSION W
Page 214 and 215:
Page 216 and 217:
198 CHAPTER 4 LOGISTIC REGRESSION t
Page 218 and 219:
200 CHAPTER 4 LOGISTIC REGRESSION (
Page 220 and 221:
Page 222 and 223:
CHAPTER5 NAIVE BAYES ESTIMATION AND
Page 224 and 225:
206 CHAPTER 5 NAIVE BAYES ESTIMATIO
Page 226 and 227:
Page 228 and 229:
Page 230 and 231:
Page 232 and 233:
Page 234 and 235:
Page 236 and 237:
Page 238 and 239:
Page 240 and 241:
Page 242 and 243:
Page 244 and 245:
Page 246 and 247:
Page 248 and 249:
Page 250 and 251:
Page 252 and 253:
Page 254 and 255:
Page 256 and 257:
Page 258 and 259:
CHAPTER6 GENETIC ALGORITHMS INTRODU
Page 260 and 261:
242 CHAPTER 6 GENETIC ALGORITHMS
Page 262 and 263:
244 CHAPTER 6 GENETIC ALGORITHMS TA
Page 264 and 265:
246 CHAPTER 6 GENETIC ALGORITHMS in
Page 266 and 267:
248 CHAPTER 6 GENETIC ALGORITHMS 0
Page 268 and 269:
250 CHAPTER 6 GENETIC ALGORITHMS Fi
Page 270 and 271:
Page 272 and 273:
Page 274 and 275:
256 CHAPTER 6 GENETIC ALGORITHMS Fi
Page 276 and 277:
258 CHAPTER 6 GENETIC ALGORITHMS 4.
Page 278 and 279:
Page 280 and 281:
262 CHAPTER 6 GENETIC ALGORITHMS Ge
Page 282 and 283:
264 CHAPTER 6 GENETIC ALGORITHMS Wo
Page 284 and 285:
266 CHAPTER 7 CASE STUDY: MODELING
Page 286 and 287:
Page 288 and 289:
Page 290 and 291:
Page 292 and 293:
Page 294 and 295:
Page 296 and 297:
Page 298 and 299:
Page 300 and 301:
Page 302 and 303:
Page 304 and 305:
Page 306 and 307:
Page 308 and 309:
Page 310 and 311:
Page 312 and 313:
Page 314 and 315:
Page 316 and 317:
Page 318 and 319:
Page 320 and 321:
Page 322 and 323:
Page 324 and 325:
Page 326 and 327:
Page 328 and 329:
Page 330 and 331:
Page 332 and 333:
Page 334 and 335:
Page 336 and 337:
318 INDEX Cost/benefit table, 267-2
Page 338 and 339:
320 INDEX Mean squared error (MSE),
Page 340:
322 INDEX Variable selection method
show all

Data Mining Methods and Models

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?