Data Mining Methods and Models

More documents

Recommendations

Info

206 CHAPTER 5 NAIVE BAYES ESTIMATION AND BAYESIAN NETWORKS The posterior distribution is found as follows: p(θ|X) = p(X|θ)p(θ) p(X) where p(X|θ) represents the likelihood function, p(θ) the prior distribution, and p(X) a normalizing factor called the marginal distribution of the data. Since the posterior is a distribution rather than a single value, we can conceivably examine any possible statistic of this distribution that we are interested in, such as the first quartile or the mean absolute deviation. However, it is common to choose the posterior mode, the value of θ that maximizes p(θ|X), for an estimate, in which case we call this estimation method the maximum a posteriori (MAP) method. For noninformative priors, the MAP estimate and the frequentist maximum likelihood estimate often coincide, since the data dominate the prior. The likelihood function p(X|θ) derives from the assumption that the observations are independently and identically distributed according to a particular distribution f (X|θ), so that p(X|θ) = � n i=1 f (Xi|θ). The normalizing factor p(X) is essentially a constant, for a given data set and model, so that we may express the posterior distribution like this: p(θ|X) ∝ p(X|θ)p(θ). That is, given the data, the posterior distribution of θ is proportional to the product of the likelihood and the prior. Thus, when we have a great deal of information coming from the likelihood, as we do in most data mining applications, the likelihood will overwhelm the prior. Criticism of the Bayesian framework has focused primarily on two potential drawbacks. First, elicitation of a prior distribution may be subjective. That is, two different subject matter experts may provide two different prior distributions, which will presumably percolate through to result in two different posterior distributions. The solution to this problem is (1) to select noninformative priors if the choice of priors is controversial, and (2) to apply lots of data so that the relative importance of the prior is diminished. Failing this, model selection can be performed on the two different posterior distributions, using model adequacy and efficacy criteria, resulting in the choice of the better model. Is reporting more than one model a bad thing? The second criticism has been that Bayesian computation has been intractable in data mining terms for most interesting problems where the approach suffered from scalability issues. The curse of dimensionality hits Bayesian analysis rather hard, since the normalizing factor requires integrating (or summing) over all possible values of the parameter vector, which may be computationally infeasible when applied directly. However, the introduction of Markov chain Monte Carlo (MCMC) methods such as Gibbs sampling and the Metropolis algorithm has greatly expanded the range of problems and dimensions that Bayesian analysis can handle. MAXIMUM A POSTERIORI CLASSIFICATION How do we find the MAP estimate of θ? Well, we need the value of θ that will maximize p(θ|X); this value is expressed as θMAP = arg max θ p(θ|X) since it is the argument (value) that maximizes p(θ|X) over all θ. Then, using the formula for the
posterior distribution, we have, since p(X) has no θ term, θMAP = arg max θ p(θ|X) = arg max θ MAXIMUM A POSTERIORI CLASSIFICATION 207 p(X|θ)p(θ) p(X) = arg max θ p(X|θ)p(θ) (5.1) The Bayesian MAP classification is optimal; that is, it achieves the minimum error rate for all possible classifiers [2, p. 174]. Next, we apply these formulas to a subset of the churn data set [3], specifically so that we may find the maximum a posteriori estimate of churn for this subset. First, however, let us step back for a moment and derive Bayes’ theorem for simple events. Let A and B be events in a sample space. Then the conditional probability P(A|B) is defined as P(A|B) = P(A ∩ B) P(B) = number of outcomes in both A and B number of outcomes in B Also, P(B|A) = P(A ∩ B)/P(A). Now, reexpressing the intersection, we have P(A ∩ B) = P(B|A)P(A), and substituting, we obtain P(A|B) = P(B|A)P(A) P(B) (5.2) which is Bayes’ theorem for simple events. We shall restrict our example to only two categorical predictor variables, International Plan and VoiceMail Plan, and the categorical target variable, churn. The business problem is to classify new records as either churners or nonchurners based on the associations of churn with the two predictor variables learned in the training set. Now, how are we to think about this churn classification problem (see Larose [4]) in the Bayesian terms addressed above? First, we let the parameter vector θ represent the dichotomous variable churn, taking on the two values true and false. For clarity, we denote θ as C for churn. The 3333 × 2 matrix X consists of the 3333 records in the data set, each with two fields, specifying either yes or no for the two predictor variables. Thus, equation (5.1) can be reexpressed as θMAP = CMAP = arg max C p(I ∩ V |C)p(C) (5.3) where I represents the International Plan and V represents the VoiceMail Plan. Denote: � I to mean International Plan = yes � I to mean International Plan = no � V to mean VoiceMail Plan = yes � V to mean VoiceMail Plan = no � C to mean churn = true � C to mean churn = false
Page 2 and 3:
DATA MINING METHODS AND MODELS DANI
Page 5 and 6:
DATA MINING METHODS AND MODELS DANI
Page 7:
DEDICATION To those who have gone b
Page 10 and 11:
viii CONTENTS 3 MULTIPLE REGRESSION
Page 12 and 13:
x CONTENTS Deriving New Variables 2
Page 14 and 15:
xii PREFACE understanding of the al
Page 16 and 17:
xiv PREFACE Web site at www.spss.co
Page 18 and 19:
xvi PREFACE express my eternal grat
Page 20 and 21:
2 CHAPTER 1 DIMENSION REDUCTION MET
Page 22 and 23:
Page 24 and 25:
Page 26 and 27:
Page 28 and 29:
10 CHAPTER 1 DIMENSION REDUCTION ME
Page 30 and 31:
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
Page 38 and 39:
Page 40 and 41:
Page 42 and 43:
Page 44 and 45:
Page 46 and 47:
Page 48 and 49:
Page 50 and 51:
Page 52 and 53:
34 CHAPTER 2 REGRESSION MODELING EX
Page 54 and 55:
36 CHAPTER 2 REGRESSION MODELING No
Page 56 and 57:
38 CHAPTER 2 REGRESSION MODELING TA
Page 58 and 59:
40 CHAPTER 2 REGRESSION MODELING th
Page 60 and 61:
42 CHAPTER 2 REGRESSION MODELING th
Page 62 and 63:
44 CHAPTER 2 REGRESSION MODELING Fo
Page 64 and 65:
46 CHAPTER 2 REGRESSION MODELING Th
Page 66 and 67:
48 CHAPTER 2 REGRESSION MODELING OU
Page 68 and 69:
Page 70 and 71:
52 CHAPTER 2 REGRESSION MODELING Di
Page 72 and 73:
Page 74 and 75:
56 CHAPTER 2 REGRESSION MODELING x
Page 76 and 77:
58 CHAPTER 2 REGRESSION MODELING Th
Page 78 and 79:
60 CHAPTER 2 REGRESSION MODELING Co
Page 80 and 81:
62 CHAPTER 2 REGRESSION MODELING ro
Page 82 and 83:
Page 84 and 85:
66 CHAPTER 2 REGRESSION MODELING Pe
Page 86 and 87:
68 CHAPTER 2 REGRESSION MODELING fo
Page 88 and 89:
Page 90 and 91:
72 CHAPTER 2 REGRESSION MODELING Re
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
78 CHAPTER 2 REGRESSION MODELING as
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
84 CHAPTER 2 REGRESSION MODELING Fo
Page 104 and 105:
86 CHAPTER 2 REGRESSION MODELING pr
Page 106 and 107:
88 CHAPTER 2 REGRESSION MODELING Te
Page 108 and 109:
Page 110 and 111:
92 CHAPTER 2 REGRESSION MODELING (h
Page 112 and 113:
94 CHAPTER 3 MULTIPLE REGRESSION AN
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
100 CHAPTER 3 MULTIPLE REGRESSION A
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
132 TABLE 3.19 Best Subsets Results
Page 152 and 153:
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175: 156 CHAPTER 4 LOGISTIC REGRESSION S
Page 176 and 177: 158 CHAPTER 4 LOGISTIC REGRESSION w
Page 178 and 179: 160 CHAPTER 4 LOGISTIC REGRESSION I
Page 180 and 181: 162 CHAPTER 4 LOGISTIC REGRESSION I
Page 182 and 183: 164 CHAPTER 4 LOGISTIC REGRESSION T
Page 186 and 187: 168 CHAPTER 4 LOGISTIC REGRESSION F
Page 188 and 189: 170 CHAPTER 4 LOGISTIC REGRESSION a
Page 204 and 205: 186 CHAPTER 4 LOGISTIC REGRESSION N
Page 212 and 213: 194 CHAPTER 4 LOGISTIC REGRESSION W
Page 216 and 217: 198 CHAPTER 4 LOGISTIC REGRESSION t
Page 218 and 219: 200 CHAPTER 4 LOGISTIC REGRESSION (
Page 222 and 223: CHAPTER5 NAIVE BAYES ESTIMATION AND
Page 226 and 227: 208 CHAPTER 5 NAIVE BAYES ESTIMATIO
Page 258 and 259: CHAPTER6 GENETIC ALGORITHMS INTRODU
Page 260 and 261: 242 CHAPTER 6 GENETIC ALGORITHMS
Page 262 and 263: 244 CHAPTER 6 GENETIC ALGORITHMS TA
Page 264 and 265: 246 CHAPTER 6 GENETIC ALGORITHMS in
Page 266 and 267: 248 CHAPTER 6 GENETIC ALGORITHMS 0
Page 268 and 269: 250 CHAPTER 6 GENETIC ALGORITHMS Fi
Page 274 and 275:
256 CHAPTER 6 GENETIC ALGORITHMS Fi
Page 276 and 277:
258 CHAPTER 6 GENETIC ALGORITHMS 4.
Page 278 and 279:
260 CHAPTER 6 GENETIC ALGORITHMS TA
Page 280 and 281:
262 CHAPTER 6 GENETIC ALGORITHMS Ge
Page 282 and 283:
264 CHAPTER 6 GENETIC ALGORITHMS Wo
Page 284 and 285:
266 CHAPTER 7 CASE STUDY: MODELING
Page 286 and 287:
Page 288 and 289:
Page 290 and 291:
Page 292 and 293:
Page 294 and 295:
Page 296 and 297:
Page 298 and 299:
Page 300 and 301:
Page 302 and 303:
Page 304 and 305:
Page 306 and 307:
Page 308 and 309:
Page 310 and 311:
Page 312 and 313:
Page 314 and 315:
Page 316 and 317:
Page 318 and 319:
Page 320 and 321:
Page 322 and 323:
Page 324 and 325:
Page 326 and 327:
Page 328 and 329:
Page 330 and 331:
Page 332 and 333:
Page 334 and 335:
Page 336 and 337:
318 INDEX Cost/benefit table, 267-2
Page 338 and 339:
320 INDEX Mean squared error (MSE),
Page 340:
322 INDEX Variable selection method
show all

Data Mining Methods and Models

Create successful ePaper yourself

Delete template?

Save as template?