Data Mining Methods and Models
Data Mining Methods and Models
Data Mining Methods and Models
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
206 CHAPTER 5 NAIVE BAYES ESTIMATION AND BAYESIAN NETWORKS<br />
The posterior distribution is found as follows:<br />
p(θ|X) = p(X|θ)p(θ)<br />
p(X)<br />
where p(X|θ) represents the likelihood function, p(θ) the prior distribution, <strong>and</strong> p(X)<br />
a normalizing factor called the marginal distribution of the data. Since the posterior<br />
is a distribution rather than a single value, we can conceivably examine any possible<br />
statistic of this distribution that we are interested in, such as the first quartile or the<br />
mean absolute deviation. However, it is common to choose the posterior mode, the<br />
value of θ that maximizes p(θ|X), for an estimate, in which case we call this estimation<br />
method the maximum a posteriori (MAP) method. For noninformative priors, the<br />
MAP estimate <strong>and</strong> the frequentist maximum likelihood estimate often coincide, since<br />
the data dominate the prior. The likelihood function p(X|θ) derives from the assumption<br />
that the observations are independently <strong>and</strong> identically distributed according to<br />
a particular distribution f (X|θ), so that p(X|θ) = � n<br />
i=1<br />
f (Xi|θ).<br />
The normalizing factor p(X) is essentially a constant, for a given data set<br />
<strong>and</strong> model, so that we may express the posterior distribution like this: p(θ|X) ∝<br />
p(X|θ)p(θ). That is, given the data, the posterior distribution of θ is proportional<br />
to the product of the likelihood <strong>and</strong> the prior. Thus, when we have a great deal of<br />
information coming from the likelihood, as we do in most data mining applications,<br />
the likelihood will overwhelm the prior.<br />
Criticism of the Bayesian framework has focused primarily on two potential<br />
drawbacks. First, elicitation of a prior distribution may be subjective. That is, two<br />
different subject matter experts may provide two different prior distributions, which<br />
will presumably percolate through to result in two different posterior distributions.<br />
The solution to this problem is (1) to select noninformative priors if the choice of<br />
priors is controversial, <strong>and</strong> (2) to apply lots of data so that the relative importance<br />
of the prior is diminished. Failing this, model selection can be performed on the two<br />
different posterior distributions, using model adequacy <strong>and</strong> efficacy criteria, resulting<br />
in the choice of the better model. Is reporting more than one model a bad thing?<br />
The second criticism has been that Bayesian computation has been intractable<br />
in data mining terms for most interesting problems where the approach suffered from<br />
scalability issues. The curse of dimensionality hits Bayesian analysis rather hard, since<br />
the normalizing factor requires integrating (or summing) over all possible values of<br />
the parameter vector, which may be computationally infeasible when applied directly.<br />
However, the introduction of Markov chain Monte Carlo (MCMC) methods such as<br />
Gibbs sampling <strong>and</strong> the Metropolis algorithm has greatly exp<strong>and</strong>ed the range of<br />
problems <strong>and</strong> dimensions that Bayesian analysis can h<strong>and</strong>le.<br />
MAXIMUM A POSTERIORI CLASSIFICATION<br />
How do we find the MAP estimate of θ? Well, we need the value of θ that will<br />
maximize p(θ|X); this value is expressed as θMAP = arg max θ p(θ|X) since it is the<br />
argument (value) that maximizes p(θ|X) over all θ. Then, using the formula for the