16.11.2012 Views

Data Mining Methods and Models

Data Mining Methods and Models

Data Mining Methods and Models

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

206 CHAPTER 5 NAIVE BAYES ESTIMATION AND BAYESIAN NETWORKS<br />

The posterior distribution is found as follows:<br />

p(θ|X) = p(X|θ)p(θ)<br />

p(X)<br />

where p(X|θ) represents the likelihood function, p(θ) the prior distribution, <strong>and</strong> p(X)<br />

a normalizing factor called the marginal distribution of the data. Since the posterior<br />

is a distribution rather than a single value, we can conceivably examine any possible<br />

statistic of this distribution that we are interested in, such as the first quartile or the<br />

mean absolute deviation. However, it is common to choose the posterior mode, the<br />

value of θ that maximizes p(θ|X), for an estimate, in which case we call this estimation<br />

method the maximum a posteriori (MAP) method. For noninformative priors, the<br />

MAP estimate <strong>and</strong> the frequentist maximum likelihood estimate often coincide, since<br />

the data dominate the prior. The likelihood function p(X|θ) derives from the assumption<br />

that the observations are independently <strong>and</strong> identically distributed according to<br />

a particular distribution f (X|θ), so that p(X|θ) = � n<br />

i=1<br />

f (Xi|θ).<br />

The normalizing factor p(X) is essentially a constant, for a given data set<br />

<strong>and</strong> model, so that we may express the posterior distribution like this: p(θ|X) ∝<br />

p(X|θ)p(θ). That is, given the data, the posterior distribution of θ is proportional<br />

to the product of the likelihood <strong>and</strong> the prior. Thus, when we have a great deal of<br />

information coming from the likelihood, as we do in most data mining applications,<br />

the likelihood will overwhelm the prior.<br />

Criticism of the Bayesian framework has focused primarily on two potential<br />

drawbacks. First, elicitation of a prior distribution may be subjective. That is, two<br />

different subject matter experts may provide two different prior distributions, which<br />

will presumably percolate through to result in two different posterior distributions.<br />

The solution to this problem is (1) to select noninformative priors if the choice of<br />

priors is controversial, <strong>and</strong> (2) to apply lots of data so that the relative importance<br />

of the prior is diminished. Failing this, model selection can be performed on the two<br />

different posterior distributions, using model adequacy <strong>and</strong> efficacy criteria, resulting<br />

in the choice of the better model. Is reporting more than one model a bad thing?<br />

The second criticism has been that Bayesian computation has been intractable<br />

in data mining terms for most interesting problems where the approach suffered from<br />

scalability issues. The curse of dimensionality hits Bayesian analysis rather hard, since<br />

the normalizing factor requires integrating (or summing) over all possible values of<br />

the parameter vector, which may be computationally infeasible when applied directly.<br />

However, the introduction of Markov chain Monte Carlo (MCMC) methods such as<br />

Gibbs sampling <strong>and</strong> the Metropolis algorithm has greatly exp<strong>and</strong>ed the range of<br />

problems <strong>and</strong> dimensions that Bayesian analysis can h<strong>and</strong>le.<br />

MAXIMUM A POSTERIORI CLASSIFICATION<br />

How do we find the MAP estimate of θ? Well, we need the value of θ that will<br />

maximize p(θ|X); this value is expressed as θMAP = arg max θ p(θ|X) since it is the<br />

argument (value) that maximizes p(θ|X) over all θ. Then, using the formula for the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!