13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.6 CLUSTERING 269mixture model, problems will occur whenever any of the normal distributionsbecomes so narrow that it is centered on just one data point. Consequently,implementations generally insist that clusters contain at least two different datavalues.Whenever there are a large number of parameters, the problem of overfittingarises. If you were unsure of which attributes were covariant, you might try outdifferent possibilities <strong>and</strong> choose the one that maximized the overall probabilityof the data given the clustering that was found. Unfortunately, the moreparameters there are, the larger the overall data probability will tend to be—notnecessarily because of better clustering but because of overfitting. The moreparameters there are to play with, the easier it is to find a clustering that seemsgood.It would be nice if somehow you could penalize the model for introducingnew parameters. One principled way of doing this is to adopt a fully Bayesianapproach in which every parameter has a prior probability distribution. Then,whenever a new parameter is introduced, its prior probability must be incorporatedinto the overall likelihood figure. Because this will involve multiplyingthe overall likelihood by a number less than one—the prior probability—it willautomatically penalize the addition of new parameters. To improve the overalllikelihood, the new parameters will have to yield a benefit that outweighs thepenalty.In a sense, the Laplace estimator that we met in Section 4.2, <strong>and</strong> whose usewe advocated earlier to counter the problem of zero probability estimates fornominal values, is just such a device. Whenever observed probabilities are small,the Laplace estimator exacts a penalty because it makes probabilities that arezero, or close to zero, greater, <strong>and</strong> this will decrease the overall likelihood of thedata. Making two nominal attributes covariant will exacerbate the problem.Instead of v 1 + v 2 parameters, where v 1 <strong>and</strong> v 2 are the number of possible values,there are now v 1 v 2 , greatly increasing the chance of a large number of small estimatedprobabilities. In fact, the Laplace estimator is tantamount to using a particularprior distribution for the introduction of new parameters.The same technique can be used to penalize the introduction of largenumbers of clusters, just by using a prespecified prior distribution that decayssharply as the number of clusters increases.AutoClass is a comprehensive Bayesian clustering scheme that uses the finitemixture model with prior distributions on all the parameters. It allows bothnumeric <strong>and</strong> nominal attributes <strong>and</strong> uses the EM algorithm to estimate theparameters of the probability distributions to best fit the data. Because there isno guarantee that the EM algorithm converges to the global optimum, the procedureis repeated for several different sets of initial values. But that is not all.AutoClass considers different numbers of clusters <strong>and</strong> can consider differentamounts of covariance <strong>and</strong> different underlying probability distribution types

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!