13.07.2015 Views

Mixture Density Mercer Kernels - Intelligent Systems Division - NASA

Mixture Density Mercer Kernels - Intelligent Systems Division - NASA

Mixture Density Mercer Kernels - Intelligent Systems Division - NASA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The second step above is valid under the assumptionthat the two data points are independent and identicallydistributed. This equation shows that the <strong>Mixture</strong> <strong>Density</strong>Kernel measures the ratio of the probability thattwo points arise from the same mode, compared withthe unconditional joint distribution. If we simplify thisequation further by assuming that the class distributionsare uniform, the kernel tells us on average (acrossensembles) the amount of information gained by knowingthat two points are drawn from the same mode in amixture density.It is not possible to obtain similarity through computingthe average density across an ensemble (whichwould be the traditional approach to bagging), becausein that case, P (x) = ∑ M ∑ Km=1 k=1 P m(k|x)P m (k) the assignmentof a data point to a given mixture componentis arbitrary. Thus, for two given probabilistic modelsP m and P n , cluster c m may not be the same as clusterc n . This lack of similarity may go beyond the mere problemthat the mixture assignments are arbitrary. Thegeometry of the mixture components may be differentfrom run to run. For example, in the case of a Gaussianmixture model, the positions of the mean vectors µ mand µ n and their associated covariance matrices maybe different.The dimension of the feature space defined by theBagged Probabilistic Kernel is large but finite. In thecase where the number of components varies for eachmember, the dimensionality is dim(F) = ∑ Mm=1 C m.Once normalized to unit length (through the factorZ(x i , x j )), the Φ(x i ) vectors represent points on ahigh dimensional hypersphere. The angle between thevectors determines the similarity between the two datapoints, and the (i, j) element of the kernel matrix is thecosine of this angle.Building the <strong>Mixture</strong> <strong>Density</strong> Kernel requires buildingan ensemble of mixture density models, each representinga sample from the larger model space M. Thegreater the heterogeneity of the models used in generatingthe kernel, the more effective the procedure. Inour implementation of the procedure, the training datais sampled M times with replacement. These overlappingdata sets, combined with random initial conditionsfor the EM algorithm, aid in generating a heterogenousensemble. Other ways of introducing heterogeneity includeencoding domain knowledge in the model. Thiscan be accomplished through the use of Bayesian <strong>Mixture</strong>Densities, which is the subject of the next section.6 Review of Bayesian <strong>Mixture</strong> <strong>Density</strong>EstimationA Bayesian formulation to the density estimation problemrequires that we maximize the posterior distributionP (Θ|X ) which arises as follows:(6.4)P (Θ|X ) =P (X |Θ)P (Θ)P (X )Cheeseman and Stutz 1995, among others, showed thatprior knowledge about the data generating process canbe encoded in the prior P (Θ) in order to guide theoptimization algorithm toward a model Θ ′ that takesthe domain knowledge into account. This prior assumesthat a generative model Λ has been chosen (such asa Gaussian), and determines the prior over the modelparameters.A Bayesian formulation to the mixture densityproblem requires that we specify the model (Λ) and thena prior distribution of the model parameters. In thecase of a Gaussian mixture density model for x ∈ R d ,we take the likelihood function as:P (x|θ c ) = P (x|µ c , Σ c , c)= (2π) − d 2 |Σc | − 1 2 ×exp[− 1 2 (x − µ c) T Σ −1c (x − µ c )]In the event that domain information is to be encoded, itis convenient to represent it in terms of a conjugate priorfor the Gaussian distribution. A conjugate is defined asfollows:Definition 6.1. A family F of probability densityfunctions is said to be conjugate if for every f ∈ F ,the posterior f(Θ|x) also belongs to F .For a mixture of Gaussians model, priors can be set asfollows[6]:• For priors on the means, either a uniform distributionor a Gaussian distribution can be used.• For priors on the covariance matrices, theWishart density can be used: P (Σ i |α, β, J) ∝|Σ −1i | β 2 exp(−αtr(Σ −1 J)/2).i• For priors on the mixture weights, a Dirichletdistribution can be used: P (p i |γ) ∝ ∏ Cc=1 pγ i−1i ,where p i ≡ P (c = i).These priors can be viewed as regularizers for the mixturenetwork as in [14]. Maximum a posteriori estimationis performed by taking the log of the posteriorlikelihood of each data point x i given the model Θ. Thefollowing function is thus optimized using the ExpectationMaximization[5]:[ N]∏(6.5) l(Θ) = log P (x i |Θ)P (Θ)i=i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!