2 Notation• p is the dimension of the data• M is the space of models from which the mixturedensity models are drawn.• F is the feature space, which may be a highdimensional (but finite) space, or more generallyan infinite dimensional Hilbert space.• N is the number of data points x i drawn from a pdimensional space• M is the number of probabilistic models used ingenerating the kernel function.• C is the number of mixture components in eachprobabilistic model. In principle one can use adifferent number of mixture components in eachmodel. However, here we choose a fixed numberfor simplicity.• x i is a p × 1 dimensional real column vector thatrepresents the data sampled from a data set X .• Φ(x) : R p ↦→ F is generally a nonlinear mappingto a high, possibly infinite dimensional, featurespace F. This mapping operator may be explicitlydefined or may be implicitly defined via a kernelfunction.• K(x i , x j ) = Φ(x i )Φ T (x j ) ∈ R is the kernelfunction that measures the similarity between datapoints x i and x j . If K is a <strong>Mercer</strong> kernel, it canbe written as the outer product of the map Φ.As i and j sweep through the N data points, itgenerates an N × N kernel matrix.• Θ is the entire set of parameters that specify amixture model.3 <strong>Mixture</strong> Models: A Sample from a ModelSpaceIn this section, we briefly motivate the development of<strong>Mixture</strong> <strong>Density</strong> <strong>Kernels</strong> by showing that the combinedresult of model misspecification and the effects of a finitedata set can lead to high uncertainty in the estimate ofa mixture model. We closely follow the arguments givenin [15].Suppose that a data set X is generated by drawing afinite sample from a mixture density function f(Λ ∗ , Θ ∗ ),where Λ ∗ defines the true density function (say a Gaussianmixture density) and is a sample from a large butfinite class of models M, and Θ ∗ defines the true setof parameters of that density function. In the case ofthe Gaussian mixture density, these parameters wouldbe the means and covariance matrices for each Gaussiancomponent, and the number of components thatcomprise the model. We can compute the probabilityof obtaining the correct model given the data as follows(see Smyth and Wolpert, 1998 for a detailed discussion).The posterior probability of the true densityf ∗ ≡ f(Λ ∗ , Θ ∗ ) given the data X is:∫ ∫P (f(Λ ∗ , Θ ∗ )|X ) =P (Λ, Θ|X ) ×MR(M)δ(f ∗ − P (Λ, Θ))dΛdΘwhere the first integral is taken over the model spaceM and the second integral is taken over the region inthe parameter space that is appropriate for the modelΛ, and δ is the Dirac delta function. Using Bayes rule,it is possible to expand the posterior into a product ofthe posterior of the model uncertainty and the posteriorof the parameter uncertainty. Thus, we have:∫ ∫P (f ∗ |X ) =P (Λ|X )P (Θ|Λ, X ) ×=MR(M)δ(f ∗ − P (Λ, Θ))dΛdΘ∫ ∫1P (Θ|Λ, X )P (Λ, X ) ×P (X )MR(M)δ(f ∗ − P (Λ, Θ))dΛdΘThe first equation above shows that there are twosources of variation in the estimation of the densityfunction. The first is due to model misspecification, andthe second is due to parameter uncertainty. The secondequation shows that if prior information is available, itcan be used to modify the likelihood of the data in orderto obtain a better estimate of the true density function.The goal of this paper is to seek a representationof the posterior P (x i |Θ) by reducing these errors byembedding x i in a high dimensional feature space thatdefines a kernel function.4 Review of Kernel Functions<strong>Mercer</strong> Kernel functions can be viewed as a measureof the similarity between two data points that are embeddedin a high, possibly infinite dimensional featurespace. For a finite sample of data X , the kernel func-
tion yields a symmetric N × N positive definite matrix,where the (i, j) entry corresponds to the similaritybetween (x i , x j ) as measured by the kernel function.Because of the positive definite property, such a <strong>Mercer</strong>Kernel can be written as the outer product of thedata in the feature space. Thus, if Φ(x i ) : R d ↦→ Fis the (perhaps implicitly) defined embedding function,we have K(x i , x j ) = Φ(x i )Φ T (x j ). Typical kernel functionsinclude the Gaussian kernel for which K(x i , x j ) =Φ(x i )Φ T (x j ) = exp(− 12σ 2 ||x i − x j || 2 ), and the polynomialkernel K(x i , x j ) = Φ(x i )Φ T (x j ) =< x i , x j > p .For supervised learning tasks, linear algorithms areused to define relationships between the target variableand the embedded features [4]. Work has also been donein using kernel methods for unsupervised learning tasks,such as clustering [7, 16] and density estimation [11].5 <strong>Mixture</strong> <strong>Density</strong> <strong>Kernels</strong>The idea of using probabilistic kernels was discussedby Haussler in 1999 [9] where he observesthat∑if K(x i , x j ) ≥ 0 ∀ (x i , x j ) ∈ X × X , and∑x i x jK(x i , x j ) = 1 then K is a probability distributionand is called a P-Kernel. He further observedthat the Gibbs kernel K(x i , x j ) = P (x i )P (x j ) is alsoan admissible kernel function.Although these kernel functions either represent orare derived from probabilistic models, they may notmeasure similarity in a consistent way. For example,suppose that x i is a low probability point, so thatP (x i ) ≈ 0, In the case of the Gibbs kernel, K(x i , x i ) ≈0, although the input vectors are identical. The Grammatrix generated by this kernel function would onlyshow those points as being similar which have very highprobabilities. While this feature may be of value insome applications, it needs modification to work as asimilarity measure.Our idea is to use an ensemble of probabilistic mixturemodels as a similarity measure. Two data pointswill have a larger similarity if multiple models agreethat they should be placed in the same cluster or modeof the distribution. Those points where there is disagreementwill be given intermediate similarity measures.The shapes of the underlying mixture distributionscan significantly affect the similarity measurementof the two points. Experimental results uphold this intuitionand show that in regions where there is “no question”about the membership of two points, the <strong>Mixture</strong><strong>Density</strong> Kernel behaves identically to a standardmixture model. However, in regions of the input spacewhere there is disagreement about the membership oftwo points, the behavior may be quite different thanthe standard model. Since each mixture density modelin the ensemble can be encoded with domain knowledgeby constructing informative priors, the Bagged ProbabilisticKernel will also encode domain knowledge. TheBagged Probabilistic Kernel is defined as follows:K(x i , x j ) = Φ(x i )Φ T (x j )=1M∑Z(x i , x j )C m∑m=1 c m =1P m (c m |x i )P m (c m |x j )The feature space is thus defined explicitly as follows:Φ(x i ) = [P 1 (c = 1|x i ), P 1 (c = 2|x i ), . . . ,P 1 (c = C|x i ), P 2 (c = 1|x i ), . . . , P M (c = C|x i )]The first sum in the defining equation above sweepsthrough the M models in the ensemble, where eachmixture model is a Maximum A Posteriori estimatorof the underlying density trained by sampling (withreplacement) the original data. We will discuss howto design these estimators in the next section. C mdefines the number of mixtures in the mth ensemble,and c m is the cluster (or mode) label assigned by themodel. The quantity Z(x i , x j ) is a normalization suchthat K(x i , x i ) = 1 for all i. The fact that the <strong>Mixture</strong><strong>Density</strong> Kernel is a valid kernel function arises directlyfrom the definition. In order to prove K is a valid kernelfunction, we need to show that it is symmetric, and thatthe kernel matrix is positive semi-definite[4]. The kernelis clearly symmetric since K(x i , x j ) = K(x j , x i ) for allvalues of i and j. The proof that K is positive semidefiniteis straightforward and arises from the fact thatΦ can be explicitly known. For nonzero α:(5.2)α T K(x i , x j )α = α T Φ(x i )Φ T (x j )α = β T β ≥ 0The <strong>Mixture</strong> <strong>Density</strong> Kernel function can be interpretedas follows. Suppose that we have a hard classificationstrategy, where each data point is assigned to themost likely posterior class distribution. In this case thekernel function counts the the number of times the Mmixtures agree that two points should be placed in thesame cluster mode. In the soft classification strategy,two data points are given an intermediate level of similaritywhich will be less than or equal to the case whereall models agree on their membership. Further interpretationof the kernel function is possible by applyingBayes rule to the defining equation of the <strong>Mixture</strong> <strong>Density</strong>Kernel. Thus, we have:(5.3)K(x i , x j ) ==M∑C m∑m=1 c m =1P m (x i |c m )P m (c m )P m (x i )×P m (x j |c m )P m (c m )P m (x j )M∑ ∑C mP m (x i , x j |c m )Pm(c 2 m )P m (x i , x j )m=1 c m =1