13.07.2015 Views

Mixture Density Mercer Kernels - Intelligent Systems Division - NASA

Mixture Density Mercer Kernels - Intelligent Systems Division - NASA

Mixture Density Mercer Kernels - Intelligent Systems Division - NASA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

tion yields a symmetric N × N positive definite matrix,where the (i, j) entry corresponds to the similaritybetween (x i , x j ) as measured by the kernel function.Because of the positive definite property, such a <strong>Mercer</strong>Kernel can be written as the outer product of thedata in the feature space. Thus, if Φ(x i ) : R d ↦→ Fis the (perhaps implicitly) defined embedding function,we have K(x i , x j ) = Φ(x i )Φ T (x j ). Typical kernel functionsinclude the Gaussian kernel for which K(x i , x j ) =Φ(x i )Φ T (x j ) = exp(− 12σ 2 ||x i − x j || 2 ), and the polynomialkernel K(x i , x j ) = Φ(x i )Φ T (x j ) =< x i , x j > p .For supervised learning tasks, linear algorithms areused to define relationships between the target variableand the embedded features [4]. Work has also been donein using kernel methods for unsupervised learning tasks,such as clustering [7, 16] and density estimation [11].5 <strong>Mixture</strong> <strong>Density</strong> <strong>Kernels</strong>The idea of using probabilistic kernels was discussedby Haussler in 1999 [9] where he observesthat∑if K(x i , x j ) ≥ 0 ∀ (x i , x j ) ∈ X × X , and∑x i x jK(x i , x j ) = 1 then K is a probability distributionand is called a P-Kernel. He further observedthat the Gibbs kernel K(x i , x j ) = P (x i )P (x j ) is alsoan admissible kernel function.Although these kernel functions either represent orare derived from probabilistic models, they may notmeasure similarity in a consistent way. For example,suppose that x i is a low probability point, so thatP (x i ) ≈ 0, In the case of the Gibbs kernel, K(x i , x i ) ≈0, although the input vectors are identical. The Grammatrix generated by this kernel function would onlyshow those points as being similar which have very highprobabilities. While this feature may be of value insome applications, it needs modification to work as asimilarity measure.Our idea is to use an ensemble of probabilistic mixturemodels as a similarity measure. Two data pointswill have a larger similarity if multiple models agreethat they should be placed in the same cluster or modeof the distribution. Those points where there is disagreementwill be given intermediate similarity measures.The shapes of the underlying mixture distributionscan significantly affect the similarity measurementof the two points. Experimental results uphold this intuitionand show that in regions where there is “no question”about the membership of two points, the <strong>Mixture</strong><strong>Density</strong> Kernel behaves identically to a standardmixture model. However, in regions of the input spacewhere there is disagreement about the membership oftwo points, the behavior may be quite different thanthe standard model. Since each mixture density modelin the ensemble can be encoded with domain knowledgeby constructing informative priors, the Bagged ProbabilisticKernel will also encode domain knowledge. TheBagged Probabilistic Kernel is defined as follows:K(x i , x j ) = Φ(x i )Φ T (x j )=1M∑Z(x i , x j )C m∑m=1 c m =1P m (c m |x i )P m (c m |x j )The feature space is thus defined explicitly as follows:Φ(x i ) = [P 1 (c = 1|x i ), P 1 (c = 2|x i ), . . . ,P 1 (c = C|x i ), P 2 (c = 1|x i ), . . . , P M (c = C|x i )]The first sum in the defining equation above sweepsthrough the M models in the ensemble, where eachmixture model is a Maximum A Posteriori estimatorof the underlying density trained by sampling (withreplacement) the original data. We will discuss howto design these estimators in the next section. C mdefines the number of mixtures in the mth ensemble,and c m is the cluster (or mode) label assigned by themodel. The quantity Z(x i , x j ) is a normalization suchthat K(x i , x i ) = 1 for all i. The fact that the <strong>Mixture</strong><strong>Density</strong> Kernel is a valid kernel function arises directlyfrom the definition. In order to prove K is a valid kernelfunction, we need to show that it is symmetric, and thatthe kernel matrix is positive semi-definite[4]. The kernelis clearly symmetric since K(x i , x j ) = K(x j , x i ) for allvalues of i and j. The proof that K is positive semidefiniteis straightforward and arises from the fact thatΦ can be explicitly known. For nonzero α:(5.2)α T K(x i , x j )α = α T Φ(x i )Φ T (x j )α = β T β ≥ 0The <strong>Mixture</strong> <strong>Density</strong> Kernel function can be interpretedas follows. Suppose that we have a hard classificationstrategy, where each data point is assigned to themost likely posterior class distribution. In this case thekernel function counts the the number of times the Mmixtures agree that two points should be placed in thesame cluster mode. In the soft classification strategy,two data points are given an intermediate level of similaritywhich will be less than or equal to the case whereall models agree on their membership. Further interpretationof the kernel function is possible by applyingBayes rule to the defining equation of the <strong>Mixture</strong> <strong>Density</strong>Kernel. Thus, we have:(5.3)K(x i , x j ) ==M∑C m∑m=1 c m =1P m (x i |c m )P m (c m )P m (x i )×P m (x j |c m )P m (c m )P m (x j )M∑ ∑C mP m (x i , x j |c m )Pm(c 2 m )P m (x i , x j )m=1 c m =1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!