11.07.2015 Views

Algorithms for Gaussian Bandwidth Selection in Kernel Density ...

Algorithms for Gaussian Bandwidth Selection in Kernel Density ...

Algorithms for Gaussian Bandwidth Selection in Kernel Density ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

evaluated on the po<strong>in</strong>t left. The model evaluated on each tra<strong>in</strong><strong>in</strong>g sample has the<strong>for</strong>m:ˆp θ (x i ) = 1 N∑G(x i − x j |θ) (2)N − 1j=1j≠iwhere we make explicit the use of a <strong>Gaussian</strong> kernel. This framework was firstproposed <strong>in</strong> [4] and later studied by other authors [2], [5]. However, these studieslack a closed optimization procedure, so that the bandwidth σ 2 is obta<strong>in</strong>ed bya greedy tun<strong>in</strong>g along its possible values. Besides, the multivariate case is onlyconsidered <strong>in</strong> these previous works under a spherical kernel assumption. In thispaper, proposed two algorithms that overcome these difficulties.In a multidimensional <strong>Gaussian</strong> kernel, the set of parameters consists of the covariancematrix of the <strong>Gaussian</strong>. In the follow<strong>in</strong>g, we consider two different degreesof complexity assumed <strong>for</strong> this matrix: a spherical shape, so that C = σ 2 I D -onlyone parameter to adjust-, and an unconstra<strong>in</strong>ed kernel, <strong>in</strong> which a general <strong>for</strong>m isconsidered <strong>for</strong> C with D(D + 1)/2 parameters.Sections 2 and 3 describe the bandwidth optimization <strong>for</strong> the both cases mentionedas presented <strong>in</strong> [6] and establish their convergence conditions. Some classificationexperiments are presented <strong>in</strong> Section 4 to measure the accuracy of the models.Section 5 closes the paper with the most important conclusions.2 The spherical caseThe expression <strong>for</strong> the kernel function is, <strong>for</strong> the spherical case:(G ij (σ 2 ) = G(x i − x j |σ 2 ) = (2π) −D/2 σ −D exp − 1)2σ 2 ‖x i − x j ‖ 2We want to f<strong>in</strong>d the σ that maximizes the log-likelihood log L(X|σ 2 ) =∑i log ˆp θ(x i ). The derivative of this likelihood is:∇ σ log L(X|σ 2 ) = 1 ∑ 1 ∑( ‖xi − x j ‖ 2N − 1 ˆp(xi i ) σ 3 − D σWe now search <strong>for</strong> the po<strong>in</strong>t that makes the derivative null:∑ 1 ∑ ‖x i − x j ‖ 2ˆp(x i ) σ 3 G ij (σ 2 ) = ∑ 1 D ∑G ij (σ 2 ) =ˆp(xi i ) σij≠ij≠ij≠i)G ij (σ 2 )N(N − 1)DσThe ∑ second equality has been obta<strong>in</strong>ed by the fact that, by def<strong>in</strong>ition,j≠i G ij = (N − 1)ˆp(x i ). Then we obta<strong>in</strong> the follow<strong>in</strong>g fixed-po<strong>in</strong>t algorithm:σ 2 t+1 =1N(N − 1)D∑i1ˆp t (x i )∑‖x i − x j ‖ 2 G ij (σt 2 ) (3)where ˆp t denotes the KDE obta<strong>in</strong>ed <strong>in</strong> iteration t, i.e. the one that makes use ofthe width σ 2 t .We prove the convergence of the algorithm <strong>in</strong> (3) by means of the follow<strong>in</strong>g convergencetheorem:j≠i


Theorem 1 There is a fixed po<strong>in</strong>t <strong>in</strong> the <strong>in</strong>terval ( d 2 NND, 2 tr{Σx} )D , be<strong>in</strong>g d2NNthemean quadratic distance to the nearest neighbor and Σ x the covariance matrix of x.Besides, the fixed po<strong>in</strong>t is unique and the algorithm converges to it <strong>in</strong> the mentioned<strong>in</strong>terval if the follow<strong>in</strong>g condition holds:1 ∑ 1 ∑2σ 4 N(N − 1) 2 D ˆpi l (x i ) 2∑(d 2 ij − d 2 ik) 2 exp(− d2 ij + d2 ik2σ 2 ) < 1 (4)j≠i k≠i,jProof 1 Let σ 2 = g(σ 2 ) the function <strong>in</strong> (3), whose fixed po<strong>in</strong>t is to be obta<strong>in</strong>ed.The proof of the fixed po<strong>in</strong>t existence is based on the search of an <strong>in</strong>terval (a, b) suchthat a < g(σ 2 ) < b if σ 2 ∈ (a, b).In order to demonstrate that the <strong>in</strong>terval ( d 2 NND, 2 tr{Σx} )D holds that condition, weneed to prove these three facts:1. g ( d 2 )NNd 2>NND D2. g ( 2 tr{Σ x }) 2 tr{Σ x }


We obta<strong>in</strong>:2E{x T x} − 2µ x T µ x = 2 tr{Σ x }limσ 2 →∞ g(σ2 ) = 2 tr{Σ x}DThe second po<strong>in</strong>t is then proved s<strong>in</strong>ce the maximum value of g(σ 2 ) is reached at the<strong>in</strong>f<strong>in</strong>ite.To demonstrate the last po<strong>in</strong>t, we compute the derivative of g ′ (σ 2 ) and check outthat it is positive:dg(σ 2 ) 1 ∑ ∑dσ 2 =2σ 4 ND=ij≠i12σ 4 N(N − 1) 2 Dd 2 ij∑k (d2 ij − d2 ik ) exp(− d2 ij +d2 ik2σ 2 )∑i( ∑ l≠i exp(− d2 il2σ)) 2 21 ∑ ∑ˆp l (x i ) 2 (d 2 ij − d 2 ik) 2 exp(− d2 ij + d2 ik2σ 2 ) ≥ 0j≠i k≠i,jThe existence of a unique fixed po<strong>in</strong>t is then proved. To demonstrate the convergenceof the algorithm <strong>in</strong> such <strong>in</strong>terval, we need to check out the condition |g ′ (σ 2 )| < 1[7]. In that case, we are guaranteed that only a cross<strong>in</strong>g po<strong>in</strong>t between g(σ 2 ) andthe l<strong>in</strong>e g(σ 2 ) = σ 2 exists. The convergence condition (4) means that the value of(6) is lesser than 1.3 The unconstra<strong>in</strong>ed caseThe general expression <strong>for</strong> a <strong>Gaussian</strong> kernel is:(G ij (C) = |2πC| −1/2 exp − 1 )2 (x i − x j ) T C −1 (x i − x j )(6)and its derivative w.r.t. C:∇ C G ij (C) = 1 2(C −1 (x i − x j )(x i − x j ) T − I ) C −1 G ij (C)As <strong>in</strong> the previous cases, we take the derivative of the log-likelihood and make itequal to zero:∑ 1 1 ∑ 1ˆp(xi i ) N − 1 2 C−1 (x i −x j )(x i −x j ) T C −1 G ij = ∑ ij≠i1 1 ∑ 1ˆp(x i ) N − 1 2 C−1 G ijj≠iBy multiply<strong>in</strong>g both members by C, both at the right and the left, we obta<strong>in</strong>:∑ 1 ∑i − x j )(x i − x j )ˆp(xi i )j≠i(x T G ij = C ∑ 1 ∑G ijˆp(xi i )j≠iAfter some simplifications as <strong>in</strong> the spherical case, we reach the follow<strong>in</strong>g fixed-po<strong>in</strong>talgorithm:1 ∑ 1 ∑C t+1 =(x i − x j )(x i − x j ) T G ij (C t ) (7)N(N − 1) ˆpi t (x i )j≠i


The expression <strong>in</strong> (7) suggests a relationship with the Expectation-Maximizationresult <strong>for</strong> <strong>Gaussian</strong> Mixture Models (GMM). A GMM is a PDF estimator given bythe expression ˆp(x) = ∑ Kk=1 α kG(x|µ k , C k ). The weights of the K components ofthe mixture are given by the α k , and each <strong>Gaussian</strong> is characterized by its meanvector µ k and its covariance matrix C k . The solution provided by the EM algorithmconsists of an iterative procedure where the parameters at step t are obta<strong>in</strong>ed by theones at step t − 1. To do so, a matrix of auxiliary variables is used, r ki = p(k|x i ),express<strong>in</strong>g the likelihood of the sample to belong to the k-th component of themixture. These probabilities must hold ∑ k r ki = 1. The EM solution establishesthe follow<strong>in</strong>g updat<strong>in</strong>g rule <strong>for</strong> the covariance matrix at step t:C t k = ∑ k∑irkit (x i − µ t k )(x i − µ t k )TN(8)where the rki t and µt kare also iteratively updated. Note that our KDE model canbe considered as a special case of GMM where i) there are as many mixtures assamples (K = N) with the same weights α k = 1/N; ii) mean vectors are fixed:µ k = x k ; iii) the covariance matrix is the same <strong>for</strong> each of the components, and iv)r ki = 0 if k = i and r ki = 1/(N − 1) if k ≠ i.With these particularizations, the updat<strong>in</strong>g rule <strong>in</strong> (8) becomes equal to the onegiven by the iteration <strong>in</strong> (7).The EM guarantees the monotonic <strong>in</strong>crease of the likelihood cost and so its convergenceto a local m<strong>in</strong>imum, as proved <strong>in</strong> the literature [8]. The algorithm given <strong>in</strong> (7)is subject to the same conditions, so that its convergence is also proved. However,<strong>in</strong> situations <strong>in</strong> which N ≈ D, empirical covariance matrices are close to s<strong>in</strong>gularity,so that numerical problems may arise as <strong>in</strong> GMM design.4 Application to Parzen classificationWe have tested the per<strong>for</strong>mance of the obta<strong>in</strong>ed models on a set of public classificationproblems from [9]. For do<strong>in</strong>g so, we apply the Parzen classifier, which per<strong>for</strong>msthe simple Bayes criterion:ŷ = arg maxlˆp θl (x|c l )with per-class spherical (S-KDE) and unconstra<strong>in</strong>ed (U-KDE) models ˆp θl (x|c l ) optimizedaccord<strong>in</strong>g to the proposed method, be<strong>in</strong>g c l each of the L classes considered.We have compared these results with the ones obta<strong>in</strong>ed by other classificationmethods such as K-Nearest-Neighbors (KNN, with K=1) and the one-versus-therestSupport Vector Mach<strong>in</strong>e (SVM) with RBF kernel. The results are shown <strong>in</strong>Data Tra<strong>in</strong> Test L D S-KDE U-KDE KNN SVMPima 738 - 2 8 71.22 75.13 73.18 76.47W<strong>in</strong>e 178 - 3 13 75.84 99.44 76.97 100Landsat 4435 2000 6 36 89.45 86.10 90.60 90.90Optdigits 3823 1797 10 64 97.89 93.54 94.38 98.22Letter 16000 4000 26 16 95.23 92.77 95.20 97.55Table 1: Classification per<strong>for</strong>mance on some public datasets. Leave-one-out accuracyis provided when there are not test data.


Table 1. The most remarkable conclusion from this result is that either S-KDE orU-KDE provides, <strong>in</strong> each case, a classification per<strong>for</strong>mance close to SVM’s. Thecomparison between S-KDE and U-KDE per<strong>for</strong>mance is closely related, <strong>in</strong> eachcase, to the dimension of the data when compared to the number of samples. Inthe datasets with higher dimensionality, the per<strong>for</strong>mance of S-KDE is higher dueto its lower risk of overfitt<strong>in</strong>g. Parzen classifiers have not enjoyed the popularity ofother methods, ma<strong>in</strong>ly due to the difficulty of obta<strong>in</strong><strong>in</strong>g a reliable bandwidth <strong>for</strong>the kernel. However, <strong>in</strong> this experiment we have shown how the bandwidth chosenby our algorithm provides a classification per<strong>for</strong>mance close to a state-of-the-artclassifier such as the SVM.5 ConclusionsWe have presented two algorithms <strong>for</strong> the optimization of the likelihood <strong>in</strong> the bandwidthselection problem <strong>for</strong> KDE models. Unlike previous results <strong>in</strong> the literature,the methods tackle <strong>in</strong> a natural way the multivariate case, <strong>for</strong> which we providesolutions based on both spherical and complete (unconstra<strong>in</strong>ed) <strong>Gaussian</strong> kernel.The convergence conditions have been described <strong>for</strong> both algorithms. By a set ofexperiments, we have shown that the models obta<strong>in</strong>ed are accurate enough to providegood classification results. This demonstrates that the model do not overfit tothe data, even <strong>in</strong> problems <strong>in</strong>volv<strong>in</strong>g a high number of variables.References[1] M. C. Jones, J. S. Marron, and S. J. Sheather, “A brief survey of bandwidth selection<strong>for</strong> density estimation,” Journal of the American Statistical Association,vol. 91, no. 433, pp. 401–407, 1996.[2] B. W. Silverman, <strong>Density</strong> Estimation <strong>for</strong> Statistics and Data Analysis, Chapman& Hall, Londres, 1986.[3] A. Bowman, “An alternative method of cross-validation <strong>for</strong> the smooth<strong>in</strong>g ofdensity estimates,” Biometrika, vol. 71, pp. 353–360, 1984.[4] R. Du<strong>in</strong>, “On the choice of smooth<strong>in</strong>g parameters <strong>for</strong> Parzen estimators ofprobability density functions,” IEEE Trans. on Computers, vol. 25, no. 11,1976.[5] P. Hall, “Cross-validation <strong>in</strong> density estimation,” Biometrika, vol. 69, no. 2, pp.383–390, 1982.[6] J.M. Leiva-Murillo and A. Artés-Rodríguez, “A fixed-po<strong>in</strong>t algorithm <strong>for</strong> f<strong>in</strong>d<strong>in</strong>gthe optimal covariance matrix <strong>in</strong> kernel density model<strong>in</strong>g,” <strong>in</strong> InternationalConference on Acoustic, Speech and Signal Process<strong>in</strong>g, Toulouse, Francia, 2006.[7] R. Fletcher, Practical Methods of Optimization (2nd Edition), John Wiley &Sons, New York, 1995.[8] G. McLachlan, The EM algorithm and extensions, John Wiley & Sons, NewYork, 1997.[9] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “UCI repository of mach<strong>in</strong>elearn<strong>in</strong>g databases,” Tech. Rep., Univ. of Cali<strong>for</strong>nia, Dept. ICS, 1998.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!