Algorithms for Gaussian Bandwidth Selection in Kernel Density ...

evaluated on the point left. The model evaluated on each training sample has theform:ˆp θ (x i ) = 1 N∑G(x i − x j |θ) (2)N − 1j=1j≠iwhere we make explicit the use of a Gaussian kernel. This framework was firstproposed in [4] and later studied by other authors [2], [5]. However, these studieslack a closed optimization procedure, so that the bandwidth σ 2 is obtained bya greedy tuning along its possible values. Besides, the multivariate case is onlyconsidered in these previous works under a spherical kernel assumption. In thispaper, proposed two algorithms that overcome these difficulties.In a multidimensional Gaussian kernel, the set of parameters consists of the covariancematrix of the Gaussian. In the following, we consider two different degreesof complexity assumed for this matrix: a spherical shape, so that C = σ 2 I D -onlyone parameter to adjust-, and an unconstrained kernel, in which a general form isconsidered for C with D(D + 1)/2 parameters.Sections 2 and 3 describe the bandwidth optimization for the both cases mentionedas presented in [6] and establish their convergence conditions. Some classificationexperiments are presented in Section 4 to measure the accuracy of the models.Section 5 closes the paper with the most important conclusions.2 The spherical caseThe expression for the kernel function is, for the spherical case:(G ij (σ 2 ) = G(x i − x j |σ 2 ) = (2π) −D/2 σ −D exp − 1)2σ 2 ‖x i − x j ‖ 2We want to find the σ that maximizes the log-likelihood log L(X|σ 2 ) =∑i log ˆp θ(x i ). The derivative of this likelihood is:∇ σ log L(X|σ 2 ) = 1 ∑ 1 ∑( ‖xi − x j ‖ 2N − 1 ˆp(xi i ) σ 3 − D σWe now search for the point that makes the derivative null:∑ 1 ∑ ‖x i − x j ‖ 2ˆp(x i ) σ 3 G ij (σ 2 ) = ∑ 1 D ∑G ij (σ 2 ) =ˆp(xi i ) σij≠ij≠ij≠i)G ij (σ 2 )N(N − 1)DσThe ∑ second equality has been obtained by the fact that, by definition,j≠i G ij = (N − 1)ˆp(x i ). Then we obtain the following fixed-point algorithm:σ 2 t+1 =1N(N − 1)D∑i1ˆp t (x i )∑‖x i − x j ‖ 2 G ij (σt 2 ) (3)where ˆp t denotes the KDE obtained in iteration t, i.e. the one that makes use ofthe width σ 2 t .We prove the convergence of the algorithm in (3) by means of the following convergencetheorem:j≠i

Theorem 1 There is a fixed point in the interval ( d 2 NND, 2 tr{Σx} )D , being d2NNthemean quadratic distance to the nearest neighbor and Σ x the covariance matrix of x.Besides, the fixed point is unique and the algorithm converges to it in the mentionedinterval if the following condition holds:1 ∑ 1 ∑2σ 4 N(N − 1) 2 D ˆpi l (x i ) 2∑(d 2 ij − d 2 ik) 2 exp(− d2 ij + d2 ik2σ 2 ) < 1 (4)j≠i k≠i,jProof 1 Let σ 2 = g(σ 2 ) the function in (3), whose fixed point is to be obtained.The proof of the fixed point existence is based on the search of an interval (a, b) suchthat a < g(σ 2 ) interval ( d 2 NND, 2 tr{Σx} )D holds that condition, weneed to prove these three facts:1. g ( d 2 )NNd 2>NND D2. g ( 2 tr{Σ x }) 2 tr{Σ x }

We obtain:2E{x T x} − 2µ x T µ x = 2 tr{Σ x }limσ 2 →∞ g(σ2 ) = 2 tr{Σ x}DThe second point is then proved since the maximum value of g(σ 2 ) is reached at theinfinite.To demonstrate the last point, we compute the derivative of g ′ (σ 2 ) and check outthat it is positive:dg(σ 2 ) 1 ∑ ∑dσ 2 =2σ 4 ND=ij≠i12σ 4 N(N − 1) 2 Dd 2 ij∑k (d2 ij − d2 ik ) exp(− d2 ij +d2 ik2σ 2 )∑i( ∑ l≠i exp(− d2 il2σ)) 2 21 ∑ ∑ˆp l (x i ) 2 (d 2 ij − d 2 ik) 2 exp(− d2 ij + d2 ik2σ 2 ) ≥ 0j≠i k≠i,jThe existence of a unique fixed point is then proved. To demonstrate the convergenceof the algorithm in such interval, we need to check out the condition |g ′ (σ 2 )| < 1[7]. In that case, we are guaranteed that only a crossing point between g(σ 2 ) andthe line g(σ 2 ) = σ 2 exists. The convergence condition (4) means that the value of(6) is lesser than 1.3 The unconstrained caseThe general expression for a Gaussian kernel is:(G ij (C) = |2πC| −1/2 exp − 1 )2 (x i − x j ) T C −1 (x i − x j )(6)and its derivative w.r.t. C:∇ C G ij (C) = 1 2(C −1 (x i − x j )(x i − x j ) T − I ) C −1 G ij (C)As in the previous cases, we take the derivative of the log-likelihood and make itequal to zero:∑ 1 1 ∑ 1ˆp(xi i ) N − 1 2 C−1 (x i −x j )(x i −x j ) T C −1 G ij = ∑ ij≠i1 1 ∑ 1ˆp(x i ) N − 1 2 C−1 G ijj≠iBy multiplying both members by C, both at the right and the left, we obtain:∑ 1 ∑i − x j )(x i − x j )ˆp(xi i )j≠i(x T G ij = C ∑ 1 ∑G ijˆp(xi i )j≠iAfter some simplifications as in the spherical case, we reach the following fixed-pointalgorithm:1 ∑ 1 ∑C t+1 =(x i − x j )(x i − x j ) T G ij (C t ) (7)N(N − 1) ˆpi t (x i )j≠i

The expression in (7) suggests a relationship with the Expectation-Maximizationresult for Gaussian Mixture Models (GMM). A GMM is a PDF estimator given bythe expression ˆp(x) = ∑ Kk=1 α kG(x|µ k , C k ). The weights of the K components ofthe mixture are given by the α k , and each Gaussian is characterized by its meanvector µ k and its covariance matrix C k . The solution provided by the EM algorithmconsists of an iterative procedure where the parameters at step t are obtained by theones at step t − 1. To do so, a matrix of auxiliary variables is used, r ki = p(k|x i ),expressing the likelihood of the sample to belong to the k-th component of themixture. These probabilities must hold ∑ k r ki = 1. The EM solution establishesthe following updating rule for the covariance matrix at step t:C t k = ∑ k∑irkit (x i − µ t k )(x i − µ t k )TN(8)where the rki t and µt kare also iteratively updated. Note that our KDE model canbe considered as a special case of GMM where i) there are as many mixtures assamples (K = N) with the same weights α k = 1/N; ii) mean vectors are fixed:µ k = x k ; iii) the covariance matrix is the same for each of the components, and iv)r ki = 0 if k = i and r ki = 1/(N − 1) if k ≠ i.With these particularizations, the updating rule in (8) becomes equal to the onegiven by the iteration in (7).The EM guarantees the monotonic increase of the likelihood cost and so its convergenceto a local minimum, as proved in the literature [8]. The algorithm given in (7)is subject to the same conditions, so that its convergence is also proved. However,in situations in which N ≈ D, empirical covariance matrices are close to singularity,so that numerical problems may arise as in GMM design.4 Application to Parzen classificationWe have tested the performance of the obtained models on a set of public classificationproblems from [9]. For doing so, we apply the Parzen classifier, which performsthe simple Bayes criterion:ŷ = arg maxlˆp θl (x|c l )with per-class spherical (S-KDE) and unconstrained (U-KDE) models ˆp θl (x|c l ) optimizedaccording to the proposed method, being c l each of the L classes considered.We have compared these results with the ones obtained by other classificationmethods such as K-Nearest-Neighbors (KNN, with K=1) and the one-versus-therestSupport Vector Machine (SVM) with RBF kernel. The results are shown inData Train Test L D S-KDE U-KDE KNN SVMPima 738 - 2 8 71.22 75.13 73.18 76.47Wine 178 - 3 13 75.84 99.44 76.97 100Landsat 4435 2000 6 36 89.45 86.10 90.60 90.90Optdigits 3823 1797 10 64 97.89 93.54 94.38 98.22Letter 16000 4000 26 16 95.23 92.77 95.20 97.55Table 1: Classification performance on some public datasets. Leave-one-out accuracyis provided when there are not test data.

Table 1. The most remarkable conclusion from this result is that either S-KDE orU-KDE provides, in each case, a classification performance close to SVM’s. Thecomparison between S-KDE and U-KDE performance is closely related, in eachcase, to the dimension of the data when compared to the number of samples. Inthe datasets with higher dimensionality, the performance of S-KDE is higher dueto its lower risk of overfitting. Parzen classifiers have not enjoyed the popularity ofother methods, mainly due to the difficulty of obtaining a reliable bandwidth forthe kernel. However, in this experiment we have shown how the bandwidth chosenby our algorithm provides a classification performance close to a state-of-the-artclassifier such as the SVM.5 ConclusionsWe have presented two algorithms for the optimization of the likelihood in the bandwidthselection problem for KDE models. Unlike previous results in the literature,the methods tackle in a natural way the multivariate case, for which we providesolutions based on both spherical and complete (unconstrained) Gaussian kernel.The convergence conditions have been described for both algorithms. By a set ofexperiments, we have shown that the models obtained are accurate enough to providegood classification results. This demonstrates that the model do not overfit tothe data, even in problems involving a high number of variables.References[1] M. C. Jones, J. S. Marron, and S. J. Sheather, “A brief survey of bandwidth selectionfor density estimation,” Journal of the American Statistical Association,vol. 91, no. 433, pp. 401–407, 1996.[2] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman& Hall, Londres, 1986.[3] A. Bowman, “An alternative method of cross-validation for the smoothing ofdensity estimates,” Biometrika, vol. 71, pp. 353–360, 1984.[4] R. Duin, “On the choice of smoothing parameters for Parzen estimators ofprobability density functions,” IEEE Trans. on Computers, vol. 25, no. 11,1976.[5] P. Hall, “Cross-validation in density estimation,” Biometrika, vol. 69, no. 2, pp.383–390, 1982.[6] J.M. Leiva-Murillo and A. Artés-Rodríguez, “A fixed-point algorithm for findingthe optimal covariance matrix in kernel density modeling,” in InternationalConference on Acoustic, Speech and Signal Processing, Toulouse, Francia, 2006.[7] R. Fletcher, Practical Methods of Optimization (2nd Edition), John Wiley &Sons, New York, 1995.[8] G. McLachlan, The EM algorithm and extensions, John Wiley & Sons, NewYork, 1997.[9] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “UCI repository of machinelearning databases,” Tech. Rep., Univ. of California, Dept. ICS, 1998.

Algorithms for Gaussian Bandwidth Selection in Kernel Density ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?