A particular Gaussian mixture model for clustering and its ...

Soft ComputDOI 10.1007/s00500-007-0247-yFOCUSA particular Gaussian mixture model for clustering and its applicationto image retrievalHichem Sahbi© Springer-Verlag 2007Abstract We introduce a new method for data clusteringbased on a particular Gaussian mixture model (GMM). Eachcluster of data, modeled as a GMM into an input space, isinterpreted as a hyperplane in a high dimensional mappingspace where the underlying coefficients are found by solvinga quadratic programming (QP) problem. The main contributionsof this work are (1) an original probabilistic frameworkfor GMM estimation based on QP which only requires findingthe mixture parameters, (2) this QP is interpreted as theminimization of the pairwise correlations between clusterhyperplanes in a high dimensional space and (3) it is solvedeasily using a new decomposition algorithm involving triviallinear programming sub-problems. The validity of themethod is demonstrated for clustering 2D toy examples aswell as image databases.Keywords Gaussian mixture models · Clustering ·Kernel methods and image retrieval1 IntroductionConsider a training set S N ={x 1 ,...,x N } generated i.i.d.according to an unknown but a fixed probability distributionP(X). HereX is a random variable standing for trainingdata in an input space denoted X . Our goal is to define aH. Sahbi (B)Machine Intelligence Laboratory, Department of Engineering,Cambridge University, Cambridge, UKe-mail: hs385@cam.ac.ukH. SahbiCertis Laboratory, Ecole Nationale des Ponts et Chaussees,Paris, Francee-mail: sahbi@certis.enpc.frpartition of this training set, i.e., assign each training exampleto its actual cluster y k , k = 1,...,C. Existing clusteringmethods can be categorized depending on the “crispness”of data memberships, i.e., whether each training examplebelongs to only one cluster or not. The first category includeshierarchical approaches (Jain and Dubes 1988; Posse 2001),EM (expectation maximization) (Dempster et al. 1977), C (orK ) means (MacQueen 1965), self organizing maps (SOMs)(Tsao et al. 1994) and recently original approaches basedon kernel methods (Ben-Hur et al. 2000; Bach and Jordan2003; Wu and Schoelkopf 2006). In the second family themembership of a training example to one or another clusteris fuzzy and a family of algorithms dealing with fuzzinessexists in the literature for instance (Yang 1993; Bezdek 1981;Frigui and Krishnapuram 1999; Dave 1991; Ichihashai et al.2000; Sahbi and Boujemaa 2005). All these methods havebeen used in different domains including Gene expression(Orengo et al. 2003), image segmentation (Carson et al. 2002)and database categorization (Le-Saux and Boujemaa 2002).One of the main issues in the existing clustering methodsremains setting the appropriate number of classes for a givenproblem. Many methods for instance hierarchical clustering(Sneath and Sokal 1973; Fraley 1998; Posse 2001) have provento perform well when the application allows us to know apriori the number of clusters or when the user sets it manually.Of course, the estimation of this number is applicationdependent,for instance in image segmentation it can be seta priori to the number of targeted regions. Unfortunately, forsome applications such as database categorization, it is notalways possible to predict automatically and even manuallythe appropriate number of classes.Given a training set S N in the input space X , in ourmethod each cluster in X is modeled as a GMM and interpretedas a hyperplane in a high dimensional space referredto as the mapping space, denoted H. Clustering consists in123

H. Sahbimaximizing a likelihood function of data memberships and isequivalent to finding the parameters of the cluster hyperplanesby solving a QP problem. We will show that this QPcan be tackled efficiently by solving trivial linear programmingsub-problems. Notice that when solving this QP, thenumber of clusters (denoted C), fixed initially, might be overestimatedand this leads to several overlapping clusters; thereforethe actual clusters are found as constellations of highlycorrelated hyperplanes in the mapping space H.In the remainder of this paper, we refer to a cluster as aset of data gathered using an algorithm while a class (or acategory) is the actual membership of this data according toa well defined ground truth. Among notations, i and j standfor data indices while k, c and l stand for cluster indices.Other notations will be introduced as we go along throughdifferent sections of this paper which is organized as following:in Sect. 2 we provide a short reminder on GMMs followedby our clustering minimization problem in Sect. 3.InSect. 4 we show that this minimization problem can be solvedas a succession of simple linear programming subproblemswhich are handled efficiently. We present in Sect. 5 experimentson simple as well as challenging problems in contentbased image retrieval. Finally, we conclude in Sect. 6 and weprovide some directions for a future work.2 A short reminder on GMMsGiven a training set S N of size N, we are interested in aparticular Gaussian mixture model (denoted M k ) Bishop(1995) where the number of its components is equal to thesize of the training set. The parameters of M k are denotedΘ k ={Θ i|k , i = 1,...,N}. HereΘ i|k stands for the ithcomponent parameter of M k , i.e., the mean and the covariancematrices of a Gaussian density function. In this case,the output of the likelihood GMM function related to a clustery k is a weighted sum of N component densities:Fig. 1 This figure shows a particular GMM model where the centersand the variances are fixed. The only free parameters are the GMMmixture coefficients and the scale σσ ∈ R + ). Now, each cluster y k is modeled as a GMMwhere the only free parameters are the mixture coefficients{µ ik }; the means and the covariances are assumed constantbut of course dependent on a priori knowledge of the trainingset (see Fig. 1 and Sect. 5).3 ClusteringThe goal of a clustering algorithm is to make clusters, containingdata from different classes, as different as possible whilekeeping data from the same classes as close as possible totheir actual clusters. Usually this implies the optimization ofan objective function (see for instance Frigui and Krishnapuram1997) involving a fidelity term which measures thefitness of each training sample to its model and a regularizerwhich reduces the number of clusters, i.e., the complexity ofthe model.3.1 Our formulation2σp(x|y k ) =N∑iP(Θ i|k |y k ) p(x|y k ,Θ i|k ), (1)Following notations and definitions in Sect. 2, agivenx ∈R p is assigned to a cluster y k if:here P(Θ i|k |y k ) (also denoted µ ik ) is the prior probability ofthe ith component parameter Θ i|k given the cluster y k andp(x|y k ,Θ i|k ) = N k (x; Θ i|k ) is the normal density functionof the ith component. In order to guarantee that p(x|y k ) isa density function, the mixture parameters {µ ik } are chosensuch that ∑ i µ ik = 1. Usually the training of GMMs canbe formulated as a maximum likelihood problem where theparameters Θ k and the mixture parameters {µ ik } are estimatedusing expectation maximization (Dempster et al. 1977;Bishop 1995).We consider in our work, N k (x; Θ i|k ) with a fixed meanx i ∈ S N and covariance matrix σ Σ i (Σ i ∈ R p×p ,y k= arg maxy lp(x|y l ), (2)here the weights µ ={µ il }, of the GMM functions p(x|y l )l = 1,...,C, are found by solving the following constrainedminimization problem (see motivation below):⎛ ⎞∑min µ ik⎝ ∑ p(x i |y l ) ⎠µk,i l̸=ks.t. ∑ µ ic = 1, µ ic ∈[0, 1], i = 1,...,N, (3)ic = 1,...,C123

A particular Gaussian mixture model for clustering and its application to image retrievalIn the above objective function, the training data belonging toaclustery l are assumed drawn from a GMM with a likelihoodfunction p(x|y l ). Each mixture parameter µ il stands for thedegree of membership (or the contribution) of the trainingexample x i to the cluster y l . The overall objective is to maximizethe membership of each training example to its actualcluster while keeping the memberships to the remaining clustersrelatively low. Usually, existing clustering methods (seefor instance Bezdek 1981) find the parameters {µ il } as thosewhich maximize the membership of each training exampleto its actual cluster. In contrast to these methods, our formulationproceeds using a dual principle; the purpose is tominimize the memberships of training examples to their nonactualclusters.Using (1), we can expand the objective function (3)as:minµ∑ ∑µ ik µ jl N l (x i ; Θ j|l ), (4)k̸=li, jhere N l (x i ; Θ j|k ) is the response of a Gaussian density function(also referred to as kernel), with fixed parameters Θ j|k ={x j ,σ Σ j }; x j is the mean, Σ j is the covariance and σ isthe scale. We will denote this kernel simply as K σ (‖x i −x j ‖). It is known that the Gaussian kernel is positive definite(Cristianini and Shawe-Taylor 2000) so this function correspondsto a scalar product in the mapping space H, i.e., thereis a mapping Φ σ from the input space into an infinite dimensionalspace such that K σ (‖x i − x j ‖) =〈Φ σ (x i ), Φ σ (x j )〉where 〈〉 stands for the inner product in H. At this stage, theresponse of a GMM function p(x|y k ) is equal to the scalarproduct 〈ω k ,Φ σ (x)〉 where ω k = ∑ i µ ik Φ σ (x i ) is the normalof a hyperplane in H (see Fig. 2). Now, the objectivefunction (4) can be rewritten:minµ∑〈ω k ,ω l 〉 (5)k̸=lThe above objective function minimizes the sum of hyperplanecorrelations taken pairwise among all different clusters.Now, we can derive the new form of the constrainedminimization problem (3):minµs.t.∑k,i∑l̸=k, jµ ik µ jl K σ (‖x i − x j ‖)∑µ ic = 1, µ ic ∈[0, 1], i = 1,...,N,ic = 1,...,CThis defines a constrained QP which can be solved using standardQP libraries (see for instance Vanderbei 1999). Whensolving this problem, training examples {x i } for which themixing parameter {µ ik } are positive will be referred to as theGMM vectors of the cluster y k (see Fig. 2).(6)p(x|y 1 ) = ∑ Niµ i1 K σ (‖x − x i ‖)= 〈ω 1 ,Φ σ (x)〉w 1w 2p(x|y 2 )Fig. 2 This figure shows the mapping of training samples into a highdimensional space. Data in the original space characterizes Gaussianblobs while in the mapping space they correspond to hyperplanes. TheGMM vectors are surrounded with circles and correspond to the centersof the Gaussian kernels for which the mixture parameters do not vanish4 TrainingThe number of parameters intervening in (6) isN × C, sofor clustering problems of reasonable size, for instance N =1.000 and C = 20, solving this QP, using standard packages,can quickly get out of hand. Chunking methods have beensuccessfully used to solve QP for large scale training problemssuch as SVM (Osuna et al. 1997). The idea consists insolving a QP problem using an active subset of parameters,referred to as a chunk, which is updated iteratively. When theQP is convex, by checking that the Gram matrix is positivedefinite, the process is guaranteed to converge to the globaloptimum after a sufficient number of iterations (Osuna et al.1997).Using the same principle as Osuna et al. (1997) and Platt(1999), we will show in this section that for a particular choiceof the active chunk, the QP (6) can be decomposed into linearprogramming subproblems each one can be solved trivially.4.1 DecompositionLet us fix one cluster index p ∈{1,...,C} and rewrite theobjective function (6)as:min 2 ∑ µi+ ∑here:i,k̸=pc ip = ∑j,l̸=pµ ip c ip∑j,l̸=k,pµ ik µ jl K σ (‖x i − x j ‖), (7)µ jl K σ (‖x i − x j ‖) (8)123

H. SahbiConsider a chunk {µ ip }i=1 N and fix {µ jk, k ̸= p} N j=1 ,theobjective function (7) is linear in terms of {µ ip } and canbe solved, with respect to this chunk, using linear programming.Only one equality constraint ∑ i µ ip = 1 is takeninto account in the linear programming problem as the otherequality constraints in (6) are independent from the chunk{µ ip }.We can further reduce the size of the chunk {µ ip }i=1 N toonly two parameters. Given i 1 , i 2 ∈ {1,...,N}, we canrewrite (7) as:min 2 ( µ i1 p c i1 p + µ i2 p c i2 p)+ bµ i1 p, µ i2 ps.t µ i1 p + µ i2 p = 1 − ∑µ ipi ̸= i 1 ,i 20 ≤ µ i1 p ≤ 10 ≤ µ i2 p ≤ 1,here b is a constant independent from the chunk {µ i1 p,µ i2 p}(see 7). When c i1 p = c i2 p the above objective function andthe equality constraints are linearly dependent, so we have aninfinite set of optimal solutions; any one which satisfies theconstraints in (9) can be considered. Let us assume c i1 p ̸=c i2 p and denote d = 1 − ∑ i ̸= i 1 ,i 2µ ip , since µ i2 p = d −µ i1 p, the above linear programming problem can be writtenas:minµ i1 p (c i1 p − c i2 p)µ i1 ps.t max(d − 1, 0) ≤ µ i1 p ≤ min(d, 1)(9)(10)Depending on the sign of c i1 p − c i2 p, the solution of (10) issimply taken as one of the bounds of the inequality constraintin (10). At this stage, the parameters {µ 1p ,...,µ Np } whichsolve (7) are found by solving iteratively the trivial minimizationproblem (10) for different chunks, as shown inAlgorithm1,Algorithm1(p,µ)Do the following steps ITERMAX1 iterationsSelect randomly (i 1 , i 2 ) ∈{1,...,N} 2 .(c i1 p, c i2 p) ←− (8).if (c i1 p − c i2 p < 0) µ i1 p ←− min(d, 1) (see 10)else µ i1 p ←− max(d − 1, 0)µ i2 p ←− d − µ i1 p (see 9)endDoreturn (µ i1 p,...,µ iN p)and the whole QP (6) is solved by looping several timesthrough different cluster indices as shown in Algorithm2.For a convex form of the QP (6) and for a large value of ITER-MAX1 and ITERMAX2, each subproblem will be convexand the convergence of the decomposition algorithms (1,2) isguaranteed as also discussed for support vector machine training(Platt 1999). In practice ITERMAX1 and ITERMAX2are set to C and N 2 respectively.Algorithm2Set {µ} to random values.Do the following steps ITERMAX2 iterationsFix p in {1,...,C}, update µ 1p ,…,µ Np using:(µ 1p ,..., µ Np ) ←− Algorithm1(p,µ)endDo4.2 AgglomerationInitially, the algorithm described above considers an overestimatednumber of clusters (denoted C). This will resultinto several overlapping cluster GMMs which might be detectedusing a similarity measure for instance the KullbackLeibler divergence (Bishop 1995).Given two cluster GMMs p(x|y k ), p(x|y l ),forafixedɛ>0 we declare p(x|y k ), p(x|y l ) as similar if:∫X∥ p(x|yk ) − p(x|y l ) ∥ ∥ 2 dx < ɛ(11)As p(x|y k ) =〈ω k ,Φ σ (x)〉 and p(x|y l ) =〈ω l ,Φ σ (x)〉, wecan simply rewrite (11) as:∫X〈ω k − ω l ,Φ σ (x)〉 2 dx < ɛ(12)We can set the memberships in ω k , ω l such that ‖ω k ‖=‖ω l ‖=1. Now, from (12) it is clear that two overlappingGMMs correspond to two highly correlated hyperplanes inthe mapping space, i.e., 〈ω k ,ω l 〉 ≥ τ (for a fixed thresholdτ ∈[0, 1]).Regularization: notice that,∂〈 ω k ,ω l 〉∂σ≥ 0, (13)so the correlation is an increasing function of the scale σ .Assuming ‖w k ‖=‖w l ‖=1, it results that ∀ τ ∈[0, 1], ∃ σsuch that 〈ω k ,ω l 〉 ≥ τ.Let us define the following adjacency matrix:A k,l ={ 1 if〈ωk ,ω l 〉 ≥ τ0 otherwise(14)This matrix characterizes a graph where a node is a hyperplaneand an arc corresponds to two highly correlated hyperplanes(i.e., 〈ω k ,ω l 〉 ≥ τ). Now, the actual number of clusters isfound as the number of connected components in this graph.For a fixed threshold τ, the scale parameter σ acts as a regularizer.When σ is overestimated, the graph A ={A k,l } willbe connected resulting into only one big cluster whereas a123

A particular Gaussian mixture model for clustering and its application to image retrievalNumber of Clusters25201510500 5 10 15 20 25 30ScaleFig. 3 This figure shows the decrease of the number of clusters withrespect to the scale parameter σ (N = 100, C = 20). These experimentsare performed on the training samples shown in Fig. 4small σ results into lots of disconnected subgraphs, thereforethe total number of clusters will be large (see Fig. 3).4.3 Mixing different scalesWe can extend the formulation presented in (3) to handleclusters with large variation in scale. Let us denote σ(y l ) thescale of the GMM related to the cluster y l . We can rewritethe objective function (6)as:minµs.t.∑k,i∑l̸=k, jµ ik µ jl K σ(yl )(‖x i − x j ‖)∑µ ic = 1, µ ic ∈[0, 1], i = 1,...,N,ic = 1,...,C(15)Following the same steps (see Sect. 4.1), we can solve (15).However, the above objective function cannot be interpretedas the minimization of pairwise correlations between hyperplanessince the GMMs, with different scale parameters, correspondto hyperplanes living in different mapping spaces.It follows that, the criteria (in 14) used to detect and mergeoverlapping GMMs cannot be used. Instead, other criteria,for instance the Kullback Leibler divergence, might be used.4.4 Toy examplesThese experiments are targeted to show the performancebrought by this decomposition algorithm both in term ofprecision and speed. We consider a two dimensional trainingset, of size N, generated using four Gaussian densitiescentered at four different locations (see Fig. 4, top-left). Wecluster these data using Algorithm2, for different values ofN (40, 80, 100, 500, 1000). For N = 100, Table 1 shows allthe pairwise correlations between the hyperplanes found (i)when running Algorithm2 and (ii) when solving the wholeQP (6) using a standard package LOQO (Vanderbei 1999).We can see that each hyperplane found using Algorithm2is highly correlated with only one hyperplane found usingLOQO. Table 2 shows a comparison of the underlying runtimeperformances using a 1 GHz Pentium III.5 Database categorizationExperiments have been conducted on both the Olivetti andColumbia databases in order to show the good performanceof our clustering method. The Olivetti subset contains 20 personseach one represented by ten faces. The Columbia subsetcontains 15 categories of objects each one represented by72 images. Each image from both the Olivetti and Columbiadatasets is processed using histogram equalization andFig. 4 (Top-left) Datarandomly generated from fourGaussian densities centered atfour different locations(N = 1,000). When runningAlgorithm2, C is fixed to 20, sothe total number of parametersin the training problem is20 × 1000. (Other figures)Different clusters are shownwith different colors and theunderlying GMM vectors areshown in gray. Intheseexperiments σ is set,respectively, from top tobottom-right to 10, 4 and 50YX123

H. SahbiTable 1 This table shows the correlations (normalized inner products)between the hyperplanes found, when (i) running Algorithm2 and (ii)solving the QP (6) using the LOQO package, on the 2D toy data ofFig. 4 (σ = 10)LOQO pack vs. Algorithm2 Class 1 Class 2 Class 3 Class 4Cluster 1 0.994 0.043 0.331 0.187Cluster 2 0.036 0.999 0.238 0.133Cluster 3 0.284 0.313 0.985 0.058Cluster 4 0.152 0.147 0.057 0.998Rows characterize the clusters founds after running our algorithm whilecolumns characterize the classes (i.e., ground truth)Table 2 Comparison of the run-time performances on the trainingsamples in Fig. 4 for different values of N (C = 20,σ = 10)#S (N) 40 80 100 500# of parameters in the QP (N × C) 280 480 700 3,500LOQO packAlgorithm224.6 (s)87.9 (s) 317.8 (s)> 1 (H)0.09 (s) 0.32 (s) 0.56 (s) 24.99 (s)encoded using 20 coefficients of projection into a subspacelearned using ISOMAP (Tanenbaum et al. 2000). Notice thatISOMAP embeds a training database into a subspace suchthat non-linearly separable classes become more separable,homogeneous in terms of scale, and easier to cluster.5.1 ScaleIt might be easier for database categorization to predict thevariance of data rather than the number of clusters, mainlyfor large databases living in high dimensional spaces. As thenumber of clusters is initially set to an overestimated value(see Sect. 4.2), our method relies mainly on one parameter,i.e., the scale σ , which makes it possible to find automaticallythe actual number of clusters. In our experiments, we predictσ by sampling manually few images from some categories,estimating the scales of the underlying classes, then settingσ to the expectation of the scale through these classes. Whilethis setting is not automatic, it has at least the advantage ofintroducing a priori knowledge on the variance of the data atFig. 5 These diagrams showthe variation of the scale withrespect to the number ofsamples per class on (left)theOlivetti and (right) Columbiasets. The scale is given as theexpectation of the distancebetween pairs of images takenfrom 3, 5, 7 and 9 classesScale454035302520153 classes5 classes7 classes9 classesScale10.80.60.40.23 classes5 classes7 classes9 classes100 2 4 6 8Number of samples00 10 20Number of samplesFig. 6 (Top) Decrease of thenumber of clusters with respectto the scale parameter. (Bottom)Probability of error with respectto σ . These observations aregiven using (left) the Olivettiand (right) the Columbia setsProbability of errorNumber of Clusters353025201510500 10 20 30 40 50 60Scale0.60.5Olivetti0.40.30.20.1Probability of errorNumber of Clusters25201510500 0.2 0.4 0.6 0.8 1Scale0.6Columbia0.50.40.30.20.100 10 20 30 40 50 60Scale00 0.2 0.4 0.6 0.8 1Scale123

A particular Gaussian mixture model for clustering and its application to image retrievalTable 3 This table shows the distribution of the categories (ground truth) through the clusters on the Olivetti set using our clustering algorithmCategories 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Totalclusters1 9 . . . . . . . . . . . . . . . . . . . 92 . 10 . . 1 3 . . . 1 . . . . . . . . 3 . 183 . . 10 . . . . . . . . . . . . . . 5 . . 154 . . . 10 . . . . 3 . . . . . . . 2 . . . 155 . . . . 9 . . . . . . . . . . . . . . 1 106 . . . . . . 5 . . . . . . . . . . . . . 57 . . . . . . . . 5 . 2 . . . . . . . . . 78 . . . . . 1 . . . 9 . . . . . . . . . 2 129 . . . . . . . . . . 3 . . . . . . . . . 310 1 . . . . . . . . . . 4 5 . 2 . . . . . 1211 . . . . . . . . . . 5 . . . . . . . . . 512 . . . . . . . . . . . 6 3 . . . . . . . 913 . . . . . . . . . . . . 2 10 . 1 . . . . 1314 . . . . . 1 . . . . . . . . 8 . . . . . 915 . . . . . 1 . . . . . . . . . 9 . . . . 1016 . . . . . 4 1 2 2 . . . . . . . 8 2 . . 1917 . . . . . . . . . . . . . . . . . 3 . 5 818 . . . . . . 4 . . . . . . . . . . . 7 . 1119 . . . . . . . . . . . . . . . . . . . 2 2Total 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10We can see that this distribution is concentrated near the diagonal. The scale parameter σ is set to 24 and C = 30. Errors in the cardinality of theclusters are mentioned in boldTable 4 This table shows the distribution of the categories through the clusters on the Columbia set using our clustering algorithmCategories 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Totalclusters1 72 . . . . . . . . . . . . . . 722 . 72 . . . . . . . . . . . . . 723 . . 72 . . . . . . . . . . . . 724 . . . 72 . . . . . . . . . . . 725 . . . . 72 . . . . . . . . . . 726 . . . . . 37 . 34 . . . . . . . 717 . . . . . 35 . 38 . . . . . . . 738 . . . . . . 72 . . . . . . . . 729 . . . . . . . . 72 . . . . . . 7210 . . . . . . . . . 72 . . . . . 7211 . . . . . . . . . . 72 . . . . 7212 . . . . . . . . . . . 72 . . . 7213 . . . . . . . . . . . . 30 . . 3014 . . . . . . . . . . . . 42 72 72 186Total 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72We can see that this distribution is concentrated near the diagonal. The scale σ is set to 0.4andC = 20. Errors in the cardinality of the clusters arementioned in boldthe expense of few interactions and reasonable effort fromthe user.Figure 5 shows that this heuristic provides “good guess” ofthe scale on the Olivetti and Columbia sets as it is close to theoptimal scale shown in Fig. 6. Indeed, the estimated scale inFig. 5 is approximately 24 and 0.4 for, respectively, Olivettiand Columbia sets. For these scales, the number of clustersfound by our algorithm on Olivetti and Columbia (resp. 19123

H. SahbiFig. 7 This figure shows 18 face prototypes, from different clusters, found after the application of our method. Each prototype corresponds to aGMM vectorFig. 8 This figure shows five clusters from the Olivetti database found after the application of our algorithm123

A particular Gaussian mixture model for clustering and its application to image retrievaland 15, see Fig. 6, top) is close to the actual numbers, resp.20 and 15. Furthermore, the probability of error (17)iscloseto its minimum for both Olivetti and Columbia (see Fig. 6,bottom).5.2 PrecisionA Clustering method can be objectively evaluated when theground truth is available, otherwise the meaning of clusteringcan differ from one intelligent observer to another. Thevalidity criteria are introduced in order to measure the qualityof a clustering algorithm, i.e., its capacity to assign data totheir actual classes. For a survey on these methods, see forexample (Halkidi et al. 2002).In the presence of a ground truth, we consider in our worka simple validity criteria based on the probability of misclassification.The latter occurs when either two examplesbelonging to two different classes are assigned to the samecluster, or when two elements belonging to the same class areassigned different clusters. We denote X and Y as two randomvariables standing respectively for the training examplesand their different possible classes {y 1 ,...,y C } and X ′ , Y ′respectively similar to X, Y . We denote by f (X) the indexof the cluster of X in {y 1 ,...,y C }. Formally, we define themisclassification error as:P(1 {( f (X) = f (X ′ )} ̸= 1 {Y = Y ′ }) = (∗), (16)here:(∗) = P( f (X) ̸= f (X ′ ) | Y = Y ′ ) P(Y = Y ′ )+ P( f (X) = f (X ′ ) | Y ̸= Y ′ ) P(Y ̸= Y ′ ) (17)Again, Fig. 6 (bottom) shows the misclassification error(17) with respect to the scale parameter σ . Tables 3 and 4are the confusion matrices which show the distribution of20 and 15 categories from, respectively, the Olivetti and theColumbia sets through the clusters after the application ofour clustering method. We can see that this distribution isconcentrated in the diagonal and this clearly shows that mostof the training data are assigned to their actual categories (seeFigs. 7, 8).5.3 ComparisonFigure 9 shows a comparison of our kernel clustering methodswith respect to existing state of the art approaches includingfuzzy clustering (Sahbi and Boujemaa 2005), K-means(MacQueen 1965) and hierarchical clustering (Posse 2001).These results are shown for the Olivetti database. The error(17) is measured with respect to the number of clusters.Notice that the latter is fixed for hierarchical clustering andK-means while it corresponds to the resulting number of clustersafter setting σ (see Sect. 4.2) for fuzzy clustering and ourProbability of Error0.450.350.25K-MeansHierarchicalParticular GMMsFuzzy0.1515 20 25Number of ClustersFig. 9 This figure shows a comparison of the generalization error onthe Olivetti set of our method and other existing state of the art methodsincluding fuzzy, k-means and hierarchical clusteringmethod. We clearly see that the generalization error takes itssmallest value when the number of clusters is close to theactual one (i.e., 20).6 Conclusion and future workWe introduced in this work an original approach for clusteringbased on a particular Gaussian mixture model (GMM).The method considers an objective function which acts asa regularizer and minimizes the overlap between the clusterGMMs. The GMM parameters are found by solving aquadratic programming problem using a new decompositionalgorithm which considers trivial linear programming subproblems.The actual number of clusters is found by controllingthe scale parameter of these GMMs; in practice, it turnsout that predicting this parameter is easier than predicting theactual number of clusters mainly for large databases livingin high dimensional spaces.The concept presented in this paper is different from kernelregression which might be assimilated to density estimationwhile our approach performs this estimation for each cluster.Obviously, the proposed approach performs clustering anddensity estimation at the same time.The validity of the method is demonstrated on toy data aswell as database categorization problems. As a future work,we will investigate the application of the method to handlenoisy data.ReferencesBach FR, Jordan MI (2003) Learning spectral clustering. Neural informationprocessing systemsBen-Hur A, Horn D, Siegelmann HT, Vapnik V (2000) Support vectorclustering. Neural information processing systems, pp 367–373Bezdek JC (1981) Pattern recognition with fuzzy objective functionalgorithms. Plenum Press, New York123

H. SahbiBishop CM (1995) Neural networks for pattern recognition. ClarendonPress, OxfordCarson C, Belongie S, Greenspan H, Malik J (2002) Blobworld: imagesegmentation using expectation-maximization and its applicationto image querying. IEEE Trans Pattern Anal Mach Intell24(8):1026–1038Cristianini N, Shawe-Taylor J (2000) An introduction to support vectormachines. Cambridge University Press, CambridgeDave RN (1991) Characterization and detection of noise in clustering.In Pattern Recognit 12(11):657–664Dempster A, Laird N, Rubin D (1977) Maximum likelihood fromincomplete data via the em algorithm. J R Stat Soc B 39(1):1–38Fraley C (1998) Algorithms for model-based gaussian hierarchicalclustering. SIAM J Sci Comput 20(1):270–281Frigui H, Krishnapuram R (1997) Clustering by competitive agglomeration.Pattern recognition, vol 30, no. 7Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithmwith applications in computer vision. IEEE Trans PatternAnal Mach Intell 21(5):450–465Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validitymethods: Part i and ii. SIGMOD RecordIchihashai H, Honda K, Tani N (2000) Gaussian mixture pdf approximationand fuzzy c-means clustering with entropy regularization.In: Proceedings of the 4th Asian fuzzy system symposium, pp217–221Jain AK, Dubes RC (1988) Algorithms for clustering data. PrenticeHall, Englewood CliffsLe-Saux B, Boujemaa N (2002) Unsupervised robust clustering forimage database categorization. IEEE-IAPR International Conferenceon Pattern Recognition, pp 259–262MacQueen J (1965) Some methods for classification and analysis ofmultivariate observations. In: Proceedings of the 5th Berkeley symposiumon mathematical statistics and probability, vol 1Orengo CA, Jones DT, Thornton JM (2003) Bioinformatics—genes,protein and computers. bios. ISBN: 1-85996-054-5Osuna E, Freund R, Girosi F (1997) Training support vector machines:an application to face detection. In: Proceedings of the internationalconference on computer vision and pattern recognition, pp 130–136Platt J (1999) Fast training of support vector machines using sequentialminimal optimization. In: Schlkopf B, Burges C, Smola AJ (eds)Advances in kernel methods—support vector learning. MIT Press,Cambridge, pp 185–208Posse C (2001) Hierarchical model-based clustering for large datasets.J Comput Graph Stat 10(3):464–486Sahbi H, Boujemaa N (2005) Validity of fuzzy clustering using entropyregularization. In: proceedings of the IEEE conference on fuzzysystemsSneath PH, Sokal RR (1973) Numerical taxonomy—the principles andpractice of numerical classification. W. H. Freeman, San FranciscoTanenbaum J, de Silva V, Langford J (2000) A global geometricframework for non-linear dimensionality reduction. Science290(5500):2319–2323Tsao EK, Bezdek JC, Pal NR (1994) Fuzzy kohonen clustering networks.PR 27(5):757–764Vanderbei RJ (1999) LOQO: an interior point code for quadratic programming.Optim Methods Softw 11:451–484Wu M, Schoelkopf B (2006) A local learning approach for clustering.Advances neural information processing systemsYang MS (1993) A survey of fuzzy clustering. MathCompMod 18:1–16123

A particular Gaussian mixture model for clustering and its ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?