12.07.2015 Views

A particular Gaussian mixture model for clustering and its ...

A particular Gaussian mixture model for clustering and its ...

A particular Gaussian mixture model for clustering and its ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Soft ComputDOI 10.1007/s00500-007-0247-yFOCUSA <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> <strong>for</strong> <strong>clustering</strong> <strong>and</strong> <strong>its</strong> applicationto image retrievalHichem Sahbi© Springer-Verlag 2007Abstract We introduce a new method <strong>for</strong> data <strong>clustering</strong>based on a <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> (GMM). Eachcluster of data, <strong>model</strong>ed as a GMM into an input space, isinterpreted as a hyperplane in a high dimensional mappingspace where the underlying coefficients are found by solvinga quadratic programming (QP) problem. The main contributionsof this work are (1) an original probabilistic framework<strong>for</strong> GMM estimation based on QP which only requires findingthe <strong>mixture</strong> parameters, (2) this QP is interpreted as theminimization of the pairwise correlations between clusterhyperplanes in a high dimensional space <strong>and</strong> (3) it is solvedeasily using a new decomposition algorithm involving triviallinear programming sub-problems. The validity of themethod is demonstrated <strong>for</strong> <strong>clustering</strong> 2D toy examples aswell as image databases.Keywords <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong>s · Clustering ·Kernel methods <strong>and</strong> image retrieval1 IntroductionConsider a training set S N ={x 1 ,...,x N } generated i.i.d.according to an unknown but a fixed probability distributionP(X). HereX is a r<strong>and</strong>om variable st<strong>and</strong>ing <strong>for</strong> trainingdata in an input space denoted X . Our goal is to define aH. Sahbi (B)Machine Intelligence Laboratory, Department of Engineering,Cambridge University, Cambridge, UKe-mail: hs385@cam.ac.ukH. SahbiCertis Laboratory, Ecole Nationale des Ponts et Chaussees,Paris, Francee-mail: sahbi@certis.enpc.frpartition of this training set, i.e., assign each training exampleto <strong>its</strong> actual cluster y k , k = 1,...,C. Existing <strong>clustering</strong>methods can be categorized depending on the “crispness”of data memberships, i.e., whether each training examplebelongs to only one cluster or not. The first category includeshierarchical approaches (Jain <strong>and</strong> Dubes 1988; Posse 2001),EM (expectation maximization) (Dempster et al. 1977), C (orK ) means (MacQueen 1965), self organizing maps (SOMs)(Tsao et al. 1994) <strong>and</strong> recently original approaches basedon kernel methods (Ben-Hur et al. 2000; Bach <strong>and</strong> Jordan2003; Wu <strong>and</strong> Schoelkopf 2006). In the second family themembership of a training example to one or another clusteris fuzzy <strong>and</strong> a family of algorithms dealing with fuzzinessexists in the literature <strong>for</strong> instance (Yang 1993; Bezdek 1981;Frigui <strong>and</strong> Krishnapuram 1999; Dave 1991; Ichihashai et al.2000; Sahbi <strong>and</strong> Boujemaa 2005). All these methods havebeen used in different domains including Gene expression(Orengo et al. 2003), image segmentation (Carson et al. 2002)<strong>and</strong> database categorization (Le-Saux <strong>and</strong> Boujemaa 2002).One of the main issues in the existing <strong>clustering</strong> methodsremains setting the appropriate number of classes <strong>for</strong> a givenproblem. Many methods <strong>for</strong> instance hierarchical <strong>clustering</strong>(Sneath <strong>and</strong> Sokal 1973; Fraley 1998; Posse 2001) have provento per<strong>for</strong>m well when the application allows us to know apriori the number of clusters or when the user sets it manually.Of course, the estimation of this number is applicationdependent,<strong>for</strong> instance in image segmentation it can be seta priori to the number of targeted regions. Un<strong>for</strong>tunately, <strong>for</strong>some applications such as database categorization, it is notalways possible to predict automatically <strong>and</strong> even manuallythe appropriate number of classes.Given a training set S N in the input space X , in ourmethod each cluster in X is <strong>model</strong>ed as a GMM <strong>and</strong> interpretedas a hyperplane in a high dimensional space referredto as the mapping space, denoted H. Clustering consists in123


H. Sahbimaximizing a likelihood function of data memberships <strong>and</strong> isequivalent to finding the parameters of the cluster hyperplanesby solving a QP problem. We will show that this QPcan be tackled efficiently by solving trivial linear programmingsub-problems. Notice that when solving this QP, thenumber of clusters (denoted C), fixed initially, might be overestimated<strong>and</strong> this leads to several overlapping clusters; there<strong>for</strong>ethe actual clusters are found as constellations of highlycorrelated hyperplanes in the mapping space H.In the remainder of this paper, we refer to a cluster as aset of data gathered using an algorithm while a class (or acategory) is the actual membership of this data according toa well defined ground truth. Among notations, i <strong>and</strong> j st<strong>and</strong><strong>for</strong> data indices while k, c <strong>and</strong> l st<strong>and</strong> <strong>for</strong> cluster indices.Other notations will be introduced as we go along throughdifferent sections of this paper which is organized as following:in Sect. 2 we provide a short reminder on GMMs followedby our <strong>clustering</strong> minimization problem in Sect. 3.InSect. 4 we show that this minimization problem can be solvedas a succession of simple linear programming subproblemswhich are h<strong>and</strong>led efficiently. We present in Sect. 5 experimentson simple as well as challenging problems in contentbased image retrieval. Finally, we conclude in Sect. 6 <strong>and</strong> weprovide some directions <strong>for</strong> a future work.2 A short reminder on GMMsGiven a training set S N of size N, we are interested in a<strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> (denoted M k ) Bishop(1995) where the number of <strong>its</strong> components is equal to thesize of the training set. The parameters of M k are denotedΘ k ={Θ i|k , i = 1,...,N}. HereΘ i|k st<strong>and</strong>s <strong>for</strong> the ithcomponent parameter of M k , i.e., the mean <strong>and</strong> the covariancematrices of a <strong>Gaussian</strong> density function. In this case,the output of the likelihood GMM function related to a clustery k is a weighted sum of N component densities:Fig. 1 This figure shows a <strong>particular</strong> GMM <strong>model</strong> where the centers<strong>and</strong> the variances are fixed. The only free parameters are the GMM<strong>mixture</strong> coefficients <strong>and</strong> the scale σσ ∈ R + ). Now, each cluster y k is <strong>model</strong>ed as a GMMwhere the only free parameters are the <strong>mixture</strong> coefficients{µ ik }; the means <strong>and</strong> the covariances are assumed constantbut of course dependent on a priori knowledge of the trainingset (see Fig. 1 <strong>and</strong> Sect. 5).3 ClusteringThe goal of a <strong>clustering</strong> algorithm is to make clusters, containingdata from different classes, as different as possible whilekeeping data from the same classes as close as possible totheir actual clusters. Usually this implies the optimization ofan objective function (see <strong>for</strong> instance Frigui <strong>and</strong> Krishnapuram1997) involving a fidelity term which measures thefitness of each training sample to <strong>its</strong> <strong>model</strong> <strong>and</strong> a regularizerwhich reduces the number of clusters, i.e., the complexity ofthe <strong>model</strong>.3.1 Our <strong>for</strong>mulation2σp(x|y k ) =N∑iP(Θ i|k |y k ) p(x|y k ,Θ i|k ), (1)Following notations <strong>and</strong> definitions in Sect. 2, agivenx ∈R p is assigned to a cluster y k if:here P(Θ i|k |y k ) (also denoted µ ik ) is the prior probability ofthe ith component parameter Θ i|k given the cluster y k <strong>and</strong>p(x|y k ,Θ i|k ) = N k (x; Θ i|k ) is the normal density functionof the ith component. In order to guarantee that p(x|y k ) isa density function, the <strong>mixture</strong> parameters {µ ik } are chosensuch that ∑ i µ ik = 1. Usually the training of GMMs canbe <strong>for</strong>mulated as a maximum likelihood problem where theparameters Θ k <strong>and</strong> the <strong>mixture</strong> parameters {µ ik } are estimatedusing expectation maximization (Dempster et al. 1977;Bishop 1995).We consider in our work, N k (x; Θ i|k ) with a fixed meanx i ∈ S N <strong>and</strong> covariance matrix σ Σ i (Σ i ∈ R p×p ,y k= arg maxy lp(x|y l ), (2)here the weights µ ={µ il }, of the GMM functions p(x|y l )l = 1,...,C, are found by solving the following constrainedminimization problem (see motivation below):⎛ ⎞∑min µ ik⎝ ∑ p(x i |y l ) ⎠µk,i l̸=ks.t. ∑ µ ic = 1, µ ic ∈[0, 1], i = 1,...,N, (3)ic = 1,...,C123


A <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> <strong>for</strong> <strong>clustering</strong> <strong>and</strong> <strong>its</strong> application to image retrievalIn the above objective function, the training data belonging toaclustery l are assumed drawn from a GMM with a likelihoodfunction p(x|y l ). Each <strong>mixture</strong> parameter µ il st<strong>and</strong>s <strong>for</strong> thedegree of membership (or the contribution) of the trainingexample x i to the cluster y l . The overall objective is to maximizethe membership of each training example to <strong>its</strong> actualcluster while keeping the memberships to the remaining clustersrelatively low. Usually, existing <strong>clustering</strong> methods (see<strong>for</strong> instance Bezdek 1981) find the parameters {µ il } as thosewhich maximize the membership of each training exampleto <strong>its</strong> actual cluster. In contrast to these methods, our <strong>for</strong>mulationproceeds using a dual principle; the purpose is tominimize the memberships of training examples to their nonactualclusters.Using (1), we can exp<strong>and</strong> the objective function (3)as:minµ∑ ∑µ ik µ jl N l (x i ; Θ j|l ), (4)k̸=li, jhere N l (x i ; Θ j|k ) is the response of a <strong>Gaussian</strong> density function(also referred to as kernel), with fixed parameters Θ j|k ={x j ,σ Σ j }; x j is the mean, Σ j is the covariance <strong>and</strong> σ isthe scale. We will denote this kernel simply as K σ (‖x i −x j ‖). It is known that the <strong>Gaussian</strong> kernel is positive definite(Cristianini <strong>and</strong> Shawe-Taylor 2000) so this function correspondsto a scalar product in the mapping space H, i.e., thereis a mapping Φ σ from the input space into an infinite dimensionalspace such that K σ (‖x i − x j ‖) =〈Φ σ (x i ), Φ σ (x j )〉where 〈〉 st<strong>and</strong>s <strong>for</strong> the inner product in H. At this stage, theresponse of a GMM function p(x|y k ) is equal to the scalarproduct 〈ω k ,Φ σ (x)〉 where ω k = ∑ i µ ik Φ σ (x i ) is the normalof a hyperplane in H (see Fig. 2). Now, the objectivefunction (4) can be rewritten:minµ∑〈ω k ,ω l 〉 (5)k̸=lThe above objective function minimizes the sum of hyperplanecorrelations taken pairwise among all different clusters.Now, we can derive the new <strong>for</strong>m of the constrainedminimization problem (3):minµs.t.∑k,i∑l̸=k, jµ ik µ jl K σ (‖x i − x j ‖)∑µ ic = 1, µ ic ∈[0, 1], i = 1,...,N,ic = 1,...,CThis defines a constrained QP which can be solved using st<strong>and</strong>ardQP libraries (see <strong>for</strong> instance V<strong>and</strong>erbei 1999). Whensolving this problem, training examples {x i } <strong>for</strong> which themixing parameter {µ ik } are positive will be referred to as theGMM vectors of the cluster y k (see Fig. 2).(6)p(x|y 1 ) = ∑ Niµ i1 K σ (‖x − x i ‖)= 〈ω 1 ,Φ σ (x)〉w 1w 2p(x|y 2 )Fig. 2 This figure shows the mapping of training samples into a highdimensional space. Data in the original space characterizes <strong>Gaussian</strong>blobs while in the mapping space they correspond to hyperplanes. TheGMM vectors are surrounded with circles <strong>and</strong> correspond to the centersof the <strong>Gaussian</strong> kernels <strong>for</strong> which the <strong>mixture</strong> parameters do not vanish4 TrainingThe number of parameters intervening in (6) isN × C, so<strong>for</strong> <strong>clustering</strong> problems of reasonable size, <strong>for</strong> instance N =1.000 <strong>and</strong> C = 20, solving this QP, using st<strong>and</strong>ard packages,can quickly get out of h<strong>and</strong>. Chunking methods have beensuccessfully used to solve QP <strong>for</strong> large scale training problemssuch as SVM (Osuna et al. 1997). The idea consists insolving a QP problem using an active subset of parameters,referred to as a chunk, which is updated iteratively. When theQP is convex, by checking that the Gram matrix is positivedefinite, the process is guaranteed to converge to the globaloptimum after a sufficient number of iterations (Osuna et al.1997).Using the same principle as Osuna et al. (1997) <strong>and</strong> Platt(1999), we will show in this section that <strong>for</strong> a <strong>particular</strong> choiceof the active chunk, the QP (6) can be decomposed into linearprogramming subproblems each one can be solved trivially.4.1 DecompositionLet us fix one cluster index p ∈{1,...,C} <strong>and</strong> rewrite theobjective function (6)as:min 2 ∑ µi+ ∑here:i,k̸=pc ip = ∑j,l̸=pµ ip c ip∑j,l̸=k,pµ ik µ jl K σ (‖x i − x j ‖), (7)µ jl K σ (‖x i − x j ‖) (8)123


H. SahbiConsider a chunk {µ ip }i=1 N <strong>and</strong> fix {µ jk, k ̸= p} N j=1 ,theobjective function (7) is linear in terms of {µ ip } <strong>and</strong> canbe solved, with respect to this chunk, using linear programming.Only one equality constraint ∑ i µ ip = 1 is takeninto account in the linear programming problem as the otherequality constraints in (6) are independent from the chunk{µ ip }.We can further reduce the size of the chunk {µ ip }i=1 N toonly two parameters. Given i 1 , i 2 ∈ {1,...,N}, we canrewrite (7) as:min 2 ( µ i1 p c i1 p + µ i2 p c i2 p)+ bµ i1 p, µ i2 ps.t µ i1 p + µ i2 p = 1 − ∑µ ipi ̸= i 1 ,i 20 ≤ µ i1 p ≤ 10 ≤ µ i2 p ≤ 1,here b is a constant independent from the chunk {µ i1 p,µ i2 p}(see 7). When c i1 p = c i2 p the above objective function <strong>and</strong>the equality constraints are linearly dependent, so we have aninfinite set of optimal solutions; any one which satisfies theconstraints in (9) can be considered. Let us assume c i1 p ̸=c i2 p <strong>and</strong> denote d = 1 − ∑ i ̸= i 1 ,i 2µ ip , since µ i2 p = d −µ i1 p, the above linear programming problem can be writtenas:minµ i1 p (c i1 p − c i2 p)µ i1 ps.t max(d − 1, 0) ≤ µ i1 p ≤ min(d, 1)(9)(10)Depending on the sign of c i1 p − c i2 p, the solution of (10) issimply taken as one of the bounds of the inequality constraintin (10). At this stage, the parameters {µ 1p ,...,µ Np } whichsolve (7) are found by solving iteratively the trivial minimizationproblem (10) <strong>for</strong> different chunks, as shown inAlgorithm1,Algorithm1(p,µ)Do the following steps ITERMAX1 iterationsSelect r<strong>and</strong>omly (i 1 , i 2 ) ∈{1,...,N} 2 .(c i1 p, c i2 p) ←− (8).if (c i1 p − c i2 p < 0) µ i1 p ←− min(d, 1) (see 10)else µ i1 p ←− max(d − 1, 0)µ i2 p ←− d − µ i1 p (see 9)endDoreturn (µ i1 p,...,µ iN p)<strong>and</strong> the whole QP (6) is solved by looping several timesthrough different cluster indices as shown in Algorithm2.For a convex <strong>for</strong>m of the QP (6) <strong>and</strong> <strong>for</strong> a large value of ITER-MAX1 <strong>and</strong> ITERMAX2, each subproblem will be convex<strong>and</strong> the convergence of the decomposition algorithms (1,2) isguaranteed as also discussed <strong>for</strong> support vector machine training(Platt 1999). In practice ITERMAX1 <strong>and</strong> ITERMAX2are set to C <strong>and</strong> N 2 respectively.Algorithm2Set {µ} to r<strong>and</strong>om values.Do the following steps ITERMAX2 iterationsFix p in {1,...,C}, update µ 1p ,…,µ Np using:(µ 1p ,..., µ Np ) ←− Algorithm1(p,µ)endDo4.2 AgglomerationInitially, the algorithm described above considers an overestimatednumber of clusters (denoted C). This will resultinto several overlapping cluster GMMs which might be detectedusing a similarity measure <strong>for</strong> instance the KullbackLeibler divergence (Bishop 1995).Given two cluster GMMs p(x|y k ), p(x|y l ),<strong>for</strong>afixedɛ>0 we declare p(x|y k ), p(x|y l ) as similar if:∫X∥ p(x|yk ) − p(x|y l ) ∥ ∥ 2 dx < ɛ(11)As p(x|y k ) =〈ω k ,Φ σ (x)〉 <strong>and</strong> p(x|y l ) =〈ω l ,Φ σ (x)〉, wecan simply rewrite (11) as:∫X〈ω k − ω l ,Φ σ (x)〉 2 dx < ɛ(12)We can set the memberships in ω k , ω l such that ‖ω k ‖=‖ω l ‖=1. Now, from (12) it is clear that two overlappingGMMs correspond to two highly correlated hyperplanes inthe mapping space, i.e., 〈ω k ,ω l 〉 ≥ τ (<strong>for</strong> a fixed thresholdτ ∈[0, 1]).Regularization: notice that,∂〈 ω k ,ω l 〉∂σ≥ 0, (13)so the correlation is an increasing function of the scale σ .Assuming ‖w k ‖=‖w l ‖=1, it results that ∀ τ ∈[0, 1], ∃ σsuch that 〈ω k ,ω l 〉 ≥ τ.Let us define the following adjacency matrix:A k,l ={ 1 if〈ωk ,ω l 〉 ≥ τ0 otherwise(14)This matrix characterizes a graph where a node is a hyperplane<strong>and</strong> an arc corresponds to two highly correlated hyperplanes(i.e., 〈ω k ,ω l 〉 ≥ τ). Now, the actual number of clusters isfound as the number of connected components in this graph.For a fixed threshold τ, the scale parameter σ acts as a regularizer.When σ is overestimated, the graph A ={A k,l } willbe connected resulting into only one big cluster whereas a123


A <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> <strong>for</strong> <strong>clustering</strong> <strong>and</strong> <strong>its</strong> application to image retrievalNumber of Clusters25201510500 5 10 15 20 25 30ScaleFig. 3 This figure shows the decrease of the number of clusters withrespect to the scale parameter σ (N = 100, C = 20). These experimentsare per<strong>for</strong>med on the training samples shown in Fig. 4small σ results into lots of disconnected subgraphs, there<strong>for</strong>ethe total number of clusters will be large (see Fig. 3).4.3 Mixing different scalesWe can extend the <strong>for</strong>mulation presented in (3) to h<strong>and</strong>leclusters with large variation in scale. Let us denote σ(y l ) thescale of the GMM related to the cluster y l . We can rewritethe objective function (6)as:minµs.t.∑k,i∑l̸=k, jµ ik µ jl K σ(yl )(‖x i − x j ‖)∑µ ic = 1, µ ic ∈[0, 1], i = 1,...,N,ic = 1,...,C(15)Following the same steps (see Sect. 4.1), we can solve (15).However, the above objective function cannot be interpretedas the minimization of pairwise correlations between hyperplanessince the GMMs, with different scale parameters, correspondto hyperplanes living in different mapping spaces.It follows that, the criteria (in 14) used to detect <strong>and</strong> mergeoverlapping GMMs cannot be used. Instead, other criteria,<strong>for</strong> instance the Kullback Leibler divergence, might be used.4.4 Toy examplesThese experiments are targeted to show the per<strong>for</strong>mancebrought by this decomposition algorithm both in term ofprecision <strong>and</strong> speed. We consider a two dimensional trainingset, of size N, generated using four <strong>Gaussian</strong> densitiescentered at four different locations (see Fig. 4, top-left). Wecluster these data using Algorithm2, <strong>for</strong> different values ofN (40, 80, 100, 500, 1000). For N = 100, Table 1 shows allthe pairwise correlations between the hyperplanes found (i)when running Algorithm2 <strong>and</strong> (ii) when solving the wholeQP (6) using a st<strong>and</strong>ard package LOQO (V<strong>and</strong>erbei 1999).We can see that each hyperplane found using Algorithm2is highly correlated with only one hyperplane found usingLOQO. Table 2 shows a comparison of the underlying runtimeper<strong>for</strong>mances using a 1 GHz Pentium III.5 Database categorizationExperiments have been conducted on both the Olivetti <strong>and</strong>Columbia databases in order to show the good per<strong>for</strong>manceof our <strong>clustering</strong> method. The Olivetti subset contains 20 personseach one represented by ten faces. The Columbia subsetcontains 15 categories of objects each one represented by72 images. Each image from both the Olivetti <strong>and</strong> Columbiadatasets is processed using histogram equalization <strong>and</strong>Fig. 4 (Top-left) Datar<strong>and</strong>omly generated from four<strong>Gaussian</strong> densities centered atfour different locations(N = 1,000). When runningAlgorithm2, C is fixed to 20, sothe total number of parametersin the training problem is20 × 1000. (Other figures)Different clusters are shownwith different colors <strong>and</strong> theunderlying GMM vectors areshown in gray. Intheseexperiments σ is set,respectively, from top tobottom-right to 10, 4 <strong>and</strong> 50YX123


H. SahbiTable 1 This table shows the correlations (normalized inner products)between the hyperplanes found, when (i) running Algorithm2 <strong>and</strong> (ii)solving the QP (6) using the LOQO package, on the 2D toy data ofFig. 4 (σ = 10)LOQO pack vs. Algorithm2 Class 1 Class 2 Class 3 Class 4Cluster 1 0.994 0.043 0.331 0.187Cluster 2 0.036 0.999 0.238 0.133Cluster 3 0.284 0.313 0.985 0.058Cluster 4 0.152 0.147 0.057 0.998Rows characterize the clusters founds after running our algorithm whilecolumns characterize the classes (i.e., ground truth)Table 2 Comparison of the run-time per<strong>for</strong>mances on the trainingsamples in Fig. 4 <strong>for</strong> different values of N (C = 20,σ = 10)#S (N) 40 80 100 500# of parameters in the QP (N × C) 280 480 700 3,500LOQO packAlgorithm224.6 (s)87.9 (s) 317.8 (s)> 1 (H)0.09 (s) 0.32 (s) 0.56 (s) 24.99 (s)encoded using 20 coefficients of projection into a subspacelearned using ISOMAP (Tanenbaum et al. 2000). Notice thatISOMAP embeds a training database into a subspace suchthat non-linearly separable classes become more separable,homogeneous in terms of scale, <strong>and</strong> easier to cluster.5.1 ScaleIt might be easier <strong>for</strong> database categorization to predict thevariance of data rather than the number of clusters, mainly<strong>for</strong> large databases living in high dimensional spaces. As thenumber of clusters is initially set to an overestimated value(see Sect. 4.2), our method relies mainly on one parameter,i.e., the scale σ , which makes it possible to find automaticallythe actual number of clusters. In our experiments, we predictσ by sampling manually few images from some categories,estimating the scales of the underlying classes, then settingσ to the expectation of the scale through these classes. Whilethis setting is not automatic, it has at least the advantage ofintroducing a priori knowledge on the variance of the data atFig. 5 These diagrams showthe variation of the scale withrespect to the number ofsamples per class on (left)theOlivetti <strong>and</strong> (right) Columbiasets. The scale is given as theexpectation of the distancebetween pairs of images takenfrom 3, 5, 7 <strong>and</strong> 9 classesScale454035302520153 classes5 classes7 classes9 classesScale10.80.60.40.23 classes5 classes7 classes9 classes100 2 4 6 8Number of samples00 10 20Number of samplesFig. 6 (Top) Decrease of thenumber of clusters with respectto the scale parameter. (Bottom)Probability of error with respectto σ . These observations aregiven using (left) the Olivetti<strong>and</strong> (right) the Columbia setsProbability of errorNumber of Clusters353025201510500 10 20 30 40 50 60Scale0.60.5Olivetti0.40.30.20.1Probability of errorNumber of Clusters25201510500 0.2 0.4 0.6 0.8 1Scale0.6Columbia0.50.40.30.20.100 10 20 30 40 50 60Scale00 0.2 0.4 0.6 0.8 1Scale123


A <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> <strong>for</strong> <strong>clustering</strong> <strong>and</strong> <strong>its</strong> application to image retrievalTable 3 This table shows the distribution of the categories (ground truth) through the clusters on the Olivetti set using our <strong>clustering</strong> algorithmCategories 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Totalclusters1 9 . . . . . . . . . . . . . . . . . . . 92 . 10 . . 1 3 . . . 1 . . . . . . . . 3 . 183 . . 10 . . . . . . . . . . . . . . 5 . . 154 . . . 10 . . . . 3 . . . . . . . 2 . . . 155 . . . . 9 . . . . . . . . . . . . . . 1 106 . . . . . . 5 . . . . . . . . . . . . . 57 . . . . . . . . 5 . 2 . . . . . . . . . 78 . . . . . 1 . . . 9 . . . . . . . . . 2 129 . . . . . . . . . . 3 . . . . . . . . . 310 1 . . . . . . . . . . 4 5 . 2 . . . . . 1211 . . . . . . . . . . 5 . . . . . . . . . 512 . . . . . . . . . . . 6 3 . . . . . . . 913 . . . . . . . . . . . . 2 10 . 1 . . . . 1314 . . . . . 1 . . . . . . . . 8 . . . . . 915 . . . . . 1 . . . . . . . . . 9 . . . . 1016 . . . . . 4 1 2 2 . . . . . . . 8 2 . . 1917 . . . . . . . . . . . . . . . . . 3 . 5 818 . . . . . . 4 . . . . . . . . . . . 7 . 1119 . . . . . . . . . . . . . . . . . . . 2 2Total 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10We can see that this distribution is concentrated near the diagonal. The scale parameter σ is set to 24 <strong>and</strong> C = 30. Errors in the cardinality of theclusters are mentioned in boldTable 4 This table shows the distribution of the categories through the clusters on the Columbia set using our <strong>clustering</strong> algorithmCategories 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Totalclusters1 72 . . . . . . . . . . . . . . 722 . 72 . . . . . . . . . . . . . 723 . . 72 . . . . . . . . . . . . 724 . . . 72 . . . . . . . . . . . 725 . . . . 72 . . . . . . . . . . 726 . . . . . 37 . 34 . . . . . . . 717 . . . . . 35 . 38 . . . . . . . 738 . . . . . . 72 . . . . . . . . 729 . . . . . . . . 72 . . . . . . 7210 . . . . . . . . . 72 . . . . . 7211 . . . . . . . . . . 72 . . . . 7212 . . . . . . . . . . . 72 . . . 7213 . . . . . . . . . . . . 30 . . 3014 . . . . . . . . . . . . 42 72 72 186Total 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72We can see that this distribution is concentrated near the diagonal. The scale σ is set to 0.4<strong>and</strong>C = 20. Errors in the cardinality of the clusters arementioned in boldthe expense of few interactions <strong>and</strong> reasonable ef<strong>for</strong>t fromthe user.Figure 5 shows that this heuristic provides “good guess” ofthe scale on the Olivetti <strong>and</strong> Columbia sets as it is close to theoptimal scale shown in Fig. 6. Indeed, the estimated scale inFig. 5 is approximately 24 <strong>and</strong> 0.4 <strong>for</strong>, respectively, Olivetti<strong>and</strong> Columbia sets. For these scales, the number of clustersfound by our algorithm on Olivetti <strong>and</strong> Columbia (resp. 19123


H. SahbiFig. 7 This figure shows 18 face prototypes, from different clusters, found after the application of our method. Each prototype corresponds to aGMM vectorFig. 8 This figure shows five clusters from the Olivetti database found after the application of our algorithm123


A <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> <strong>for</strong> <strong>clustering</strong> <strong>and</strong> <strong>its</strong> application to image retrieval<strong>and</strong> 15, see Fig. 6, top) is close to the actual numbers, resp.20 <strong>and</strong> 15. Furthermore, the probability of error (17)iscloseto <strong>its</strong> minimum <strong>for</strong> both Olivetti <strong>and</strong> Columbia (see Fig. 6,bottom).5.2 PrecisionA Clustering method can be objectively evaluated when theground truth is available, otherwise the meaning of <strong>clustering</strong>can differ from one intelligent observer to another. Thevalidity criteria are introduced in order to measure the qualityof a <strong>clustering</strong> algorithm, i.e., <strong>its</strong> capacity to assign data totheir actual classes. For a survey on these methods, see <strong>for</strong>example (Halkidi et al. 2002).In the presence of a ground truth, we consider in our worka simple validity criteria based on the probability of misclassification.The latter occurs when either two examplesbelonging to two different classes are assigned to the samecluster, or when two elements belonging to the same class areassigned different clusters. We denote X <strong>and</strong> Y as two r<strong>and</strong>omvariables st<strong>and</strong>ing respectively <strong>for</strong> the training examples<strong>and</strong> their different possible classes {y 1 ,...,y C } <strong>and</strong> X ′ , Y ′respectively similar to X, Y . We denote by f (X) the indexof the cluster of X in {y 1 ,...,y C }. Formally, we define themisclassification error as:P(1 {( f (X) = f (X ′ )} ̸= 1 {Y = Y ′ }) = (∗), (16)here:(∗) = P( f (X) ̸= f (X ′ ) | Y = Y ′ ) P(Y = Y ′ )+ P( f (X) = f (X ′ ) | Y ̸= Y ′ ) P(Y ̸= Y ′ ) (17)Again, Fig. 6 (bottom) shows the misclassification error(17) with respect to the scale parameter σ . Tables 3 <strong>and</strong> 4are the confusion matrices which show the distribution of20 <strong>and</strong> 15 categories from, respectively, the Olivetti <strong>and</strong> theColumbia sets through the clusters after the application ofour <strong>clustering</strong> method. We can see that this distribution isconcentrated in the diagonal <strong>and</strong> this clearly shows that mostof the training data are assigned to their actual categories (seeFigs. 7, 8).5.3 ComparisonFigure 9 shows a comparison of our kernel <strong>clustering</strong> methodswith respect to existing state of the art approaches includingfuzzy <strong>clustering</strong> (Sahbi <strong>and</strong> Boujemaa 2005), K-means(MacQueen 1965) <strong>and</strong> hierarchical <strong>clustering</strong> (Posse 2001).These results are shown <strong>for</strong> the Olivetti database. The error(17) is measured with respect to the number of clusters.Notice that the latter is fixed <strong>for</strong> hierarchical <strong>clustering</strong> <strong>and</strong>K-means while it corresponds to the resulting number of clustersafter setting σ (see Sect. 4.2) <strong>for</strong> fuzzy <strong>clustering</strong> <strong>and</strong> ourProbability of Error0.450.350.25K-MeansHierarchicalParticular GMMsFuzzy0.1515 20 25Number of ClustersFig. 9 This figure shows a comparison of the generalization error onthe Olivetti set of our method <strong>and</strong> other existing state of the art methodsincluding fuzzy, k-means <strong>and</strong> hierarchical <strong>clustering</strong>method. We clearly see that the generalization error takes <strong>its</strong>smallest value when the number of clusters is close to theactual one (i.e., 20).6 Conclusion <strong>and</strong> future workWe introduced in this work an original approach <strong>for</strong> <strong>clustering</strong>based on a <strong>particular</strong> <strong>Gaussian</strong> <strong>mixture</strong> <strong>model</strong> (GMM).The method considers an objective function which acts asa regularizer <strong>and</strong> minimizes the overlap between the clusterGMMs. The GMM parameters are found by solving aquadratic programming problem using a new decompositionalgorithm which considers trivial linear programming subproblems.The actual number of clusters is found by controllingthe scale parameter of these GMMs; in practice, it turnsout that predicting this parameter is easier than predicting theactual number of clusters mainly <strong>for</strong> large databases livingin high dimensional spaces.The concept presented in this paper is different from kernelregression which might be assimilated to density estimationwhile our approach per<strong>for</strong>ms this estimation <strong>for</strong> each cluster.Obviously, the proposed approach per<strong>for</strong>ms <strong>clustering</strong> <strong>and</strong>density estimation at the same time.The validity of the method is demonstrated on toy data aswell as database categorization problems. As a future work,we will investigate the application of the method to h<strong>and</strong>lenoisy data.ReferencesBach FR, Jordan MI (2003) Learning spectral <strong>clustering</strong>. Neural in<strong>for</strong>mationprocessing systemsBen-Hur A, Horn D, Siegelmann HT, Vapnik V (2000) Support vector<strong>clustering</strong>. Neural in<strong>for</strong>mation processing systems, pp 367–373Bezdek JC (1981) Pattern recognition with fuzzy objective functionalgorithms. Plenum Press, New York123


H. SahbiBishop CM (1995) Neural networks <strong>for</strong> pattern recognition. ClarendonPress, Ox<strong>for</strong>dCarson C, Belongie S, Greenspan H, Malik J (2002) Blobworld: imagesegmentation using expectation-maximization <strong>and</strong> <strong>its</strong> applicationto image querying. IEEE Trans Pattern Anal Mach Intell24(8):1026–1038Cristianini N, Shawe-Taylor J (2000) An introduction to support vectormachines. Cambridge University Press, CambridgeDave RN (1991) Characterization <strong>and</strong> detection of noise in <strong>clustering</strong>.In Pattern Recognit 12(11):657–664Dempster A, Laird N, Rubin D (1977) Maximum likelihood fromincomplete data via the em algorithm. J R Stat Soc B 39(1):1–38Fraley C (1998) Algorithms <strong>for</strong> <strong>model</strong>-based gaussian hierarchical<strong>clustering</strong>. SIAM J Sci Comput 20(1):270–281Frigui H, Krishnapuram R (1997) Clustering by competitive agglomeration.Pattern recognition, vol 30, no. 7Frigui H, Krishnapuram R (1999) A robust competitive <strong>clustering</strong> algorithmwith applications in computer vision. IEEE Trans PatternAnal Mach Intell 21(5):450–465Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validitymethods: Part i <strong>and</strong> ii. SIGMOD RecordIchihashai H, Honda K, Tani N (2000) <strong>Gaussian</strong> <strong>mixture</strong> pdf approximation<strong>and</strong> fuzzy c-means <strong>clustering</strong> with entropy regularization.In: Proceedings of the 4th Asian fuzzy system symposium, pp217–221Jain AK, Dubes RC (1988) Algorithms <strong>for</strong> <strong>clustering</strong> data. PrenticeHall, Englewood CliffsLe-Saux B, Boujemaa N (2002) Unsupervised robust <strong>clustering</strong> <strong>for</strong>image database categorization. IEEE-IAPR International Conferenceon Pattern Recognition, pp 259–262MacQueen J (1965) Some methods <strong>for</strong> classification <strong>and</strong> analysis ofmultivariate observations. In: Proceedings of the 5th Berkeley symposiumon mathematical statistics <strong>and</strong> probability, vol 1Orengo CA, Jones DT, Thornton JM (2003) Bioin<strong>for</strong>matics—genes,protein <strong>and</strong> computers. bios. ISBN: 1-85996-054-5Osuna E, Freund R, Girosi F (1997) Training support vector machines:an application to face detection. In: Proceedings of the internationalconference on computer vision <strong>and</strong> pattern recognition, pp 130–136Platt J (1999) Fast training of support vector machines using sequentialminimal optimization. In: Schlkopf B, Burges C, Smola AJ (eds)Advances in kernel methods—support vector learning. MIT Press,Cambridge, pp 185–208Posse C (2001) Hierarchical <strong>model</strong>-based <strong>clustering</strong> <strong>for</strong> large datasets.J Comput Graph Stat 10(3):464–486Sahbi H, Boujemaa N (2005) Validity of fuzzy <strong>clustering</strong> using entropyregularization. In: proceedings of the IEEE conference on fuzzysystemsSneath PH, Sokal RR (1973) Numerical taxonomy—the principles <strong>and</strong>practice of numerical classification. W. H. Freeman, San FranciscoTanenbaum J, de Silva V, Lang<strong>for</strong>d J (2000) A global geometricframework <strong>for</strong> non-linear dimensionality reduction. Science290(5500):2319–2323Tsao EK, Bezdek JC, Pal NR (1994) Fuzzy kohonen <strong>clustering</strong> networks.PR 27(5):757–764V<strong>and</strong>erbei RJ (1999) LOQO: an interior point code <strong>for</strong> quadratic programming.Optim Methods Softw 11:451–484Wu M, Schoelkopf B (2006) A local learning approach <strong>for</strong> <strong>clustering</strong>.Advances neural in<strong>for</strong>mation processing systemsYang MS (1993) A survey of fuzzy <strong>clustering</strong>. MathCompMod 18:1–16123

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!