10.07.2015 Views

Dimension Reduction for Model-based Clustering via Mixtures of ...

Dimension Reduction for Model-based Clustering via Mixtures of ...

Dimension Reduction for Model-based Clustering via Mixtures of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Dimension</strong> <strong>Reduction</strong> <strong>for</strong> <strong>Model</strong>-<strong>based</strong> <strong>Clustering</strong><strong>via</strong> <strong>Mixtures</strong> <strong>of</strong> Multivariate t-DistributionsbyKatherine MorrisA Thesispresented toThe University <strong>of</strong> GuelphIn partial fulfilment <strong>of</strong> requirements<strong>for</strong> the degree <strong>of</strong>Master <strong>of</strong> ScienceinMathematics and StatisticsGuelph, Ontario, Canadac○ Katherine Morris, July, 2012


AcknowledgementsFirst and <strong>for</strong>emost, I would like to extend my sincerest gratitude to my advisor, Pr<strong>of</strong>.Paul McNicholas, <strong>for</strong> his guidance and support. His knowledge and enthusiasm inspiredme to undertake the research presented here.I would like to thank Pr<strong>of</strong>. Ryan Browne <strong>for</strong> his contributions on my advisory committee,and also Pr<strong>of</strong>. Zeny Feng and Pr<strong>of</strong>. Edward Carter <strong>for</strong> taking the time to examinethis work.Also, I would like to thank Jeffrey Andrews, the author <strong>of</strong> the R package tEIGEN, <strong>for</strong>his continued assistance while using the s<strong>of</strong>tware.Finally, thanks go to my family and Phillip <strong>for</strong> their years <strong>of</strong> love and patience.iii


List <strong>of</strong> Tables2.1 Nomenclature <strong>for</strong> models in the MCLUST family . . . . . . . . . . . . . 72.2 Nomenclature <strong>for</strong> models in the tEIGEN family . . . . . . . . . . . . . . 114.1 Average ARI (with standard errors in brackets) <strong>based</strong> on 1000 simulations. 274.2 Variable description <strong>for</strong> the c<strong>of</strong>fee data. . . . . . . . . . . . . . . . . . . . 284.3 <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the c<strong>of</strong>fee data. . . . . . . . . . . . . . 294.4 A classification table <strong>for</strong> the best tMMDR model fitted to the c<strong>of</strong>fee data. 294.5 Variable description <strong>for</strong> the wines data. . . . . . . . . . . . . . . . . . . . 314.6 <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the wine data. . . . . . . . . . . . . . . 314.7 A classification table <strong>for</strong> the best tMMDR model fitted to the wine data. 324.8 Variable description <strong>for</strong> the crabs data. . . . . . . . . . . . . . . . . . . . 334.9 <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the crabs data. . . . . . . . . . . . . . 334.10 A classification table <strong>for</strong> the best tMMDR model fitted to the crabs data. 344.11 Variable description <strong>for</strong> the banknotes data. . . . . . . . . . . . . . . . . 364.12 <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the banknotes data. . . . . . . . . . . 364.13 A classification table <strong>for</strong> the best tMMDR model fitted to the banknotesdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.14 Variable description <strong>for</strong> the diabetes data. . . . . . . . . . . . . . . . . . 384.15 <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the diabetes data. . . . . . . . . . . . . 384.16 A classification table <strong>for</strong> the best tMMDR model fitted to the diabetesdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.17 Summary <strong>of</strong> the results <strong>for</strong> the real data. . . . . . . . . . . . . . . . . . . 40v


List <strong>of</strong> Figures2.1 Contour and perspective plots <strong>for</strong> a bivariate Normal distribution withµ = [0, 0] ⊤ and Σ = diag(2). . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Comparison <strong>of</strong> a Normal and t-distribution . . . . . . . . . . . . . . . . . 94.1 Pairs plot <strong>for</strong> the three overlapping clusters in <strong>Model</strong> 1 (n = 300 datapoints). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Pairs plot <strong>for</strong> the three overlapping clusters in <strong>Model</strong> 2 (n = 300 datapoints). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Pairs plot <strong>for</strong> the three overlapping clusters in <strong>Model</strong> 3 (n = 300 datapoints). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4 Plots <strong>of</strong> tMMDR directions <strong>for</strong> the c<strong>of</strong>fee data . . . . . . . . . . . . . . . 304.5 Plots <strong>of</strong> tMMDR directions <strong>for</strong> the wine data . . . . . . . . . . . . . . . . 324.6 Plots <strong>of</strong> tMMDR directions <strong>for</strong> the crabs data . . . . . . . . . . . . . . . 354.7 Plots <strong>of</strong> tMMDR directions <strong>for</strong> the banknotes data . . . . . . . . . . . . 374.8 Plots <strong>of</strong> tMMDR directions <strong>for</strong> the diabetes data . . . . . . . . . . . . . . 39vi


Chapter 1Introduction<strong>Clustering</strong> algorithms <strong>based</strong> on probability models are a popular choice <strong>for</strong> exploringstructures in modern data sets, which continue to grow in size and complexity. In particular,the model-<strong>based</strong> approach assumes that the data are generated by a finite mixture<strong>of</strong> probability distributions such as the multivariate Gaussian distribution. Thesemodels have been shown to be a powerful tool <strong>for</strong> applications in bioin<strong>for</strong>matics, finance,medicine and survey research (McLachlan and Bas<strong>for</strong>d, 1988; Banfield and Raftery, 1993;Celeux and Govaert, 1995, <strong>for</strong> example). In the probability framework, the issues <strong>of</strong> selectingthe ‘best’ clustering method and determining the appropriate number <strong>of</strong> clustersare reduced to model selection problems.Another choice <strong>for</strong> model-<strong>based</strong> clustering is the t-distribution which, although lessfrequently used, has the potential to outper<strong>for</strong>m its Gaussian analogue. In many situations,the tails <strong>of</strong> the normal distribution are thinner than required and the estimatescan be affected by outliers. McLachlan and Peel (1998) introduced the idea <strong>of</strong> fittingmixtures <strong>of</strong> multivariate t-distributions to the data, which provided a more robust approachthan fitting normal mixture models, as outliers were given reduced weight in thecalculation <strong>of</strong> parameters.Reducing dimensionality within the model-<strong>based</strong> clustering paradigm will be thefocus <strong>of</strong> this thesis. The work <strong>of</strong> Scrucca (2010) on dimension reduction <strong>for</strong> model<strong>based</strong>clustering in the Gaussian framework, called GMMDR, is applied to mixtures<strong>of</strong> multivariate t-distributions. A new method, tMMDR, is proposed and its steps aresummarized as follows.1. Fit a t-distribution mixture model to the data using the tEIGEN family (Andrewsand McNicholas, 2012a).2. Find the smallest subspace which captures the clustering in<strong>for</strong>mation contained in1


Chapter 2BackgroundFinite mixtures <strong>of</strong> distributions provide a mathematical approach <strong>for</strong> fitting statisticalmodels to a wide variety <strong>of</strong> random phenomena. McLachlan and Bas<strong>for</strong>d (1988) andMcLachlan and Peel (2000) give an extensive description <strong>of</strong> finite mixture models whichhave become increasingly popular due to their flexibility. The most popular application<strong>of</strong> these models occurs <strong>for</strong> scenarios where data exhibit group structure or where thedata can be investigated <strong>for</strong> such structure.A p-dimensional random vector X is said to arise from a parametric finite mixturedistribution if ∀x ⊂ X, one can writep(x|ϑ) =G∑π g p g (x|θ g ) ,g=1where G is the number <strong>of</strong> components, π g are mixing proportions such thatG∑π g = 1 and π g > 0 ,g=1and ϑ = (π 1 , . . . , π G , θ 1 , . . . , θ G ) is the parameter vector. The p g (x|θ g ) are called componentdensities.3


2.1 Gaussian Mixture <strong>Model</strong>sFrom McLachlan and Peel (2000), we know that the density <strong>of</strong> a multivariate Gaussianmixture model is given byf(x|ϑ) =G∑π g f N (x|µ g , Σ g ) ,g=1where f N (x|µ g , Σ g ) is the density <strong>of</strong> a multivariate Gaussian distribution with mean µ gand covariance matrix Σ g , i.e.f N (x|µ g , Σ g ) = exp{− 1(x − µ 2 g) ⊤ Σ −1g (x − µ g )}. (2.1)(2π) p 2 |Σ g | 1 2Figure 2.1 illustrates the contour and 2-dimensional plot <strong>for</strong> a bivariate normal distributionwith µ = [0, 0] ⊤ and Σ = diag(2). By varying Σ, one can obtain differentorientations, shapes and volumes <strong>for</strong> the distribution.2.2 <strong>Model</strong>-<strong>based</strong> <strong>Clustering</strong> with MCLUST<strong>Model</strong>-<strong>based</strong> clustering is a method which assumes that the data are generated by amixture <strong>of</strong> underlying probability distributions in which each component represents adifferent group or cluster. The mixture likelihood is given byL M (θ 1 , . . . , θ G ; π 1 , . . . , π G |x) =n∏ G∑π g f N (x i |θ g ) , (2.2)i=1 g=1where x = {x 1 , . . . , x n } represents the data.For the Gaussian density in (2.1) clusters are ellipsoidal and centred at the meansµ g . Other geometric properties are determined by the covariances Σ g . Banfield andRaftery (1993) developed a model-<strong>based</strong> framework <strong>for</strong> clustering by using the followingeigenvalue decomposition <strong>of</strong> the covariance matrixΣ g = λ g D g A g D ⊤ g , (2.3)4


Figure 2.1: Contour and perspective plots <strong>for</strong> a bivariate Normal distribution withµ = [0, 0] ⊤ and Σ = diag(2).where• D g is the orthogonal matrix <strong>of</strong> eigenvectors which determines the orientation <strong>of</strong>the principal components <strong>of</strong> Σ g ;• A g is a diagonal matrix with elements proportional to the eigenvalues <strong>of</strong> Σ g whichdetermines the shape <strong>of</strong> the density contours;5


• λ g is a scalar which specifies the volume <strong>of</strong> the corresponding ellipsoid.The orientation, volume and shape <strong>of</strong> distributions are estimated from the dataand can either vary between clusters or be constrained across clusters. There are fourcommonly used assumptions <strong>for</strong> the covariance matrices:1. Σ g = Σ = λI where clusters are spherical and have equal volumes;2. Σ g = Σ = λDAD ⊤ where all clusters have the same shape, volume and orientation;3. Σ g = λ g D g A g Dg ⊤ where the shape, volume and orientation are allowed to vary;4. Σ g = λD g ADg ⊤ where only the orientations <strong>of</strong> the clusters may differ.By imposing constraints on the elements <strong>of</strong> (2.3) a large range <strong>of</strong> models is obtainedas discussed in Celeux and Govaert (1995). Several <strong>of</strong> these models, called the MCLUSTfamily (Fraley and Raftery, 1999), are available through the R package mclust by Fraleyand Raftery (2006). A brief description <strong>of</strong> the types <strong>of</strong> models included in this particularfamily appears in Table 2.1.<strong>Clustering</strong> <strong>via</strong> mixture models is done through the expectation-maximization (EM)algorithm (Dempster et al., 1977) which is an iterative procedure <strong>for</strong> finding maximumlikelihood estimates when data are incomplete or treated as being incomplete.The complete-data log-likelihood <strong>for</strong> model-<strong>based</strong> clustering with MCLUST isl(ϑ) =n∑ G∑z ig log[π g f N (x i |θ g )] . (2.4)i=1 g=1The EM algorithm involves the iteration <strong>of</strong> two steps until convergence is reached,namely1. the expectation step (E-step) computes the expected value <strong>of</strong> the complete-datalog-likelihood, and2. the maximization step (M-step) maximizes the expected value <strong>of</strong> the complete-datalog-likelihood with respect to the model parameters.6


Table 2.1: Nomenclature <strong>for</strong> models in the MCLUST family (Fraley and Raftery, 1999)<strong>Model</strong> Distribution Volume Shape Orientation Covariance Free covarianceparametersEII Spherical Equal Equal NA λI 1VII Spherical Variable Equal NA λ g I GEEI Diagonal Equal Equal Coord. axes λA pVEI Diagonal Variable Equal Coord. axes λ g A p + G − 1EVI Diagonal Equal Variable Coord. axes λA g Gp − G + 1VVI Diagonal Variable Variable Coord. axes λ g A g GpEEE Ellipsoidal Equal Equal Equal λDAD ⊤ p(p + 1)/2EEV Ellipsoidal Equal Equal Variable λD g ADg ⊤ Gp(p + 1)/2 − (G − 1)pVEV Ellipsoidal Variable Equal Variable λ g D g ADg ⊤ Gp(p + 1)/2 − (G − 1)(p − 1)VVV Ellipsoidal Variable Variable Variable λ g D g A g Dg ⊤ Gp(p + 1)/2Traditionally, the missing data are represented by component indicator variables z igdefined as⎧⎨1 if observation x i belongs to component g, andz ig =(2.5)⎩0 otherwise.The main objective <strong>of</strong> model-<strong>based</strong> clustering is to estimate these z ig under the assumptionsthat• the density <strong>of</strong> an observation x i given z i is ∏ Gg=1 f N(x i |θ g ) z ig;• each z i is independent and identically distributed according to a multinomial distribution<strong>of</strong> one draw on G categories with probabilities π 1 , . . . , π G .At each E-step, the z ig are updated by their conditional expected valuesẑ ig =π g f N (x|µ g , Σ g )∑ Gj=1 π jf N (x|µ j , Σ j ) , (2.6)where f N has the <strong>for</strong>m given in (2.1).7


In the M-step <strong>of</strong> the algorithm, the mean, mixing proportions and covariance structureget updated. The details appear in several papers, <strong>for</strong> example Celeux and Govaert(1995) or Fraley and Raftery (1998).After the EM algorithm converges, cluster memberships are estimated <strong>via</strong> the maximuma posteriori (MAP) classification given by⎧⎨1 if max g {ẑ ig } occurs at component g, andMAP{ẑ ig } =(2.7)⎩0 otherwise.<strong>Clustering</strong> <strong>via</strong> mixture modelling allows <strong>for</strong> the use <strong>of</strong> Bayes factors (Kass andRaftery, 1995) to compare models. This gives a systematic method <strong>of</strong> selecting theparametrization <strong>of</strong> the model and the number <strong>of</strong> mixture components. When the EMalgorithm is used to find the maximum mixture likelihood, an approximation to twicethe log Bayes factor, called the Bayesian in<strong>for</strong>mation criterion (BIC; Schwarz, 1978) ismore suitableBIC = 2l(x, ˆϑ) − p log(n) , (2.8)where l(x, ˆϑ) is the maximized log-likelihood, ˆϑ is the maximum likelihood estimate <strong>of</strong>ϑ, p is the number <strong>of</strong> parameters and n is the number <strong>of</strong> observations. The use <strong>of</strong> BIC<strong>for</strong> model selection in model-<strong>based</strong> clustering is well established in its application as canbe seen in Leroux (1992), Kass and Wasserman (1995) and Keribin (2000), <strong>for</strong> example.2.3 <strong>Mixtures</strong> <strong>of</strong> Multivariate t-DistributionsFrom McLachlan and Peel (1998) and Peel and McLachlan (2000) we know that thedensity <strong>of</strong> a multivariate t-distribution mixture model is given byf(x|ϑ) =G∑π g f t (x|µ g , Σ g , ν g ) , (2.9)g=1where π g are the mixing proportions andf t (x|µ g , Σ g , ν g ) =(πν g ) p 2 Γ( νg 2Γ( νg+p2)|Σ g | − 1 2δ(x,µg|Σg))(1 +ν g) νg+p2, (2.10)8


is the density <strong>of</strong> a multivariate t-distribution with mean µ g , covariance matrix Σ g anddegrees <strong>of</strong> freedom ν g . Also, δ(x, µ g |Σ g ) = (x − µ g ) ⊤ Σ −1g (x − µ g ) is the squaredMahalanobis distance between x and µ g .As in the analogue <strong>of</strong> univariate distributions, the multivariate t-distribution becomesasymptotically normal as the number <strong>of</strong> degrees <strong>of</strong> freedom tends to infinity.Figure 2.2: Normal distribution with µ = 0 and Σ = 2 superimposed on a t-distributionwith the same µ and Σ and three degrees <strong>of</strong> freedom.2.4 <strong>Model</strong>-<strong>based</strong> <strong>Clustering</strong> with tEIGENWhen clustering continuous multivariate data, attention has focussed on the use <strong>of</strong>multivariate normal components because <strong>of</strong> their computational convenience (they canbe easily modelled <strong>via</strong> the EM algorithm as outlined in Section 2.2). However, in manyapplications, the tails <strong>of</strong> the normal distribution are <strong>of</strong>ten thinner than required and theestimates <strong>of</strong> the component means and covariance matrices can be affected by outliers.Figure 2.2 illustrates the difference in the tails <strong>of</strong> the Normal and t-distributions <strong>via</strong> a9


simple example. McLachlan and Peel (1998) introduced the idea <strong>of</strong> fitting mixtures <strong>of</strong>multivariate t-distributions to the data. This provided a more robust approach thanfitting normal mixture models, as outliers were given reduced weight in the calculation<strong>of</strong> parameters, i.e. the Mahalanobis distance term in the density (2.10).Recent work on model-<strong>based</strong> clustering using the multivariate t-distribution has beencontributed by McLachlan et al. (2007), Greselin and Ingrassia (2010a,b), Andrews andMcNicholas (2011a,b) and Andrews et al. (2011).Andrews and McNicholas (2012a) used the eigen-decomposition <strong>of</strong> the multivariatet-distribution covariance matrix to build a family <strong>of</strong> 20 mixtures which are called thetEIGEN family (as shown in Table 2.2). These take into account the same constraints asthe MCLUST family <strong>of</strong> models but also include constraints on the degrees <strong>of</strong> freedom.With the tEIGEN family, clustering is done <strong>via</strong> the expectation-conditional maximization(ECM) algorithm <strong>of</strong> Meng and Rubin (1993). In the ECM algorithm the M-stepis replaced by a number <strong>of</strong> conditional maximization steps which are more efficient froma computational standpoint. The missing data are represented by the same indicatorvariables z ig (2.5) as in the MCLUST case but with the addition <strong>of</strong> characteristic weightsu ig .Under the model-<strong>based</strong> clustering framework, the observations do not have knowngroup memberships. Thus the z ig must be initialized in order to run the ECM algorithmand there are several options <strong>for</strong> this. For MCLUST, Fraley and Raftery (2002) use amodel-<strong>based</strong> agglomerative hierarchical clustering procedure to obtain starting values.For parsimonious Gaussian mixture models (PGMM), McNicholas and Murphy (2008)use a number <strong>of</strong> random starting values on their most constrained modelΣ g = ΛΛ ⊤ + Ψ ,where Λ is a (p × q) matrix <strong>of</strong> parameters (factor loadings), typically with q ≪ p, andΨ is a diagonal noise matrix. The results are then picked from the iteration with thelargest BIC value to initialize <strong>for</strong> each cluster G. Andrews and McNicholas (2012a)suggest using the agglomerative hierarchical initialization method <strong>of</strong> MCLUST in orderto avoid algorithm failure when clustering with mixtures <strong>of</strong> multivariate t-distributions.10


Table 2.2: Nomenclature <strong>for</strong> models in the tEIGEN family: ‘C’ indicates that a constraintis imposed, ‘U’ indicates that a constraint is not imposed, ‘I’ indicates the identitymatrix <strong>of</strong> suitable dimension (Andrews and McNicholas, 2012a)<strong>Model</strong> λ g = λ D g = D A g = A ν g = ν Free covariance parametersCIIC C I I C 1 + 1CIIU C I I U 1 + GUIIC U I I C (G − 1) + 1UIIU U I I U (G − 1) + GCICC C I C C p + 1CICU C I C U p + GUICC U I C C p + (G − 1) + 1UICU U I C U p + (G − 1) + GCIUC C I U C Gp − (G − 1) + 1CIUU C I U U Gp − (G − 1) + GUIUC U I U C Gp + 1UIUU U I U U Gp + GCCCC C C C C [p(p + 1)/2] + 1CCCU C C C U [p(p + 1)/2] + GUCCC U C C C [p(p + 1)/2] + (G − 1) + 1UCCU U C C U [p(p + 1)/2] + (G − 1) + GCUCC C U C C G[p(p + 1)/2] − (G − 1)p + 1CUCU C U C U G[p(p + 1)/2] − (G − 1)p + GUUCC U U C C G[p(p + 1)/2] − (G − 1)(p − 1) + 1UUCU U U C U G[p(p + 1)/2] − (G − 1)(p − 1) + GCUUC C U U C G[p(p + 1)/2] − (G − 1) + 1CUUU C U U U G[p(p + 1)/2] − (G − 1) + GUUUC U U U C G[p(p + 1)/2] + 1UUUU U U U U G[p(p + 1)/2] + G11


For the tEIGEN models, parameter estimation is comparable to the Gaussian casebut it includes two additional steps:• incorporation <strong>of</strong> the weights u ig ;• estimation <strong>of</strong> the degrees <strong>of</strong> freedom.Andrews and McNicholas (2012a) define the complete-data log-likelihood <strong>for</strong> model<strong>based</strong>clustering with tEIGEN <strong>for</strong> n p-dimensional observations asl(ϑ) =G∑g=1n∑i=1[ ( ∣ ∣∣∣ ν gz ig log π g γ u ig2 , ν ) ( ∣g ∣∣∣f N x i µ g , Σ )]g, (2.11)2u igwhere γ is the gamma density given by<strong>for</strong> α > 0, β > 0 and indicator functionγ(y|α, β) = βα y α−1 exp(−βy)Γ(α)⎧⎨1 <strong>for</strong> y > 0, andI {y>0} =⎩0 otherwise.I {y>0} ,At each E-step, the z ig are updated (cf. 2.6) by their conditional expected valuesẑ ig =π g f t (x|µ g , Σ g , ν g )∑ Gj=1 π jf t (x|µ j , Σ j , ν j ) , (2.12)and the characteristic weights are updated as followsû ig =ν g + pν g + δ(x i , µ g |Σ g ) ,where δ(x i , µ g |Σ g ) is the squared Mahalanobis distance between x i and µ g .The degrees <strong>of</strong> freedom <strong>for</strong> the case where ν g = ν are updated by solving the equation( ) ˆνnew1−ϕ +log2( ˆνnew2)+ 1 nG∑ n∑( ) ( )ˆν old + p ˆν old + pẑ ig (log û ig −û ig )+ϕ−log= 0 ,22g=1i=112


Chapter 3MethodologyThere are several approaches to using mixture models in a clustering context which werediscussed in Chapter 2. McNicholas (2011) presents a review <strong>of</strong> work in model-<strong>based</strong>clustering with a particular focus on two families <strong>of</strong> Gaussian mixture models, namelyMCLUST (Fraley and Raftery, 2002) and PGMM (McNicholas and Murphy, 2008).3.1 <strong>Dimension</strong>-reduction <strong>for</strong> <strong>Mixtures</strong> <strong>of</strong> MultivariateGaussian Distributions (GMMDR)Scrucca (2010) proposed a novel approach to model-<strong>based</strong> clustering, namely the dimensionreduction<strong>for</strong> mixtures <strong>of</strong> multivariate Gaussian distributions (GMMDR). The main ideais to find a reduced subspace which captures most <strong>of</strong> the clustering structure in the data.By following the work <strong>of</strong> Li (1991, 2000) on sliced inverse regression (SIR) one can obtainin<strong>for</strong>mation on the dimension reduction subspace from two sources:• the variation on group means;• the variation on group covariances (depending on the estimated mixture model).Classical procedures <strong>for</strong> reducing the dimensions in the data are principal componentsanalysis and factor analysis. These techniques lower dimensionality by <strong>for</strong>ming linearcombinations <strong>of</strong> the variables. In terms <strong>of</strong> visualizing any potential clustering structure,neither method is particularly useful.The proposed method reduces dimensionality by identifying a set <strong>of</strong> linear combinations<strong>of</strong> the original variables, ordered by importance <strong>via</strong> their associated eigenvalues,which capture most <strong>of</strong> the cluster structure in the data. Observations are then projected14


onto the reduced subspace and their plots help visualize the clustering structure. Thenew GMMDR variables contain most <strong>of</strong> the clustering in<strong>for</strong>mation <strong>of</strong> the original dataand they can be reduced further to improve per<strong>for</strong>mance.3.2 <strong>Dimension</strong>-reduction <strong>for</strong> <strong>Mixtures</strong> <strong>of</strong> Multivariatet-Distributions (tMMDR)In this section, we follow the work <strong>of</strong> Scrucca (2010) and apply it to mixtures <strong>of</strong> multivariatet-distributions. Suppose a G-component t-distribution mixture model (tMM) isimposed on a set <strong>of</strong> data with n observations and p variables. Recall from (2.9) in theprevious chapter that this model takes the <strong>for</strong>mf(x|ϑ) =G∑π g f t (x|µ g , Σ g , ν g ) . (3.1)g=13.2.1 <strong>Clustering</strong> on a <strong>Dimension</strong> Reduced SubspaceConsider a (p×1) vector X <strong>of</strong> random variables and a discrete random variable Y takingG distinct values to indicate the G clusters. Let β denote a fixed (p × d) matrix withd ≤ p such thatY ⊥ X | β ⊤ X . (3.2)The conditional independence in (3.2) tells us that the distribution <strong>of</strong> Y |X is the sameas the distribution <strong>of</strong> Y |β ⊤ X <strong>for</strong> all values <strong>of</strong> X in its marginal sample space. As aconsequence, we can replace the (p × 1) vector X with the (d × 1) vector β ⊤ X withoutloss <strong>of</strong> clustering in<strong>for</strong>mation. If d < p then we have reduced the dimension <strong>of</strong> thepredictor vector.Li (1991) defines the basis <strong>for</strong> the subspace S(β) given by β as a dimension-reductionsubspace (DRS) <strong>for</strong> the regression <strong>of</strong> Y on X. A minimum DRS may not necessarily beunique but if several such subspaces exist, then they all have the same dimension.The assumption in (3.2) implies that P (Y = g|X) = P (Y = g|β ⊤ X) so the density15


<strong>for</strong> the g-th component <strong>of</strong> the mixture model in (3.1) can be written asf g (x) = P (Y = g|x) f X(x)P (Y = g)= P (Y = g|β⊤ x) f X (x)P (Y = g)= f g (β ⊤ f X (x)x)f β X(β ⊤ ⊤ x) . (3.3)Also, <strong>for</strong> any two groups, i and j say, the ratio <strong>of</strong> the tMM densities isf i (x)f j (x) = f i(β ⊤ x)f j (β ⊤ x) .Given (3.2), the ratio <strong>of</strong> the conditional densities is the same whether it is computed onthe original variables space or on the DRS. Thus the clustering in<strong>for</strong>mation is containedcompletely in S(β).We know that an observation is assigned to a cluster g by the MAP (2.7), i.e., to thecluster <strong>for</strong> which the conditional probability given the data is a maximum such thatwhich is equivalent toby (3.3).arg max P (Y = g|x) =garg max P (Y = g|β ⊤ x) =gπ g f g (x)∑ Gj=1 π jf j (x) ,π g f g (β ⊤ x)∑ Gj=1 π jf j (β ⊤ x) ,Hence, the assignment <strong>of</strong> an observation to a cluster is unchanged whenper<strong>for</strong>med on the DRS.3.2.2 Estimation <strong>of</strong> the tMMDR DirectionsGiven a tMM (3.1), we wish to find the smallest subspace which captures the clusteringin<strong>for</strong>mation contained in the data. Thus, we need to identify those directions where thecluster means µ g vary as much as possible, provided that each direction is Σ-orthogonal16


to the others. Specifically, we need to maximizewhereµ =Σ B =G∑π g µ gg=1Σ = 1 ng=1argβmax β ⊤ Σ B β subject to β ⊤ Σβ = I d , (3.4)is the global mean,G∑π g (µ g − µ)(µ g − µ) ⊤ is the between-cluster covariance matrix,n∑(x i − µ)(x i − µ) ⊤ is the covariance matrix,i=1β ∈ R p×dI dis the spanning matrix,is the (d × d) identity matrix.The solution to the constrained optimization problem in (3.4) is given by the eigendecomposition<strong>of</strong> the kernel matrixM I ≡ Σ B with respect to Σ . (3.5)The eigenvectors corresponding to the first d largest eigenvalues, [v 1 , . . . , v d ] ≡ β,provide a basis <strong>for</strong> the subspace S(β) which shows the maximal variation among clustermeans. The number <strong>of</strong> directions which span this subspace is• d = min(p, G − 1) <strong>for</strong> models which assume equal within-cluster covariance matrices,or• d ≤ p <strong>for</strong> models which don’t assume equal within-cluster covariance matrices.From Scrucca (2010), we know that this is similar to the sliced inverse regression algorithmsSIR (Li, 1991) and SIR II (Li, 2000). These are dimension reduction proceduresthat use in<strong>for</strong>mation from the inverse mean function and differences in class covariancematrices. In order to find directions which account <strong>for</strong> variation in both cluster meansand cluster covariances, Scrucca (2010) uses these algorithms to devise the following17


kernel matrixM = M I Σ −1 M I + M II , (3.6)whereM I ≡ Σ B as be<strong>for</strong>e,G∑M II = π g (Σ g − ¯Σ)Σ −1 (Σ g − ¯Σ) ⊤ ,g=1G∑¯Σ = π g Σ gg=1is the pooled within-cluster covariance matrix.Thus the kernel matrix in (3.6) contains in<strong>for</strong>mation on how both cluster means andcluster covariances vary. Now the optimization problem in (3.4) is given byargβmax β ⊤ M β subject to β ⊤ Σ β = I d , (3.7)and it is solved using the generalized eigen-decompositionMv i = l i Σv i , (3.8)where⎧⎨1 if i = j, andvi ⊤ Σv j =⎩0 otherwise,and l 1 ≥ l 2 ≥ . . . ≥ l d > 0.Definition 3.1 The tMMDR directions are the eigenvectors [v 1 , . . . , v d ] ≡ β which<strong>for</strong>m the basis <strong>of</strong> the dimension reduction subspace S(β) and constitute the solution to(3.7).Suppose S(β) is the subspace spanned by the tMMDR directions from (3.8) and µ g ,Σ g are the mean and covariance <strong>for</strong> cluster g. Then the projections <strong>of</strong> the parametersonto S(β) are given by β ⊤ T µ g and β ⊤ Σ g β.18


Definition 3.2 The tMMDR variables, Z, are the projections <strong>of</strong> the (n×p) data matrixX onto the subspace S(β) and can be computed as Z = Xβ.The raw coefficients <strong>of</strong> the estimated directions are uniquely determined up to multiplicationby a scalar while the associated directions from the origin are unique. Hence,they can be normalized to have unit lengthβ j ≡v j‖ v j ‖<strong>for</strong> j = 1, . . . , d .If we set D = diag(V ⊤ V ), where V is the matrix <strong>of</strong> eigenvectors from (3.8), thenβ ≡ V D − 1 2 .For the tMMDR variables we haveCov(Z) = β ⊤ Σβ = D − 1 2 V ⊤ ΣV D − 1 2( ) 1= D −1 = diag.‖ v j ‖ 2We can see that the tMMDR variables are uncorrelated while the tMMDR directionsare orthogonal with respect to the Σ-inner product.For an (n × p) sample data matrix X, the sample version ˆM <strong>of</strong> the kernel in (3.6)is obtained using the corresponding estimates from the fit <strong>of</strong> a t-distribution mixturemodel <strong>via</strong> the EM algorithm. Then the tMMDR directions are calculated from thegeneralized eigen-decomposition <strong>of</strong> ˆM with respect to ˆΣ. The tMMDR directions areordered <strong>based</strong> on eigenvalues which means that directions associated with approximatelyzero eigenvalues can be discarded in practice since clusters will overlap a lot along thesedirections. Also, their contribution to the overall position <strong>of</strong> the sample points in aneigenvector expansion is approximately zero so they provide little positional in<strong>for</strong>mation,and their associated eigenvectors are liable to be unstable in any case.3.2.3 Selection <strong>of</strong> the tMMDR VariablesThe estimation <strong>of</strong> the tMMDR variables discussed in the previous section can be viewedas a <strong>for</strong>m <strong>of</strong> feature extraction where the components are reduced through a set <strong>of</strong> linearcombinations <strong>of</strong> the original variables. This set <strong>of</strong> features may contain estimated19


tMMDR variables which provide no clustering in<strong>for</strong>mation but require parameter estimation.Thus, the next step in the process <strong>of</strong> model-<strong>based</strong> clustering is to detect andremove these unnecessary variables.Scrucca (2010) used the subset selection method <strong>of</strong> Raftery and Dean (2006) toprune the subset <strong>of</strong> GMMDR variables. We will also use this approach to select themost appropriate tMMDR variables.Let s be a subset <strong>of</strong> q features from the original tMMDR variables Z, with dim(s) = qand q ≤ d. Let s ′ = {s \ i} ⊂ s be the set <strong>of</strong> dim = q − 1 which is obtained by excludingthe i-th feature from s. The comparison <strong>of</strong> the two subsets can be viewed as a modelcomparison problem and addressed by using the BIC difference which is given in Rafteryand Dean (2006) asBIC diff (Z i∈s ) = BIC clust (Z s ) − BIC not clust (Z s ) , (3.9)where BIC clust (Z s ) is the BIC value <strong>for</strong> the best model fitted using features in s andBIC not clust (Z s ) is the BIC value <strong>for</strong> no clustering. We can write the latter asBIC not clust (Z s ) = BIC clust (Z s ′) + BIC reg (Z i |Z s ′) ,where BIC clust (Z s ′) is the BIC value <strong>for</strong> the best model fitted using features in s ′ andBIC reg (Z i |Z s ′) is the BIC value <strong>for</strong> the regression <strong>of</strong> the i-th feature on the remaining(q − 1) features in s ′ . Since the tMMDR variables are orthogonal, the <strong>for</strong>mula <strong>for</strong> BIC reg(Raftery and Dean, 2006) reduces to( ) RSSBIC reg (Z i |Z s ′) = −n log(2π) − n log − n − (q + 1) log(n) ,nwhere RSS is the residual sum <strong>of</strong> squares in the regression <strong>of</strong> Z i on Z s ′.Now, the space <strong>of</strong> all possible subsets has size dim(s) = q, where q = 1, . . . , d has 2 d −1elements and an exhaustive search would be unfeasible. To bypass this issue, Rafteryand Dean (2006) proposed a greedy search algorithm which finds a local optimum in themodel space. The algorithm comprises the following steps:20


• the <strong>for</strong>ward step evaluates the inclusion <strong>of</strong> a proposed variable;• the backward step evaluates the exclusion <strong>of</strong> one <strong>of</strong> the currently included variables;• termination occurs when consecutive inclusion and exclusion steps are rejected.Since the tMMDR variables are Σ-orthogonal we do not need to per<strong>for</strong>m the backwardstep. With the BIC (2.8) as a method <strong>of</strong> model comparison, the greedy searchalgorithm proceeds as follows.1. Let s = {1, 2, . . . , d} be the set <strong>of</strong> tMMDR variables Z s computed <strong>for</strong> the mixturemodel as described in Section 3.2.2.2. Select the first variable to be the one which maximizes the BIC difference in (3.9).• Let s 1 = {i} be the candidate set and s ′ 1 = 0 be the set <strong>of</strong> variables whichare already included.• Choose the variable Z i1such thati 1 = argi∈smax BIC diff (Z s1 ) = arg max (BIC diff (Z s1 ) − BIC reg (Z i )) .i∈sAt the first iteration dim(s 1 ) = 1 thus maximization occurs over univariatemodels with G = 1, 2, . . . , max(G), where max(G) is the maximum number<strong>of</strong> clusters considered <strong>for</strong> the data.• Then, let s 1 = {i 1 }, set j = 2 and proceed to the next step.3. Select a variable to include, among those not included already, to be the one whichmaximizes the BIC difference in (3.9).• Let s j = s j−1 ∪ {i} be the candidate set and s ′ j = s j−1 be the set <strong>of</strong> variablesalready included.• Choose the variable Z ijto be included such thati j = arg max BIC diff (Z sj ) .i∈s\s ′ jHere the best model is identified with respect to the number <strong>of</strong> mixture componentsup to max(G) and the model parametrization.21


• Then update the set <strong>of</strong> currently included variables such that s j = s ′ j ∪ {i j }.4. Set j = j + 1 and iterate the previous step until a stopping rule is met. Our algorithmterminates when the BIC difference <strong>for</strong> the inclusion <strong>of</strong> a variable becomesnegative.At each step, the search over the model space is per<strong>for</strong>med with respect to the modelparametrization and the number <strong>of</strong> clusters.As mentioned be<strong>for</strong>e, the greedy search will likely discard the tMMDR variableswhich are associated with small eigenvalues as they do not carry any clustering in<strong>for</strong>mation.Then a tMM can be fitted on the selected set <strong>of</strong> tMMDR variables and thecorresponding tMMDR directions will be estimated. The feature selection step is thenrepeated until no directions can be dropped. This entire process is summarized in thealgorithm below.Algorithm: tMMDR estimation and feature selection1. Fit a t-distribution mixture model (tMM) to the data.2. Estimate the tMMDR directions using the method in Section 3.2.2.3. Per<strong>for</strong>m feature selection using the greedy algorithm in Section 3.2.3.4. Fit a tMM on the selected tMMDR variables and return to step 2.5. Repeat steps 2 − 4 until none <strong>of</strong> the features can be dropped.22


Chapter 4ApplicationsWe now focus our attention on applying the model-<strong>based</strong> clustering methods GMMDR(Scrucca, 2010) and tMMDR to simulated and real data. We used the R (R DevelopmentCore Team, 2012) s<strong>of</strong>tware to achieve this. For GMMDR, the clustering algorithms arerun <strong>via</strong> the package mclust (Fraley and Raftery, 2006). For tMMDR the clustering isdone through the package tEIGEN (Andrews and McNicholas, 2012b).4.1 Simulated Data ExamplesIn this section, we follow the data simulation schemes outlined in Scrucca (2010). ThetMMDR algorithm is implemented and applied on these data, and its per<strong>for</strong>mance iscompared to the GMMDR procedure.<strong>Model</strong> 1: we consider three overlapping clusters with common covariance correspondingto a EEE Gaussian mixture model and to a CCCC t-distribution mixturemodel. The data sets are generated from three overlapping clusters with equal mixingprobabilities on three variables generated from a multivariate normal distribution withmeansµ 1 = [0, 0, 0] ⊤ , µ 2 = [0, 2, 2] ⊤ , µ 3 = [2, −2, −2] ⊤ ,and common covariance matrix⎡⎤2 0.7 0.8Σ =⎢0.7 0.5 0.3.⎥⎣⎦0.8 0.3 123


Figure 4.1: Pairs plot <strong>for</strong> the three overlapping clusters in <strong>Model</strong> 1 (n = 300 data points).<strong>Model</strong> 2: we consider three overlapping clusters with common shape correspondingto a VEV Gaussian mixture model and to a UUCU t-distribution mixture model. Thedata sets are generated from three overlapping clusters with equal mixing probabilitieson three variables generated from a multivariate normal distribution with meansµ 1 = [0, 0, 0] ⊤ , µ 2 = [4, −2, 6] ⊤ , µ 3 = [−2, −4, 2] ⊤ .For the covariance matrices Σ g = λ g D g ADg ⊤ , where g = 1, 2, 3, the scale, shape andorientation parameters are, respectively, λ = [0.2, 0.5, 0.8] ⊤ , A = diag(1, 2, 3) and⎡⎤ ⎡⎤ ⎡⎤1 0.6 0.62 −1.2 1.20.5 0 0D 1 =⎢0.6 1 0.6, D 2 =⎥ ⎢−1.2 2 −1.2, D 3 =⎥ ⎢0 0.5 0.⎥⎣⎦ ⎣⎦ ⎣⎦0.6 0.6 11.2 −1.2 20 0 0.5<strong>Model</strong> 3: we consider three overlapping clusters with unconstrained covariancecorresponding to a VVV Gaussian mixture model and to a UUUU t-distribution mixturemodel. The data sets are generated from three overlapping clusters with equal mixing24


Figure 4.2: Pairs plot <strong>for</strong> the three overlapping clusters in <strong>Model</strong> 2 (n = 300 data points).probabilities on three variables generated from a multivariate normal distribution withmeansµ 1 = [0, 0, 0] ⊤ , µ 2 = [4, −2, 6] ⊤ , µ 3 = [−2, −4, 2] ⊤ ,and within-groups covariances⎡⎤ ⎡⎤ ⎡⎤1 0.9 0.92 −1.8 1.80.5 0 0Σ 1 =⎢0.9 1 0.9, Σ 2 =⎥ ⎢−1.8 2 −1.8, Σ 3 =⎥ ⎢0 0.5 0.⎥⎣⎦ ⎣⎦ ⎣⎦0.9 0.9 11.8 −1.8 20 0 0.5For each model we ran three scenarios, namely• scenario one (no noise variables): generated three variables from a multivariatenormal distribution;• scenario two (noise variables): started with scenario one and added seven noisevariables generated from independent standard normal variables;25


Figure 4.3: Pairs plot <strong>for</strong> the three overlapping clusters in <strong>Model</strong> 3 (n = 300 data points).• scenario three (redundant and noise variables): started with scenario one andadded three variables correlated with each clustering variable (with correlationcoefficients equal to 0.9, 0.7, 0.5, respectively) as well as four independent standardnormal variables.In order to ascertain the per<strong>for</strong>mance <strong>of</strong> the clustering methods under varying datadimensions, each scenario was run <strong>for</strong> three data sets consisting <strong>of</strong> 100, 300 and 1000data points generated according to the schemes described earlier. Every run comprised1000 simulations. Table 4.1 shows the comparison between our results <strong>for</strong> tMMDR andthe results <strong>for</strong> GMMDR from Scrucca (2010).For the scenarios which include noise and redundant variables, tMMDR exhibits ARIvalues which are higher than those <strong>for</strong> GMMDR. This occurs consistently <strong>for</strong> all modelsand varying data dimensions.26


Table 4.1: Average ARI (with standard errors in brackets) <strong>based</strong> on 1000 simulations.No noise Noise variables Noise and redundant variables(3 variables) (10 variables) (10 variables)n = 100 n = 300 n = 1000 n = 100 n = 300 n = 1000 n = 100 n = 300 n = 1000<strong>Model</strong> 1GMMDR (EEE) 0.9716 0.9742 0.9753 0.8612 0.9674 0.9742 0.8832 0.9699 0.9742(0.0347) (0.0172) (0.0088) (0.1426) (0.0234) (0.009) (0.1613) (0.0197) (0.0088)tMMDR (CCCC) 0.9705 0.9722 0.9747 0.9210 0.9709 0.9737 0.9987 0.9994 0.9995(0.0314) (0.0163) (0.0085) (0.1763) (0.0172) (0.0087) (0.0116) (0.0028) (0.0012)<strong>Model</strong> 2GMMDR (VEV) 0.9709 0.9802 0.9819 0.9201 0.9747 0.9806 0.9231 0.9727 0.9806(0.0351) (0.0141) (0.0073) (0.0799) (0.0165) (0.0082) (0.0811) (0.0169) (0.0077)tMMDR (UUCU) 0.9768 0.9796 0.9815 0.9609 0.9781 0.9818 0.9898 0.9995 0.9997(0.0283) (0.0142) (0.0074) (0.1571) (0.0101) (0.0079) (0.0197) (0.0014) (0.0008)<strong>Model</strong> 3GMMDR (VVV) 0.9952 0.9981 0.9983 0.8643 0.9674 0.9751 0.8799 0.9712 0.9740(0.0146) (0.0042) (0.0024) (0.1384) (0.021) (0.0081) (0.1609) (0.0185) (0.0099)tMMDR (UUUU) 0.9972 0.9982 0.9981 0.9865 0.9973 0.9981 0.9612 0.9991 0.9994(0.0092) (0.0042) (0.0023) (0.0801) (0.0052) (0.0025) (0.1241) (0.0168) (0.0073)27


4.2 Real Data ExamplesIn this section we apply the GMMDR and tMMDR methods to real data sets and usethe ARI (2.13) to assess clustering per<strong>for</strong>mance. We chose data which appear quite<strong>of</strong>ten in the literature <strong>for</strong> model-<strong>based</strong> clustering and provide a well-rounded framework<strong>for</strong> testing the algorithms.4.2.1 C<strong>of</strong>fee DataStreuli (1973) recorded twelve chemical compositions <strong>for</strong> two types <strong>of</strong> c<strong>of</strong>fee, namelythe Arabica and Robusta varieties. The c<strong>of</strong>fee samples, totalling 43 observations, werecollected from around the world and classified according to variety and country <strong>of</strong> provenienceas shown in Table 4.2. These data are available through the R package pgmm(McNicholas et al., 2011).Table 4.2: Variable description <strong>for</strong> the c<strong>of</strong>fee data.VariableVarietyCountryRange <strong>of</strong> values1 (Arabica) or 2 (Robusta)43 different countriesWater [5, 12]Bean weight [110.8, 191.2]Extract yield [29, 36.2]Ph value [5.21, 6.13]Free acid [28.4, 43.4]Mineral content [3.48, 4.49]Fat [7.2, 17]Caffine [0.9, 2.16]Trigonelline [0.32, 1.38]Chlorogenic acid [4.8, 6.41]Neochlorogenic acid [0.12, 0.75]Isochlorogenic acid [0.31, 1.64]28


The known classification <strong>of</strong> the c<strong>of</strong>fee data by variety has two groups: Arabica with36 observations and Robusta with 7 observations. We ran the GMMDR and tMMDRalgorithms on the scaled version <strong>of</strong> the data using the MCLUST hierarchical agglomerativeprocedure <strong>for</strong> initialization. The results <strong>of</strong> the model-<strong>based</strong> clustering are shownin Table 4.3.Table 4.3: <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the c<strong>of</strong>fee data.Method <strong>Model</strong> Clusters (G) Degrees <strong>of</strong> freedom Features ARIGMMDR E 2 - 1 1tMMDR CIUU 2 {48.1, 46.8} 5 1Both methods accomplish perfect clustering as evidenced by an ARI equal to 1.However, tMMDR needs five features to achieve this, whereas GMMDR takes only onefeature. The resulting classification is presented in Table 4.4.Table 4.4: A classification table <strong>for</strong> the best tMMDR model fitted to the c<strong>of</strong>fee data.ClusterVariety 1 2Arabica 36 0Robusta 0 7Figure 4.4 illustrates a scatterplot <strong>of</strong> the tMMDR directions corresponding to thetwo clusters found by the procedure. The separation between the varieties <strong>of</strong> c<strong>of</strong>fee isvery clear as there is no overlap between the clusters.29


Figure 4.4: Plots <strong>of</strong> tMMDR directions <strong>for</strong> the c<strong>of</strong>fee data. The shape <strong>of</strong> the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.4.2.2 Italian Wines DataForina et al. (1986) recorded twenty-eight chemical and physical properties <strong>for</strong> threetypes <strong>of</strong> Italian wines, namely Barolo, Grignolino and Barbera. For our analysis, weused a subset <strong>of</strong> 13 <strong>of</strong> these variables, as shown in Table 4.5. These data, comprising178 observations, are available through the R package gclus (Hurley, 2010).30


Table 4.5: Variable description <strong>for</strong> the wines data.VariableTypeRange <strong>of</strong> values1 (Barolo), 2 (Grignolino) or 3 (Barbera)Alcohol [11.03, 14.83]Malic acid [0.74, 5.8]Ash [1.36, 3.23]Alcalinity <strong>of</strong> ash [10.6, 30]Magnesium [70, 162]Total phenols [0.98, 3.88]Flavanoids [0.34, 5.08]Nonflavanoid phenols [0.13, 0.66]Proanthocyanins [0.41, 3.58]Color intensity [1.28, 13]Hue [0.48, 1.71]OD280/OD315 <strong>of</strong> diluted wines [1.27, 4]Proline [278, 1680]The known classification <strong>of</strong> the wine data by type has three groups: Barolo with 59observations, Grignolino with 71 observations and Barbera with 48 observations. Weran the GMMDR and tMMDR algorithms on the scaled version <strong>of</strong> the data using theMCLUST hierarchical agglomerative procedure <strong>for</strong> initialization. The results <strong>of</strong> themodel-<strong>based</strong> clustering are shown in Table 4.6.Table 4.6: <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the wine data.Method <strong>Model</strong> Clusters (G) Degrees <strong>of</strong> freedom Features ARIGMMDR VEV 3 - 5 0.85tMMDR CUCC 3 57.8 4 0.9309The tMMDR method (ARI = 0.93) produces a better clustering than the GMMDRmethod (ARI = 0.85) on the wine data and uses less features in the process. The31


esulting classification is presented in Table 4.7.Table 4.7: A classification table <strong>for</strong> the best tMMDR model fitted to the wine data.ClusterWines 1 2 3Barolo 59 2 0Grignolino 0 67 0Barbera 0 2 48Figure 4.5 illustrates a scatterplot <strong>of</strong> the tMMDR directions corresponding to thethree clusters found by the procedure. The separation between the varieties <strong>of</strong> wine isclear in the plots <strong>of</strong> direction one against directions two and three, as there is very littleoverlap between the clusters.Figure 4.5: Plots <strong>of</strong> tMMDR directions <strong>for</strong> the wine data. The shape <strong>of</strong> the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.32


4.2.3 Australian Crabs DataCampbell and Mahon (1974) recorded five measurements <strong>for</strong> specimens <strong>of</strong> Leptograpsuscrabs found in Australia. Crabs were classified according to their colour (blue or orange)and their gender. This data set, which consists <strong>of</strong> 200 observations, is described inTable 4.8 and is available through the R package MASS (Venables and Ripley, 2002).Table 4.8: Variable description <strong>for</strong> the crabs data.VariableSpeciesSexRange <strong>of</strong> valuesB (blue) or O (orange)M (male) or F (female)Frontal lobe size [7.2, 23.1]Rear width [6.5, 20.20]Carapace length [14.7, 47.2]Carapace width [17.1, 54.6]Body depth [6.1, 21.6]The known classification <strong>of</strong> the crabs data by colour and gender has four groups:50 blue males, 50 orange males, 50 blue females and 50 orange females. We ran theGMMDR and tMMDR algorithms on the scaled version <strong>of</strong> the data using the MCLUSThierarchical agglomerative procedure <strong>for</strong> initialization. The results <strong>of</strong> the model-<strong>based</strong>clustering are shown in Table 4.9.Table 4.9: <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the crabs data.Method <strong>Model</strong> Clusters (G) Degrees <strong>of</strong> freedom Features ARIGMMDR EEV 4 - 3 0.8195tMMDR CUCC 4 80.9 4 0.8617The tMMDR method (ARI = 0.86) produces a better clustering than the GMMDRmethod (ARI = 0.82) on the crabs data but it requires one more feature than GMMDRto get this value. The resulting classification is presented in Table 4.10.33


Table 4.10: A classification table <strong>for</strong> the best tMMDR model fitted to the crabs data.ClusterSpecies 1 2 3 4BF 50 8 0 0OF 0 42 0 0BM 0 0 47 0OM 0 0 3 50Figure 4.6 illustrates a scatterplot <strong>of</strong> the tMMDR directions corresponding to thefour clusters found by the procedure. The separation between the crabs species is mostclear in the plots <strong>of</strong> direction one against direction three. As the number <strong>of</strong> clustersincrease, it gets more difficult to visualize their separation as evidenced in the plots <strong>of</strong>direction two against directions three and four.34


Figure 4.6: Plots <strong>of</strong> tMMDR directions <strong>for</strong> the crabs data. The shape <strong>of</strong> the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.4.2.4 Swiss Banknotes DataFlury and Riedwyl (1988) presented six measurements taken from Swiss banknotes. The200 observations are classified as either genuine or counterfeit as shown in Table 4.11.These data are available through the R package gclus.The known classification <strong>of</strong> the banknotes data by status has two groups: genuinewith 100 observations and counterfeit with 100 observations. We ran the GMMDR andtMMDR algorithms on the scaled version <strong>of</strong> the data using the MCLUST hierarchicalagglomerative procedure <strong>for</strong> initialization. The results <strong>of</strong> the model-<strong>based</strong> clustering areshown in Table 4.12.35


Table 4.11: Variable description <strong>for</strong> the banknotes data.VariableStatusRange <strong>of</strong> values0 (genuine) or 1 (counterfeit)Length [213.8, 216.3]Left [129, 131]Right [129, 131.1]Bottom [7.2, 12.7]Top [7.7, 12.3]Diagonal [137.8, 142.4]Table 4.12: <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the banknotes data.Method <strong>Model</strong> Clusters (G) Degrees <strong>of</strong> freedom Features ARIGMMDR EEI 4 - 3 0.6739tMMDR CICU 3 {13, 10.1, 63.6} 6 0.8603The tMMDR method (ARI = 0.86) produces a better clustering than the GMMDRmethod (ARI = 0.67) on the banknotes data, although neither is able to classify correctlythe data into its two known clusters. The resulting classification is presented inTable 4.13.Table 4.13: A classification table <strong>for</strong> the best tMMDR model fitted to the banknotesdata.ClusterStatus 1 21 99 02 1 153 0 85While we expect genuine bank notes to present as one identifiable cluster, counterfeitnotes may well appear in two clusters depending on the number <strong>of</strong> different sources <strong>of</strong>36


those notes. The results may indicate that there were two different kinds <strong>of</strong> counterfeitnotes.Figure 4.7 illustrates a scatterplot <strong>of</strong> the tMMDR directions corresponding to thethree clusters found by the procedure. The separation between the types <strong>of</strong> banknote ismost clear in the plots <strong>of</strong> direction one against directions two and three, as they do notoverlap.Figure 4.7: Plots <strong>of</strong> tMMDR directions <strong>for</strong> the banknotes data. The shape <strong>of</strong> theobservations indicates their true cluster classification and the colour gives their tMMDRcluster allocation.4.2.5 Diabetes DataReaven and Miller (1979) examined the relationship between measures <strong>of</strong> blood plasmaglucose and insulin in order to classify people as normal, overt diabetic or chemicaldiabetic. This data set consists <strong>of</strong> observations from 145 adult patients at the Stan<strong>for</strong>dClinical Research Centre, as described in Table 4.14, and is available through the Rpackage locfit (Loader, 2012).37


Table 4.14: Variable description <strong>for</strong> the diabetes data.VariableClassifier (cc)Range <strong>of</strong> valuesnormal, chemical or overtRelative weight (rw) [0.7, 1.2]Fasting plasma glucose (fpg) [70, 353]Glucose area (ga) [269, 1568]Insulin area (ina) [29, 480]Steady state plasma glucose (sspg) [10, 748]The known classification <strong>for</strong> the diabetes data has three groups: normal with 76observations, chemical with 36 observations and overt with 33 observations. We ran theGMMDR and tMMDR algorithms on the scaled version <strong>of</strong> the data using the MCLUSThierarchical agglomerative procedure <strong>for</strong> initialization. The results <strong>of</strong> the model-<strong>based</strong>clustering are shown in Table 4.12.Table 4.15: <strong>Model</strong>-<strong>based</strong> clustering results <strong>for</strong> the diabetes data.Method <strong>Model</strong> Clusters (G) Degrees <strong>of</strong> freedom Features ARIGMMDR VEV 3 - 3 0.6536tMMDR UUUC 3 65.8 4 0.702The tMMDR method (ARI = 0.7) produces a better clustering than the GMMDRmethod (ARI = 0.65) on the diabetes data. However, the classification <strong>of</strong> the data intotheir three known clusters is somewhat poor. The resulting classification is presented inTable 4.16.Figure 4.8 illustrates a scatterplot <strong>of</strong> the tMMDR directions corresponding to thethree clusters found by the procedure. The separation between the types <strong>of</strong> diabetes ismost clear in the plots <strong>of</strong> direction one against direction three, where they overlap theleast.38


Table 4.16: A classification table <strong>for</strong> the best tMMDR model fitted to the diabetes data.ClusterClassifier 1 2 3Overt 26 0 0Chemical 7 27 2Normal 0 9 74Figure 4.8: Plots <strong>of</strong> tMMDR directions <strong>for</strong> the diabetes data. The shape <strong>of</strong> the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.39


A summary <strong>of</strong> our clustering results <strong>for</strong> the real data appears in Table 4.17. In general,the tMMDR algorithm gives higher ARI values but requires more features than itsGMMDR analogue. This is to be expected since there are more parameters to be estimatedin a t-mixture model than in a Gaussian mixture model, hence more in<strong>for</strong>mationis required.Table 4.17: Summary <strong>of</strong> the results <strong>for</strong> the real data.tMMDRGMMDRData Clust. Vars. <strong>Model</strong> Clust. Feat. ARI <strong>Model</strong> Clust. Feat. ARIC<strong>of</strong>fee 2 13 CIUU 2 5 1 E 2 1 1Wine 3 13 CUCC 3 4 0.9309 VEV 3 5 0.85Crabs 4 5 CUCC 4 4 0.8617 EEV 4 3 0.8195Banknotes 2 6 CICU 3 6 0.8603 EEI 4 3 0.6739Diabetes 3 5 UUUC 3 4 0.7020 VEV 3 3 0.653640


Chapter 5ConclusionThis thesis introduces the idea <strong>of</strong> dimension reduction <strong>for</strong> model-<strong>based</strong> clustering withinthe multivariate t-distribution framework (tMMDR). This approach is <strong>based</strong> on existingwork <strong>for</strong> reducing dimensionality in the case <strong>of</strong> finite Gaussian mixtures (GMMDR).Work focused on identifying the smallest subspace <strong>of</strong> the data which captured theinherent cluster structure. This in<strong>for</strong>mation was gathered by looking at how the groupmeans and group covariances varied, using the eigen-decomposition <strong>of</strong> the kernel matrix.The elements <strong>of</strong> the subspace consisted <strong>of</strong> linear combinations <strong>of</strong> the original data, whichwere ordered by importance <strong>via</strong> the associated eigenvalues. Observations were thenprojected onto the subspace and the resulting set <strong>of</strong> variables captured most <strong>of</strong> theclustering structure available in the data.The tMMDR approach was illustrated using simulated and real data and its per<strong>for</strong>mancecompared to that <strong>of</strong> the GMMDR method. The evaluation was done usingthe ARI. In the case <strong>of</strong> synthetic data, tMMDR produced better results generally andspecifically <strong>for</strong> the scenarios with higher levels <strong>of</strong> complexity (i.e., more data points andvariables). For the real data, the tMMDR algorithm showed higher ARI values than itsGMMDR analogue, but the <strong>for</strong>mer selected a slightly bigger set <strong>of</strong> features (usually onemore feature than GMMDR).Overall, using dimension reduction <strong>via</strong> mixtures <strong>of</strong> t-distributions shows great potential<strong>for</strong> model-<strong>based</strong> clustering.In terms <strong>of</strong> future work, the application <strong>of</strong> tMMDR and GMMDR to mixtures <strong>of</strong> factoranalyzers could be investigated. More generally, the method <strong>of</strong> dimension reductioncould be extended to other model-<strong>based</strong> paradigms such as classification.41


BibliographyAndrews, J. L. and P. D. McNicholas (2011a). Extending mixtures <strong>of</strong> multivariate t-factor analyzers. Statistics and Computing 21 (3), 361–373.Andrews, J. L. and P. D. McNicholas (2011b). <strong>Mixtures</strong> <strong>of</strong> modified t-factor analyzers <strong>for</strong>model-<strong>based</strong> clustering, classification, and discriminant analysis. Journal <strong>of</strong> StatisticalPlanning and Inference 141 (4), 1479–1486.Andrews, J. L. and P. D. McNicholas (2012a). <strong>Model</strong>-<strong>based</strong> clustering, classification,and discriminant analysis <strong>via</strong> mixtures <strong>of</strong> multivariate t-distributions: The tEIGENfamily. Statistics and Computing 22.Andrews, J. L. and P. D. McNicholas (2012b). tEIGEN <strong>for</strong> R: <strong>Model</strong>-<strong>based</strong> clusteringand classification with the multivariate t-distribution. R package version 1.0.Andrews, J. L., P. D. McNicholas, and S. Subedi (2011). <strong>Model</strong>-<strong>based</strong> classification <strong>via</strong>mixtures <strong>of</strong> multivariate t-distributions. Computational Statistics and Data Analysis55 (1), 520–529.Banfield, J. D. and A. E. Raftery (1993).clustering. Biometrics 49 (3), 803–821.<strong>Model</strong>-<strong>based</strong> Gaussian and non-GaussianCampbell, N. A. and R. J. Mahon (1974). A multivariate study <strong>of</strong> variation in twospecies <strong>of</strong> rock crab <strong>of</strong> genus leptograpsus. Australian Journal <strong>of</strong> Zoology 22, 417–425.Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. PatternRecognition 28, 781–793.Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood fromincomplete data <strong>via</strong> the EM algorithm. Journal <strong>of</strong> the Royal Statistical Society. SeriesB 39 (1), 1–38.Flury, B. and H. Riedwyl (1988). Multivariate Statistics: A practical Approach. CambridgeUniversity Press.42


Forina, M., C. Armanino, M. Castino, and M. Ubigli (1986). Multivariate data analysisas a discriminating method <strong>of</strong> the origin <strong>of</strong> wines. Vitis 25, 189–201.Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering methods?Answers <strong>via</strong> model-<strong>based</strong> cluster analysis. The Computer Journal 41 (8), 578–588.Fraley, C. and A. E. Raftery (1999). MCLUST: S<strong>of</strong>tware <strong>for</strong> model-<strong>based</strong> cluster analysis.Journal <strong>of</strong> Classification 16, 297–306.Fraley, C. and A. E. Raftery (2002). <strong>Model</strong>-<strong>based</strong> clustering, discriminant analysis, anddensity estimation. Journal <strong>of</strong> the American Statistical Association 97 (458), 611–631.Fraley, C. and A. E. Raftery (2006). MCLUST Version 3 <strong>for</strong> R: Normal Mixture <strong>Model</strong>ingand <strong>Model</strong>-<strong>based</strong> <strong>Clustering</strong>. Department <strong>of</strong> Statistics, University <strong>of</strong> Washington.(revised in 2012).Greselin, F. and S. Ingrassia (2010a). Constrained monotone EM algorithms <strong>for</strong> mixtures<strong>of</strong> multivariate t-distributions. Statistics and Computing 20 (1), 9–22.Greselin, F. and S. Ingrassia (2010b). Weakly homoscedastic constraints <strong>for</strong> mixtures<strong>of</strong> t-distributions. In A. Fink, B. Lausen, W. Seidel, and A. Ultsch (Eds.), Advancesin Data Analysis, Data Handling and Business Intelligence. Studies in Classification,Data Analysis, and Knowledge Organization, Berlin/Heidelberg. Springer.Hubert, L. and P. Arabie (1985). Comparing partitions. Journal <strong>of</strong> Classification 2,193–218.Hurley, C. (2010). gclus: <strong>Clustering</strong> Graphics. R package version 1.3.Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal <strong>of</strong> the American StatisticalAssociation 90, 773–795.Kass, R. E. and L. Wasserman (1995). A reference Bayesian test <strong>for</strong> nested hypothesesand its relationships to the Schwarz criterion. Journal <strong>of</strong> the American StatisticalAssociation 90 (431), 928–934.Keribin, C. (2000). Consistent estimation <strong>of</strong> the order <strong>of</strong> mixture models. Sankhyā. TheIndian Journal <strong>of</strong> Statistics. Series A 62 (1), 49–66.Leroux, B. G. (1992). Consistent estimation <strong>of</strong> a mixing distribution. The Annals <strong>of</strong>Statistics 20, 1350–1360.43


Li, K. C. (1991). Sliced inverse regression <strong>for</strong> dimension reduction (with discussion).Journal <strong>of</strong> the American Statistical Association 86, 316–342.Li, K. C. (2000). High dimensional data analysis <strong>via</strong> the SIR/PHD approach. Unpublishedmanuscript.Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 5. Cali<strong>for</strong>nia:Institute <strong>of</strong> Mathematical Statistics: Hayward.Loader, C. (2012). locfit: Local Regression, Likelihood and Density Estimation. Rpackage version 1.5-8.McLachlan, G. J. and K. E. Bas<strong>for</strong>d (1988). Mixture <strong>Model</strong>s: Inference and applicationsto clustering. New York: Marcel Dekker Inc.McLachlan, G. J., R. W. Bean, and L.-T. Jones (2007). Extension <strong>of</strong> the mixture <strong>of</strong>factor analyzers model to incorporate the multivariate t-distribution. ComputationalStatistics & Data Analysis 51 (11), 5327–5338.McLachlan, G. J. and D. Peel (1998). Robust cluster analysis <strong>via</strong> mixtures <strong>of</strong> multivariatet-distributions. In Lecture Notes in Computer Science, Volume 1451, pp. 658–666.Berlin: Springer-Verlag.McLachlan, G. J. and D. Peel (2000). Finite Mixture <strong>Model</strong>s. New York: John Wiley &Sons.McNicholas, P. D. (2011). On model-<strong>based</strong> clustering, classification, and discriminantanalysis. Journal <strong>of</strong> the Iranian Statistical Society 10 (2), 181–199.McNicholas, P. D., K. R. Jampani, A. F. McDaid, T. B. Murphy, and L. Banks (2011).pgmm: Parsimonious Gaussian Mixture <strong>Model</strong>s. R package version 1.0.McNicholas, P. D. and T. B. Murphy (2008). Parsimonious Gaussian mixture models.Statistics and Computing 18, 285–296.McNicholas, P. D., T. B. Murphy, A. F. McDaid, and D. Frost (2010). Serial andparallel implementations <strong>of</strong> model-<strong>based</strong> clustering <strong>via</strong> parsimonious Gaussian mixturemodels. Computational Statistics & Data Analysis 54 (3), 711–723.44


Meng, X. L. and D. B. Rubin (1993). Maximum likelihood estimation <strong>via</strong> the ECMalgorithm: a general framework. Biometrika 80, 267–278.Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t-distribution.Statistics & Computing 10, 339–348.R Development Core Team (2012). R: A Language and Environment <strong>for</strong> StatisticalComputing. Vienna, Austria: R Foundation <strong>for</strong> Statistical Computing.Raftery, A. E. and N. Dean (2006). Variable selection <strong>for</strong> model-<strong>based</strong> clustering. Journal<strong>of</strong> the American Statistical Association 101 (473), 168–178.Rand, W. M. (1971). Objective criteria <strong>for</strong> the evaluation <strong>of</strong> clustering methods. Journal<strong>of</strong> the American Statistical Association 66, 846–850.Reaven, G. M. and R. G. Miller (1979). An attempt to define the nature <strong>of</strong> chemicaldiabetes using a multidimensional analysis. Diabetologia 16, 17–24.Schwarz, G. (1978). Estimating the dimension <strong>of</strong> a model. Annals <strong>of</strong> Statistics 6, 461–464.Scrucca, L. (2010). <strong>Dimension</strong> reduction <strong>for</strong> model-<strong>based</strong> clustering. Statistics & Computing20 (4), 471–484.Streuli, H. (1973). Der heutige stand der kaffeechemie. In Association ScientifiqueInternational du Cafe, 6th International Colloquium on C<strong>of</strong>fee Chemistry, Bogatá,Columbia, pp. 61–72.Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (Fourthed.). New York: Springer.45

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!