Dimension Reduction for Model-based Clustering via Mixtures of ...

Dimension Reduction for Model-based Clusteringvia Mixtures of Multivariate t-DistributionsbyKatherine MorrisA Thesispresented toThe University of GuelphIn partial fulfilment of requirementsfor the degree ofMaster of ScienceinMathematics and StatisticsGuelph, Ontario, Canadac○ Katherine Morris, July, 2012

AcknowledgementsFirst and foremost, I would like to extend my sincerest gratitude to my advisor, Prof.Paul McNicholas, for his guidance and support. His knowledge and enthusiasm inspiredme to undertake the research presented here.I would like to thank Prof. Ryan Browne for his contributions on my advisory committee,and also Prof. Zeny Feng and Prof. Edward Carter for taking the time to examinethis work.Also, I would like to thank Jeffrey Andrews, the author of the R package tEIGEN, forhis continued assistance while using the software.Finally, thanks go to my family and Phillip for their years of love and patience.iii

List of Tables2.1 Nomenclature for models in the MCLUST family . . . . . . . . . . . . . 72.2 Nomenclature for models in the tEIGEN family . . . . . . . . . . . . . . 114.1 Average ARI (with standard errors in brackets) based on 1000 simulations. 274.2 Variable description for the coffee data. . . . . . . . . . . . . . . . . . . . 284.3 Model-based clustering results for the coffee data. . . . . . . . . . . . . . 294.4 A classification table for the best tMMDR model fitted to the coffee data. 294.5 Variable description for the wines data. . . . . . . . . . . . . . . . . . . . 314.6 Model-based clustering results for the wine data. . . . . . . . . . . . . . . 314.7 A classification table for the best tMMDR model fitted to the wine data. 324.8 Variable description for the crabs data. . . . . . . . . . . . . . . . . . . . 334.9 Model-based clustering results for the crabs data. . . . . . . . . . . . . . 334.10 A classification table for the best tMMDR model fitted to the crabs data. 344.11 Variable description for the banknotes data. . . . . . . . . . . . . . . . . 364.12 Model-based clustering results for the banknotes data. . . . . . . . . . . 364.13 A classification table for the best tMMDR model fitted to the banknotesdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.14 Variable description for the diabetes data. . . . . . . . . . . . . . . . . . 384.15 Model-based clustering results for the diabetes data. . . . . . . . . . . . . 384.16 A classification table for the best tMMDR model fitted to the diabetesdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.17 Summary of the results for the real data. . . . . . . . . . . . . . . . . . . 40v

List of Figures2.1 Contour and perspective plots for a bivariate Normal distribution withµ = [0, 0] ⊤ and Σ = diag(2). . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Comparison of a Normal and t-distribution . . . . . . . . . . . . . . . . . 94.1 Pairs plot for the three overlapping clusters in Model 1 (n = 300 datapoints). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Pairs plot for the three overlapping clusters in Model 2 (n = 300 datapoints). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Pairs plot for the three overlapping clusters in Model 3 (n = 300 datapoints). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4 Plots of tMMDR directions for the coffee data . . . . . . . . . . . . . . . 304.5 Plots of tMMDR directions for the wine data . . . . . . . . . . . . . . . . 324.6 Plots of tMMDR directions for the crabs data . . . . . . . . . . . . . . . 354.7 Plots of tMMDR directions for the banknotes data . . . . . . . . . . . . 374.8 Plots of tMMDR directions for the diabetes data . . . . . . . . . . . . . . 39vi

Chapter 1IntroductionClustering algorithms based on probability models are a popular choice for exploringstructures in modern data sets, which continue to grow in size and complexity. In particular,the model-based approach assumes that the data are generated by a finite mixtureof probability distributions such as the multivariate Gaussian distribution. Thesemodels have been shown to be a powerful tool for applications in bioinformatics, finance,medicine and survey research (McLachlan and Basford, 1988; Banfield and Raftery, 1993;Celeux and Govaert, 1995, for example). In the probability framework, the issues of selectingthe ‘best’ clustering method and determining the appropriate number of clustersare reduced to model selection problems.Another choice for model-based clustering is the t-distribution which, although lessfrequently used, has the potential to outperform its Gaussian analogue. In many situations,the tails of the normal distribution are thinner than required and the estimatescan be affected by outliers. McLachlan and Peel (1998) introduced the idea of fittingmixtures of multivariate t-distributions to the data, which provided a more robust approachthan fitting normal mixture models, as outliers were given reduced weight in thecalculation of parameters.Reducing dimensionality within the model-based clustering paradigm will be thefocus of this thesis. The work of Scrucca (2010) on dimension reduction for modelbasedclustering in the Gaussian framework, called GMMDR, is applied to mixturesof multivariate t-distributions. A new method, tMMDR, is proposed and its steps aresummarized as follows.1. Fit a t-distribution mixture model to the data using the tEIGEN family (Andrewsand McNicholas, 2012a).2. Find the smallest subspace which captures the clustering information contained in1

Chapter 2BackgroundFinite mixtures of distributions provide a mathematical approach for fitting statisticalmodels to a wide variety of random phenomena. McLachlan and Basford (1988) andMcLachlan and Peel (2000) give an extensive description of finite mixture models whichhave become increasingly popular due to their flexibility. The most popular applicationof these models occurs for scenarios where data exhibit group structure or where thedata can be investigated for such structure.A p-dimensional random vector X is said to arise from a parametric finite mixturedistribution if ∀x ⊂ X, one can writep(x|ϑ) =G∑π g p g (x|θ g ) ,g=1where G is the number of components, π g are mixing proportions such thatG∑π g = 1 and π g > 0 ,g=1and ϑ = (π 1 , . . . , π G , θ 1 , . . . , θ G ) is the parameter vector. The p g (x|θ g ) are called componentdensities.3

2.1 Gaussian Mixture ModelsFrom McLachlan and Peel (2000), we know that the density of a multivariate Gaussianmixture model is given byf(x|ϑ) =G∑π g f N (x|µ g , Σ g ) ,g=1where f N (x|µ g , Σ g ) is the density of a multivariate Gaussian distribution with mean µ gand covariance matrix Σ g , i.e.f N (x|µ g , Σ g ) = exp{− 1(x − µ 2 g) ⊤ Σ −1g (x − µ g )}. (2.1)(2π) p 2 |Σ g | 1 2Figure 2.1 illustrates the contour and 2-dimensional plot for a bivariate normal distributionwith µ = [0, 0] ⊤ and Σ = diag(2). By varying Σ, one can obtain differentorientations, shapes and volumes for the distribution.2.2 Model-based Clustering with MCLUSTModel-based clustering is a method which assumes that the data are generated by amixture of underlying probability distributions in which each component represents adifferent group or cluster. The mixture likelihood is given byL M (θ 1 , . . . , θ G ; π 1 , . . . , π G |x) =n∏ G∑π g f N (x i |θ g ) , (2.2)i=1 g=1where x = {x 1 , . . . , x n } represents the data.For the Gaussian density in (2.1) clusters are ellipsoidal and centred at the meansµ g . Other geometric properties are determined by the covariances Σ g . Banfield andRaftery (1993) developed a model-based framework for clustering by using the followingeigenvalue decomposition of the covariance matrixΣ g = λ g D g A g D ⊤ g , (2.3)4

Figure 2.1: Contour and perspective plots for a bivariate Normal distribution withµ = [0, 0] ⊤ and Σ = diag(2).where• D g is the orthogonal matrix of eigenvectors which determines the orientation ofthe principal components of Σ g ;• A g is a diagonal matrix with elements proportional to the eigenvalues of Σ g whichdetermines the shape of the density contours;5

• λ g is a scalar which specifies the volume of the corresponding ellipsoid.The orientation, volume and shape of distributions are estimated from the dataand can either vary between clusters or be constrained across clusters. There are fourcommonly used assumptions for the covariance matrices:1. Σ g = Σ = λI where clusters are spherical and have equal volumes;2. Σ g = Σ = λDAD ⊤ where all clusters have the same shape, volume and orientation;3. Σ g = λ g D g A g Dg ⊤ where the shape, volume and orientation are allowed to vary;4. Σ g = λD g ADg ⊤ where only the orientations of the clusters may differ.By imposing constraints on the elements of (2.3) a large range of models is obtainedas discussed in Celeux and Govaert (1995). Several of these models, called the MCLUSTfamily (Fraley and Raftery, 1999), are available through the R package mclust by Fraleyand Raftery (2006). A brief description of the types of models included in this particularfamily appears in Table 2.1.Clustering via mixture models is done through the expectation-maximization (EM)algorithm (Dempster et al., 1977) which is an iterative procedure for finding maximumlikelihood estimates when data are incomplete or treated as being incomplete.The complete-data log-likelihood for model-based clustering with MCLUST isl(ϑ) =n∑ G∑z ig log[π g f N (x i |θ g )] . (2.4)i=1 g=1The EM algorithm involves the iteration of two steps until convergence is reached,namely1. the expectation step (E-step) computes the expected value of the complete-datalog-likelihood, and2. the maximization step (M-step) maximizes the expected value of the complete-datalog-likelihood with respect to the model parameters.6

Table 2.1: Nomenclature for models in the MCLUST family (Fraley and Raftery, 1999)Model Distribution Volume Shape Orientation Covariance Free covarianceparametersEII Spherical Equal Equal NA λI 1VII Spherical Variable Equal NA λ g I GEEI Diagonal Equal Equal Coord. axes λA pVEI Diagonal Variable Equal Coord. axes λ g A p + G − 1EVI Diagonal Equal Variable Coord. axes λA g Gp − G + 1VVI Diagonal Variable Variable Coord. axes λ g A g GpEEE Ellipsoidal Equal Equal Equal λDAD ⊤ p(p + 1)/2EEV Ellipsoidal Equal Equal Variable λD g ADg ⊤ Gp(p + 1)/2 − (G − 1)pVEV Ellipsoidal Variable Equal Variable λ g D g ADg ⊤ Gp(p + 1)/2 − (G − 1)(p − 1)VVV Ellipsoidal Variable Variable Variable λ g D g A g Dg ⊤ Gp(p + 1)/2Traditionally, the missing data are represented by component indicator variables z igdefined as⎧⎨1 if observation x i belongs to component g, andz ig =(2.5)⎩0 otherwise.The main objective of model-based clustering is to estimate these z ig under the assumptionsthat• the density of an observation x i given z i is ∏ Gg=1 f N(x i |θ g ) z ig;• each z i is independent and identically distributed according to a multinomial distributionof one draw on G categories with probabilities π 1 , . . . , π G .At each E-step, the z ig are updated by their conditional expected valuesẑ ig =π g f N (x|µ g , Σ g )∑ Gj=1 π jf N (x|µ j , Σ j ) , (2.6)where f N has the form given in (2.1).7

In the M-step of the algorithm, the mean, mixing proportions and covariance structureget updated. The details appear in several papers, for example Celeux and Govaert(1995) or Fraley and Raftery (1998).After the EM algorithm converges, cluster memberships are estimated via the maximuma posteriori (MAP) classification given by⎧⎨1 if max g {ẑ ig } occurs at component g, andMAP{ẑ ig } =(2.7)⎩0 otherwise.Clustering via mixture modelling allows for the use of Bayes factors (Kass andRaftery, 1995) to compare models. This gives a systematic method of selecting theparametrization of the model and the number of mixture components. When the EMalgorithm is used to find the maximum mixture likelihood, an approximation to twicethe log Bayes factor, called the Bayesian information criterion (BIC; Schwarz, 1978) ismore suitableBIC = 2l(x, ˆϑ) − p log(n) , (2.8)where l(x, ˆϑ) is the maximized log-likelihood, ˆϑ is the maximum likelihood estimate ofϑ, p is the number of parameters and n is the number of observations. The use of BICfor model selection in model-based clustering is well established in its application as canbe seen in Leroux (1992), Kass and Wasserman (1995) and Keribin (2000), for example.2.3 Mixtures of Multivariate t-DistributionsFrom McLachlan and Peel (1998) and Peel and McLachlan (2000) we know that thedensity of a multivariate t-distribution mixture model is given byf(x|ϑ) =G∑π g f t (x|µ g , Σ g , ν g ) , (2.9)g=1where π g are the mixing proportions andf t (x|µ g , Σ g , ν g ) =(πν g ) p 2 Γ( νg 2Γ( νg+p2)|Σ g | − 1 2δ(x,µg|Σg))(1 +ν g) νg+p2, (2.10)8

is the density of a multivariate t-distribution with mean µ g , covariance matrix Σ g anddegrees of freedom ν g . Also, δ(x, µ g |Σ g ) = (x − µ g ) ⊤ Σ −1g (x − µ g ) is the squaredMahalanobis distance between x and µ g .As in the analogue of univariate distributions, the multivariate t-distribution becomesasymptotically normal as the number of degrees of freedom tends to infinity.Figure 2.2: Normal distribution with µ = 0 and Σ = 2 superimposed on a t-distributionwith the same µ and Σ and three degrees of freedom.2.4 Model-based Clustering with tEIGENWhen clustering continuous multivariate data, attention has focussed on the use ofmultivariate normal components because of their computational convenience (they canbe easily modelled via the EM algorithm as outlined in Section 2.2). However, in manyapplications, the tails of the normal distribution are often thinner than required and theestimates of the component means and covariance matrices can be affected by outliers.Figure 2.2 illustrates the difference in the tails of the Normal and t-distributions via a9

simple example. McLachlan and Peel (1998) introduced the idea of fitting mixtures ofmultivariate t-distributions to the data. This provided a more robust approach thanfitting normal mixture models, as outliers were given reduced weight in the calculationof parameters, i.e. the Mahalanobis distance term in the density (2.10).Recent work on model-based clustering using the multivariate t-distribution has beencontributed by McLachlan et al. (2007), Greselin and Ingrassia (2010a,b), Andrews andMcNicholas (2011a,b) and Andrews et al. (2011).Andrews and McNicholas (2012a) used the eigen-decomposition of the multivariatet-distribution covariance matrix to build a family of 20 mixtures which are called thetEIGEN family (as shown in Table 2.2). These take into account the same constraints asthe MCLUST family of models but also include constraints on the degrees of freedom.With the tEIGEN family, clustering is done via the expectation-conditional maximization(ECM) algorithm of Meng and Rubin (1993). In the ECM algorithm the M-stepis replaced by a number of conditional maximization steps which are more efficient froma computational standpoint. The missing data are represented by the same indicatorvariables z ig (2.5) as in the MCLUST case but with the addition of characteristic weightsu ig .Under the model-based clustering framework, the observations do not have knowngroup memberships. Thus the z ig must be initialized in order to run the ECM algorithmand there are several options for this. For MCLUST, Fraley and Raftery (2002) use amodel-based agglomerative hierarchical clustering procedure to obtain starting values.For parsimonious Gaussian mixture models (PGMM), McNicholas and Murphy (2008)use a number of random starting values on their most constrained modelΣ g = ΛΛ ⊤ + Ψ ,where Λ is a (p × q) matrix of parameters (factor loadings), typically with q ≪ p, andΨ is a diagonal noise matrix. The results are then picked from the iteration with thelargest BIC value to initialize for each cluster G. Andrews and McNicholas (2012a)suggest using the agglomerative hierarchical initialization method of MCLUST in orderto avoid algorithm failure when clustering with mixtures of multivariate t-distributions.10

Table 2.2: Nomenclature for models in the tEIGEN family: ‘C’ indicates that a constraintis imposed, ‘U’ indicates that a constraint is not imposed, ‘I’ indicates the identitymatrix of suitable dimension (Andrews and McNicholas, 2012a)Model λ g = λ D g = D A g = A ν g = ν Free covariance parametersCIIC C I I C 1 + 1CIIU C I I U 1 + GUIIC U I I C (G − 1) + 1UIIU U I I U (G − 1) + GCICC C I C C p + 1CICU C I C U p + GUICC U I C C p + (G − 1) + 1UICU U I C U p + (G − 1) + GCIUC C I U C Gp − (G − 1) + 1CIUU C I U U Gp − (G − 1) + GUIUC U I U C Gp + 1UIUU U I U U Gp + GCCCC C C C C [p(p + 1)/2] + 1CCCU C C C U [p(p + 1)/2] + GUCCC U C C C [p(p + 1)/2] + (G − 1) + 1UCCU U C C U [p(p + 1)/2] + (G − 1) + GCUCC C U C C G[p(p + 1)/2] − (G − 1)p + 1CUCU C U C U G[p(p + 1)/2] − (G − 1)p + GUUCC U U C C G[p(p + 1)/2] − (G − 1)(p − 1) + 1UUCU U U C U G[p(p + 1)/2] − (G − 1)(p − 1) + GCUUC C U U C G[p(p + 1)/2] − (G − 1) + 1CUUU C U U U G[p(p + 1)/2] − (G − 1) + GUUUC U U U C G[p(p + 1)/2] + 1UUUU U U U U G[p(p + 1)/2] + G11

For the tEIGEN models, parameter estimation is comparable to the Gaussian casebut it includes two additional steps:• incorporation of the weights u ig ;• estimation of the degrees of freedom.Andrews and McNicholas (2012a) define the complete-data log-likelihood for modelbasedclustering with tEIGEN for n p-dimensional observations asl(ϑ) =G∑g=1n∑i=1[ ( ∣ ∣∣∣ ν gz ig log π g γ u ig2 , ν ) ( ∣g ∣∣∣f N x i µ g , Σ )]g, (2.11)2u igwhere γ is the gamma density given byfor α > 0, β > 0 and indicator functionγ(y|α, β) = βα y α−1 exp(−βy)Γ(α)⎧⎨1 for y > 0, andI {y>0} =⎩0 otherwise.I {y>0} ,At each E-step, the z ig are updated (cf. 2.6) by their conditional expected valuesẑ ig =π g f t (x|µ g , Σ g , ν g )∑ Gj=1 π jf t (x|µ j , Σ j , ν j ) , (2.12)and the characteristic weights are updated as followsû ig =ν g + pν g + δ(x i , µ g |Σ g ) ,where δ(x i , µ g |Σ g ) is the squared Mahalanobis distance between x i and µ g .The degrees of freedom for the case where ν g = ν are updated by solving the equation( ) ˆνnew1−ϕ +log2( ˆνnew2)+ 1 nG∑ n∑( ) ( )ˆν old + p ˆν old + pẑ ig (log û ig −û ig )+ϕ−log= 0 ,22g=1i=112

Chapter 3MethodologyThere are several approaches to using mixture models in a clustering context which werediscussed in Chapter 2. McNicholas (2011) presents a review of work in model-basedclustering with a particular focus on two families of Gaussian mixture models, namelyMCLUST (Fraley and Raftery, 2002) and PGMM (McNicholas and Murphy, 2008).3.1 Dimension-reduction for Mixtures of MultivariateGaussian Distributions (GMMDR)Scrucca (2010) proposed a novel approach to model-based clustering, namely the dimensionreductionfor mixtures of multivariate Gaussian distributions (GMMDR). The main ideais to find a reduced subspace which captures most of the clustering structure in the data.By following the work of Li (1991, 2000) on sliced inverse regression (SIR) one can obtaininformation on the dimension reduction subspace from two sources:• the variation on group means;• the variation on group covariances (depending on the estimated mixture model).Classical procedures for reducing the dimensions in the data are principal componentsanalysis and factor analysis. These techniques lower dimensionality by forming linearcombinations of the variables. In terms of visualizing any potential clustering structure,neither method is particularly useful.The proposed method reduces dimensionality by identifying a set of linear combinationsof the original variables, ordered by importance via their associated eigenvalues,which capture most of the cluster structure in the data. Observations are then projected14

onto the reduced subspace and their plots help visualize the clustering structure. Thenew GMMDR variables contain most of the clustering information of the original dataand they can be reduced further to improve performance.3.2 Dimension-reduction for Mixtures of Multivariatet-Distributions (tMMDR)In this section, we follow the work of Scrucca (2010) and apply it to mixtures of multivariatet-distributions. Suppose a G-component t-distribution mixture model (tMM) isimposed on a set of data with n observations and p variables. Recall from (2.9) in theprevious chapter that this model takes the formf(x|ϑ) =G∑π g f t (x|µ g , Σ g , ν g ) . (3.1)g=13.2.1 Clustering on a Dimension Reduced SubspaceConsider a (p×1) vector X of random variables and a discrete random variable Y takingG distinct values to indicate the G clusters. Let β denote a fixed (p × d) matrix withd ≤ p such thatY ⊥ X | β ⊤ X . (3.2)The conditional independence in (3.2) tells us that the distribution of Y |X is the sameas the distribution of Y |β ⊤ X for all values of X in its marginal sample space. As aconsequence, we can replace the (p × 1) vector X with the (d × 1) vector β ⊤ X withoutloss of clustering information. If d of thepredictor vector.Li (1991) defines the basis for the subspace S(β) given by β as a dimension-reductionsubspace (DRS) for the regression of Y on X. A minimum DRS may not necessarily beunique but if several such subspaces exist, then they all have the same dimension.The assumption in (3.2) implies that P (Y = g|X) = P (Y = g|β ⊤ X) so the density15

for the g-th component of the mixture model in (3.1) can be written asf g (x) = P (Y = g|x) f X(x)P (Y = g)= P (Y = g|β⊤ x) f X (x)P (Y = g)= f g (β ⊤ f X (x)x)f β X(β ⊤ ⊤ x) . (3.3)Also, for any two groups, i and j say, the ratio of the tMM densities isf i (x)f j (x) = f i(β ⊤ x)f j (β ⊤ x) .Given (3.2), the ratio of the conditional densities is the same whether it is computed onthe original variables space or on the DRS. Thus the clustering information is containedcompletely in S(β).We know that an observation is assigned to a cluster g by the MAP (2.7), i.e., to thecluster for which the conditional probability given the data is a maximum such thatwhich is equivalent toby (3.3).arg max P (Y = g|x) =garg max P (Y = g|β ⊤ x) =gπ g f g (x)∑ Gj=1 π jf j (x) ,π g f g (β ⊤ x)∑ Gj=1 π jf j (β ⊤ x) ,Hence, the assignment of an observation to a cluster is unchanged whenperformed on the DRS.3.2.2 Estimation of the tMMDR DirectionsGiven a tMM (3.1), we wish to find the smallest subspace which captures the clusteringinformation contained in the data. Thus, we need to identify those directions where thecluster means µ g vary as much as possible, provided that each direction is Σ-orthogonal16

to the others. Specifically, we need to maximizewhereµ =Σ B =G∑π g µ gg=1Σ = 1 ng=1argβmax β ⊤ Σ B β subject to β ⊤ Σβ = I d , (3.4)is the global mean,G∑π g (µ g − µ)(µ g − µ) ⊤ is the between-cluster covariance matrix,n∑(x i − µ)(x i − µ) ⊤ is the covariance matrix,i=1β ∈ R p×dI dis the spanning matrix,is the (d × d) identity matrix.The solution to the constrained optimization problem in (3.4) is given by the eigendecompositionof the kernel matrixM I ≡ Σ B with respect to Σ . (3.5)The eigenvectors corresponding to the first d largest eigenvalues, [v 1 , . . . , v d ] ≡ β,provide a basis for the subspace S(β) which shows the maximal variation among clustermeans. The number of directions which span this subspace is• d = min(p, G − 1) for models which assume equal within-cluster covariance matrices,or• d ≤ p for models which don’t assume equal within-cluster covariance matrices.From Scrucca (2010), we know that this is similar to the sliced inverse regression algorithmsSIR (Li, 1991) and SIR II (Li, 2000). These are dimension reduction proceduresthat use information from the inverse mean function and differences in class covariancematrices. In order to find directions which account for variation in both cluster meansand cluster covariances, Scrucca (2010) uses these algorithms to devise the following17

kernel matrixM = M I Σ −1 M I + M II , (3.6)whereM I ≡ Σ B as before,G∑M II = π g (Σ g − ¯Σ)Σ −1 (Σ g − ¯Σ) ⊤ ,g=1G∑¯Σ = π g Σ gg=1is the pooled within-cluster covariance matrix.Thus the kernel matrix in (3.6) contains information on how both cluster means andcluster covariances vary. Now the optimization problem in (3.4) is given byargβmax β ⊤ M β subject to β ⊤ Σ β = I d , (3.7)and it is solved using the generalized eigen-decompositionMv i = l i Σv i , (3.8)where⎧⎨1 if i = j, andvi ⊤ Σv j =⎩0 otherwise,and l 1 ≥ l 2 ≥ . . . ≥ l d > 0.Definition 3.1 The tMMDR directions are the eigenvectors [v 1 , . . . , v d ] ≡ β whichform the basis of the dimension reduction subspace S(β) and constitute the solution to(3.7).Suppose S(β) is the subspace spanned by the tMMDR directions from (3.8) and µ g ,Σ g are the mean and covariance for cluster g. Then the projections of the parametersonto S(β) are given by β ⊤ T µ g and β ⊤ Σ g β.18

Definition 3.2 The tMMDR variables, Z, are the projections of the (n×p) data matrixX onto the subspace S(β) and can be computed as Z = Xβ.The raw coefficients of the estimated directions are uniquely determined up to multiplicationby a scalar while the associated directions from the origin are unique. Hence,they can be normalized to have unit lengthβ j ≡v j‖ v j ‖for j = 1, . . . , d .If we set D = diag(V ⊤ V ), where V is the matrix of eigenvectors from (3.8), thenβ ≡ V D − 1 2 .For the tMMDR variables we haveCov(Z) = β ⊤ Σβ = D − 1 2 V ⊤ ΣV D − 1 2( ) 1= D −1 = diag.‖ v j ‖ 2We can see that the tMMDR variables are uncorrelated while the tMMDR directionsare orthogonal with respect to the Σ-inner product.For an (n × p) sample data matrix X, the sample version ˆM of the kernel in (3.6)is obtained using the corresponding estimates from the fit of a t-distribution mixturemodel via the EM algorithm. Then the tMMDR directions are calculated from thegeneralized eigen-decomposition of ˆM with respect to ˆΣ. The tMMDR directions areordered based on eigenvalues which means that directions associated with approximatelyzero eigenvalues can be discarded in practice since clusters will overlap a lot along thesedirections. Also, their contribution to the overall position of the sample points in aneigenvector expansion is approximately zero so they provide little positional information,and their associated eigenvectors are liable to be unstable in any case.3.2.3 Selection of the tMMDR VariablesThe estimation of the tMMDR variables discussed in the previous section can be viewedas a form of feature extraction where the components are reduced through a set of linearcombinations of the original variables. This set of features may contain estimated19

tMMDR variables which provide no clustering information but require parameter estimation.Thus, the next step in the process of model-based clustering is to detect andremove these unnecessary variables.Scrucca (2010) used the subset selection method of Raftery and Dean (2006) toprune the subset of GMMDR variables. We will also use this approach to select themost appropriate tMMDR variables.Let s be a subset of q features from the original tMMDR variables Z, with dim(s) = qand q ≤ d. Let s ′ = {s \ i} ⊂ s be the set of dim = q − 1 which is obtained by excludingthe i-th feature from s. The comparison of the two subsets can be viewed as a modelcomparison problem and addressed by using the BIC difference which is given in Rafteryand Dean (2006) asBIC diff (Z i∈s ) = BIC clust (Z s ) − BIC not clust (Z s ) , (3.9)where BIC clust (Z s ) is the BIC value for the best model fitted using features in s andBIC not clust (Z s ) is the BIC value for no clustering. We can write the latter asBIC not clust (Z s ) = BIC clust (Z s ′) + BIC reg (Z i |Z s ′) ,where BIC clust (Z s ′) is the BIC value for the best model fitted using features in s ′ andBIC reg (Z i |Z s ′) is the BIC value for the regression of the i-th feature on the remaining(q − 1) features in s ′ . Since the tMMDR variables are orthogonal, the formula for BIC reg(Raftery and Dean, 2006) reduces to( ) RSSBIC reg (Z i |Z s ′) = −n log(2π) − n log − n − (q + 1) log(n) ,nwhere RSS is the residual sum of squares in the regression of Z i on Z s ′.Now, the space of all possible subsets has size dim(s) = q, where q = 1, . . . , d has 2 d −1elements and an exhaustive search would be unfeasible. To bypass this issue, Rafteryand Dean (2006) proposed a greedy search algorithm which finds a local optimum in themodel space. The algorithm comprises the following steps:20

• the forward step evaluates the inclusion of a proposed variable;• the backward step evaluates the exclusion of one of the currently included variables;• termination occurs when consecutive inclusion and exclusion steps are rejected.Since the tMMDR variables are Σ-orthogonal we do not need to perform the backwardstep. With the BIC (2.8) as a method of model comparison, the greedy searchalgorithm proceeds as follows.1. Let s = {1, 2, . . . , d} be the set of tMMDR variables Z s computed for the mixturemodel as described in Section 3.2.2.2. Select the first variable to be the one which maximizes the BIC difference in (3.9).• Let s 1 = {i} be the candidate set and s ′ 1 = 0 be the set of variables whichare already included.• Choose the variable Z i1such thati 1 = argi∈smax BIC diff (Z s1 ) = arg max (BIC diff (Z s1 ) − BIC reg (Z i )) .i∈sAt the first iteration dim(s 1 ) = 1 thus maximization occurs over univariatemodels with G = 1, 2, . . . , max(G), where max(G) is the maximum numberof clusters considered for the data.• Then, let s 1 = {i 1 }, set j = 2 and proceed to the next step.3. Select a variable to include, among those not included already, to be the one whichmaximizes the BIC difference in (3.9).• Let s j = s j−1 ∪ {i} be the candidate set and s ′ j = s j−1 be the set of variablesalready included.• Choose the variable Z ijto be included such thati j = arg max BIC diff (Z sj ) .i∈s\s ′ jHere the best model is identified with respect to the number of mixture componentsup to max(G) and the model parametrization.21

• Then update the set of currently included variables such that s j = s ′ j ∪ {i j }.4. Set j = j + 1 and iterate the previous step until a stopping rule is met. Our algorithmterminates when the BIC difference for the inclusion of a variable becomesnegative.At each step, the search over the model space is performed with respect to the modelparametrization and the number of clusters.As mentioned before, the greedy search will likely discard the tMMDR variableswhich are associated with small eigenvalues as they do not carry any clustering information.Then a tMM can be fitted on the selected set of tMMDR variables and thecorresponding tMMDR directions will be estimated. The feature selection step is thenrepeated until no directions can be dropped. This entire process is summarized in thealgorithm below.Algorithm: tMMDR estimation and feature selection1. Fit a t-distribution mixture model (tMM) to the data.2. Estimate the tMMDR directions using the method in Section 3.2.2.3. Perform feature selection using the greedy algorithm in Section 3.2.3.4. Fit a tMM on the selected tMMDR variables and return to step 2.5. Repeat steps 2 − 4 until none of the features can be dropped.22

Chapter 4ApplicationsWe now focus our attention on applying the model-based clustering methods GMMDR(Scrucca, 2010) and tMMDR to simulated and real data. We used the R (R DevelopmentCore Team, 2012) software to achieve this. For GMMDR, the clustering algorithms arerun via the package mclust (Fraley and Raftery, 2006). For tMMDR the clustering isdone through the package tEIGEN (Andrews and McNicholas, 2012b).4.1 Simulated Data ExamplesIn this section, we follow the data simulation schemes outlined in Scrucca (2010). ThetMMDR algorithm is implemented and applied on these data, and its performance iscompared to the GMMDR procedure.Model 1: we consider three overlapping clusters with common covariance correspondingto a EEE Gaussian mixture model and to a CCCC t-distribution mixturemodel. The data sets are generated from three overlapping clusters with equal mixingprobabilities on three variables generated from a multivariate normal distribution withmeansµ 1 = [0, 0, 0] ⊤ , µ 2 = [0, 2, 2] ⊤ , µ 3 = [2, −2, −2] ⊤ ,and common covariance matrix⎡⎤2 0.7 0.8Σ =⎢0.7 0.5 0.3.⎥⎣⎦0.8 0.3 123

Figure 4.1: Pairs plot for the three overlapping clusters in Model 1 (n = 300 data points).Model 2: we consider three overlapping clusters with common shape correspondingto a VEV Gaussian mixture model and to a UUCU t-distribution mixture model. Thedata sets are generated from three overlapping clusters with equal mixing probabilitieson three variables generated from a multivariate normal distribution with meansµ 1 = [0, 0, 0] ⊤ , µ 2 = [4, −2, 6] ⊤ , µ 3 = [−2, −4, 2] ⊤ .For the covariance matrices Σ g = λ g D g ADg ⊤ , where g = 1, 2, 3, the scale, shape andorientation parameters are, respectively, λ = [0.2, 0.5, 0.8] ⊤ , A = diag(1, 2, 3) and⎡⎤ ⎡⎤ ⎡⎤1 0.6 0.62 −1.2 1.20.5 0 0D 1 =⎢0.6 1 0.6, D 2 =⎥ ⎢−1.2 2 −1.2, D 3 =⎥ ⎢0 0.5 0.⎥⎣⎦ ⎣⎦ ⎣⎦0.6 0.6 11.2 −1.2 20 0 0.5Model 3: we consider three overlapping clusters with unconstrained covariancecorresponding to a VVV Gaussian mixture model and to a UUUU t-distribution mixturemodel. The data sets are generated from three overlapping clusters with equal mixing24

Figure 4.2: Pairs plot for the three overlapping clusters in Model 2 (n = 300 data points).probabilities on three variables generated from a multivariate normal distribution withmeansµ 1 = [0, 0, 0] ⊤ , µ 2 = [4, −2, 6] ⊤ , µ 3 = [−2, −4, 2] ⊤ ,and within-groups covariances⎡⎤ ⎡⎤ ⎡⎤1 0.9 0.92 −1.8 1.80.5 0 0Σ 1 =⎢0.9 1 0.9, Σ 2 =⎥ ⎢−1.8 2 −1.8, Σ 3 =⎥ ⎢0 0.5 0.⎥⎣⎦ ⎣⎦ ⎣⎦0.9 0.9 11.8 −1.8 20 0 0.5For each model we ran three scenarios, namely• scenario one (no noise variables): generated three variables from a multivariatenormal distribution;• scenario two (noise variables): started with scenario one and added seven noisevariables generated from independent standard normal variables;25

Figure 4.3: Pairs plot for the three overlapping clusters in Model 3 (n = 300 data points).• scenario three (redundant and noise variables): started with scenario one andadded three variables correlated with each clustering variable (with correlationcoefficients equal to 0.9, 0.7, 0.5, respectively) as well as four independent standardnormal variables.In order to ascertain the performance of the clustering methods under varying datadimensions, each scenario was run for three data sets consisting of 100, 300 and 1000data points generated according to the schemes described earlier. Every run comprised1000 simulations. Table 4.1 shows the comparison between our results for tMMDR andthe results for GMMDR from Scrucca (2010).For the scenarios which include noise and redundant variables, tMMDR exhibits ARIvalues which are higher than those for GMMDR. This occurs consistently for all modelsand varying data dimensions.26

Table 4.1: Average ARI (with standard errors in brackets) based on 1000 simulations.No noise Noise variables Noise and redundant variables(3 variables) (10 variables) (10 variables)n = 100 n = 300 n = 1000 n = 100 n = 300 n = 1000 n = 100 n = 300 n = 1000Model 1GMMDR (EEE) 0.9716 0.9742 0.9753 0.8612 0.9674 0.9742 0.8832 0.9699 0.9742(0.0347) (0.0172) (0.0088) (0.1426) (0.0234) (0.009) (0.1613) (0.0197) (0.0088)tMMDR (CCCC) 0.9705 0.9722 0.9747 0.9210 0.9709 0.9737 0.9987 0.9994 0.9995(0.0314) (0.0163) (0.0085) (0.1763) (0.0172) (0.0087) (0.0116) (0.0028) (0.0012)Model 2GMMDR (VEV) 0.9709 0.9802 0.9819 0.9201 0.9747 0.9806 0.9231 0.9727 0.9806(0.0351) (0.0141) (0.0073) (0.0799) (0.0165) (0.0082) (0.0811) (0.0169) (0.0077)tMMDR (UUCU) 0.9768 0.9796 0.9815 0.9609 0.9781 0.9818 0.9898 0.9995 0.9997(0.0283) (0.0142) (0.0074) (0.1571) (0.0101) (0.0079) (0.0197) (0.0014) (0.0008)Model 3GMMDR (VVV) 0.9952 0.9981 0.9983 0.8643 0.9674 0.9751 0.8799 0.9712 0.9740(0.0146) (0.0042) (0.0024) (0.1384) (0.021) (0.0081) (0.1609) (0.0185) (0.0099)tMMDR (UUUU) 0.9972 0.9982 0.9981 0.9865 0.9973 0.9981 0.9612 0.9991 0.9994(0.0092) (0.0042) (0.0023) (0.0801) (0.0052) (0.0025) (0.1241) (0.0168) (0.0073)27

4.2 Real Data ExamplesIn this section we apply the GMMDR and tMMDR methods to real data sets and usethe ARI (2.13) to assess clustering performance. We chose data which appear quiteoften in the literature for model-based clustering and provide a well-rounded frameworkfor testing the algorithms.4.2.1 Coffee DataStreuli (1973) recorded twelve chemical compositions for two types of coffee, namelythe Arabica and Robusta varieties. The coffee samples, totalling 43 observations, werecollected from around the world and classified according to variety and country of provenienceas shown in Table 4.2. These data are available through the R package pgmm(McNicholas et al., 2011).Table 4.2: Variable description for the coffee data.VariableVarietyCountryRange of values1 (Arabica) or 2 (Robusta)43 different countriesWater [5, 12]Bean weight [110.8, 191.2]Extract yield [29, 36.2]Ph value [5.21, 6.13]Free acid [28.4, 43.4]Mineral content [3.48, 4.49]Fat [7.2, 17]Caffine [0.9, 2.16]Trigonelline [0.32, 1.38]Chlorogenic acid [4.8, 6.41]Neochlorogenic acid [0.12, 0.75]Isochlorogenic acid [0.31, 1.64]28

The known classification of the coffee data by variety has two groups: Arabica with36 observations and Robusta with 7 observations. We ran the GMMDR and tMMDRalgorithms on the scaled version of the data using the MCLUST hierarchical agglomerativeprocedure for initialization. The results of the model-based clustering are shownin Table 4.3.Table 4.3: Model-based clustering results for the coffee data.Method Model Clusters (G) Degrees of freedom Features ARIGMMDR E 2 - 1 1tMMDR CIUU 2 {48.1, 46.8} 5 1Both methods accomplish perfect clustering as evidenced by an ARI equal to 1.However, tMMDR needs five features to achieve this, whereas GMMDR takes only onefeature. The resulting classification is presented in Table 4.4.Table 4.4: A classification table for the best tMMDR model fitted to the coffee data.ClusterVariety 1 2Arabica 36 0Robusta 0 7Figure 4.4 illustrates a scatterplot of the tMMDR directions corresponding to thetwo clusters found by the procedure. The separation between the varieties of coffee isvery clear as there is no overlap between the clusters.29

Figure 4.4: Plots of tMMDR directions for the coffee data. The shape of the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.4.2.2 Italian Wines DataForina et al. (1986) recorded twenty-eight chemical and physical properties for threetypes of Italian wines, namely Barolo, Grignolino and Barbera. For our analysis, weused a subset of 13 of these variables, as shown in Table 4.5. These data, comprising178 observations, are available through the R package gclus (Hurley, 2010).30

Table 4.5: Variable description for the wines data.VariableTypeRange of values1 (Barolo), 2 (Grignolino) or 3 (Barbera)Alcohol [11.03, 14.83]Malic acid [0.74, 5.8]Ash [1.36, 3.23]Alcalinity of ash [10.6, 30]Magnesium [70, 162]Total phenols [0.98, 3.88]Flavanoids [0.34, 5.08]Nonflavanoid phenols [0.13, 0.66]Proanthocyanins [0.41, 3.58]Color intensity [1.28, 13]Hue [0.48, 1.71]OD280/OD315 of diluted wines [1.27, 4]Proline [278, 1680]The known classification of the wine data by type has three groups: Barolo with 59observations, Grignolino with 71 observations and Barbera with 48 observations. Weran the GMMDR and tMMDR algorithms on the scaled version of the data using theMCLUST hierarchical agglomerative procedure for initialization. The results of themodel-based clustering are shown in Table 4.6.Table 4.6: Model-based clustering results for the wine data.Method Model Clusters (G) Degrees of freedom Features ARIGMMDR VEV 3 - 5 0.85tMMDR CUCC 3 57.8 4 0.9309The tMMDR method (ARI = 0.93) produces a better clustering than the GMMDRmethod (ARI = 0.85) on the wine data and uses less features in the process. The31

esulting classification is presented in Table 4.7.Table 4.7: A classification table for the best tMMDR model fitted to the wine data.ClusterWines 1 2 3Barolo 59 2 0Grignolino 0 67 0Barbera 0 2 48Figure 4.5 illustrates a scatterplot of the tMMDR directions corresponding to thethree clusters found by the procedure. The separation between the varieties of wine isclear in the plots of direction one against directions two and three, as there is very littleoverlap between the clusters.Figure 4.5: Plots of tMMDR directions for the wine data. The shape of the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.32

4.2.3 Australian Crabs DataCampbell and Mahon (1974) recorded five measurements for specimens of Leptograpsuscrabs found in Australia. Crabs were classified according to their colour (blue or orange)and their gender. This data set, which consists of 200 observations, is described inTable 4.8 and is available through the R package MASS (Venables and Ripley, 2002).Table 4.8: Variable description for the crabs data.VariableSpeciesSexRange of valuesB (blue) or O (orange)M (male) or F (female)Frontal lobe size [7.2, 23.1]Rear width [6.5, 20.20]Carapace length [14.7, 47.2]Carapace width [17.1, 54.6]Body depth [6.1, 21.6]The known classification of the crabs data by colour and gender has four groups:50 blue males, 50 orange males, 50 blue females and 50 orange females. We ran theGMMDR and tMMDR algorithms on the scaled version of the data using the MCLUSThierarchical agglomerative procedure for initialization. The results of the model-basedclustering are shown in Table 4.9.Table 4.9: Model-based clustering results for the crabs data.Method Model Clusters (G) Degrees of freedom Features ARIGMMDR EEV 4 - 3 0.8195tMMDR CUCC 4 80.9 4 0.8617The tMMDR method (ARI = 0.86) produces a better clustering than the GMMDRmethod (ARI = 0.82) on the crabs data but it requires one more feature than GMMDRto get this value. The resulting classification is presented in Table 4.10.33

Table 4.10: A classification table for the best tMMDR model fitted to the crabs data.ClusterSpecies 1 2 3 4BF 50 8 0 0OF 0 42 0 0BM 0 0 47 0OM 0 0 3 50Figure 4.6 illustrates a scatterplot of the tMMDR directions corresponding to thefour clusters found by the procedure. The separation between the crabs species is mostclear in the plots of direction one against direction three. As the number of clustersincrease, it gets more difficult to visualize their separation as evidenced in the plots ofdirection two against directions three and four.34

Figure 4.6: Plots of tMMDR directions for the crabs data. The shape of the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.4.2.4 Swiss Banknotes DataFlury and Riedwyl (1988) presented six measurements taken from Swiss banknotes. The200 observations are classified as either genuine or counterfeit as shown in Table 4.11.These data are available through the R package gclus.The known classification of the banknotes data by status has two groups: genuinewith 100 observations and counterfeit with 100 observations. We ran the GMMDR andtMMDR algorithms on the scaled version of the data using the MCLUST hierarchicalagglomerative procedure for initialization. The results of the model-based clustering areshown in Table 4.12.35

Table 4.11: Variable description for the banknotes data.VariableStatusRange of values0 (genuine) or 1 (counterfeit)Length [213.8, 216.3]Left [129, 131]Right [129, 131.1]Bottom [7.2, 12.7]Top [7.7, 12.3]Diagonal [137.8, 142.4]Table 4.12: Model-based clustering results for the banknotes data.Method Model Clusters (G) Degrees of freedom Features ARIGMMDR EEI 4 - 3 0.6739tMMDR CICU 3 {13, 10.1, 63.6} 6 0.8603The tMMDR method (ARI = 0.86) produces a better clustering than the GMMDRmethod (ARI = 0.67) on the banknotes data, although neither is able to classify correctlythe data into its two known clusters. The resulting classification is presented inTable 4.13.Table 4.13: A classification table for the best tMMDR model fitted to the banknotesdata.ClusterStatus 1 21 99 02 1 153 0 85While we expect genuine bank notes to present as one identifiable cluster, counterfeitnotes may well appear in two clusters depending on the number of different sources of36

those notes. The results may indicate that there were two different kinds of counterfeitnotes.Figure 4.7 illustrates a scatterplot of the tMMDR directions corresponding to thethree clusters found by the procedure. The separation between the types of banknote ismost clear in the plots of direction one against directions two and three, as they do notoverlap.Figure 4.7: Plots of tMMDR directions for the banknotes data. The shape of theobservations indicates their true cluster classification and the colour gives their tMMDRcluster allocation.4.2.5 Diabetes DataReaven and Miller (1979) examined the relationship between measures of blood plasmaglucose and insulin in order to classify people as normal, overt diabetic or chemicaldiabetic. This data set consists of observations from 145 adult patients at the StanfordClinical Research Centre, as described in Table 4.14, and is available through the Rpackage locfit (Loader, 2012).37

Table 4.14: Variable description for the diabetes data.VariableClassifier (cc)Range of valuesnormal, chemical or overtRelative weight (rw) [0.7, 1.2]Fasting plasma glucose (fpg) [70, 353]Glucose area (ga) [269, 1568]Insulin area (ina) [29, 480]Steady state plasma glucose (sspg) [10, 748]The known classification for the diabetes data has three groups: normal with 76observations, chemical with 36 observations and overt with 33 observations. We ran theGMMDR and tMMDR algorithms on the scaled version of the data using the MCLUSThierarchical agglomerative procedure for initialization. The results of the model-basedclustering are shown in Table 4.12.Table 4.15: Model-based clustering results for the diabetes data.Method Model Clusters (G) Degrees of freedom Features ARIGMMDR VEV 3 - 3 0.6536tMMDR UUUC 3 65.8 4 0.702The tMMDR method (ARI = 0.7) produces a better clustering than the GMMDRmethod (ARI = 0.65) on the diabetes data. However, the classification of the data intotheir three known clusters is somewhat poor. The resulting classification is presented inTable 4.16.Figure 4.8 illustrates a scatterplot of the tMMDR directions corresponding to thethree clusters found by the procedure. The separation between the types of diabetes ismost clear in the plots of direction one against direction three, where they overlap theleast.38

Table 4.16: A classification table for the best tMMDR model fitted to the diabetes data.ClusterClassifier 1 2 3Overt 26 0 0Chemical 7 27 2Normal 0 9 74Figure 4.8: Plots of tMMDR directions for the diabetes data. The shape of the observationsindicates their true cluster classification and the colour gives their tMMDR clusterallocation.39

A summary of our clustering results for the real data appears in Table 4.17. In general,the tMMDR algorithm gives higher ARI values but requires more features than itsGMMDR analogue. This is to be expected since there are more parameters to be estimatedin a t-mixture model than in a Gaussian mixture model, hence more informationis required.Table 4.17: Summary of the results for the real data.tMMDRGMMDRData Clust. Vars. Model Clust. Feat. ARI Model Clust. Feat. ARICoffee 2 13 CIUU 2 5 1 E 2 1 1Wine 3 13 CUCC 3 4 0.9309 VEV 3 5 0.85Crabs 4 5 CUCC 4 4 0.8617 EEV 4 3 0.8195Banknotes 2 6 CICU 3 6 0.8603 EEI 4 3 0.6739Diabetes 3 5 UUUC 3 4 0.7020 VEV 3 3 0.653640

Chapter 5ConclusionThis thesis introduces the idea of dimension reduction for model-based clustering withinthe multivariate t-distribution framework (tMMDR). This approach is based on existingwork for reducing dimensionality in the case of finite Gaussian mixtures (GMMDR).Work focused on identifying the smallest subspace of the data which captured theinherent cluster structure. This information was gathered by looking at how the groupmeans and group covariances varied, using the eigen-decomposition of the kernel matrix.The elements of the subspace consisted of linear combinations of the original data, whichwere ordered by importance via the associated eigenvalues. Observations were thenprojected onto the subspace and the resulting set of variables captured most of theclustering structure available in the data.The tMMDR approach was illustrated using simulated and real data and its performancecompared to that of the GMMDR method. The evaluation was done usingthe ARI. In the case of synthetic data, tMMDR produced better results generally andspecifically for the scenarios with higher levels of complexity (i.e., more data points andvariables). For the real data, the tMMDR algorithm showed higher ARI values than itsGMMDR analogue, but the former selected a slightly bigger set of features (usually onemore feature than GMMDR).Overall, using dimension reduction via mixtures of t-distributions shows great potentialfor model-based clustering.In terms of future work, the application of tMMDR and GMMDR to mixtures of factoranalyzers could be investigated. More generally, the method of dimension reductioncould be extended to other model-based paradigms such as classification.41

BibliographyAndrews, J. L. and P. D. McNicholas (2011a). Extending mixtures of multivariate t-factor analyzers. Statistics and Computing 21 (3), 361–373.Andrews, J. L. and P. D. McNicholas (2011b). Mixtures of modified t-factor analyzers formodel-based clustering, classification, and discriminant analysis. Journal of StatisticalPlanning and Inference 141 (4), 1479–1486.Andrews, J. L. and P. D. McNicholas (2012a). Model-based clustering, classification,and discriminant analysis via mixtures of multivariate t-distributions: The tEIGENfamily. Statistics and Computing 22.Andrews, J. L. and P. D. McNicholas (2012b). tEIGEN for R: Model-based clusteringand classification with the multivariate t-distribution. R package version 1.0.Andrews, J. L., P. D. McNicholas, and S. Subedi (2011). Model-based classification viamixtures of multivariate t-distributions. Computational Statistics and Data Analysis55 (1), 520–529.Banfield, J. D. and A. E. Raftery (1993).clustering. Biometrics 49 (3), 803–821.Model-based Gaussian and non-GaussianCampbell, N. A. and R. J. Mahon (1974). A multivariate study of variation in twospecies of rock crab of genus leptograpsus. Australian Journal of Zoology 22, 417–425.Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. PatternRecognition 28, 781–793.Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistical Society. SeriesB 39 (1), 1–38.Flury, B. and H. Riedwyl (1988). Multivariate Statistics: A practical Approach. CambridgeUniversity Press.42

Forina, M., C. Armanino, M. Castino, and M. Ubigli (1986). Multivariate data analysisas a discriminating method of the origin of wines. Vitis 25, 189–201.Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering methods?Answers via model-based cluster analysis. The Computer Journal 41 (8), 578–588.Fraley, C. and A. E. Raftery (1999). MCLUST: Software for model-based cluster analysis.Journal of Classification 16, 297–306.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, anddensity estimation. Journal of the American Statistical Association 97 (458), 611–631.Fraley, C. and A. E. Raftery (2006). MCLUST Version 3 for R: Normal Mixture Modelingand Model-based Clustering. Department of Statistics, University of Washington.(revised in 2012).Greselin, F. and S. Ingrassia (2010a). Constrained monotone EM algorithms for mixturesof multivariate t-distributions. Statistics and Computing 20 (1), 9–22.Greselin, F. and S. Ingrassia (2010b). Weakly homoscedastic constraints for mixturesof t-distributions. In A. Fink, B. Lausen, W. Seidel, and A. Ultsch (Eds.), Advancesin Data Analysis, Data Handling and Business Intelligence. Studies in Classification,Data Analysis, and Knowledge Organization, Berlin/Heidelberg. Springer.Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2,193–218.Hurley, C. (2010). gclus: Clustering Graphics. R package version 1.3.Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American StatisticalAssociation 90, 773–795.Kass, R. E. and L. Wasserman (1995). A reference Bayesian test for nested hypothesesand its relationships to the Schwarz criterion. Journal of the American StatisticalAssociation 90 (431), 928–934.Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā. TheIndian Journal of Statistics. Series A 62 (1), 49–66.Leroux, B. G. (1992). Consistent estimation of a mixing distribution. The Annals ofStatistics 20, 1350–1360.43

Li, K. C. (1991). Sliced inverse regression for dimension reduction (with discussion).Journal of the American Statistical Association 86, 316–342.Li, K. C. (2000). High dimensional data analysis via the SIR/PHD approach. Unpublishedmanuscript.Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 5. California:Institute of Mathematical Statistics: Hayward.Loader, C. (2012). locfit: Local Regression, Likelihood and Density Estimation. Rpackage version 1.5-8.McLachlan, G. J. and K. E. Basford (1988). Mixture Models: Inference and applicationsto clustering. New York: Marcel Dekker Inc.McLachlan, G. J., R. W. Bean, and L.-T. Jones (2007). Extension of the mixture offactor analyzers model to incorporate the multivariate t-distribution. ComputationalStatistics & Data Analysis 51 (11), 5327–5338.McLachlan, G. J. and D. Peel (1998). Robust cluster analysis via mixtures of multivariatet-distributions. In Lecture Notes in Computer Science, Volume 1451, pp. 658–666.Berlin: Springer-Verlag.McLachlan, G. J. and D. Peel (2000). Finite Mixture Models. New York: John Wiley &Sons.McNicholas, P. D. (2011). On model-based clustering, classification, and discriminantanalysis. Journal of the Iranian Statistical Society 10 (2), 181–199.McNicholas, P. D., K. R. Jampani, A. F. McDaid, T. B. Murphy, and L. Banks (2011).pgmm: Parsimonious Gaussian Mixture Models. R package version 1.0.McNicholas, P. D. and T. B. Murphy (2008). Parsimonious Gaussian mixture models.Statistics and Computing 18, 285–296.McNicholas, P. D., T. B. Murphy, A. F. McDaid, and D. Frost (2010). Serial andparallel implementations of model-based clustering via parsimonious Gaussian mixturemodels. Computational Statistics & Data Analysis 54 (3), 711–723.44

Meng, X. L. and D. B. Rubin (1993). Maximum likelihood estimation via the ECMalgorithm: a general framework. Biometrika 80, 267–278.Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t-distribution.Statistics & Computing 10, 339–348.R Development Core Team (2012). R: A Language and Environment for StatisticalComputing. Vienna, Austria: R Foundation for Statistical Computing.Raftery, A. E. and N. Dean (2006). Variable selection for model-based clustering. Journalof the American Statistical Association 101 (473), 168–178.Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journalof the American Statistical Association 66, 846–850.Reaven, G. M. and R. G. Miller (1979). An attempt to define the nature of chemicaldiabetes using a multidimensional analysis. Diabetologia 16, 17–24.Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461–464.Scrucca, L. (2010). Dimension reduction for model-based clustering. Statistics & Computing20 (4), 471–484.Streuli, H. (1973). Der heutige stand der kaffeechemie. In Association ScientifiqueInternational du Cafe, 6th International Colloquium on Coffee Chemistry, Bogatá,Columbia, pp. 61–72.Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (Fourthed.). New York: Springer.45

Dimension Reduction for Model-based Clustering via Mixtures of ...

Create successful ePaper yourself

Delete template?

Save as template?