Dimension Reduction for Model-based Clustering via Mixtures of ...

More documents

Recommendations

Info

Chapter 3MethodologyThere are several approaches to using mixture models in a clustering context which werediscussed in Chapter 2. McNicholas (2011) presents a review of work in model-basedclustering with a particular focus on two families of Gaussian mixture models, namelyMCLUST (Fraley and Raftery, 2002) and PGMM (McNicholas and Murphy, 2008).3.1 Dimension-reduction for Mixtures of MultivariateGaussian Distributions (GMMDR)Scrucca (2010) proposed a novel approach to model-based clustering, namely the dimensionreductionfor mixtures of multivariate Gaussian distributions (GMMDR). The main ideais to find a reduced subspace which captures most of the clustering structure in the data.By following the work of Li (1991, 2000) on sliced inverse regression (SIR) one can obtaininformation on the dimension reduction subspace from two sources:• the variation on group means;• the variation on group covariances (depending on the estimated mixture model).Classical procedures for reducing the dimensions in the data are principal componentsanalysis and factor analysis. These techniques lower dimensionality by forming linearcombinations of the variables. In terms of visualizing any potential clustering structure,neither method is particularly useful.The proposed method reduces dimensionality by identifying a set of linear combinationsof the original variables, ordered by importance via their associated eigenvalues,which capture most of the cluster structure in the data. Observations are then projected14
onto the reduced subspace and their plots help visualize the clustering structure. Thenew GMMDR variables contain most of the clustering information of the original dataand they can be reduced further to improve performance.3.2 Dimension-reduction for Mixtures of Multivariatet-Distributions (tMMDR)In this section, we follow the work of Scrucca (2010) and apply it to mixtures of multivariatet-distributions. Suppose a G-component t-distribution mixture model (tMM) isimposed on a set of data with n observations and p variables. Recall from (2.9) in theprevious chapter that this model takes the formf(x|ϑ) =G∑π g f t (x|µ g , Σ g , ν g ) . (3.1)g=13.2.1 Clustering on a Dimension Reduced SubspaceConsider a (p×1) vector X of random variables and a discrete random variable Y takingG distinct values to indicate the G clusters. Let β denote a fixed (p × d) matrix withd ≤ p such thatY ⊥ X | β ⊤ X . (3.2)The conditional independence in (3.2) tells us that the distribution of Y |X is the sameas the distribution of Y |β ⊤ X for all values of X in its marginal sample space. As aconsequence, we can replace the (p × 1) vector X with the (d × 1) vector β ⊤ X withoutloss of clustering information. If d of thepredictor vector.Li (1991) defines the basis for the subspace S(β) given by β as a dimension-reductionsubspace (DRS) for the regression of Y on X. A minimum DRS may not necessarily beunique but if several such subspaces exist, then they all have the same dimension.The assumption in (3.2) implies that P (Y = g|X) = P (Y = g|β ⊤ X) so the density15
Page 1: Dimension Reduction for Model-based
Page 5 and 6: List of Tables2.1 Nomenclature for
Page 7: Chapter 1IntroductionClustering alg
Page 10 and 11: 2.1 Gaussian Mixture ModelsFrom McL
Page 12 and 13: • λ g is a scalar which specifie
Page 14 and 15: In the M-step of the algorithm, the
Page 16 and 17: simple example. McLachlan and Peel
Page 18: For the tEIGEN models, parameter es
Page 23 and 24: to the others. Specifically, we nee
Page 25 and 26: Definition 3.2 The tMMDR variables,
Page 27 and 28: • the forward step evaluates the
Page 29 and 30: Chapter 4ApplicationsWe now focus o
Page 31 and 32: Figure 4.2: Pairs plot for the thre
Page 33 and 34: Table 4.1: Average ARI (with standa
Page 35 and 36: The known classification of the cof
Page 37 and 38: Table 4.5: Variable description for
Page 39 and 40: 4.2.3 Australian Crabs DataCampbell
Page 41 and 42: Figure 4.6: Plots of tMMDR directio
Page 43 and 44: those notes. The results may indica
Page 45 and 46: Table 4.16: A classification table
Page 47 and 48: Chapter 5ConclusionThis thesis intr
Page 49 and 50: Forina, M., C. Armanino, M. Castino
Page 51: Meng, X. L. and D. B. Rubin (1993).

Dimension Reduction for Model-based Clustering via Mixtures of ...

Create successful ePaper yourself

Delete template?

Save as template?