Views
2 years ago

Variational Inference for Nonparametric Multiple Clustering

Variational Inference for Nonparametric Multiple Clustering

clustering. In Qi et al.

clustering. In Qi et al. [20], they find an alternative projection ofthe original data that minimizes the Kullback-Leibler divergencebetween the distribution of the original space and the projectionsubject to the constraint that sum squared error between samplesin the projected space with the means of the clusters they belongto is smaller than a pre-specified threshold. Their method approximatesthe clusters from mixtures of Gaussians with componentssharing the same covariance matrix. These three [10, 1, 20] onlyaddresses finding one alternative clustering. However, for complexdata there may be more than one alternative clustering interpretation.In Cui et al. [7], their method finds multiple alternative viewsby clustering in the subspace orthogonal to the clustering solutionsfound in previous iterations. This approach discovers several alternativeclustering solutions by iteratively finding one alternativesolution given the previously found clustering solutions. All thesemethods find alternative clustering solutions sequentially (or iteratively).Another general way for discovering multiple solutions isby finding them simultaneously. Meta clustering in [4] generates adiverse set of clustering solutions by either random initialization orrandom feature weighting. Then these solutions are meta clusteredusing an agglomerative clustering based on a Rand index for measuringsimilarity between pairwise clustering solutions. Jain et al.in [13] also learn the non-redundant views together. Their methodlearns two disparate clusterings by minimizing two k-means typesum-squared error objective for the two clustering solutions whileat the same time minimizing the correlation between these two clusterings.Like [7], [4] and [13], the approach we propose discoversmultiple clustering solutions. Furthermore, like [4] and [13], ourapproach finds these solutions simultaneously. However, unlike allthese methods, we provide a probabilistic generative nonparametricmodel which can learn the features and clustering solutions ineach view simultaneously.Recently, there are several nonparametric Bayesian models introducedfor unsupervised learning [19, 3, 11, 22]. The Chinese RestaurantProcess (CRP) [19] and the stick-breaking Dirichlet ProcessModel [3] only assume one underlying partitioning of the data samples.The Indian Buffet Process (IBP) assumes that each sample isgenerated from an underlying latent set of features sampled from aninfinite menu of features. There are also nonparametric Bayesianmodels for co-clustering [22]. None of these model multiple clusteringsolutions. There is, however, concurrent work that is independentlydeveloped that provides a nonparametric Bayesian modelfor finding multiple partitionings, called cross-categorization [17].Their model utilizes the CRP construction and Gibbs sampling forinference. Here, we propose an approach that utilizes a multipleclustering stick-breaking construction and provide a variational inferenceapproach to learn the model parameters and latent variables.Unlike Markov chain Monte Carlo samplers [18] includingGibbs sampling, variational methods provide a fast deterministicapproximation to marginal probabilities.3. NONPARAMETRIC MULTIPLE CLUSTER-ING MODELGiven data X ∈ R n×d , where n is the number of samples and dthe number of features. X =[x 1, x 2, . . . , x n] T =[f 1, f 2, . . . , f d ],where x i ∈ R d are the samples, the columns f j ∈ R n are the features,and (·) T is the transpose of a matrix. Our goal is to discovermultiple clustering solutions and the feature views in which theyreside. We formulate this problem as finding a partitioning of thedata into a tile structure as shown in Figure 1. We want to finda partitioning that partitions the features into different views andthe partitioning of the samples in each view. In the tile figure, theFigure 1: Tile structure partitioning of the data for multipleclustering in different feature views.columns are permuted such that features that belong together in thesame view are next to each other and the borders show partitionsfor the different views. In Figure 1, the samples in each view areclustered into groups where members from the same group are indicatedby the same color. Note that the samples in different viewsare independently partitioned given the view; moreover, samplesbelonging to the same clusters in one view can belong to differentclusters in other views. Here, we assumed that the features in eachview are disjoint. In future work, we will explore models that canallow sharing of features between views.In this paper, we design a nonparametric prior model that can generatesuch a latent tile structure for data X. Let Y be a matrix of latentvariables representing the partitioning of features into differentviews, where each element, y j,v =1if feature f j belongs to view vand y j,v =0otherwise. And, let Z be a matrix of latent variablesrepresenting the partitioning of samples for all views, where eachelement, z v,i,k =1if sample x i belongs to cluster k in view v andz v,i,k =0otherwise. We model both latent variables Y and Z fromDirichlet processes and utilize the stick-breaking construction [21]to generate these variables as follows.1. Generate an infinite sequence w v from a beta distribution:w v ∼ Beta(1,α), v= {1, 2, . . .}. α is the concentrationparameter for a beta distribution.2. Generate the mixture weights π v for each feature partition vfrom a stick-breaking construction: π v = w v∏ v−1j=1 (1−wj).3. Generate the view indicator Y for each feature f j from amultinomial distribution with weights π v: p(y j,v =1|π) =π v, denoted y j ∼ Mult(π) = ∏ qv=1 πy j,vv and q is thenumber of views, which can be infinite.4. For each view v = {1, 2, . . .}, generate an infinite sequenceu v,k from a beta distribution: u v,k ∼ Beta(1,β), k ={1, 2, . . .}. Here, β is the concentration parameter for a betadistribution.5. Generate the mixture weights η v,k for each cluster partitionk in each view v from a stick-breaking construction: η v,k =u v,k∏ k−1l=1 (1 − u v,l).6. Generate the cluster indicator Z for each view v from a multinomialdistribution with weights η v,k : p(z v,i,k =1|η) =η v,k or z v,i ∼ Mult(η) = ∏ k vk=1 ηz v,i,kv,kand k v is the numberof clusters in view v, which can be infinite.

We are now ready to generate our observation variables X given thelatent variables Y and Z. For each cluster in each view, we drawcluster parameter θ from an appropriate prior distribution with hyperparameterλ: θ ∼ p(θ|λ). Then, in each view v, we draw thevalue of the features in view v for sample i: x v,i ∼ p(x v,i|θ v,zv,i ),where z v,i is equal to the cluster k that sample i belongs to in viewv, and x v,i =(x i,j : y j,v = 1) is the vector of features in view v.Figure 2 shows a graphical model of our nonparametric multipleclustering model. Our joint model p(X, Y, Z, W, U, θ) is:Figure 2: Graphical model for our nonparametric multipleclustering model.p(X|Y, Z, θ)p(Y |W )p(W |α)p(Z|U)p(U|β)p(θ|λ) (1)[ q∏]n∏= p(x v,i|θ v,zv,i )p(z v,i|η v,i)v=1 i=1[ q∏]∏k vp(θ v,k |λ)p(u v,k |β)v=1k=1j=1d∏q∏p(y j|π) p(w v|α)v=1where q is the number of views and k v is the number of clusters inview v, n is the number of data points and d is the number of features.Here, in our nonparametric model, q and k v can be infinite.Note that previous models for multiple clustering explicitly enforcenon-redundancy, orthogonality or disparity among clustering solutions[10, 1, 7, 20, 13]; whereas, our model handles redundancyimplicitly, since redundant clusterings offer no probabilistic modellingadvantage and are penalized under the prior which assumesthat each view is clustered independently.3.1 Variational InferenceIt is computationally intractable to evaluate the marginal likelihood,p(X) = ∫ p(X, φ)dφ, where φ represents the set of all parametersθ and latent variables, Y , Z, U, and W . Variational methodsallow us to approximate the marginal likelihood by maximizing alower bound, L(Q), on the true log marginal likelihood [15].∫ln p(X) = ln p(X, φ)dφ∫p(X, φ)= ln Q(φ)Q(φ) dφ∫p(X, φ)≥ Q(φ) ln dφ = L(Q(φ)),Q(φ)using Jensen’s inequality [5]. The difference between the log marginalln p(X) and the lower bound L(Q) is the Kullback-Leibler divergencebetween the approximating distribution Q(φ) and the trueposterior p(φ|X). The idea is to choose a Q(φ) distribution thatis simple enough that the lower bound can be tractably evaluatedand flexible enough to get a tight bound. Here, we assume a distributionfor Q(φ) that factorizes over all the parameters Q(φ) =∏iQi(φi). For our model, this isQ(Y, Z, W, U, θ) =Q(Y )Q(Z)Q(W )Q(U)Q(θ) (2)The Q i(·) that minimizes the KL divergence over all factorial distributionsisQ i(φ i)=exp 〈ln P (X, φ)〉 k≠i∫exp 〈ln P (X, φ)〉k≠i dφ i(3)where 〈·〉 k≠idenotes averaging with respect to ∏ k≠i Q k(φ k ). ApplyingEquation 3, we obtain our factorial distributions, Q i(φ i)functions as:q∏Q(W ) = Beta(γ v,1,γ v,2) (4)Q(Y ) =Q(U) =Q(Z) =v=1d∏Mult(π j) (5)j=1q∏k v ∏v=1 k=1q∏v=1 i=1Beta(γ v,k,1 ,γ v,k,2 ) (6)n∏Mult(η v,i) (7)where γ v,· are the beta parameters for the distributions on W , π jare the multinomial parameters for the distributions on Y , γ v,k,·are the beta parameters for the distributions on U, and η v,i are themultinomial parameters for the distributions on Z. Note that thesevariational parameters for the approximate posterior Q are not thesame as the model parameters in Equation 1 although we have usedsimilar notation.The update equations we obtain using variational inference are providedbelow:d∑γ v,1 = 1 + π j,v (8)γ v,2 = α +γ v,k,1 = 1 +γ v,k,2 = β +π j,v∝j=1d∑q∑j=1 l=v+1π j,l (9)n∑η v,i,k (10)i=1n∑∑k vi=1 l=k+1η v,i,l (11)[∏ n]p(x v,i|θ v,zv,i ) Mult(π j,v|γ v) (12)i=1η v,i,k ∝ p(x v,i|θ v,zv,i )Mult(η v,i,k |γ v,k ) (13)Note that all parameters on the right hand side of the equationsabove are based on the parameter estimates at the previous timestep and those on the left hand side are the parameter updates atthe current time step. The variational parameter η v,i,k can be interpretedas the posterior probability that view v of data point i isassigned to cluster k. The parameter π j,v can be interpreted as theposterior probability that feature j belongs to view v. We iteratethese update steps until convergence.

Afghanistan Multiple Indicator Cluster Survey - Childinfo.org
Afghanistan Multiple Indicator Cluster Survey - Reproductive Health ...
Multiple Indicator Cluster Survey 2006 - Childinfo.org
Multiple Indicator Cluster Survey, 2006 - Childinfo.org
Copy Number Variation of Multiple Genes at Rhg1 ... - SoyBase
Co-Variation of Multiple Behavior Change
Feature Selection For Clustering Based Aspect Mining
Nonparametric Variational Inference
Nonparametric Variational Inference
Structured Bayesian Nonparametric Models with Variational Inference
A Bayesian Nonparametric Approach to Clustering Data from ...
Nonparametric Inference and the Dark Energy Equation of State
Nonparametric Density Estimation and Clustering in Astronomical ...
A Nonparametric Information Theoretic Clustering Algorithm
Nonparametric Bayesian inference for point processes - Statistics ...
Large Scale Nonparametric Bayesian Inference - Cambridge ...
Nonparametric Bayes Inference on Manifolds with Applications to ...
Nonparametric Inference for the Cosmic Microwave Background ...
Nonparametric Inference for the Cosmic Microwave Background ...
Nonparametric Estimation and Inference on Conditional Quantile ...
Bayesian nonparametric predictive inference and bootstrap ...
Spectral Chinese Restaurant Processes: Nonparametric Clustering ...
nonparametric inference on manifolds with applications to shape ...
Inference in Partially Identified Nonparametric Instrumental ...
Nonparametric Estimation and Inference on Conditional Quantile ...
Robust Inferences from a Before-and-After Study with Multiple ...
Nonparametric predictive inference with right-censored data K. J. ...
Nonparametric Density Estimation and Clustering in Astronomical ...