- Text
- Clustering,
- Feature,
- Features,
- Method,
- Cluster,
- Clusters,
- Multiple,
- Nonparametric,
- Synthetic,
- Tile,
- Variational,
- Inference

Variational Inference for Nonparametric Multiple Clustering

clustering. In Qi et al. [20], they find an alternative projection ofthe original data that minimizes the Kullback-Leibler divergencebetween the distribution of the original space and the projectionsubject to the constraint that sum squared error between samplesin the projected space with the means of the clusters they belongto is smaller than a pre-specified threshold. Their method approximatesthe clusters from mixtures of Gaussians with componentssharing the same covariance matrix. These three [10, 1, 20] onlyaddresses finding one alternative clustering. However, **for** complexdata there may be more than one alternative clustering interpretation.In Cui et al. [7], their method finds multiple alternative viewsby clustering in the subspace orthogonal to the clustering solutionsfound in previous iterations. This approach discovers several alternativeclustering solutions by iteratively finding one alternativesolution given the previously found clustering solutions. All thesemethods find alternative clustering solutions sequentially (or iteratively).Another general way **for** discovering multiple solutions isby finding them simultaneously. Meta clustering in [4] generates adiverse set of clustering solutions by either random initialization orrandom feature weighting. Then these solutions are meta clusteredusing an agglomerative clustering based on a Rand index **for** measuringsimilarity between pairwise clustering solutions. Jain et al.in [13] also learn the non-redundant views together. Their methodlearns two disparate clusterings by minimizing two k-means typesum-squared error objective **for** the two clustering solutions whileat the same time minimizing the correlation between these two clusterings.Like [7], [4] and [13], the approach we propose discoversmultiple clustering solutions. Furthermore, like [4] and [13], ourapproach finds these solutions simultaneously. However, unlike allthese methods, we provide a probabilistic generative nonparametricmodel which can learn the features and clustering solutions ineach view simultaneously.Recently, there are several nonparametric Bayesian models introduced**for** unsupervised learning [19, 3, 11, 22]. The Chinese RestaurantProcess (CRP) [19] and the stick-breaking Dirichlet ProcessModel [3] only assume one underlying partitioning of the data samples.The Indian Buffet Process (IBP) assumes that each sample isgenerated from an underlying latent set of features sampled from aninfinite menu of features. There are also nonparametric Bayesianmodels **for** co-clustering [22]. None of these model multiple clusteringsolutions. There is, however, concurrent work that is independentlydeveloped that provides a nonparametric Bayesian model**for** finding multiple partitionings, called cross-categorization [17].Their model utilizes the CRP construction and Gibbs sampling **for**inference. Here, we propose an approach that utilizes a multipleclustering stick-breaking construction and provide a variational inferenceapproach to learn the model parameters and latent variables.Unlike Markov chain Monte Carlo samplers [18] includingGibbs sampling, variational methods provide a fast deterministicapproximation to marginal probabilities.3. NONPARAMETRIC MULTIPLE CLUSTER-ING MODELGiven data X ∈ R n×d , where n is the number of samples and dthe number of features. X =[x 1, x 2, . . . , x n] T =[f 1, f 2, . . . , f d ],where x i ∈ R d are the samples, the columns f j ∈ R n are the features,and (·) T is the transpose of a matrix. Our goal is to discovermultiple clustering solutions and the feature views in which theyreside. We **for**mulate this problem as finding a partitioning of thedata into a tile structure as shown in Figure 1. We want to finda partitioning that partitions the features into different views andthe partitioning of the samples in each view. In the tile figure, theFigure 1: Tile structure partitioning of the data **for** multipleclustering in different feature views.columns are permuted such that features that belong together in thesame view are next to each other and the borders show partitions**for** the different views. In Figure 1, the samples in each view areclustered into groups where members from the same group are indicatedby the same color. Note that the samples in different viewsare independently partitioned given the view; moreover, samplesbelonging to the same clusters in one view can belong to differentclusters in other views. Here, we assumed that the features in eachview are disjoint. In future work, we will explore models that canallow sharing of features between views.In this paper, we design a nonparametric prior model that can generatesuch a latent tile structure **for** data X. Let Y be a matrix of latentvariables representing the partitioning of features into differentviews, where each element, y j,v =1if feature f j belongs to view vand y j,v =0otherwise. And, let Z be a matrix of latent variablesrepresenting the partitioning of samples **for** all views, where eachelement, z v,i,k =1if sample x i belongs to cluster k in view v andz v,i,k =0otherwise. We model both latent variables Y and Z fromDirichlet processes and utilize the stick-breaking construction [21]to generate these variables as follows.1. Generate an infinite sequence w v from a beta distribution:w v ∼ Beta(1,α), v= {1, 2, . . .}. α is the concentrationparameter **for** a beta distribution.2. Generate the mixture weights π v **for** each feature partition vfrom a stick-breaking construction: π v = w v∏ v−1j=1 (1−wj).3. Generate the view indicator Y **for** each feature f j from amultinomial distribution with weights π v: p(y j,v =1|π) =π v, denoted y j ∼ Mult(π) = ∏ qv=1 πy j,vv and q is thenumber of views, which can be infinite.4. For each view v = {1, 2, . . .}, generate an infinite sequenceu v,k from a beta distribution: u v,k ∼ Beta(1,β), k ={1, 2, . . .}. Here, β is the concentration parameter **for** a betadistribution.5. Generate the mixture weights η v,k **for** each cluster partitionk in each view v from a stick-breaking construction: η v,k =u v,k∏ k−1l=1 (1 − u v,l).6. Generate the cluster indicator Z **for** each view v from a multinomialdistribution with weights η v,k : p(z v,i,k =1|η) =η v,k or z v,i ∼ Mult(η) = ∏ k vk=1 ηz v,i,kv,kand k v is the numberof clusters in view v, which can be infinite.

We are now ready to generate our observation variables X given thelatent variables Y and Z. For each cluster in each view, we drawcluster parameter θ from an appropriate prior distribution with hyperparameterλ: θ ∼ p(θ|λ). Then, in each view v, we draw thevalue of the features in view v **for** sample i: x v,i ∼ p(x v,i|θ v,zv,i ),where z v,i is equal to the cluster k that sample i belongs to in viewv, and x v,i =(x i,j : y j,v = 1) is the vector of features in view v.Figure 2 shows a graphical model of our nonparametric multipleclustering model. Our joint model p(X, Y, Z, W, U, θ) is:Figure 2: Graphical model **for** our nonparametric multipleclustering model.p(X|Y, Z, θ)p(Y |W )p(W |α)p(Z|U)p(U|β)p(θ|λ) (1)[ q∏]n∏= p(x v,i|θ v,zv,i )p(z v,i|η v,i)v=1 i=1[ q∏]∏k vp(θ v,k |λ)p(u v,k |β)v=1k=1j=1d∏q∏p(y j|π) p(w v|α)v=1where q is the number of views and k v is the number of clusters inview v, n is the number of data points and d is the number of features.Here, in our nonparametric model, q and k v can be infinite.Note that previous models **for** multiple clustering explicitly en**for**cenon-redundancy, orthogonality or disparity among clustering solutions[10, 1, 7, 20, 13]; whereas, our model handles redundancyimplicitly, since redundant clusterings offer no probabilistic modellingadvantage and are penalized under the prior which assumesthat each view is clustered independently.3.1 **Variational** **Inference**It is computationally intractable to evaluate the marginal likelihood,p(X) = ∫ p(X, φ)dφ, where φ represents the set of all parametersθ and latent variables, Y , Z, U, and W . **Variational** methodsallow us to approximate the marginal likelihood by maximizing alower bound, L(Q), on the true log marginal likelihood [15].∫ln p(X) = ln p(X, φ)dφ∫p(X, φ)= ln Q(φ)Q(φ) dφ∫p(X, φ)≥ Q(φ) ln dφ = L(Q(φ)),Q(φ)using Jensen’s inequality [5]. The difference between the log marginalln p(X) and the lower bound L(Q) is the Kullback-Leibler divergencebetween the approximating distribution Q(φ) and the trueposterior p(φ|X). The idea is to choose a Q(φ) distribution thatis simple enough that the lower bound can be tractably evaluatedand flexible enough to get a tight bound. Here, we assume a distribution**for** Q(φ) that factorizes over all the parameters Q(φ) =∏iQi(φi). For our model, this isQ(Y, Z, W, U, θ) =Q(Y )Q(Z)Q(W )Q(U)Q(θ) (2)The Q i(·) that minimizes the KL divergence over all factorial distributionsisQ i(φ i)=exp 〈ln P (X, φ)〉 k≠i∫exp 〈ln P (X, φ)〉k≠i dφ i(3)where 〈·〉 k≠idenotes averaging with respect to ∏ k≠i Q k(φ k ). ApplyingEquation 3, we obtain our factorial distributions, Q i(φ i)functions as:q∏Q(W ) = Beta(γ v,1,γ v,2) (4)Q(Y ) =Q(U) =Q(Z) =v=1d∏Mult(π j) (5)j=1q∏k v ∏v=1 k=1q∏v=1 i=1Beta(γ v,k,1 ,γ v,k,2 ) (6)n∏Mult(η v,i) (7)where γ v,· are the beta parameters **for** the distributions on W , π jare the multinomial parameters **for** the distributions on Y , γ v,k,·are the beta parameters **for** the distributions on U, and η v,i are themultinomial parameters **for** the distributions on Z. Note that thesevariational parameters **for** the approximate posterior Q are not thesame as the model parameters in Equation 1 although we have usedsimilar notation.The update equations we obtain using variational inference are providedbelow:d∑γ v,1 = 1 + π j,v (8)γ v,2 = α +γ v,k,1 = 1 +γ v,k,2 = β +π j,v∝j=1d∑q∑j=1 l=v+1π j,l (9)n∑η v,i,k (10)i=1n∑∑k vi=1 l=k+1η v,i,l (11)[∏ n]p(x v,i|θ v,zv,i ) Mult(π j,v|γ v) (12)i=1η v,i,k ∝ p(x v,i|θ v,zv,i )Mult(η v,i,k |γ v,k ) (13)Note that all parameters on the right hand side of the equationsabove are based on the parameter estimates at the previous timestep and those on the left hand side are the parameter updates atthe current time step. The variational parameter η v,i,k can be interpretedas the posterior probability that view v of data point i isassigned to cluster k. The parameter π j,v can be interpreted as theposterior probability that feature j belongs to view v. We iteratethese update steps until convergence.

- Page 1: Variational Inference for Nonparame
- Page 5 and 6: a baseline. In all our experiments,
- Page 7 and 8: Figure 8: First two eigenfaces sele