29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.2. Related work on consensus functions<br />

crisp partitions with different number of clusters, although the desired number of clusters<br />

k in the consensus clustering solution is a necessary parameter for the execution of their<br />

consensus functions. The proposals presented in this work are based on the computation of<br />

a probabilistic mapping for solving the cluster correspondence problem–instead of the oneto-one<br />

classic cluster mapping– which allows combining partitions with different number of<br />

clusters avoiding the addition of dummy clusters as in (Dimitriadou, Weingessel, and Hornik,<br />

2002). In particular, three ways for deriving such probabilistic mapping based on the idea<br />

of cumulative voting are presented. The construction of the consensus clustering solution is<br />

a two-stage procedure: firstly, based on the cumulative vote mapping, a tentative consensus<br />

is derived as a summary of the cluster ensemble maximizing the information content in<br />

terms of entropy. And secondly, the extraction of the final consensus clustering solution<br />

is obtained by applying an agglomerative clustering algorithm that minimizes the average<br />

generalized Jensen-Shannon divergence within each cluster.<br />

As already mentioned, solving the cluster correspondence problem paves the way for the<br />

application of voting strategies for combining the outcomes of multiple clustering processes.<br />

This issue is the central focus of (Boulis and Ostendorf, 2004), which presented several<br />

methods for finding the correspondence between the clusters of the individual partitions in<br />

the cluster ensemble. Two of these proposals are based on linear optimization techniques,<br />

which are applied on an objective function that measures the degree of agreement among<br />

clusters. In contrast, the third cluster correspondence method is based on Singular Value<br />

Decomposition, and it sets cluster correspondences based on cluster correlation. All these<br />

methods operate on a common space where the clusters of the distinct partitions (which<br />

can be either crisp or fuzzy) are represented by means of cluster co-association matrices.<br />

2.2.2 Consensus functions based on graph partitioning<br />

The work by Strehl and Ghosh on consensus clustering based on graph partitioning is<br />

probably one of the most classic references in the field of cluster ensembles (Strehl and<br />

Ghosh, 2002). To our knowledge, they were the first to formulate the consensus clustering<br />

problem in an information theoretic framework –i.e. the consensus clustering solution<br />

should be the one maximizing the mutual information with respect to all the individual<br />

partitions in the cluster ensemble–, a path followed by other authors in subsequent works<br />

(Fred and Jain, 2003). In view of its prohibitive cost when formulated as a combinatorial<br />

optimization problem in terms of shared mutual information, the authors propose<br />

three clustering combination heuristics based on deriving a hypergraph representation of<br />

the cluster ensemble —all of which require the desired number of clusters k as one of their<br />

parameters. The first consensus function (called Cluster-based Similarity Partitioning Algorithm<br />

or CSPA) induces a pairwise object similarity measure from the cluster ensemble<br />

(as in (Fred and Jain, 2002a)), obtaining the consensus partition by reclustering the objects<br />

with the METIS graph partitioning algorithm (Karypis and Kumar, 1998). For this<br />

reason, we have enclosed the CSPA consensus function in both the graph partitioning and<br />

object co-association categories in table 2.1. In the second clustering combiner proposed<br />

in (Strehl and Ghosh, 2002) –named HGPA for HyperGraph Partitioning Algorithm–, the<br />

cluster ensemble problem is posed as the partitioning of a hypergraph where hyperedges<br />

represent clusters into k unconnected components of approximately the same size, cutting<br />

a minimum number of hyperedges. And the third consensus function (Meta-CLustering<br />

Algorithm or MCLA) views the clustering integration process as a cluster correspondence<br />

36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!