29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Cluster ensembles and consensus clustering<br />

problem that is solved by identifying and consolidating groups of clusters (meta-clusters),<br />

which is done by applying graph-based clustering to hyperedges in the hypergraph representation<br />

of the cluster ensemble. In (Strehl and Ghosh, 2002), the authors apply the proposed<br />

consensus functions on hard cluster ensembles, suggesting that they could be extended to a<br />

fuzzy clustering integration scenario. Such extensions (in particular, the soft versions of the<br />

CSPA and MCLA consensus functions, sCSPA and sMCLA, respectively) were introduced<br />

in (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />

The clustering combination problem was also formulated as a graph partitioning problem<br />

in (Fern and Brodley, 2004). In particular, a bipartite graph is built from a hard cluster<br />

ensemble, although the authors suggest that the same method can be applied for combining<br />

soft clustering solutions after introducing minor modifications on the proposed consensus<br />

function, which is called HBGF for Hybrid Bipartite Graph Formulation. As in previous<br />

graph partitioning approaches to clustering combination, the desired number of clusters<br />

must be set aprioriandpassed as a parameter of the consensus function. In contrast, HBGF<br />

considers object and cluster similarity simultaneously when creating the consensus clustering<br />

solution, an issue not considered by other graph partition based consensus functions as<br />

CSPA and MCLA (Strehl and Ghosh, 2002).<br />

More recently, the BALLS consensus function (Gionis, Mannila, and Tsaparas, 2007) operates<br />

on a graph representation of the pairwise object co-dissociation matrix, viewing its<br />

vertices as the objects in the data set, its edges being weighted by the pairwise object<br />

distances. The rationale of the consensus clustering creation process is the iterative construction<br />

of consensus clusters from compact and relatively isolated sets of close vertices,<br />

which are then removed from the graph.<br />

2.2.3 Consensus functions based on object co-association measures<br />

The approach to consensus clustering based on object co-association measures is based on<br />

the assumption that objects belonging to a natural cluster are likely to be co-located in<br />

the same cluster by different clusterings of the cluster ensemble. Therefore, pairwise object<br />

co-occurrences are deemed as votes which are aggregated in a n × n object co-association<br />

matrix (where n is the number of instances contained in the data collection). A great<br />

advantage of this type of methods is that it avoids cluster disambiguation processes, as the<br />

cluster ensemble is inspected object-wise, rather than cluster-wise. However, a downside of<br />

consensus functions based object co-association is that their time and space complexities<br />

scale quadratically with n, thus making their application on large data sets highly costly or<br />

even unfeasible.<br />

One of the pioneering works on the combination of hard clustering solutions based on<br />

object co-association metrics is the evidence accumulation (EAC) approach. In the original<br />

form of the evidence accumulation consensus function, the consensus clustering solution is<br />

obtained by applying a simple majority voting scheme on the co-association matrix (Fred,<br />

2001). In subsequent versions, consensus is derived by applying the single-link hierarchical<br />

clustering algorithm on the object co-association matrix, regarding it as a measure of the<br />

similarity between objects (Fred and Jain, 2002a)—a virtually identical proposal is found<br />

in a contemporary work (Zeng et al., 2002). In (Fred and Jain, 2003), the evidence accumulation<br />

approach is formulated in an information-theoretic framework, defining the optimal<br />

consensus clustering solution as the one maximizing the sum of normalized mutual infor-<br />

37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!