29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Cluster ensembles and consensus clustering<br />

2.2.12 Other interesting works on consensus clustering<br />

There exist several works in the literature devoted to the experimental comparison of the<br />

performance of consensus functions, the main interest of which lies in the evaluation of the<br />

quality of the consensus clusterings obtained.<br />

Examples of this include the work by Minaei-Bidgoli, Topchy, and Punch (2004), where<br />

a data resampling scheme was presented as a means for improving the robustness and stability<br />

of the consensus clustering solution. In that work, the EAC, BagClust2, QMI, CSPA,<br />

HGPA and MCLA consensus functions are experimentally compared when operating on<br />

hard cluster ensembles created from bootstrap partitions on several artificial and real data<br />

collections. The main conclusion drawn is that, as expected, there exists no universally superior<br />

consensus function, as each consensus function explores the data set in different ways,<br />

thus its efficiency greatly depends on the existing structure in the data set. Another extensive<br />

and interesting performance comparison between several consensus functions operating<br />

on small hard cluster ensembles is presented in (Kuncheva, Hadjitodorov, and Todorova,<br />

2006). Recently, the application of consensus clustering as a means for avoiding the obtention<br />

of suboptimal clustering solutions when applying non-parametric clustering algorithms<br />

on text document collections is tackled in both (Gonzàlez and Turmo, 2008a) and (Gonzàlez<br />

and Turmo, 2008b). These works compared i) the performance of several non-parametric<br />

clustering algorithms across six text corpora, and ii) the quality of the consensus clustering<br />

solution when it is built –using some of the consensus functions presented in (Gionis,<br />

Mannila, and Tsaparas, 2007)– upon homogeneous and heterogeneous cluster ensembles.<br />

In most of these inter-consensus functions comparisons, the evaluation of their computational<br />

complexity is often given marginal importance, although it becomes a critical aspect<br />

when it comes to their application in practice, especially when dealing with large data<br />

collections or cluster ensembles containing many partitions. This is the main motivation<br />

behind the data sampling strategy proposed in (Greene and Cunningham, 2006; Gionis,<br />

Mannila, and Tsaparas, 2007). The proposal of the latter work is the SAMPLING algorithm,<br />

which consists of performing a sufficient subsampling of the objects in the data set –thus<br />

constructing the consensus clustering solution on a reduced cluster ensemble–, and the subsequent<br />

extension of the combined clustering solution on those objects that have been left<br />

out of the subsampling process. The time complexity of these two processes is linear with<br />

the data set size, which can lead to relevant time savings when dealing with large data<br />

collections.<br />

Another variant of the consensus clustering problem is the weighted combination of<br />

clusterings, which constitutes the central point of (Gonzàlez and Turmo, 2006). The idea<br />

behind weighted consensus clustering is the possibility of giving more relevance to some<br />

components of the cluster ensemble, as they may better describe the structure of the data<br />

set. Thus, it makes sense to combine clusterings in a weighted manner, emphasizing the<br />

contribution of those components deemed as the best ones in the ensemble. Besides designing<br />

consensus functions capable of combining weighted partitions, it is necessary to devise<br />

strategies for setting the proper weight of each individual clustering, which is not trivial in<br />

an unsupervised scenario. In this work, hypergraph-based (Strehl and Ghosh, 2002) and<br />

probabilistic (Topchy, Jain, and Punch, 2004) consensus functions are modified so as to<br />

handle weighted partitions. Moreover, the best weighting scheme is determined by creating<br />

differently weighted cluster ensembles, and subsequently selecting the best option in an<br />

unsupervised manner through the maximization of a scoring function. Moreover, this con-<br />

43

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!