29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7. Conclusions<br />

of-the-art consensus funtions employed, both in terms of the quality of the consensus clusterings<br />

they yield and their computational complexity, and the most relevant conclusions<br />

drawn are enumerated next. As regards the consensus functions for hard cluster ensembles,<br />

we have observed that, in general terms, the EAC and HGPA consensus functions deliver,<br />

by far, the poorest quality consensus clusterings. We believe that the low performance of<br />

EAC is due to the fact that it was originally devised for consolidating clusterings with a very<br />

high number of clusters into a consensus partition with a smaller number of clusters (Fred<br />

and Jain, 2005). However, in our experiments, both the cluster ensemble components and<br />

the consensus clustering have the same number of clusters, which probably has a detrimental<br />

effect on the quality of the consensus partitions obtained by the evidence accumulation<br />

approach.<br />

From a computational standpoint, the HGPA and MCLA consensus functions are applicable<br />

on larger data collections than the rest, as their complexity is linear with the number<br />

of objects n. However, the execution time of MCLA is penalized when it is executed on<br />

large cluster ensembles, as its time complexity is quadratic with l (the number of cluster<br />

ensemble components). As regards the soft consensus functions, VMA constitutes the most<br />

attractive alternative, as it yields pretty high quality consensus clusterings while being fast<br />

to execute, thanks to the simultaneous execution of the cluster disambiguation and voting<br />

procedures. A rather opposite behaviour is shown by the soft versions of the EAC and<br />

HGPA consensus functions: the former is notably time consuming, while the latter outputs<br />

really poor quality consensus clusterings.<br />

Let us get critical for a while: possibly one of the major sources of criticism for this<br />

work refers to the rather unrealistic assumption (though not uncommon in the literature)<br />

that the number of clusters the objects must be grouped into (referred to as k) isaknown<br />

parameter. In practice, however, the user seldom knows how many clusters should be found,<br />

so it becomes a further indeterminacy to deal with.<br />

In this work, all the clusterings involved in any process (i.e. the cluster ensemble components<br />

and the consensus clusterings) have the same number of clusters, which coincides<br />

with the number of true groups in the data, defined by the ground truth that constitutes<br />

the gold standard for ultimately evaluating the quality our results. By doing so, we have<br />

also disregarded a common diversity factor employed in the creation of cluster ensembles,<br />

which often contain clusterings with different numbers of clusters (often chosen randomly).<br />

However, we would like to highlight at this point that not many of the consensus functions<br />

found in the literature are capable of estimating the correct number of clusters in<br />

the data set, thus making it necessary to specify the desired value of k as one of their parameters.<br />

Quite obviously, two of the clearest future research directions of this work are i)<br />

estimating the number of clusters of the consensus clustering solution, and ii) adapting the<br />

proposed consensus functions for dealing with cluster ensemble components with distinct<br />

numbers of clusters. The achievement of these goals would constitute the ultimate step<br />

towards a fully generic approach to robust clustering based on cluster ensembles.<br />

7.1 Hierarchical consensus architectures<br />

As regards the computational efficiency of consensus processes, the fact that their space and<br />

time complexities usually scale linearly or quadratically with the cluster ensemble size can<br />

195

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!