29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7. Conclusions<br />

Though put forward in a robust clustering via cluster ensembles framework, hierarchical<br />

consensus architectures can be of interest in any consensus clustering related problem where<br />

cluster ensembles containing a large number of components are involved. Furthermore,<br />

HCAs are directly portable to a fuzzy clustering scenario with no modifications.<br />

In our opinion, the main weakness of this proposal lies in the rather simplistic approach<br />

taken in the running time estimation methodology, which employs the execution time of a<br />

single consensus process run for estimating the time complexity of the whole HCA. Despite<br />

experiments have demonstrated that its performance is pretty good, we conjecture that<br />

a possible means for improving it –especially in the parallel case, where lower prediction<br />

accuracies are obtained– would consist in modelling statistically the running times of the<br />

consensus processes the estimation is based on.<br />

7.2 Consensus self-refining procedures<br />

Besides the computational difficulties that have motivated the development of hierarchical<br />

consensus architectures, the use of large cluster ensembles also poses a challenge as far as the<br />

quality of the obtained consensus clustering is concerned. Indeed, the somewhat indiscriminate<br />

generation of clusterings encouraged by our robust clustering via cluster ensembles<br />

proposal may presumably lead to the creation of low quality cluster ensemble components,<br />

which affects the quality of consensus clustering negatively. In order to mitigate the undesired<br />

influence of these components, we have devised an unsupervised strategy for excluding<br />

them from the consensus process.<br />

The rationale of such strategy is the following: starting with a reference clustering,<br />

we measure its similarity (in terms of average normalized mutual information, or φ (ANMI) )<br />

with respect to the l cluster ensemble components. Subsequently, a percentage p of these<br />

components is selected, after ranking them according to their similarity with respect to the<br />

aforementioned reference clustering. <strong>La</strong>st, the self-refined consensus clustering is obtained<br />

by combining the clusterings included in such reduced cluster ensemble, according to either<br />

a flat or a hierarchical architecture —a decision that can be reliably made using the running<br />

time estimation methodology mentioned earlier.<br />

Following this generic approach, two self-refining strategies have been proposed. They<br />

solely differ in the origin of the clustering used as the reference of the self-refining procedure.<br />

In the first version (denominated consensus based self-refining), the reference clustering is<br />

the result of a previous consensus process conducted upon the cluster ensemble at hand.<br />

In contrast, the second self-refining procedure (referred to as selection based self-refining)<br />

employs one of the cluster ensemble components as the reference clustering, which is selected<br />

by means of an φ (ANMI) maximization criterion.<br />

We would like to highlight the fact that the self-refining procedure is almost fully unsupervised.<br />

The only user intervention is the selection of the percentage p of the l cluster<br />

ensemble components included in the select cluster ensemble the self-refined consensus clustering<br />

is derived upon. In order to minimize the risks of negatively biasing the self-refining<br />

of procedure results by a suboptimal selection of p, the user is prompted to select not a<br />

single, but a set of values of p. The self-refining procedure will produce a self-refined consensus<br />

clustering for each distinct value of p, selecting a posteriori, in a fully unsupervised<br />

manner, the one with maximum average normalized mutual information with respect to the<br />

197

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!