29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 2. Cluster ensembles and consensus clustering<br />

ensemble components, which are generated by random feature extraction and selection, weak<br />

clusterers, or random number of clusters, among others. Unfortunately, the fitness function<br />

driving the genetic algorithm used to select the best cluster ensemble is the classification<br />

accuracy of the respective ensemble with respect to the correct object labels (ground truth),<br />

which makes this strategy of limited use in practice.<br />

Another interesting work that incides in the heuristics of the cluster ensemble construction<br />

process is (Kuncheva, Hadjitodorov, and Todorova, 2006). In this sense, the authors<br />

recommend creating individual partitions with a variable number of clusters as a means for<br />

obtaining good quality consensus labelings.<br />

As mentioned earlier, an alternative way of obtaining a good quality consensus clustering<br />

solution is based not on designing the cluster ensemble components according to certain<br />

heuristics, but on refining its contents. The rationale of such strategies is based on the fact<br />

that the quality of the consensus clustering solution is penalized by the inclusion of poor<br />

individual clustering solutions in the ensemble. For this reason, it seems logical to develop<br />

techniques capable of discarding such cluster ensemble components prior to conducting the<br />

consensus clustering process.<br />

In this sense, the application of quality and/or diversity criteria for selecting a small<br />

subset of a large cluster ensemble was evaluated in (Fern and Lin, 2008) as a means for obtaining<br />

a consensus clustering solution that equals or betters the one that would be obtained<br />

using the whole ensemble. Pursuing the same goal, the authors of (Goder and Filkov, 2008)<br />

propose creating smaller subsets of the cluster ensemble that will yield better consensus.<br />

These mini-cluster ensembles are generated by clustering the individual partitions of the<br />

cluster ensemble using a hierarchical agglomerative average-link clustering algorithm.<br />

2.2 Related work on consensus functions<br />

The goal of this subsection is to review the state of the art in the area of consensus clustering.<br />

Although a considerable corpus of theoretical work on combining classifications was<br />

developed in the 80s and earlier (e.g. (Mirkin, 1975; Barthelemy, <strong>La</strong>clerc, and Monjardet,<br />

1986; Neumann and Norton, 1986)), it was not until the start of the present decade when<br />

this field experienced a significant flourish of activity.<br />

Despite this relatively recent awakening, multiple approaches to the combination of<br />

several clusterings can be found in the literature. In general terms, consensus clustering<br />

can be posed as an optimization problem the goal of which is to minimize a cost function<br />

measuring the dissimilarity between the consensus clustering solution and the partitions in<br />

the cluster ensemble –often, the cost function is expressed in terms of the number of pairwise<br />

co-clustering disagreements between the individual partitions in the cluster ensemble and<br />

the consensus clustering solution. Unfortunately, finding the partition that minimizes the<br />

proposed symmetric difference distance metric (i.e. the so-called median partition) isa<br />

NP-hard problem (Goder and Filkov, 2008), and this is the reason why it is necessary to<br />

resort to distinct heuristics so as to conduct clustering combination.<br />

Aiming to provide the reader with a global perspective on the distinct existing approaches<br />

in this field, table 2.1 presents a taxonomy of some of the most relevant consensus<br />

functions according to the theoretical foundations that guides the consensus process. Notice<br />

that some consensus functions appear under more than one theoretical approach, as some-<br />

33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!