29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Cluster ensembles and consensus clustering<br />

desambiguation is conducted by matching those clusters sharing the highest percentage<br />

of objects, iterating this cluster matching process across all the partitions in the cluster<br />

ensemble. As a result of the voting step, a fuzzy partition of the data set is obtained.<br />

Subsequently, a merging procedure is conducted on this soft partition, fusing those clusters<br />

which are closest to each other. By imposing a stopping criterion based on the sureness<br />

of the obtained clusters, this merging process is capable of finding the natural number<br />

of clusters in the data set. In (Dimitriadou, Weingessel, and Hornik, 2002), the authors<br />

define the consensus clustering solution as the one minimizing the average square distance<br />

with respect to all the partitions in the cluster ensemble. They demostrate that obtaining<br />

such consensus clustering boils down to finding the optimal re-labeling of the clusters of<br />

all the individual clusterings, which becomes an unfeasible problem if approached directly,<br />

since it requires an enumeration of all possible permutations. Therefore, they resort to<br />

the VMA consensus function of (Dimitriadou, Weingessel, and Hornik, 2001) for finding an<br />

approximate solution to the problem, extending its application to soft cluster ensembles.<br />

One of the two consensus functions proposed in (Dudoit and Fridlyand, 2003), called<br />

BagClust1, is based on applying plurality voting on the labelings in the cluster ensemble<br />

after a label disambiguation process based on measuring the overlap between clusters. The<br />

generation of the cluster ensemble components follows a resampling strategy similar to<br />

bagging, aiming to reduce the variability in the partitioning results via consensus clustering.<br />

A related proposal was the one presented in (Fischer and Buhmann, 2003). In that work,<br />

consensus clustering is viewed as a means for improving the quality and reliability of the<br />

results of path-based clustering, applying bagging for creating the hard cluster ensemble.<br />

The consensus clustering solution is obtained through a maximum likelihood mapping in<br />

which the label permutation problem is solved by means of the Hungarian method (Kuhn,<br />

1955), which somehow resembles the application of plurality voting on the disambiguated<br />

individual partitions in the cluster ensemble (Ayad and Kamel, 2008). Moreover, a related<br />

reliability measure chooses the number of clusters with the highest stability as the preferable<br />

consensus solution.<br />

In (Greene and Cunningham, 2006), a majoritary voting strategy was applied for generating<br />

the consensus clustering solution, after disambiguating the clusters using the Hungarian<br />

algorithm. An additional interest of that work is that it was one of the first research<br />

efforts that considered the problem of creating and combining a large number of clustering<br />

solutions in the context of high dimensional data sets (such as text document collections).<br />

Indeed, the authors point out that using large ensembles boosts computational cost, while<br />

small ensembles tend to produce unstable consensus clustering solutions. In this context,<br />

the authors propose basing the cluster ensemble construction and the consensus clustering<br />

tasks on a prototype reduction technique that allows representing the whole data set by<br />

means of a minimal set of objects, ensuring that the clustering results will approximate<br />

those that would be obtained on the original data set. By doing so, the final clustering<br />

solution can be extended to those objects that have been left out of the reduced data set<br />

representation while alleviating the overall computational cost of the whole process. In particular,<br />

the reduced version of the cluster ensemble is obtained by projecting the pairwise<br />

object similarity matrix by means of a kernel matrix.<br />

The recent work of (Ayad and Kamel, 2008) presented a set of consensus functions based<br />

on cumulative voting –named URCV, RCV and ACV– whose time complexity scales linearly<br />

with the size of the data set. Another interesting feature is their capability of combining<br />

35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!