29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Cluster ensembles and consensus clustering<br />

strength between the ith and the jth objects. The authors base these consensus functions<br />

on the proved suitability of using similarity measures as object features in classification<br />

problems (Kalska, 2005). Thus, the consensus clustering solution is obtained by applying<br />

standard clustering algorithms on the pairwise object co-association matrix. In particular,<br />

the hierarchical average-link, single-link and k-means clustering algorithms are applied,<br />

giving rise to the ALSAD, SLSAD and KMSAD consensus functions. Notice that these<br />

consensus functions, despite being based on partitioning the object co-association matrix<br />

(just like EAC, for instance), differ from these in that the contents are not interpreted as<br />

measures of similarity between objects, but rather as attributes in a new feature space.<br />

2.2.8 Consensus functions based on cluster centroids<br />

The time and memory scalability problems derived from combining clusterings of large data<br />

sets is the principal motivation of several works that tackle the consensus clustering problem<br />

following a centroid-based approach (Hore, Hall, and Goldgof, 2006; Nguyen and Caruana,<br />

2007). The underlying rationale consists of representing the cluster ensemble components<br />

in terms of the centroids of their clusters, instead of label vectors. By doing so, storage<br />

inconveniences are alleviated, as the number of clusters k is usually much lower than the<br />

number of objects n. In (Hore, Hall, and Goldgof, 2006), moreover, the authors highlight<br />

that the parallelization of the cluster ensemble construction process would dramatically<br />

decrease its time complexity. As regards the creation of the consensus clustering solution,<br />

it is based on computing the average centroid of each cluster, after disambiguating them<br />

by means of the Hungarian algorithm. Furthermore, that work introduced the possibility<br />

of discarding bad clusters from the cluster ensemble at consensus clustering creation time.<br />

In (Nguyen and Caruana, 2007), three iterative consensus functions were presented and<br />

empirically compared with other eleven clustering combiners in a complete experimental<br />

study. The proposal in that work derived the consensus clustering solution in terms of the<br />

centroids of its clusters. Although the proposed consensus functions are capable of combining<br />

clusterings with a variable number of clusters, all the individual partitions contained<br />

in the hard cluster ensembles used in the experiments have the same number of clusters<br />

for simplicity. The first consensus function proposed in this work, called Iterative Voting<br />

Consensus (IVC), is based on recursively computing the centroids of the consensus solution<br />

clusters, and assigning each object to the nearest cluster, which is determined in terms<br />

of the Hamming distance. This procedure is iterated until the centroids of the consensus<br />

clustering solution reach a stable state. The second proposal (named Iterative Probabilistic<br />

Voting Consensus or IPVC), is a variant of IVC in which objects are iteratively assigned<br />

to consensus clusters in terms of their distance with respect to the objects that have been<br />

previously assigned to them. And in the third proposed consensus function, Iterative Pairwise<br />

Consensus or IPC, objects are iteratively reassigned to consensus clusters according to<br />

their similarity as measured by the pairwise object co-association matrix.<br />

2.2.9 Consensus functions based on correlation clustering<br />

Recently, the connection between the late-emerging problem of correlation clustering and<br />

consensus clustering was exploited for deriving novel consensus functions capable of determining<br />

the most natural number of clusters (Gionis, Mannila, and Tsaparas, 2007). In<br />

that work, the cluster ensemble is modeled as a graph resembling a pairwise object co-<br />

41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!