29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Hierarchical consensus architectures<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 0 0 0 0 0 0.2 0<br />

RHCA 0.4 0 0 0 1.2 0.1 0<br />

DHCA 0 0 0 0 0 0.3 0<br />

Table 3.14: Percentage of experiments in which the consensus clustering solution is better<br />

than the best cluster ensemble component.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat – – – – – 0.07 –<br />

RHCA 0.85 – – – 0.8 0.07 –<br />

DHCA – – – – – 1.1 –<br />

Table 3.15: Relative percentage φ (NMI) gain between the consensus clustering solution and<br />

the best cluster ensemble component.<br />

large cluster ensembles if their complexity increases quadratically with the number of components<br />

in the ensemble.<br />

To our knowledge, most previous proposals oriented towards this aim deal with subsampling<br />

strategies as a means for reducing the computational complexity of consensus<br />

processes. That is, if the clustering combination task becomes more costly as the number of<br />

the objects in the data set and/or the number of the cluster ensemble components grow, a<br />

natural solution consists in applying the consensus clustering process on a reduced version<br />

of the data collection (<strong>La</strong>nge and Buhmann, 2005; Greene and Cunningham, 2006; Gionis,<br />

Mannila, and Tsaparas, 2007) and/or the cluster ensemble (Greene and Cunningham, 2006),<br />

which is created by means of some sufficient subsampling procedure. Once the consensus<br />

process is completed on the reduced data set (or cluster ensemble), it is extended to those<br />

entities (objects or cluster ensemble components) that have been left out of the mentioned<br />

subset. Whereas reducing the size of the data collection and/or the cluster ensemble subject<br />

to the consensus clustering process entails an automatic saving of time complexity, one<br />

should take into account the cost associated to the subsampling and extension processes,<br />

which is often linear with the size of the data collection (<strong>La</strong>nge and Buhmann, 2005; Greene<br />

and Cunningham, 2006; Gionis, Mannila, and Tsaparas, 2007).<br />

In contrast, our hierarchical consensus architecture proposals are based on reducing the<br />

time complexity of consensus processes without discarding any of the objects in the data<br />

set nor any of the cluster ensemble components. By means of a divide and conquer type<br />

of approach (Dasgupta, Papadimitriou, and Vazirani, 2006), we break a single consensus<br />

clustering problem into multiple smaller problems, which gives rise to hierarchical consensus<br />

architectures that allow to achieve important computational time savings, specially in high<br />

diversity scenarios —i.e. the ones we might find ourselves in if the strategy of using multiple<br />

mutually crossed diversity factors for creating large cluster ensembles is followed.<br />

As far as we know, the use of divide and conquer approaches to the consensus clustering<br />

problem has only been reported in (He et al., 2005) as a means for clustering data sets<br />

that contain both numeric and categorical attributes. This proposal consists of dividing<br />

the original data collection into two purely numeric and categorical subsets, conducting<br />

105

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!