TESI DOCTORAL - La Salle

C.I.F. G: 59069740 Universitat Ramon Lull Fundació Privada. Rgtre. Fund. Generalitat de Catalunya núm. 472 (28-02-90) 

TESI DOCTORAL 

Títol Hierarchical self-refining consensus architectures and soft consensus 

functions for robust multimedia clustering 

Realitzada per Xavier Sevillano Domínguez 

en el Centre Enginyeria i Arquitectura La Salle 

i en el Departament Comunicacions i Teoria del Senyal 

Dirigida per Dr. Francesc Alías Pujol i Dr. Joan Claudi Socoró Carrié 

C. Claravall, 1-3 

08022 Barcelona 

Tel. 936 022 200 

Fax 936 022 249 

E-mail: urlsc@sec.url.es 

www.url.es

Agraïments 

Aquesta tesi és fruit de moltes hores de treball personal. Tanmateix, hi ha molta gent a qui 

estic agraït pel seu suport durant aquests anys. 

En primer lloc, vull citar el meu fantàstic equip de co-directors: en Joan Claudi Socoró, 

a qui agraeixo que hagi estat un magnífic ponent/director des dels ja llunyans temps del 

TFC (equalització adaptativa, uf!), a banda d’haver-me donat la llibertat de fer la tesi que 

volia i oferir-me sempre el seu ajut en moments crítics. I en Francesc Alías (en Xuti), per 

l’empenta que va acabar suposant l’inici d’aquesta tesi, pel seu constant esperit de millora 

i, sobretot, per la seva amistat que dura des de temps encara més llunyans. 

Als caps directes que he tingut al llarg d’aquests anys, que són en David Badia, l’Elisa 

Martínez i en Gabriel Fernández, els agraeixo l’haver-me concedit un espai a recer, moltes 

vegades, de l’habitual pluja de marrons. 

També estic molt agraït al meu bon amic i deixeble Germán Cobo, que va donar amb mi 

els primers passos del que ha acabat essent aquesta tesi, i amb qui espero seguir treballant 

en un futur, ni que sigui a distància (és el que té laUOC). 

Les llargues hores de simulacions ho haurien estat encara més si no hagués estat pel 

personal de Manteniment (Chus, Héctor, Gerard, Oscar, Raúl), que em va ajudar i facilitar 

l’ús (quasi monopoli) d’un bon grapat de PC’s. També els estic molt agraïts als meus 

companys de secció que em van permetre okupar els seus ordinadors, moltes vegades en 

perjudici propi: Germán, Joan Claudi, José Antonio Montero i Ester Cierco. Gràcies a en 

Lluís Formiga per obrir-me les portes de la Multimodal, el que va permetre accelerar molt 

el procés d’experimentació. 

Gràcies a la Berta Martínez, l’ Àngel Calzada i en Lluís Cortés per haver fet més suportable 

l’estiu del 2008, i a en Germán (de nou!) per descobrir-me el Deezer, que ha posat 

banda sonora a aquesta tesi. Gràcies, en general, a tots els companys de l’antiga secció de 

Tractament, del Departament de Comunicacions i de l’actual DTM. 

Gracias a mi madre por su amor y apoyo a lo largo de toda la vida. Gracias a mi 

padre por inculcarme la pasión por estudiar, y gracias a ambos por los esfuerzos para que 

pudiera hacerlo en las mejores condiciones. Y gracias a toda mi familia y amigos en general 

—aunque no lo sepáis, era muy reconfortante para mí que me preguntárais cómo iba la 

tesis. 

Y gracias a Susana, por su paciencia a lo largo de todo este proceso, dándome ánimos y 

creyendo siempre en mí, y por seguir ahí para disfrutar juntos de lo que vendrá ... cuando 

acabe la tesis. 

iii

Resum 

En segmentar de forma no supervisada una col·lecció de dades, l’usuari ha de prendre 

múltiples decisions –quin algorisme aplicar, com representar els objectes, en quants grups 

agrupar aquests, entre d’altres– que condicionen, en gran mesura, la qualitat de la partició 

resultant. Malauradament, la naturalesa no supervisada del problema fa difícil (quan no 

impossible) prendre aquestes decisions de manera fonamentada, a no ser que es disposi de 

cert coneixement del domini. 

En un intent per combatre aquestes incerteses, aquesta tesi proposa una aproximació 

al problema que minimitza, intencionadament, la presa de decisions per part de l’usuari. 

Ans al contrari, s’encoratja l’ús de tants sistemes de classificació no supervisada com sigui 

possible, combinant-los per tal d’obtenir la partició final de les dades (o partició deconsens). 

Com més semblant sigui aquesta a la partició demàxima qualitat oferida pels sistemes de 

classificació subjectes a combinació, major serà el grau de robustesa assolit respecte les 

indeterminacions inherents a la classificació no supervisada. 

Nogensmenys, la combinació indiscriminada de classificadors no supervisats planteja 

dues dificultats principals, que són i) l’increment de la complexitat computacional del procés 

de combinació, fins al punt que la seva execució pot esdevenir inviable si el nombre de 

sistemes a combinar és excessiu, i ii) l’obtenció de particions de consens de baixa qualitat 

deguda a la inclusió de sistemes de classificació pobres. Amb l’objectiu de lluitar contra 

aquests problemes, aquesta tesi introdueix les arquitectures de consens jeràrquiques autorefinables 

com a via per a l’obtenció de particions de consens de bona qualitat amb baix 

cost computacional, tal i com confirmen els nombrosos experiments realitzats. 

Amb la intenció d’exportar aquesta estratègia de classificació no supervisada robusta 

a un marc generalista, es proposa un conjunt de funcions de consens basades en votació 

per a la combinació de classificadors difusos. Diversos experiments demostren que les seves 

prestacions són comparables o superiors a bona part de l’estat de l’art. 

Les nostres propostes són aplicables de forma natural a la classificació robusta de dades 

multimodals –un problema d’interès ben actual donada la ubiqüitat de la multimèdia–, ja 

que l’existència de múltiples modalitats planteja indeterminacions addicionals que dificulten 

l’obtenció de particions robustes. La base de la nostra proposta és la creació deconjunts 

de particions multimodals, el que permet l’ús natural i simultani de tècniques de fusió de 

modalitats avançada i retardada, donant peu a una aproximació genèrica i eficient a la classificació 

multimèdia —els resultats de la qual s’analitzen al llarg de múltiples experiments. 

v

Resumen 

Al segmentar de forma no supervisada una colección de datos, el usuario debe tomar 

múltiples decisiones –qué algoritmo aplicar, cómo representar los objetos, en cuantos grupos 

agrupar éstos, entre otras– que condicionan, en gran medida, la calidad de la partición resultante. 

Desgraciadamente, la naturaleza no supervisada del problema hace difícil (cuando 

no imposible) tomar estas decisiones de manera fundamentada, a no ser que se disponga de 

cierto conocimiento del dominio. 

En un intento por combatir estas incertidumbres, esta tesis propone una aproximación 

al problema que minimiza, intencionadamente, la toma de decisiones por parte del usuario. 

Al contrario, se alienta el uso de tantos sistemas de clasificación no supervisada como sea 

posible, combinándolos con el fin de obtener la partición final de los datos (o partición de 

consenso). Cuanto más similar sea ésta a la partición de máxima calidad ofrecida por los 

sistemas de clasificación sujetos a combinación, mayor será el grado de robustez respecto a 

las indeterminaciones inherentes a la clasificación no supervisada. 

No obstante, la combinación indiscriminada de clasificadores no supervisados plantea 

dos dificultades principales, que son i) el incremento de la complejidad computacional del 

proceso de combinación, hasta el punto que su ejecución puede ser inviable si el número 

de sistemas a combinar es excesivo, y ii) la obtención de particiones de consenso de baja 

calidad debida a la inclusión de sistemas de clasificación pobres. Con el objetivo de luchar 

contra estos problemas, esta tesis introduce las arquitecturas de consenso jerárquicas autorefinables 

como vía para la obtención de particiones de consenso de buena calidad con bajo 

coste computacional, tal como confirman los numerosos experimentos realizados. 

Con la intención de exportar esta estrategia de clasificación no supervisada robusta a 

un marco generalista, se propone un conjunto de funciones de consenso basadas en votación 

para la combinación de clasificadores difusos. Diversos experimentos demuestran que sus 

prestaciones son comparables o superiores a buena parte del estado del arte. 

Nuestras propuestas son aplicables de forma natural a la clasificación robusta de datos 

multimodales –un problema de interés actual dada la ubicuidad de la multimedia–, ya que la 

existencia de múltiples modalidades plantea indeterminaciones adicionales que dificultan la 

obtención de particiones robustas. La base de nuestra propuesta es la creación de conjuntos 

de particiones multimodales, lo que permite el uso natural y simultáneo de técnicas de fusión 

de modalidades temprana y tardía, dando pie a una aproximación genérica y eficiente a la 

clasificación multimedia —cuyos resultados se analizan a lo largo de múltiples experimentos. 

vii

Abstract 

When facing the task of partitioning a data collection in an unsupervised fashion, the clustering 

practitioner must make several crucial decisions –which clustering algorithm to apply, 

how the objects in the data set are represented, how many clusters are to be found, among 

others– that condition, to a large extent, the quality of the resulting partition. However, 

the unsupervised nature of the clustering problem makes it difficult (if not impossible) to 

make well-founded decisions unless domain knowledge is available. 

In an attempt to fight these indeterminacies, we propose an approach to the clustering 

problem that intentionally reduces user decision making as much as possible. Rather the 

contrary, the clustering practitioner is encouraged to simultaneously employ as many clustering 

systems as possible (compiling their outcomes into a cluster ensemble), combining 

them in order to obtain the final partition (or consensus clustering). The greater the similarity 

between the highest quality cluster ensemble component and the consensus clustering, 

the larger degree of robustness to the inherent indeterminacies of clustering is achieved. 

However, the indiscriminate creation of cluster ensemble components poses two main 

challenges to the clustering combination process, namely i) an increase of its computational 

complexity, to the point that the creation of the consensus clustering can even become unfeasible 

if the number of clustering systems combined is too large, and ii) the obtention of a 

low quality consensus partition due to the inclusion of poor clustering systems in the cluster 

ensemble. In order to fight against these inconveniences, this thesis introduces hierarchical 

self-refining consensus architectures as a means for obtaining good quality partitions at a 

reduced computational cost, as confirmed by extensive experimental evaluation. 

Aiming to port this robust clustering strategy to a more generic framework, a set of 

voting based consensus functions for fuzzy clustering systems combination is proposed. 

Several experiments demonstrate that the quality of the consensus clusterings they yield is 

comparable or better than that of multiple state-of-the-art soft consensus functions. 

Our proposals find a natural field of application in the robust clustering of multimodal 

data –a problem of current interest due to the growing ubiquity of multimedia–, as the 

existence of multiple data modalities poses additional indeterminacies that challenge the 

obtention of robust clustering results. The basis of our proposal is the creation of multimodal 

cluster ensembles, which naturally allows the simultaneous use of early and late modality 

fusion techniques, thus providing a highly generic and efficient approach to multimedia 

clustering —the performance of which is analyzed in multiple experiments. 

ix

Contents 

Resum iii 

Resumen v 

Abstract vii 

List of tables xvi 

List of figures xxii 

List of algorithms xxxiii 

List of symbols xxxiv 

1 Framework of the thesis 1 

1.1 Knowledge discovery and data mining . . . . . . . . . . . . . . . . . . . . . 3 

1.2 Clustering in knowledge discovery and data mining . . . . . . . . . . . . . . 7 

1.2.1 Overview of clustering methods . . . . . . . . . . . . . . . . . . . . . 8 

1.2.2 Evaluation of clustering processes . . . . . . . . . . . . . . . . . . . . 15 

1.3 Multimodal clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

1.4 Clustering indeterminacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.5 Motivation and contributions of the thesis . . . . . . . . . . . . . . . . . . . 25 

2 Cluster ensembles and consensus clustering 27 

2.1 Related work on cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . 31 

2.2 Related work on consensus functions . . . . . . . . . . . . . . . . . . . . . . 33 

2.2.1 Consensus functions based on voting . . . . . . . . . . . . . . . . . . 34 

2.2.2 Consensus functions based on graph partitioning . . . . . . . . . . . 36 

2.2.3 Consensus functions based on object co-association measures . . . . 37 

2.2.4 Consensus functions based on categorical clustering . . . . . . . . . 39 

2.2.5 Consensus functions based on probabilistic approaches . . . . . . . . 39 

xi

Contents 

2.2.6 Consensus functions based on reinforcement learning . . . . . . . . . 40 

2.2.7 Consensus functions based on interpeting object similarity as data . 40 

2.2.8 Consensus functions based on cluster centroids . . . . . . . . . . . . 41 

2.2.9 Consensus functions based on correlation clustering . . . . . . . . . 41 

2.2.10 Consensus functions based on search techniques . . . . . . . . . . . . 42 

2.2.11 Consensus functions based on cluster ensemble component selection 42 

2.2.12 Other interesting works on consensus clustering . . . . . . . . . . . . 43 

3 Hierarchical consensus architectures 45 

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.2 Random hierarchical consensus architectures . . . . . . . . . . . . . . . . . 49 

3.2.1 Rationale and definition . . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.2.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 51 

3.2.3 Running time minimization . . . . . . . . . . . . . . . . . . . . . . . 54 

3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.3 Deterministic hierarchical consensus architectures . . . . . . . . . . . . . . . 69 

3.3.1 Rationale and definition . . . . . . . . . . . . . . . . . . . . . . . . . 69 

3.3.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 71 

3.3.3 Running time minimization . . . . . . . . . . . . . . . . . . . . . . . 72 

3.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 

3.4 Flat vs. hierarchical consensus . . . . . . . . . . . . . . . . . . . . . . . . . 88 

3.4.1 Running time comparison . . . . . . . . . . . . . . . . . . . . . . . . 88 

3.4.2 Consensus quality comparison . . . . . . . . . . . . . . . . . . . . . . 96 

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

3.6 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

4 Self-refining consensus architectures 109 

4.1 Description of the consensus self-refining procedure . . . . . . . . . . . . . . 110 

4.2 Flat vs. hierarchical self-refining . . . . . . . . . . . . . . . . . . . . . . . . 112 

4.2.1 Evaluation of the consensus-based self-refining process . . . . . . . . 116 

4.2.2 Evaluation of the supraconsensus process . . . . . . . . . . . . . . . 120 

4.3 Selection-based self-refining . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

4.3.1 Evaluation of the selection-based self-refining process . . . . . . . . . 124 

4.3.2 Evaluation of the supraconsensus process . . . . . . . . . . . . . . . 126 

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 


xii

Contents 

5 Multimedia clustering based on cluster ensembles 133 

5.1 Generation of multimodal cluster ensembles . . . . . . . . . . . . . . . . . . 134 

5.2 Self-refining multimodal consensus architecture . . . . . . . . . . . . . . . . 136 

5.3 Multimodal consensus clustering results . . . . . . . . . . . . . . . . . . . . 138 

5.3.1 Consensus clustering per modality and across modalities . . . . . . . 141 

5.3.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 152 

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 


6 Voting based consensus functions for soft cluster ensembles 163 

6.1 Soft cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 

6.2 Adapting consensus functions to soft cluster ensembles . . . . . . . . . . . . 166 

6.3 Voting based consensus functions . . . . . . . . . . . . . . . . . . . . . . . . 171 

6.3.1 Cluster disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . 172 

6.3.2 Voting strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 


7 Conclusions 193 

7.1 Hierarchical consensus architectures . . . . . . . . . . . . . . . . . . . . . . 195 

7.2 Consensus self-refining procedures . . . . . . . . . . . . . . . . . . . . . . . 197 

7.3 Multimedia clustering based on cluster ensembles . . . . . . . . . . . . . . . 198 

7.4 Voting based soft consensus functions . . . . . . . . . . . . . . . . . . . . . 200 

References 202 

Appendices 216 

A Experimental setup 217 

A.1 The CLUTO clustering package . . . . . . . . . . . . . . . . . . . . . . . . . 217 

A.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 

A.2.1 Unimodal data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 

A.2.2 Multimodal data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 223 

A.3 Data representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 

A.3.1 Unimodal data representations . . . . . . . . . . . . . . . . . . . . . 224 

A.3.2 Multimodal data representations . . . . . . . . . . . . . . . . . . . . 227 

A.4 Cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 

xiii

Contents 

A.5 Consensus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 

A.6 Computational resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 

B Experiments on clustering indeterminacies 233 

B.1 Clustering indeterminacies in unimodal data sets . . . . . . . . . . . . . . . 233 

B.1.1 Zoo data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 

B.1.2 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 

B.1.3 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 

B.1.4 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 

B.1.5 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 

B.1.6 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 

B.1.7 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 

B.1.8 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 

B.1.9 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 

B.1.10 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 238 

B.1.11 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 

B.1.12 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 

B.1.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 

B.2 Clustering indeterminacies in multimodal data sets . . . . . . . . . . . . . . 241 

B.2.1 CAL500 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 

B.2.2 Corel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 

B.2.3 InternetAds data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 

B.2.4 IsoLetters data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 

B.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 

C Experiments on hierarchical consensus architectures 249 

C.1 Configuration of a random hierarchical consensus architecture . . . . . . . . 249 

C.2 Estimation of the computationally optimal RHCA . . . . . . . . . . . . . . 252 

C.2.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 

C.2.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 

C.2.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 

C.2.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 

C.2.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 

C.2.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 

C.2.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 

C.3 Estimation of the computationally optimal DHCA . . . . . . . . . . . . . . 271 

C.3.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 

xiv

Contents 

C.3.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 



C.3.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 



C.4 Computationally optimal RHCA, DHCA and flat consensus comparison . . 290 

C.4.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 

C.4.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 



C.4.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 



C.4.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 

C.4.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 318 

C.4.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 

C.4.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 

D Experiments on self-refining consensus architectures 333 

D.1 Experiments on consensus-based self-refining . . . . . . . . . . . . . . . . . 333 

D.1.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 

D.1.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 

D.1.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 

D.1.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 

D.1.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 

D.1.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 

D.1.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 

D.1.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 

D.1.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 342 

D.1.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 

D.1.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 

D.2 Experiments on selection-based self-refining . . . . . . . . . . . . . . . . . . 348 

D.2.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 

D.2.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 

D.2.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 

D.2.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 

xv

Contents 

D.2.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 

D.2.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 

D.2.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 

D.2.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 

D.2.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 353 

D.2.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 

D.2.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 

E Experiments on multimodal consensus clustering 359 

E.1 CAL500 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 

E.1.1 Consensus clustering per modality and across modalities . . . . . . . 360 

E.1.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 362 

E.2 InternetAds data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 



E.3 Corel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 



F Experiments on soft consensus clustering 373 

F.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 

F.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 

F.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 

F.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 

F.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 

F.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 

F.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 

F.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 

F.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 

F.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 

F.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 

xvi

List of Tables 

1.1 Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits 

data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms . . . 23 

1.2 Illustration of the clustering indeterminacies on the CAL500, Corel, InternetAds 

and IsoLetters multimoda data sets . . . . . . . . . . . . . . . . . . . 24 

2.1 Taxonomy of consensus functions according to their theoretical basis . . . . 34 

3.1 Number of inner loop iterations as a function of the outer’s loop index i . . 53 

3.2 Methodology for estimating the running time of multiple RHCA variants . 56 

3.3 Evaluation of the minimum complexity RHCA variant estimation methodology 

in terms of the percentage of correct predictions and running time 

penalizations resulting from mistaken predictions . . . . . . . . . . . . . . . 63 

3.4 Computationally optimal consensus architectures (flat or RHCA) on the unimodal 

data collections assuming a fully serial implementation . . . . . . . . 67 

3.5 Computationally optimal consensus architectures (flat or RHCA) on the unimodal 

data collections assuming a fully parallel implementation . . . . . . . 68 

3.6 Methodology for estimating the running time of multiple DHCA variants . 74 

3.7 Evaluation of the minimum complexity DHCA variant estimation methodology 

in terms of the percentage of correct predictions and running time 

penalizations resulting from mistaken predictions . . . . . . . . . . . . . . . 81 

3.8 Evaluation of the minimum complexity serial DHCA variant prediction based 

on decreasing diversity factor ordering in terms of the percentage of correct 

predictions and running time penalizations resulting from mistaken predictions 83 

3.9 Running time differences between the most and least computationally efficient 

DHCA variants in both the serial and parallel implementations . . . . 84 

3.10 Computationally optimal consensus architectures (flat or DHCA) on the unimodal 

data collections assuming a fully serial implementation . . . . . . . . 86 

3.11 Computationally optimal consensus architectures (flat or DHCA) on the unimodal 

data collections assuming a fully parallel implementation . . . . . . . 87 

3.12 Percentage of experiments in which the consensus clustering solution is better 

than the median cluster ensemble component . . . . . . . . . . . . . . . . . 104 

xvii


3.13 Relative percentage φ (NMI) gain between the consensus clustering solution 

and the median cluster ensemble component . . . . . . . . . . . . . . . . . . 104 

3.14 Percentage of experiments in which the consensus clustering solution is better 

than the best cluster ensemble component . . . . . . . . . . . . . . . . . . . 105 

3.15 Relative percentage φ (NMI) gain between the consensus clustering solution 

and the best cluster ensemble component . . . . . . . . . . . . . . . . . . . 105 

4.1 Methodology of the consensus self-refining procedure . . . . . . . . . . . . . 112 

4.2 Percentage of self-refining experiments in which one of the self-refined consensus 

clustering solutions is better than its non-refined counterpart . . . . 117 

4.3 Relative φ (NMI) gain percentage between the top quality self-refined consensus 

clustering solutions with respect to its non-refined counterpart . . . . . . . 117 

4.4 Percentage of experiments in which the best (non-refined or self-refined) consensus 

clustering solution is better than the best cluster ensemble component 118 

4.5 Relative percentage φ (NMI) gain between the best (non-refined or self-refined) 

consensus clustering solution and the best cluster ensemble component . . . 118 

4.6 Percentage of experiments in which the best (non-refined or self-refined) consensus 

clustering solution is better than the median cluster ensemble component119 

4.7 Relative percentage φ (NMI) gain between the best (non-refined or self-refined) 

consensus clustering solution and the median cluster ensemble component . 119 

4.8 φ (NMI) variance of the non-refined and the best non/self-refined consensus 

clustering solutions across the flat, RHCA and DHCA consensus architectures120 

4.9 Percentage of experiments in which the supraconsensus function selects the 

top quality consensus clustering solution . . . . . . . . . . . . . . . . . . . . 120 

4.10 Relative percentage φ (NMI) losses due to suboptimal self-refined consensus 

clustering solution selection by supraconsensus . . . . . . . . . . . . . . . . 121 

4.11 Methodology of the cluster ensemble component selection-based consensus 

self-refining procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

4.12 Percentage of self-refining experiments in which one of the self-refined consensus 

clustering solutions is better than the selected cluster ensemble component 

reference λref . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 


clustering solutions with respect to the maximum φ (ANMI) cluster ensemble 

component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

4.14 Percentage of experiments where either the top quality self-refined consensus 

clustering solution or λref better the best cluster ensemble component, and 

relative φ (NMI) gain percentage with respect to it . . . . . . . . . . . . . . . 126 

4.15 Percentage of experiments where either the top quality self-refined consensus 

clustering solution or λref better the median cluster ensemble component, 

and relative φ (NMI) gain percentage with respect to it . . . . . . . . . . . . . 126 

xviii



top quality clustering solution, and relative percentage φ (NMI) losses between 

the top quality clustering solution and the one selected by supraconsensus, 

averaged across the twelve data collections . . . . . . . . . . . . . . . . . . . 127 

5.1 Range and cardinality of the dimensional diversity factor dfD per modality 

for each one of the four multimedia data sets . . . . . . . . . . . . . . . . . 136 

5.2 Percentage of cluster ensemble components that attain a higher φ (NMI) than 

the unimodal and multimodal consensus clusterings, across the four multimedia 

data collections and the seven consensus functions . . . . . . . . . . . 145 

5.3 Evaluation of the unimodal and multimodal consensus clusterings with respect 

to the best cluster ensemble component, across the four multimedia 

data collections and the seven consensus functions . . . . . . . . . . . . . . 148 

5.4 Evaluation of the unimodal and multimodal consensus clusterings with respect 

to the median cluster ensemble component, across the four multimedia 

data collections and the seven consensus functions . . . . . . . . . . . . . . 149 

5.5 Evaluation of the multimodal consensus clusterings with respect to their 

unimodal counterparts, across the four multimedia data collections and the 

seven consensus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

5.6 Evaluation of the intermodal consensus clustering with respect to the unimodal 

and multimodal consensus clusterings, across the four multimedia data 

collections and the seven consensus functions . . . . . . . . . . . . . . . . . 151 

5.7 Percentage of multimodal self-refining experiments in which one of the selfrefined 

consensus clustering solutions is better than its non-refined counterpart154 


clustering solutions with respect to its non-refined counterpart . . . . . . . 156 

5.9 Percentage of the cluster ensemble components that attain a higher φ (NMI) 

score than the top quality self-refined consensus clustering solution . . . . . 156 

5.10 Percentage of experiments in which the best (either non-refined or self-refined) 

consensus clustering solution is better than the best cluster ensemble component 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

5.11 Relative φ (NMI) percentage difference between the top quality (either nonrefined 

or self-refined) consensus clustering solution with respect to the best 

ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

5.12 Percentage of experiments in which the top quality (either non-refined or 

self-refined) consensus clustering solution is better than the median cluster 

ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 

5.13 Relative φ (NMI) percentage difference between the top quality (either nonrefined 

or self-refined) consensus clustering solution with respect to the median 

ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 


top quality consensus clustering solution . . . . . . . . . . . . . . . . . . . . 159 

xix


5.15 Relative φ (NMI) percentage differences between the best and median components 

of the cluster ensemble and the consensus clustering λ final 

c selected by 

supraconsensus, across the four multimedia data collections . . . . . . . . . 160 

6.1 Soft cluster ensemble sizes of the unimodal data sets . . . . . . . . . . . . . 186 

6.2 Significance levels p corresponding to the pairwise comparison of soft consensus 

functions using a t-paired test on the Zoo data set . . . . . . . . . . . . 187 

6.3 Percentage of experiments in which the state-of-the-art consensus functions 

(CSPA, EAC, HGPA, MCLA and VMA) yield better/equivalent/worse consensus 

clustering solutions than the four proposed consensus functions (BC, 

CC, PC and SC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 

6.4 Percentage of experiments in which the state-of-the-art consensus functions 

(CSPA, EAC, HGPA, MCLA and VMA) are executed faster/equivalent/slower 

than the four proposed consensus functions (BC, CC, PC and SC) . . . . . 189 

A.1 Cross-option table indicating the clustering strategy-criterion function-similarity 

measure combinations available in CLUTO . . . . . . . . . . . . . . . . . . 220 

A.2 Summary of the unimodal data sets employed in the experiments . . . . . . 222 

A.3 Summary of the multimodal data sets employed in the experiments . . . . . 224 

A.4 Cluster ensemble sizes corresponding to distinct algorithmic diversity configurations 

for the unimodal data sets . . . . . . . . . . . . . . . . . . . . . . 228 

A.5 Cluster ensemble sizes corresponding to distinct algorithmic diversity configurations 

for the multimodal data sets . . . . . . . . . . . . . . . . . . . . . 229 

B.1 Number of individual clusterings per data representation on each unimodal 

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 

B.2 Top clustering results of each clustering algorithm family sorted from highest 

to lowest φ (NMI) on the unimodal collections . . . . . . . . . . . . . . . . . . 242 

B.3 Number of individual clusterings per data representation on each multimodal 

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 

B.4 Top clustering results of each clustering algorithm family sorted from highest 

to lowest φ (NMI) on the multimodal collections . . . . . . . . . . . . . . . . 248 

C.1 Examples of computation of the number of stages s of a RHCA with l = 

7, 8and9andb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 

C.2 Examples of computation of the number of consensus per stage (Ki) ofa 

RHCA with l =7, 8 and 9 and b = 2 . . . . . . . . . . . . . . . . . . . . . . 250 

C.3 Examples of computation of the mini-ensembles size of a RHCA with l = 

7, 8and9andb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 

C.4 Configuration of RHCA topologies on a cluster ensemble of size l =30with 

varying mini-ensembles sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 

xx


F.1 Significance levels p corresponding to the pairwise comparison of soft consensus 

functions using a t-paired test on the Iris data set . . . . . . . . . . . . 375 


functions using a t-paired test on the Wine data set . . . . . . . . . . . 376 


functions using a t-paired test on the Glass data set . . . . . . . . . . . 377 


functions using a t-paired test on the Ionosphere data set . . . . . . . . 378 


functions using a t-paired test on the WDBC data set . . . . . . . . . . 379 


functions using a t-paired test on the Balance data set . . . . . . . . . . 380 


functions using a t-paired test on the MFeat data set . . . . . . . . . . . 381 


functions using a t-paired test on the miniNG data set . . . . . . . . . . 382 


functions using a t-paired test on the Segmentation data set . . . . . . . 

F.10 Significance levels p corresponding to the pairwise comparison of soft consen- 

383 

sus functions using a t-paired test on the BBC data set . . . . . . . . . . . 384 


functions using a t-paired test on the PenDigits data set . . . . . . . . . 386 

xxi

List of Figures 

1.1 Evolution of the total number of websites across all Internet domains, from 

November 1995 to February 2009 . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.2 Schematic diagram of the steps involved in a knowledge discovery process . 4 

1.3 Taxonomy of data mining methods . . . . . . . . . . . . . . . . . . . . . . . 6 

1.4 Toy example of a hierarchical clustering dendrogram . . . . . . . . . . . . . 10 

1.5 Illustration of the data representation indeterminacy on the Wine and miniNG 

data sets clustered by the rbr-corr-e1 algorithm. . . . . . . . . . . . . . 21 

1.6 Block diagram of the robust multimodal clustering system based on selfrefining 

hierarchical consensus architectures . . . . . . . . . . . . . . . . . . 26 

2.1 Scatterplot of an artificially generated two-dimensional toy data set containing 

n = 9 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

2.2 Schematic representation of a consensus clustering process on a hard cluster 

ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

3.1 Flat vs hierarchical construction of a consensus clustering solution on a hard 

cluster ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.2 Three examples of topologies of random hierarchical consensus architectures 52 

3.3 Evolution of RHCA parameters as a function of the mini-ensembles size b . 54 

3.4 Estimated and real running times of the serial and parallel RHCA implementations 

on the Zoo data collection in the |dfA| = 1 diversity scenario . . . . 58 







3.8 Evolution of the accuracy of RHCA running time estimation as a function of 

the number of consensus processes . . . . . . . . . . . . . . . . . . . . . . . 65 

3.9 An example of a deterministic hierarchical consensus architecture . . . . . . 71 

3.10 Estimated and real running times of the serial and parallel dHCA implementations 


xxiii


3.11 Estimated and real running times of the serial and parallel dHCA implementations 

on the Zoo data collection in the |dfA| = 10 diversity scenario . . . . 

3.12 Estimated and real running times of the serial and parallel dHCA implemen- 

78 

tations on the Zoo data collection in the |dfA| = 19 diversity scenario . . . . 

3.13 Estimated and real running times of the serial and parallel dHCA implemen- 

79 

tations on the Zoo data collection in the |dfA| = 28 diversity scenario . . . . 

3.14 Evolution of the accuracy of DHCA running time estimation as a function of 

80 

the number of consensus processes . . . . . . . . . . . . . . . . . . . . . . . 82 

3.15 Running times of the computationally optimal RHCA, DHCA and flat consensus 

architectures on the Zoo data collection for the diversity scenario 

corresponding to a cluster ensemble of size l = 57 . . . . . . . . . . . . . . . 



89 

corresponding to a cluster ensemble of size l = 570 . . . . . . . . . . . . . . 91 



corresponding to a cluster ensemble of size l = 1083 . . . . . . . . . . . . . 92 



corresponding to a cluster ensemble of size l = 1596 . . . . . . . . . . . . . 93 

3.19 Running times of the computationally optimal serial RHCA, DHCA and flat 

consensus architectures across all data collections for all the diversity scenarios 95 

3.20 Running times of the computationally optimal parallel RHCA, DHCA and 

flat consensus architectures across all data collections for all the diversity 

scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

3.21 φ 

97 

(NMI) of the consensus solutions yielded by the computationally optimal 

RHCA, DHCA and flat consensus architectures on the Zoo data collection 

for the diversity scenario corresponding to a cluster ensemble of size l =57 

3.22 φ 

99 



for the diversity scenario corresponding to a cluster ensemble of size l = 570 

3.23 φ 

100 



for the diversity scenario corresponding to a cluster ensemble of size l = 1083 100 

3.24 φ (NMI) of the consensus solutions yielded by the computationally optimal 


for the diversity scenario corresponding to a cluster ensemble of size l = 1596 101 

3.25 φ (NMI) of the consensus solutions obtained by the computationally optimal 

parallel RHCA, DHCA and flat consensus architectures across all data collections 

for all the diversity scenarios . . . . . . . . . . . . . . . . . . . . . . 103 

4.1 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Zoo 

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

xxiv


4.2 Decreasingly ordered φ (NMI) (wrt ground truth) values of the 300 clusterings 

included in the toy cluster ensemble (left), and their corresponding φ (ANMI) 

values (wrt the toy cluster ensemble) (right) . . . . . . . . . . . . . . . . . . 121 

4.3 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions 

on the Zoo data collection . . . . . . . . . . . . . . . . . . . . . . . . . 124 

5.1 Block diagram of the proposed multimodal consensus clustering system . . 134 

5.2 An example of a deterministic hierarchical consensus architecture DRM variant139 

5.3 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering 

solutions using the agglo-cos-upgma algorithm on the IsoLetters data 

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 


solutions using the direct-cos-i2 algorithm on the IsoLetters data set 143 


solutions using the graph-cos-i2 algorithm on the IsoLetters data set 143 


solutions using the rb-cos-i2 algorithm on the IsoLetters data set . . 144 

5.7 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions 

using the agglo-cos-upgma algorithm on the IsoLetters data set . . . . . . . 153 


using the direct-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . 153 


using the graph-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . 154 


using the rb-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . . . 155 

6.1 Scatterplot of an artificially generated two-dimensional data set containing 

n = 9 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 

6.2 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus 

functions on the Zoo data collection . . . . . . . . . . . . . . . . . . . . 186 

B.1 φ (NMI) histograms on the Zoo data set . . . . . . . . . . . . . . . . . . . . . 234 

B.2 φ (NMI) histograms on the Iris data set . . . . . . . . . . . . . . . . . . . . . 235 

B.3 φ (NMI) histograms on the Wine data set . . . . . . . . . . . . . . . . . . . . 236 

B.4 φ (NMI) histograms on the Glass data set . . . . . . . . . . . . . . . . . . . . 236 

B.5 φ (NMI) histograms on the Ionosphere data set . . . . . . . . . . . . . . . . . 237 

B.6 φ (NMI) histograms on the WDBC data set . . . . . . . . . . . . . . . . . . . 237 

B.7 φ (NMI) histograms on the Balance data set . . . . . . . . . . . . . . . . . . . 238 

B.8 φ (NMI) histograms on the MFeat data set . . . . . . . . . . . . . . . . . . . 239 

B.9 φ (NMI) histograms on the miniNG data set . . . . . . . . . . . . . . . . . . . 239 

B.10 φ (NMI) histograms on the Segmentation data set . . . . . . . . . . . . . . . . 240 

xxv


B.11 φ (NMI) histograms on the BBC data set . . . . . . . . . . . . . . . . . . . . 240 

B.12 φ (NMI) histograms on the PenDigits data set . . . . . . . . . . . . . . . . . . 240 

B.13 φ (NMI) histograms on the CAL500 data set . . . . . . . . . . . . . . . . . . 244 

B.14 φ (NMI) histograms on the Corel data set . . . . . . . . . . . . . . . . . . . . 245 

B.15 φ (NMI) histograms on the InternetAds data set . . . . . . . . . . . . . . . . 246 

B.16 φ (NMI) histograms on the IsoLetters data set . . . . . . . . . . . . . . . . . . 247 

C.1 Estimated and real running times of the serial RHCA implementation on the 

Iris data collection in the four diversity scenarios . . . . . . . . . . . . . . . 254 

C.2 Estimated and real running times of the parallel RHCA implementation on 

the Iris data collection in the four diversity scenarios . . . . . . . . . . . . . 255 


Wine data collection in the four diversity scenarios . . . . . . . . . . . . . . 257 


the Wine data collection in the four diversity scenarios . . . . . . . . . . . . 258 


Glass data collection in the four diversity scenarios . . . . . . . . . . . . . . 259 


the Glass data collection in the four diversity scenarios . . . . . . . . . . . . 260 


Ionosphere data collection in the four diversity scenarios . . . . . . . . . . . 262 


the Ionosphere data collection in the four diversity scenarios . . . . . . . . . 263 


WDBC data collection in the four diversity scenarios . . . . . . . . . . . . . 264 


the WDBC data collection in the four diversity scenarios . . . . . . . . . . . 265 


Balance data collection in the four diversity scenarios . . . . . . . . . . . . 267 


the Balance data collection in the four diversity scenarios . . . . . . . . . . 268 


Mfeat data collection in the four diversity scenarios . . . . . . . . . . . . . . 269 


the Mfeat data collection in the four diversity scenarios . . . . . . . . . . . 270 

C.15 Estimated and real running times of the serial DHCA implementation on the 

Iris data collection in the four diversity scenarios . . . . . . . . . . . . . . . 274 

C.16 Estimated and real running times of the parallel DHCA implementation on 

the Iris data collection in the four diversity scenarios . . . . . . . . . . . . . 275 


Wine data collection in the four diversity scenarios . . . . . . . . . . . . . . 276 

xxvi



the Wine data collection in the four diversity scenarios . . . . . . . . . . . . 277 


Glass data collection in the four diversity scenarios . . . . . . . . . . . . . . 279 


the Glass data collection in the four diversity scenarios . . . . . . . . . . . . 280 


Ionosphere data collection in the four diversity scenarios . . . . . . . . . . . 281 


the Ionosphere data collection in the four diversity scenarios . . . . . . . . . 282 


WDBC data collection in the four diversity scenarios . . . . . . . . . . . . . 284 


the WDBC data collection in the four diversity scenarios . . . . . . . . . . . 285 


Balance data collection in the four diversity scenarios . . . . . . . . . . . . 286 


the Balance data collection in the four diversity scenarios . . . . . . . . . . 287 


Mfeat data collection in the four diversity scenarios . . . . . . . . . . . . . . 288 


the Mfeat data collection in the four diversity scenarios . . . . . . . . . . . 289 

C.29 Running times of the computationally optimal serial RHCA, DHCA and 

flat consensus architectures on the Iris data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 292 

C.30 Running times of the computationally optimal parallel RHCA, DHCA and 

flat consensus architectures on the Iris data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 293 

C.31 φ (NMI) of the consensus solutions yielded by the computationally optimal 

RHCA, DHCA and flat consensus architectures on the Iris data collection in 

the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . 294 

C.32 Running times of the computationally optimal serial RHCA, DHCA and flat 

consensus architectures on the Wine data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 296 


flat consensus architectures on the Wine data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 297 


RHCA, DHCA and flat consensus architectures on the Wine data collection 

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 298 

xxvii



consensus architectures on the Glass data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 299 


flat consensus architectures on the Glass data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 300 


RHCA, DHCA and flat consensus architectures on the Glass data collection 



consensus architectures on the Ionosphere data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 303 


flat consensus architectures on the Ionosphere data collection in the four 

diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 304 


RHCA, DHCA and flat consensus architectures on the Ionosphere data collection 

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . 305 


consensus architectures on the WDBC data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 307 


flat consensus architectures on the WDBC data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 308 


RHCA, DHCA and flat consensus architectures on the WDBC data collection 



consensus architectures on the Balance data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 311 


flat consensus architectures on the Balance data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . 312 


RHCA, DHCA and flat consensus architectures on the Balance data collection 

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . 313 


consensus architectures on the MFeat data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 315 


flat consensus architectures on the MFeat data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 316 

xxviii



RHCA, DHCA and flat consensus architectures on the MFeat data collection 



consensus architectures on the miniNG data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 319 


flat consensus architectures on the miniNG data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . 320 


RHCA, DHCA and flat consensus architectures on the miniNG data collection 

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . 321 


flat consensus architectures on the Segmentation data collection in the four 



flat consensus architectures on the Segmentation data collection in the four 



RHCA, DHCA and flat consensus architectures on the Segmentation data 

collection in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . 325 


flat consensus architectures on the BBC data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 326 


flat consensus architectures on the BBC data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 327 


RHCA, DHCA and flat consensus architectures on the BBC data collection 



consensus architectures on the PenDigits data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 330 


flat consensus architectures on the PenDigits data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . 331 


RHCA, DHCA and flat consensus architectures on the PenDigits data collection 

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . 332 

D.1 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Iris 


xxix


D.2 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Wine 


D.3 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Glass 


D.4 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Ionosphere 

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 

D.5 φ (NMI) boxplots of the self-refined consensus clustering solutions on the WDBC 


D.6 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Balance 

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 

D.7 φ (NMI) boxplots of the self-refined consensus clustering solutions on the MFeat 


D.8 φ (NMI) boxplots of the self-refined consensus clustering solutions on the miniNG 

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 

D.9 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Segmentation 

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 

D.10 φ (NMI) boxplots of the self-refined consensus clustering solutions on the BBC 


D.11 φ (NMI) boxplots of the self-refined consensus clustering solutions on the PenDigits 

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 

D.12 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions 

on the Iris data collection . . . . . . . . . . . . . . . . . . . . . . . . . 349 


on the Wine data collection . . . . . . . . . . . . . . . . . . . . . . . . 350 


on the Glass data collection . . . . . . . . . . . . . . . . . . . . . . . . 351 


on the Ionosphere data collection . . . . . . . . . . . . . . . . . . . . . 352 


on the WDBC data collection . . . . . . . . . . . . . . . . . . . . . . . 353 


on the Balance data collection . . . . . . . . . . . . . . . . . . . . . . 354 


on the MFeat data collection . . . . . . . . . . . . . . . . . . . . . . . 355 


on the miniNG data collection . . . . . . . . . . . . . . . . . . . . . . 356 


on the Segmentation data collection . . . . . . . . . . . . . . . . . . . 357 


on the BBC data collection . . . . . . . . . . . . . . . . . . . . . . . . 358 

xxx



on the PenDigits data collection . . . . . . . . . . . . . . . . . . . . . 358 

E.1 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering 

solutions using the agglo-cos-upgma algorithm on the CAL500 data 

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 


solutions using the direct-cos-i2 algorithm on the CAL500 data set . 361 


solutions using the graph-cos-i2 algorithm on the CAL500 data set . 361 


solutions using the rb-cos-i2 algorithm on the CAL500 data set . . . 361 

E.5 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions 

using the agglo-cos-upgma algorithm on the CAL500 data set . . . . . . . . 362 


using the direct-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . 363 


using the graph-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . 363 


using the rb-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . . . 364 


solutions using the agglo-cos-upgma algorithm on the InternetAds data 

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 


solutions using the direct-cos-i2 algorithm on the InternetAds data 

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 


solutions using the graph-cos-i2 algorithm on the InternetAds data 

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 


solutions using the rb-cos-i2 algorithm on the InternetAds data set . 366 


using the agglo-cos-upgma algorithm on the InternetAds data set . . . . . . 367 


using the direct-cos-i2 algorithm on the InternetAds data set . . . . . . . . 367 


using the graph-cos-i2 algorithm on the InternetAds data set . . . . . . . . 368 


using the rb-cos-i2 algorithm on the InternetAds data set . . . . . . . . . . 368 


solutions using the agglo-cos-upgma algorithm on the Corel data set 369 

xxxi



solutions using the direct-cos-i2 algorithm on the Corel data set . . . 369 


solutions using the graph-cos-i2 algorithm on the Corel data set . . . 369 


solutions using the rb-cos-i2 algorithm on the Corel data set . . . . . 370 


using the agglo-cos-upgma algorithm on the Corel data set . . . . . . . . . . 370 


using the direct-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . 371 


using the graph-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . 371 


using the rb-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . . . 372 

F.1 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus 

functions on the Iris data collection . . . . . . . . . . . . . . . . . . . . 374 


functions on the Wine data collection . . . . . . . . . . . . . . . . . . . 375 


functions on the Glass data collection . . . . . . . . . . . . . . . . . . . 376 


functions on the Ionosphere data collection . . . . . . . . . . . . . . . . 378 


functions on the WDBC data collection . . . . . . . . . . . . . . . . . . 379 


functions on the Balance data collection . . . . . . . . . . . . . . . . . . 380 


functions on the MFeat data collection . . . . . . . . . . . . . . . . . . . 381 


functions on the miniNG data collection . . . . . . . . . . . . . . . . . . 382 


functions on the Segmentation data collection . . . . . . . . . . . . . . . 

F.10 φ 

383 

(NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus 

functions on the BBC data collection . . . . . . . . . . . . . . . . . . . . 

F.11 φ 

384 

(NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus 

functions on the PenDigits data collection . . . . . . . . . . . . . . . . . 386 

xxxii

List of Algorithms 

6.1 Symbolic description of the soft consensus function SumConsensus . . . . . . 178 

6.2 Symbolic description of the soft consensus function ProductConsensus . . . . 180 

6.3 Symbolic description of the soft consensus function BordaConsensus . . . . . 182 

6.4 Symbolic description of the soft consensus function CondorcetConsensus . . 183 

xxxiii

List of symbols 

A: set of algorithms used for creating a cluster ensemble 

BE: Borda voting score matrix related to a cluster ensemble E 

CE: Condorcet voting score matrix related to a cluster ensemble E 

b: size of the mini-ensembles of a hierarchical consensus architecture 

c: number of executions of a consensus function in the running time estimation process 

Cλ: cluster co-association matrix of a clustering λ 

d: number of attributes used for representing an object 

D: set of object representation dimensionalities used for creating a cluster ensemble 

dfi: ith diversity factor employed in the generation of a cluster ensemble 

DHCA: deterministic hierachical consensus architecture 

E: (hard or soft) cluster ensemble 

f: number of diversity factors employed in the generation of a cluster ensemble 

F: consensus function 

γ: ground truth vector 

HCA: hierachical consensus architecture 

Iλ: incidence matrix of a hard clustering solution λ 

k: numberofclusters 

Ki: number of consensus processes executed in the ith stage of a hierarchical consensus 

architecture 

l: number of clusterings contained in a cluster ensemble 

λ: label vector resulting from a hard clustering process 

Λ: clustering matrix resulting from a soft clustering process 

λc: label vector resulting from a hard consensus clustering process 

Λc: clustering matrix resulting from a soft consensus clustering process 

m: number of modalities of a multimodal data set 

n: number of objects in a data set 

OEλ : object co-association matrix of a hard cluster ensemble 

OEΛ : object co-association matrix of a soft cluster ensemble 

Oλ: object co-association matrix of a hard clustering solution λ 

xxxv

List of symbols 

OΛ: object co-association matrix of a soft clustering solution Λ 

π λ1 ,λ 2 : cluster correspondence vector between the hard clustering solutions λ1 and λ2 

π Λ1 ,Λ 2 : cluster correspondence vector between the soft clustering solutions Λ1 and Λ2 

P λ1 ,λ 2 : cluster permutation matrix between the hard clustering solutions λ1 and λ2 

P Λ1 ,Λ 2 : cluster permutation matrix between the soft clustering solutions Λ1 and Λ2 

ΠE: product rule voting score matrix related to a cluster ensemble E 

φ (ANMI) : average normalized mutual information 

φ (NMI) : normalized mutual information 

PERTDHCA: estimated running time of a parallel DHCA 

PRTDHCA: real running time of a parallel DHCA 

PERTRHCA: estimated running time of a parallel RHCA 

PRTRHCA: real running time of a parallel RHCA 

r: number of attributes of an object after a dimensionality reduction process 

R: set of object representations used for creating a cluster ensemble 

RHCA: random hierachical consensus architecture s: number of stages of a hierarchical 

consensus architecture 

S λ1 ,λ 2 : cluster similarity matrix between the hard clustering solutions λ1 and λ2 

S Λ1 ,Λ 2 : cluster similarity matrix between the soft clustering solutions Λ1 and Λ2 

SERTDHCA: estimated running time of a serial DHCA 

SRTDHCA: real running time of a serial DHCA 

SERTRHCA: estimated running time of a serial RHCA 

SRTRHCA: real running time of a serial RHCA 

ΣE: sum rule voting score matrix related to a cluster ensemble E 

w: power proportionality factor of the time complexity of a consensus function 

xi: d-dimensional column vector denoting the ith object contained in a data set 

X: d × n matrix denoting a data set 

xxxvi

Chapter 1 

Framework of the thesis 

The information and communications technologies (ICT) play a key role in the construction 

of the so-called global knowledge society. In fact, the increasingly rapid development of the 

ICT and the democratization of their use have facilitated information generation and access 

–a basic component of knowledge acquisition– to large segments of population. 

Possibly, the most paradigmatic example of this evolution is the World Wide Web 

(WWW, or “the Web” for short), which offers almost universal access to information to 

over 1500 million users worldwide, experiencing a percentage increase of 336% since year 

2000 (InternetWorldStats.com, accessed on February 2009). In parallel, this evolution has 

boosted the number of existing websites, which has grown exponentially up to over 215 

millions (NetCraft.com, accessed on February 2009)—see figure 1.1. Quite obviously, this 

latter fact incides on the quality of the information available on the Web (e.g. webpages 

with replicated, erroneous or even forged content are commonly found), making it sometimes 

difficult to separate the wheat from the chaff. 

This is an example of how the development of the ICT entails two instrinsically contradictory 

consequences: while facilitating knowledge acquisition by making information easier 

to share and access, it has also fostered the generation of increasingly growing amounts of 

digital information, giving rise to the so-called data overload effect —a problem that started 

to attract the attention of researchers more than a decade ago (Fayyad, Piatetsky-Shapiro, 

and Smyth, 1996). 

Indeed, as computing technologies have become increasingly efficient in the compression, 

transmission, storage and manipulation of data, the real bottleneck has moved towards the 

user side: that is, human information interpretation capabilities are often exceeded by the 

sheer amount of data. Not seldom, this situation is aggravated by the existence of data 

inconsistencies, making knowledge extraction even more difficult. 

Although exemplified in the WWW context, the data overload effect is far from being 

exclusive to it. At a smaller scale, large amounts of scientific and social data are being 

generated and made either freely or commercially available. Examples of these include 

experimental or observational data sets in the physics, chemistry, biomedical, marketing, 

financial or social sciences fields, whose sizes can even range over the Terabyte (Valencia, 

2002; Witten and Frank, 2005). 

In addition to boosting the volume of information repositories, the evolution of the 

ICT has also brought about an increase of data complexity (Hinneburg and Keim, 1998). 

1

Chapter 1. Framework of the thesis 

Figure 1.1: Evolution of the total number of websites across all Internet 

domains, from November 1995 to February 2009 (extracted from 

http://news.netcraft.com/archives/2009/02/index.html). 

Resorting to the WWW example again, we have all witnessed, through the last decade, how 

web pages have evolved from static plain text to dynamic multimedia contents. That is, 

the information available on the Web is, to a large extent, no longer restricted to a single 

modality (e.g. news in text format). Rather the contrary, data is increasingly becoming 

multimodal, i.e. a combination of several modalities (e.g. text news accompanied with 

photos, graphics, audio or video). 

This shift towards data multimodality can be regarded as a change of paradigm which 

is also found in many other domains (Klosgen, Zytkow, and Zyt, 2002). For instance, 

meteorological information often combines satellite and radar imagery with meteorological 

data in numerical form (e.g. temperature, humidity, wind speed, rainfall, etc.). In medical 

contexts, repositories often contain data obtained from several diagnostic tests (e.g. 

blood analysis, radiography, electrocardiophy, electroencephalography, functional magnetic 

resonance) whose results are represented under distinct modalities (nominal and numerical 

data, images, etc.). 

To sum up, despite providing enormous quantities of information on a silver plate, the 

development and expansion of the ICT pose a serious challenge to human analytic and 

understanding capabilities, not only by the large volumes of data available, but also by its 

growing complexity. Therefore, it seems logical to highlight the importance of developing 

automatic tools that allow knowledge extraction from large multimodal data repositories, 

regardless of their domain (Witten and Frank, 2005). The techniques supporting these tools 

belong to the fields of knowledge discovery and data mining (Klosgen, Zytkow, and Zyt, 

2002), which constitute, in a broad sense, the frame of reference of this thesis. 

When it comes to extracting knowledge from a given data collection, one of the primary 

tasks one thinks of is organization: clearly, arranging the contents of a data repository 

according to some meaningful structure helps to gain some perspective on it –in fact, orga- 

2


nizing information is one of the most innate activities involved in human learning (Anderberg, 

1973). In general terms, the structures according to which objects 1 are classified are 

known as taxonomies, and, although their shape can vary widely (e.g. from parent-child 

hierarchical trees to network schemes or simple group structures), they share the common 

goal of allowing knowledge extraction by implementing a structure in an unstructured world 

of information. Taxonomies have been proposed by experts in almost every scientific field, 

such as biology (e.g. the Linnaean taxonomy (Linnaeus, 1758), which settled the basis of 

species classification), medicine (for instance, the International Classification of Diseases of 

the World Health Organization (www.who.int, accessed on February 2009)) or education 

(from classifications of the different learning objectives and skills that educators set for 

students (Bloom, 1956; Anderson et al., 2001) to taxonomic models that describe the levels 

of increasing complexity in student’s understanding of subjects (Biggs and Collis, 1982)). 

When dealing with digital data, the manual creation of a taxonomy can become a very 

challenging and burdensome task, as it requires previous domain knowledge (which is not 

always available) and/or careful inspection of the whole data collection before designing 

the most suitable taxonomic structure. For this reason, it would be very useful to develop 

systems capable of organizing data in a fully automatic manner, so that no expert supervision 

nor domain knowledge was required. If this goal was accomplished, the role of expert 

taxonomists would be minimized —good news provided the dramatic pace at which digital 

data is generated. 

Regardless of the taxonomic scheme’s layout, data organization criteria are typically 

based on analyzing the similarities between objects, grouping them according to their degree 

of similarity —i.e. the goal is to place dissimilar instances in separate and distant groups (or 

clusters), while placing similar objects in the same group (or in different but closely located 

clusters). This task, known as unsupervised classification or clustering, isanimportant 

process which underlies many automated knowledge discovery processes (Fayyad, Piatetsky- 

Shapiro, and Smyth, 1996; Klosgen, Zytkow, and Zyt, 2002; Witten and Frank, 2005). 

The remainder of this chapter provides an insight on the general framework of the 

thesis, highlighting the importance of clustering processes as a part of automatic knowledge 

discovery systems, and introducing the central focus of this thesis: the robust clustering of 

multimodal data. 

1.1 Knowledge discovery and data mining 

The subject of knowledge discovery from data repositories has come a long way. In fact, 

the interest in this field emerged more than a decade ago as a response to the data overload 

effect (Fayyad, Piatetsky-Shapiro, and Smyth, 1996; Fayyad, 1996), when the growth of 

digital data repositories started to surpass human analytic capabilities. Indeed, while the 

analysis and understanding abilities of human analysts remain more or less the same, the 

seemingly ever-growing computers’ storing capacity is holding back a true avalanche of data 

(Witten and Frank, 2005). Nowadays, many textbooks, journals, workshops and conferences 

are devoted to this scientific area, denoting it still is a very active research field (Klosgen, 

Zytkow, and Zyt, 2002). 

1 By object, we refer to anything -animate beings, inanimate objects, places, concepts, events, properties, 

or relationships- liable to be classified according to some taxonomic scheme. 

3

1.1. Knowledge discovery and data mining 

Figure 1.2: Schematic diagram of the steps involved in the KD process (extracted from 

(Fayyad, 1996)). 

It is a commonplace that potentially useful and beneficial information patterns lie in 

digital data repositories awaiting analysis (Witten and Frank, 2005). However, the concept 

of analysis and its goals are highly local to the context it is applied (Fayyad, 1996). 

Typical application scenarios can be as disparate as i) mining records of buyers’ choices 

for creating marketing campaigns adapted to distinct customer profiles (Witten and Frank, 

2005), ii) analyzing credit card transactions history of bank customers so as to detect possible 

fraudulent operations from unauthorized users (Fayyad, 1996) or iii) locating and 

cataloging geologic objects of interest in remotely sensed images of planets or asteroids 

(Fayyad, Piatetsky-Shapiro, and Smyth, 1996). 

Thus, be it either economic or scientific, there exists a great interest in replacing (or, at 

least, augmenting) human analytic capabilities by computer-based means. The field of computer 

science devoted to the extraction of useful patterns from data has been given different 

names in the literature, such as information discovery, information harvesting or data archaeology 

(Fayyad, Piatetsky-Shapiro, and Smyth, 1996), being knowledge discovery2 (KD) 

and data mining (DM) the two most common denominations. 

However, the use of KD and DM as synonymous concepts has been a matter of dispute in 

the research community (Klosgen, Zytkow, and Zyt, 2002): while deemed equivalent by some 

authors (Witten and Frank, 2005), others refer to KD as the whole process of extracting 

knowledge from data, defining DM as the central constituting step of KD processes (Fayyad, 

Piatetsky-Shapiro, and Smyth, 1996), as depicted in figure 1.2. 

According to this latter standpoint (to which we adhere in this thesis), KD is defined 

as the ‘non-trivial process of identifying valid, novel, useful and ultimately understandable 

patterns in data’, whereas DM is ‘the application of specific algorithms for extracting patterns 

from data’ (Fayyad, Piatetsky-Shapiro, and Smyth, 1996). By ‘extracting patterns 

patterns from data’ we refer to making any high-level description of a set of data, e.g. fitting 

a model to data or finding structure from it (Fayyad, Piatetsky-Shapiro, and Smyth, 

1996). Thus, according to this point of view, KD and DM constitute what could be called 

2 Although this discipline was originally named KDD —for Knowledge Discovery in Databases (Piatetsky- 

Shapiro, 1991)— in this work we assume that operations are conducted on a flat file extracted from the 

database, i.e. we remove the second D in KDD and focus on the knowledge discovery process. 

4


the general and specific frames of reference of this work, respectively. Due to its generic 

definition, KD is a crossroad of several disciplines, and, as such, it attracts researchers and 

practitioners from the fields of statistics, machine learning, pattern recognition, information 

retrieval, or visualization, to name a few (Fayyad, 1996). 

As shown in the flow diagram presented in figure 1.2, the KD process can be regarded 

as a succession of five steps, namely: selection, preprocessing, transformation, data mining 

and interpretation/evaluation. That is, extracting knowledge from a given data set can be 

regarded as a multistage, interactive and iterative process (Brachman and Anand, 1996), as 

the user evaluation of the extracted patterns can lead to a re-execution of any of the stages 

(as denoted by the dashed arrows in figure 1.2). However, depending on the nature of the 

data and the problem at hand, the two first steps may even be skipped. The following 

paragraphs present a brief description of each of these five phases. 

For starters, the target data set that will be subject to the KD process is created in the 

selection phase. This typically implies selecting a subset of the available objects in the 

database, although, in some cases, no selection is conducted, and all the data items in the 

repository are included in the target data set. 

Optionally, this stage can also consider representing the objects upon a subset of the 

variables (aka attributes or features) that constitute them. In general terms, these variables 

can either be numerical or nominal. In the former case, attributes usually represent the 

value of a quantitatively measurable magnitude (e.g. temperature, altitude or population). 

In the latter case, features can only take one of a predefined set of categorical values, such 

as the outlook feature in the classic weather data repository, i.e. outlook = {overcast, 

sunny, rainy} (Witten and Frank, 2005). In this work, all objects will be represented by 

means of a set of d numeric attributes gathered into real-valued d-dimensional vectors. 

Secondly, a data preprocessing step is carried out, which includes data parameterization, 

noise and/or outliers removal or missing data fields handling, among others. 

Thirdly, the objects in the data set are subject to a transformation process, which 

basically consists of finding useful features for representing the objects according to the 

goals of the overall KD process. In general terms, this step is aimed to decrease the number 

of variables used for representing the objects (dimensionality reduction) so as to improve 

the results and/or the computational complexity of the data mining step of the KD process. 

The reasoning underlying dimensionality reduction is based on the fact that the original 

data representation is often redundant, e.g. there may exist high levels of correlation between 

several variables, or the values of some features may be so small that they are almost 

irrelevant (Carreira-Perpiñán, 1997). Moreover, when the number of variables associated 

with each object is too high, the scalability of KD systems is negatively affected, as the 

time complexity of the DM stage is usually proportional to the number of attributes (Yang 

and Olafsson, 2005). Furthermore, many methods suffer performance breakdowns when the 

dimensionality of the feature space is very high (Fodor, 2002). 

There exist two main strategies to conduct dimensionality reduction: feature selection 

and feature extraction. In feature selection, the reduced feature set is a subset of the original 

object attributes, whereas in feature extraction, the original variables are transformed into 

a set of new features, typically obtained as combinations of the original ones. For an insight 

on feature selection and extraction techniques, the reader is referred to (Molina, Belanche, 

and Nebot, 2002; Dy and Brodley, 2004) and (Fodor, 2002), respectively. 

5

1.1. Knowledge discovery and data mining 

Verrification 

Goodness of 

fit 

HHypothesis 

teesting 

Analysis oof 

variancee 

Data miining 

methoods 

Prediction 

Classific cation 

Regres ssion 

Discovery 

Description 

Clusteriing 

Summarization 

Figure 1.3: Taxonomy of data mining methods (adapted from (Maimon and Rokach, 2005)). 

Next, depending on the goals of the KD process, a suitable data mining task must 

be chosen. According to the taxonomy presented in figure 1.3, data mining methods can 

be classified into two main groups: verification-oriented and discovery-oriented (Fayyad, 

Piatetsky-Shapiro, and Smyth, 1996). In verification-oriented DM, the role of the system 

is to evaluate an hypothesis proposed by the user, and this task is usually accomplished 

by means of traditional statistical methods. In discovery-oriented data mining, the goal of 

the system is to discover useful patterns in the data. In this work, we focus on this latter 

family of DM tasks. 

Among discovery-oriented data mining methods, one can distinguish between prediction 

and description DM tasks. In prediction tasks, the goal of the system is to build 

a behavioral model upon the data, whereas in description tasks, the system aims to find 

human-understandable patterns that facilitate knowledge extraction from data. 

According to figure 1.3, there exist two main prediction DM tasks: classification and 

regression. In classification problems, the goal is to learn a mapping between the categories 

in a known taxonomic scheme and a set of pre-classified objects, so that any unseen object 

can be categorized into any of these predefined classes. The aim of regression tasks is to 

learn a function that maps unseen data objects to a real-valued prediction variable. 

As aforementioned, description-oriented DM methods focus on finding understandable 

representations of the underlying structure of the data (Maimon and Rokach, 2005). One of 

the most common descriptive DM tasks is clustering, which consists of identifying a finite 

set of categories to describe the data with no previous knowledge (i.e. deriving a taxonomy 

solely from the data). Another description-oriented data mining task is summarization, 

whose aim is to find a compact description for a subset of data. To do so, summarization 

techniques often make use of multivariate visualization methods. 

Once the data mining task that fits the goals of the KD process is identified, there 

comes the time to select the specific data mining algorithm to be applied. This selection 

must take into account not only which models and parameters are the most appropriate 

from an algorithmic viewpoint, but also the desired level of accuracy, utility, and intel- 

6


ligibility of the descriptions of the structural patterns of the data (Fayyad, 1996). This 

latter issue is of paramount importance with a view to the final stage of the KD process — 

evaluation/interpretation—, which often involves visualizing the mined patterns and/or 

the data according to these. As mentioned earlier, depending on the user’s evaluation of the 

extracted patterns, it may be necessary to re-execute any of the previous steps for further 

refinement of the KD process (Halkidi, Batistakis, and Vazirgiannis, 2002a). 

As the reader may have observed, the user must make several critical decisions at every 

step of the knowledge discovery process (Fayyad, Piatetsky-Shapiro, and Smyth, 1996). 

By critical we mean that a wrong election in any of the intermediate stages could lead to 

the extraction of little meaningful patterns, and, as a consequence, the objectives of the 

whole KD process would not be reached. This issue becomes especially tricky when the DM 

stage is based on unsupervised learning techniques, as when, for instance, clustering is the 

data mining task selected for the DM phase of the KD process. In this situation, the user’s 

decisions are usually made blindly, which may conclude in an insatisfactory evaluation, thus 

requiring a (blind) re-execution of one or several of the phases of the KD process —which, 

in the worst-case scenario, can end in a tedious iterative loop while groping for the right 

decisions. 

This thesis is focused on the discovery and description-oriented data mining task of 

clustering, placing special emphasis on its application on multimodal data. In particular, 

one of our main goals is the design of clustering systems robust to the indeterminacies 

induced by the fact that many decisions surrounding clustering processes must be made 

blindly (e.g. which features should be used, which clustering algorithm should be applied). 

In the next section, we present a by no means exhaustive overview of several fundamental 

aspects of the clustering problem, which will lead to a description of the key problems 

addressed in this thesis. 

1.2 Clustering in knowledge discovery and data mining 

Clustering can be defined as the process of separating a finite unlabeled data set into a 

finite and discrete set of natural clusters based on similarity (Xu and Wunsch II, 2005; 

Jain, Murty, and Flynn, 1999). 

After a successful clustering process, the presumably high number of objects contained in 

the data set can be represented by means of a comparatively smaller number of meaningful 

clusters, which necessarily implies a loss of certain fine details, but yields a simplified clusterbased 

data model (Berkhin, 2002). In other words, clustering is a description-oriented data 

mining task, as the obtained clusters should somehow reflect the mechanisms that cause 

some objects to be more similar to one another than to the remaining ones (Witten and 

Frank, 2005). 

It is important to notice that clustering is a non-supervised classification task, as the 

objects in the data set are unlabeled (i.e. there is no prior knowledge about how they should 

be grouped). This fact states a clear difference between clustering and the predictionoriented 

task of supervised classification (see figure 1.3). In this latter case, we are provided 

with a collection of labeled (i.e. pre-classified) objects so as to learn the descriptions of 

classes, which in turn are used to categorize new data items. In clustering, objects are also 

assigned labels, but these are data driven —that is, cluster labels are obtained solely from 

7

1.2. Clustering in knowledge discovery and data mining 

the data, not provided by an external source (Jain, Murty, and Flynn, 1999). 

Being such a generic task, clustering has found applications in multiple research fields, 

among which we can name the following few: 

– information retrieval, where clustering has been applied for organizing the results 

returned by a search engine in response to a users query (i.e. post-retrieval clustering) 

(Tombros, Villa, and van Rijsbergen, 2002; Hearst, 2006), for refining ambiguous 

queries input to retrieval systems (Käki, 2005), or for improving their performance 

(van Rijsbergen, 1979). 

– text mining, where browsing through large document collections is simplified if they 

are previously clustered (Cutting et al., 1992; Steinbach, Karypis, and Kumar, 2004; 

Dhillon, Fan, and Guan, 2001). 

– computational genomics, where clustering of gene expression data from DNA microarray 

experiments is applied for identifying the functionality of genes, finding out what 

genes are co-regulated or distinguishing the important genes between abnormal and 

normal tissues (Zhao and Karypis, 2003a; Jiang, Tang, and Zhang, 2004). 

– ecomomics, where clustering economic and financial time series can be employed for 

identifying i) areas or sectors for policy-making purposes, ii) structural similarities 

in economic processes for economic forecasting, iii) stable dependencies for risk management 

and investment management (Focardi, 2001), or iv) customer profiles and 

customers-products relationships (Liu and Luo, 2005). 

– computer vision, where clustering is applied for common procedures such as image 

preprocessing (Jain, 1996), segmentation (Mancas-Thillou and Gosselin, 2007) and 

matching (Miyajima and Ralescu, 1993). 

Regardless of the application, the desired result of any clustering process is a maximally 

representative partition of the data set, which usually corresponds to clusters with high 

intra-cluster and low inter-cluster object similarities. In the quest for this goal, a myriad 

of clustering methods have been proposed. With no claim of being exhaustive, the next 

section presents an overview of some of the most relevant clustering methods, highlighting 

some important concepts in this context. 

1.2.1 Overview of clustering methods 

Several excellent and extensive surveys on clustering can be found in the literature (Jain, 

Murty, and Flynn, 1999; Berkhin, 2002; Kotsiantis and Pintelas, 2004; Xu and Wunsch II, 

2005). Providing a detailed description of the existing clustering methods lies beyond the 

scope of this work, so the reader interested in their ins and outs is referred to the previously 

cited surveys and references therein. However, due to the central role of clustering processes 

in this thesis, this section presents a brief description of several key issues in this context, 

such as: 

1. A categorization of clustering algorithms according to generic criteria. 

2. A brief discussion on similarity measures, one of the central notions in clustering. 

8


3. An outline of the foundations of several representative clustering methods. 

For starters, let us introduce a few notational conventions which will be employed 

throughout this work: 

– in general terms, any object in a data set will be represented by means of a ddimensional 

column vector x =[x1 x2 ... xd] T (where T denotes transposition). As 

mentioned in section 1.1, in this work we consider only object representations based 

on numerical attributes, so that each feature xi is a real number and, hence, x ∈ R d . 

– so as to refer to a particular data item (e.g. the ith object in the data repository) we 

will use the notation xi =[xi1 xi2 ... xid ]T . 

– a data set is defined as a compilation of n objects, being mathematically representated 

by a d × n matrix X =[x1 x2 ... xn]. 

– the number of clusters into which the objects will be partitioned is denoted as k. Thus, 

each cluster resulting from the clustering process will be assigned an integer-valued 

label in the integer range [1,...,k]. 

Categorization of clustering methods 

Without taking into account their theoretical foundations, there exist many ways of classifying 

clustering algorithms (Jain, Murty, and Flynn, 1999). However, they can be split 

into two clearly separated groups according to two universal criteria: i) the structure of the 

final clustering solution, and ii) the overlap of the obtained clusters. 

On one hand, if clustering algorithms are analyzed in terms of the structure of the clustering 

solution, one can distinguish between partitional and hierarchical clustering methods. 

1. Partitional clustering algorithms: the clustering solution output by this type of algorithms 

corresponds to the most intuitive notion of clustering: a single partition of the 

objects in the data set into the desired number of clusters k. 

2. Hierarchical clustering algorithms: the structure of the clustering solution is a tree 

of clusters (Kotsiantis and Pintelas, 2004), which is built sequentially. Depending on 

whether this tree is constructed bottom-up or top-down (i.e. from n singleton clusters 

to a sole cluster or vice versa), hierarchical algorithms are called agglomerative or divisive. 

In either case, the algorithm decides, at each level of the tree, which two clusters 

should be merged (if agglomerative) or split (if divisive) depending on their degree of 

similarity, which is typically measured according to either the minimum, maximum 

or average of the distances between all pairs of objects drawn from both clusters, 

giving rise to the single-link, complete-link or average-link criteria (Jain, Murty, and 

Flynn, 1999). The clustering solution structure yielded by hierarchical algorithms is 

usually represented by means of a binary tree of clusters or dendrogram, anexampleof 

which is depicted in figure 1.4. Although hierarchical clustering algorithms typically 

compute the complete hierarchy of clusters 3 , a single partition can be obtained by 

3 As a consequence, when compared to their partitional counterparts, hierarchical clustering algorithms 

tend to be more computationally demanding (Jain, Murty, and Flynn, 1999). 

9


0.6 

0.4 

0.2 

0 

1 

2 

3 

−0.2 

−0.2 0 0.2 0.4 0.6 

(a) Scatterplot of the data of this 

toy two-dimensional data set. 

4 

7 

5 

9 

6 

8 

Euclidean distance 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

7 9 8 4 5 6 1 2 3 

object index 

(b) Dendrogram obtained by the 

single-link hierarchical algorithm. 

Figure 1.4: A hierarchical clustering toy example: (a) Scatterplot of an artificially generated 

two-dimensional data set containing n = 9 objects, each one of them is identified by a 

numerical label. (b) Dendrogram resulting of applying the single-link hierarchical agglomerative 

clustering algorithm on this data, using the Euclidean distance as the similarity 

measure. The dashed horizontal line performs a cut on the dendrogram, yielding a 4-way 

partition with an Euclidean distance between clusters ranging between 0.112 and 0.255. 

performing a cut through the dendrogram at the desired level of cluster similarity or 

by setting the desired number of clusters k, as shown by the dashed horizontal line in 

figure 1.4(b). 

On the other hand, the overlap of the clusters into which objects are grouped is an 

additional factor which allows splitting clustering algorithms into two large categories: hard 

and soft algorithms. 

1. Hard (aka crisp) clustering algorithms: this type of algorithms partition the data 

set into k disjoint clusters, i.e. each object is assigned to one and only one cluster. 

In mathematical terms, the result of a hard clustering process on the data set X 

is a n-dimensional integer-valued row label vector λ = [λ1 λ2 ... λn], where λi ∈ 

{1, 2,...,k}, ∀i ∈ [1,n]. That is, the ith component of the label vector (or labeling 

for short) contains the label of the cluster the ith object in the data set is assigned 

to. For instance, the label vector obtained after applying a classic hard clustering 

algorithm such as k-means on the artificial toy data set depicted on figure 1.4(a), 

setting k =3,isλ =[222111333]. Notice the symbolic nature of the cluster labels, 

as the same clustering result would be represented by label permuted label vectors 

such as λ =[111222333] or λ =[333111222]. 

2. Soft (aka fuzzy) clustering algorithms: they generate a set of k overlapping clusters, 

i.e. each object is associated to each of the k clusters to a certain degree. Hence, the 

result of conducting a soft clustering process on the data set X is a k × n real-valued 

clustering matrix Λ, whose(i,j)th entry indicates the degree of association between 

the ith cluster and the jth object. This degree of association is typically expressed 

in terms of the probability of membership of each object to each cluster, as done by 

10


the well-known fuzzy c-means (FCM) soft clustering algorithm. Following with the 

toy two-dimensional data set presented in figure 1.4(a), the membership probability 

matrix yielded by the FCM clustering algorithm (with k = 3) is the following: 

⎛ 

Λ = ⎝ 

0.0544 0.0418 0.0572 0.0254 0.0192 0.0301 0.9764 0.9285 0.9723 

0.0252 0.0258 0.0375 0.9688 0.9758 0.9586 0.0144 0.0554 0.0173 

0.9205 0.9324 0.9053 0.0059 0.0050 0.0114 0.0092 0.0160 0.0104 

It is easy to see that permuting the rows of matrix Λ would not alter the clustering 

results, as cluster identifiers are symbolic. Moreover, notice that a clustering matrix 

Λ can always be transformed into a label vector λ by simply assigning each object 

to the cluster it is more strongly associated with (e.g. the cluster with the highest 

membership probability). 

Summarizing, the structure of the clustering solution and the overlap of the resulting 

clusters define a two-dimensional frame of reference which allows categorizing clustering 

algorithms in a broad sense, i.e. without resorting to their theoretical foundations. 

Distance and similarity measures 

According to the definition of clustering presented at the beginning of section 1.2, measuring 

the resemblance between the objects in the data set is central to clustering processes. For 

this reason, these next paragraphs will be devoted to a brief description of several ways 

of measuring similarity. More specifically, we focus solely on measures for computing the 

resemblance between objects under numeric feature representation, although there exist 

equivalent measures for comparing objects represented by means of ordinal or nominal 

attributes (Xu and Wunsch II, 2005; Jain, Murty, and Flynn, 1999). 

Hence, let us consider two objects in the data set represented as vectors in a R d space, 

namely xi =[xi1 xi2 ... xid ]T and xj =[xj1 xj2 ... xjd ]T . 

There exist two complementary ways of comparing xi and xj: i) measuring their degree 

of similarity, denoted as S(xi, xj), or ii) measuring the distance between them, i.e. 

D(xi, xj). Although when dealing with objects represented by numerical features it is more 

usual to measure the distance between them than their similarity (Jain, Murty, and Flynn, 

1999; Xu and Wunsch II, 2005), both types of measures will be described next—furthermore, 

there are multiple ways of transforming a similarity measure into a distance and viceversa 

(Fenty, 2004). 

1. Distance measures 

– Minkowski distance: properly speaking, it is a family of distances. In general 

terms, the Minkowski distance of order n is defined according to equation (1.1): 

D(xi, xj) = 

11 

d 

l=1 

|xil 

− xjl | 1 

n 

n 

⎞ 

⎠ 

(1.1)


– Euclidean distance: possibly the most widely used metric, it is obtained as the 

particularization of the Minkowski metric for n =2: 

D(xi, xj) = 

d 

l=1 

|xil 

− xjl | 1 

2 

2 

(1.2) 

This is the distance measure used in the most classic implementation of the kmeans 

clustering algorithm, and it tends to form hyperspherical clusters (Xu and 

Wunsch II, 2005). 

– Manhattan distance: also known as city block distance, it is defined as a particular 

case of the Minkowski metric for n = 1 (see equation (1.3)), and it tends to 

create hyperrectangular clusters (Xu and Wunsch II, 2005). 

 

d 

 

D(xi, xj) = |xil − xjl | 

l=1 

(1.3) 

– Mahalanobis distance: it can be regarded as a modified version of the Euclidean 

distance, that takes into account the covariance among the attributes. It is 

defined as follows: 

D(xi, xj) =(xi − xj) T S −1 (xi − xj) (1.4) 

where S is the sample covariance matrix computed over all the data set (Jain, 

Murty, and Flynn, 1999). Algorithms using this distance tend to create hyperellipsoidal 

clusters (Xu and Wunsch II, 2005). 

2. Similarity measures 

– Cosine similarity: it consists of measuring the angle comprised between the vectors 

representing the objects, and, as such, it does not depend on their length. 

It is defined as follows: 

S(xi, xj) = xT i xj 

||xi|| ||xj|| 

(1.5) 

where ||·||denotes vector norm. 

– Pearson correlation coefficient: a classic concept in the probability theory and 

statistics fields, correlation measures the strength and direction of the linear 

relationship between vectors xi and xj. The most widely used correlation index 

is the Pearson correlation coefficient, which is defined in equation (1.6): 

d 

(xil − ¯xi)(xjl − ¯xj) 

l=1 

S(xi, xj) = 

d 

d 2 2 

(xil − ¯xi) (xjl − ¯xj) 

l=1 

l=1 

where xil is the lth component of vector xi, and¯xi denotes its sample mean. 

12 

(1.6)


– Extended Jaccard coefficient: whereas the cosine and Pearson correlation coefficient 

similarity measures consider that vectors xi and xj are similar if they point 

in the same direction (i.e. they have roughly the same set of features and in the 

same proportion), the extended Jaccard coefficient –defined in equation (1.7)– 

accounts both for the angle and magnitude of the vectors. 

S(xi, xj) = 

x T i xj 

||xi|| + ||xj|| − x T i xj 

(1.7) 

For further insight on these and other distance and similarity measures, their properties 

and other characteristics, see (Duda, Hart, and Stork, 2001; Xu and Wunsch II, 2005). 

Approaches to clustering 

As aforementioned, multiple clustering algorithms with fairly different foundations can be 

found in the literature. Our interest here is to present a brief description of the most wellknown 

theoretical approaches clustering algorithms are based on, enumerating some specific 

implementations emanated from them. 

1. Square error based clustering: in this approach, the goal is minimizing the sum of 

squared distances between the objects and the centroid (i.e. the center of gravity) 

of the cluster they are assigned to. This minimization process is usually executed 

iteratively, as in the case of the k-means clustering algorithm, which is probably 

the most representative example of this type of algorithm (Forgy, 1965). A slight 

variant of the k-means algorithm is k-medioids, where clusters are represented by 

one of its member objects instead of their centroids, which makes it more robust to 

outliers (Kaufman and Rousseeuw, 1990). Moreover, techniques that allow splitting 

and merging the resulting clusters have also been developed. An example is the 

ISODATA algorithm (Ball and Hall, 1965), which divides high variance clusters and 

joins close clusters—quite obviously, setting the thresholds that govern the cluster 

merging and splitting decisions is a key issue in ISODATA. 

2. Mixture densities based clustering: this approach follows a probabilistic perspective, 

as it assumes that the objects in the data set have been generated according to several 

probability distributions—typically one per cluster. Finding the clusters boils down to 

making assumptions on these probability distributions (they are often assumed to be 

Gaussian) and estimating the parameters of the underlying models usually following 

a maximum likelihood approach (Jain, Murty, and Flynn, 1999). The iterative optimization 

of the maximum likelihood criterion using the expectation-maximization 

(EM) algorithm has given rise to the most popular clustering algorithm based on 

mixture densities, EM clustering (McLachlan and Krishnan, 1997). Other algorithms 

based on this kind of approach are AutoClass (which extends mixture densities to 

Poisson, Bernoulli and log-normal probability distributions) (Cheeseman and Stutz, 

1996), or the SNOB algorithm, which uses a mixture model in conjunction with the 

minimum message length principle (Wallace and Dowe, 1994). 

3. Graph based clustering: this type of algorithms applies the concepts and properties 

of graph theory to the clustering problem, mapping data objects onto the nodes of a 

13


weighted graph whose edges reflect the similarity between each pair of objects. This 

makes graph based clustering conceptually similar to hierarchical clustering (Jain and 

Dubes, 1988), a good example of which is the Chameleon hierarchical clustering algorithm, 

which is based on the k-nearest neighbour graph (Karypis, Han, and Kumar, 

1999). Other (non-hierarchical) graph based clustering algorithms are i) Zahn’s algorithm 

(Zahn, 1971), which is based on discarding the edges with the largest lengths 

on the minimal spanning tree so as to create clusters, ii) CLICK, based on computing 

the minimum weight cut to form clusters (Sharan and Shamir, 2000), and iii) 

MajorClust, which is based on the weighted partial connectivity of the graph, a measure 

whose maximization implicitly determines the optimum number of clusters to be 

found (Stein and Niggemann, 1999). Lastly, the more recent family of spectral clustering 

algorithms can be included in the context of graph based clustering. This type 

of algorithms are often reported as outperforming traditional clustering techniques 

(von Luxburg, 2006). In addition, spectral clustering is simple to implement and can 

be solved efficiently by standard linear algebra methods, as, in short, it boils down to 

computing the eigenvectors of the Laplacian matrix of the graph. Several variants of 

spectral clustering algorithms have been proposed, differing in the way the similarity 

graph and the Laplacian matrix are computed (e.g. (Shi and Malik, 2000; Ng, Jordan, 

and Weiss, 2002)). 

4. Clustering based on combinatorial search: this approach is based on considering clustering 

as a combinatorial optimization problem that can be solved by applying search 

techniques for finding the global (or approximate global) optimum clustering solution. 

In this context, two paradigms have been followed in the design of clustering algorithms: 

stochastic optimization methods and deterministic search techniques. Among 

the former, some of the most popular approaches are based on evolutionary computation 

(e.g. hard or soft clustering based on genetic algorithms (Hall, Özyurt, 

and Bezdek, 1999; Tseng and Yang, 2001)), simulated annealing (e.g. (Selim and 

Al-Sultan, 1991)), Tabu search (Al-Sultan, 1995) and hybrid solutions (Chu and Roddick, 

2000; Scott, Clark, and Pham, 2001), whereas deterministic annealing is the most 

typical deterministic search technique applied to clustering (Hofmann and Buhmann, 

1997). 

5. Clustering based on neural networks: the well-known learning and modelling abilities 

of neural networks have been exploited to solve clustering problems. The two most 

successful neural networks paradigms applied to clustering are i) competitive learning, 

where the Self-Organizing Maps (Kohonen, 1990) and Generalized Learning Vector 

Quantization (Karayiannis et al., 1996) play a salient role, and ii) adaptive resonance 

theory (Carpenter and Grossberg, 1987), which encompasses a whole family of neural 

networks architectures that can be used for hierarchical (Wunsch et al., 1993) and 

soft (Carpenter, Grossberg, and Rosen, 1991) clustering. 

6. Kernel based clustering: the rationale of kernel-based learning algorithms is simplifying 

the task of separating the objects in the data set by nonlinearly transforming them 

into a higher-dimensional feature space. Through the design of an inner-product kernel, 

the time-consuming and sometimes even infeasible process of explicitly describing 

the nonlinear mapping and computing the corresponding data points in the transformed 

space can be avoided (Xu and Wunsch II, 2005). A recent example of this 

approach is Support Vector Clustering (SVC) (Ben-Hur et al., 2001), that employs 

14


the radial basis function as its kernel, being capable of forming either agglomerative 

or divisive hierarchical clusters. Moreover, SVC can be further extended to allow for 

fuzzy membership (Chiang and Hao, 2003). Kernel-based clustering algorithms have 

many advantages, such as the ability to form arbitrary clustering shapes or to deal 

with noise and outliers (Xu and Wunsch II, 2005). 

7. Density based clustering: this type of clustering algorithms form clusters based on the 

density of objects in a region of the feature space, so that the neighborhood of each 

object in a cluster must contain a minimum number of objects. This principle allows 

the growth of clusters in any direction, thus being able to discover arbitrarily shaped 

clusters, besides providing a natural protection against outliers. There are two main 

approaches to density-based clustering, differing in the way density is computed: the 

first class of algorithms refers the computation of density to the objects in the data 

set (such as the DBSCAN algorithm (Ester et al., 1996)). In contrast, the second type 

of density-based clustering strategies create an analytical model of the density over 

all the feature space, using influence functions that describe the impact of each data 

object within its neighbourhood, thus identifying clusters by determining the maximum 

of the overall density function—as the DENCLUE algorithm does (Hinneburg 

and Keim, 1998). More recently, another density-based clustering algorithm called 

Shared Nearest Neighbors (Ertz, Steinbach, and Kumar, 2003) has been proposed: it 

measures object similarity upon the number of common nearest neighbours of each 

pair of objects, thus identifying core points, around which clusters are grown. 

8. Grid based clustering: in this approach, the clustering space is quantized into a finite 

number of hyperrectangular cells. Those cells containing a number of objects over 

a predetermined threshold are connected to form the clusters. Possibly, the three 

most well-known clustering algorithms based on this approach are STING (Wang, 

Yang, and Muntz, 1997), WaveCluster (Sheikholeslami, Chatterjee, and Zhang, 1998) 

and CLIQUE (Agrawal et al., 1998). The main differences between them is the cell 

generation procedure. In STING, the feature space is divided into several levels, thus 

forming a hierarchical cell structure. In contrast, CLIQUE follows a recursive process 

for generating (k+1)-dimensional dense cells by associating dense k-dimensional cells, 

starting with k = 1. In turn, WaveCluster follows a fairly different approach, as it 

applies the Wavelet transform on the original feature space into a frequency space 

where the natural clusters in the data become distinguishable (i.e. cells are somehow 

defined on the transformed space). 

1.2.2 Evaluation of clustering processes 

According to the flow diagram depicted in figure 1.2, the final stage of knowledge discovery 

processes involves the user’s evaluation of the mined patterns (Fayyad, Piatetsky-Shapiro, 

and Smyth, 1996). When clustering is the central data mining task of the KD process, this 

implies validating the obtained clusters. 

As far as this issue is concerned, three distinct approaches can be followed depending 

on the reference used for evaluating the clustering solution: 

– the data itself, by determining whether the evaluated cluster structure provides a 

proper description of the data, which is measured by means of internal cluster validity 

15


indices. 

– a predefined and allegedly correct cluster structure, measuring its degree of resemblance 

to the obtained clustering solution by means of external cluster validity indices. 

– a clustering solution resulting from another clustering process (e.g. a distinct execution 

of the same clustering algorithm but using different parameters), measuring 

their relative merit so as to decide which one may best reveal the characteristics of 

the objects using relative cluster validity indices. 

All these three types of evaluation criteria can be used for validating individual clusters, 

as well as the output of partitional and hierarchical clustering algorithms (Jain and Dubes, 

1988; Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and Vazirgiannis, 

2002b). 

As regards their applicability, only internal and relative evaluation criteria are applicable 

to evaluate clustering solutions in real-life scenarios. This is due to the unsupervised 

nature of the clustering problem, as the class membership of objects is unknown in practice. 

However, in a research context –where ‘correct’ category labels presumably assigned by an 

expert to the objects in the data set are usually known, but not available to the clustering 

algorithm–, it is sometimes more appropriate to use external validity indices, since clusters 

are ultimately evaluated externally by humans (Strehl, 2002). For this reason, in this work 

we will make use of external evaluation criteria solely. However, there exist some recent 

efforts that aim to find correlations between internal and external cluster validity indices, 

such as (Ingaramo et al., 2008). Nevertheless, for further insight on internal and relative 

validity indices, see (Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and 

Vazirgiannis, 2002b; Maulik and Bandyopadhyay, 2002). 

Therefore, evaluation will consist in testing whether the clustering solution reflects the 

true group structure of the data set, captured in a reference clustering or ground truth4 . 

A further advantage of this evaluation approach lies in the fact that external evaluation 

measures can be used to compare fairly the performance of clustering algorithms regardless 

of their foundations, as they make no assumption about the mechanisms used for finding 

the clusters (Strehl, 2002). 

There exist multiple ways of comparing a clustering solution to a ground truth. Quite 

obviously, different approaches must be followed depending on the nature of the clustering 

solution (i.e. whether it is hard or soft, hierarchical or partitional). 

As regards the evaluation of soft clustering solutions, the main difficulty lies not in the 

comparison process (see (Gopal and Woodcock, 1994; Jäger and Benz, 2000) for some classic 

approaches), but in the creation of a fuzzy ground truth, which may require applying an 

averaging scheme that accounts for systematic biases in the answers of the expert labelers 

(Jäger and Benz, 2000; Jomier, LeDigarcher, and Aylward, 2005). 

As far as the validation hierarchical clustering solutions is concerned, it is necessary to 

use a hierarchical taxonomy as the ground truth. However, this type of ground truth is 

4 The unsupervised nature of clustering makes that the performance of clustering algorithms cannot be 

judged with the same certitude as for supervised classifiers, as the external categorization (ground truth) 

might not be optimal. For instance, the way web pages are organized in the Yahoo! taxonomy is possibly not 

the best structure possible, but achieving a grouping similar to the Yahoo! taxonomy is certainly indicative 

of successful clustering (Strehl, 2002). 

16


not always available, as some domains are more prone to be organized hierarchically than 

others. Possibly due to this fact, not much research has been done on external hierarchical 

clustering evaluation (Patrikainen and Meila, 2005). Some examples of the few existing 

hierarchical clustering comparison methods are simple layer-wise comparison (Fowlkes and 

Mallows, 1983) and cophenetic matrices (Theodoridis and Koutroumbas, 1999), although 

they also have their shortcomings (Patrikainen and Meila, 2005). For this reason, the most 

extended strategy is to compare the clusterings found at a certain level of the dendrogram 

with a partitional ground truth. Unfortunately, this approach does not take in account the 

cluster hierarchy in any way, which is clearly not the point if the hierarchical clustering 

solution is to be validated as a whole. 

Allowing for all these considerations, and provided that the outputs of soft and hierarchical 

clustering algorithms can always be converted to hard and partitional clustering 

solutions, respectively (see section 1.2.1), the most common cluster validation procedure 

consists in comparing hard partitional clustering solutions (i.e. label vectors) with the 

same type of ground truths (i.e. comparing cluster labels with class labels) (Strehl, 2002). 

The following paragraphs are devoted to a description of some relevant external validity 

indices for evaluating hard partitional clustering solutions. 

The multiple possible ways for comparing the class labels contained in the ground truth 

label vector γ and the cluster labels in a label vector λ can be categorized into two groups 

depending on whether they are based on i) object pairwise matching, or ii) cluster matching. 

Object pairwise matching cluster validity indices are based on counting how many object 

pairs (xi, xj) , ∀ i = j, are clustered together and separately in both γ and λ (the more coincidences, 

the higher the similarity between the clustering solution and the ground truth). 

Following this rationale, several validity indices have been proposed, such as the Rand index 

(Rand, 1971), the Adjusted Rand index (Hubert and Arabie, 1985), the Fowlkes-Mallows 

index (Fowlkes and Mallows, 1983) or the Jaccard index, among others—see (Halkidi, Batistakis, 

and Vazirgiannis, 2002a; Denoeud and Guénoche, 2006). 

Cluster matching cluster validity indices measure the degree of agreement between the 

assignment of objects to classes (according to γ) and clusters (as designated by λ). Typical 

examples of this kind of validity indices are the Larsen index (Larsen and Aone, 1999), the 

Van Dongen index (van Dongen, 2000), variation of information (Meila, 2003), entropy or 

mutual information (Cover and Thomas, 1991), to name a few. 

In all the experimental sections of this work, the cluster validity index used for evaluating 

clustering results is normalized mutual information, denoted as φ (NMI) . This choice is 

motivated by the fact that φ (NMI) is theoretically well-founded, unbiased, symmetric with 

respect to λ and γ and normalized in the [0, 1] interval —the higher the value of φ (NMI) ,the 

more similar λ and γ are (Strehl, 2002). Mathematically, normalized mutual information 

is defined as follows: 

φ (NMI) k k h=1 l=1 

(γ, λ) = 

nh,l log 

k h=1 n(γ) 

 

) 

n(γ k 

h 

h log n 

 

n·nh,l 

(γ ) 

n h n(λ) 

l 

l=1 n(λ) 

l 

 

log n(λ) 

 

l 

n 

(1.8) 

where n (γ) 

h is the number of objects in cluster h according to γ, n (λ) 

l is the number of objects 

17

1.3. Multimodal clustering 

in cluster l according to λ, nh,l denotes the number of objects in cluster h according to γ as 

well as in group l according to λ, n is the number of objects contained in the data set, and 

k is the number of clusters into which objects are clustered according to λ and γ (Strehl, 

2002). 

Thus, the more similar the clustering solutions represented by the label vector λ and 

the ground truth γ, the closer to 1 φ (NMI) (γ, λ) will be. As the ground truth γ is assumed 

to represent the true partition of the data, high quality clusterings will attain φ (NMI) (γ, λ) 

values close to unity. As a consequence, given two label vectors λ1 and λ2, the former will 

be considered to be better than the latter if φ (NMI) (γ, λ1) >φ (NMI) (γ, λ2), and vice versa. 

1.3 Multimodal clustering 

The ubiquity of multimedia data has motivated an increasing interest in clustering techniques 

capable of dealing with multimodal data. In the following paragraphs we review 

some of the most relevant works on clustering multimodal data. 

Possibly one of the first works that mention the multimedia clustering problem was 

(Hinneburg and Keim, 1998). The authors place special emphasis in highlighting the two 

main challenges faced by clustering algorithms in this context: the high dimensionality 

of the feature vectors and the existence of noise. To tackle these problems, the authors 

proposed DENCLUE, a density-based clustering algorithm capable of dealing satisfactorily 

with both issues. However, in that work, multimodality seemed to be more of a pretext to 

justify the challenges of clustering high dimensional noisy data than an interest in itself. 

This was not the case of the browsing and retrieval system for collections of text annotated 

web images presented in (Chen et al., 1999), which was a multimodal extension 

of the Scatter-Gather document browser of (Cutting et al., 1992). In this case, multiple 

clusterings were created upon text and image features independently. In particular, clustering 

on image features was employed as part of a query refinement process. Therefore, this 

proposal is multimodal in the sense that the features of the distinct modalities are employed 

for clustering the image collection, but, still, they were not fully integrated in the clustering 

process. 

In contrast, the true multimodality of the clustering approach proposed in (Barnard 

and Forsyth, 2001) is guaranteed by modeling the probabilities of word and image feature 

occurrences and co-occurrences. It consisted of a statistical hierarchical generative model 

fitted with the EM algorithm, which organizes image databases using both image features 

and their associated text. In subsequent works, the learnt joint distribution of image regions 

and words was exploited in several applications, such as the prediction of words associated 

with whole images or with image regions (Barnard et al., 2003). 

Multimodal clustering has also been applied to the discovery of perceptual clusters for 

disambiguating the semantic meaning of text annotated images (Benitez and Chang, 2002). 

To do so, the images are clustered based on the visual or the text feature descriptors. Moreover, 

the system could also conduct multimodal clustering upon any combination of text 

and visual feature descriptors by conducting an early fusion of these. Principal Component 

Analysis was used to integrate and to reduce the dimensionality of feature descriptors 

before clustering. As regards the results obtained by multimodal clustering, the authors 

highlighted the uncorrelatedness of visual and text feature descriptors, which suggests that 

18


they should be integrated in the knowledge extraction process. 

In the multimodal clustering context, the field that has motivated the largest amount 

of research efforts is the clustering of web image search results based not only on visual 

features, but also using the surrounding text and also link information—as organizing the 

results into different semantic clusters might facilitates users browsing (Cai et al., 2004). 

In that work, each image returned by the search engine is represented using three kinds 

of information: visual information, textual information and link information (text and link 

data are recovered from the surroundings of the image). The rationale of this approach is 

based on the fact that the textual and link based representations can reflect the semantic 

relationship of images better than visual features. The proposed system implements a 

two level clustering algorithm: in the first level, clustering is conducted using the textual 

and link representation of images (separately or jointly). In the second level, clustering 

is conducted on the images assigned to each cluster resulting from the previous stage. In 

this case, low level visual features are employed to re-organize the images in the first level 

clusters, so as to group visually similar images to facilitate users browsing. 

A second paper dealing with web image search results clustering was (Gao et al., 2005). 

In that work, a tripartite graph was used to model the relations among low-level features, 

images and their surrounding texts. Thus, the method was formulated as a constrained 

multiobjective optimization problem, which can be efficiently solved by semi-definite programming. 

In a similar context, clustering was applied for image sense discrimination for web images 

retrieved from ambiguous keywords (Loeff, Ovesdotter-Alm, and Forsyth, 2006). Its goal 

was presenting the image search results in semantically sensible clusters for improved image 

browsing. To do so, spectral clustering was applied on multimodal features: simple local and 

global image features, and a bag of words representation of the text in the embedding web 

page. Multimodal fusion was achieved by combining pairwise object similarities measured 

on both image and textual features in the graph affinity matrix of the spectral clustering 

algorithm. 

Finally, the notion that each of the multiple modalities in a multimedia collection contributes 

its own perspective to the collections organization was the driving force behind the 

proposal in (Bekkerman and Jeon, 2007). That work presents the Comraf* model, a lightweight 

version of combinatorial Markov random fields. In Comraf*, multimodal clustering 

is faced as the problem of simultaneously constructing a partition of each data modality. 

By clustering modalities simultaneously, the statistical sparseness of the data representation 

can be overcome, obtaining a dense and smooth joint distribution of the modalities. 

However, not every modality has to be clustered, as long as the so-called target modality 

is. 

The reader interested in multimedia indexing and retrieval is referred to the recent 

and complete survey of (Chen, 2006), although it is mainly focused on text plus image 

modalities. 

1.4 Clustering indeterminacies 

As mentioned at the end of section 1.1, the accomplishment of a knowledge discovery process 

requires making several critical decisions at each of its stages, which may have to be re- 

19

1.4. Clustering indeterminacies 

executed if the user is not satisfied with the evaluation of the mined patterns. In the case 

that clustering is the data mining task of the knowledge discovery process, these important 

decisions are often made blindly, due to the unsupervised nature of the clustering problem. 

Unfortunately, these decisions determine, to a large extent, the effectiveness of the clustering 

task (Jain, Murty, and Flynn, 1999), so they should not be made unconcernedly. 

Thus, the obtention of a good quality clustering solution relies heavily on making optimal 

(or quasi-optimal) decisions at every stage of the KD process. The doubts that seize 

clustering practitioners at the time of making such decisions are caused by what we call 

cluster indeterminacies, which mainly localize in the selection of i) the way objects are 

represented, and ii) the clustering algorithm to be applied. 

As regards the decision on data representation, ideal features should permit distinguishing 

objects belonging to different clusters, besides being robust to noise, easy to extract and 

interpret (Xu and Wunsch II, 2005). In a blind quest for finding such a data representation, 

the clustering practitioner is struck by the following questions: 

– how should the objects be represented? Should we stick to their original representation, 

select a subset of the original attributes (i.e. feature selection) or transform 

them into a new feature space (i.e. feature extraction)? 

– if the original data representation is subject to a dimensionality reduction process, 

which should be the dimensionality of the reduced feature space? 

– if the original data representation is subject to a feature selection process, which 

criterion should be followed? 

– if feature extraction is applied, which criterion should guide it? 

Regrettably, whereas these questions are easy to answer in a supervised classification 

scenario (e.g. the optimal feature subset can be chosen by maximizing some function of 

predictive classification performance (Kohavi and John, 1998) or by applying a feature 

transformation driven by class labels (Torkkola, 2003)), they have no clear nor universal 

answer in an unsupervised context. This is due to the fact that, in clustering, the lack of 

class labels makes feature selection a necessarily ad hoc and often trial-and-error process (Dy 

and Brodley, 2004). Moreover, studies comparing the influence of object representations 

based on feature extraction in clustering performance often come up with contradictory 

conclusions (Tang et al., 2004; Shafiei et al., 2006; Cobo et al., 2006; Sevillano, Alías, and 

Socoró, 2007b). 

To illustrate the effect and importance of the data representation clustering indeterminacy, 

in the following paragraphs we present experimental evidences that the selection of 

a specific object representation can condition the quality of a clustering process to a large 

extent. In particular, we have represented the objects contained in the Wine and the miniNG 

data collections using multiple data representations: the original attributes (referred 

to as baseline) and feature extraction-based representations —obtained by means of Principal 

Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative 

Matrix Factorization (NMF) and Random Projection (RP)— on a range of distinct dimensionalities. 

Upon each object representation, we have applied a refined repeated bisecting 

clustering algorithm based on the correlation similarity measure for obtaining a partition 

20

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

WINE rbr−corr−e1 

Baseline 

PCA 

ICA 

NMF 

RP 

0 

3 4 5 6 7 8 9 10 11 12 13 

dimensions 

(a) Clustering results on the Wine 

data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


miniNG rbr−corr−e1 

Baseline 

PCA 

ICA 

NMF 

RP 

0 

50 100 150 200 250 300 350 390 

dimensions 

(b) Clustering results on the miniNG 

data set. 

Figure 1.5: Illustration of the data representation indeterminacy on the clustering results 

of the (a) Wine and (b) miniNG data sets clustered by the rbr-corr-e1 algorithm. 

of the data 5 . In all cases, the desired number of clusters k is set to the number of classes 

defined by the ground truth of each data set. 

The results of these clustering processes are presented in figure 1.5, which displays 

the normalized mutual information (φ (NMI) ) values between these clustering solutions and 

the ground truth of each data collection. It can be observed that, in the Wine data set 

(figure 1.5(a)), the clustering solution obtained operating on the original representation 

is worse (i.e. it attains a lower φ (NMI) score) than all but three of the feature-extraction 

based representations. In particular, the maximum value of φ (NMI) is attained using an 

8-dimensional PCA transformation of the original data. However, these results cannot be 

generalized. In fact, it is the baseline object representation that yields the best results 

when the same clustering algorithm is applied on the miniNG data set (see figure 1.5(b)). 

If we analyze the data representation that yields the best clustering results across the 

12 unimodal data sets described in appendix A.2, we observe a rather even distribution: 

baseline (23% of the times), LSA (31%), NMF (31%) and RP (15%), which somehow 

reinforces the notion that no intrinsically superior data representation exists—see appendix 

B.1 for more experimental results regarding data representation clustering indeterminacies. 

Moreover, notice the remarkable influence of the data representation dimensionality on 

the value of φ (NMI) , i.e. it is not only important to select the right type of representation, 

but its dimensionality is also a critical factor. Although there exist several approaches for 

determining the most natural dimensionality of the data (dnat) –such as spectrum of eigenvectors 

in PCA (Duda, Hart, and Stork, 2001) or reconstruction error in NMF (Sevillano 

et al., 2009)–, it is not trivial to ensure that any clustering algorithm will yield its best 

performance when operating on the dnat-dimensional representation of the data. 

To sum up, this modest example tries to demonstrate the relevance of the data representation 

indeterminacy, as an incorrect choice in data representation may ruin the results 

of a clustering process. 

5 For a detailed description of the clustering algorithms, the data sets and the data representations employed 

in the experimental sections of this thesis, see appendices A.1, A.2 and A.3, respectively. 

21


The other major source of indeterminacy is the selection of the particular clustering 

algorithm to apply. In this sense, there are several critical questions that must be answered: 

– what type of algorithm should we apply? Hierarchical or partitional? Hard or soft? 

– once the type of clustering algorithm is selected, which specific clustering algorithm 

should be applied? 

– how should the parameters of the clustering algorithm be tuned? 

– in how many clusters should the data objects be grouped? 

As far as the selection of the type of clustering algorithm is concerned, it depends on 

the desired shape of the clustering solution. In any case, it is worth noting that soft and 

hierarchical clustering algorithms can be regarded as a generalization of their hard and 

partitional counterparts, as the latter can always be obtained from the former. 

Moreover, it is a commonplace that no universally superior clustering algorithm exists, 

as most of the proposals found in the literature have been designed to solve particular 

problems in specific fields (Xu and Wunsch II, 2005), thus being able to outperform the 

other existing algorithms in a concrete context, but not in others (Jain, Murty, and Flynn, 

1999). This fact has been theoretically analyzed and demonstrated by the impossibility 

problem in (Kleinberg, 2002). Thus, unless some domain knowledge enables clustering 

practitioners to choose a specific algorithm, this selection is often made blindly to a large 

extent. 

Once the algorithm is chosen, its parameters must be set. Again, this is not a trivial 

choice, as they largely determine its behaviour. Several examples concerning the sensitivity 

of some popular clustering algorithms to their parameter tuning are easy to find in the 

literature: for instance, there is no universal method for identifying the initial centroids 

of k-means. The EM clustering algorithm is highly sensitive to the selection of its initial 

parameters and the effect of a singular covariance matrix, as it can converge to a local 

optimum (Xu and Wunsch II, 2005). In combinatorial search based clustering, there exist 

no theoretic guidelines to select the appropriate and effective parameters, while the selection 

of the graph Laplacians is a major issue that affects the performance of spectral clustering 

algorithms, just like it happens in kernel-based clustering algorithms as regards selecting 

the width of the Gaussian kernels (Xu and Wunsch II, 2005). 

And finally, one has to decide the number of clusters k into which the objects must be 

grouped, as many clustering algorithms (e.g. partitional) require this value to be passed as 

one of their parameters. Unfortunately, in most cases the number of classes in a data set 

is unknown, so it is one more parameter to guess about. Moreover, determining which is 

the ‘correct’ number of clusters in a data set is an open question: in some cases, equally 

satisfying (though substantially different) clustering solutions can be obtained with different 

values of k for the same data, proving that the right value of k often depends on the scale 

at which the data is inspected (Chakaravathy and Ghosh, 1996). 

Notwithstanding, there exist several practical procedures for determining the most suitable 

value of k. Possibly, the most intuitive approach consists in visualizing the data set on 

a two-dimensional space, although this strategy is of little use for complex data sets (Xu and 

Wunsch II, 2005). Additionally, relative cluster validity indices (such as the Davies-Bouldin 

22

φ (NMI) direct-cos-i2 graph-cos-i2 

BBC 0.807 (P100) 0.603 (P83) 

PenDigits 0.658 (P84) 0.829 (P99) 


Table 1.1: Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits 

data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms. 

index (Davies and Bouldin, 1979), Dunn’s index (Dunn, 1973), the Calinski-Harabasz index 

(Calinski and Harabasz, 1974) or the I index (Maulik and Bandyopadhyay, 2002)) can be 

applied for determining the most appropriate number of clusters by comparing the relative 

merit of several clustering solutions obtained for distinct values of k (Halkidi, Batistakis, 

and Vazirgiannis, 2002b)—unfortunately, the performance of these indices is data dependent, 

which gives rise to a further indeterminacy (Xu and Wunsch II, 2005). And last, in 

the context of mixture densities based clustering, the number of clusters can be determined 

through the optimization of criterion functions such as the Akaike’s Information Criterion 

(Akaike, 1974) or the Bayesian Inference Criterion (Schwarz, 1978), among others. 

In this work, the number of clusters is assumed to be known from the start of the 

clustering process, and k is set to be equal to the number of classes defined by the ground 

truth of each data set. However, in a real-world scenario, this parameter should be tuned 

using some of the previously mentioned techniques. 

To illustrate the clustering algorithm selection indeterminacy, table 1.1 presents the 

values of φ (NMI) (and their percentiles in the global φ (NMI) distribution of each case) obtained 

from evaluating the clustering solutions yielded by the graph-cos-i2 and direct-cos-i2 (graphbased 

and direct clustering algorithms using the cosine distance, respectively) operating on 

the baseline data representation of the objects contained in the BBC and PenDigits data 

sets 6 . As just mentioned, the desired number of clusters k is set to the number of classes 

in each data set. It can be observed that these two algorithms, despite using the same 

object similarity measure and optimizing the same criterion function, offer almost opposite 

performances in these two specific data collections, so no absolute claims on the superiority 

of none of them can be made.—see appendix B.1 for more experimental results regarding 

the clustering algorithm selection indeterminacy. 

It is worth noticing that the decision problems caused by the previously described clustering 

indeterminacies are multiplied in the case of clustering multimodal data. In this 

context, besides the data representation and clustering algorithm selection indeterminacies, 

the clustering practitioner must face an additional set of questions with no clear answer, 

such as: 

– should one modality dominate the clustering process? If so, which one, and to which 

extent? 

– should the modalities be fused? If so, how should the fusion process be conducted? 

To illustrate the effect of these and the aforementioned indeterminacies in a multimodal 

clustering scenario, we have conducted several clustering experiments on the multimodal 

data sets presented in appendix A.2.2. 

6 Refer to appendices A.1, A.2 and A.3 for a detailed description of the clustering algorithms, the data 

sets and the data representations employed in the experimental sections of this thesis. 

23


Data set Best results Mode #1 Mode #2 Multimodal 

CAL500 

Corel 

InternetAds 

IsoLetters 

φ (NMI) 0.411 (P100) 0.249 (P74) 0.310 (P88) 

algorithm rbr-cos-i2 graph-cos-i2 graph-jacc-i2 

representation RP, r=120 PCA, r=40 ICA, r=100 

φ (NMI) 0.669 (P99) 0.270 (P40) 0.675 (P100) 

algorithm rbr-corr-i2 rb-cos-e1 rbr-corr-i2 

representation RP, r=300 NMF, r=400 NMF, r=550 

φ (NMI) 0.430 (P100) 0.258 (P98) 0.319 (P99) 

algorithm bagglo-cos-slink agglo-corr-clink graph-jacc-i2 

representation RP, r=70 Baseline NMF, r=150 

φ (NMI) 0.754 (P93) 0.537 (P67) 0.897 (P100) 

algorithm rbr-corr-i2 graph-jacc-i2 rbr-corr-i2 

representation Baseline ICA, r=12 PCA, r=100 

Table 1.2: Illustration of the clustering indeterminacies on the CAL500, Corel, InternetAds 

and IsoLetters multimoda data sets. Each column presents the top-performing clustering 

configuration for each separate modality and for the multimodal data representation. 

The twenty-eight clustering algorithms employed in this work (see table A.1 for a quick 

reference) are run on i) each modality of the data set, and ii) on the multimodal representation 

obtained by early feature fusion, as described in appendix A.3.2. In both cases, 

clustering is run on the baseline and feature extraction based data representations (see 

section A.3.1 of appendix A). 

As a summary of the obtained results and an illustration of the clustering indeterminacies, 

table 1.2 presents the highest quality clustering results obtained in each case (i.e. 

when clustering is conducted on either mode –mode #1 and mode #2 columns– and on 

the multimodal representation), indicating the corresponding value of φ (NMI) , its percentile 

in the global φ (NMI) distribution obtained, and the top-performing clustering configuration 

(i.e. algorithm plus data representation plus dimensionality of the reduced feature space r 

when needed). 

As expected, there is no predominant modality across all the data sets. For the CAL500 

collection, the clustering results obtained on mode #1 (text) are clearly superior to the rest. 

A similar behaviour is observed in the InternetAds data set, where it is also mode #1 (images 

size and aspect ratio, plus caption and alternate text in this case) the one that yields the 

best clustering results, although its predominance is not as clear as in CAL500. In contrast, 

the highest quality clustering solutions are obtained from multimodal representations in the 

Corel and IsoLetters data sets. 

Last but not least, notice the effect of the data representation and clustering algorithm 

indeterminacies on all the data sets. Again, there is no universally superior clustering 

algorithm nor data representation that guarantees the best clustering results. For a more 

detailed description of the experimental results regarding the clustering indeterminacies in 

multimodal data collections, see appendix B.2. 

24

1.5 Motivation and contributions of the thesis 


The main motivation of this thesis is the construction of an efficient multimodal clustering 

system that performs as autonomously as possible, avoiding the re-execution of the different 

stages of the knowledge discovery process. As these feedback loops are caused by suboptimal 

decision-making, our idea is setting cluster practitioners free from the obligation of making 

such critical decisions in a blind way, obtaining, at the same time, clustering solutions which 

are robust to the clustering indeterminacies presented in the previous section. 

Instead of being forced to blindly select a single clustering configuration, the user is 

encouraged to use and combine all the data modalities, representations and clustering algorithms 

at hand, generating as many individual clustering solutions (compiled into a cluster 

ensemble) as possible. It will be the proposed system which, in a fully unsupervised mode, 

outputs a consensus clustering solution that will hopefully be comparable to (or even better 

than) the one achieved using the best clustering configuration among the available ones. 

As the informed reader may have guessed, the approach followed in the quest for this 

goal localizes in the consensus clustering framework, which is defined as “the problem of 

combining multiple partitionings of a set of objects into a single consolidated clustering 

without accessing the features or algorithms that determined these partitionings” (Strehl 

and Ghosh, 2002). That is, the data representations, modalities and clustering algorithms 

employed for generating the individual partitions are not of the system’s concern, as it will 

operate on the individual clustering solutions regardless of the way they were created. 

However, applying consensus clustering on cluster ensembles as a means for obtaining 

robust clustering solutions is not new—in fact it has been a central or collateral matter in 

several works (Strehl and Ghosh, 2002; Fred and Jain, 2003; Sevillano et al., 2006a; Fern 

and Lin, 2008). Anyway, this thesis deals with several crucial and, to our knowledge, little 

addressed issues in this context, such as: 

– the computational burden imposed by the use of large cluster ensembles generated by 

crossing multiple data modalities, representations and clustering algorithms. 

– the quality decrease of the consensus clustering solution caused by the wide diversity 

of the cluster ensemble. 

– the application of cluster ensembles on the multimodal clustering problem. 

– the definition of methods for building consensus clustering solutions (either crisp or 

fuzzy) from the outputs of soft clustering algorithms. 

As a systematic response to these challenges, this thesis puts forward the following 

proposals: 

– parallelizable hierarchical consensus architectures for creating consensus clustering 

solutions in a computationally efficient way (see chapter 3). 

– fully unsupervised consensus self-refining procedures, so as to drive the quality of 

the consensus clustering solution near or even above the best available individual 

clustering configuration —see chapter 4. 

25

1.5. Motivation and contributions of the thesis 

Obj Object 

representation 

Multimedia 

data set R df D df df , 

X Clustering E 

df df A 

( hard / soft) 

Flat/ 

hi hierarchical hi l 

(serial/parallel) 


( hard / soft) 

 

c 

or 

 

c 

Consensus 

final 

c 

self- or 

refining final 

c 

Figure 1.6: Block diagram of the robust multimodal clustering system based on self-refining 

hierarchical consensus architectures. 

– the construction of multimodal cluster ensembles and the application of self-refining 

hierarchical consensus architectures for robust multimodal clustering (see chapter 5). 

– consensus functions based on voting strategies for combining fuzzy partitions contained 

in soft cluster ensembles —see chapter 6. 

These contributions can be articulated in a unitary proposal for robust multimodal 

clustering based on cluster ensembles, a block diagram of which is shown in figure 1.6. The 

procedure for deriving the partition of a multimodal data collection X accordingtoour 

proposal goes as follows: firstly, multiple representations of the objects contained in X are 

created by the application of a set of representational and dimensional diversity factors 

provided by the user (denoted as dfR and dfD in figure 1.6). Next, a set of either hard or 

soft clustering algorithms (referred to as the algorithmic diversity factor dfA) are applied on 

the distinct object representations obtained from the previous step, giving rise to a set of 

clusterings compiled in the cluster ensemble E. Notice that, up to this point, the only choices 

made by the user refer to the object representation techniques and clustering algorithms 

employed for creating the ensemble. As mentioned earlier, the user is encouraged to employ 

the widest possible range of diversity factors, thus creating maximally diverse clusterings 

so as to break free from the indeterminacies inherent to clustering. The obviously high 

computational cost associated to this cluster ensemble generation strategy can somehow be 

mitigated considering it is a highly parallelizable process (Hore, Hall, and Goldgof, 2006). 

Subsequently, the process for deriving the partition of the data set X upon the cluster 

ensemble E starts by applying a consensus clustering procedure. This can either be 

conducted according to a flat or a hierarchical consensus architecture, a decision that is 

automatically made by the system based on the characteristics of the data set X, the cluster 

ensemble E and the consensus function F employed for combining the clusterings in E 

—which is selected by the user. In case a hierarchical consensus architecture is employed, 

an additional decision (also made with no user supervision) is the one related to its serial or 

parallel execution, which ultimately depends on the availability of computational resources. 

As a result, a consensus clustering solution is obtained, which can either be represented 

by a consensus label vector λc or a consensus clustering matrix Λc, depending on whether 

a crisp or a fuzzy clustering approach is taken. Subsequently, this consensus clustering is 

subjected to an almost fully autonomous self-refining procedure, which requires the user 

to specify a percentage threshold (denoted by symbol ‘%’ in figure 1.6). Finally, the final 

partition of the data set X is obtained, denoted as λ final 

c (or Λfinal c in the fuzzy case). 

Before proceeding with the description of our proposals, the next chapter presents an 

overview of related work in the area of cluster ensembles. 

26 

%

Chapter 2 

Cluster ensembles and consensus 

clustering 

In our quest for overcoming clustering indeterminacies in a multimodal context, the notions 

of cluster ensembles and consensus clustering play a central role. As mentioned at the 

end of chapter 1, our strategy for clustering multimodal data in a robust manner is based 

on the massive creation of multiple partitions of the target data set and the subsequent 

combination of these into a single consensus clustering solution. Therefore, an appropriate 

way to start this chapter is by formally defining the two closely related concepts of cluster 

ensembles and consensus clustering1 . 

For starters, a cluster ensemble E is defined as the compilation of the outcomes of l 

clustering processes. For simplicity, we assume in this work that the l clustering processes 

group the data into the same number of clusters, namely k, although this is not a strictly 

necessary constraint2 . Depending on whether the clustering processes are crisp or fuzzy, E 

will be a hard or a soft cluster ensemble. 

In the former case, E is mathematically defined as a l×n integer-valued matrix compiling 

l row label vectors λi (∀i ∈ [1,...,l]) resulting from the respective hard clustering processes 

(see equation (2.1)). 

⎛ 

λ1 

λl 

⎞ 

⎛ 

⎜ 

⎜λ2 

⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ = 

⎜ 

⎝ . 

λ11 λ12 ... λ1m 

λ21 λ22 ... λ2m 

. .. 

λl1 λl2 ... λlm 

⎞ 

⎟ 

⎠ 

(2.1) 

where λij ∈{1,...,k} (∀i ∈ [1,...,l], and ∀j ∈ [1,...,n]), i.e. each component of each 

1 In some works, the term ‘cluster ensemble’ is used to designate the framework for combining multiple 

partitionings obtained from separate clustering runs into a final consensus clustering (Strehl and Ghosh, 

2002; Punera and Ghosh, 2007). In this work, however, we stick to the literal meaning of this expression, 

and use it to design the result of gathering several clustering solutions. 

2 Since our goal is to combine partitions differing only in the way data are represented and clustered, we 

set the number of clusters k to be equal across the l clustering processes. However, combining clustering 

solutions with a variable number of clusters is a common practice in the cluster ensembles literature. This 

can be useful for clustering complex data sets upon simple individual partitions (Fred and Jain, 2005), or 

for discovering the natural number of clusters in the data set (Strehl and Ghosh, 2002), although these 

potentialities are not exploited in this work. 

27

Chapter 2. Cluster ensembles and consensus clustering 

0.6 

0.4 

0.2 

0 

1 

2 

3 

−0.2 

−0.2 0 0.2 0.4 0.6 

Figure 2.1: Scatterplot of an artificially generated two-dimensional toy data set containing 

n = 9 objects grouped into k = 3 natural clusters. Each object is identified by a numerical 

label. 

labeling is an integer label identifying to which of the k clusters each of the n objects in 

the data set is assigned to. 

For illustration purposes, and resorting to the toy clustering example presented in section 

1.2.1, equation (2.2) presents a hard cluster ensemble created by compiling the outcomes of 

l = 3 independent runs of the k-means clustering algorithm on the two-dimensional data set 

presented in figure 2.1, which contains n = 9 objects, setting the desired number of clusters 

k equal to 3. 

4 

7 

5 

9 

6 

8 

⎛ 

1 1 1 3 3 3 2 2 

⎞ 

2 

E = ⎝2 

2 2 1 1 1 3 3 3⎠ 

(2.2) 

2 2 2 3 3 3 1 1 1 

On its part, a soft cluster ensemble E is defined as the compilation of the outcomes of l 

fuzzy clustering processes, and as such, it is mathematically expressed as a kl×n matrix, as 

presented in equation (2.3) (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b). 

⎛ ⎞ 

Λ1 

⎜ 

⎜Λ2 

⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ 

Λl 

(2.3) 

where Λi is the k × n real-valued clustering matrix resulting from the ith soft clustering 

process (∀i ∈ [1,...,l]). 

Continuing with the same toy example, equation (2.4) presents a soft cluster ensemble 

created by collecting the outcomes of l = 3 independent executions of the fuzzy c-means 

clustering algorithm on the same data set as before. As k = 3, the first three rows of 

E correspond to the clustering probability membership matrix output by the first soft 

clustering process, the next three are the outcome of the second fuzzy clusterer, and so on. 

28

1 1 1 3 3 3 2 2 2 

1 

2 

3 

2 2 2 1 

2 2 2 3 

1 1 3 3 

3 3 1 1 

3 

1 


E 

Consensus 

function c 

1 

1 1 2 2 

2 3 3 3 

Figure 2.2: Schematic representation of the obtention of a consensus labeling λc by applying 

a consensus function F on a hard cluster ensemble E containing l = 3 individual label 

vectors, being n =9andk =3. 

⎛ 

0.921 0.932 0.905 0.006 0.005 0.011 0.009 0.016 

⎞ 

0.010 

⎜ 

0.054 

⎜ 

0.025 

⎜ 

⎜0.025 

E = ⎜ 

⎜0.920 

⎜ 

⎜0.054 

⎜ 

⎜0.054 

⎝0.920 

0.042 

0.026 

0.026 

0.932 

0.042 

0.042 

0.932 

0.057 

0.038 

0.038 

0.905 

0.057 

0.057 

0.905 

0.025 

0.969 

0.969 

0.006 

0.025 

0.025 

0.006 

0.019 

0.976 

0.976 

0.005 

0.019 

0.019 

0.005 

0.030 

0.959 

0.959 

0.011 

0.030 

0.030 

0.011 

0.976 

0.014 

0.014 

0.009 

0.976 

0.976 

0.009 

0.929 

0.055 

0.055 

0.016 

0.929 

0.929 

0.016 

0.972 ⎟ 

0.017 ⎟ 

0.017 ⎟ 

0.010 ⎟ 

0.972 ⎟ 

0.972 ⎟ 

0.010⎠ 

0.025 0.026 0.038 0.969 0.976 0.959 0.014 0.055 0.017 

(2.4) 

Notice that, just like it was observed in section 1.2.1 regarding crisp and fuzzy clustering 

solutions, soft cluster ensembles can be converted to hard cluster ensembles by assigning 

each object to the cluster it is more strongly associated to. In fact, by doing so, the soft 

ensemble in equation (2.4) would be converted to the hard cluster ensemble in equation 

(2.2). Moreover, notice that the l = 3 components that make up both cluster ensembles are 

identical, given the symbolic nature of cluster labels. 

As for consensus clustering, it is defined as the process of obtaining a consolidated 

clustering solution through the application of a consensus function F on a cluster ensemble 

E (Strehl and Ghosh, 2002). In other words, consensus clustering can be regarded as the 

problem of combining several clustering solutions without accessing the features representing 

the clustered objects. Figure 2.2 depicts a schematic representation of a consensus clustering 

process conducted on the hard cluster ensemble resulting from our toy example. In this 

case, the result of the consensus clustering process is a consensus label vector λc which, 

quite obviously, represents the same partition as the individual label vectors that compose 

the cluster ensemble. However, in a real context, a higher degree of diversity among the 

clustering solutions embedded in the cluster ensemble can be expected (which, in fact, is 

desirable), a situation consensus clustering algorithms take advantage of for consolidating 

richer consensus clustering solutions (Pinto et al., 2007). 

Quite obviously, the design of the consensus function F is a central issue as regards 

consensus clustering. Most works in the consensus clustering literature focus on combining 

the outcomes of hard clustering processes (as in the example depicted in figure 2.2), although 

some consensus functions can be applied to either hard or soft cluster ensembles indistinctly, 

possibly after introducing some minor modifications (Strehl and Ghosh, 2002; Fern and 

Brodley, 2004; Lange and Buhmann, 2005). However, little effort has been conducted 

towards the design of specific consensus functions for soft cluster ensembles that generate 

29

fuzzy consensus clustering solutions (Dimitriadou, Weingessel, and Hornik, 2002; Punera 

and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b). 

Regardless of whether the cluster ensemble is hard or soft, combining the results of 

several clustering processes has multiple applications, a good description of which can be 

found in (Strehl and Ghosh, 2002). In a nutshell, consensus clustering is useful for: 

– knowledge reuse: in some scenarios, one may want to create a partition of a set of 

objects, but the access to the original data may be restricted due to copyright or privacy 

reasons (customer databases are the most prototypical examples of this type of 

situation). However, if a set of legacy partitions of the data exist (e.g. segmentations 

of a customer database based on distinct criteria –such as residence, purchasing patterns, 

age, etc.), consensus clustering provides a means for reconciling the knowledge 

contained in those legacy clusterings. 

– distributed clustering: due to security or operational reasons, there exist situations in 

which the data to be clustered is scattered across different locations. In this context, 

as an alternative to gathering and processing all the data at one site –which can be 

unfeasible, for instance, due to storage costs–, the data available at each location 

would be subject to a clustering process, and the label vectors obtained would be 

combined by means of consensus clustering, yielding a consolidated classification of 

the data. 

– robust clustering: in this case, the goal is to obtain a consensus clustering solution 

that improves the quality of the component clusterings, based on the fact that if the 

distinct clustering processes disagree, combining their outcomes may offer additional 

information and discriminatory power, thus obtaining a combined better clustering 

closer to a hypothetical true classification (Pinto et al., 2007). 

It is in this latter application that consensus clustering can be more clearly regarded as 

the unsupervised counterpart of classifier committees, as the objective of both strategies is to 

combine the outcomes of several classification processes aiming to improve the quality of the 

component classifiers (Dietterich, 2000). However, the purely symbolic nature of the labels 

returned by unsupervised classifiers makes consensus clustering a more challenging task. 

Possibly due to this fact, consensus clustering has historically been far less popular than 

classifier committees, and it has only began to draw considerable attention of researchers 

during the last decade. 

In the quest for obtaining good quality consensus clustering solutions, the design of 

both the cluster ensemble and the consensus function are of critical importance. Although 

having a cluster ensemble is always necessary in order to conduct consensus clustering, some 

works focus mainly on the design of the consensus function, relegating the construction of 

the cluster ensemble, and vice versa. Given the importance of both elements, we split the 

revision of the related work in this field into two separate parts, devoting section 2.1 to the 

previous work regarding the construction of cluster ensembles and section 2.2 to overview 

the existing approaches to the design of consensus functions. 

30

2.1 Related work on cluster ensembles 


Our aim in this section is to review the strategies applied in the literature as regards the 

construction of cluster ensembles, given its influence on the consensus clustering process results. 

Two alternative approaches have been traditionally followed in this context, differing 

in the number of distinct clustering algorithms used for generating the individual partitions 

in the ensemble. 

The first cluster ensemble creation strategy consists of compiling the outcomes of multiple 

runs of a single clustering algorithm, which gives rise to what is known as a homogeneous 

cluster ensemble (Hadjitodorov, Kuncheva, and Todorova, 2006). In this case, the diversity 

of the ensemble components can be induced by several means, often in a combined manner: 

– application of a stochastic clustering algorithm: this strategy relies on the fact that 

the outcome of a stochastic clustering algorithm depends on how its parameters are 

adjusted. For instance, diverse clustering solutions can be obtained by the random 

initialization of the starting centroids of k-means (Fred, 2001; Fred and Jain, 2002a; 

Fred and Jain, 2003; Dimitriadou, Weingessel, and Hornik, 2001; Greene et al., 2004; 

Long, Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Kuncheva, Hadjitodorov, 

and Todorova, 2006; Li, Ding, and Jordan, 2007; Nguyen and Caruana, 2007; Ayad 

and Kamel, 2008; Fern and Lin, 2008) or fuzzy c-means (Dimitriadou, Weingessel, 

and Hornik, 2002), or the initial settings of EM clustering (Punera and Ghosh, 2007; 

Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b). 

– random number of clusters: in this case, at each run of the clustering algorithm, 

the number of clusters to be found is set randomly (Fred and Jain, 2002b; Fred and 

Jain, 2005; Topchy, Jain, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova, 

2006; Hadjitodorov and Kuncheva, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and 

Turmo, 2008b; Ayad and Kamel, 2008). In general terms, this number of clusters 

is usually set to be much larger than the expected number of categories in the data 

set (Dimitriadou, Weingessel, and Hornik, 2001; Fred and Jain, 2002a), being often 

selected at random from a predefined interval (Long, Zhang, and Yu, 2005; Hore, Hall, 

and Goldgof, 2006). 

– distinct object representations: another source of diversity lies in the way objects 

are represented. Indeed, as we showed in section 1.4, running the same clustering 

algorithm on distinct representations of the same data set often leads to pretty diverse 

clustering solutions. Allowing for this fact, cluster ensembles have been created by 

running a single clustering algorithm on different data representations generated by 

random feature selection (Agogino and Tumer, 2006; Hadjitodorov and Kuncheva, 

2007; Fern and Lin, 2008), random feature extraction (Greene et al., 2004; Long, 

Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Hadjitodorov and Kuncheva, 

2007; Fern and Lin, 2008) or deterministic feature extraction (Sevillano et al., 2006a; 

Sevillano et al., 2006b; Sevillano et al., 2007a; Sevillano, Alías, and Socoró, 2007b). 

– data subsampling: the creation of multiple clustering solutions upon distinct random 

subsamples of the data set has been applied as a means for generating diverse cluster 

ensembles (Fischer and Buhmann, 2003; Dudoit and Fridlyand, 2003; Minaei-Bidgoli, 

Topchy, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova, 2006; Punera and 

Ghosh, 2007). 

31

2.1. Related work on cluster ensembles 

– weak clustering: another approach to the generation of homogeneous cluster ensembles 

is the repeated application of computationally cheap and conceptually simple 

clustering procedures that, although yielding poor clustering solutions by themselves 

(this is why they are said to be weak),mayleadtobetterdataclusteringifcombined. 

This type of strategies are of special interest when clustering high dimensional and/or 

large data collections, as deriving multiple partitions by traditional means may become 

too costly (Fern and Brodley, 2003). Examples of this include using random 

hyperplanes for splitting the data (Topchy, Jain, and Punch, 2003) or prematurely 

halted executions of k-means (Hadjitodorov and Kuncheva, 2007). 

– noise injection: the random perturbation of the representation of the objects (Hadjitodorov, 

Kuncheva, and Todorova, 2006) or the labels contained in the individual 

clustering solutions (Hadjitodorov and Kuncheva, 2007) through noise injection has 

also been applied in a few research works, although these strategies constitute a far 

less natural way of creating diverse cluster ensembles if compared to the previous ones. 

The second approach for creating cluster ensembles consists of applying several distinct 

clustering algorithms for generating the individual components of the ensemble, which gives 

rise to what are known as heterogeneous cluster ensembles. If clustering algorithms with 

substantially different biases are employed, cluster ensembles with a high degree of diversity 

can be obtained. This strategy has been applied in several works, such as (Strehl and 

Ghosh, 2002; Lange and Buhmann, 2005; Gonzàlez and Turmo, 2006; Gionis, Mannila, and 

Tsaparas, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b). 

Notice that the strategies used for creating homogeneous and heterogeneous cluster 

ensembles can be combined so as to create even more diverse ensembles, as in (Sevillano et 

al., 2007c), where a bunch of clustering algorithms are applied on different representations – 

obtained by means of multiple feature extraction techniques with distinct dimensionalities– 

of the objects in the data set. In this work, we will follow this approach as regards the 

generation of cluster ensembles, using the clustering algorithms and object representations 

described in appendices A.1 and A.3, as our goal is to overcome the indeterminacies resulting 

from the selection of a particular clustering configuration. 

There exist several recent works in the literature dealing with the design of the cluster 

ensemble. In general terms, they can be divided into two categories: i) those works focused 

on analyzing which strategies should be followed for creating cluster ensemble components 

that give rise to good quality consensus clustering solutions, and ii) those centered on 

obtaining a good quality consensus clustering given a particular cluster ensemble. 

Among the first group, we highlight the works by Kuncheva and Hadjitorov. In (Hadjitodorov, 

Kuncheva, and Todorova, 2006), the authors analyze the diversity of the individual 

partitions composing a hard cluster ensemble and its effect on the quality of the 

consensus clustering. To do so, several measures for evaluating the diversity of a cluster ensemble 

are proposed and evaluated. Moreover, such measures are employed in the derivation 

of a procedure for selecting the candidate with the median diversity among a population 

of cluster ensembles, a criterion that leads to the obtention of equal or better consensus 

clustering solutions than those obtained on arbitrarily chosen cluster ensembles. 

The notion that moderately diverse cluster ensembles lead to good quality consensus 

is reinforced by the experimental results presented in (Hadjitodorov and Kuncheva, 2007), 

where the authors apply a standard genetic algorithm for driving the selection of the cluster 

32


ensemble components, which are generated by random feature extraction and selection, weak 

clusterers, or random number of clusters, among others. Unfortunately, the fitness function 

driving the genetic algorithm used to select the best cluster ensemble is the classification 

accuracy of the respective ensemble with respect to the correct object labels (ground truth), 

which makes this strategy of limited use in practice. 

Another interesting work that incides in the heuristics of the cluster ensemble construction 

process is (Kuncheva, Hadjitodorov, and Todorova, 2006). In this sense, the authors 

recommend creating individual partitions with a variable number of clusters as a means for 

obtaining good quality consensus labelings. 

As mentioned earlier, an alternative way of obtaining a good quality consensus clustering 

solution is based not on designing the cluster ensemble components according to certain 

heuristics, but on refining its contents. The rationale of such strategies is based on the fact 

that the quality of the consensus clustering solution is penalized by the inclusion of poor 

individual clustering solutions in the ensemble. For this reason, it seems logical to develop 

techniques capable of discarding such cluster ensemble components prior to conducting the 

consensus clustering process. 

In this sense, the application of quality and/or diversity criteria for selecting a small 

subset of a large cluster ensemble was evaluated in (Fern and Lin, 2008) as a means for obtaining 

a consensus clustering solution that equals or betters the one that would be obtained 

using the whole ensemble. Pursuing the same goal, the authors of (Goder and Filkov, 2008) 

propose creating smaller subsets of the cluster ensemble that will yield better consensus. 

These mini-cluster ensembles are generated by clustering the individual partitions of the 

cluster ensemble using a hierarchical agglomerative average-link clustering algorithm. 

2.2 Related work on consensus functions 

The goal of this subsection is to review the state of the art in the area of consensus clustering. 

Although a considerable corpus of theoretical work on combining classifications was 

developed in the 80s and earlier (e.g. (Mirkin, 1975; Barthelemy, Laclerc, and Monjardet, 

1986; Neumann and Norton, 1986)), it was not until the start of the present decade when 

this field experienced a significant flourish of activity. 

Despite this relatively recent awakening, multiple approaches to the combination of 

several clusterings can be found in the literature. In general terms, consensus clustering 

can be posed as an optimization problem the goal of which is to minimize a cost function 

measuring the dissimilarity between the consensus clustering solution and the partitions in 

the cluster ensemble –often, the cost function is expressed in terms of the number of pairwise 

co-clustering disagreements between the individual partitions in the cluster ensemble and 

the consensus clustering solution. Unfortunately, finding the partition that minimizes the 

proposed symmetric difference distance metric (i.e. the so-called median partition) isa 

NP-hard problem (Goder and Filkov, 2008), and this is the reason why it is necessary to 

resort to distinct heuristics so as to conduct clustering combination. 

Aiming to provide the reader with a global perspective on the distinct existing approaches 

in this field, table 2.1 presents a taxonomy of some of the most relevant consensus 

functions according to the theoretical foundations that guides the consensus process. Notice 

that some consensus functions appear under more than one theoretical approach, as some- 

33

2.2. Related work on consensus functions 

Theoretical approach Consensus functions 

VMA (Dimitriadou, Weingessel, and Hornik, 2002) 

BagClust1 (Dudoit and Fridlyand, 2003) 

Voting 

URCV, RCV, ACV (Ayad and Kamel, 2008) 

Also in (Fischer and Buhmann, 2003; 

Greene and Cunningham, 2006) 

CSPA, HGPA, MCLA (Strehl and Ghosh, 2002) 

Graph partitioning 

HBGF (Fern and Brodley, 2004) 

BALLS (Gionis, Mannila, and Tsaparas, 2007) 

EAC (Fred and Jain, 2005) 

CSPA (Strehl and Ghosh, 2002) 

BagClust2 (Dudoit and Fridlyand, 2003) 

Object co-association IPC (Nguyen and Caruana, 2007) 

BALLS (Gionis, Mannila, and Tsaparas, 2007) 

Majority Rule, CC Pivot (Goder and Filkov, 2008) 

Also in (Greene et al., 2004) 

QMI (Topchy, Jain, and Punch, 2003) 

Categorical clustering 

ITK (Punera and Ghosh, 2007) 

EM (Topchy, Jain, and Punch, 2004) 

PLA (Lange and Buhmann, 2005) 

Probabilistic 

Also in (Long, Zhang and Yu, 2005; 

Li, Ding and Jordan, 2007) 

Reinforcement learning (Agogino and Tumer, 2006) 

ALSAD, KMSAD, SLSAD 

Similarity as data 

(Kuncheva, Hadjitodorov, and Todorova, 2006) 

IVC, IPVC, IPC (Nguyen and Caruana, 2007) 

Centroid based 

Also in (Hore, Hall, and Goldgof, 2006) 

Agglomerative, Furthest, LocalSearch 

Correlation clustering 

(Gionis, Mannila, and Tsaparas, 2007) 

Search techniques SA, BOEM (Goder and Filkov, 2008) 

BestClustering (Gionis, Mannila, and Tsaparas, 2007) 

Cluster ensemble component selection BOK (Goder and Filkov, 2008) 

Also in (Fern and Lin, 2008) 

Table 2.1: Taxonomy of consensus functions according to their theoretical basis. 

times the limits between them are somewhat vague. Throughout the following paragraphs, 

the main features of these consensus functions are described. 

2.2.1 Consensus functions based on voting 

The main idea underlying consensus functions based on voting strategies is the notion that 

objects assigned to a particular cluster by many partitions in the ensemble should also be 

located in that cluster according to the consensus clustering solution. One obvious way to 

achieve this goal is to consider cluster labels as votes, thus consolidating different clusterings 

by means of voting procedures. However, due to the symbolic nature of clusters (caused 

by the unsupervised nature of the clustering problem), it is necessary to disambiguate the 

clusters across the l components of the cluster ensemble prior to voting. 

One of the pioneering works in voting-based consensus clustering was the voting-merging 

algorithm (VMA) of Dimitriadou, Weingessel, and Hornik (2001). In that work, cluster 

34


desambiguation is conducted by matching those clusters sharing the highest percentage 

of objects, iterating this cluster matching process across all the partitions in the cluster 

ensemble. As a result of the voting step, a fuzzy partition of the data set is obtained. 

Subsequently, a merging procedure is conducted on this soft partition, fusing those clusters 

which are closest to each other. By imposing a stopping criterion based on the sureness 

of the obtained clusters, this merging process is capable of finding the natural number 

of clusters in the data set. In (Dimitriadou, Weingessel, and Hornik, 2002), the authors 

define the consensus clustering solution as the one minimizing the average square distance 

with respect to all the partitions in the cluster ensemble. They demostrate that obtaining 

such consensus clustering boils down to finding the optimal re-labeling of the clusters of 

all the individual clusterings, which becomes an unfeasible problem if approached directly, 

since it requires an enumeration of all possible permutations. Therefore, they resort to 

the VMA consensus function of (Dimitriadou, Weingessel, and Hornik, 2001) for finding an 

approximate solution to the problem, extending its application to soft cluster ensembles. 

One of the two consensus functions proposed in (Dudoit and Fridlyand, 2003), called 

BagClust1, is based on applying plurality voting on the labelings in the cluster ensemble 

after a label disambiguation process based on measuring the overlap between clusters. The 

generation of the cluster ensemble components follows a resampling strategy similar to 

bagging, aiming to reduce the variability in the partitioning results via consensus clustering. 

A related proposal was the one presented in (Fischer and Buhmann, 2003). In that work, 

consensus clustering is viewed as a means for improving the quality and reliability of the 

results of path-based clustering, applying bagging for creating the hard cluster ensemble. 

The consensus clustering solution is obtained through a maximum likelihood mapping in 

which the label permutation problem is solved by means of the Hungarian method (Kuhn, 

1955), which somehow resembles the application of plurality voting on the disambiguated 

individual partitions in the cluster ensemble (Ayad and Kamel, 2008). Moreover, a related 

reliability measure chooses the number of clusters with the highest stability as the preferable 

consensus solution. 

In (Greene and Cunningham, 2006), a majoritary voting strategy was applied for generating 

the consensus clustering solution, after disambiguating the clusters using the Hungarian 

algorithm. An additional interest of that work is that it was one of the first research 

efforts that considered the problem of creating and combining a large number of clustering 

solutions in the context of high dimensional data sets (such as text document collections). 

Indeed, the authors point out that using large ensembles boosts computational cost, while 

small ensembles tend to produce unstable consensus clustering solutions. In this context, 

the authors propose basing the cluster ensemble construction and the consensus clustering 

tasks on a prototype reduction technique that allows representing the whole data set by 

means of a minimal set of objects, ensuring that the clustering results will approximate 

those that would be obtained on the original data set. By doing so, the final clustering 

solution can be extended to those objects that have been left out of the reduced data set 

representation while alleviating the overall computational cost of the whole process. In particular, 

the reduced version of the cluster ensemble is obtained by projecting the pairwise 

object similarity matrix by means of a kernel matrix. 

The recent work of (Ayad and Kamel, 2008) presented a set of consensus functions based 

on cumulative voting –named URCV, RCV and ACV– whose time complexity scales linearly 

with the size of the data set. Another interesting feature is their capability of combining 

35


crisp partitions with different number of clusters, although the desired number of clusters 

k in the consensus clustering solution is a necessary parameter for the execution of their 

consensus functions. The proposals presented in this work are based on the computation of 

a probabilistic mapping for solving the cluster correspondence problem–instead of the oneto-one 

classic cluster mapping– which allows combining partitions with different number of 

clusters avoiding the addition of dummy clusters as in (Dimitriadou, Weingessel, and Hornik, 

2002). In particular, three ways for deriving such probabilistic mapping based on the idea 

of cumulative voting are presented. The construction of the consensus clustering solution is 

a two-stage procedure: firstly, based on the cumulative vote mapping, a tentative consensus 

is derived as a summary of the cluster ensemble maximizing the information content in 

terms of entropy. And secondly, the extraction of the final consensus clustering solution 

is obtained by applying an agglomerative clustering algorithm that minimizes the average 

generalized Jensen-Shannon divergence within each cluster. 

As already mentioned, solving the cluster correspondence problem paves the way for the 

application of voting strategies for combining the outcomes of multiple clustering processes. 

This issue is the central focus of (Boulis and Ostendorf, 2004), which presented several 

methods for finding the correspondence between the clusters of the individual partitions in 

the cluster ensemble. Two of these proposals are based on linear optimization techniques, 

which are applied on an objective function that measures the degree of agreement among 

clusters. In contrast, the third cluster correspondence method is based on Singular Value 

Decomposition, and it sets cluster correspondences based on cluster correlation. All these 

methods operate on a common space where the clusters of the distinct partitions (which 

can be either crisp or fuzzy) are represented by means of cluster co-association matrices. 

2.2.2 Consensus functions based on graph partitioning 

The work by Strehl and Ghosh on consensus clustering based on graph partitioning is 

probably one of the most classic references in the field of cluster ensembles (Strehl and 

Ghosh, 2002). To our knowledge, they were the first to formulate the consensus clustering 

problem in an information theoretic framework –i.e. the consensus clustering solution 

should be the one maximizing the mutual information with respect to all the individual 

partitions in the cluster ensemble–, a path followed by other authors in subsequent works 

(Fred and Jain, 2003). In view of its prohibitive cost when formulated as a combinatorial 

optimization problem in terms of shared mutual information, the authors propose 

three clustering combination heuristics based on deriving a hypergraph representation of 

the cluster ensemble —all of which require the desired number of clusters k as one of their 

parameters. The first consensus function (called Cluster-based Similarity Partitioning Algorithm 

or CSPA) induces a pairwise object similarity measure from the cluster ensemble 

(as in (Fred and Jain, 2002a)), obtaining the consensus partition by reclustering the objects 

with the METIS graph partitioning algorithm (Karypis and Kumar, 1998). For this 

reason, we have enclosed the CSPA consensus function in both the graph partitioning and 

object co-association categories in table 2.1. In the second clustering combiner proposed 

in (Strehl and Ghosh, 2002) –named HGPA for HyperGraph Partitioning Algorithm–, the 

cluster ensemble problem is posed as the partitioning of a hypergraph where hyperedges 

represent clusters into k unconnected components of approximately the same size, cutting 

a minimum number of hyperedges. And the third consensus function (Meta-CLustering 

Algorithm or MCLA) views the clustering integration process as a cluster correspondence 

36


problem that is solved by identifying and consolidating groups of clusters (meta-clusters), 

which is done by applying graph-based clustering to hyperedges in the hypergraph representation 

of the cluster ensemble. In (Strehl and Ghosh, 2002), the authors apply the proposed 

consensus functions on hard cluster ensembles, suggesting that they could be extended to a 

fuzzy clustering integration scenario. Such extensions (in particular, the soft versions of the 

CSPA and MCLA consensus functions, sCSPA and sMCLA, respectively) were introduced 

in (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b). 

The clustering combination problem was also formulated as a graph partitioning problem 

in (Fern and Brodley, 2004). In particular, a bipartite graph is built from a hard cluster 

ensemble, although the authors suggest that the same method can be applied for combining 

soft clustering solutions after introducing minor modifications on the proposed consensus 

function, which is called HBGF for Hybrid Bipartite Graph Formulation. As in previous 

graph partitioning approaches to clustering combination, the desired number of clusters 

must be set aprioriandpassed as a parameter of the consensus function. In contrast, HBGF 

considers object and cluster similarity simultaneously when creating the consensus clustering 

solution, an issue not considered by other graph partition based consensus functions as 

CSPA and MCLA (Strehl and Ghosh, 2002). 

More recently, the BALLS consensus function (Gionis, Mannila, and Tsaparas, 2007) operates 

on a graph representation of the pairwise object co-dissociation matrix, viewing its 

vertices as the objects in the data set, its edges being weighted by the pairwise object 

distances. The rationale of the consensus clustering creation process is the iterative construction 

of consensus clusters from compact and relatively isolated sets of close vertices, 

which are then removed from the graph. 

2.2.3 Consensus functions based on object co-association measures 

The approach to consensus clustering based on object co-association measures is based on 

the assumption that objects belonging to a natural cluster are likely to be co-located in 

the same cluster by different clusterings of the cluster ensemble. Therefore, pairwise object 

co-occurrences are deemed as votes which are aggregated in a n × n object co-association 

matrix (where n is the number of instances contained in the data collection). A great 

advantage of this type of methods is that it avoids cluster disambiguation processes, as the 

cluster ensemble is inspected object-wise, rather than cluster-wise. However, a downside of 

consensus functions based object co-association is that their time and space complexities 

scale quadratically with n, thus making their application on large data sets highly costly or 

even unfeasible. 

One of the pioneering works on the combination of hard clustering solutions based on 

object co-association metrics is the evidence accumulation (EAC) approach. In the original 

form of the evidence accumulation consensus function, the consensus clustering solution is 

obtained by applying a simple majority voting scheme on the co-association matrix (Fred, 

2001). In subsequent versions, consensus is derived by applying the single-link hierarchical 

clustering algorithm on the object co-association matrix, regarding it as a measure of the 

similarity between objects (Fred and Jain, 2002a)—a virtually identical proposal is found 

in a contemporary work (Zeng et al., 2002). In (Fred and Jain, 2003), the evidence accumulation 

approach is formulated in an information-theoretic framework, defining the optimal 

consensus clustering solution as the one maximizing the sum of normalized mutual infor- 

37


mation (φ (NMI) ) with respect to all the partitions in the cluster ensemble. The authors 

prove that, by maximizing the number of shared objects in the consolidated clusters, the 

EAC consensus function maximizes the aforementioned information theoretical objective 

function, although reaching its global optimum is not ensured in all situations. Moreover, 

cutting the dendrograms resulting from the application of the single-link clustering on the 

co-association matrix at the highest lifetime level leads to a minimization of the variance 

of the average φ (NMI) , which guarantees the robustness of the clustering solution to small 

variations in the composition of the cluster ensemble —furthermore, this also avoids making 

assumptions on the number of clusters, a significant advantage with respect to other 

consensus functions. A compendium on the evidence accumulation consensus clustering 

approach is presented in (Fred and Jain, 2005), extending the previous consensus functions 

through the application of other hierarchical clustering algorithms on the pairwise object 

co-association matrix. 

The clustering of high dimensional data is the main motivation of the work presented 

in (Fern and Brodley, 2003). In this scenario, Random Projection (RP) is an efficient 

dimensionality reduction technique, although it often gives rise to highly unstable clustering 

results. In order to reduce this variability, the authors propose creating cluster ensembles 

by compiling partitions resulting from distinct RP runs, combining them using a consensus 

function very similar to EAC, as it applies an agglomerative clustering algorithm on an 

object similarity matrix. 

One of the two consensus functions presented by Dudoit and Fridlyand (2003), named 

BagClust2, resembles evidence accumulation, as it builds a pairwise object dissimilarity 

matrix which is subject to a partitioning process for obtaining the consensus clustering. 

However, BagClust2 and EAC differ in that the former requires that the desired number 

of clusters is passed as a parameter to the consensus function (the same happens with 

BagClust1). 

In (Greene et al., 2004), consensus clustering was conducted by means of variants of the 

EAC consensus functions using distinct hierarchical clustering algorithms (i.e. single-link, 

complete-link and average-link) for partitioning the pairwise object co-association matrix, 

as proposed in (Fred and Jain, 2005). However, the central matter of study in that work 

is the analysis of the diversity of the cluster ensemble as a factor determining the quality 

of the consensus clustering. In this sense, the authors focused on random techniques for 

introducing diversity in the cluster ensemble, such as random subspacing, random algorithm 

initialization, random number of clusters or random feature projection. 

A related work is the Majority Rule consensus function of (Goder and Filkov, 2008), 

which is also based on clustering the pairwise object co-dissociation matrix, which can be 

done by simply setting a dissimilarity threshold like in the the first version of EAC (Fred, 

2001), or by applying the average-link hierarchical clustering algorithm —like in the latest 

versions of EAC (Fred and Jain, 2005). 

Moreover, there exist several consensus functions that make indirect use of pairwise 

object co-association (or co-dissociation) matrices, despite the way the consensus clustering 

is obtained differs from that of EAC. Examples of this include some graph partition-based 

consensus functions, such as CSPA (Strehl and Ghosh, 2002) and BALLS (Gionis, Mannila, 

and Tsaparas, 2007), the Iterative Pairwise Consensus (IPC) (Nguyen and Caruana, 2007) 

(a consensus function based on cluster centroids in which objects are iteratively reassigned 

to the clusters of the consensus partition according to their similarity), or the CC Pivot 

38


consensus function (Goder and Filkov, 2008), which obtains the consensus partition by 

conducting an iterative pivoting on the object dissimilarity matrix. 

2.2.4 Consensus functions based on categorical clustering 

A different approach to consensus clustering is the one related to categorical clustering, 

which basically consists of transforming the contents of the cluster ensemble into quantitative 

features that represent the objects, for subsequently clustering them according to this 

novel representation —thus obtaining the consensus partition. The QMI (Quadratic Mutual 

Information) consensus function of (Topchy, Jain, and Punch, 2003) posed the problem of 

combining the partitions contained in a hard cluster ensemble in an information theoretic 

framework, and consists of applying the k-means clustering algorithms on this new feature 

space, which forces the user to set the desired number of clusters k in advance. 

In (Punera and Ghosh, 2007), a novel fuzzy consensus function based on the Information 

Theoretic K-means (ITK) algorithm was presented. Its rationale follows a similar approach 

to that of (Topchy, Jain, and Punch, 2003). In this case, though, consensus clustering 

is conducted on soft cluster ensembles (i.e. the compilation of the outcomes of multiple 

fuzzy clustering processes), so each object in the data set is represented by means of the 

concatenated posterior cluster membership probability distributions corresponding to each 

one of the l fuzzy partitions in the cluster ensemble. Thus, using the Kullback-Leibler 

divergence (KLD) between those probability distributions as a measure of the distance 

between objects, the k-means algorithm is applied so as to obtain the consensus clustering 

solution. Note that the ITK consensus function is capable of combining fuzzy partitions 

with variable number of clusters, while producing a crisp consensus clustering solution. 

Moreover, this consensus function allows assigning distinct weights to each clustering in 

the cluster ensemble, which can be useful for the user to express his/her confidence on the 

quality of some individual clusterings. 

2.2.5 Consensus functions based on probabilistic approaches 

Consensus clustering has also been approached from a probabilistic perspective. One of 

the pioneering works in this direction was the Expectation-Maximization (EM) consensus 

function proposed in (Topchy, Jain, and Punch, 2004), where a probabilistic model of the 

consensus clustering solution is defined in the space of the contributing clusters. Such 

model is based on a finite mixture of multinomial distributions, each component of which 

corresponds to a cluster of the combined clustering, which is obtained as the solution to 

the maximum likelihood problem solved by means of the EM algorithm. Contrasting with 

other consensus functions, the authors highlight the low computational complexity of the 

proposed method and its ability to combine partitions with different numbers of clusters. 

Another probabilistic approach to the consensus clustering problem was presented in 

(Long, Zhang, and Yu, 2005). The central matter in that work was finding a solution to 

the cluster correspondence problem (which, as mentioned earlier, is due to the symbolic 

identification of clusters caused by the unsupervised nature of the clustering problem). In 

particular, the goal was to derive a correspondence matrix that desambiguates the clusters 

of each individual clustering in the cluster ensemble (represented as a probabilistic or binary 

membership matrix depending on whether the cluster ensemble is soft or hard) with regard 

39


to a hypothetical consensus clustering solution membership matrix. The goal is to find 

a correspondence matrix that yields the best projection of each individual clustering on 

the space defined by the consensus clustering solution. From a practical viewpoint, both 

the correspondence and consensus clustering matrices are derived simultaneously, using an 

EM-like approach. 

The beautiful proposal of (Lange and Buhmann, 2005) introduced a consensus function 

named Probabilistic Label Aggregation (PLA), which operates on soft cluster ensembles 

(although it also works on crisp ones). Its rationale is as follows: given a single fuzzy 

partition, a pairwise object co-association matrix is created by simply multiplicating the 

membership probabilities matrix by its own transpose. Repeating this process on all the 

partitions in the soft cluster ensemble and aggregating (and subsequently normalizing) the 

resulting matrices gives rise to a joint probability matrix of finding two objects in the 

same cluster. Neatly, the authors propose subjecting this joint probability matrix to a 

non-negative matrix factorization process that yields estimates for class-likelihoods and 

class-posteriors, upon which the consensus clustering solution is based. This factorization 

process is posed as an optimization problem which is solved by applying the EM algorithm. 

Besides the elegance of the proposed solution, this work also stands out by the fact that 

it supports an out-of-sample extension that makes it possible to assign previously unseen 

objects to classes of the consensus clustering solution. Moreover, the proposed method also 

allows combining weighted partitions, i.e. it gives the user the chance to assign different 

degrees of relevance to the cluster ensemble partitions. 

A closely related proposal is the application of Non-Negative Matrix Factorization 

(NMF) for solving the consensus clustering problem presented (Li, Ding, and Jordan, 2007). 

In contrast to (Lange and Buhmann, 2005), the aim is to combine crisp partitions, which 

imposes a series of constraints on the optimization problem that is solved via symmetric 

NMF—which, from an algorithmic viewpoint, is implemented by means of multiplicative 

rules. Moreover, the same approach is employed for conducting semi-supervised clustering, 

a problem that lies beyond the scope of this work. 

2.2.6 Consensus functions based on reinforcement learning 

Reinforcement learning has also been applied for the construction of consensus clustering 

solutions (Agogino and Tumer, 2006). In that work, the average φ (NMI) of the consensus 

clustering solution with respect to the cluster ensemble is regarded as the reward that must 

be maximized by the actions of the agents. In this case, each agent casts a vote indicating 

which cluster each object should be assigned to (i.e. it operates on hard cluster ensembles). 

The application of a majority voting scheme on these votes yields the consensus clustering 

solution, which is iteratively refined as the agents learn how to vote so as to maximize the 

average φ (NMI) . The authors highlight the ease of their approach for combining clusterings 

in distributed scenarios, which makes it specially suitable in failure-prone domains. 

2.2.7 Consensus functions based on interpeting object similarity as data 

The work by Kuncheva, Hadjitodorov, and Todorova (2006) introduced three consensus 

functions based on interpreting object similarity as data. That is, each object is represented 

by n features, where the jth feature of the ith object corresponds to the co-association 

40


strength between the ith and the jth objects. The authors base these consensus functions 

on the proved suitability of using similarity measures as object features in classification 

problems (Kalska, 2005). Thus, the consensus clustering solution is obtained by applying 

standard clustering algorithms on the pairwise object co-association matrix. In particular, 

the hierarchical average-link, single-link and k-means clustering algorithms are applied, 

giving rise to the ALSAD, SLSAD and KMSAD consensus functions. Notice that these 

consensus functions, despite being based on partitioning the object co-association matrix 

(just like EAC, for instance), differ from these in that the contents are not interpreted as 

measures of similarity between objects, but rather as attributes in a new feature space. 

2.2.8 Consensus functions based on cluster centroids 

The time and memory scalability problems derived from combining clusterings of large data 

sets is the principal motivation of several works that tackle the consensus clustering problem 

following a centroid-based approach (Hore, Hall, and Goldgof, 2006; Nguyen and Caruana, 

2007). The underlying rationale consists of representing the cluster ensemble components 

in terms of the centroids of their clusters, instead of label vectors. By doing so, storage 

inconveniences are alleviated, as the number of clusters k is usually much lower than the 

number of objects n. In (Hore, Hall, and Goldgof, 2006), moreover, the authors highlight 

that the parallelization of the cluster ensemble construction process would dramatically 

decrease its time complexity. As regards the creation of the consensus clustering solution, 

it is based on computing the average centroid of each cluster, after disambiguating them 

by means of the Hungarian algorithm. Furthermore, that work introduced the possibility 

of discarding bad clusters from the cluster ensemble at consensus clustering creation time. 

In (Nguyen and Caruana, 2007), three iterative consensus functions were presented and 

empirically compared with other eleven clustering combiners in a complete experimental 

study. The proposal in that work derived the consensus clustering solution in terms of the 

centroids of its clusters. Although the proposed consensus functions are capable of combining 

clusterings with a variable number of clusters, all the individual partitions contained 

in the hard cluster ensembles used in the experiments have the same number of clusters 

for simplicity. The first consensus function proposed in this work, called Iterative Voting 

Consensus (IVC), is based on recursively computing the centroids of the consensus solution 

clusters, and assigning each object to the nearest cluster, which is determined in terms 

of the Hamming distance. This procedure is iterated until the centroids of the consensus 

clustering solution reach a stable state. The second proposal (named Iterative Probabilistic 

Voting Consensus or IPVC), is a variant of IVC in which objects are iteratively assigned 

to consensus clusters in terms of their distance with respect to the objects that have been 

previously assigned to them. And in the third proposed consensus function, Iterative Pairwise 

Consensus or IPC, objects are iteratively reassigned to consensus clusters according to 

their similarity as measured by the pairwise object co-association matrix. 

2.2.9 Consensus functions based on correlation clustering 

Recently, the connection between the late-emerging problem of correlation clustering and 

consensus clustering was exploited for deriving novel consensus functions capable of determining 

the most natural number of clusters (Gionis, Mannila, and Tsaparas, 2007). In 

that work, the cluster ensemble is modeled as a graph resembling a pairwise object co- 

41


dissociation matrix, and the consensus clustering solution is defined as the one minimizing 

the disagreements with respect to the individual partitions contained in the cluster ensemble. 

In particular, three consensus functions based on correlation clustering were presented 

in this work, which are briefly described next. Firstly, the AGGLOMERATIVE consensus function 

results from applying a standard bottom-up procedure for correlation clustering on 

cluster ensembles. Resorting to the graph view of the pairwise object distance matrix, the 

AGGLOMERATIVE algorithm follows an iterative merging process that joins objects in clusters 

depending on whether their average distance is below a predefined threshold, stopping when 

further cluster merging does not reduce the number of disagreements of the consensus solution 

with respect to the cluster ensemble. Secondly, the FURTHEST consensus function can 

be regarded as the converse of AGGLOMERATIVE , as it consists of a top-down procedure that 

iteratively separates maximally distant graph vertices into consensus clusters, assigning 

the remaining objects to the cluster that minimizes the overall number of disagreements. 

This process is stopped when no disagreement reduction is achieved from additional cluster 

splitting. And fifthly, the LOCALSEARCH algorithm is derived from the application of a local 

search correlation clustering heuristic, which is based on a greedy procedure that, starting 

with a specific (possibly random) partition of the graph, tries to minimize the number of 

disagreements resulting from moving objects to different clusters or creating new singleton 

clusters, stopping when no move can decrease the disagreements rate. Interestingly, the authors 

point out that, despite its high computational cost, the LOCALSEARCH algorithm can be 

employed as a post-processing step for refining a previously obtained consensus clustering 

solution. 

2.2.10 Consensus functions based on search techniques 

In (Goder and Filkov, 2008), two consensus functions based on search techniques were 

introduced. Their rationale consists of building the consensus clustering solution by means 

of a greedy search process aiming to minimize the cost function —the authors implement 

such search processes either by means of Simulated Annealing (SA), as in (Filkov and Skiena, 

2004), and on successive single object movements that guarantee the largest decrease of the 

cost function (Best One Element Moves or BOEM). 

2.2.11 Consensus functions based on cluster ensemble component selection 

Recall that the aim of any consensus clustering process is to obtain a single partition 

from a collection of l clustering solutions. As an alternative means for achieving that 

goal, cluster ensemble component selection techniques are based on obtaining the consensus 

clustering solution by selection, not by combination. For instance, the BESTCLUSTERING 

algorithm (Gionis, Mannila, and Tsaparas, 2007) is not a consensus function proper, as it 

identifies as the consensus clustering the individual partition that minimizes the number of 

disagreements with respect to the remaining clusterings in the cluster ensemble. 

Following a very similar approach, the Best of K (BOK) consensus function is based on 

selecting that individual clustering from the cluster ensemble that minimizes the number 

of pairwise co-clustering disagreements between the individual partitions in the cluster 

ensemble (Goder and Filkov, 2008). 

42


2.2.12 Other interesting works on consensus clustering 

There exist several works in the literature devoted to the experimental comparison of the 

performance of consensus functions, the main interest of which lies in the evaluation of the 

quality of the consensus clusterings obtained. 

Examples of this include the work by Minaei-Bidgoli, Topchy, and Punch (2004), where 

a data resampling scheme was presented as a means for improving the robustness and stability 

of the consensus clustering solution. In that work, the EAC, BagClust2, QMI, CSPA, 

HGPA and MCLA consensus functions are experimentally compared when operating on 

hard cluster ensembles created from bootstrap partitions on several artificial and real data 

collections. The main conclusion drawn is that, as expected, there exists no universally superior 

consensus function, as each consensus function explores the data set in different ways, 

thus its efficiency greatly depends on the existing structure in the data set. Another extensive 

and interesting performance comparison between several consensus functions operating 

on small hard cluster ensembles is presented in (Kuncheva, Hadjitodorov, and Todorova, 

2006). Recently, the application of consensus clustering as a means for avoiding the obtention 

of suboptimal clustering solutions when applying non-parametric clustering algorithms 

on text document collections is tackled in both (Gonzàlez and Turmo, 2008a) and (Gonzàlez 

and Turmo, 2008b). These works compared i) the performance of several non-parametric 

clustering algorithms across six text corpora, and ii) the quality of the consensus clustering 

solution when it is built –using some of the consensus functions presented in (Gionis, 

Mannila, and Tsaparas, 2007)– upon homogeneous and heterogeneous cluster ensembles. 

In most of these inter-consensus functions comparisons, the evaluation of their computational 

complexity is often given marginal importance, although it becomes a critical aspect 

when it comes to their application in practice, especially when dealing with large data 

collections or cluster ensembles containing many partitions. This is the main motivation 

behind the data sampling strategy proposed in (Greene and Cunningham, 2006; Gionis, 

Mannila, and Tsaparas, 2007). The proposal of the latter work is the SAMPLING algorithm, 

which consists of performing a sufficient subsampling of the objects in the data set –thus 

constructing the consensus clustering solution on a reduced cluster ensemble–, and the subsequent 

extension of the combined clustering solution on those objects that have been left 

out of the subsampling process. The time complexity of these two processes is linear with 

the data set size, which can lead to relevant time savings when dealing with large data 

collections. 

Another variant of the consensus clustering problem is the weighted combination of 

clusterings, which constitutes the central point of (Gonzàlez and Turmo, 2006). The idea 

behind weighted consensus clustering is the possibility of giving more relevance to some 

components of the cluster ensemble, as they may better describe the structure of the data 

set. Thus, it makes sense to combine clusterings in a weighted manner, emphasizing the 

contribution of those components deemed as the best ones in the ensemble. Besides designing 

consensus functions capable of combining weighted partitions, it is necessary to devise 

strategies for setting the proper weight of each individual clustering, which is not trivial in 

an unsupervised scenario. In this work, hypergraph-based (Strehl and Ghosh, 2002) and 

probabilistic (Topchy, Jain, and Punch, 2004) consensus functions are modified so as to 

handle weighted partitions. Moreover, the best weighting scheme is determined by creating 

differently weighted cluster ensembles, and subsequently selecting the best option in an 

unsupervised manner through the maximization of a scoring function. Moreover, this con- 

43


sensus function allows assigning distinct weights to each clustering in the cluster ensemble, 

which can be useful for the user to express his/her confidence on the quality of some individual 

clusterings. The ITK consensus function of (Punera and Ghosh, 2007) also allows to 

assign distinct weights to each clustering in the cluster ensemble, which can be useful for 

the user to express his/her confidence on the quality of some individual clusterings. 

44

Chapter 3 

Hierarchical consensus 

architectures 

As outlined in section 1.5, our proposal for building robust multimedia clustering systems 

lies on the creation of consensus clustering solutions upon cluster ensembles. These ensembles 

are made up of a large number of individual clusterings resulting from the execution of 

multiple clustering algorithms on several unimodal and multimodal representations of the 

objects contained in the data set. 

Indeed, the massive crossing between clustering algorithms, object representations and 

data modalities is a simple and parallelizable manner of generating highly diverse heterogeneous 

cluster ensembles, entrusting the obtention of a meaningful combined clustering 

solution to the consensus clustering task. 

Given the unsupervised nature of the clustering problem, we think this is a pretty 

sensible way of proceeding so as to obtain clustering solutions robust to the influence of the 

clustering indeterminacies, as sticking to the use of a handful of clustering algorithms or 

object representations can lead to an involuntary and undesirable limitation as regards the 

quality and diversity of the cluster ensemble components. 

However, at the same time that this strategy allows the creation of rich cluster ensembles, 

it also introduces several drawbacks that affect the consensus clustering task: 

– the large number of individual clustering solutions contained in the cluster ensemble, 

resulting from the aforementioned combination of clustering algorithms, object representations 

and data modalities, often leads to a notable increase in the computational 

cost of the execution of the consensus function, which can even become prohibitive. 

– this same fact incides in the diversity and quality of the cluster ensemble components, 

and, while moderate diversity has been found to be beneficial as far as consensus 

clustering is concerned (Hadjitodorov, Kuncheva, and Todorova, 2006; Fern and Lin, 

2008), the existence of poor quality clustering solutions in the cluster ensemble may 

cause a detrimental effect on the quality of the consensus clustering solution. 

Allowing for these considerations, in this thesis we introduce the concept of self-refining 

hierarchical consensus architectures (SHCA), defined as a generic means for fighting against: 

45

3.1. Motivation 

– the computational complexity of combining a large number of individual partitions, by 

means of hierarchical consensus architectures, which consist in the layered construction 

of the consensus clustering solution through a hierarchical structure of low complexity 

intermediate consensus processes. 

– the negative bias induced by poor quality clusterings in the consensus clustering solution, 

by means of a self-refining post-processing that, using the obtained consensus 

clustering solution as a reference, builds a select and reduced cluster ensemble (i.e. a 

subset of the original cluster ensemble), deriving a new and refined consensus clustering 

upon it in a fully unsupervised manner. 

Although both strategies are complementary (not in vain they can be naturally combined 

giving rise to SHCA), their description and study is decoupled in the present and the 

next chapters, respectively. Thus, in our description and analysis of hierarchical consensus 

architectures (chapter 3), we ultimately aim to design computationally optimal consensus 

architectures and, consequently, we will solely focus on aspects regarding their time complexity. 

Meanwhile, the study of consensus self-refining procedures, which are presented in 

chapter 4, is centered on improving the quality of the consensus solutions yielded by the 

most computationally efficient consensus architectures devised in the present chapter. 

The introduction, the discussion of their rationale and the theoretical description of 

hierarchical consensus architectures are complemented by the presentation of multiple experiments 

analyzing multiple aspects of their performance on several real data collections. 

Last but not least, it is to note that although all the proposals put forward in this chapter 

are focused on a hard cluster ensemble scenario, they are also applicable for fuzzy clusterings 

combination. 

3.1 Motivation 

The construction of consensus clustering solutions is usually tackled as a one-step process, 

in the sense that the whole cluster ensemble E is input to the consensus function F at once 

—see figure 3.1(a). This is what we call flat consensus clustering. However, as outlined in 

chapter 2, the time and space complexities of consensus functions typically scale linearly or 

quadratically with the size of the cluster ensemble l –i.e. O (l w ), where w ∈{1, 2}–, which 

may lead to a highly costly or even impossible execution of the consensus clustering task if 

it is to be conducted on a cluster ensemble containing a large number of partitions 1 . 

For this reason, a natural way for avoiding this limitation besides reducing the computational 

complexity of the consensus solution creation process consists in applying the 

classic divide-and-conquer strategy (Dasgupta, Papadimitriou, and Vazirani, 2006) which 

basically: 

– breaks the original problem into subproblems which are nothing but smaller instances 

of the same type of problem 

1 Moreover, the time complexity of consensus functions also depends –linearly or quadratically, see appendix 

A.5– on the number of objects in the data set n and the number of clusters k of the clusterings 

in the ensemble. However, as we assume that these two factors are constant for a given cluster ensemble 

corresponding to a specific data set, the only dependence of concern is that referring to the cluster ensemble 

size l. 

46

λ 1 λ 11 λ 12 λ 13 … λ 1m 

λ 2 λ 21 λ 22 λ 23 … λ 2m 

λ 3 λ 31 λ 32 λ 33 … λ 3m 

λ 4 λ 41 λ 42 λ 43 … λ 4m 

λ 5 λ 51 λ 52 λ 53 … λ 5m 

… 

λ l λ l1 λ l2 λ l3 … λ lm 

Chapter 3. Hierarchical consensus architectures 

Consensus 

function 

(a) Flat construction of a consensus clustering solution 

on a hard cluster ensemble 

1 11 12 13 … 1m Consensus 1 

 

function 1 

2 21 22 23 … 2m c 1 

 

3 31 32 33 … 3m 

4 41 42 43 … 4m 

5 51 52 53 … 5m 

l l1 l2 l3 … lm 

Consensus 

function 

Consensus 

function 

 

 

1 

c 2 

1 

c K K1 

Consensus 

ffunction i 

Consensus 

function 

λ c 

1 

s 

c1 1 

 

1 

s 

cKs Consensus 

function 

(b) Hierarchical construction of a consensus clustering solution on a hard cluster ensemble 

Figure 3.1: Flat vs hierarchical construction of a consensus clustering solution on a hard 

cluster ensemble. 

– recursively solves these subproblems 

– appropriately combines their outcomes 

Transferring this strategy to the consensus clustering problem is equivalent to segmenting 

the cluster ensemble into subsets (referred to as mini-ensembles hereafter), building 

an intermediate consensus solution upon each mini-ensemble, and subsequently combining 

these halfway consensus clusterings into the final consensus clustering solution λc —see 

figure 3.1(b). Due to the fact that successive layers (or stages) of consensus solutions are 

created, we give this approach the name of hierarchical consensus architecture (HCA), as 

opposed to the traditional flat topology of consensus clustering processes. 

The rationale of hierarchical consensus architectures is pretty simple. By reducing the 

time and space complexities of each intermediate consensus clustering –which is achieved by 

creating it upon a smaller ensemble–, we aim to reduce the overall execution requirements 

(i.e. memory and specially, CPU time), although a larger number of low cost consensus 

clustering processes must be run. However, this strategy is capable of yielding computational 

gains, as for large enough values of l, the execution of the original problem becomes 

slower than the recursive execution of the subproblems into which it is divided (Dasgupta, 

Papadimitriou, and Vazirani, 2006). 

47 

c

3.1. Motivation 

An additional and very relevant point as regards the computational efficiency of hierarchical 

consensus architectures is that they naturally allow the parallel execution of the 

consensus clustering processes of every HCA stage —quite obviously, this will ultimately 

depend on the availability of computing resources. Thus, the degree of parallelism in executing 

the consensus of every HCA stage will set the lower and upper bounds of the time 

required for obtaining the final consensus clustering λc. 

In the best-case scenario, the HCA running time can be as low as the sum of the 

execution times of the longest-lasting consensus task of each stage of the architecture, 

provided that the available computational resources allow the parallel computation of all 

the intermediate consensus solutions of any given stage. 

On the contrary, if the execution of the halfway consensus is serialized, the time required 

for running the whole HCA amounts to the sum of the execution times of all the consensus 

processes of the stages of the hierarchical consensus architecture, which constitutes the 

upper bound of the running time of a hierarchical consensus architecture. 

Therefore, depending on the design of the HCA, the simultaneously available computing 

resources and the characteristics of the data set, structuring the consensus clustering task 

in a hierarchical manner may be more or less computationally beneficial (or not beneficial 

at all) as compared to its flat counterpart. From a practical viewpoint, our general idea is 

to provide the user with simple tools that, for a given consensus clustering problem, allow 

to decide a priori whether hierarchical consensus architectures are more computationally 

efficient than traditional flat consensus and, if so, implement the HCA variant of minimal 

complexity. 

Moreover, it is important to highlight the fact that, in cases where the flat execution 

of the consensus function F becomes impossible due to memory limitations caused by the 

large size of the cluster ensemble, a carefully designed HCA will allow obtaining a consensus 

clustering solution. 

Let us now elaborate briefly on several notational definitions regarding hierarchical 

consensus architectures that will be of help when describing our proposals in detail. We 

suggest the reader resort to the generic HCA topology depicted in figure 3.1(b) for a better 

understanding of the concepts we are about to expose. 

Firstly, a hierarchical consensus architecture is structured in s successive stages. The 

number of intermediate consensus solutions obtained at the output of the ith stage is denoted 

as Ki —notice that Ks = 1 (i.e. the last stage yields the single final consensus 

clustering solution λc). The notation used for designating the jth halfway consensus clustering 

created at the ith HCA stage is λ i cj ,wherei∈ [1,s− 1] and j ∈ [1,Ki]. 

Another important factor in the definition of HCAs is the size of the mini-ensembles, 

which may vary from stage to stage or even within the same stage. For this reason, we 

denote as bij the size of the mini-ensemble upon which the jth consensus process of the ith 

HCA stage is conducted. Notice that bs1 = Ks−1 (i.e. the last consensus stage combines 

all the intermediate clusterings output by the previous stage into the single final consensus 

clustering solution λc), while, in the HCA presented in figure 3.1(b), b1j =2∀j ∈ [1,K1]. 

Moreover, notice that hierarchical architectures naturally allow the use of distinct consensus 

functions across the distinct stages (or even within the same stage). However, in 

this work we assume that a single consensus function F is applied for conducting all the 

consensus processes involved. 

48


In this chapter, we propose two strategies for constructing hierarchical consensus architectures, 

which differ in i) the way mini-ensembles are created, and ii) which HCA 

parameters are tuned by the user. As a result, two HCA implementation alternatives are 

put forward: 

– random hierarchical consensus architectures, whose tunable parameter is the size of 

the mini-ensembles –the components of which are selected at random–, which eventually 

determines the HCA topology (i.e. its number of stages). 

– deterministic hierarchical consensus architectures, where the construction of the miniensembles 

is driven by the cluster ensemble creation process —in particular, the diversity 

factors used in creating the ensemble determine the number of HCA stages 

and the mini-ensembles components. 

The following sections are devoted to describing the rationale and implementation details 

of both HCA variants, specifying how the number of stages, the number of consensus 

processes per stage and the size of the mini-ensembles are determined in each case. This 

description is completed by an analysis of their computational complexity. 

3.2 Random hierarchical consensus architectures 

In this section, we introduce random hierarchical consensus architectures (RHCA for short), 

we define their topology from a generic perspective, making a brief description of their 

foundations, followed by an analysis of their computational complexity. 

3.2.1 Rationale and definition 

The idea behind random hierarchical consensus architectures is to construct a regular pyramidal 

structure of intermediate consensus processes that delivers, at its top, the final consensus 

clustering solution λc. The term random refers not to the consensus architecture 

itself, but to the way mini-ensembles are created. In particular, the randomness of RHCA 

lies in the fact that the clusterings input to each stage of the hierarchical architecture are 

shuffled randomly. 

Besides this fact, the most distinctive feature of RHCA is that the user determines 

the size of the mini-ensembles, setting it to b, keeping it constant across the stages of 

the consensus architecture. Therefore, given a cluster ensemble containing l component 

clusterings and a mini-ensemble size set equal to b by the user, the number of stages s of 

the resulting RHCA is computed by equation (3.1). 

⎧ 

⎪⎩ 

⌊log b (l)⌉ if 

⎪⎨ 

 

s = ⌊logb (l)⌉−1 if 

⌊log b (l)⌉ +1 if 

 

 

l 

b ⌊log b (l)⌉ 

l 


l 


49 

 

≤ 1and 

 

≤ 1and 

 

> 1 

l 

b ⌊log b (l)⌉−1 

l 

b ⌊log b (l)⌉−1 

 

> 1 

 

=1 

(3.1)

3.2. Random hierarchical consensus architectures 

where ⌊x⌉ denotes the operation of rounding x to the nearest integer (Hastad et al., 1988). 

The second option in equation (3.1) reduces the number of stages by one in the case that 

the penultimate RHCA stage already yields one consensus clustering, whereas the third one 

adds a supplementary stage so as to ensure the obtention of a single consensus solution at 

the output of the RHCA. 

Furthermore, the number of consensus solutions computed at the ith stage of the RHCA 

(where i ∈ [1,s]) is determined by the expression in equation (3.2). 

 

l 

Ki =max 

bi 

, 1 

(3.2) 

where ⌊x⌋ stands for the greatest integer less than or equal to x (i.e. the result of applying 

the floor function on number x). 

However, it is important to notice that it will only be possible to keep the mini-ensembles 

size constant all along the hierarchy (i.e. bij = b, ∀i ∈ [1,s]and∀j∈ [1,Ki]) when l is 

an integer power of b. In the likely case that this condition is not met, in the current 

implementation of RHCA we choose, for simplicity, integrating the spare clusterings2 of the 

ith RHCA stage into its last (i.e. the Kith) mini-ensemble, thus introducing a bounded 

increase of its size, as b ≤ biKi < 2b. Moreover, the size of the mini-ensemble input to the 

sth stage is set to be equal to the number of halfway consensus output by the penultimate 

RHCA stage, as defined in equation (3.3). 

⎧ 

⎪⎨ 

b if i


the mini-ensembles size, notice that the size of the third mini-ensemble of the first RHCA 

stage is increased (b13 =3) so that all the l = 7 components of the cluster ensemble are 

involved in one of the K1 = 3 consensus processes of the first RHCA stage. This also 

happens in the second stage, where b21 =3andK2 =1,which,asjustmentioned,yieldsa 

single consensus at its output. 

The interested reader will find a more detailed description of these and other RHCA 

configuration examples in appendix C.1. 

3.2.2 Computational complexity 

In the following paragraphs, we present a study of the asymptotic computational complexity 

of RHCA, considering both its fully serial and parallel implementations, which, as 

aforementioned, constitute the upper and lower bounds of the RHCA execution time. 

Serial RHCA 

For starters, the time complexity of the fully serialized implementation is considered. This 

means that the intermediate consensus tasks of each RHCA stage must be sequentially executed 

on a single computation unit. Recall that the time complexity of consensus functions 

typically grows linearly or quadratically with the cluster ensemble size, that is, it can be 

expressed as O (l w ), where w ∈{1, 2}. Therefore, the serial time complexity of a RHCA 

(STCRHCA) withs stages boils down to systematically adding the time complexities of all 

the consensus processes executed across the whole RHCA, as defined in equation (3.4). 

STCRHCA = 

s Ki 

O (bij w ) (3.4) 

i=1 j=1 

where Ki refers to the number of consensus processes executed in the ith RHCA stage, bij is 

the mini-ensemble size corresponding to the jth consensus process executed at the ith stage 

of the hierarchy —the exact value of these parameters is computed according to equations 

(3.2) and (3.3), respectively—, and O (bij w ) reflects the complexity of each intermediate 

consensus process. 

Equation (3.4) can be reformulated so as to obtain a compact expression of an upper 

bound of STCRHCA as a function of the user defined mini-ensembles size b. This requires 

recalling that, in the current RHCA implementation, the effective mini-ensembles size is 

bounded, that is, bij < 2b, ∀i ∈ [1,s]and∀j ∈ [1,Ki]. Therefore, we can write: 

STCRHCA < 

s Ki 

O ((2b) w ) (3.5) 

i=1 j=1 

Notice that, from an algorithmic viewpoint, equation (3.5) can be regarded as two nested 

loops where the number of iterations of the inner loop (Ki) depends on the value of the 

outer loop’s index (i). The number of iterations of the inner loop as a function of the outer 

loop’s index is presented in table 3.1. 

Thus, it can be observed that the total number of times the mini-ensemble consensus of 

51


λ 1 λ 11 λ 12 λ 13 … λ 1m 

λ 2 λ 21 λ 22 λ 23 … λ 2m 

λ 3 λ 31 λ 32 λ 33 … λ 3m 

λ 4 λ 41 λ 42 λ 43 … λ 4m 

λ 5 λ 51 λ 52 λ 53 … λ 5m 

λ 6 λ 61 λ 62 λ 63 … λ 6m 

λ 7 λ 71 λ 72 λ 73 … λ 7m 

λ 8 λ 81 λ 82 λ 83 … λ 8m 

λ 1 λ 11 λ 12 λ 13 … λ 1m 

λ 2 λ 21 λ 22 λ 23 … λ 2m 

λ 3 λ 31 λ 32 λ 33 … λ 3m 

λ 4 λ 41 λ 42 λ 43 … λ 4m 

λ 5 λ 51 λ 52 λ 53 … λ 5m 

λ 6 λ 61 λ 62 λ 63 … λ 6m 

λ 7 λ 71 λ 72 λ 73 … λ 7m 

λ 8 λ 81 λ 82 λ 83 … λ 8m 

λ 9 λ 91 λ 92 λ 93 … λ 9m 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

λ 

λ 

λ 

λ 

1 

c1 

1 

c2 

1 

c3 

1 

c4 

Consensus 

function 

Consensus 

function 

(a) Topology of a RHCA with l =8andb =2,withs =3 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

λ 

λ 

λ 

λ 

1 

c1 

1 

c2 

1 

c3 

1 

c4 

Consensus 

function 

Consensus 

function 

(b) Topology of a RHCA with l =9andb =2,withs =3 

1 11 12 13 … 1m 

2 21 22 23 … 2m 3 31 32 33 … 3m 4 41 42 43 … 4m 5 51 52 53 … 

5m 

6 61 62 63 … 6m 

7 71 72 73 … 7m 

Consensus 

function 

Consensus 

function 

Consensus 

function 

 

1 

c1 

λ 

λ 

λ 

λ 

2 

c1 

2 

c2 

2 

c1 

2 

c2 

Consensus 

function 

(c) Topology of a RHCA with l =7andb =2,withs =2 

 

1 

c2 

Consensus 

function 

Consensus 

function 

Figure 3.2: Three examples of topologies of random hierarchical consensus architectures 

with distinct relationships between the cluster ensemble and mini-ensembles sizes, l and b, 

respectively. 

maximum complexity O ((2b) w ) are executed is s 

Ki. The value of this summation can be 

approximated by considering that the number of consensus per stage Ki is also bounded, 

as Ki ≤ l 

b i (see equation (3.2)), yielding: 

52 

i=1 

 

1 

c3 

c 2 

1 

 

c 

λc 3 

1 

λc 3 

1 

= λ 

= λ 

c 

c


i # inner loop iterations 

1 K1 

2 K2 

... ... 

s 1 

Table 3.1: Number of inner loop iterations as a function of the outer’s loop index i. 

s 

Ki ≤ 

i=1 

s 

i=1 

l 

= l · 

bi s 

i=1 

 

1 1 

1 b − b = l · 

bi s+1 

1 − 1 

 

bs − 1 

= l · 

b 

b 

s 

(3.6) 

(b − 1) 

. Therefore, the 

which equals the partial sum of a geometric series whose common ratio is 1 

b 

upper bound of the time complexity STCRHCA can be rewritten as: 

 

bs − 1 

STCRHCA 

Ki, ∀i ∈ [2,s]). In this case of maximum parallelism, the parallel time complexity of a 

RHCA (PTCRHCA) ofs stages is formulated according to equation (3.9). 

PTCRHCA = 

s 

i=1 

max 

j∈[1,Ki] (O (bij w )) (3.9) 

That is, the parallel time complexity of a RHCA is equal to the sum of the time complexities 

of the most time-consuming consensus process of each RHCA stage. Notice that 

it is not difficult to find an upper bound to PTCRHCA, as finding the maximum of O (bij w ) 

just requires taking into account that bij < 2b, ∀i, j. Thus: 

PTCRHCA < 

s 

i=1 

O ((2b) w )=s · O ((2b) w )=O (s · (2b) w ) (3.10) 

If the number of RHCA stages is approximated as s ≈ log b (l), and constants are 

dropped, equation (3.10) can be rewritten as a function of l and b: 

53


s : number of stages 

10 1 

Number of RHCA stages as a function of b 

10 0 

10 0 

10 1 

10 2 

b 

10 3 

100 

200 

500 

1000 

2000 

5000 

10000 

10 4 

(a) Evolution of the number of 

RHCA stages s as a function 

of b 

10 4 

10 3 

10 2 

10 1 

10 0 

10 0 

10 1 

10 2 

b 

10 3 

100 

200 

500 

1000 

2000 

5000 

10000 

10 4 

sum(K i ) : total number of consensus Number of consensus in a RHCA as a function of b 

(b) Evolution of the total 

number of RHCA consensus 

s 

Ki as a function of b 

i=1 

10 4 

10 3 

10 2 

10 1 

b ij : mean mini−ensemble size Mini−ensembles size of a RHCA as a function of b 

10 0 

10 0 

10 1 

10 2 

b 

10 3 

100 

200 

500 

1000 

2000 

5000 

10000 

10 4 

(c) Evolution of the mean 

mini-ensembles size as a function 

of b 

Figure 3.3: Evolution of RHCA parameters as a function of the mini-ensembles size b for 

cluster ensembles sizes ranging from 100 to 10000. 

3.2.3 Running time minimization 

PTCRHCA O (log b (l)(2b) w )=O (b w log b (l)) (3.11) 

In light of the expressions of the upper bounds of the serial and parallel time complexities 

of RHCA, a naturally arising question is which particular RHCA configuration yields, for 

a given cluster ensemble, the minimal running time —notice that the user’s election of the 

mini-ensembles size b determines both the number of stages and of consensus computed per 

stage, see equations (3.1) and (3.2), which ultimately determines the running time of the 

RHCA. 

In fact, there exists a trade-off between the value of b and the execution time of the 

whole RHCA, as selecting a small value for b simultaneously reduces the time complexity of 

each consensus while increasing the total number of stages (s) and of consensus processes 

of the RHCA ( s 

Ki), and vice versa. 

i=1 

With the purpose of visualizing the dependence between b and these factors, figure 3.3 

depicts their value for different cluster ensembles sizes l ∈ [100, 10000] as a function of the 

mini-ensembles size b ∈ 2, ⌊ l 

2 ⌋ . 

Firstly, figure 3.3(a) shows the exponential decay of the number of RHCA stages s as a 

function of b, caused by the fact that s is computed as the b-base logarithm of the cluster 

ensemble size l. Secondly, the evolution of the total number of consensus processes follows 

a fast exponential decay (hard to appreciate in this doubly logarithmic chart), as depicted 

in figure 3.3(b). And finally, figure 3.3(c) presents the mean value of the effective miniensembles 

size bij across the whole RHCA (which obviously scales linearly with b)asarough 

indicator of the complexity of each halfway consensus process, which will approximately be 

(linearly or quadratically) proportional to this value. 

Allowing for the evident dependence between the user defined mini-ensembles size b and 

54


the running of RHCA, it seems necessary to accurately choose the value of this parameter 

—regardless of whether the serial or parallel version of RHCA is implemented (in this latter 

case, notice that the RHCA running time still depends on b (via s) although it becomes 

s 

insensitive to the total number of consensus processes, Ki)—,asRHCAvariantswith 

different values of b may have dramatically different running times. 

As mentioned earlier, we aim to design an automatic mechanism that allows selecting 

thespecificvalueofb that gives rise to the RHCA variant of minimal running time, making 

it also possible to decide aprioriwhether it is more computationally efficient than flat 

consensus. 

To do so, we propose a simple yet effective methodology for comparing distinct RHCA 

variants (i.e. with different b values) on computational grounds, a detailed description of 

which is presented in table 3.2. In a nutshell, the proposed strategy is based on estimating 

the RHCA variants running time using equations (3.4) or (3.9) (depending on whether the 

serial or parallel version is to be implemented), replacing the theoretical time complexity of 

intermediate consensus processes (O (bij w )) by the average real execution time of c consensus 

processes using a specific consensus function F on a mini-ensemble of size bij —recall that 

the values of bij can be computed by means of equation (3.3). 

It is to note that, for a specific RHCA variant (that is, for a given mini-ensemble size 

b), the parameter bij may take a few different values —for instance, in the three toy RHCA 

variants depicted in figure 3.2, bij = {2, 3} although b = 2 in all of them. That is, the 

number of consensus processes to be executed for estimating the running time of the RHCA 

is usually small, which makes the proposed procedure little computationally demanding. 

Futhermore, notice that a more robust estimation can be obtained by averaging the running 

times of c>1 executions of the consensus function F on mini-ensembles of size bij, although 

this would increase the time required for completing the estimation procedure. 

By means of the proposed methodology, the user is provided with an estimation of 

which is the most computationally efficient RHCA configuration and which is its running 

time for the consensus clustering problem at hand. However, it is necessary to decide 

whether this allegedly optimal RHCA variant is faster than flat consensus. The running 

time of flat consensus can be estimated by extrapolating from the running times of the 

consensus processes executed upon mini-ensembles of size bij. A simpler although less 

efficient alternative consists of launching the execution of flat consensus, halting it as soon 

as its running time exceeds the estimated execution time of the allegedly optimal RHCA 

variant. 

The next section is devoted to illustrate the performance of the proposed running time 

estimation methodology by means of several experiments, as well as for evaluating the 

computational efficiency of RHCA in front of flat consensus. 

3.2.4 Experiments 

In the following paragraphs, we present a set of experiments oriented to i) evaluate the 

performance of the computationally optimal consensus architecture prediction methodology, 

and ii) analyze the computational efficiency of random hierarchical consensus architectures. 

To do so, we have designed several experiments which are outlined next. 

55 

i=1


1. Given the cluster ensemble size l, create a set of mini-ensembles sizes b with values 

sweeping from 2 to ⌊ l 

2 ⌋. 

2. For each b value –which corresponds to a RHCA variant– compute the number of 

stages of the RHCA s according to equation (3.1). With these results in hand, limit 

the sweep of values of b according to two criteria: 

i) as there exist multiple values of b that yield RHCA variants with the same number 

of stages, consider only the largest and smallest values of b that yield the same 

number of RHCA stages s. 

ii) keep those values of b that uniquely give rise to RHCA variants with a specific 

number of stages. 

3. For the reduced set of b values, compute the total number of consensus processes 

s 

Ki, and the real mini-ensembles sizes bij of the corresponding RHCA variants, 

i=1 

according to equations (3.2) and (3.3), respectively. 

4. Measure the time required for executing the consensus function F on c randomly 

picked mini-cluster ensembles of the sizes bij corresponding to each value of b. 

5. Employ the computed parameters of each RHCA variant (i.e. number of stages s, 

s 

total number of consensus processes Ki and the running times of the consensus 

i=1 

function F) to estimate the running times of the whole hierarchical architecture, 

using equations (3.4) or (3.9) depending on whether its fully serial or parallel version 

is to be implemented in practice. 

Table 3.2: Methodology for estimating the running time of multiple RHCA variants. 

Experimental design 

– What do we want to measure? 

i) The time complexity of random hierarchical consensus architectures. 

ii) The ability of the proposed methodology for predicting the computationally optimal 

RHCA variant, in both the fully serial and parallel implementations. 

– How do we measure it? 

i) The time complexity of the implemented serial and parallel RHCA variants is 

measured in terms of the CPU time required for their execution —serial running 

time (SRTRHCA) and parallel running time (PRTRHCA). 

ii) The estimated running times of the same RHCA variants –serial estimated running 

time (SERTRHCA) and parallel estimated running time (PERTRHCA)– are 

computed by means of the proposed running time estimation methodology, which 

is based on the measured running time of c = 1 consensus clustering process. Predictions 

regarding the computationally optimal RHCA variant will be successful 

56


in case that both the real and estimated running times are minimized by the 

same RHCA variant, and the percentage of experiments in which prediction is 

successful is given as a measure of its performance. In order to measure the 

impact of incorrect predictions, we also measure the execution time differences 

(in both absolute and relative terms) between the truly and the allegedly fastest 

RHCA variants in the case prediction fails. This evaluation process is replicated 

for a range of values of c ∈ [1, 20], so as to measure the influence of this factor 

on the prediction accuracy of the proposed methodology. 

– How are the experiments designed? All the RHCA variants corresponding to 

the sweep of values of b resulting from the proposed running time estimation methodology 

have been implemented (see table 3.2). In order to test our proposals under a 

wide spectrum of experimental situations, consensus processes have been conducted 

using the seven consensus functions for hard cluster ensembles presented in appendix 

A.5 (i.e. CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD), employing 

cluster ensembles of the sizes corresponding to the four diversity scenarios described 

in appendix A.4 —which basically boils down to compiling the clusterings output by 

|dfA| = {1, 10, 19, 28} clustering algorithms. In all cases, the real running times correspond 

to an average of 10 independent runs of the whole RHCA, in order to obtain 

representative real running time values (recall that the mini-ensemble components 

change from run to run, as they are randomly selected). For a description of the 

computational resources employed in or experiments, see appendix A.6. 

– How are results presented? Both the real and estimated running times of the 

serial and parallel implementations of the RHCA variants are depicted by means of 

curves representing their average values. 

– Which data sets are employed? For brevity reasons, this section only describes 

the results of the experiments conducted on the Zoo data collection. The presentation 

of the results of these same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, 

Balance and MFeat unimodal data collections is deferred to appendix C.2. 

One word before proceeding to present the results obtained. In practice, only serial 

RHCA have been implemented in our experiments. The real execution times of their parallel 

counterparts are, in fact, an estimation based on retrieving the execution time of the longestlasting 

consensus process of each stage of the serial RHCA and plugging them into equation 

(3.9). 

Diversity scenario |df A| =1 

Firstly, figure 3.4 presents the results corresponding to the lowest diversity scenario, i.e. 

the one resulting from using a single randomly chosen clustering algorithm for generating 

the cluster ensemble —that is, the cardinality of the algorithmic diversity factor is equal 

to one, i.e. |dfA| = 1, which, on this data set, gives rise to a cluster ensemble size l = 57. 

Following the methodology of table 3.2, the sweep of values of the mini-ensemble size is 

b = {2, 3, 4, 6, 7, 28, 57} —recall that each value of b gives rise to a distinct RHCA variant. 

Figure 3.4(a) presents the serial RHCA estimated running time (SERTRHCA), while figure 

3.4(b) depicts the real serial running time (or SRTRHCA) of the implemented RHCA 

57


SERT RHCA (sec.) 

PERT RHCA (sec.) 

10 0 

10 −1 


5 4 3 3 2 2 1 

2 3 4 6 7 28 57 

b : mini−ensemble size 

(a) Serial estimated running time 

5 

10 

4 3 3 2 2 1 

0 


10 −1 

2 3 4 6 7 28 57 


(c) Parallel estimated running time 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

SRT RHCA (sec.) 

PRT RHCA (sec.) 

10 0 

10 −1 


5 4 3 3 2 2 1 

2 3 4 6 7 28 57 


(b) Serial real running time 

5 

10 

4 3 3 2 2 1 

0 


10 −1 

2 3 4 6 7 28 57 


(d) Parallel real running time 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

Figure 3.4: Estimated and real running times of the serial RHCA on the Zoo data collection 

in the diversity scenario corresponding to a cluster ensemble of size l = 57. 

variants. Figures 3.4(c) and 3.4(d) present their counterparts for the parallel RHCA implementation. 

The lower horizontal axis of each chart presents the mini-ensembles size b of 

each RHCA variant, and the superior horizontal axis indicates the corresponding number 

of stages s of the RHCA. Notice, for instance, that s =1forb = 57, which corresponds to 

flat consensus. 

If the estimated and real execution times of the serial implementation of the RHCA are 

analyzed separately (figures 3.4(a) and 3.4(b)), it can be observed that flat consensus is 

faster than any RHCA variant regardless of the consensus function employed. This is due 

to the small size of the cluster ensemble (l = 57) in this low diversity scenario, which makes 

any hierarchical consensus architecture slower than its one-step counterpart. 

Moreover, the visual comparison of figures 3.4(a) and 3.4(b) shows that SERTRHCA is 

a fairly accurate estimation of SRTRHCA. However, it is to notice that our goal is not 

to predict the exact value of SRTRHCA, but to use SERTRHCA to predict where the real 

running time will attain its minimum value —which is equivalent to determining the most 

computationally efficient RHCA variant, a goal that is perfectly accomplished in this case. 

Figures 3.4(c) and 3.4(d) depict the estimated and real running times of the fully 

parallel implementation of the same RHCA variants as before. The observation of these 

charts reveals that PERTRHCA succeeds notably in predicting the location of the minima 

58



10 1 

10 0 

10 1 

10 0 

10 −1 


9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 285 570 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 285 570 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 




10 1 

10 0 

10 1 

10 0 

10 −1 


9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 285 570 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 285 570 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



of PRTRHCA —in fact, the only prediction error occurs in the case the SLSAD consensus 

function is employed. In this case, according to PERTRHCA (figure 3.4(c)), the most efficient 

consensus architecture is the RHCA variant with s = 2 stages using mini-ensembles of 

size b = 28. However, the real execution times (figure 3.4(d)) reveal that flat consensus is 

the fastest option when this consensus function is employed for combining the clusterings. 

Nevertheless, we would like to highlight that the cost (measured in terms of running 

time) of selecting this computationally suboptimal RHCA variant based on the PERTRHCA 

prediction is almost negligible in absolute terms, as the difference between the running 

times of the truly and allegedly optimal parallel RHCA variants is smaller than a tenth of 

a second in this case. 


Figure 3.5 presents the estimated and real execution times of several architectural variants of 

both the fully serial and parallel RHCA implementations in the second diversity scenario, 

the one resulting from employing |dfA| = 10 randomly chosen clustering algorithms for 

generating a cluster ensemble of size l = 570. In this case, the sweep of values of the 

mini-ensembles size is b = {2, 3, 4, 5, 7, 8, 19, 20, 285, 570}. 

If the estimated and real execution times of the serial implementation of the RHCA are 

59


observed (figures 3.5(a) and 3.5(b)), it can be noticed that i) SERTRHCA is again a pretty 

accurate estimation of SRTRHCA, andii) for several consensus functions (in fact, all but 

EAC) there exists at least one RHCA variant that is more computationally efficient that 

flat consensus. In general terms, the difference between the running times of the fastest 

RHCA variant and flat consensus is small, although in the case of the MCLA consensus 

function, the execution of flat consensus (i.e. b = l = 570) is six times as costly as the 

fastest RHCA variant (the one with b = 20). 

Three main conclusions can be drawn at this point: firstly, increasing the size of the 

cluster ensemble makes hierarchical consensus architectures a computationally competitive 

alternative to flat consensus. Secondly, it is necessary to predict accurately which is the 

fastest RHCA variant (i.e. the specific value of the mini-ensembles size b) soastoobtain 

significant execution time savings. And thirdly, the computational optimality of a particular 

RHCA variant is local to the consensus function employed. 

As regards the estimated and real running times of the fully parallel RHCA implementation, 

depicted in figures 3.5(c) and 3.5(d), we can conclude that, again, PERTRHCA is a 

good indicator of the most computationally efficient RHCA variant. Furthermore, notice 

that the differences between the running times of flat consensus and the optimal RHCA are 

beyond one order of magnitude for most consensus functions, which highlights the interestingness 

of RHCA in computational terms, as well as the need for being able to predict 

which is the least time consuming consensus architecture. 


The results corresponding to the third diversity scenario (i.e. cluster ensembles of size 

l = 1083 using |dfA| = 19 randomly chosen clustering algorithms) are presented in figure 

3.6. In this case, the mini-ensembles size sweep is b = {2, 3, 4, 5, 6, 8, 9, 26, 27, 541, 1083}. 

As regards the serial implementation of the RHCA –whose estimated and real running 

times are presented in figures 3.6(a) and 3.6(b), respectively–, a few observations must be 

made. Firstly, notice that the curves in figure 3.6(a) present a high degree of resemblance 

to the ones in figure 3.6(b), which indicates that SERTRHCA is a notably accurate predictor 

of SRTRHCA. Again, we would like to highlight the fact that our main interest is that 

the former is a good predictor of the location of the minima of the latter, a goal which is 

pretty successfully achieved in this case. Secondly, notice the influence of the consensus 

function employed for conducting the clustering combination on the running time of the 

RHCA. Whereas most of them yield a similar running time pattern (i.e. they have a 

more or less pronounced minimum around b =26orb = 27), two consensus functions 

stand out for their particular behaviour: i) when the EAC consensus function is employed, 

flat consensus is faster than any serial RHCA variant, and ii) when consensus is created 

by means of the MCLA consensus function, the space complexity requirements of MCLA 

make flat consensus not executable, as this is the only consensus function (among the ones 

employed in this work) whose complexity scales quadratically with the cluster ensemble size 

(see appendix A.5). 

If the estimated and real parallel RHCA implementation running times are evaluated 

(see figures 3.6(c) and 3.6(d)), it can be observed that, whatever the consensus function 

employed, there always exists at least one parallel RHCA variant which performs more 

efficiently than flat consensus. Moreover, notice that just like in the previous diversity 

60



10 

10 

6 5 5 4 4 3 3 2 2 1 

2 


10 1 

10 0 

10 1 

10 0 

10 −1 

2 3 4 5 6 8 9 26 27 541 1083 




10 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 8 9 26 27 541 1083 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 




10 

10 

6 5 5 4 4 3 3 2 2 1 

2 


10 1 

10 0 

10 1 

10 0 

10 −1 

2 3 4 5 6 8 9 26 27 541 1083 




10 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 8 9 26 27 541 1083 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



scenario, there exists a notable difference between the running times of the most efficient 

parallel RHCA and flat consensus, which can be as high as two orders of magnitude. 


Figure 3.7 depicts the estimated and real execution times corresponding to the highest 

diversity scenario —i.e. the one resulting from applying the |dfA| = 28 clustering algorithms 

from the CLUTO clustering package for generating cluster ensembles of size l = 1596. In 

this case, the mini-ensembles size sweep is b = {2, 3, 4, 5, 6, 10, 11, 32, 33, 798, 1596}. 

The results obtained are pretty similar to those obtained on the previous diversity 

scenario, although the following remarks must be made: firstly, notice that the large size of 

the cluster ensemble may not only impede flat consensus, but also the execution of those 

RHCA variants using larger mini-ensembles (see the curves corresponding to the MCLA 

consensus function). And secondly, it is noteworthy that the larger the cluster ensemble, 

the greater running time savings –which can be as high as two orders of magnitude– are 

derived from using the computationally optimal RHCA variant instead of flat consensus (in 

the case it is executable), regardless of the consensus function employed. 

61




10 2 

10 1 

10 0 

10 1 

10 0 

10 −1 


10 7 6 5 4 4 3 3 2 2 1 

2 3 4 5 6 10 11 32 33 798 1596 




10 7 6 5 4 4 3 3 2 2 1 

2 3 4 5 6 10 11 32 33 798 1596 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



10 2 

10 1 

10 0 

10 1 

10 0 

10 −1 


10 7 6 5 4 4 3 3 2 2 1 

2 3 4 5 6 10 11 32 33 798 1596 




10 7 6 5 4 4 3 3 2 2 1 

2 3 4 5 6 10 11 32 33 798 1596 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



Conclusions regarding the computationally efficiency of RHCA 

The observation of the results obtained across the four diversity scenarios allow drawing 

several conclusions (together with the experiments presented in appendix C.2) as regards 

the computational efficiency of hierarchical and flat consensus architectures: 

– hierarchical consensus architectures can constitute a feasible way to obtain a consensus 

clustering solution in cases where one-step consensus is not affordable, which 

ultimately depends on the size of the cluster ensemble, the characteristics of the consensus 

function and the computational resources at hand. 

– as expected, parallel RHCA are highly efficient, being faster than flat consensus even 

in low diversity scenarios. 

– serial RHCA implementations become computationally competitive in medium to high 

diversity scenarios. 

– depending on the characteristics of the consensus function(s) employed for conducting 

clustering combination, large variations of the overall execution time of consensus 

architectures are observed. 

62


Serial RHCA Parallel RHCA 

Dataset % correct ΔRT % correct ΔRT 

predictions (sec.) (%) predictions (sec.) (%) 

Zoo 72.2 1.11 109.7 54.9 0.10 67.2 

Iris 90.4 0.05 26.1 56.7 0.12 102.7 

Wine 77.1 0.60 37.8 46.8 0.21 139.5 

Glass 74.6 0.49 26.5 25.9 0.26 97.3 

Ionosphere 73.1 2.63 16.6 67.8 0.77 110.5 

WDBC 63.0 12.11 39.1 38.6 8.17 113.9 

Balance 92.4 0.31 29.5 73.2 3.09 87.3 

MFeat 83.4 7.02 27.7 76.3 14.41 50.3 

average 78.3 3.04 39.1 55.0 3.39 96.1 

Table 3.3: Evaluation of the minimum complexity RHCA variant estimation methodology 

in terms of the percentage of correct predictions and running time penalizations resulting 

from mistaken predictions. 

Conclusions regarding the optimal RHCA prediction methodology 

As far as the proposed running time estimation methodology is concerned, the following 

conclusions are drawn: 

– the computation of SERTRHCA and PERTRHCA constitutes a reasonable, simple and 

fast means for predicting whether flat or hierarchical consensus must be conducted. 

– the selection of computationally suboptimal consensus architectures caused by prediction 

errors of the proposed methodology entails a (usually assumable) execution 

time overhead. 

In order to provide the reader with a more quantitative analysis of the predictive power 

of the proposed running time estimation methodology, we have computed the percentage 

of experiments –considering the eight data sets over which they have been conducted– the 

minimum value of the estimated and real running times is obtained for the same consensus 

architecture. If, for a given experiment, both functions are simultaneously minimized, then 

our methodology suceeds in determining aprioriwhich is the fastest consensus architecture. 

If not, we compute the difference between the real running times of the truly (i.e. the 

one that minimizes SRTRHCA or PRTRHCA) and the allegedly (that is, the one minimizing 

SERTRHCA or PERTRHCA) computationally optimal consensus architectures, so as to 

provide a measure of the impact of choosing a suboptimal consensus configuration both in 

absolute and relative terms. 

Table 3.3 presents the percentage of experiments where the minima of SERTRHCA 

and PERTRHCA predict the most efficient consensus architecture correctly (expressed as 

‘% correct predictions’). In this case, SERTRHCA and PERTRHCA have been estimated 

upon a single execution (c = 1) of a consensus process on the mini-ensembles of size bij, 

i.e. no statistical running time averaging is conducted. Moreover, the difference between 

the real running times (ΔRT, measured both in seconds and in relative percentage) of the 

truly and the allegedly computationally optimal consensus architectures is also presented, 

63


which corresponds to an average across 50 independent runs of the experiments conducted 

on each data collection. 

It is to note that, as already observed in the experiment described in this section and 

those presented in appendix C.2, SERTRHCA is a pretty accurate estimator of SRTRHCA 

(despite being based on the running time of a single consensus) and, as such, it suceeds 

notably in determining the most computationally efficient consensus architecture, achieving 

an average level of accuracy superior to 78% across the eight data sets employed in these 

experiments. In spite of this notably high accuracy percentage, notice that the resulting 

running time increase derived from inaccurate predictions is pretty high when measured in 

relative percentage —e.g. for the Zoo data set, the average running time of truly optimal 

consensus architectures is more than doubled when suboptimal ones are selected. However, 

if this average execution time increase is measured in absolute terms, we can conclude that 

it is perfectly assumable from a practical viewpoint in most cases —after all, the large 

relative deviation observed in the Zoo data collection results only in a one second running 

time rise. 

As regards the parallel implementation of the RHCA, there are a couple of issues worth 

noting: firstly, the proposed prediction methodology performs worse than in the serial case. 

This is due to the fact that, as observed across all the experiments conducted, PERTRHCA is 

a poorer estimator of PRTRHCA. However, despite ΔRT achieves very high values in relative 

terms, the absolute running time deviations between the truly and allegedly fastest consensus 

architectures are, again, not of paramount importance (i.e. the running time overhead 

due to a slightly erroneous estimation of the fastest RHCA variant clearly compensates the 

hypothetical execution of the least efficient consensus architecture). 

Recall that the proposed running time estimation methodology is based on capturing the 

execution times of several (namely c) runs of the consensus process on mini-ensembles of the 

sizes bij corresponding to each RHCA variant. As aforementioned, the results just reported 

have been obtained upon a single execution (c = 1). But expectably, the larger the value 

of c, the more accurate the estimation, but also the more costly in terms of computation 

time. 

So as to evaluate the influence of this factor, figure 3.8 depicts the evolution of the 

percentage of correct predictions (for both the serial and parallel implementations, referred 

to as %CPS and %CPP, respectively) and of the running time deviations (ΔRTS and ΔRTP 

in both absolute and relative terms) as a function of the parameter c, varying its value 

between 1 and 20, averaged across the eight data collections employed in this experiment. 

It can be observed that, despite the relatively wide sweep of values of c, thevariation 

in the percentage of correct predictions is below 6% for both the serial and parallel RHCA 

implementations —see figure 3.8(a). In terms of the difference between the running times 

of the truly and allegedly fastest consensus architectures, this results in a slight reduction 

of ΔRTS and ΔRTP –figure 3.8(b)–, which is, in any case, lower than 1.7 seconds —which, 

in relative percentage terms, amounts to a maximum reduction of 22% —see figure 3.8(c). 

Thus, we can conclude that, despite being a coarse approximation, using the running 

time of a single consensus process as the basis for estimating the execution time of the 

whole RHCA yields pretty accurate results as far as the prediction of the most efficient 

consensus architecture is concerned. Furthermore, when this prediction methodology fails, 

the execution time overhead is, in most cases, not dramatic from a practical standpoint. 

64

% correct predictions 

90 

80 

70 

60 

CP S 

CP P 

50 

1 5 10 15 20 

c : number of consensus processes 

(a) Percentage of correct optimal 


predictions 

RT (sec.) 

4 

3.5 

3 

2.5 

2 


RT S 

RT P 

1.5 

1 5 10 15 20 


(b) Absolute running time 

differences between the 

truly and allegedly optimal 

consensus architectures 

relative % RT 

100 

80 

60 

40 

20 

relative RT S 

relative RT P 

0 

1 5 10 15 20 


(c) Relative running time 

differences between the truly 

and allegedly optimal consensus 

architectures 

Figure 3.8: Evolution of the accuracy of the serial and parallel RHCA running time estimation 

as a function of the number of consensus processes c used in the estimation, measured in 

terms of (a) the percentage of correct predictions, the (b) absolute and (c) relative running 

time deviations between the truly and allegedly optimal consensus architectures, averaged 

across the eight data sets employed . 

Summary of the most computationally efficient RHCA variants 

Following the proposed methodology, we have estimated which are the most computationally 

efficient consensus architectures for the twelve unimodal data collections described in 

appendix A.2.1. The results corresponding to the fully serial and parallel implementations 

are presented in tables 3.4 and 3.5, respectively. 

From a notational viewpoint, RHCA variants are expressed in terms of their number of 

stages s (in case there exist two variants with the same number of stages, we denote whether 

it corresponds to the implementation using the minimum or maximum mini-ensemble size 

by the symbols bm andbM, respectively). Moreover, successful predictions of the computationally 

optimal consensus architectures (i.e. the minima of SERTRHCA –or PERTRHCA– 

and SRTRHCA –or PRTRHCA– are yielded by the same consensus architecture) are denoted 

with the dagger symbol (†). Obviously, this applies for the eight first data collections 

(from Zoo to MFeat), where the true running times of all consensus architectures have 

been measured after their real execution. For the remaining four data sets (from miniNG 

to PenDigits), the allegedly optimal consensus architectures are presented —however, we 

think it is not an outlandish assumption to consider that a rate of computationally optimal 

consensus architecture correct predictions comparable to those presented in table 3.3 can 

be expected in these cases. 

As regards the consensus architecture serial implementation (table 3.4), a few observations 

can be made: firstly, the higher the degree of diversity (i.e. the larger the cluster 

ensembles), the more efficient RHCA variants become when compared to flat consensus. 

As observed earlier, the most notable exception to this rule of thumb occurs when clustering 

combination is conducted by means of the EAC consensus function, whereas it can 

be observed that the remaining ones show, in general terms, a pretty similar behaviour as 

regards the type of consensus architecture (flat or hierarchical) that minimizes the total 

65


running time. Secondly, notice that flat consensus tends to be computationally optimal in 

those data sets having small cluster ensembles even in high diversity scenarios (e.g. Iris, 

Balance or MFeat). Thirdly, for data collections containing a large number of objects n 

(such as PenDigits), only the HGPA and MCLA consensus functions are executable in our 

experimental conditions (as they are the only whose complexity scales linearly with the 

data set size, see appendix A.5). And last, notice the predominance of RHCA variants 

with s =2ands = 3 stages among the fastest ones, which seems to indicate that, from a 

computational perspective, RHCA variants few stages are more efficient, despite consensus 

processes are conducted on large mini-cluster ensembles. 

Most of these observations can be extrapolated to the case of the fully parallel consensus 

implementation (table 3.5), where we can observe a pretty overwhelming prevalence of 

RHCA variants over flat consensus, a trend that was already reported earlier in this section 

and also in appendix C.2. 

In the remains of this work, experiments concerning random hierarchical consensus 

architectures have been limited to those RHCA variants of minimum estimated running 

time for the sake of brevity. 

66

Consensus Diversity Dataset 

function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits 

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|df 

CSPA A| = 10 s=2,bM† flat† s=2,bM† s=2,bM† s=2,bM† s=2,bM† flat† flat† s=2,bM flat flat – 

|dfA| = 19 s=3,bM flat† s=2,bM† s=2,bM† s=2,bM s=2,bM† flat† flat† s=2,bm s=2,bM s=2,bM – 

|dfA| = 28 s=2,bm† flat† s=2,bM† s=2,bM s=2,bm s=2,bM flat† flat† s=3,bM s=2,bM s=2,bM – 


|df 

EAC A| = 10 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat – 



|dfA| = 1 flat† flat† flat† flat† flat† flat flat† flat† s=3,bM s=2,bM flat s=2,bM 

|df 

HGPA A| = 10 s=2,bM† flat† s=2,bM† s=2,bM† s=2,bm† s=2,bm† flat† s=2,bM† s=4,bM s=2,bm s=2,bm s=2,bm 

|dfA| = 19 s=2,bm† flat† s=2,bM† s=2,bM s=2,bm† s=3,bM† s=2,bM† s=2,bm† s=2,bm s=3,bM s=2,bm s=3,bm 

|dfA| = 28 s=2,bm† s=2,bM† s=3,bM s=2,bm s=3,bM† s=3,bM s=2,bM† s=3,bM s=3,bM s=4,bM s=6 s=3,bm 

|dfA| = 1 flat† flat† flat† flat† flat† flat flat† flat† s=3,bm s=2,bm s=3,bM s=2,bM 

|df 

MCLA A| = 10 s=3,bM flat† s=2,bm† s=2,bm† s=3,bM† s=2,bm† flat s=2,bm† s=4,bM s=4,bM s=3,bM s=3,bm 

|dfA| = 19 s=2,bm† s=2,bM† s=2,bm s=2,bm† s=2,bm s=3,bM s=2,bM† s=2,bm† s=2,bm s=3,bm s=4,bm s=3,bm 

|dfA| = 28 s=2,bm† s=2,bM† s=2,bm s=3,bM† s=2,bm† s=3,bM† s=2,bM s=3,bm s=3,bM s=4,bM s=3,bM s=4,bM 


|df 

ALSAD A| = 10 s=2,bm† flat† s=2,bM s=2,bM s=2,bM† flat† flat† flat† s=2,bM flat flat – 

|dfA| = 19 s=2,bm† flat† s=2,bm† s=2,bm† s=2,bm† s=2,bM flat† flat† s=2,bm flat flat – 

|dfA| = 28 s=3,bM s=2,bM† s=2,bm† s=3,bM s=2,bm† s=2,bm flat† flat† s=3,bm flat flat – 


|df 

KMSAD A| = 10 s=2,bm† flat† s=2,bM† s=2,bM† s=3,bM† s=2,bM† flat† flat† flat flat flat – 

|dfA| = 19 s=2,bm† flat s=2,bm† s=2,bM† s=3,bM† s=2,bm† flat† flat† s=2,bM s=2,bM s=2,bM – 

|dfA| = 28 s=2,bm† s=2,bM s=3,bM s=2,bm† s=2,bm† s=2,bm† flat† flat† s=3,bM s=3,bM s=2,bM – 


|df 

SLSAD A| = 10 s=2,bm† flat† s=2,bM s=2,bm† s=2,bM s=2,bM† flat† flat† s=2,bM flat flat – 

|dfA| = 19 s=2,bm† flat† s=2,bm† s=3,bM s=3,bM s=2,bM flat† flat† s=2,bM flat flat – 

|dfA| = 28 s=2,bm† s=2,bM s=3,bM s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bm flat flat – 


67 

Table 3.4: Computationally optimal consensus architectures (flat or RHCA) on the unimodal data collections assuming a fully serial 

implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.




|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† s=2,bm s=2,bm flat – 

|df 

CSPA A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM s=2,bm s=2,bm – 

|dfA| = 19 s=3,bm flat s=3,bM s=3,bM s=2,bm† s=3,bM flat† s=3,bM s=3,bM s=2,bm s=2,bm – 

|dfA| = 28 s=2,bm† s=3,bM s=3,bm s=3,bM s=2,bm† s=3,bM flat s=3,bm s=3,bM s=2,bm s=2,bm – 


|df 

EAC A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† flat s=2,bm s=3,bM – 

|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=3,bM s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm – 

|dfA| = 28 s=2,bm† s=3,bM s=2,bm† s=3,bM s=2,bm† s=3,bM s=2,bm flat† s=2,bm s=2,bm s=2,bm – 

|dfA| = 1 flat† flat† flat† flat† s=3,bM s=3,bM flat† flat† s=3,bm s=3,bm s=2,bm s=3,bM 

|df 

HGPA A| = 10 s=3,bM flat† s=2,bm s=3,bM s=4,bM s=3,bm s=2,bm† s=3,bM s=6 s=4,bm s=2,bm s=4,bM 

|dfA| = 19 s=3,bm† s=3,bM s=3,bM s=4,bM s=2,bm† s=4,bM s=2,bm† s=4,bM s=2,bM s=4,bM s=6 s=3,bm 

|dfA| = 28 s=3,bm s=3,bM s=3,bm s=3,bM s=3,bm s=3,bm† s=3,bm s=3,bm† s=3,bM s=5,bm s=6 s=4,bM 

|dfA| = 1 s=2,bm† flat† flat† flat† s=3,bM s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm s=3,bM 

|df 

MCLA A| = 10 s=3,bm† s=2,bm† s=2,bm s=3,bm s=4,bM s=2,bm† s=2,bm† s=3,bM s=5 s=4,bm s=4,bm s=4,bM 

|dfA| = 19 s=2,bm s=3,bM s=3,bM s=4,bM s=2,bm† s=4,bM s=2,bm† s=4,bM s=2,bM s=4,bM s=4,bm s=3,bm 

|dfA| = 28 s=3,bm† s=2,bm† s=3,bm s=3,bm† s=4,bM s=4,bm s=2,bm† s=3,bm† s=3,bM s=5,bm s=3,bM s=4,bM 


|df 

ALSAD A| = 10 s=3,bM flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM flat s=3,bM – 

|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=2,bm† s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm – 

|dfA| = 28 s=4,bM s=2,bm† s=3,bM s=3,bm s=2,bm† s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm – 


|df 

KMSAD A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=4,bM s=2,bm† flat† flat† s=3,bM s=3,bm s=2,bM – 

|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=3,bm s=3,bm s=3,bM s=2,bm flat† s=3,bM s=2,bm s=2,bm – 

|dfA| = 28 s=2,bm† s=2,bm† s=4,bM s=3,bM s=2,bm† s=3,bM s=4,bM flat† s=3,bM s=2,bm s=3,bm – 

|dfA| = 1 s=2,bM flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|df 

SLSAD A| = 10 s=3,bM flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM s=2,bm s=3,bM – 

|dfA| = 19 s=2,bm† s=2,bM s=3,bM s=4,bM s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm – 

|dfA| = 28 s=3,bm s=2,bm† s=3,bM s=3,bm s=2,bm† s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm – 

68 

Table 3.5: Computationally optimal consensus architectures (flat or RHCA) on the unimodal data collections assuming a fully parallel 



3.3 Deterministic hierarchical consensus architectures 

This section is devoted to the description of deterministic hierarchical consensus architectures 

(or DHCA). As in the previous section, we present a generic definition of this 

architectural variant along with a study of its computational complexity. 

3.3.1 Rationale and definition 

As opposed to random HCA, this proposal drives the creation of the mini-ensembles by 

a deterministic criterion. The main idea behind DHCA is to exploit the distinct ways of 

introducing diversity in the cluster ensemble as the guiding principle for creating the miniensembles 

upon which the intermediate consensus clustering solutions are built. That is, 

a key differential factor between DHCA and RHCA is that the former type of architecture 

is indirectly designed by the user when creating the cluster ensemble, whereas the latter 

requires the user to fix an architectural defining factor (i.e. assign a value to the size of the 

mini-ensembles b). 

Enlarging on the relationship between the creation of the cluster ensemble and the 

configuration of the DHCA, it is important to recall the strategies employed for introducing 

diversity in cluster ensembles (see section 2.1). 

For instance, heterogeneous cluster ensembles –whose components are generated by 

the execution of multiple clustering algorithms on the data set– have a single diversity 

factor, i.e. the set of distinct clustering algorithms employed. Meanwhile, when creating 

homogeneous cluster ensembles (those compiling the outcomes of multiple runs of a single 

clustering algorithm), a wider spectrum of diversity factors can be applied, such as the 

random starting configuration of a stochastic algorithm, or the use of distinct attributes for 

representing the objects in the data set, among others. 

As aforementioned, in this work we combine both the homogeneous and heterogeneous 

approaches for creating cluster ensembles, aiming not only to obtain highly diverse cluster 

ensembles, but also to design a strategy for fighting against clustering indeterminacies. This 

means that we employ several mutually crossed diversity factors (e.g. multiple clustering 

algorithms are run on several data representations with varying dimensionalities), and this 

will be the scenario where DHCA will be defined. 

In general terms, let us denote the number of diversity factors employed in the cluster 

ensemble creation process as f. Each diversity factor dfi ∀i ∈ [1,f] has a cardinality |dfi| 

—e.g. |dfi| denotes the number of clustering algorithms employed for creating the cluster 

ensemble in case that the ith diversity factor dfi represents the algorithmic diversity of the 

ensemble. 

Finally, notice that, if fully mutual crossing between all diversity factors is ensured (e.g. 

each cluster ensemble component is the result of running each clustering algorithm on each 

document representation of each distinct dimensionality), the cluster ensemble size l can be 

expressed as: 

f 

l = |dfk| (3.12) 

k=1 

Let us see how the design of the cluster ensemble determines the topology of a deterministic 

hierarchical consensus architecture. The guiding principle is that the consensus 

69

3.3. Deterministic hierarchical consensus architectures 

processes conducted at each stage of a DHCA combine those clusterings stemming from a 

single diversity factor (e.g. those ensemble components obtained by applying all the available 

algorithms on a particular object representation with a specific dimensionality). Then, 

quite obviously, the number of stages of a DHCA is equal to the number of diversity factors 

employed in creating the cluster ensemble, i.e. s = f. 

Besides selecting the diversity factors (and their cardinality) used in generating the cluster 

ensemble, the user must make an additional choice that affects the DHCA configuration: 

deciding which diversity factor is subject to consensus at each DHCA stage. This is specified 

by means of an ordered list of diversity factors, O = {df1,df2,...,dff }, sothatdfi will 

refer hereafter to the diversity factor which is subject to consensus at the ith stage of the 

DHCA. 

As regards the number of consensus executed on the ith DHCA stage (Ki), it is equal 

to the product of the cardinalities of the diversity factors that have not been subject to 

consensus neither in the present nor in the previous stages —except for the last stage, 

where a single consensus is conducted, as defined in equation (3.13). 

⎧ 

⎨ 

Ki = 

⎩ 

f 

k=i+1 

|dfk| if 1 ≤ i

λ 1 

λ 2 

= λ 

= λ 

λ3 = λ 

λ 4 

= λ 

λ5 = λ 

λ 6 

λ 7 

= λ 

= λ 

λ8 = λ 

λ 9 

λ 10 

λ 11 

λ 12 

λ 13 

λ 14 

λ 15 

λ 16 

= λ 

= λ 

= λ 

= λ 

= λ 

= λ 

= λ 

= λ 

λ17 = λ 

λ 18 

= λ 

1, 

1, 

1 

2, 

1, 

1 

3, 

1, 

1 

1, 

1, 

2 

2, 

1, 

2 

3, 

1, 

2 

1, 

1, 

3 

2, 

1, 

3 

3, 

1, 

3 

1, 

2, 

1 

2, 

2, 

1 

3, 

2, 

1 

1, 

2, 

2 

2, 

2, 

2 

3, 

2, 

2 

1, 

2, 

3 

2, 

2, 

3 

3, 

2, 

3 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

λ 

λ 

λ 

λ 

λ 

λ 

A, 

1, 

1 

A, 

1, 

2 

A, 

1, 

3 

A, 

2, 

1 

A, 

2, 

2 

A, 

2, 

3 


Consensus 

function 

Consensus 

function 

λ 

λ 

A, 1 1, 

D 

A, 2 2, 

D 

Consensus 

function c λ 

Figure 3.9: An example of a deterministic hierarchical consensus architecture operating on a 

cluster ensemble created using three diversity factors: three clustering algorithms |dfA| =3, 

two object representations |dfR| = 2 of three dimensionalities each |dfD| = 3. The cluster 

ensemble component obtained by running the ith clustering algorithm on the jth object 

representation and the kth dimensionality is denoted as λi,j,k. Consensus are sequentially 

created across the algorithmic, dimensional and representational diversity factors (dfA, dfD 

and dfR, respectively). 

Therefore, a total of K2 = |dfR| = 2 consensus processes are run, each on a mini-ensemble 

of size b2j = |dfD| =3, ∀j ∈ [1, 2]. The halfway consensus clustering solutions obtained 

after this second stage are designated as λA,j,D. 

And finally, the last DHCA stage combines the clusterings output by the previous one, 

which only differ in its original object representation. Being the final stage of the hierarchy, 

a single consensus process is executed (K3 = 1), and the size of the mini-ensemble coincides 

with the cardinality of the representation diversity factor, i.e. b3j = |dfR| =2. 

3.3.2 Computational complexity 

The maximum and minimum time complexities of DHCA –corresponding to the serial and 

parallel execution of the consensus processes of each stage, respectively– are estimated in 

the following paragraphs. In this case, the goal is to express these complexities in terms of 

the cardinality and number of diversity factors employed in the cluster ensemble creation 

process. Recall that the time complexity of consensus functions typically grows linearly or 

quadratically with the cluster ensemble size, i.e. it is O (l w ), where w ∈{1, 2}. 

71


Serial DHCA 

Firstly, let us consider the fully serialized version of the DHCA. In this case, the time 

complexity amounts to the sum of all the consensus processes, as defined by equation (3.14). 

Notice that STCDHCA can be expressed ultimately in terms of the number and cardinalities 

of the diversity factors employed in the generation of the cluster ensemble. 

STCDHCA = 

s Ki 

i=1 j=1 

O (bij w )= 

s 

i=1 

Ki · O (bij w )= 

f 

 

f 

i=1 

k=i+1 

Keeping the higher order terms, serial DHCA time complexity is: 

Parallel DHCA 

 

f 

 

STCDHCA = O |dfk| |df1| w 

 

k=2 

 

|dfk| · O (|dfi| w ) (3.14) 

(3.15) 

And secondly, the time complexity of the parallelized execution of the DHCA is presented 

in equation (3.16). As all the consensus processes in a given DHCA stage are equally costly, 

the value of PTCDHCA amounts to the addition of the complexities of one of the consensus 

processes run on each of the s stages of the hierarchy. 

PTCDHCA = 

s 

i=1 

O (bij w )= 

f 

O (|dfi| w ) (3.16) 

Notice that the parallel execution of a DHCA can be regarded as a sequence of f 

instructions of complexity O (|dfi| w ) , ∀i ∈ [1,f]. Therefore, applying the sum rule of 

assymptotic notation, PTCDHCA can be rewritten as: 

 

PTCDHCA = O 

3.3.3 Running time minimization 

i=1 

max (|dfi|) 

i 

w 

 

(3.17) 

As in section 3.2, a naturally arising question regarding the practical implementation of 

deterministic hierarchical consensus architectures is the following: given a cluster ensemble 

of size l created upon a set of diversity factors dfi (for i = {1,...,f}), which is the least 

time consuming DHCA variant that can be built? 

Indeed, as the topology of a deterministic hierarchical consensus architecture is ultimately 

determined by an ordered list O of the f diversity factors indicating upon which 

diversity factor consensus is conducted at each DHCA stage, there exist f! distinctDHCA 

variants for a given consensus clustering problem —one for each of the possible ways of ordering 

the f diversity factors. Then, the question transforms into: how should the diversity 

factors be ordered so as to minimize the total running time of the DHCA? 

72


Notice that, as opposed to what was observed in random HCA, the distinct DHCA 

variants do not differ in their number of stages (which is, in all cases, equal to the number 

of diversity factors, i.e. s = f), but in the time complexity of each stage of the architecture. 

Thus, in order to determine which is the computationally optimal DHCA variant, it is 

necessary to analyze the dependence between the ordering of the diversity factors and the 

total number of consensus processes executed and their complexity. In this section, we tackle 

this issue for both the fully serial and parallel implementation of deterministic hierarchical 

consensus architectures. 

Without loss of generality, let us assume that consensus clustering is to be conducted 

on a cluster ensemble of size l generated upon f = 3 mutually crossed diversity factors. By 

means of an ordered list O, these three factors are associated to one of the stages of the 

DHCA, i.e. O = {df1,df2,df3} —recall that, according to the definition of DHCA, i) the 

numerical subindex of each diversity factor identifies the stage it is associated to, and ii) 

Ki consensus processes of complexity O (|dfi| w )(wherew = {1, 2}) are conducted in the ith 

stage, with i = {1, 2, 3} in this case. 

As aforementioned, the total number of consensus processes depends on the cardinality 

of the diversity factors, which in this particular case amounts to the expression presented 

in equation (3.18): 

f 

Ki = 

i=1 

3 

Ki = 

i=1 

3 

|dfk| + 

k=2 

3 

|dfk| +1=|df2||df3| + |df3| + 1 (3.18) 

k=3 

where the number of consensus per stage Ki is computed according to equation (3.13). 

Firstly, let us analyze the running time of the fully parallel DHCA implementation. As 

in section 3.2, we assume that sufficient computing resources allow the concurrent execution 

of all the consensus processes of any of the DHCA stages —notice that this amounts to 

having as many as |df2||df3| parallel computation units capable of running simultaneously 

all the consensus processes of the first stage, which is the one with the largest number of 

consensus. 

If this condition is met, the running time of any parallel DHCA variant becomes independent 

of the ordering of the diversity factors. This is due to the fact that, assuming 

the fully simultaneous execution of all the consensus processes corresponding to the same 

DHCA stage, the running time of the whole consensus architecture will be proportional to 

O (|df1| w )+O (|df2| w )+O (|df3| w ) –as the running time of each DHCA stage is equivalent to 

the execution of a single consensus process–, which is independent of which diversity factor 

is assigned to each stage. 

However, despite the diversity factors ordering does not affect the running time of parallel 

DHCA variants, this factor has a significant impact on the dimensioning of the necessary 

resources for the entirely parallel execution of all the consensus processes involved, as it is 

directly related to the total number of consensus that must be executed in the DHCA. 

f 

From equation (3.18), it is straightforward to see that Ki is independent of the 

cardinality of the diversity factor associated to the first DHCA stage (df1). Thus, notice 

that the total number of consensus of a DHCA is minimized if the diversity factors are 

arranged in the ordered list O according to their cardinality and in decreasing order, i.e. 

|df1| ≥|df2| ≥|df3|. By doing so, the number of consensus processes conducted in the 

73 

i=1


1. Given the cluster ensemble size l generated upon a set of f mutually crossed diversity 

factors, create f! ordered lists corresponding to all the possible permutations of the 

diversity factors, each giving rise to a DHCA variant. 

2. For each one of the ordered lists, compute the total number of consensus processes 

per stage Ki, according to equation (3.13). 

4. Measure the time required for executing the consensus function F on c randomly 

picked mini-cluster ensembles of sizes |dfi| for i ∈{1,...,f}. 

5. Employ the computed parameters of each DHCA variant (i.e. number of stages s = 

f, total number of consensus processes s 

Ki and the running times of the consensus 

i=1 

function F) to estimate the running times of the whole hierarchical architecture, 

using equations (3.14) or (3.16) depending on whether its fully serial or parallel 

version is to be implemented in practice. 

Table 3.6: Methodology for estimating the running time of multiple DHCA variants. 

first stage of the DHCA is minimized, which is equivalent to minimizing the necessary 

computation units for the fully parallel implementation of the DHCA. 

If the running time of the serial implementation of the DHCA is now considered, the 

total number of consensus to be executed is not the only factor to take into account. In 

fact, arranging the diversity factors in decreasing order, while minimizing the total number 

of consensus to be executed across the DHCA, brings about a contradictory collateral effect 

if the complexity and number of the consensus processes run at each stage are considered. 

Indeed, notice that it is in the first DHCA stage where the largest number of consensus 

processes is executed (K1 = |df2||df3|), and they have the highest complexity (O (|df1| w ), 

where w = {1, 2}), as |df1| ≥|df2| ≥|df3|. Moreover, a single minimum complexity (i.e. 

O (|df3| w )) consensus process is conducted at the third stage, as K3 =1. 

Thus, as far as the running time of the serial implementation of DHCA is concerned, 

there exists an apparent trade-off involving the order of diversity factors, and the number 

and complexity of the associated consensus processes. It is important to note that the computationally 

optimal solution ultimately depends on the growth rates of the total number 

of consensus and of their time complexities with respect to the cardinality of the diversity 

factors (|dfi|, fori = {1,...,f}). However, while the latter grow according to a linear or 

quadratic law, the former follows a usually steeper multiplicative growth rate. 

Similarly to what has been discussed in section 3.2, table 3.6 presents a methodology 

for estimating the running times of the f! DHCA variants, which allows comparing them 

and, as a consequence, predicting which is the most computationally efficient. 

74

3.3.4 Experiments 


This section presents the results of multiple experiments oriented to illustrate the computational 

efficiency of DHCA, as well as to evaluate the predictive power of the running time 

estimation methodology of table 3.6. Their design follows the scheme presented next. 



i) The time complexity of deterministic hierarchical consensus architectures. 


DHCA variant, in both the fully serial and parallel implementations. 

iii) The predictive power of the proposed methodology based on running time estimation 

vs the computational optimality criterion based on designing the DHCA 

according to a decreasing diversity factor cardinality order, in both the fully 

serial and parallel implementations. 


i) The time complexity of the implemented serial and parallel DHCA variants is 


time (SRTDHCA) and parallel running time (PRTDHCA). 

ii) The estimated running times of the same DHCA variants –serial estimated running 

time (SERTDHCA) and parallel estimated running time (PERTDHCA)– are 



regarding the computationally optimal DHCA variant will be successful 


same DHCA variant, and the percentage of experiments in which prediction is 




DHCA variants in the case prediction fails. This evaluation process is replicated 



iii) Both computationally optimal DHCA variants prediction approaches are compared 

in terms of the percentage of experiments in which prediction is successful, 

and in terms of the execution time overheads (in both absolute and relative terms) 

between the truly and the allegedly fastest DHCA variants in the case prediction 

fails. 

– How are the experiments designed? The f! DHCA variants corresponding to 

all the possible permutations of the f diversity factors employed in the generation 

of the cluster ensemble have been implemented (see table 3.6). As described in appendix 

A.4, cluster ensembles have been created by the mutual crossing of f =3 

diversity factors: clustering algorithms (dfA), object representations (dfR) and data 

dimensionalities (dfD). Thus, in all our experiments, the number of DHCA variants is 

75


f! = 3! = 6, which are identified by an acronym describing the order in which diversity 

factors are assigned to stages —for instance, ADR describes the DHCA variant 

defined by the ordered list O = {df1 = dfA,df2 = dfD,df3 = dfR}. For a given data 

collection, the cardinalities of the representational and dimensional diversity factors 

(|dfR| and |dfD|, respectively) are constant, while the cardinality of the algorithmic 

diversity factor takes four distinct values |dfA| = {1, 10, 19, 28}, giving rise to the four 

diversity scenarios where our proposals are analyzed. Moreover, consensus clustering 

has been conducted by means of the seven consensus functions for hard cluster 

ensembles described in appendix A.5, which allows evaluating the behaviour of our 

proposals under distinct consensus paradigms. In all cases, the real running times 

correspond to an average of 10 independent runs of the whole RHCA, in order to 

obtain representative real running time values. As described in appendix A.6, all the 

experiments have been executed under Matlab 7.0.4 on Pentium 4 3GHz/1 GB RAM 

computers. 


serial and parallel implementations of the DHCA variants are depicted by means of 



the results of the experiments conducted on the Zoo data collection. On this data 

set, the cardinalities of the representational and dimensional diversity factors are 

|dfR| = 5 and |dfD| = 14, respectively. The presentation of the results of these 

same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat 

unimodal data collections is deferred to appendix C.3. 


Figure 3.10 presents the estimated and real running times of the serial and parallel DHCA 

implementations in the lowest diversity scenario corresponding to the use of |dfA| = 1 

randomly chosen clustering algorithm for creating a cluster ensemble of size l = 57. The 

DHCA variants are identified in the horizontal axis of each chart. Meanwhile, the values 

of SERTDHCA and PERTDHCA correspond to an arbitrarily chosen estimation experiment 

based on a single consensus run (i.e. c =1). 

On one hand, figures 3.10(a) and 3.10(b) present the estimated and real running times 

of the serial DHCA implementation (SERTDHCA and SRTDHCA). Notice that SERTDHCA 

is a fairly good estimator of the real execution time of the DHCA variants. Moreover, it 

successfully predicts the fastest consensus architecture —which is what we are ultimately 

interested in. Notice that DRA is the DHCA variant minimizing SRTDHCA, which corresponds 

to the decreasing ordering of the diversity factors in terms of their cardinality. 

On the other hand, figures 3.10(c) and 3.10(d) depict the estimated and real running 

times corresponding to the fully parallel implementation of DHCA (PERTDHCA and 

PRTDHCA, respectively). There are two issues worth noting in this case: firstly, notice that 

the real execution time of the distinct DHCA variants shows a notably lower dispersion than 

in the serial case, which somehow corroborates our conjecture regarding the unimportance 

of the diversity factors ordering in parallel DHCA variants. And secondly, it is clear that 

PERTDHCA does not perform as accurately as regards the determination of the fastest con- 

76

SERT DHCA (sec.) 

PERT DHCA (sec.) 

10 0 

10 −1 

|df A | = 1 , |df D | = 14 , |df R | = 5 

ADR ARD DAR DRA RAD RDA flat 

DHCA variant 


10 −1 

|df A | = 1 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


SRT DHCA (sec.) 

PRT DHCA (sec.) 

10 0 

10 −1 

|df A | = 1 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 −1 


|df A | = 1 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

Figure 3.10: Estimated and real running times of the serial and parallel DHCA on the Zoo 

data collection in the diversity scenario corresponding to a cluster ensemble of size l = 57. 

sensus architecture as in the serial case. Moreover, notice that in this low diversity scenario, 

hierarchical consensus architectures are, in general terms, slower than flat consensus. 


Figure 3.11 presents the results obtained in the diversity scenario corresponding to the generation 

of the cluster ensemble by the application of |dfA| = 10 clustering algorithms chosen 

at random. In this case, there exists at least one serial DHCA variant that performs faster 

than flat consensus —the only exception occurs when consensus are created by the EAC 

consensus function (recall this also happened with RHCA, see section 3.2.4). As in the previous 

diversity scenario, all the parallel DHCA variants yield pretty similar running times, 

as opposed to what is observed in the serial case, where there exist significant execution 

time differences between variants. Last, as regards the prediction of the computationally 

optimal consensus architecture, notice that its performance is pretty accurate in both the 

serial and parallel cases. 


As the diversity level of the cluster ensemble increases, a shift in the computationally optimal 

serial DHCA variant can be observed —see figure 3.12. Indeed, as figure 3.12(b) shows, 

the ADR variant of DHCA becomes the least computationally expensive serial hierarchical 

77




10 1 

10 0 

|df A | = 10 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 10 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



10 1 

10 0 

|df A | = 10 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 10 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



consensus architecture except when the EAC consensus function is employed —however, 

this behaviour is not always successfully predicted by SERTDHCA, asdepictedinfigure 

3.12(a). We believe this is due to the fact that our estimation is founded on the execution 

time of a single consensus process. 

As regards the parallel implementation of DHCA, the six DHCA variants yield very 

similar real execution times, as shown in figure 3.12(d), a trend that the running time 

estimation also captures —see figure 3.12(c). However, notice that this fact makes it difficult 

that the absolute minima of PERTDHCA and PRTDHCA coincide, which will probably harm 

the predictive accuracy of our proposal in the parallel implementation context. Moreover, 

notice that when the MCLA consensus function is employed, flat consensus is not executable 

(with the resources available in our experiments, see appendix A.6), while all the DHCA 

variants are. 


Figure 3.13 presents the estimated and real running times of the serial and parallel DHCA 

implementations in the highest diversity scenario, i.e. the one corresponding to the creation 

of the cluster ensemble by means of the |dfA| = 28 clustering algorithms of the CLUTO 

clustering toolbox —which gives rise to a cluster ensemble containing l = 1596 components. 

In this context, arranging the diversity factors in decreasing cardinality order for defining 

their association to the DHCA stages again yields the most computationally efficient serial 

78



10 1 

10 0 

|df A | = 19 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 19 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 




10 1 

10 0 

|df A | = 19 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 19 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



DHCA variant (ADR in this case). This fact somehow reinforces the idea that, when 

compared to the typically linear or quadratic time complexity of consensus functions, the 

multiplicative growth rate of the total number of consensus imposes a stronger constraint 

as far as the running time of the DHCA is concerned. As observed in the previous diversity 

scenarios, the selection of a particular DHCA variant is a less critical matter when the fully 

parallel implementation of DHCA is considered, as all six variants yield pretty similar real 

execution times. In the serial case, in contrast, the accuracy of the running time estimation 

methodology is much more crucial, although SERTDHCA performs as a reasonably successful 

predictor. 

As mentioned earlier, these same experiments have been run on the Iris, Wine, Glass, 

Ionosphere, WDBC, Balance and MFeat unimodal data collections, and the corresponding 

results are presented in appendix C.3. Most of the conclusions drawn regarding the computational 

efficiency of hierarchical and flat consensus architectures in the analysis of random 

hierarchical consensus architectures are also applicable in the context of DHCA, such as 

the high computational efficiency of i) parallel DHCA even in low diversity scenarios, and 

ii) serial DHCA in medium and high diversity scenarios, or the dependence between the 

characteristics of the consensus function employed for conducting clustering combination 

and the execution time of consensus architectures. 

Moreover, two extra conclusions regarding the selection of the computationally optimal 

DHCA variant must be discussed. Firstly, defining the DHCA architecture by means of an 

ordered list of diversity factors arranged in decreasing cardinality order (i.e. associating the 

79




10 1 

10 0 

|df A | = 28 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 28 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



10 1 

10 0 

|df A | = 28 , |df D | = 14 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 28 , |df D | = 14 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



most numerous diversity factor to the first stage, the second most populated to the second 

DHCA stage, and so on) seems to give rise to the most computationally efficient serial 

DHCA variant. And secondly, the execution time of fully parallel DHCA appears to be 

pretty insensitive to the way diversity factors are associated to the stages of the hierarchical 

consensus architecture. 

In practice, these two latter facts may play down the accuracy of the computationally 

optimal consensus architecture prediction methodology presented in table 3.6, as it seems 

possible to make well-grounded aprioridecisions as regards the selection of the fastest 

deterministic hierarchical consensus architecture variant without need of any running time 

estimation. For this reason, the next section presents an exhaustive comparative evaluation 

of these two strategies for predicting the computationally optimal consensus architecture. 

Evaluation of the optimal DHCA prediction methodology based on running 

time estimation 

Following an analogous procedure as in section 3.2.4, we have computed the percentage of 

experiments in which the estimated and real running times are simultaneously minimized by 

the same consensus architecture. The impact of the failures of this prediction methodology is 

measured in terms of the absolute and relative differences between the real execution times 

–ΔRT– of the truly (i.e. the one minimizing SRTDHCA or PRTDHCA) and the allegedly 

(the one that minimizes SERTDHCA or PERTDHCA) computationally optimal consensus 

80


Serial DHCA Parallel DHCA 

Dataset % correct ΔRT % correct ΔRT 

predictions (sec.) (%) predictions (sec.) (%) 

Zoo 55.9 1.32 227.3 40.0 0.03 34.0 

Iris 93.4 0.56 180.7 51.2 0.10 63.7 

Wine 76.1 1.19 48.2 36.8 0.14 48.1 

Glass 76.4 0.76 32.1 39.3 0.23 60.7 

Ionosphere 83.2 11.9 33.1 47.9 1.38 46.5 

WDBC 76.2 77.73 58.1 39.7 4.93 44.5 

Balance 90.6 0.52 26.5 71.1 0.93 51.7 

MFeat 88.6 5.40 27.4 68.4 5.83 28.9 

average 80.0 12.57 78.0 49.3 1.70 47.3 

Table 3.7: Evaluation of the minimum complexity DHCA variant estimation methodology 

in terms of the percentage of correct predictions and running time penalizations resulting 

from mistaken predictions. 

architectures. The results corresponding to an averaging across 20 independent running 

time estimation experiments are presented in table 3.7. 

It can be observed that, in the serial case, SERTDHCA is a pretty accurate predictor, 

achieving correct prediction rates of the computationally optimal consensus architecture 

superior to 75% in all but one of the data sets. The running time overheads associated 

to incorrect predictions are usually negligible in absolute terms (ΔRT (sec.)) —except for 

the WDBC data collection, where the large real execution times of any of the consensus 

architectures make any mistake costly. 

As regards the performance of the proposed methodology for predicting the most efficent 

parallel implementation of DHCA, its degree of accuracy is inferior than in the serial case –a 

circumstance already observed in the context of RHCA–, although the penalization caused 

by this lower level of precision ranges below one second of extra execution time, which 

constitutes an assumable cost from a practical viewpoint —the WDBC and MFeat data 

collections stand out as the exceptions to this rule, although the corresponding ΔRT (sec.) 

overheads (around five seconds) are again of little importance in practice. 

Finally, we have also evaluated the influence of employing the execution times of c>1 

consensus executions for estimating the running times of the DHCA variants. Expectably, 

the larger c, the more accurate the running time estimation and, consequently, smaller 

running time overheads will be derived from incorrect predictions. On the flip side, however, 

this will slow down the prediction process —recall that, in the experiments presented up to 

now, c =1. 

As in section 3.2.4, a sweep of values of c ∈ [1, 20] has been conducted, computing 

the percentage of fastest consensus architecture correct predictions and the absolute and 

relative running time deviations associated to prediction errors at each step of the sweep, 

averaging the results of twenty independent runs of this experiment on each one of the eight 

unimodal data collections —see figure 3.14. 

It can be observed that, despite the gradual increase of correct predictions (figure 

3.14(a)), the running time deviations suffer a steep decrease as soon as c = 4 consensus 

processes are employed for computing SERTDHCA. Moreover, notice that using larger val- 

81


% correct predictions 

100 

80 

60 

40 

1 5 10 

CP 

S 

15 

CP 

P 

20 


(a) Percentage of correct optimal 


predictions 

RT (sec.) 

15 

10 

5 

RT S 

RT P 

0 

1 5 10 15 20 


(b) Absolute running time 

differences between the truly 

and allegedly optimal consensus 

architectures 

relative % RT 

80 

60 

40 

20 

relative RT S 

relative RT P 

0 

1 5 10 15 20 


(c) Relative running time differences 

between the truly and 

allegedly optimal consensus 

architectures 

Figure 3.14: Evolution of the accuracy of the serial and parallel DHCA running time estimation 

as a function of the number of consensus processes used in the estimation, measured in 

terms of (a) the percentage of correct predictions, the (b) absolute and (c) relative running 

time deviations between the truly and allegedly optimal consensus architectures. 

ues of c does not imply significant reductions in ΔRT, which shows a pretty stationary 

behaviour for c>5. Last, it is worth observing that, in the parallel case, the correct 

prediction percentage increase and relative ΔRT decrease obtained for c>1resultinalmost 

negligible absolute ΔRT reductions, which again reveals the lesser importance of the 

aprioriselection of a particular parallel DHCA variant. 

This adds to the fact that, as suggested earlier, it seems unnecessary to conduct any 

runnning time estimation process for determining the fastest hierarchical consensus architecture 

variant, as assigning the diversity factors to DHCA stages in decreasing cardinality 

order apparently gives rise to the serial DHCA variant of minimum time complexity. With 

the purpose of evaluating the validity of this latter hypothesis, we have conducted the 

experiments presented throughout the following paragraphs. 

Evaluation of the optimal DHCA prediction methodology based on decreasing 

cardinality diversity factor ordering 

As far as the serial DHCA implementation is concerned, we have computed the percentage 

of experiments where the minimum real running time is achieved by the variant corresponding 

to the decreasing cardinality ordering of the diversity factors. Moreover, in case this 

prediction fails, we have computed the running time overhead resulting from selecting a 

computationally suboptimal hierarchical consensus architecture as the fastest one. The average 

results obtained for each one of the eight unimodal data collections are presented in 

table 3.8. 

It can be observed, for instance, that the DHCA variants defined by the ordered list of 

diversity factors in decreasing cardinality order is always the fastest hierarchical consensus 

architecture in the Zoo data set —which is equivalent to a 100% correct prediction rate. 

Using this prediction method, the lowest accuracy is obtained in the MFeat data collection 

(71.4%) —in this case, the average running time deviation derived from the 28.6% of in- 

82


Serial DHCA 

Dataset % correct ΔRT 

predictions (sec.) (%) 

Zoo 100 – – 

Iris 96.4 0.01 1.2 

Wine 89.3 0.69 16.2 

Glass 92.9 0.09 5.4 

Ionosphere 85.7 23.10 35.3 

WDBC 92.9 2.03 2.2 

Balance 75.0 1.12 17.1 

MFeat 71.4 233.6 18.5 

average 88.0 32.6 11.9 

Table 3.8: Evaluation of the minimum complexity serial DHCA variant prediction based on 

decreasing diversity factor ordering in terms of the percentage of correct predictions and 

running time penalizations resulting from mistaken predictions. 

correct predictions amounts to 233.6 seconds (a very high value due to the absolute real 

execution times of hierarchical consensus architectures on this data set), which is equivalent 

to an average deviation of 18.5% in relative percentage terms. 

An averaging across data sets yields a prediction accuracy of 88%, i.e. it performs better 

than the prediction methodology based on running time estimation, which made a 80% of 

correct predictions (see table 3.7). This result reinforces the notion that the decreasing 

cardinality diversity factor ordering approach to select the computationally optimal serial 

DHCA variant is an alternative worth considering, as it requires no previous consensus 

execution besides obtaining higher levels of prediction accuracy. 

Aiming to support the conjecture that there is no computationally superior DHCA variant 

when its fully parallel implementation is considered, we have conducted an experiment 

seeking to quantify the differences between the least and most time consuming DHCA variants. 

So as to provide a valid contrast to these results, the same computation has been 

conducted regarding the most and least computationally efficient serial DHCA variants, 

proving that making an accurate selection is much more important in the serial than in the 

parallel case. Table 3.9 presents the results obtained, averaged across all the experiments 

conducted on each of the eight unimodal data collections. 

It can be observed that the running time differences between the most and least computationally 

efficient DHCA variants are very notable in the serial case —in fact, it takes from 

5 to 18 times longer to run the slowest DHCA variant than the computationally optimal 

one. In contrast, these variations are much smaller when the fully parallel implementation 

of DHCA is considered. In this case, as expected, a greater running time uniformity is 

observed across DHCA variants, as the least computationally efficient variant is at most 2.5 

times slower than the fastest one. 

To sum up, the decreasing cardinality diversity factor ordering provides the user with a 

pretty accurate notion of which is the most computationally efficient DHCA configuration 

without need of executing a single consensus process. However, this strategy does not allow 

to decide whether the allegedly fastest DHCA variant is more efficient than flat consensus. 

To do so, we propose estimating the running time of the computationally optimal DHCA 

83


Serial DHCA Parallel DHCA 

Dataset max (SRTDHCA) − min (SRTDHCA) max (PRTDHCA) − min (PRTDHCA) 

(sec.) (%) (sec.) (%) 

Zoo 7.36 547.6 0.08 42.3 

Iris 5.34 707.4 0.12 53.2 

Wine 12.66 636.8 0.23 70.1 

Glass 8.78 387.1 0.34 90.8 

Ionosphere 487.73 1650.3 2.82 92.3 

WDBC 2095.36 1357.6 15.91 80.5 

Balance 187.77 736.5 8.57 96.2 

MFeat 16667.23 1562.4 1104.08 154.4 

Table 3.9: Running time differences between the most and least computationally efficient 

DHCA variants in both the serial and parallel implementations. 

variant (following the strategy presented in table 3.6), and then i) estimate the running time 

of flat consensus by extrapolating the execution times of the consensus processes conducted 

upon mini-ensembles of size |dfi| (for i = {1,...,f}), or ii) launch the execution of flat 

consensus, halting it as soon as its running time exceeds the estimated execution time of 

the allegedly optimal DHCA variant —which is a simpler but less efficient alternative. 

Summary of the most computationally efficient DHCA variants 

To end this section, we have estimated which are the most computationally efficient consensus 

architectures for the twelve unimodal data collections described in appendix A.2.1. 

The results corresponding to the fully serial and parallel implementations are presented in 

tables 3.10 and 3.11, respectively. 

As regards the serial consensus architecture implementation (table 3.10), a few notational 

observations must be made: successful predictions of the computationally optimal 

consensus architecture (i.e. the minima of SERTDHCA and SRTDHCA are yielded by the 

same consensus architecture) are denoted with a dagger (†). Moreover, we highlight the 

cases where the minimum time complexity consensus architecture is the DHCA variant defined 

by the ordered list of diversity factors arranged in decreasing cardinality order using 

the double dagger (‡) symbol. Quite obviously, this only applies to the first eight data 

collections (Zoo to MFeat), as in these cases both the estimated and real execution times 

are available. For the remaining data sets (miniNG to PenDigits), we have only estimated 

which are the computationally optimal consensus architectures. 

Firstly, notice the large number of † symbols in table 3.10, which indicates the reasonably 

high accuracy of the proposed optimal consensus architecture prediction methodology. 

Moreover, notice that the most times we predict correctly that the least time consuming 

consensus architecture is a DHCA variant, its architecture is created by arranging the 

diversity factors in decreasing order of cardinality (which is denoted by the ‡ symbol). 

Secondly, it is important to highlight that the higher the degree of diversity, the more 

efficient DHCA variants become when compared to flat consensus —as already observed 

throughout all the experiments reported, the EAC consensus function constitutes an exception 

to this rule. However, notice that flat consensus tends to be computationally optimal 

84


in those data sets having small cluster ensembles even in high diversity scenarios (e.g. Iris, 

Balance or MFeat). 

Thirdly, as the number of objects n contained in the data set increases (such as in 

PenDigits collection), only the HGPA and MCLA consensus functions are executable (as 

they are the only whose time complexity scales linearly with the data set size, see appendix 

A.5), and hierarchical consensus architectures are the most computationally efficient ones. 

However, if the data set was even larger, nor flat neither DHCA would be affordable from a 

computational perspective —with the resources employed in our experiments, see appendix 

A.6. 

Most of these observations can be extrapolated to the case of the fully parallel consensus 

implementation (table 3.11), where a pretty overwhelming prevalence of DHCA variants over 

flat consensus can be observed, a trend that was already reported earlier in this section and 

can also be observed in the experiments described in appendix C.3. 

For brevity reasons, the experiments presented in the remains of this work concerning 

deterministic hierarchical consensus architectures will solely refer to those DHCA variants 

of minimum estimated running time —i.e. those presented in tables 3.10 and 3.11. 

85





|df 

CSPA A| = 10 DAR‡ flat† flat† flat† DAR‡ flat† flat† flat† flat flat flat – 

|dfA| = 19 ADR‡ flat† flat† ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat – 

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat – 


|df 

EAC A| = 10 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|dfA| = 19 flat† flat† flat† flat† flat† flat† flat† flat† ADR flat flat – 

|dfA| = 28 flat† flat† flat† flat† flat† flat† flat† flat† ADR flat flat – 

|dfA| = 1 flat† flat† flat† flat† flat† DRA flat† RDA DRA DRA DRA DRA 

|df 

HGPA A| = 10 DAR‡ flat† flat† ADR‡ DAR‡ DRA flat† ARD‡ DAR DAR DAR DAR 

|dfA| = 19 ARD‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ ARD‡ ARD‡ ADR ADR ADR ADR 

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR‡ ADR ARD† ARD‡ ADR ADR ADR ADR 

|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† DRA DRA DRA DRA 

|df 

MCLA A| = 10 DAR‡ flat† DAR‡ ADR‡ DAR‡ DAR‡ flat† ARD‡ DAR DAR DAR DAR 

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ ARD‡ ARD‡ ADR ADR ADR ADR 

|dfA| = 28 ADR‡ ARD‡ ADR‡ ADR‡ DAR‡ ADR‡ ARD‡ ARD‡ ADR ADR ADR ADR 


|df 

ALSAD A| = 10 DAR‡ flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR flat† flat† flat† ADR flat flat – 

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ ADR‡ flat† flat† flat† ADR flat flat – 

|dfA| = 1 flat† flat† flat† flat† flat† DRA flat† RAD flat flat flat – 

|df 

KMSAD A| = 10 DAR‡ flat† DAR flat† DAR‡ DAR‡ flat† flat† flat flat flat – 

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat – 

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR ADR‡ flat† flat† ADR flat flat – 


|df 

SLSAD A| = 10 DAR‡ flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ flat† flat† flat† ADR flat flat – 

|dfA| = 28 ADR‡ flat† ADR‡ ADR DAR flat† flat† flat† ADR flat flat – 

86 

Table 3.10: Computationally optimal consensus architectures (flat or DHCA) on the unimodal data collections assuming a fully serial 

implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions. The double dagger (‡) identifies DHCA 

variants defined by the ordered list of diversity factors in decreasing cardinality order.



|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† DRA flat flat – 

|df 

CSPA A| = 10 DAR† flat† DAR† ADR DAR† DAR flat† ARD DAR DAR DAR – 

|dfA| = 19 ADR† ARD ADR† ADR† DAR DAR† flat† ARD ADR ADR ADR – 

|dfA| = 28 ADR† ARD† ADR† ADR† DAR ADR† ARD† ARD ADR DAR ADR – 

|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|df 

EAC A| = 10 DAR flat† DAR ADR DAR DAR flat† flat† flat flat flat – 

|dfA| = 19 ADR† ARD ADR† ADR† DAR DAR flat† flat† flat flat flat – 

|dfA| = 28 ADR† ARD ADR ADR DAR ADR flat† flat† flat flat flat – 

|dfA| = 1 DRA flat† flat† DRA DRA† DRA† flat† RAD DRA DRA DRA DRA 

|df 

HGPA A| = 10 DAR ARD DAR ADR† DAR† DAR ARD ARD† DAR DAR DAR DAR 

|dfA| = 19 ADR ARD† ADR ADR† DAR† DAR ARD ARD† ADR ADR ADR ADR 

|dfA| = 28 ADR† ARD† ADR† ADR† DAR† ADR ARD ARD ADR ADR ADR ADR 

|dfA| = 1 DRA† flat† flat† flat† DRA† DRA† flat† flat† DRA DRA DRA DRA 

|df 

MCLA A| = 10 DAR† ARD DAR ADR† DAR† DAR ARD ARD† DAR DAR DAR DAR 

|dfA| = 19 ADR† ARD† ADR† ADR DAR† DAR ARD† ARD† ADR ADR ADR ADR 

|dfA| = 28 ADR† ARD† ADR† ADR† DAR ADR† ARD ARD† ADR ADR ADR ADR 


|df 

ALSAD A| = 10 DAR flat† DAR ADR† DAR DAR flat† flat† DAR flat flat – 

|dfA| = 19 ADR† flat† ADR ADR DAR DAR flat† flat† ADR flat flat – 

|dfA| = 28 ADR† ARD† ADR ADR DAR ADR† flat† flat† ADR flat flat – 

|dfA| = 1 flat† flat† DRA DRA flat† DRA† flat† RAD DRA flat flat – 

|df 

KMSAD A| = 10 DAR ARD DAR† ADR DAR DAR ARD ARD DAR DAR DAR – 

|dfA| = 19 ADR ARD† ADR† ADR DAR DAR ARD ARD ADR ADR ADR – 

|dfA| = 28 ADR† ARD ADR† ADR DAR ADR† ARD† ARD ADR ADR ADR – 

|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† flat flat flat – 

|df 

SLSAD A| = 10 DAR flat† DAR ADR DAR DAR flat† flat† DAR flat flat – 

|dfA| = 19 ADR ARD ADR ADR DAR DAR flat† flat† ADR flat flat – 

|dfA| = 28 ADR ARD ADR† ADR DAR ARD flat† flat† ADR ADR flat – 


87 

Table 3.11: Computationally optimal consensus architectures (flat or DHCA) on the unimodal data collections assuming a fully parallel 


3.4. Flat vs. hierarchical consensus 

3.4 Flat vs. hierarchical consensus 

In sections 3.2 and 3.3, two specific implementations of hierarchical consensus architectures 

have been proposed, alongside a methodology for determining aprioriwhich is the fastest 

(random or deterministic) HCA variant, and deciding whether it is computationally advantageous 

with respect to classic flat consensus. In this section, we present a direct twofold 

comparison between flat consensus and those DHCA and RHCA variants deemed as the 

most computationally efficient by the proposed running time estimation methodologies. 

Firstly, we compare them in terms of computational complexity. In fact, such comparison 

could be made upon the results presented in sections 3.2 and 3.3, but we think that a 

comparison considering only the allegedly best performing variants may simplify the process 

of drawing meaningful conclusions. And secondly, these least time consuming hierarhical 

consensus architecture variants will be compared with flat consensus in terms of the quality 

of the consensus clustering solutions they yield. By doing so, we intend to present a comprehensive 

picture of our hierarchical consensus architecture proposals in terms of the two 

main factors that condition robust clustering by consensus: time complexity and quality. 

3.4.1 Running time comparison 

This section compares the real execution times of the allegedly fastest DHCA and RHCA 

variants and flat consensus. The experiments conducted follow the design outlined next. 


– What do we want to measure? The time complexity of the allegedly fastest 

DHCA and RHCA variants and flat consensus. 

– How do we measure it? We measure the CPU time required for the execution of 

the aforementioned consensus architectures. 

– How are the experiments designed? Such comparison entails the running times 

of ten independent runs of each one of the compared consensus architectures. So 

as to evaluate their computational efficiency under distinct experimental conditions, 

the consensus processes involved have been conducted by means of the seven consensus 

functions for hard cluster ensembles employed in this work —see appendix A.5. 

Moreover, experiments have been replicated on the four diversity scenarios described 

in appendix A.4 —recall that they differ in the algorithmic diversity factor, as a set of 

|dfA| = {1, 10, 19, 28} randomly chosen clustering algorithms are employed for creating 

the cluster ensemble in each diversity scenario. 

– How are results presented? In formal terms, the measured execution times are 

presented by means of boxplot charts, so as to provide the reader with a notion 

of the degree of dispersion and asymmetry of the running times of each consensus 

architecture. When comparing boxplots, notice that non-overlapping boxes notches 

indicate that the medians of the compared running times differ at the 5% significance 

level, which allows a quick inference of the statistical significance of the results. 

– Which data sets are employed? A detailed description of the results of this 

comparison on the Zoo data collection is presented in the following paragraphs. Recall 

88

CPU time (sec.) 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 



0.2 

0.15 

0.1 

0.05 

0 

0.14 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

RHCA 

DHCA 

flat 



1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

MCLA 

RHCA 

DHCA 

flat 


0.15 

0.1 

0.05 

0 

ALSAD 

RHCA 

DHCA 

(a) Serial implementation running time 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

HGPA 

RHCA 

DHCA 

flat 


0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

MCLA 

RHCA 

DHCA 

flat 


0.14 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

0 

flat 

ALSAD 

RHCA 

DHCA 

(b) Parallel implementation running time 

flat 



0.5 

0.4 

0.3 

0.2 

0.1 

0 

0.2 

0.15 

0.1 

0.05 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 



0.18 

0.16 

0.14 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

Figure 3.15: Running times of the computationally optimal RHCA, DHCA and flat consensus 

architectures on the Zoo data collection for the diversity scenario corresponding to 

a cluster ensemble of size l = 57. 

that, the cardinalities of the dimensional and representational diversity factors of this 

fata set are |dfD| =14and|dfR| = 5, respectively. For brevity reasons, the results 

obtained on eleven more unimodal data sets are described in detail in appendix C.4. 

However, at the end of this section, the running times of the three compared consensus 

architectures measured across the experiments conducted on the twelve unimodal 

data collections employed in this work are compiled and compared. The goal of such 

comparison is to analyze whether any of the consensus architecture is inherently faster 

than the rest. 


The running times of the estimated computationally optimal serial and parallel DHCA and 

RHCA implementations and flat consensus architectures in the lowest diversity scenario are 

presented in figure 3.15. 

As regards the fully serial implementation (figure 3.15(a)), it can be observed that flat 

consensus is 1.2 to 5 times faster than the fastest hierarchical consensus variant regardless 

89


of the consensus function employed, and that such differences are, in all cases, statistically 

significant. As far as the hierarchical consensus architectures are concerned, notice that the 

fastest DHCA variant (DRA) is more computationally efficient than its RHCA counterpart 

(which has s = 2 stages and mini-ensembles of size b = 28), except when consensus processes 

are conducted by means of the EAC and SLSAD consensus functions —in this case, statistically 

equivalent running times are attained by both HCA. If these results are contrasted 

with the predicted computationally optimal consensus architectures presented in tables 3.4 

and 3.10, a single prediction error is detected (flat consensus turns out to be faster than the 

DRA DHCA variant when consensus is conducted by MCLA), which reinforces the notion 

that the proposed computationally optimal consensus architecture prediction methodology 

performs pretty well. 

Figure 3.15(b) presents the running times of the fully parallel optimal hierarhical consensus 

architectures and flat consensus. As in the serial case, it can be noticed that flat 

consensus tends to be more efficient than RHCA and DHCA. The only exception occurs 

when consensus is conducted by means of the MCLA consensus function —which is due 

to the fact that it is the only combiner the computational complexity of which increases 

quadratically with the size of the cluster ensemble. Last but not least, it is to note that, 

as opposed to what was observed in the serial implementation, the fastest RHCA is less 

time consuming than the most efficient DHCA variant, and the running time differences 

between them are statistically significant. The reason for this lies in the fact that this 

specific RHCA variant has s = 2 stages and consensus is conducted on mini-ensembles of 

size b = 7, whereas the DHCA variant consists of three consensus stages, in one of which 

consensus are built on larger mini-ensembles of size |dfD| = 14, which is responsible for the 

higher computational cost of parallel DHCA in this case. 


The results corresponding to the experiments conducted in the second diversity scenario 

(i.e. cluster ensembles generated by the compilation of the clusterings output by |dfA| =10 

randomly selected clustering algorithms, giving rise to cluster ensembles of size l = 570) are 

presented in figure 3.16. In particular, figure 3.16(a) depicts the execution time boxplots 

of the serial implementation of hierarchical consensus architectures. The first noticeable 

situation is that, in contrast to what was observed in the lowest diversity scenario, the 

computationally optimal RHCA variant (s =2andb =20orb = 285 depending on the 

consensus function employed) is faster than its DHCA counterpart (DAR). This is due to the 

fact that the rise of the algorithmic diversity factor (from |dfA| =1to|dfA| = 10) entails an 

increase of the computational cost of one of the DHCA stages that exceeds the increment 

of the complexity of the RHCA caused by the same factor. Meanwhile, regarding the 

computational efficiency of flat consensus, two opposed behaviours are observed depending 

on the consensus function employed: while being faster than any hierarchical architecture 

when consensus are built using the CSPA and EAC consensus functions, one-step consensus 

is slower when the remaining clustering combiners are employed. Moreover, the differences 

between the running times of these consensus architectures is statistically significant at the 

5% significance level in all cases. 

Figure 3.16(b) presents the results corresponding to the fully parallel implementation 

of consensus architectures. In this case, flat consensus is at least four times more computationally 

costly than the DHCA and RHCA variants. Moreover, the optimal RHCA 

90



3.8 

3.6 

3.4 

3.2 

3 

2.8 

2.6 

2.4 

2.2 

2 

1.5 

1 

0.5 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 



1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.25 

0.2 

0.15 

0.1 

0.05 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



3.4 

3.2 

3 

2.8 

2.6 

2.4 

2.2 

HGPA 

RHCA 

DHCA 

flat 



25 

20 

15 

10 

5 

MCLA 

RHCA 

DHCA 

flat 


2 

1.5 

1 

0.5 

ALSAD 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

0.5 

0 

HGPA 

RHCA 

DHCA 

flat 


25 

20 

15 

10 

5 

0 

MCLA 

RHCA 

DHCA 

flat 


2 

1.5 

1 

0.5 

0 

ALSAD 

RHCA 

DHCA 

flat 




2.5 

2 

1.5 

1 

2.5 

2 

1.5 

1 

0.5 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 



2 

1.5 

1 

0.5 

2 

1.5 

1 

0.5 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 




is faster than its DHCA counterpart (except for the MCLA consensus function), although 

they both attain very similar execution times, their differences being statistically significant 

little below the 5% significance level. 


The running times of the consensus architectures corresponding to the experiments conducted 

on the third diversity scenario (i.e. cluster ensembles of size l = 1083) are presented. 

Figure 3.17(a) depicts the execution time boxplots of the serially implemented consensus 

architectures. In most cases, hierarchical consensus architectures are faster than their flat 

counterpart —the only exception occurs when consensus are built using the EAC consensus 

function, a trend that was already observed in sections 3.2 and 3.3. Notice, moreover, 

that flat consensus is not executable when MCLA is the consensus function employed for 

creating the consensus clustering solutions. 

When the entirely parallel implementation of hierarchical consensus architectures is evaluated 

from a computational viewpoint, the optimal RHCA and DHCA variants (see tables 

91




8 

7 

6 

5 

4 

8 

7 

6 

5 

4 

3 

2 

1 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 



1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



9 

8 

7 

6 

5 

4 

HGPA 

RHCA 

DHCA 

flat 


8 

7.5 

7 

6.5 

6 

5.5 

5 

4.5 

4 

MCLA 

RHCA 

DHCA 

flat 


8 

7 

6 

5 

4 

3 

2 

1 

ALSAD 

RHCA 

DHCA 

flat 


8 

6 

4 

2 

0 

HGPA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

MCLA 

RHCA 

DHCA 

flat 


8 

7 

6 

5 

4 

3 

2 

1 

0 

ALSAD 

RHCA 

DHCA 

flat 




8 

7 

6 

5 

4 

3 

2 

1 

8 

7 

6 

5 

4 

3 

2 

1 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 



8 

7 

6 

5 

4 

3 

2 

1 

8 

7 

6 

5 

4 

3 

2 

1 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 




3.5 and 3.11) are at least eight times faster than flat consensus –see figure 3.17(b)– attaining 

very similar execution times (being statistically equivalent when the HGPA, ALSAD and 

SLSAD consensus functions are employed). This is quite logical provided that DHCA has 3 

stages, building consensus on mini-ensembles of sizes |dfA| = 19, |dfD| = 14, and |dfR| =5, 

while the fastest RHCA has two or three stages (depending on the consensus function employed) 

where consensus is built on mini-ensembles of size b =26orb = 27. At the end of 

the day, the number of stages and the mini-ensembles sizes of DHCA and RHCA are conterbalanced, 

yielding, as mentioned earlier, pretty similar running times. Notice, however, 

that when consensus are built using the MCLA consensus function, RHCA is penalized 

with respect to DHCA, given the larger size of its mini-ensembles and the aforementioned 

quadratic dependence of this consensus function running time with this factor. 


A very similar behaviour to the one just reported is observed when the size of the cluster 

ensembles is increased. Indeed, in the highest diversity scenario, i.e. the one corresponding 

to the use of the |dfA| = 28 clustering algorithms of the CLUTO toolbox for creating cluster 

92



16 

14 

12 

10 

8 

6 

4 

16 

14 

12 

10 

8 

6 

4 

2 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 



1.5 

1 

0.5 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



18 

16 

14 

12 

10 

8 

6 

4 

HGPA 

RHCA 

DHCA 

flat 


12 

11 

10 


9 

8 

7 

6 

MCLA 

RHCA 

DHCA 

flat 


16 

14 

12 

10 

8 

6 

4 

2 

0 

ALSAD 

RHCA 

DHCA 

flat 


15 

10 

5 

0 

HGPA 

RHCA 

DHCA 

flat 


1.2 

1 

0.8 

0.6 

0.4 

0.2 

MCLA 

RHCA 

DHCA 

flat 


16 

14 

12 

10 

8 

6 

4 

2 

0 

ALSAD 

RHCA 

DHCA 

flat 




16 

14 

12 

10 

8 

6 

4 

2 

16 

14 

12 

10 

8 

6 

4 

2 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 



16 

14 

12 

10 

8 

6 

4 

2 

0 

16 

14 

12 

10 

8 

6 

4 

2 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 




ensembles of size l = 1596, almost identical running time boxplot charts are obtained —see 

figure 3.18. 

As aforementioned, this same experiment has been conducted on the eleven remaining 

unimodal data collections, and the corresponding execution time boxplot charts are 

presented in appendix C.4. From their analysis, the following conclusions have been drawn: 

– the prediction regarding the computationally optimality of flat or hierarhical consensus 

architectures are made with a high degree of accuracy (in fact, average prediction 

accuracies of 93.45% and 91.07% are obtained across our experiments in the serial 

and parallel implementation cases, respectively). 

– flat consensus is the most efficient consensus architecture when the EAC consensus 

function is employed, regardless of the diversity scenario or whether the serial or 

parallel implementation of hierarchical architectures is employed. 

– in large datasets, only HGPA and MCLA are executable, as the time and space 

complexities of the remaining consensus functions scale quadratically with the number 

93


of objects in the collection, as they employ object co-association matrices as a basic 

element of their consensus processes. 

– regarding which type of hierarchical consensus architecture (RHCA or DHCA) is 

more computationally efficient, little can be said apriori, as it depends on the specific 

configurations of the hierarhical architectures, i.e. their number of stages and the 

sizes of the mini-ensembles. 

– using the MCLA consensus function penalizes those architectures with large miniensembles, 

as its time complexity depends quadratically on this factor. 

Comparison across diversity scenarios and data collections 

Aiming to reveal the existence of any global computational superiority pattern between 

the two hierarchical consensus architecture variants, besides confirming the hypotheses put 

forward earlier, we have compiled the real running times of the assumedly computationally 

optimal RHCA and DHCA variants and of flat consensus in each diversity scenario using 

each consensus function, across the twelve data collections employed in these experiments. 

This process has been replicated for both the fully serial and parallel implementations 

of hierarchical consensus architectures, and as a result, the running time boxplot charts 

presented in figures 3.19 and 3.20 have been obtained. 

For starters, figure 3.19 presents the running times corresponding to the entirely serial 

implementation of the computationally optimal RHCA and DHCA variants and flat consensus. 

Notice the notable height of the boxes in the boxplots, caused by the fact that they 

represent running times of the consensus architectures across data collections with fairly 

distinct characteristics (i.e. number of objects and clusters). Nevertheless, the focus of this 

analysis should be placed on detecting relative differences between the boxes corresponding 

to the three consensus architectures. In general terms, it can be observed how flat consensus 

becomes gradually slower than hierarchical consensus as the size of the cluster ensembles 

grows (i.e. as the cardinality of the algorithmic diversity factor |dfA| is increased). As reported 

earlier, consensus architectures based on the EAC consensus function constitute the 

only exception to this rule. As regards the comparison between the running times of RHCA 

and DHCA, the most significant differences are observed when the ALSAD and SLSAD 

consensus functions are employed —in these cases, the random HCA variants are faster 

than their deterministic counterparts, a trend that becomes more apparent as the cluster 

ensembles size grows. Finally, notice that, in absolute terms, consensus architectures based 

on the HGPA, MCLA and CSPA consensus functions are faster than those employing the 

EAC, ALSAD, KMSAD and SLSAD clustering combiners. 

And secondly, the execution time boxplots corresponding to flat consensus and the fully 

parallel implementation of consensus architectures are depicted in figure 3.20. As in the serial 

case, the use of the HGPA, MCLA and CSPA consensus functions gives rise, in general, 

to faster consensus architectures than when consensus clustering solutions are generated by 

means of the EAC, ALSAD, KMSAD and SLSAD clustering combiners. Moreover, the superiority 

of hierarchical architectures in front of flat consensus becomes manifest in diversity 

scenarios with |dfA| ≥10, depending on the consensus function employed. As regards the 

comparison between RHCA and DHCA, a wide spectrum of behaviours is observed. When 

consensus is built upon the CSPA, EAC, HGPA and KMSAD consensus functions, little significant 

differences between both hierarchical architectures are detected. Meanwhile, RHCA 

94





10 2 

10 1 

10 0 

10 −1 

CSPA 

10 

RHCA DHCA flat 

−2 

10 2 

10 1 

10 0 

10 −1 

10 3 

10 2 

10 1 

10 0 

CSPA 


10 


−1 

10 3 

10 2 

10 1 

10 0 

CSPA 

CSPA 

10 


−1 





10 4 

10 2 

10 0 

10 −2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

EAC 


10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

EAC 

10 


−1 

10 4 

10 3 

10 2 

10 1 

10 0 

EAC 

EAC 

10 


−1 


10 1 

10 0 

10 −1 

HGPA 

10 


−2 



10 1 

10 0 

10 −1 

MCLA 

10 


−2 


10 4 

10 2 

10 0 

10 −2 

ALSAD 


(a) Serial implementation running time, |dfA| =1 


10 2 

10 1 

10 0 

10 −1 

HGPA 



10 2 

10 1 

10 0 

MCLA 

10 


−1 


10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

ALSAD 

10 


−2 

(b) Serial implementation running time, |dfA| =10 


10 2 

10 1 

10 0 

HGPA 

10 


−1 


10 2 

10 1 

10 0 

MCLA 



10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

ALSAD 


(c) Serial implementation running time, |dfA| =19 


10 3 

10 2 

10 1 

10 0 

HGPA 

10 


−1 


10 2 

10 1 

10 0 

MCLA 



10 4 

10 3 

10 2 

10 1 

10 0 

ALSAD 

10 


−1 

(d) Serial implementation running time, |dfA| =28 





10 3 

10 2 

10 1 

10 0 

10 −1 

10 −2 

KMSAD 

10 


−3 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

KMSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

10 4 

10 3 

10 2 

10 1 

10 0 

KMSAD 


KMSAD 

10 


−1 





10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

SLSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

SLSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

SLSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

SLSAD 

10 


−1 

Figure 3.19: Running times of the computationally optimal serial RHCA, DHCA and flat 

consensus architectures across all data collections for the diversity scenarios corresponding 

to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms. 

95


tends to be a bit quicker to execute than DHCA in high diversity scenarios, although the 

fact that the mini-ensembles sizes employed in the fastest random architecture variants are 

usually larger than those employed in DHCA is penalized by consensus architectures based 

on the MCLA consensus function. And last, as already perceived in the serial case, RHCA 

tends to be slightly more efficient than DHCA when clustering combination is conducted by 

means of consensus functions based on treating hierarchical clustering similarity measures 

as data (i.e. ALSAD and SLSAD). 

To sum up, we can conclude that there exists a very important dependence between 

the computationally optimal type of consensus architecture, the size of the cluster ensemble 

upon which consensus is built and the consensus function employed. From a practical 

standpoint, in front of a specific consensus clustering problem (i.e. a cluster ensemble of a 

given size l and a particular computational resources configuration), the user should take 

into account how these factors interact at the time of deciding which type of consensus 

architecture is to be implemented. However, this decision should not only be made on 

computational efficiency grounds. In fact, it should also allow for the quality of the consensus 

clustering solution obtained, as the quick obtention of a poor consensus data grouping 

would be of little use in practice. For this reason, the next section evaluates the quality of 

the consensus label vectors output by the same consensus architectures that have just been 

analyzed in computational terms. 

3.4.2 Consensus quality comparison 

In this section, we evaluate the quality of the consensus clustering solutions yielded by the 

fastest DHCA and RHCA variants and flat consensus architectures, which constitutes an 

indicator of their suitability for conducting robust clustering. The experiments conducted 

to this end follow the design described next. 



i) The suitability of the allegedly fastest DHCA and RHCA variants and flat consensus 

for obtaining clustering results robust to the inherent indeterminacies of 

clustering. 

ii) A further goal of this section is to determine whether certain consensus architectures 

tend to outperform others as regards the quality of the consensus clusterings 

they obtain. 


i) We analyze the quality of the consensus clustering solutions obtained by these 

consensus architectures, comparing it with respect the individual clusterings contained 

in the cluster ensemble E upon which consensus is built. The more similar 

the qualities of the consensus clustering solution and the top quality cluster ensemble 

components, the higher robustness to the clustering indeterminacies is 

attained. As mentioned in section 1.2.2, in this work we evaluate clustering 

solutions by means of an external cluster validity index, i.e. we compare the consensus 

clustering solution embodied in the labeling vector λc with a predefined 

96





10 2 

10 1 

10 0 

10 −1 

CSPA 

10 


−2 

10 2 

10 1 

10 0 

10 −1 

10 3 

10 2 

10 1 

10 0 

CSPA 


10 


−1 

10 3 

10 2 

10 1 

10 0 

CSPA 

CSPA 

10 


−1 





10 4 

10 2 

10 0 

10 −2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

EAC 


10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

EAC 

10 


−1 

10 4 

10 3 

10 2 

10 1 

10 0 


10 1 

10 0 

10 −1 

HGPA 

10 


−2 



10 1 

10 0 

10 −1 

MCLA 

10 


−2 


10 4 

10 2 

10 0 

10 −2 

ALSAD 


(a) Parallel implementation running time, |dfA| =1 


10 2 

10 1 

10 0 

10 −1 

HGPA 



10 2 

10 1 

10 0 

MCLA 

10 


−1 


10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

ALSAD 

10 


−2 

(b) Parallel implementation running time, |dfA| =10 

EAC 


10 2 

10 1 

10 0 

HGPA 

10 


−1 


10 2 

10 1 

10 0 

MCLA 



10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

ALSAD 


(c) Parallel implementation running time, |dfA| =19 

EAC 

10 


−1 


10 3 

10 2 

10 1 

10 0 

HGPA 

10 


−1 


10 2 

10 1 

10 0 

MCLA 



10 4 

10 3 

10 2 

10 1 

10 0 

ALSAD 

10 


−1 

(d) Parallel implementation running time, |dfA| =28 





10 3 

10 2 

10 1 

10 0 

10 −1 

10 −2 

KMSAD 

10 


−3 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

KMSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

10 4 

10 3 

10 2 

10 1 

10 0 

KMSAD 


KMSAD 

10 


−1 





10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

SLSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

SLSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

10 −1 

SLSAD 

10 


−2 

10 4 

10 3 

10 2 

10 1 

10 0 

SLSAD 

10 


−1 

Figure 3.20: Running times of the computationally optimal parallel RHCA, DHCA and flat 

consensus architectures across all data collections for the diversity scenarios corresponding 

to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms. 

97


and allegedly correct cluster structure (or ground truth), measuring their degree 

of resemblance in terms of normalized mutual information, φ (NMI) —recall that 

φ (NMI) ∈ [0, 1] and the higher its value, the better the quality of the consensus 

clustering solution. We measure the percentage of experiments and the relative 

φ (NMI) differences between the consensus clusterings and the cluster ensemble 

components of maximum and median φ (NMI) score. 

ii) We compare the φ (NMI) scores of the consensus clusterings obtained by the three 

consensus architectures subject to evaluation. 

– How are the experiments designed? The experimental methodology followed is 

the same as when the computational efficiency of consensus architectures was analyzed 

in the previous sections. That is, the consensus quality comparison has been 

conducted on the four diversity scenarios described in appendix A.4. In each diversity 

scenario, ten independent experiments have been conducted using the seven consensus 

functions for hard cluster ensembles employed in this work (CSPA, EAC, HGPA, 

MCLA, ALSAD, KMSAD and SLSAD). From a formal viewpoint, the φ (NMI) values 

of the consensus clustering solutions obtained in the 10 experiments corresponding to 

each consensus function and diversity scenario are presented. 

– How are results presented? In formal terms, the measured φ (NMI) values are 

presented by means of boxplot charts. By doing so, we can see the quality scatter 

of each consensus function and architecture. Again, non-overlapping boxes notches 

indicate that the medians of the compared φ (NMI) differ at the 5% significance level. 

– Which data sets are employed? In this section, we present in detail the results 

obtained on the Zoo data collection —for the sake of brevity, the results obtained in 

the remaining eleven unimodal data sets are deferred to appendix C.4. However, at 

the end of this section, the φ (NMI) scores of the three compared consensus architectures 

measured across the experiments conducted on the twelve unimodal data collections 

employed in this work are compiled and compared. The goal of such comparison is 

to analyze whether any of the consensus architecture tends to yield better consensus 

clustering solutions than the rest. 

One last before proceeding: notice that the only differences between serial and parallel 

hierarchical consensus architectures refer to their time complexity, not the quality of the 

consensus clustering solutions they yield. For this reason, the distinction between serial and 

parallel architectures is not found in this section. 


Firstly, the φ (NMI) values of the estimated optimal serial and parallel DHCA and RHCA 

implementations and flat consensus architectures in the lowest diversity scenario are presented 

in figure 3.21. Each chart presents four boxes that, from left to right, represent the 

φ (NMI) values of the components of the cluster ensemble E, and of the consensus clustering 

solutions output by the RHCA, DHCA and flat consensus architectures, respectively. It 

can observed that the three consensus architectures, yield, in general, pretty similar quality 

consensus solutions (in fact, the differences between them are statistically non significant 

at the 5% level for the CSPA, MCLA, ALSAD and KMSAD consensus functions). The 

98

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

Figure 3.21: φ (NMI) of the consensus solutions yielded by the computationally optimal 

RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity 

scenario corresponding to a cluster ensemble of size l = 57. 

largest inter-consensus architecture deviations are found when consensus clustering is based 

on the HGPA consensus function, as the notches of the corresponding φ (NMI) boxes are far 

from overlapping. If the consensus functions are compared in terms of the quality of the 

clustering solutions they yield, it can be observed that the best results are obtained by the 

EAC, ALSAD, KMSAD and SLSAD consensus functions. In these cases, the medians of 

the consensus solutions output by the consensus architectures are better than the 75% of 

the components of the cluster ensemble E, which denotes a notable level of robustness to 

the clustering indeterminacies. 


Secondly, the quality of the consensus clustering solutions output by the consensus architectures 

corresponding to the experiments conducted on the diversity scenario corresponding 

to cluster ensembles of size l = 570 are presented in figure 3.22. The trends detected in the 

previous diversity scenario are somehow confirmed in this case. That is, those consensus 

functions based on evidence accumulation (EAC) and object similarity as data (ALSAD, 

KMSAD and SLSAD) yield the best quality consensus clustering solutions, and they show 

a high degree of independence with respect to the topology of the consensus architecture. 

In fact, EAC and SLSAD based consensus architectures give rise to consensus clusterings 

which are better than the 80% of the cluster ensemble components, which again reveals the 

ability of these consensus functions to attain clustering solutions robust to the uncertainties 

inherent to clustering. In contrast, consensus labelings obtained by hypergraph-based 

consensus functions (CSPA, HGPA and MCLA) attain lower φ (NMI) values, while showing 

a larger quality variabilty (this only applies to the HGPA and MCLA consensus functions). 


The results corresponding to the experiments conducted in the third diversity scenario (i.e. 

cluster ensembles generated by the compilation of the clusterings output by |dfA| =19 

randomly selected clustering algorithms, giving rise to cluster ensembles of size l = 1083) 

are presented in figure 3.23. The behaviour detected in the previous diversity scenarios 

is also found in this case. Again, the largest inter-consensus architecture variations are 

99


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 




φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 




observed when consensus are built using the HGPA consensus function. In the remaining 

cases, much smaller deviations are found (maybe with the exception of ALSAD, but the 

observed dispersions are smaller than those of HGPA) —in fact, statistically non significant 

differences between the three consensus architectures are observed in the EAC, MCLA, 

KMSAD and SLSAD based architectures. 


And last, the φ (NMI) values of the consensus clustering solutions yielded by the hierarchical 

and flat consensus architectures corresponding to the experiments conducted on the 

highest diversity scenario (i.e. cluster ensembles of size l = 1596) are presented in figure 

3.24. This scenario is ideal for analyzing the variability of the quality of the consensus 

clustering solutions output by the distinct consensus functions, as exactly the same cluster 

ensemble has been employed in the ten experiments analyzed —in contrast, in the previous 

diversity scenarios, the cluster ensemble employed in each one of the ten experiments was 

created by compiling the clustering components generated by |dfA| = {1, 10, 19} randomly 

picked clustering algorithms (that is, two superimposed randomness factors underlie the 

boxplots presented in figures 3.21 to 3.23). In this sense, it can be observed that, for any 

100

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 




given consensus architecture, the HGPA, MCLA and KMSAD consensus functions have the 

highest quality variability, due to the existence of some random underlying process in their 

consensus generation procedures (e.g. the random initialization of k-means in the KM- 

SAD consensus function). In contrast, the qualities of the consensus clusterings output by 

those consensus architectures based on CSPA, EAC, ALSAD and SLSAD show very small 

(or even null) variations. As regards the inter-consensus architecture quality divergences, 

those based on the HGPA and ALSAD consensus functions show the most disparate results, 

whereas in other cases (e.g. CSPA, EAC, MCLA or KMSAD) statistically equivalent qualities 

are yielded by the three consensus architectures. And last, as far as the robustness of 

the consensus clustering solutions is concerned, notice that the EAC, ALSAD, KMSAD and 

SLSAD based consensus architectures yield the highest quality clustering results, getting 

pretty close to the top-quality component of the cluster ensemble E, being,inmostcases, 

better than the 75% of the clusterings contained in it. 

Comparison across diversity scenarios and data collections 

So as to provide the reader with a global comparative view of the consensus architectures in 

terms of the quality of the consensus clustering solutions they yield, we have compiled the 

φ (NMI) values obtained across all the experiments conducted on the twelve unimodal data 

collections in each diversity scenario, representing them in the boxplots depicted in figure 

3.25. Recall that, when comparing boxplots, non-overlapping boxes notches indicate that 

the medians of the compared magnitudes differ at the 5% significance level. 

A twofold qualitative analysis can be made in view of these results. The first aspect of 

study is an intra-consensus function comparison among consensus architectures. A quick 

inspection of any of the rows of figure 3.25 reveals that the optimality of consensus architectures 

is a property that is local to the consensus function applied. When the clustering 

combination process is based on the CSPA consensus function, the three consensus architectures 

yield pretty similar quality consensus solutions (as the boxes have a notable overlap), 

although DHCA tends to attain slightly higher φ (NMI) values —a similar pattern is observed 

in the boxplots presented in the column corresponding to the EAC consensus function. In 

contrast, flat consensus architectures yield higher quality consensus than their hierarchical 

counterparts when they are based on the HGPA clustering combiner. The analysis of the 

101


results obtained when the MCLA consensus function is employed as the basis of consensus 

architectures must be made with care, as flat consensus is not executable when it must 

be conducted on large cluster ensembles. For this reason, the boxplots presented in the 

MCLA column only reflect the φ (NMI) values corresponding to those experiments where the 

three consensus architectures are executable. In these cases, the best consensus clustering 

solutions are obtained by the flat and DHCA consensus architectures. A greater degree of 

evenness between consensus architectures is observed in the consensus functions that treat 

object similarity as object features, i.e. ALSAD, KMSAD and SLSAD —notice the large 

overlap between boxes. However, whereas DHCA yields slightly lower quality consensus 

clustering solutions than the RHCA and flat consensus architectures when the ALSAD and 

KMSAD consensus functions are employed, it is the flat consensus approach that attains 

the lowest φ (NMI) values among the SLSAD based consensus architectures. 

And secondly, if an inter-consensus functions comparison is conducted, we can conclude 

that the excellent performance of the EAC consensus function on the Zoo data collection 

apparently constitutes an exception to the rule, as –together with HGPA and SLSAD– 

it yields the lowest φ (NMI) values (i.e. the poorest consensus clustering solutions) when 

the results obtained across all the data sets and diversity scenarios are compiled. In contrast, 

CSPA, MCLA, ALSAD and KMSAD tend to yield comparatively better consensus 

clustering solutions in a global perspective. 

Following a more quantitative perspective, we have compared the quality of the consensus 

clustering solutions yielded by the three consensus architectures with the components 

of the cluster ensemble consensus is conducted upon. In particular, this comparison has 

taken into account the cluster ensemble components of median and maximum φ (NMI) with 

respect to the ground truth (referred to as the median ensemble component, orMEC,and 

best ensemble component, or BEC, respectively). This comparison makes sense inasmuch we 

focus the application of consensus clustering as a means for becoming robust to the inherent 

indeterminacies that affect the clustering problem. More specifically, the higher the φ (NMI) 

of the consensus clustering solution with respect to that of the cluster ensemble components, 

the higher robustness is achieved. The median and maximum φ (NMI) components are used 

as a summarized reference of the quality of the cluster ensemble contents. 

For this reason, we have evaluated i) the percentage of experiments in which the consensus 

clustering solution attains a higher φ (NMI) than the MEC and the BEC, and ii) 

the relative percentage φ (NMI) variation between the median and the best cluster ensemble 

components and the consensus clustering solution. 

As regards the first issue, table 3.12 presents the percentage of experiments (considering 

all data sets in the highest diversity scenario) where the consensus clustering solution is better 

than the median normalized mutual information cluster ensemble component (MEC). It 

can be observed that the average percentage of experiments (considering all data sets in the 

highest diversity scenario) where the consensus clustering solution is better than the MEC 

is 53.1%, which indicates than in more than the half of the experiments, consensus yields a 

clustering solution better than the one located halfway of the cluster ensemble components. 

When the relative percentage φ (NMI) gains between consensus clustering solutions and the 

MEC are computed, a reasonable average of 59% gain is obtained —see table 3.13 for a 

detailed presentation of the results per consensus function and consensus architecture. 

If such comparison is referred to the cluster ensemble component that best describes the 

group structure of the data in terms of normalized mutual information with respect to the 

102

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

CSPA 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

CSPA 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

CSPA 

CSPA 

0 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

EAC 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

EAC 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

EAC 

EAC 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

0 


φ (NMI) 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

ALSAD 

0 


(a) φ (NMI) of the consensus solutions, |dfA| =1 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

ALSAD 

0 


(b) φ (NMI) of the consensus solutions, |dfA| =10 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

ALSAD 

0 


(c) φ (NMI) of the consensus solutions, |dfA| =19 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

0 


φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

ALSAD 

0 


(d) φ (NMI) of the consensus solutions, |dfA| =28 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

KMSAD 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

KMSAD 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

KMSAD 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

KMSAD 

0 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

SLSAD 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

SLSAD 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

SLSAD 

0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

SLSAD 

0 


Figure 3.25: φ (NMI) of the consensus solutions obtained by the computationally optimal 

parallel RHCA, DHCA and flat consensus architectures across all data collections for the 

diversity scenarios corresponding to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms. 

103

3.5. Discussion 

Consensus Consensus function 

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD 

flat 58.3 25 42.3 33.2 66.7 69.8 41.7 

RHCA 69.8 24.7 15.9 74.2 79.1 79.1 50.4 

DHCA 83.3 8.3 3.9 77.1 75 77.9 58.3 

Table 3.12: Percentage of experiments in which the consensus clustering solution is better 

than the median cluster ensemble component. 

given ground truth (i.e. the BEC), we observe that the consensus clustering solution attains 

higher φ (NMI) values in only a 0.1% of the experiments —see table 3.14. If the degree of 

improvement of those consensus clustering solutions that attain a higher φ (NMI) than the 

BEC is measured in terms of relative percentage φ (NMI) increase, a modest 0.6% φ (NMI) 

gain is obtained in average —see table 3.15 for a detailed view across consensus functions 

and architectures. 



flat 90 12.9 80.2 50.1 107.1 87.7 24.6 

RHCA 78.8 16.3 9.2 96.4 94.7 90.6 33.3 

DHCA 73.7 11.6 5.6 53.1 83.7 72.2 67.6 

Table 3.13: Relative percentage φ (NMI) gain between the consensus clustering solution and 

the median cluster ensemble component. 

As a conclusion, we can see that the application of consensus clustering processes on 

a collection of partitions of a given data collection provides a means for obtaining a summarized 

clustering that, although rarely better than the best component available in the 

cluster ensemble, is reasonably often quite better than the median data partition. However, 

despite these fairly good results, we aim to obtain clustering solutions more robust to the 

inherent indeterminacies of clustering (i.e. closer or even better than the maximum quality 

cluster ensemble component). For this reason, chapter 4 introduces what we call consensus 

self-refining procedures that aim to improve the quality of the consensus clustering solutions 

obtained from either hierarchical or flat consensus architectures. 

3.5 Discussion 

Our proposal for building clustering systems that behave robustly in front of the indeterminacies 

inherent to unsupervised classification problems relies on the application of consensus 

clustering processes on large cluster ensembles created by the application of multiple mutually 

crossed diversity factors. 

However, in the consensus clustering literature, relatively few works face the problematics 

of combining large amounts of clusterings, as most authors tend to employ rather small 

cluster ensembles for evaluating their proposals. However, the application of certain consensus 

clustering approaches in computationally demanding scenarios can be difficult. Typical 

examples of this include consensus functions based on object co-association measures that 

become inapplicable on large data collections, or clustering combiners not executable on 

104




flat 0 0 0 0 0 0.2 0 

RHCA 0.4 0 0 0 1.2 0.1 0 

DHCA 0 0 0 0 0 0.3 0 

Table 3.14: Percentage of experiments in which the consensus clustering solution is better 

than the best cluster ensemble component. 



flat – – – – – 0.07 – 

RHCA 0.85 – – – 0.8 0.07 – 

DHCA – – – – – 1.1 – 

Table 3.15: Relative percentage φ (NMI) gain between the consensus clustering solution and 

the best cluster ensemble component. 

large cluster ensembles if their complexity increases quadratically with the number of components 

in the ensemble. 

To our knowledge, most previous proposals oriented towards this aim deal with subsampling 

strategies as a means for reducing the computational complexity of consensus 

processes. That is, if the clustering combination task becomes more costly as the number of 

the objects in the data set and/or the number of the cluster ensemble components grow, a 

natural solution consists in applying the consensus clustering process on a reduced version 

of the data collection (Lange and Buhmann, 2005; Greene and Cunningham, 2006; Gionis, 

Mannila, and Tsaparas, 2007) and/or the cluster ensemble (Greene and Cunningham, 2006), 

which is created by means of some sufficient subsampling procedure. Once the consensus 

process is completed on the reduced data set (or cluster ensemble), it is extended to those 

entities (objects or cluster ensemble components) that have been left out of the mentioned 

subset. Whereas reducing the size of the data collection and/or the cluster ensemble subject 

to the consensus clustering process entails an automatic saving of time complexity, one 

should take into account the cost associated to the subsampling and extension processes, 

which is often linear with the size of the data collection (Lange and Buhmann, 2005; Greene 

and Cunningham, 2006; Gionis, Mannila, and Tsaparas, 2007). 

In contrast, our hierarchical consensus architecture proposals are based on reducing the 

time complexity of consensus processes without discarding any of the objects in the data 

set nor any of the cluster ensemble components. By means of a divide and conquer type 

of approach (Dasgupta, Papadimitriou, and Vazirani, 2006), we break a single consensus 

clustering problem into multiple smaller problems, which gives rise to hierarchical consensus 

architectures that allow to achieve important computational time savings, specially in high 

diversity scenarios —i.e. the ones we might find ourselves in if the strategy of using multiple 

mutually crossed diversity factors for creating large cluster ensembles is followed. 

As far as we know, the use of divide and conquer approaches to the consensus clustering 

problem has only been reported in (He et al., 2005) as a means for clustering data sets 

that contain both numeric and categorical attributes. This proposal consists of dividing 

the original data collection into two purely numeric and categorical subsets, conducting 

105


separate clustering processes on each type of features (employing well-established clustering 

algorithms designed to that end), and subsequently combining the resulting clustering 

solutions by means of consensus functions. Thus, this divide and conquer consensus clustering 

proposal is aimed to deal with objects composed of multi-type features, rather than 

a means for reducing the overall time complexity of consensus processes. 

In this chapter, two versions of hierarchical consensus architectures have been proposed. 

In each one of them, one of the two factors that define the topology of the architecture are 

prefixed, i.e. the number of stages (in deterministic HCA) or the size of the mini-ensembles 

(in random hierarchical consensus architectures). Structuring the whole consensus clustering 

task as a set of partial consensus processes that take place in successive stages gives 

the user the chance to apply different consensus functions across the hierarchy —a possibility 

that, to our knowledge, remains unexplored. Moreover, the decomposition of a classic 

one-step problem into a set of smaller instances of the same problem naturally allows its parallelization 

—provided that sufficient computational resources are available. At this point, 

we would like to highlight the fact that, though posed in the context of the robust clustering 

problem, hierarchical consensus architectures are applicable to any consensus clustering 

task involving large cluster ensembles. 

From a practical perspective, we have presented a simple running time estimation 

methodology that, for a given consensus clustering problem, allows a fast and pretty accurate 

prediction of which is the computationally optimal consensus architecture. However, 

the reasonably good performance of the proposed methodology could be further improved 

by means of a more complex (probably statistical) modeling of the consensus running times 

which constitute the basis of the estimation. 

Based on these predictions, we have presented an experimental study in which the flat 

and the fastest hierarchical consensus architectures are, firstly, compared in terms of their 

execution time. Such comparison has taken into account the most and least computationally 

costly HCA implementations (i.e. fully serial and parallel), so as to provide a notion of the 

upper and lower bounds of the time complexity of hierarchical consensus architectures. 

One of the most expectable conclusions drawn from the conducted experiments is that the 

computational optimality of a given consensus architecture is local to the consensus function 

F employed for combining the clusterings. In particular, as far as the execution time of 

hierarchical consensus architectures is concerned, the main issue to take into account is 

the dependence between the time complexity of F and the size of the mini-ensembles upon 

which consensus is conducted. For instance, the use of consensus functions the complexity 

of which scales quadratically with the number of clusterings consensus is created upon (e.g. 

MCLA) clearly favours hierarchical consensus architectures. In contrast, flat consensus 

is more efficient than the fastest serial hierarchical consensus architectures even in high 

diversity scenarios when consensus functions such as EAC are employed. 

Besides analyzing their computational aspects, we have also compared hierarchical and 

flat consensus architectures in terms of the quality of the consensus clustering solutions 

they yield. In this sense, inter-consensus architecture variability is highly dependent on the 

characteristics of the cluster ensemble and the consensus function employed. For instance, 

hierarchical and flat consensus architectures based on the CSPA, EAC, ALSAD and SLSAD 

consensus functions yield pretty similar quality consensus clusterings, whereas greater variances 

are observed when the remaining consensus functions are used. Moreover, in general 

terms, we have observed that consensus architectures based on EAC and HGPA typically 

106


yield lower quality consensus clustering solutions when compared to the other consensus 

functions. 

Thus, in some sense, we face a further indeterminacy, this one referred to the consensus 

function to apply. However, this indeterminacy can be overcome by taking advantage of the 

capability of creating several consensus clustering solutions by means of multiple consensus 

functions in computationally optimal time, and subsequently, apply a supraconsensus 

function that allows selecting the highest quality consensus clustering solution in a fully 

unsupervised manner, as proposed in (Strehl and Ghosh, 2002). 

Besides this use, supraconsensus strategies constitute a basic ingredient of the consensus 

self-refining procedure presented in the next chapter, which is oriented to better the quality 

of consensus clustering solutions as a means for creating robust clustering systems upon 

consensus clustering processes. 

3.6 Related publications 

Our first approach to hierarchical consensus architectures dealt with deterministic HCA 

(Sevillano et al., 2007a), although it was solely focused on the analysis of the quality of 

the consensus clusterings obtained, not on its computational aspect. The details of this 

publication, presented as a poster at the ECIR 2007 conference held at Rome, are described 

next. 

Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró 

Title: A Hierarchical Consensus Architecture for Robust Document Clustering 

In: Proceedings of 29th European Conference on Information Retrieval (ECIR 2007) 

Publisher: Springer 

Series: Lecture Notes in Computer Science 

Volume: 4425 

Editors: Giambattista Amati, Claudio Carpineto and Giovanni Romano 

Pages: 741-744 

Year: 2007 

Abstract: A major problem encountered by text clustering practitioners is the difficulty 

of determining aprioriwhich is the optimal text representation and clustering 

technique for a given clustering problem. As a step towards building robust document 

partitioning systems, we present a strategy based on a hierarchical consensus clustering 

architecture that operates on a wide diversity of document representations and 

partitions. The conducted experiments show that the proposed method is capable of 

yielding a consensus clustering that is comparable to the best individual clustering 

available even in the presence of a large number of poor individual labelings, outperforming 

classic non-hierarchical consensus approaches in terms of performance and 

computational cost. 

107

Chapter 4 

Self-refining consensus 

architectures 

As described in chapter 3, our proposal for building clustering systems robust to the inherent 

indeterminacies that affect the clustering problem consists of i) creating a cluster ensemble 

E composed of a large number of individual partitions generated by the use of as many 

diversity factors (e.g. clustering algorithms, object representations, etc.) as possible, ii) 

deriving a unique clustering solution λc upon that cluster ensemble through the application 

of a consensus clustering process. 

As mentioned earlier, the use of such a large cluster ensemble entails two negative consequences. 

The first one refers to the fact that the construction of the consensus clustering 

solution can become costly or even unfeasible, as the space and time complexity of consensus 

functions scales up linearly or even quadratically with the size of the cluster ensemble. 

In order to overcome such difficulty, in chapter 3 we put forward the concept of hierarchical 

consensus architectures, which are based on applying a divide-and-conquer approach 

to consensus clustering. Moreover, by means of a simple running time estimation methodology, 

the user is capable of deciding apriori, with a notable degree of accuracy, which 

is the most computationally efficient consensus architecture for solving a given consensus 

clustering problem. 

The other main downside to the use of large cluster ensembles is the negative bias induced 

on the quality of the consensus clustering solution λc by the expectable presence 

of poor1 individual clusterings in E, caused by the somewhat indiscriminate generation of 

cluster ensemble components that our proposal indirectly encourages. In order to overcome 

this inconvenience, we propose a simple consensus self-refining process that, in a fully unsupervised 

manner, allows to improve the quality of the derived consensus clustering solution 

λc. Moreover, an additional benefit derived from this automatic consensus refining procedure 

is the uniformization of the quality of the consensus clustering solutions yielded by 

distinct consensus architectures, which allows selecting the most appropriate one based on 

1 By good quality clustering solutions we refer to those partitions that reflect the true group structure 

of the data. Provided that we evaluate our clustering results by means of an external cluster validity index 

–normalized mutual information (φ (NMI) ) with respect to the ground truth, i.e. an allegedly correct group 

structure of the data–, the highest quality clustering results will be those attaining a φ (NMI) close to 1, 

whereas the φ (NMI) values associated to poor quality partitions will tend to zero, as φ (NMI) ∈ [0, 1] by 

definition (Strehl and Ghosh, 2002). 

109

4.1. Description of the consensus self-refining procedure 

computational efficiency criteria solely. While put forward in a hard clustering scenario, this 

proposal could be exported to a fuzzy context by introducing several minor modifications. 

This chapter is organized as follows: section 4.1 describes the proposed self-refining consensus 

procedure. Next, several experiments regarding the application of the self-refining 

process on the consensus clustering solutions output by the three types of consensus architectures 

described in the previous chapter are presented in section 4.2. An alternative 

procedure based on cluster ensemble component selection for obtaining refined consensus 

clustering solutions upon a given cluster ensemble is described in section 4.3, and finally, 

the discussion and conclusions presented in section 4.4 put the end to this chapter. 

4.1 Description of the consensus self-refining procedure 

The proposed approach for refining the quality of the consensus clustering solution λc is 

pretty straightforward, and it is based on the notion of average normalized mutual information 

φ (ANMI) (Strehl and Ghosh, 2002) between a cluster ensemble E and a consensus 

clustering solution λc built upon it, as defined by equation (4.1). 

φ (ANMI) (E, λc) = 1 

l 

l 

φ (NMI) (λi, λc) (4.1) 

where l represents the number of partitions (or components) contained in the cluster ensemble 

E and λi is the ith of these components. 

The higher φ (ANMI) (E, λc), the more information the consensus clustering solution λc 

shares with all the clusterings in E, thus it can be considered to capture the information 

contained in the ensemble to a larger extent. In fact, the computation of the φ (ANMI) 

between a given cluster ensemble E and a set of consensus clustering solutions obtained 

by means of different consensus functions is proposed in (Strehl and Ghosh, 2002) as a 

means for choosing among them in a unsupervised fashion, giving rise to what is called a 

supraconsensus function. 

It is important to notice that each term of the summation in equation (4.1) measures 

the resemblance between each cluster ensemble component and λc. As a consequence, those 

cluster ensemble components more similar to the consensus clustering solution contribute 

in a greater proportion to the sum in equation (4.1). 

Assuming that the consensus function F applied for obtaining the consensus clustering 

solution λc delivers a moderately good performance –in the sense that the quality of λc will 

be reasonably higher than the one of the poorest components of the cluster ensemble E–, 

then the normalized mutual information (φ (NMI) ) between λc and each cluster ensemble 

component λi (∀i ∈ [1,...,l]), gives an approximate measure of the quality of the latter 

(Fern and Lin, 2008). 

Allowing for this fact, the proposed consensus self-refining procedure is based on ranking 

the l components of the cluster ensemble E accordingtotheirφ (NMI) with respect the 

consensus clustering solution λc. The results of this sorting process is represented by means 

of an ordered list Oφ (NMI) = {λ 

φ (NMI) 1, λ φ (NMI)2,...λφ (NMI)l}, the subindices of which refer 

to the aforementioned φ (NMI) based ranking, i.e. λ 

φ (NMI) 1 denotes the cluster ensemble 

component that attains the highest normalized mutual information with respect to λc, 

110 

i=1

Chapter 4. Self-refining consensus architectures 

λ φ (NMI) 2 is the component with the second highest φ(NMI) respect the consensus clustering 

solution, and so on. 

Subsequently, a percentage p of the highest ranked l individual partitions is selected so 

as to form a select cluster ensemble Ep –see equation (4.2)–, upon which a refined consensus 

clustering solution λcp will be derived through the application of the consensus function F. 

Notice that, the larger the percentage p, the more components are included in the select 

cluster ensemble Ep —ultimately, Ep = E if p = 100. 

⎛ 

⎜ 

Ep = ⎜ 

⎝ 

λ 

φ (NMI) 1 

λ 

φ (NMI) 2 

. 

λ p 

φ (NMI) ⌊ 

100 l⌉ 

⎞ 

⎟ 

⎠ 

(4.2) 

Following the rationale of the proposed self-refining procedure, it can be assumed that, 

with a high probability, the worst components of the initial cluster ensemble E will have been 

excluded from Ep. Therefore, the self-refined consensus clustering solution λcp obtained 

through the application of the consensus function F on Ep will probably improve the initial 

consensus labeling λc, as we will experimentally demonstrate in later sections. 

Finally, three additional remarks so as to conclude this description: firstly, notice that 

the consensus process run on the select cluster ensemble Ep can be conducted following 

either a flat or a hierarchical approach, depending on the consensus function applied, the 

characteristics of the data set and the value of p, which, as aforementioned, determines 

thesizeofEp. As reported in the previous chapter, the proposed running time estimation 

methodologies constitute an efficient means for deciding whether the self-refined consensus 

solution λcp should be derived following either a flat or a hierarhical consensus architecture. 

Secondly, notice that the proposed consensus self-refining process is entirely automatic 

and unsupervised (hence its name), as it is solely based on the cluster ensemble E, the 

consensus clustering solution λc and a similarity measure –φ (NMI) – that requires no external 

knowledge for its computation. The only user-driven decision refers to the selection of the 

value of the percentage p used for creating the select cluster ensemble Ep. 

And the third remark deals just with this latter issue. Quite obviously, the selection 

of the percentage p is made blindly. So as to avoid the negative consequences of choosing 

a suboptimal value of p at random, our consensus self-refining proposal is completed by 

the (possibly parallelized) creation of multiple refined consensus clustering solutions using 

P distinct percentage values p = {p1,p2,...,pP }, i.e. λc p i ,fori = {1, 2,...,P}, selecting 

as the final refined consensus clustering solution λ final 

c 

the one maximizing φ (ANMI) with 

respect to the cluster ensemble E, as defined by equation (4.3) —in fact, this unsupervised a 

posteriori clustering selection process is equivalent to the supraconsensus function proposed 

in (Strehl and Ghosh, 2002). 

λ final 

c 

 

 

= λ max φ (ANMI) 

(E, λ) , λ ∈{λc, λcp1 ,...,λcpP } (4.3) 

For summarization purposes, table 4.1 describes the steps that constitute the proposed 

consensus self-refining procedure. 

111

4.2. Flat vs. hierarchical self-refining 

1. Given a cluster ensemble E containing l 

⎛ 

components: 

⎞ 

λ1 

⎜ 

⎜λ2 

⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ 

and a pre-computed consensus clustering solution λc, compute the φ (NMI) between 

λc and each of the components of the cluster ensemble, that is: 

λl 

φ (NMI) (λc, λk) , ∀k =1,...,l 

2. Generate an ordered list Oφ (NMI) = {λ 

φ (NMI) 1, λ φ (NMI)2,...λφ (NMI)l} of cluster ensemble 

components ranked in decreasing order according to their φ (NMI) with respect 

λc. 

3. Create a set of P select cluster ensembles Epi 

nents of Oφ (NMI): 

Epi = 

⎛ 

λ 

φ (NMI) 

⎜ 

⎝ 

1 

λ 

φ (NMI) 2 

⎞ 

⎟ 

. 

⎟ 

⎠ 

where pi ∈ (0, 100) , ∀i =1,...,P. 

λ φ (NMI) ⌊ p i 

100 l⌉ 

pi 

by compiling the first ⌊ 100l⌉ compo- 

4. Run a (flat or hierarchical) consensus architecture based on a consensus function F 

on Epi , obtaining a self-refined consensus clustering solution λc p i . 

5. Apply the supraconsensus function on the non-refined consensus clustering solution 

λc and the set of self-refined consensus clustering solutions λc p i , i.e. select as the 

final consensus solution the one maximizing its φ (ANMI) with respect to the cluster 

ensemble E. 

Table 4.1: Methodology of the consensus self-refining procedure. 

4.2 Flat vs. hierarchical self-refining 

In this section, we present several experiments regarding the application of the consensus 

self-refining procedure described in section 4.1 on the consensus clustering solutions output 

by the three consensus architectures described in chapter 3. In all cases, our main interest 

is focused on analyzing the qualities of the clusterings obtained by the proposed self-refining 

procedure, not on evaluating the computational aspects of the self-refining process, as the 

decision regarding whether it is implemented according to a hierarchical or a flat consensus 

architecture can be efficiently made using the running time estimation methodologies 

proposed in chapter 3. 

The experiments conducted follow the design described next. 

112



i) The quality of the self-refined consensus clusterings obtained by the proposed 

methodology when applied on the consensus clusterings output by the flat and 

allegedly fastest RHCA and DHCA consensus architectures. 

ii) The ability of the proposed self-refining procedure to obtain a consensus clustering 

of higher quality than that of a) its non-refined counterpart, and b) the 

highest and median quality cluster ensemble components. 

iii) The quality of the self-refined consensus clustering of maximum quality compared 

to its non-refined counterpart. 

iv) The ability of the supraconsensus function to select, in a fully unsupervised 

manner, the highest quality self-refined consensus clustering among the set of 

self-refined clusterings generated. 

v) We analyze whether self-refining constitutes a means for uniformizing the quality 

of the consensus clustering solutions yielded by the flat and allegedly fastest 

RHCA and DHCA consensus architectures, thus making it possible to decide, on 

computational grounds only, which is the most suitable consensus architecture 

for a given clustering problem. 


i) The quality of the self-refined consensus clusterings is measured in terms of the 

φ (NMI) with respect to the ground truth of each data collection. 

ii) The percentage of experiments in which the proposed self-refining procedure gives 

rise to at least one self-refined consensus clustering of higher quality than that 

of a) its non-refined counterpart, and b) the highest and median quality cluster 

ensemble components. 

iii) We measure the relative φ (NMI) percentage difference between the self-refined 

consensus clustering of maximum quality and its non-refined counterpart. 

iv) The precision of the supraconsensus function is measured in terms of the percentage 

of experiments in which it manages to select the highest quality self-refined 

consensus clustering. 

v) We compare the average variance between the φ (NMI) scores of the consensus 

clusterings λc yielded by the three evaluated consensus architectures (i.e. prior to 

self-refining) with the variance between the consensus clustering selected by the 

) after the self-refining procedure is conducted. 

supraconsensus function (λ final 

c 

– How are the experiments designed? we only analyze the results of the consensus 

self-refining process executed on the highest diversity scenario (i.e. the one where 

cluster ensembles are created by applying the |dfA| = 28 clustering algorithms from 

the CLUTO clustering package). The reason for this is twofold: besides brevity, this 

limitation on our analysis avoids that the results of the self-refining process are masked 

by the consensus quality variability observed in lower diversity scenarios —recall that, 

in those cases, the quality of the consensus clustering solutions shows larger variances, 

as the cluster ensemble changes from experiment to experiment due to the random 

selection of |dfA| = {1, 10, 19} clustering algorithms, whereas exactly the same cluster 

ensemble is employed across the ten experiments in the highest diversity scenario. As 

113


in all the experimental sections of this thesis, consensus processes have been replicated 

using the set of seven consensus functions described in appendix A.5, namely: CSPA, 

EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD. Results are averaged across ten 

independent experiments consisting of ten consensus function runs each. With the 

objective of analyzing the expectable dependence between the degree of refinement of 

the consensus clustering solution and the percentage p of cluster ensemble components 

included in the select cluster ensemble Ep, the experiments have been replicated for a 

set of percentage values in the range p ∈ [2, 90]. Subsequently, the final consensus label 

vector λ final 

c is selected among all the available (i.e. non-refined and refined) consensus 

clustering solutions through the application of the supraconsensus function presented 

in equation (4.3). Last, it is important to state that, although it is possible (and, in 

fact, recommendable) to apply either flat or hierarchical consensus on the select cluster 

ensemble depending on which is the most computationally efficient option, all selfrefining 

consensus processes in our experiments have been conducted, for simplicity, 

using a flat consensus architecture. 

– How are results presented? Results are presented by means of boxplot charts of 

the φ (NMI) values corresponding to the consensus self-refining process. In particular, 

each subfigure depicts –from left to right– the φ (NMI) values of: i) the components of 

the cluster ensemble E, ii) the non-refined consensus clustering solution (i.e. the one 

resulting from the application of either a hierarchical or a flat consensus architecture, 

denoted as λc), and iii) the self-refined consensus labelings λc p i obtained upon select 

cluster ensembles created using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}. 

Moreover, the consensus clustering solution deemed as the optimal one (across a 

majority of experiment runs) by the supraconsensus function is identified by means 

of a vertical green dashed line. Moreover, the quality comparisons between the selfrefined 

consensus clusterings, the non-refined consensus clusterings and the cluster 

ensemble components are presented by means of tables showing the average values of 

the measured magnitudes. 

– Which data sets are employed? These experiments span the twelve unimodal data 

collections employed in this work. For brevity reasons, and following the presentation 

scheme of the previous chapter, this section only describes in detail the results of 

the self-refining procedure obtained on the Zoo data set, deferring the portrayal of 

the results obtained on the remaining data collections to appendix D.1. However, 

the global evaluation of the self-refining and the supraconsensus processes entails the 

results obtained on the twelve unimodal collections employed in this work. 

Figure 4.1 presents the boxplot charts of the φ (NMI) values corresponding to the consensus 

self-refining process applied on the Zoo data set. Notice that figure 4.1 is organized into 

three columns of subfigures, each one of which corresponds to one of the three consensus 

architectures, i.e. flat, RHCA and DHCA. 

Pretty varied results can be observed in figure 4.1, as regards both the performance of 

the self-refining process in itself and of the supraconsensus selection function. For instance, 

when the consensus clustering solution output by the flat consensus architecture using the 

CSPA consensus function is subject to self-refining (see the leftmost boxplot on the top 

row of figure 4.1), we can observe that two of the refined solutions yield clearly higher 

φ (NMI) values than their non-refined counterpart —in particular, the ones obtained using 

114

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 


CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

Figure 4.1: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering 

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and 

DHCA consensus architectures on the Zoo data collection across all the consensus functions 

employed. The green dashed vertical line identifies the clustering solution selected by the 

supraconsensus function in each experiment. 

115


the 30% and 60% of the whole cluster ensemble, i.e. λc30 and λc60. Moreover, notice that 

supraconsensus selects λc30 as the final consensus clustering solution (that is, it performs 

correctly). 

In other cases, supraconsensus fails to select the highest quality consensus clustering 

solution. See, for instance, that supraconsensus selects the non-refined consensus clustering 

solution λc yielded by the flat consensus architecture based on the EAC consensus function 

as the optimal one, whereas the refined clusterings λc40, λc50 and λc60 attain higher φ (NMI) 

values (leftmost boxplot chart on the second row of figure 4.1). Furthermore, notice that 

in some minority cases, the self-refining procedure introduces no or little improvement, 

as when it is applied on the consensus solution output by the RHCA using the ALSAD 

consensus function —central column boxplot on the fifth row of figure 4.1. 

Last, notice that the boxplot corresponding to the refining of the consensus clustering 

solution output by the flat consensus architecture using MCLA –leftmost boxplot on the 

fourth row– only presents the φ (NMI) values corresponding to the cluster ensemble E. This 

is due to the fact that, for this particular consensus function and diversity scenario, flat 

consensus is not executable with our computational resources —see appendix A.6. Moreover, 

as all self-refining consensus processes in our experiments have been conducted using a 

flat consensus architecture, the self-refining of the consensus clustering solutions output by 

RHCA and DHCA are not computed from λc40 forth due to memory limitations when using 

the MCLA consensus function. However, recall from chapter 3 that hierarchical consensus 

architectures would allow the computation of consensus clustering solutions in situations 

where flat consensus is not executable. 

A deeper and more quantitative evaluation of the proposed consensus self-refining procedure 

requires analyzing two of its facets. Firstly, it is necessary to evaluate the self-refining 

process in itself, answering questions such as: i) how often does the self-refining process yield 

a higher quality consensus clustering solution than the non-refined one? ii) to which extent 

are the top quality self-refined consensus clustering solutions better than their non-refined 

counterpart? iii) how do the best self-refined consensus clustering solutions compare to 

the cluster ensemble components? or iv) does the self-refining procedure reduce the differences 

between the quality of the consensus clustering solutions output by distinct consensus 

architectures? The answers to these questions are presented in section 4.2.1. 

And secondly, given a set of self-refined consensus clustering solutions, a supraconsensus 

function capable of blindly selecting the highest quality self-refined solution is required. Its 

performance can be evaluated in terms of i) the percentage of occasions the supraconsensus 

function selects the highest quality consensus solution, and ii) the quality loss degree due 

to the supraconsensus selection of suboptimal consensus clustering solutions. These aspects 

are evaluated in section 4.2.2. 

4.2.1 Evaluation of the consensus-based self-refining process 

As regards the evaluation of the self-refining process, we have firstly analyzed the percentage 

of self-refining experiments in which at least one of the self-refined consensus clustering 

solutions attains a φ (NMI) with respect to the ground truth that is higher than the one 

achieved by the consensus clustering solution available prior to self-refinement. The results 

presented in table 4.2, which correspond to an average across all the data sets for each 

consensus architecture and consensus function, reveal that the proposed self-refining proce- 

116




flat 90.4 63.6 70 77.25 80 76.9 90 

RHCA 85.7 83.2 97 94.2 73.4 76.5 82.4 

DHCA 91.1 78.5 98 88.5 89.8 88.1 67.8 

Table 4.2: Percentages of self-refining experiments in which one of the self-refined consensus 

clustering solutions is better than its non-refined counterpart, averaged across the twelve 

data collections. 



flat 16.5 273.3 53.3 14.9 9.8 16.1 433.4 

RHCA 10.9 157.1 294779.8 200.8 14.1 15.1 205.6 

DHCA 24.5 66.4 152450.9 79.9 38.4 30.9 224.9 

Table 4.3: Relative φ (NMI) gain percentage between the top quality self-refined consensus 

clustering solutions with respect to its non-refined counterpart, averaged across the twelve 

data collections. 

dure performs pretty successfully, giving rise to at least one self-refined consensus clustering 

solution that improves the consensus clustering available prior to refining in an average 83% 

of the experiments conducted. 

Moreover, we have also computed the relative φ (NMI) percentage gain between the nonrefined 

and the top quality self-refined consensus clustering solution —considering only 

those experiments where self-refining yields a better clustering solution, i.e. the 83% of 

the total. The results presented in table 4.3, which again correspond to an average across 

all the data sets for each consensus architecture and consensus function, reveal that the 

proposed self-refining procedure performs in an overwhelmingly successful manner, giving 

rise to an average relative percentage φ (NMI) gain of 21386% across all the experiments 

conducted. This exceptionally large figure is due to the fact that, although seldom, extremely 

poor quality consensus clustering solutions are available prior to self-refining in some cases. 

In particular, this situation is found when the HGPA consensus function is employed for 

refining the consensus clustering solutions yielded by hierarchical consensus architectures 

on the WDBC and BBC data collections (see, for instance, figure D.10 in appendix D). 

Despite its exceptionality, this fact introduces a large bias on the averaged values of φ (NMI) 

gains. However, if this kind of artifact is ignored, relative φ (NMI) gains between 10% and 

430% are consistently obtained in all cases, which gives an idea of the suitability of the 

proposed self-refining procedure for bettering consensus clustering solutions. 

Besides comparing the top quality self-refined consensus clustering solution with its 

non-refined counterpart, we have also contrasted its quality with respect to the highest and 

median φ (NMI) components of the cluster ensemble E, referred to as BEC (best ensemble 

component) and MEC (median ensemble component), respectively. Using the quality of 

these two components as a reference, we have evaluated i) the percentage of experiments 

where the maximum φ (NMI) (either refined or non-refined) consensus clustering solution 

attains a higher quality to that of the BEC and MEC, and ii) the relative percentage 

φ (NMI) variation between them and the top quality consensus clustering solution. Again, 

117




flat 8.3 0 0 0 25 23.9 8.3 

RHCA 8.3 0 0 16.7 28.3 23.8 4 

DHCA 16.7 0 0.1 16.6 16.7 18.2 8.3 

Table 4.4: Percentages of experiments in which the best (non-refined or self-refined) consensus 

clustering solution is better than the best cluster ensemble component, averaged across 

the twelve data collections. 



flat 2.7 – – – 3.5 1.1 0.1 

RHCA 2.5 – – 1.7 4.2 1 0.1 

DHCA 3.3 – 2.2 1.4 1.4 1.2 0.8 

Table 4.5: Relative percentage φ (NMI) gains between the best (non-refined or self-refined) 

consensus clustering solution and the best cluster ensemble component, averaged across the 

twelve data collections. 

all the results presented correspond to an average across all the experiments conducted on 

the twelve unimodal data collections. 

As regards the first issue, table 4.4 presents the percentage of experiments where the 

highest quality consensus clustering solution (either refined or non-refined) is better than the 

BEC (i.e. it attains a φ (NMI) that is higher than that of the cluster ensemble component that 

best describes the group structure of the data in terms of normalized mutual information 

with respect to the given ground truth). In average, this happens in a 10.6% of the conducted 

experiments, which is a frequency of occurence 100 times higher than what was obtained 

when non-refined clustering solutions were considered (see table 3.14 in chapter 3). Again, 

this result reveals the notable consensus improvement introduced by the proposed selfrefining 

procedure. Moreover, notice the poor results obtained with the EAC and the 

HGPA consensus functions, which were already reported to be the worst performing ones 

in chapter 3. 

Moreover, the relative percentage φ (NMI) gains between the top quality consensus clustering 

solution and the BEC are presented in table 4.5, attaining a modest average increase 

of 1.8%. However, recall that this figure was as low as 0.6% when the non-refined consensus 

clustering solution was considered (see table 3.15 in section 3.4), which indicates that the 

consensus self-refining procedure introduces again notable quality improvements. 

If this comparison is now referred to the median ensemble component, it can be observed 

that, in average, the best (non-refined or self-refined) consensus clustering solution attains 

a φ (NMI) that is higher than that of the cluster ensemble component that has the median 

normalized mutual information with respect to the given ground truth in a 67.7% of the 

experiments conducted (see table 4.6). Recall that this percentage was 53.1% when the 

consensus clustering solution prior to self-refining was compared to the MEC —see table 

3.12 in section 3.4. 

If the degree of improvement between the best (non-refined or self-refined) consensus 

clustering solutions that attain a higher φ (NMI) than the MEC is measured in terms of 

118




flat 67.7 41.7 62.1 25 75 75 66.7 

RHCA 72.3 46.4 69.8 81.9 82 83.3 66.6 

DHCA 83.3 33.3 69.2 88.2 83.3 83.1 66.7 

Table 4.6: Percentage of experiments in which the best (non-refined or self-refined) consensus 

clustering solution is better than the median cluster ensemble component, averaged 

across the twelve data collections. 



flat 107.4 33.1 82.4 91.1 113.8 109.1 73.3 

RHCA 96.6 24 98.7 109.5 118.4 114.3 70.4 

DHCA 113.3 25.6 100.2 108.8 118.9 114.7 87.8 

Table 4.7: Relative percentage φ (NMI) gain between the best (non-refined or self-refined) 

consensus clustering solution and the median cluster ensemble component, averaged across 


relative percentage φ (NMI) increase, a notable 91% relative φ (NMI) gain is obtained in average 

—see table 4.7 for a detailed view across consensus functions and architectures. Again, 

the beneficial effect of self-refining becomes evident if this result is compared to the one 

obtained from the analysis of the non-refined consensus clustering solution as, in that case, 

the observed relative φ (NMI) gain was 59% (see table 3.13). 

Furthermore, we have also measured the ability of the self-refining procedure for uniformizing 

the quality of the consensus clustering solutions output by the distinct consensus 

architectures. So as to evaluate this issue, we have computed, for each individual experiment, 

the average variance of the φ (NMI) values of the non-refined consensus solutions 

yielded by the RHCA, DHCA and flat consensus architectures —the smaller the variance, 

the more similar φ (NMI) values. This procedure has been repeated for the top quality (either 

refined or non-refined) consensus clustering solutions obtained at each experiment. The results 

of this analysis are presented in table 4.8. Except for the EAC consensus function, 

we can observe a notable reduction in the variance between the φ (NMI) of the consensus 

solutions output by the three considered consensus architectures, keeping it below the hundredth 

threshold in most cases. In average global terms, variance is dramatically reduced 

by an approximate factor of 20, from 0.105 to 0.0056. For this reason, it can be conjectured 

that, besides bettering the quality of consensus clustering solutions as already reported, the 

proposed self-refining procedure also helps making the quality of the self-refined consensus 

clustering solution more independent from the consensus architecture employed —so that 

it can be selected following computational criteria solely. 

As a conclusion, it can be asserted that the proposed consensus self-refining procedure is 

reasonably successful, as, in general terms, it introduces a quality increase that makes selfrefined 

consensus clustering solutions closer to the best individual components available 

on the cluster ensemble, which would ultimately constitute the goal of robust clustering 

systems based on consensus clustering. 

It is of paramount importance to notice that, in the analysis of all the previous results, 

119



solution CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD 

non-refined 0.011 0.011 0.026 0.638 0.009 0.011 0.029 

best non/self-refined 0.004 0.019 0.005 0.002 0.002 0.001 0.006 

Table 4.8: φ (NMI) variance of the non-refined and the best non/self-refined consensus clustering 

solutions across the flat, RHCA and DHCA consensus architectures, averaged across 




flat 30.4 50 53.1 11 25 23.8 37.5 

RHCA 25 35.9 38.4 3.9 24.4 24.1 36.6 

DHCA 12.5 42 40.1 17 0 29.5 37.5 

Table 4.9: Percentage of experiments in which the supraconsensus function selects the top 

quality consensus clustering solution, averaged across the twelve data collections. 

we have assumed the use of the top quality self-refined consensus clustering solution. Quite 

obviously, achieving the encouraging results reported would require using a supraconsensus 

function that, in an automatic manner, would detect that best self-refined consensus clustering 

in any given situation. The next section is devoted to the performance analysis of 

such supraconsensus function. 

4.2.2 Evaluation of the supraconsensus process 

As regards the performance of the supraconsensus function proposed by (Strehl and Ghosh, 

2002), we have firstly evaluated the percentage of experiments in which the supraconsensus 

function selects the highest quality consensus clustering solution. Table 4.9 presents the 

results averaged across all the data collections, for each consensus function and architecture. 

The average accuracy with which the supraconsensus function selects the top quality selfrefined 

consensus selection is 29%, i.e. it manages to select the best solution in less than a 

third of the experiments conducted. 

This somehow contradicts the beautiful conclusions of (Strehl and Ghosh, 2002), where 

φ (ANMI) (E, λc) is presented as a suitable surrogate of φ (NMI) (γ, λc) for selecting the best 

consensus clustering solutions in real scenarios, where a ground truth γ is not available. 

Such conclusion was supported by the fact that both φ (ANMI) (E, λc) andφ (NMI) (γ, λc) 

follow very similar patterns as regards their growth (i.e. the higher φ (NMI) (γ, λc), the 

higher φ (ANMI) (E, λc)). However, such claims were sustained on experiments using synthetic 

clustering results. In several of our experiments, in contrast, we have witnessed that 

this behaviour is not always obeyed in a strict fashion. 

Just for illustration purposes, we have conducted a toy experiment, in which a set of 

randomly picked 300 cluster ensemble components corresponding to the Zoo data collection 

have been evaluated in terms of i) their φ (NMI) with respect to the ground truth, and ii) 

their φ (ANMI) with respect to the 299 remaining clusterings selected. Figure 4.2 depicts 

both magnitudes, where the horizontal axis of each figure corresponds to an index of the 

clusterings in the ensemble arranged in decreasing order of φ (NMI) with respect to the 

120

φ (NMI) 

0.8 

0.78 

0.76 

0.74 

0.72 

0 100 200 300 

clustering index 

(a) Decreasingly ordered 

φ (NMI) (wrt ground truth) 


φ (ANMI) 

0.8 

0.75 

0.7 

0.65 

0 100 200 300 

clustering index 

(b) φ (ANMI) values wrt the 

toy cluster ensemble 

Figure 4.2: Decreasingly ordered φ (NMI) (wrt ground truth) values of the 300 clusterings 

included in the toy cluster ensemble (left), and their corresponding φ (ANMI) values (wrt the 

toy cluster ensemble) (right). 



flat 8.8 14.5 17.8 12 6.7 8.9 12.1 

RHCA 6.3 26.6 24.7 27.3 8.6 8.4 16.9 

DHCA 15.9 27 22.9 19.6 9.4 11 8.7 

Table 4.10: Relative percentage φ (NMI) losses due to suboptimal self-refined consensus clustering 

solution selection by supraconsensus, averaged across the twelve data collections. 

ground truth. Notice how the monotonic decreasing behaviour of φ (NMI) –figure 4.2(a)– is 

not strictly observed in φ (ANMI) (see figure 4.2(b), where a fifth order fitting red dashed 

curve is overlayed for comparison). In fact, the clustering attaining the maximum φ (ANMI) 

is the one with the fiftieth largest φ (NMI) . Thus, in practice, φ (ANMI) seems to constitute 

a means for identifying good clustering solutions, but not the best one. For this reason, 

it seems that requiring φ (ANMI) (E, λc) to select the one self-refined consensus clustering 

solution of highest quality is a far too restrictive constraint, which leads to the slightly 

disappointing results presented in table 4.9. 

In order to evaluate the influence of the apparent lack of precision of the supraconsensus 

function, we have measured the relative percentage φ (NMI) loss derived from a suboptimal 

consensus solution selection, using the top quality consensus clustering solution as a reference 

(i.e. the one that should be selected by an ideal supraconsensus function). The results, 

which are presented in table 4.10, show that the impact of the modest selection accuracy 

of the supraconsensus function leads to an average relative φ (NMI) loss of 14.9%. 

To conclude, it can be asserted that, while the proposed consensus self-refining procedure 

introduces notable gains as regards the quality of consensus clustering solutions, there 

still exists room for taking full advantage of its performance, as the entirely unsupervised 

selection of the highest quality consensus solution is not a fully solved problem yet. 

121

4.3. Selection-based self-refining 

4.3 Selection-based self-refining 

The consensus self-refining procedure proposed in section 4.1 is based on using a consensus 

clustering solution as a reference for computing the φ (NMI) of the cluster ensemble components, 

which, at the same time, constitutes the guiding principle of the creation of the select 

cluster ensemble Ep, upon which the self-refining process is conducted. 

In this section, we propose an alternative procedure for obtaining a self-refined consensus 

clustering solution. The only difference between this proposal and the one put forward in 

section 4.1 lies in the fact that the computation of the φ (NMI) of the cluster ensemble 

components –a step prior to the creation of the select cluster ensemble Ep– is not referred 

to a previously obtained consensus labeling λc, but to one of the components of the cluster 

ensemble E. 

By doing so, we aim to devise an alternative means for obtaining a high quality clustering 

from a large cluster ensemble that does not require the execution of any consensus process 

for obtaining the reference clustering with respect to which the cluster ensemble components 

are compared in terms of normalized mutual information, with the obvious computational 

savings it conveys. 

Due to the fact that this proposed method is based on selecting one of the cluster 

ensemble components for initiating the consensus self-refining process, we have called it 

selection-based self-refining, and its constituting steps are presented in table 4.11. 

In the next paragraphs, we will analyze the performance of this second self-refining 

proposal, following the same experimental scheme employed in section 4.2. That is, we firstly 

review the results obtained on the Zoo data collection at a qualitative level (the analysis 

corresponding to the remaining data collections is presented in appendix D.2), followed by 

a quantitative study of the quality of the self-refined consensus clustering solutions across 

all the experiments conducted. 

With the objective of making the results of selection-based consensus self-refining comparable 

to those presented in the previous section, we have followed the same experimental 

procedure, that is: i) the experiments have been replicated for a set of self-refining percentage 

values in the range p ∈ [2, 90], ii) the experiments have been executed on the cluster 

ensembles corresponding to the highest diversity scenario. 

For starters, figure 4.3 depicts the boxplot charts of the φ (NMI) values corresponding to 

the selection-based consensus self-refining process. Each chart depicts –from left to right– 

the φ (NMI) values of: i) the components of the cluster ensemble E, ii) the cluster ensemble 

component with maximum φ (ANMI) with respect to the whole ensemble, i.e. λref, andiii) 

the self-refined consensus clusterings λcpi obtained upon select cluster ensembles created 

using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}. 

Firstly, we can notice the high quality of the selected cluster ensemble component λref, 

the φ (NMI) of which is pretty close to the highest quality component of the cluster ensemble 

E. Thus, it seems that the proposed selection procedure constitutes, by itself, a fairly good 

approach for obtaining clustering solutions that are robust to the inherent indeterminacies 

of the clustering problem. When the self-refining procedure is applied on the select cluster 

ensemble created upon λref, distinct performances are observed. Whereas in some cases 

none of the self-refined clustering solutions λcpi attains higher φ (NMI) values than λref (see, 

for instance, figure 4.3(a)), the opposite is observed when self-refining based on the EAC and 

122

1. Given a cluster ensemble E containing l components: 

⎛ ⎞ 

λ1 

⎜λ2⎟ 

⎜ ⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ 


compute the φ (ANMI) between each one of them and the cluster ensemble, that is: 

l 

φ (NMI) (λi, λk) , ∀k =1,...,l 

φ (ANMI) (E, λk) = 1 

l 

i=1 

2. Select the cluster ensemble component that maximizes its φ (ANMI) with respect to the whole 

ensemble as the reference for the self-refining 

 

process: 

(ANMI) 

λref =maxφ 

(E, λk) 

λk 

 

3. Compute the φ (NMI) between λref and each of the components of the cluster ensemble, that 

is: 

φ (NMI) (λref, λk) , ∀k =1,...,l 

4. Generate an ordered list Oφ (NMI) = {λφ (NMI) 1, λφ (NMI)2,...λφ (NMI)l} of cluster ensemble components 

ranked in decreasing order according to their φ (NMI) with respect λref. 

5. Create a set of P select cluster ensembles Epi by compiling the first ⌊ pi 

100l⌉ components of 

Oφ (NMI): 

⎛ 

λφ (NMI) 

⎜ 

Epi = ⎜ 

⎝ 

1 

λφ (NMI) 2 

⎞ 

⎟ 

. 

⎟ 

⎠ 

where pi ∈ (0, 100) , ∀i =1,...,P. 

λl 

λ φ (NMI) ⌊ p i 

100 l⌉ 

6. Run a (flat or hierarchical) consensus architecture based on a consensus function F on Epi, 

obtaining a self-refined consensus clustering solution λc p i . 

7. Apply the supraconsensus function on the selected cluster ensemble component λref and 

the set of self-refined consensus clustering solutions λc p i , i.e. select as the final consensus 

solution the one maximizing its φ (ANMI) with respect to the cluster ensemble E, i.e.: 

λ final 

c 

= λ max φ (ANMI) (E, λ) , λ ∈{λref, λc p 1 ,...,λc p P } 

Table 4.11: Methodology of the cluster ensemble component selection-based consensus selfrefining 

procedure. 

SLSAD consensus functions is applied on λref —see figures 4.3(b) and 4.3(g), respectively. 

As in section 4.2, the consensus clustering solution deemed as the optimal one (across 

a majority of experiment runs) by the supraconsensus function is identified by means of a 

vertical green dashed line. As regards its performance, we can observe that it manages to 

select the highest quality clustering solution in all cases except when self-refining is based 

on the ALSAD and KMSAD consensus functions. 

So as to provide the reader with a more comprehensive and quantitative analysis of the 

performance of the proposed selection-based consensus self-refining procedure, the follow- 

123


φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

Figure 4.3: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component 

λref and the self-refined consensus clustering solutions λc p i on the Zoo data collection 

across all the consensus functions employed. The green dashed vertical line identifies the 

clustering solution selected by the supraconsensus function in each experiment. 

ing sections present a separate study of the results yielded by the self-refining procedure 

itself and the supraconsensus function that, a posteriori, must select the best self-refining 

consensus clustering solution. 

4.3.1 Evaluation of the selection-based self-refining process 

As far as the evaluation of the selection-based consensus self-refining procedure is concerned, 

four analysis have been conducted. For starters, we have measured the percentage 

of experiments in which the procedure of self-refining yields a better quality clustering 

than the cluster ensemble component selected as a reference (i.e. the one maximizing its 

φ (ANMI) with respect to the cluster ensemble E, referred to as λref). The results, averaged 

across all the unimodal data collections employed in this work, are presented in table 4.12 

as a function of the consensus function employed. In average, the self-refining procedure, 

when conducted on select cluster ensembles created upon the selection of λref, yields better 

clustering solutions in a 56% of the conducted experiments. 

This figure is notably lower than what was obtained when the select cluster ensemble 

124


Consensus function 

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD 

54.5 28.2 10.3 69.6 81.8 82 65.4 

Table 4.12: Percentage of self-refining experiments in which one of the self-refined consensus 

clustering solutions is better than the selected cluster ensemble component reference λref, 

averaged across the twelve data collections. 



26.9 9.1 20.8 15.2 15.5 11.4 7.8 


clustering solutions with respect to the maximum φ (ANMI) cluster ensemble component, 

averaged across the twelve data collections. 

creation uses a previously derived consensus clustering solution (it was 83%). This is due to 

the fact that the cluster ensemble component selection usually results in using a reference 

clustering λref of higher quality than the consensus clustering solution λc. 

Secondly, in those experiments where self-refined consensus solutions are better than 

λref, we have measured the relative degree of improvement achieved (quantified in terms 

of relative percentage φ (NMI) increase). The results, presented in table 4.13, show notable 

quality improvements, averaging a 15.2% relative φ (NMI) gain across all data sets and consensus 

functions. These quality gains obtained are much smaller than those obtained on the 

self-refining experiments based on a previously derived consensus clustering solution (see 

section 4.2), again due to the superior quality of the reference clustering the self-refining 

procedure is based upon. 

Next, the maximum and median φ (NMI) components of the cluster ensemble E –referred 

to as BEC (best ensemble component) and MEC (median ensemble component), respectively– 

are compared to either the top quality self-refined consensus clustering solution or λref, 

(depending on which has the largest φ (NMI) with respect to the ground truth). As in the 

previous section, we have evaluated i) the percentage of experiments where the maximum 

φ (NMI) consensus clustering solution attains a higher quality to that of the BEC and MEC, 

and ii) the relative percentage φ (NMI) variation between them and the top quality consensus 

clustering solution. Once more, all the results presented correspond to an average across 

all the experiments conducted on the twelve unimodal data collections. 

On one hand, table 4.14 presents the aforementioned magnitudes referred to the best 

cluster ensemble component. In average, the highest quality clustering (either λref or one 

of the self-refined consensus solutions) is better that the BEC in a 14.1% of the conducted 

experiments, achieving an average relative percentage φ (NMI) gain of 1.6%. It is important 

to notice that these results are pretty similar to those obtained when self-refining is based 

on a previously derived consensus clustering solution (see section 4.2), as these percentages 

were equal to 10.6% and 1.8%, respectively. 

On the other hand, table 4.15 presents the results of the same experiment, but referred 

to the median ensemble component (or MEC). In this case, the selection and self-refining 

procedure yields clusterings better than the MEC in 98% of the occasions, attaining an aver- 

125


%of 

experiments 

relative % 

φ (NMI) gain 



9.1 9.1 0 16.7 36.4 27.6 0 

2.1 0.2 – 2.6 1.8 1.3 – 

Table 4.14: Percentage of experiments where either the top quality self-refined consensus 

clustering solution or λref better the best cluster ensemble component, and relative φ (NMI) 

gain percentage with respect to it, averaged across the twelve data collections. 



%of 

experiments 

100 95.4 95 100 100 100 95.4 

relative % 

φ 

118.7 100.7 118.3 114.9 116.1 112.5 107.4 

(NMI) gain 

Table 4.15: Percentage of experiments where either the top quality self-refined consensus 

clustering solution or λref better the median cluster ensemble component, and relative 

φ (NMI) gain percentage with respect to it, averaged across the twelve data collections. 

age relative φ (NMI) gain of 112.7%. These figures indicate that the selection-based consensus 

self-refining yields better results –when compared to the MEC– than its consensus-based 

counterpart, where the two aforementioned percentages reduced to 67.7% and 91%, respectively. 

As a summarization of this analysis of the selection-based consensus self-refining proposal, 

we can conclude that, firstly, it constitutes a fairly good approach as far as the 

obtention of a high quality partition of the data is concerned. When compared to the 

consensus-based self-refining procedure put forward in section 4.1, it can be observed that, 

while the relative quality gains introduced by the self-refining process itself are smaller in 

selection-based consensus self-refining, the top quality clustering results obtained are superior 

to those yielded by consensus-based self-refining. We believe that these phenomena 

are both due to the differences in the quality of the clustering solution that constitutes 

the starting point of the self-refining process —in the case of consensus-based self-refining, 

this reference is a previously derived consensus clustering λc, which typically is a poorer 

data partition than the maximum φ (ANMI) cluster ensemble component λref (see figures in 

appendices D.1 and D.2 for a quick visual comparison). This fact makes selection-based 

self-refining even a more attractive alternative, all the more since no previous consensus 

process execution is required —with the obvious computational savings this implies. 

4.3.2 Evaluation of the supraconsensus process 

As regards the performance of the supraconsensus function proposed by (Strehl and Ghosh, 

2002), we have conducted a twofold evaluation. On one hand, we have analyzed the percentage 

of experiments in which the highest quality and the clustering solution selected 

via supraconsensus coincide. On the other hand, we have measured the relative percentage 

φ (NMI) loss derived from a suboptimal consensus solution selection, using the top quality 

clustering solution as a reference (i.e. the one that should be selected by an ideal supracon- 

126

%of 

experiments 

relative % 

φ (NMI) loss 




37.5 61.2 84.6 30.6 12.5 23 60 

24.6 10.4 16.8 12.2 10.9 12.7 11.9 


quality clustering solution, and relative percentage φ (NMI) losses between the top quality 

clustering solution and the one selected by supraconsensus, averaged across the twelve data 

collections. 

sensus function). The results of these two experiments are presented in table 4.16, averaged 

across all the data collections and for each consensus function. 

The average accuracy with which the supraconsensus function selects the top quality 

self-refined consensus selection is 44.2%, i.e. it manages to select the best solution in less 

than a half of the experiments conducted. Moreover, this apparent lack of precision entails 

an average relative φ (NMI) reduction of 14.2%. 

These results reinforce the idea that the supraconsensus function proposed in (Strehl 

and Ghosh, 2002) is still far from constituting the most appropriate means for selecting, in 

a completely unsupervised manner, the best consensus clustering solution among a bunch 

of them, specially if they have pretty similar qualities. This is the reason why the average 

level of selection accuracy attained by the supraconsensus function in the selection-based 

self-refining scenario is higher than in the consensus-based context (44.2% vs. 29%), as 

the φ (NMI) differences between the top quality clustering solution and the remaining ones 

is notably higher in the former case than in the latter —in selection-based self-refining, 

the selected cluster ensemble component λref is often of notably higher quality than the 

self-refined consensus clustering solutions λcpi , see appendix D.2. 

In contrast, similar results are obtained in both the selection-based and consensusbased 

self-refining scenarios when the efficiency of the supraconsensus function is measured 

in terms of the φ (NMI) loss caused from erroneous selections (i.e. when a clustering solution 

other than the highest quality one is selected by the supraconsensus function). In 

selection-based self-refining, this relative percentage φ (NMI) loss is 14.2%, being 14.9% in 

the consensus-based self-refining context. 


In this chapter, we have put forward a couple of proposals oriented to obtain a high quality 

clustering solution given a cluster ensemble and a similarity measure between partitions, 

using consensus clustering and following a fully unsupervised procedure. Together with the 

computationally efficient consensus architectures presented in chapter 3, these proposals 

constitute the basis for constructing robust consensus clustering systems. 

Our proposals are based on applying consensus clustering on a set of clusterings – 

compiled in a select cluster ensemble– which are chosen from the cluster ensemble according 

to their similarity with respect to an initially available clustering solution. By doing so, we 

127


have experimentally proved that it is very likely to obtain a refined consensus clustering 

solution of higher quality than the original one. 

The main difference between our two proposals lies in the origin of the clustering employed 

as a reference for creating the select cluster ensemble. In the first proposal, referred 

to as consensus-based self-refining, this initial clustering is the consensus clustering solution 

λc resulting from a previous consensus process run on the whole cluster ensemble. In our 

second proposition, the starting point of the refining process is one of the components of 

the cluster ensemble, which is selected using an average normalized mutual information 

criterion —giving rise to what we call selection-based self-refining. 

Unfortunately, the optimal configuration of this self-refining procedure –e.g. the size 

of the select cluster ensemble, or the consensus function employed for creating the refined 

clustering solutions– is local to each particular experiment. This inconvenience, which 

is by no means new in the consensus clustering literature, can be tackled by means of 

a supraconsensus function that, in a blind manner, selects the highest quality clustering 

solution among a bunch of them, created using distinct self-refining configurations. However, 

the application of one of the most extended supraconsensus function (the one proposed in 

(Strehl and Ghosh, 2002)) in our experiments has yielded little disappointing results, as it 

is capable of selecting the highest quality clustering solution in a relatively low percentage 

of the experiments conducted. Moreover, alternative supraconsensus functions based on 

average normalized mutual information gave rise to even poorer selection accuracies (not 

reported here due to the limited interest of the results obtained), which suggests that it 

is necessary to conduct further research in order to devise novel supraconsensus functions 

capable of satisfying such a restrictive constraint as the one imposed here —i.e. selecting, 

among a set of clustering solutions, the top quality one in a fully unsupervised fashion. 

As aforementioned, the concepts of consensus self-refining and supraconsensus functions 

are closely related. In fact, supraconsensus is originally presented in (Strehl and Ghosh, 

2002) as a means for selecting the best consensus clustering solution among a bunch of 

them, created using different consensus functions. Therefore, it seems logical to consider 

the application of supraconsensus not on a set of previously derived consensus clustering 

solutions, but on the cluster ensemble components themselves, so as to select the highest 

quality ones. 

Some very recent works have dealt with this issue, such as (Gionis, Mannila, and 

Tsaparas, 2007), where the BESTCLUSTERING algorithm is defined as a means for identifying 

the individual partition that minimizes the number of disagreements with respect to 

the remaining components of the cluster ensemble. Nevertheless, no posterior consensus 

clustering based refinement process is applied on this presumably high quality cluster ensemble 

component, which, as we have experimentally proved, may bring about important 

quality gains. 

More recently, the use of clustering solution refinement procedures based on consensus 

clustering has been studied in (Fern and Lin, 2008) contemporarily to the completion of this 

thesis. That work and ours have multiple points in common, such as i) the primary purpose 

of avoiding the negative influence of poor clusterings contained in large cluster ensembles 

on the quality of the consensus clustering solutions built upon them, ii) the use of one of 

the components of the cluster ensemble as the reference partition for creating the reduced 

select cluster ensemble, as we propose in selection-based self-refining, iii) the analysis of 

the quality of several refined consensus clustering solutions generated upon multiple select 

128


cluster ensembles of distinct sizes, and iv) the use of normalized mutual information as the 

guiding principle for comparing clusterings. 

However, there also exist several differences between both works, as in (Fern and Lin, 

2008) i) refining is not presented as a means for bettering the quality of a previously 

derived consensus clustering solution, but as a means for obtaining a good quality one 

upon a select cluster ensemble resulting from discarding those poor components that may 

induce a quality loss in it, ii) the criteria employed for choosing the components included 

in the select cluster ensemble consider both clustering quality and diversity, iii) clustering 

refinement results obtained by a single consensus function (CSPA) are reported, and iv) no 

supraconsensus methodology for selecting the best refined consensus clustering is studied. 

To conclude, we would like to highlight again the significant quality improvements that 

can be obtained by means of self-refining consensus procedures. However, it is also important 

to be aware that we cannot take full advantage of these gains if a good performing 

supraconsensus methodology allows to select the top quality self-refined clustering solution 

with a high level of confidence. For this reason, in our opinion, devising such a technique is 

a matter of paramount importance as regards the further research to be conducted in this 

particular field. 

In the future, it would be interesting to analyze how consensus self-refining procedures 

perform if the cluster ensemble selection process was based on clustering similarity measures 

other than normalized mutual information. Furthermore, we also intend to study the 

possibility of creating the select cluster ensemble by including in it all those clusterings 

exceeding a certain φ (ANMI) threshold with respect to the reference clustering, instead of 

selecting a percentage p of the cluster ensemble components. 


As mentioned earlier, the aim of the proposed self-refining consensus clustering approach 

is to obtain partitions which are robust to the indeterminacies inherent to the clustering 

problem. This has been the main propeller of our research since the early days, which 

has been reflected in several publications at several international conferences and national 

journals. The application focus of these works was document clustering, so they were 

mainly published at Information Retrieval and Natural Language Processing forums. In 

none of these works, however, self-refining procedures are included as a means for obtaining 

improved quality clustering results, so our proposals in this specific area remain, by the 

moment, unpublished. 

The first publication regarding robust document clustering based on cluster ensembles 

was presented as a poster at the SIGIR 2006 conference held at Seattle. The details of this 

publication follow. 


Title: Feature Diversity in Cluster Ensembles for Robust Document Clustering 

In: Proceedings of the 29th ACM SIGIR Conference 

Pages: 697-698 

129

4.5. Related publications 

Year: 2006 

Abstract: The performance of document clustering systems depends on employing 

optimal text representations, which are not only difficult to determine beforehand, 

but also may vary from one clustering problem to another. As a first step towards 

building robust document clusterers, a strategy based on feature diversity and cluster 

ensembles is presented in this work. Experiments conducted on a binary clustering 

problem show that our method is robust to near-optimal model order selection and 

able to detect constructive interactions between different document representations in 

the test bed. 

A subsequent extension of this work was published at the Journal of the Spanish Society 

for Natural Language Processing (Procesamiento del Lenguaje Natural) after its presentation 

at the SEPLN 2006 conference held at Zaragoza. 


Title: Robust Document Clustering by Exploiting Feature Diversity in Cluster Ensembles 

In: Journal of the Spanish Society for Natural Language Processing (Procesamiento 

del Lenguaje Natural) 

Volume: 37 

Pages: 169176 

Year: 2006 

Abstract: The performance of document clustering systems is conditioned by the use 

of optimal text representations, which are not only difficult to determine beforehand, 

but also may vary from one clustering problem to another. This work presents an 

approach based on feature diversity and cluster ensembles as a first step towards 

building document clustering systems that behave robustly across different clustering 

problems. Experiments conducted on three binary clustering problems of increasing 

difficulty show that the proposed method is i) robust to near-optimal model order 

selection, and ii) able to detect constructive interactions between different document 

representations, thus being capable of yielding consensus clusterings superior to any 

of the individual clusterings available. 

Last, a global analysis regarding clustering indeterminacies and how they can be overcome 

via cluster ensembles was presented at the ICA 2007 conference as an oral presentation. 


Title: Text Clustering on Latent Thematic Spaces: Variants, Strenghts and Weaknesses 

In: Proceedings of 7th International Conference on Independent Component Analysis 

and Signal Separation 

130


Series: Lecture Notes in Computer Science 

Volume: 4666 


Editors: Mike E. Davies, Christopher J. James, Samer A. Abdallah and Mark D. 

Plumbley 

Pages: 794-801 

Year: 2007 

Abstract: Deriving a thematically meaningful partition of an unlabeled document 

corpus is a challenging task. In this context, the use of document representations 

based on latent thematic generative models can lead to improved clustering. However, 

determining apriorithe optimal document indexing technique is not straighforward, 

as it depends on the clustering problem faced and the partitioning strategy adopted. 

So as to overcome this indeterminacy, we propose deriving a single consensus labeling 

upon the results of clustering processes executed on several document representations. 

Experiments conducted on subsets of two standard text corpora evaluate distinct 

clustering strategies based on latent thematic spaces and highlight the usefulness 

of consensus clustering to overcome the indeterminacy regarding optimal document 

indexing. 

131

Chapter 5 

Multimedia clustering based on 

cluster ensembles 

As already outlined in section 1.3, multimodality is an increasingly noticeable trend as 

far as the nature of data is concerned. Given the growing ubiquity of multimedia data, 

it seems logical to consider that the derivation of robust clustering strategies as a means 

for organizing the increasingly larger multimodal repositories available is already a field of 

interest in itself. 

However, it is important to take into account that the direct application of classic 

clustering algorithms for partitioning multimedia data collections may turn out to be suboptimal. 

The reason for this is twofold: firstly, the usual indeterminacies that condition 

the performance of clustering algorithms are multiplied due to the existence of several data 

modalities. This means that the user must not only make a decision regarding which is 

the object representation or the clustering algorithm that are supposed to yield the best 

partition of the data —i.e. the ones best describing the natural group structure of the 

data. Furthermore, it is also necessary to make a decision regarding on which of the m 

data modalities clustering is to be conducted, as classic clustering algorithms are designed 

to operate on unimodal data. 

And secondly, notice that clustering multimedia data using a single modality entails 

ignoring the presumable positive synergies that may exist between the different modalities, 

which could be of interest for deriving a better partition of the data. The only way a 

classic clustering algorithm can take advantage of the possible benefits of multimodality 

consists in creating multimodal representations of the objects, conducting an early fusion 

of the features corresponding to distinct modalities prior to clustering. That is, clustering 

is conducted on a single, artificially generated multimodal representation of the objects 

created by the combination of the m original modalities. However, feature fusion may be 

benefitial or not as regards the quality of the clustering results, as reported in appendix 

B.2, which turns the application of this strategy into a further indeterminacy to deal with. 

For these reasons, in this chapter we propose applying consensus clustering as a means 

for clustering multimedia data robustly, as it provides a natural way for combining i) the 

results of clustering processes run on each one of the m distinct modalities —thus conducting 

a late fusion of modalities, and ii) the partitions obtained upon the multimodal data 

representation derived by the early fusion of the features of the m modalities. By doing so, 

133

5.1. Generation of multimodal cluster ensembles 

X 1 

Object 

representation p 

+ clustering 

Object 

X2 representation 

X 

+ clustering 

Multimedia 

data set 

Xm Object 


+ clustering 

E 

Consensus 

architecture 

(fl (flat t or 

hierarchical) 

Feature 

fusion 

Object 


+ clustering 

Multimodal cluster ensemble generation 

 

c 

Consensus 

self 

refining 

Figure 5.1: Block diagram of the proposed multimodal consensus clustering system. 

we can take advantage of both modality fusion approaches, which can be of help to reveal 

the group structure of the data. 

The proposed multimodal consensus clustering approach follows the schematic block 

diagram of figure 5.1, and consists of the following steps: the generation of the multimodal 

cluster ensemble E, plus the application of a computationally efficient consensus architecture 

that, followed by a consensus-based self-refining procedure, yields the final partition of the 

multimodal data collection subject to clustering, λfinal c . 

In this chapter, the phases that constitute the multimodal consensus clustering process 

are described and contextualized in the framework of the experiments conducted in this 

work. For starters, section 5.1 presents the strategies followed in the creation of the multimodal 

cluster ensemble. Next, section 5.2 describes the particularities of the consensus 

architecture and the self-refining procedure that give rise to the multimodal data partition. 

Last, the results of the multimedia consensus clustering experiments run in this thesis 

are presented in section 5.3, and with the conclusions discussed in section 5.4 the present 

chapter comes to an end. 

5.1 Generation of multimodal cluster ensembles 

The key point for conducting a multimodal consensus clustering process lies in the creation 

of a cluster ensemble that contains both clusterings derived on each data modality separately 

and on fused modalities. In this section, we describe a general procedure for creating a 

multimodal cluster ensemble E upon a multimedia data collection. 

Without loss of generality, let us assume that the multimodal data collection subject to 

clustering contains n objects represented by numeric attributes. Thus, the whole data set 

can be formally represented by means of a d × n real-valued matrix X, where each object 

is represented by means of a d-dimensional column vector xi, ∀i ∈ [1,n]. 

134 

 

final 

c

Chapter 5. Multimedia clustering based on cluster ensembles 

Moreover, suppose that our multimedia data collection is composed of m modalities. 

That is, each of the n objects is simultaneously represented by m disjoint sets of real-valued 

attributes –each one of which corresponds to one of the m modalities– of sizes d1, d2, ..., 

dm, sothatd1 + d2 + ... + dm = d. Thus, the multimodal data set matrix X can be 

decomposed in m submatrices X1, X2, ..., Xm. Each one of these matrices Xi (of size 

di × n, ∀i ∈ [1,m]) represents all the objects in the data set according to each one of the 

m modalities it contains —see figure 5.1. 

Given this scenario, a first subset of the clusterings that will constitute the multimodal 

cluster ensemble E are generated through the application, upon each 1 submatrix 

Xi, (∀i ∈ [1,n]), of f mutually crossed diversity factors dfj, ∀j ∈ [1,f]. If the same set of 

diversity factors is applied on the m modalities, the number of clusterings generated in this 

first subset is equal to: 

l1 mod = m|df1||df2| ...|dff| (5.1) 

where the |·|operator denotes the cardinality of a set. 

Secondly, another subset of clusterings is created by the application of a set of diversity 

factors (not necessarily equal to the previous one) upon a fused multimodal representation 

of the data set. This representation can be generated by means of any early feature fusion 

process, such as the application of a projection-based object representation technique on 

the d-dimensional vectors resulting from the concatenation of the features corresponding 

to the m modalities (La Cascia, Sethi, and Sclaroff, 1998; Zhao and Grosky, 2002; Benitez 

and Chang, 2002; Snoek, Worring, and Smeulders, 2005; Gunes and Piccardi, 2005). This 

second subset of clusterings will be referred to using the symbol m mod, astheyareobtained 

upon an object representation that combines the m modalities into a single one. 

Assuming, for simplicity, that the same set of diversity factors are employed for creating 

the subsets of unimodal and multimodal clusterings, the number of multimodal partitions 

created becomes: 

lm mod = |df1||df2| ...|dff | (5.2) 

Finally, the mere compilation of the unimodal and multimodal partitions constitute the 

multimedia cluster ensemble E, the size of which is equal to: 

l = l1 mod + lm mod =(m +1)|df1||df2| ...|dff | (5.3) 

As regards the creation of the multimodal cluster ensembles E on the four multimedia 

data collections employed in this work (see appendix A.2.2 for a description), three diversity 

factors have been applied: clustering algorithms (dfA), object representations (dfR) 

and object representation dimensionalities (dfD). In the following paragraphs, a detailed 

description of these diversity factors and their role in the cluster ensemble creation process 

are presented. 

Starting with the original object features, which constitute the baseline representation, 

additional object representations are derived by means of feature extraction based on 

1 As these clusterings are created by running multiple clustering processes separately on each modality, 

we refer to them using the symbol 1 mod, which stands for “one modality”. 

135

5.2. Self-refining multimodal consensus architecture 

Data set Modality df D range |df D| 

audio [30,120] 10 

CAL500 

text [30,70] 5 

audio + text [30,200] 18 

speech [100,600] 11 

IsoLetters image [3,16] 14 

speech + image [100,600] 11 

object [60,120] 7 

InternetAds collateral [100,1000] 19 

object + collateral [100,1000] 19 

image [50,350] 7 

Corel 

text [100,450] 8 

image + text [100,800] 15 

Table 5.1: Range and cardinality of the dimensional diversity factor dfD per modality for 

each one of the four multimedia data sets. 

Principal Component Analysis, Independent Component Analysis, Random Projection and 

Non-negative Matrix Factorization (this last representation can only be applied when the 

original object features are non-negative). Thus, the total number of object representationsiseither|dfR| 

= 4 (for the CAL500 and IsoLetters collections) or |dfR| = 5 (for the 

InternetAds and Corel data sets). It is important to notice that these feature extraction 

based object representations are derived for each one of the m = 2 modalities these data 

sets contain, plus for the multimodal baseline representation created by concatenating the 

features of both modes. 

For each feature extraction based object representation and modality, a set of distinct 

representations are created by conducting a sweep of dimensionalities, which constitutes 

the second diversity factor, dfD. Quite obviously, its cardinality depends on the data set 

and modality. The range and cardinality of dfD per modality corresponding to each one 

of the four multimedia data sets employed in the experimental section of this chapter are 

presented in table 5.1. 

And finally, the clusterings that make up the multimodal cluster ensemble E are created 

by running |dfA| = 28 clustering algorithms from the CLUTO clustering package 

(see appendix A.1) on each distinct object representation. As a result, a total of l = 

2856, 3108, 5124 and 3444 partitions are obtained for the CAL500, IsoLetters, InternetAds 

and Corel multimodal data collections, respectively. Notice that, in our case, diversity 

factors are not mutually crossed, as the baseline object representations lack dimensional 

diversity. Therefore, the generic expressions of equations (5.1) to (5.3) do not apply in our 

case. 

5.2 Self-refining multimodal consensus architecture 

Once the multimodal cluster ensemble E is built, the next step consists in deriving a consensus 

clustering solution λc upon it. Recall that, according to the conclusions drawn in 

chapter 3, it may be more computationally efficient to tackle this task by means of a flat or 

a hierarchical consensus architecture depending on the size of the cluster ensemble. 

136


As far as this latter issue is concerned, it is important to notice that, if the cluster 

ensemble generation process proposed in section 5.1 is followed, multimodality induces an 

important increase in the cluster ensemble size. Indeed, if a set of f mutually crossed 

diversity factors of cardinalities |df1|, |df2|, ..., |dff | are applied on a particular unimodal 

data collection, we obtain a cluster ensemble of size: 

lunimodal = |df1||df2| ...|dff| (5.4) 

However, if the same data collection was multimodal and contained m modalities, the 

application of the previously presented multimedia cluster ensemble creation procedure – 

using exactly the same diversity factors applied on the unimodal version of the data set– 

would yield an ensemble of size: 

lmultimodal =(m +1)|df1||df2| ...|dff| (5.5) 

That is, multimodality increases the size of cluster ensembles by a factor of (m +1) 

—i.e. in the minimally multimodal case, m = 2, the cluster ensembles obtained are three 

times larger than those that would be created in a unimodal scenario. For this reason, and 

allowing for the conclusions of chapter 3, it seems that hierarchical consensus architectures 

are likely to be the most computationally efficient implementation alternative for deriving 

consensus clustering solutions upon multimodal cluster ensembles —however, the running 

time estimation process proposed in chapter 3 constitutes a valid tool for selecting apriori, 

and with a high degree of precision, which is the computationally optimal consensus 

architecture for solving a specific consensus clustering problem. 

Regardless of the consensus architecture employed, it will output a consensus clustering 

solution λc. The next and final step consists in applying the consensus self-refining procedure 

proposed in section 4.1, so as to obtain a presumably higher quality refined consensus 

clustering solution λ final 

c . Quite obviously, in a multimedia clustering scenario, the selfrefining 

process will be based on selecting a subset of the components of the multimodal 

cluster ensemble E for creating a select cluster ensemble. Otherwise, it follows exactly the 

steps presented in table 4.1. 

In this work, a three-stage deterministic hierarchical consensus architecture (or DHCA 

for short) has been applied for deriving the consensus clustering solution λc upon our 

multimodal cluster ensembles. This is due to the fact that we have deemed multimodality 

as a diversity factor (denoted as dfM) in itself. Moreover, in contrast to the procedure 

followed in the previous chapters, the multimodal cluster ensemble E considered in each 

individual experiment conducted in this chapter only contains clusterings created by a 

single clustering algorithm, despite, as mentioned earlier, |dfA| = 28 of them have been 

employed. In other words, for each data collection, experiments on |dfA| = 28 different 

cluster ensembles have been conducted. The components of each one of these ensembles 

only differ in the representational (dfR), dimensional (dfD) and multimodal (dfM) diversity 

factors, while having been created by means of a single clustering algorithm. 

According to the conclusions drawn in section 3.3, the DHCA variant that minimizes the 

number of executed consensus processes and the running time of its serial implementation 

is the one in which consensus processes are sequentially run on the distinct diversity factors 

that make up the cluster ensemble E arranged in decreasing cardinality order. 

137

5.3. Multimodal consensus clustering results 

Therefore, it is necessary to determine the cardinality of the three diversity factors so 

as to devise the computationally optimal DHCA topology. As described in section 5.1, the 

cardinality of the representational diversity factor is either |dfR| =4or|dfR| = 5, depending 

on the data set. The inspection of table 5.1 reveals that the dimensional diversity factor 

adopts a wide range of cardinalities, but all of them fall in the [5, 19] interval. Last, the newly 

introduced modality diversity factor entails not only the m = 2 original data modalities, 

but also the multimodal one resulting from the feature-level fusion of the former —thus, its 

cardinality is equal to |dfM| = 3. For this reason, the specific DHCA variant implemented is 

referred to as DRM (as |dfD| > |dfR| > |dfM|), and consensus will be sequentially conducted 

across dimensionalities, representations and modalities at each of its three stages. 

For illustration purposes, figure 5.2 depicts a toy DHCA DRM variant applied on a 27dimensional 

multidimensional cluster ensemble created on dimension, representation and 

modality diversity factors all of cardinality equal to 3. In its first stage, consensus are conducted 

across dimensionalities, thus yielding a first set of intermediate consensus clusterings 

denoted as λD,j,k, wherej and k index object representations and modalities, respectively. 

Subsequently, the second consensus stage executes consensus processes across the distinct 

object representations, giving rise to a second set of partial consensus clustering solutions, 

denoted as λD,R,k, wherek designates modalities. Assuming that two of these modalities 

are truly original modes, and that the third one is a created by feature-level fusion of the 

mod 1 

former, the clusterings output by the second stage of the DHCA are also denoted as λc , 

mod 2 mod 1+mod 2 

λc and λc , respectively. Finally, the execution of a consensus process on 

these three intermediate clusterings yields the final consensus clustering solution λc, which 

is referred to as intermodal hereafter. 

Notice that conducting consensus by means of this DHCA variant instead of a flat 

or a random hierarchical consensus architecture is specially interesting from an analytic 

viewpoint, as it makes it possible to compare the effect of the consensus process on each 

of the three modalities, by simply evaluating the three intermediate consensus clusterings 

mod 1 mod 2 mod 1+mod 2 

input to the last consensus stage –i.e. λ , λ and λ in figure 5.2. 

5.3 Multimodal consensus clustering results 

c 

In this section, the results of the proposed multimodal consensus clustering experiments 

conducted in this work are described. The design of these experiments has followed the 

rationale described next. 


c 

i) In section 5.3.1, we evaluate the quality of the partial consensus clusterings obtained 

on each separate modality and on the one resulting from multimodal 

mod 1 mod 2 mod 1+mod 2 

feature-level fusion (i.e. λc , λc and λc , respectively) plus the 

intermodal clustering λc resulting from applying the consensus process for combining 

the three aforementioned consensus clustering solutions. Moreover, we 

also analyze how do the unimodal, multimodal and intermodal consensus clusterings 

compare to each other plus the components of the cluster ensembles they 

are created upon. 

138 

c

1, 

1, 

1 

2, 

1, 

1 

 

 

3 3, 

1 1, 

1 

1, 

1, 

2 

2, 

1, 

2 

 

 

3, 

1, 

2 

1 

, 1 , 3 

2, 

1, 

3 

 

3, 

1, 

3 

1 , 2 , 1 

2, 

2, 

1 

 

3, 

2, 

1 

1 

, 2 , 2 

2, 

2, 

2 

 

3, 

2, 

2 

1 1, 

2 2, 

3 

2, 

2, 

3 

 

3, 

2, 

3 

 

1 1, 

3 3, 

1 

2, 

3, 

1 

3, 

2, 

1 

 

1 1, 

3 3, 

2 

2, 

3, 

2 

3, 

3, 

2 

 

 

1 1, 

3 3, 

3 

2, 

3, 

3 

3, 

3, 

3 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 

Consensus 

function 


D, 

1, 

1 

D, 

2, 

1 

D, 

3, 

1 

D, 

1, 

2 

D, 

2, 

2 

D, 

3, 

2 

D, 

1, 

3 

D, 

2, 

3 

D, 

3, 

3 

Consensus 

mod_1 

function D D, 

R, 

1 c 

Consensus 

function 

Consensus 

function 

mod_ 2 

D D, 

R, 

2 c 

mod_1mod_ 

2 

D D, 

R, 

3 c 

Consensus 

function c 

Figure 5.2: Deterministic hierarchical consensus architecture DRM variant operating on 

a cluster ensemble created using three diversity factors: three dimensionality |dfD| =3of 

three object representations |dfR| = 3 and three modalities |dfM| = 3. The cluster ensemble 

component obtained upon the jth object representation with the ith dimensionality on the 

kth modality is denoted as λi,j,k. Consensus are sequentially created across the dimension, 

representation and modality diversity factors (dfD, dfR and dfM, respectively). 

ii) In section 5.3.2, we analyze the quality of the self-refined intermodal consensus 

clusterings λc p i obtained upon select cluster ensembles containing a percentage 

pi of the partitions of the original multimodal cluster ensemble E. Moreover, 

we also evaluate the performance of the supraconsensus function as a means 

for selecting, in a fully unsupervised manner, the top quality (either refined or 

non-refined) consensus clustering. 


i) The quality of the unimodal, multimodal and intermodal consensus clusterings 

obtained is evaluated in terms of their φ (NMI) with respect to the ground truth of 

the data set. Inter-consensus clusterings comparisons are conducted in terms of 

their relative percentage φ (NMI) differences, and the percentage of experiments in 

which one of them attains higher φ (NMI) scores than the rest. Moreover, comparisons 

between the consensus clusterings and their associated cluster ensembles 

are made in terms of relative percentage φ (NMI) differences, the percentage of 

139


cluster ensemble components that attain φ (NMI) scores higher than that of the 

evaluated consensus clusterings, and the percentage of experiments and relative 

percentage φ (NMI) differences between them and the cluster ensemble components 

of maximum and median quality. 

ii) The quality of the self-refined consensus clusterings is measured in terms of their 

φ (NMI) with respect to the ground truth of the data set. We also compute the percentage 

of experiments in which the top quality self-refined consensus clustering 

attains a higher φ (NMI) score than its non-refined counterpart, besides the relative 

percentage φ (NMI) differences between them and the maximum and median 

cluster ensemble components. On the other hand, the ability of the supraconsensus 

function is evaluated by the computation of the percentage of experiments in 

which it succeeds in selecting the highest quality consensus clustering available 

as the final partition of the data, besides the relative percentage φ (NMI) losses 

suffered when it does not. 

– How are the experiments designed? Just like in all the experimental sections 

of this thesis, consensus clusterings have been derived by means of the seven consensus 

functions described in appendix A.5, namely CSPA, EAC, HGPA, MCLA, 

ALSAD, KMSAD and SLSAD. By doing so, it is possible to compare their performances 

across all the consensus clustering problems conducted. It is important to 

note that, in this chapter, we are solely interested in analyzing the quality of the consensus 

clustering solutions obtained, as the main purpose of the proposed multimodal 

consensus clustering approach is achieving robustness to clustering indeterminacies, 

which, as aforementioned, are increased due to multimodality. For brevity reasons, 

only the results corresponding to cluster ensembles based on four distinct clustering 

algorithms are graphically displayed. These four clustering algorithms –namely 

agglo-cos-upgma, direct-cos-i2, graph-cos-i2 and rb-cos-i2 – cover all the clustering 

approaches encompassed in the CLUTO clustering package (see appendix A.1 for a 

description). However, when global analyses are presented, the results obtained on 

the |dfA| = 28 multimodal cluster ensembles are considered. 

– How are results presented? 

i) The quality of the unimodal, multimodal and intermodal consensus clusterings 

obtained is presented by means of φ (NMI) score boxplots. Recall that nonoverlapping 

boxes notches indicate that the medians of the compared running 

times differ at the 5% significance level, which allows a quick inference of the statistical 

significance of the results. Quantitative performance evaluation measures 

are presented in the shape of numeric tables showing the average values of the 

magnitudes analyzed (mainly, percentage of experiments and relative percentage 

φ (NMI) differences). 

ii) Results are presented by means of boxplot charts of the φ (NMI) values corresponding 

to the consensus self-refining process. In particular, each subfigure depicts 

–fromlefttoright–theφ (NMI) values of: i) the components of the cluster ensemble 

E, ii) the non-refined consensus clustering solution (i.e. the one resulting from 

the application of either a hierarchical or a flat consensus architecture, denoted 

as λc), and iii) the self-refined consensus labelings λcpi obtained upon select 

cluster ensembles created using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 75}. 

140


Moreover, the consensus clustering solution deemed as the optimal one (across 

a majority of experiment runs) by the supraconsensus function is identified by 

means of a vertical green dashed line. Moreover, the quality comparisons between 

the self-refined consensus clusterings, the non-refined consensus clusterings and 

the cluster ensemble components are presented by means of tables showing the 

average values of the measured magnitudes on each one of the four multimodal 

data collections employed in this work. 

– Which data sets are employed? Only the multimodal consensus clustering results 

obtained on the IsoLetters data collection are described in detail in this section —a 

thorough portrayal of the experiments corresponding to the three other multimedia 

data collections employed in this work (CAL500, InternetAds and Corel) can be found 

in appendix E. However, the global evaluation of our proposals encompasses the 

results obtained on the four multimodal data collections, presenting the average values 

as a result. 

5.3.1 Consensus clustering per modality and across modalities 

This section is devoted to the evaluation of the intermediate (i.e. unimodal and multimodal) 

and final (that is, intermodal) consensus clustering solutions yielded by the proposed multimodal 

deterministic hierarchical consensus architecture applied on the IsoLetters data set. 

Recall that the objects contained in this data collection are instances of the letters of the 

English alphabet expressed in two original modalities, speech and image. 

We start with a visual evaluation of the quality of the aforementioned consensus clusterings, 

measured in terms of their normalized mutual information φ (NMI) with respect to 

the ground truth of the data set (the closer its value to unity the higher the quality of the 

corresponding clustering). Moreover, we also constrast them with the components of the 

multimodal cluster ensemble they are created upon. 

For starters, figure 5.3 depicts four boxplot charts corresponding to the φ (NMI) scores of 

the cluster ensemble components and the unimodal, multimodal and intermodal consensus 

clusterings, in the case the cluster ensemble compiles partitions output by the agglo-cosupgma 

clustering algorithm. In each one of the boxplots, the φ (NMI) values of the components 

of the cluster ensemble E and of the consensus clusterings yielded by each one of the 

seven consensus functions employed in this work across ten experiment runs are shown. In 

particular, figures from 5.3(a) to 5.3(c) depict the quality of the intermediate unimodal and 

multimodal consensus clustering solutions λimage c , λspeech c ,andλimage+speech c , respectively. 

Last, figure 5.3(d) shows the boxplots corresponding to the intermodal consensus clustering 

λc, resulting from the combination of the previous three. 

There are several observations worth making in view of these results. Firstly, it is to note 

that pretty diverse quality consensus clusterings are obtained depending on the consensus 

function employed. Clearly, EAC and HGPA yield the worst results, whereas the five other 

consensus functions tend to yield better consensus partitions, being often able to compete 

with the highest quality cluster ensemble components. 

Secondly, focusing on the unimodal consensus clustering processes solely (figures 5.3(a) 

and 5.3(b)), notice the substantial differences between the φ (NMI) values of the cluster ensemble 

components corresponding to the image and speech modalities. Undoubtedly, this 

141


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 

λ agglo−cos−upgma 

c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

(a) Modality 1 

φ (NMI) 

speech 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

(b) Modality 2 

φ (NMI) 

image+speech 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

(c) Multimodal 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ c agglo−cos−upgma 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

(d) Intermodal 

Figure 5.3: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering 

solutions using the agglo-cos-upgma algorithm on the IsoLetters data set. 

is a determining factor in that the qualities of the consensus clustering solutions λ image 

c 

are notably higher than those of λ speech 

c 

, as average relative percentage φ (NMI) differences 

between both modalities of 57.4% are obtained. That is, despite consensus clustering manages, 

in both modalities, to yield reasonable quality results, selecting a single modality for 

clustering a multimedia data collection can be a highly suboptimal option —as it can totally 

limit the quality of the obtained partition. 

Thirdly, when consensus clustering is conducted on the multimodal modality resulting 

from feature-level fusion (figure 5.3(c)), even better consensus clustering results are obtained 

–in average relative φ (NMI) terms, a 13.6% better than those obtained on the image 

modality–, which indicates the existence of positive synergies between both modalities on 

this data collection. 

And finally, if the λ image 

c 

, λ speech 

c 

and λ image+speech 

c 

consensus clustering solutions are 

combined –figure 5.3(d)– pretty good clusterings are obtained, specially when the CSPA, 

ALSAD and KMSAD consensus functions are employed (in these cases, relative φ (NMI) differences 

with respect the image and image+speech modalities below 5% are obtained). For 

the remaining consensus functions, the intermodal consensus clustering λc attains lower 

φ (NMI) scores, thus constituting a trade-off between the consensus clusterings of the unimodal 

and fused modalities. 

Figure 5.4 presents the results of the same process, but executed on the multimodal 

cluster ensemble E created by means of the direct-cos-i2 clustering algorithm. In this case, 

we can observe a very similar behaviour to the one just reported. In this case, however, 

the intermodal consensus clustering solution λc (see figure 5.4(d)) is, in some cases (e.g. 

when consensus is based on CSPA), better (from a 3.1% to a 13.9% in relative percentage 

φ (NMI) differences) than any of its unimodal and multimodal counterparts —figures 5.4(a) 

to 5.4(c). 

The quality of the unimodal, multimodal and intermodal consensus clustering solutions 

obtained by the application of the multimodal DHCA on the cluster ensemble generated 

upon the graph-cos-i2 clustering algorithm of the CLUTO toolbox are presented in figure 

5.5. In this case, a larger performance uniformity among consensus functions is observed, 

at least as far as the image and image+speech modalities are concerned (figures 5.5(a) and 

5.5(c)). Otherwise, the consensus clusterings obtained upon the multimodal representation 

142

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 

λ direct−cos−i2 

c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


speech 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

image+speech 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ c direct−cos−i2 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the direct-cos-i2 algorithm on the IsoLetters data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 

λ graph−cos−i2 

c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

speech 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

image+speech 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ c graph−cos−i2 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the graph-cos-i2 algorithm on the IsoLetters data set. 

of the data (λ image+speech 

c ) attain a higher quality than any of their unimodal counterparts 

(a 16.1% better in relative terms). Moreover, intermodal consensus –figure 5.5(d)– gives 

rise to clusterings that, at best, are comparable to the partitions obtained on the multimodal 

modality (e.g. ALSAD and KMSAD, where average relative φ (NMI) losses of 3% are 

observed) and, in the worst cases, constitute a trade-off between the combined modalities. 

Last, notice that pretty similar results are obtained when this consensus process is applied 

on the cluster ensemble created by the compilation of the partitions output by the 

rb-cos-i2 CLUTO clustering algorithm (see figure 5.6). Once more, the comparison of the 

consensus clusterings obtained on the unimodal and multimodal modalities (figures 5.6(a) 

to 5.6(c)) reveals the superiority of the latter on this data collection. When these three 

modalities are fused in the final consensus stage of the DHCA, the obtained intermodal 

consensus clustering solution λc yielded by the CSPA, ALSAD, KMSAD and SLSAD consensus 

functions attain φ (NMI) values only 0.9% to a 7.1% worse than those attained on the 

multimodal data representation, as depicted in figure 5.6(d). 

While the boxplots depicted in figures 5.3 to 5.6 provide the reader with a qualitative 

though partial vision of the results of unimodal, multimodal and intermodal consensus 

143


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 

λ rb−cos−i2 

c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

speech 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image+speech 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ c rb−cos−i2 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the rb-cos-i2 algorithm on the IsoLetters data set. 

clustering, it is necessary to conduct a more quantitative and generic analysis across the 

experiments conducted on the |dfA| = 28 cluster ensembles created using all the clustering 

algorithms from the CLUTO toolbox upon the four multimodal data collections employed 

in this work. 

Unimodal and multimodal consensus clustering vs their cluster ensembles 

mod 1 

Firstly, we have evaluated the quality of the two unimodal (λc mod 1+mod 2 

multimodal (λc have been created upon. 

mod 2 

and λc )andthe 

) consensus clusterings with respect to the cluster ensembles they 

In order to evaluate how the consensus clusterings compare to their associated cluster 

ensembles, we have computed the percentage of cluster ensemble components that attain 

ahigherφ (NMI) than the evaluated consensus clustering. Quite obviously, the smaller this 

percentage, the higher robustness to clustering indeterminacies is achieved. The results of 

this analysis are presented in table 5.2. 

Care must be taken in analyzing the presented percentages, due to the fact that the two 

unimodal and the multimodal consensus clusterings have been created upon different cluster 

ensembles, which makes comparisons across columns (i.e. across consensus functions) fair, 

but the same does not hold for comparisons across rows (i.e. across consensus clusterings). 

If the performance of the seven consensus functions is contrasted, it can be observed that 

EAC and HGPA yield, in most cases, the worst results, as the consensus clusterings they 

yield (either unimodal or multimodal) are, in average, worse than the 76.9% and 78.8% of 

the components of the cluster ensemble they are created upon —a percentage that goes 

below 30% in the case of the best performing consensus functions (CSPA, ALSAD and 

KMSAD). These results confirm that there may exist great differences between consensus 

functions as far as the quality of the consensus clusterings is concerned, so care must be 

taken when choosing which ones are applied. 

If averages are taken for summarization purposes, unimodal consensus clusterings are 

better than the 49.2% of their corresponding cluster ensemble components, while this percentage 

rises to 56.5% when multimodal consensus clusterings are considered. 

144


Data Consensus Consensus function 

set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD 

λ 

IsoLetters 

image 

c 21.1 60.1 63.4 40.9 20.4 26.6 40.2 

λ speech 

c 43.9 91.5 79.4 47.4 18.3 34.6 77.4 

λ image+speech 

c 17.1 51.8 59.9 34.1 16.7 25.0 39.1 

λ 

CAL500 

audio 

c 16.5 86.3 67.7 29.9 9.9 12.0 70.3 

λ text 

c 29.7 83.5 52.6 59.3 32.6 30.5 75.7 

λ audio+text 

c 21.9 94.9 54.8 48.0 16.9 20.3 61.9 

λ 

InternetAds 

object 

c 47.8 62.2 99.1 44.7 39.8 40.5 61.6 

λ collateral 

c 55.3 52.4 99.4 40.9 45.1 37.6 59.6 

λ object+collateral 

c 41.6 64.5 99.7 29.9 29.2 34.6 37.3 

λ 

Corel 

image 

c 12.7 93.8 96.2 50.0 19.1 26.7 79.1 

λ text 

c 10.4 89.2 88.1 43.7 28.8 23.7 77.8 

λ image+text 

c 7.5 92.2 85.1 40.2 10.9 20.2 64.1 

145 

Table 5.2: Percentage of cluster ensemble components that attain a higher φ (NMI) than the unimodal and multimodal consensus clusterings, 

across the four multimedia data collections and the seven consensus functions.


The second analysis consists of comparing the unimodal and multimodal consensus 

clusterings with the cluster ensemble components of maximum and median φ (NMI) (which 

we call best and median ensemble components, or BEC and MEC for short). Taking these 

two cluster ensemble components as a reference, we have computed i) the percentage of 

experiments in which the evaluated consensus clustering attains a higher φ (NMI) ,andii) the 

relative percentage φ (NMI) differences between them and the evaluated consensus clustering. 

The results of this analysis are presented, per data collection and consensus function, in 

tables 5.3 and 5.4, where evaluation is referred to BEC and the MEC, respectively. 

It can be observed that, as already noticed in previous experiments, the EAC and 

HGPA consensus functions perform notably worse than the remaining ones. In average, 

unimodal consensus clusterings are better than their corresponding BEC in a 6.5% of the 

experiments, whereas the multimodal consensus clustering solutions attain a higher φ (NMI) 

than the BEC in a 8.7% of the occasions (see table 5.3). If this comparison is made in 

terms of the relative percentage φ (NMI) differences, we see that, in average, the unimodal 

consensus clusterings are a 33.5% worse than the BEC, while this percentage reduces to 

28.2% when the multimodal consensus clusterings are considered. 

If the median ensemble component is taken as a reference –see table 5.4–, we observe 

that unimodal consensus clusterings are better than the MEC in a 54% of the experiments 

conducted. In contrast, when the multimodal consensus clustering solution is considered, 

superiority with respect to the MEC is obtained in a 62.3% of the cases. If the MEC and 

the consensus clusterings are compared in terms of relative percentage φ (NMI) differences, 

we see that the unimodal consensus yields clusterings which are a 21.1% better than the 

MEC, while this percentage rises to 39.6% in the case of multimodal consensus. 

Thus, a conclusion we can draw at this point is that, in view of the results just reported, 

the execution of consensus processes on multimodal cluster ensembles yields better quality 

consensus clusterings than those obtained upon cluster ensembles based on a single modality, 

which somehow constitutes a claim in favour of early fusion techniques. However, this 

statement must be made with caution, as it is not supported by evidence in all the data 

collections (for instance, the CAL500 collection constitutes an exception to this rule). 

Multimodal vs unimodal consensus clustering 

Secondly, we have compared the quality of the multimodal consensus clusterings (that is, 

mod 1+mod 2 

mod 1 mod 2 

λ ) with the quality of their unimodal counterparts (λ and λ ). Again, 

c 

c 

c 

this comparison has been made in terms of the percentage of experiments in which the 

former attains a higher φ (NMI) than the latter, and the relative percentage φ (NMI) differences 

between them, taking the unimodal consensus clusterings as a reference. The results are 

presented in table 5.5. 

mod 1+mod 2 

In average terms, the multimodal consensus clustering λc is better than its 

two unimodal counterparts in a 53.3% of the experiments conducted, and worse than both 

of them in only a 13% of the occasions. If the φ (NMI) differences between these consensus 

clusterings are measured, we observe that multimodal consensus yields partitions that, in 

average relative percentage φ (NMI) terms, are a 4802.4% better. Although this large figure is 

mainly caused by two outliers (on the InternetAds data set using the ALSAD and KMSAD 

consensus functions), the results presented in table 5.5 show an overwhelming majority of 

positive Δφ (NMI) values, which reinforces the notion that multimodal consensus processes, 

146


when compared to their unimodal counterparts, constitute, in most cases, a better option. 

Intermodal vs unimodal and multimodal consensus clustering 

Furthermore, we have also investigated whether the combination of unimodal and multimodal 

consensus clusterings (i.e the execution of intermodal consensus processes) can lead 

to the obtention of better partitions. 

For this reason, table 5.6 presents the detailed results corresponding to the comparison of 

the intermodal consensus clustering solutions with respect to their unimodal and multimodal 

counterparts, across all the data sets and consensus functions. Once again, such comparison 

is twofold, as it takes into account the percentage of experiments in which intermodal 

consensus is better than unimodal and multimodal, and the relative percentage φ (NMI) 

differences between them (taking the unimodal and multimodal consensus clusterings as a 

reference). 

If averages across data collections and consensus functions are taken, the following results 

are obtained: when compared to the unimodal consensus clusterings, intermodal consensus 

are better than them in a 59.5% of the experiments conducted, attaining an average relative 

φ (NMI) gain of 2821.7% with respect to them. That is, intermodal consensus clusterings are, 

in general terms, superior to their unimodal counterparts. Possibly the clearest exceptions 

to this rule are found in the audio modality of the CAL500 data set and the image mode 

of the Corel collection. 

However, intermodal consensus clusterings are superior to their multimodal counterparts 

in just a 34.7% of the occasions, reaching a quality that, measured in average relative φ (NMI) 

percentage terms, is a 65.5% better. Thus, as already suggested by the boxplots charts 

presented in figures 5.3 to 5.6, intermodal consensus clusterings tend to become a trade-off 

between the multimodal and unimodal consensus clustering solutions it is based on. 

Furthermore, if the quality of the intermodal consensus clustering is contrasted to that of 

the cluster ensemble it is created upon (that is, the one compiling both unimodal and multimodal 

clusterings), we obtain that it is better than the 52.9% of its components —recall that 

this percentage was 49.2% and 56.5% when referred to the unimodal and multimodal consensus 

clusterings, which reinforces the notion that, in general terms, intermodal consensus 

is a trade-off between its unimodal and multimodal counterparts. 

Notice that pretty different situations are found among the data sets used in this experiment. 

For instance, the intermodal consensus clustering is clearly inferior to its multimodal 

conterpart on the IsoLetters data set, whereas quite the opposite is observed on the InternetAds 

collection. Therefore, we consider that creating an intermodal consensus clustering 

is a pretty generic way of proceeding, as sometimes it can be advantageous to combine unimodal 

and multimodal consensus clusterings. Its eventual inferior quality (when compared 

to either its unimodal and multimodal counterparts) can be compensated by the consensus 

self-refining procedure presented in section 5.2. The results of applying it on the intermodal 

consensus clustering λc are described in the following section. 

147




λ 

IsoLetters 

image %exp 3.6 0 0 0 17.9 5 3.6 

c 

∆φ (NMI) -7.3 -42.9 -39.2 -18.8 -10.1 -9.1 -22 

λ speech %exp 0 0 0 0 10.7 7.1 3.6 

c ∆φ (NMI) -17.3 -59.8 -37.5 -20.3 -4.8 -10.2 -42.1 

λ image+speech %exp 10.7 0 0 0 7.1 8.6 0 

c ∆φ (NMI) -5.8 -33.2 -38.2 -16.1 -8.6 -8.5 -16.2 

λ 

CAL500 

audio %exp 3.5 0 0 1.4 25 7.8 0 

c ∆φ (NMI) -9.4 -50.5 -37.5 -17 -5.8 -7 -37.3 

λ text %exp 28.6 0 13.5 1.4 21.4 26.4 7.1 

c 

∆φ (NMI) -4.1 -24.2 -11.7 -14.9 -5.5 -4.6 -23.1 

λ audio+text %exp 21.4 0 5.7 0.7 28.6 21.4 3.6 

c ∆φ (NMI) -4.3 -44.8 -18.1 -19.6 -4.2 -6.1 -23.3 

λ 

InternetAds 

object %exp 0 3.6 0 0 0 0 0 

c ∆φ (NMI) -45.1 -72.8 -99.9 -45.4 -43.3 -41.7 -61.9 

λ collateral %exp 0 0 0 0 3.6 4.3 0 

c ∆φ (NMI) -81.7 -80.2 -99.9 -70.5 -70.3 -64.8 -87.7 

λ object+collateral %exp 0 0 0 25 3.5 3.6 3.6 

c 

∆φ (NMI) -41 -82.5 -99.9 -38.6 -34.7 -42.3 -42.8 

λ 

Corel 

image %exp 25 0 0 3.6 39.3 9.3 3.5 

c 

∆φ (NMI) -2 -56.7 -43.5 -28.9 -4.8 -7.1 -28.8 

λ text %exp 35.7 0 5.7 1.4 14.3 15 10.7 

c 

∆φ (NMI) 21.3 -56.3 -34.3 -24.6 -9.5 -8.4 -34.2 

λ image+text %exp 39.3 0 0 11.4 32.1 10 7.1 

c 

∆φ (NMI) 6.6 -60.6 -40.7 -24.4 -4.9 -8.2 -29.4 

148 

Table 5.3: Evaluation of the unimodal and multimodal consensus clusterings with respect to the best cluster ensemble component (or 

BEC), across the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of 

experiments in which the evaluated consensus clustering is superior to the BEC, and ∆φ (NMI) stands for the relative percentage φ (NMI) 

difference of the former wrt the latter.




λ 

IsoLetters 

image %exp 92.3 21.4 22.9 85 100 90.7 67.9 

c 

Δφ (NMI) 80 -10.7 5.2 45.9 56.3 69.3 35.1 

λ speech %exp 89.3 3.5 0 75.7 100 98.6 7.2 

c Δφ (NMI) 5.6 -47.4 -20.1 0.7 23.6 17 -23.4 

λ image+speech %exp 100 50 42.1 91.4 100 92.9 85.7 

c 

Δφ (NMI) 96.3 11.2 13.8 64 70.4 83.9 63 

λ 

CAL500 

audio %exp 100 3.6 12.9 86.4 100 99.3 21.4 

c Δφ (NMI) 24.4 -31.9 -15.4 12 28.8 27.8 -14.1 

λ text %exp 78.6 14.3 45 39.3 75 73.6 28.6 

c 

Δφ (NMI) 9.9 -13.3 1.1 -2.4 8.4 9.5 -11.8 

λ audio+text %exp 96.4 3.5 41.4 55.7 85.7 92.9 35.7 

c 

Δφ (NMI) 16.3 -33.2 -0.6 -2.9 16.4 14 -6.8 

λ 

InternetAds 

object %exp 71.4 32.1 0 77.1 67.9 68.6 46.2 

c Δφ (NMI) 64.5 110.2 -99.9 30.4 69 115.4 93.6 

λ collateral %exp 50 50 0 61.4 57.1 67.9 21.4 

c 

Δφ (NMI) 111.7 173.5 -99.7 109.2 143.8 184 -33.5 

λ object+collateral %exp 71.4 25 0 78.6 82.1 66.4 67.9 

c 

Δφ (NMI) 236.9 -13 -99.4 137.5 186.1 202.9 14.6 

λ 

Corel 

image %exp 100 0 0 52.1 96.4 85 14.3 

c 

Δφ (NMI) 23.4 -46.7 -33.1 -17.3 18.7 16.5 -9.4 

λ text %exp 96.4 10.7 8.6 62.9 85.7 87.9 21.4 

c 

Δφ (NMI) 48.8 -45.7 -19.7 -8.2 13.4 14.9 -16.9 

λ image+text %exp 100 3.6 0 63.6 100 95.7 17.9 

c 

Δφ (NMI) 51 -45.7 -21.5 -3.3 30.7 27.4 -0.2 

149 

Table 5.4: Evaluation of the unimodal and multimodal consensus clusterings with respect to the median cluster ensemble component 

(or MEC), across the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of 

experiments in which the evaluated consensus clustering is superior to the MEC, and Δφ (NMI) stands for the relative percentage φ (NMI) 

difference of the former wrt the latter.




λ 

IsoLetters 

image %exp 92.9 92.9 93.6 92.9 92.9 92.9 85.7 

c 

∆φ (NMI) 10.7 37.4 11.2 14.3 10.7 9.7 20.1 

λ speech %exp 96.4 92.9 95 95.7 85.7 92.9 92.9 

c ∆φ (NMI) 81.1 177.7 56.5 70.7 54.3 65.9 155.4 

λ 

CAL500 

audio %exp 7.1 10.7 30.7 2.1 3.6 5 25 

c ∆φ (NMI) -26 -21.7 -7.3 -31.4 -28.7 -29.4 -11.2 

λ text %exp 100 17.9 85 79.3 96.4 93.6 82.1 

c 

∆φ (NMI) 19.9 -12.3 11.8 13.8 22.2 18.7 23.4 

λ 

InternetAds 

object %exp 64.3 35.7 93.6 62.1 64.3 59.3 71.4 

c ∆φ (NMI) 80 234 1616 348 53833 1454 1270 

λ collateral %exp 67.9 50 78.6 73.6 82.1 64.3 78.6 

c ∆φ (NMI) 1690 160 -10 910 3750 200660 1030 

λ 

Corel 

image %exp 85.7 28.6 42.9 65.7 60.7 61.4 53.6 

c 

∆φ (NMI) -2.5 -0.8 96.3 12.9 -5.9 -5.7 -6.7 

λ text %exp 89.3 96.4 88.6 91.4 96.4 96.4 96.5 

c 

∆φ (NMI) 142.9 149.9 129.9 165.8 169.6 151.7 192 

150 

Table 5.5: Evaluation of the multimodal consensus clusterings with respect to their unimodal counterparts, across the four multimedia 

data collections and the seven consensus functions. The symbol %exp denotes the percentage of experiments in which the multimodal 

consensus clustering is superior to the unimodal consensus clusterings, and ∆φ (NMI) stands for the relative percentage φ (NMI) difference 

of the former wrt the latter.




λ 

IsoLetters 

image %exp 92.9 42.9 1.4 8.6 89.3 86.4 46.4 

c 

Δφ (NMI) 9 -5.2 -24.3 -12.1 8.9 10.1 3 

λ speech %exp 96.4 89.3 52.9 87.1 82.1 92.9 92.9 

c Δφ (NMI) 78.2 90.2 6.6 30.5 49.6 59.5 114.2 

λ image+speech %exp 17.9 3.6 0 0.7 17.9 12.1 7.1 

c 

Δφ (NMI) -1.4 -28.6 -31.8 -22.5 -0.3 1.8 -12.7 

λ 

CAL500 

audio %exp 7.1 17.9 8.6 1.4 7.1 7.2 28.6 

c Δφ (NMI) -23.8 -10.1 -19.7 -31.9 -23.7 -25.8 -8.9 

λ text %exp 96.4 53.6 32.9 80.7 100 99.3 89.3 

c 

Δφ (NMI) 25 0.5 -3.4 13.8 30.7 24.2 28 

λ audio+text %exp 75 78.6 11.4 40.7 75 72.1 57.1 

c 

Δφ (NMI) 4.2 17.3 -13.1 3 7.3 5 6.8 

λ 

InternetAds 

object %exp 57.1 57.2 90 45.7 57.1 53.6 42.9 

c Δφ (NMI) 2 3 -10 104 44152 465 -38 

λ collateral %exp 85.7 82.1 70.7 78.6 92.9 71.4 64.3 

c 

Δφ (NMI) 1250 60 -30 940 4880 105030 60 

λ object+collateral %exp 42.9 67.9 75 30 71.4 55.7 32.1 

c 

Δφ (NMI) 479.1 41.7 -19.2 59.4 41.1 1322.6 22.2 

λ 

Corel 

image %exp 75 10.7 3.6 7.1 17.9 6.4 3.6 

c 

Δφ (NMI) -0.7 -17.1 -24.4 -22.4 -10.8 -11.6 -18 

λtext %exp 100 92.9 91.4 84.3 100 100 100 

c 

Δφ (NMI) 145.9 136.8 30.1 83.5 153.8 143.4 157.9 

λ image+text %exp 14.3 46.4 14.4 5.7 14.3 14.2 17.9 

c 

Δφ (NMI) 6.7 1.1 -38.2 -25.2 3.2 7.5 -1.6 

151 

Table 5.6: Evaluation of the intermodal consensus clustering with respect to the unimodal and multimodal consensus clusterings, across 

the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of experiments in 

which the intermodal consensus clustering is superior to the unimodal and multimodal consensus clusterings, and Δφ (NMI) stands for the 

relative percentage φ (NMI) difference of the former wrt the latter.


5.3.2 Self-refined consensus clustering across modalities 

In this section, we analyze the results of subjecting the intermodal consensus clustering 

solution λc to a self-refining procedure based on a cluster ensemble E, the components of 

which correspond to both unimodal and multimodal partitions. 

As described in chapter 4, the self-refining process is based on the creation of a select 

cluster ensemble Epi –containing a percentatge pi of the clusterings in E– followed by 

the application of a (either flat or hierarchical) consensus process on it, which yields a 

self-refined consensus clustering λ pi 

c . In this work, the refining consensus processes are 

conducted by means of a flat consensus architecture, and the set of percentages pi employed 

is pi = {2, 5, 10, 15, 20, 30, 40, 50, 75}. 

For this reason, the obtention of the final consensus clustering solution λ final 

c 

relies on the 

application of a supraconsensus selection function onto the set of self-refined clusterings λ pi 

c 

—see section 4.1 for a description of the φ (ANMI) -based supraconsensus function employed 

in this work. In the following paragraphs, the performances of these two processes (i.e. the 

self-refining procedure and the supraconsensus function) are evaluated separately. 

As in the previous section, the results of the execution of these processes on the IsoLetters 

data collection are described in detail next. Again, for brevity reasons, the presentation of 

the results corresponding to the CAL500, InternetAds and Corel data sets is deferred to 

appendix E. 

Evaluation of the multimodal self-refining process 

For starters, figure 5.7 depicts the results of the self-refining process using the multimodal 

cluster ensemble resulting from gathering the partitions output by the agglo-cos-upgma 

clustering algorithm of the CLUTO package. Each one of the seven boxplots presented (one 

per consensus function) shows, from left to right, the φ (NMI) values of the multimodal cluster 

ensemble E components, of the intermodal consensus clustering λc, and of the self-refined 

consensus clustering solutions λ pi 

c . The consensus clustering selected by the supraconsensus 

, is highlighted by a vertical green dashed line. 

Notice that, regardless of the consensus function employed, there always exists at least 

one self-refined consensus clustering solution that attains a φ (NMI) value that is statistically 

significantly higher than the one achieved by the non-refined version λc. In some cases, as 

when consensus is based on CSPA (figure 5.7(a)), relatively small φ (NMI) gains are obtained, 

specially if they are compared with the dramatic φ (NMI) increases brought about by selfrefining 

when, for instance, the EAC, HGPA or SLSAD consensus functions are employed 

(see figures 5.7(b), 5.7(c) and 5.7(g)). 

As regards the ability of the supraconsensus function to select the highest quality (either 

refined or non-refined) consensus clustering as the final partition of the data, it is to 

notice that it apparently performs reasonably well, as it mostly succeeds in picking up the 

clustering solution of maximum φ (NMI) as λfinal c . A deeper analysis of the supraconsensus 

function performance will be presented in the next section. 

Notice that the proposed intermodal consensus self-refining procedure shows a very similar 

behaviour in the experiments conducted on the multimodal cluster ensembles created 

upon the partitions output by the direct-cos-i2, thegraph-cos-i2 and the rb-cos-i2 clustering 

algorithms from the CLUTO toolbox —see figures 5.8, 5.9 and 5.10, respectively. 

function, λ final 

c 

152

φ (NMI) 

CSPA agglo−cos−upgma 

1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


EAC agglo−cos−upgma 

E λc 

λ 2 c 

λ 5 c 

ALSAD agglo−cos−upgma 

1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

HGPA agglo−cos−upgma 

1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

KMSAD agglo−cos−upgma 

1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

MCLA agglo−cos−upgma 

1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

SLSAD agglo−cos−upgma 

1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 

Figure 5.7: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions 

using the agglo-cos-upgma algorithm on the IsoLetters data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

ALSAD direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(e) ALSAD 

λ 40 

c 

EAC direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

KMSAD direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

HGPA direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

SLSAD direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

MCLA direct−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the direct-cos-i2 algorithm on the IsoLetters data set. 

A comprehensive evaluation of the results of the intermodal consensus self-refining procedure 

is presented throughout the following paragraphs. This analysis considers the ex- 

153 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 50 

c 

λ 75 

c


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

ALSAD graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(e) ALSAD 

λ 40 

c 

EAC graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

KMSAD graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

HGPA graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

SLSAD graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(g) SLSAD 

λ 40 

c 

MCLA graph−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the graph-cos-i2 algorithm on the IsoLetters data set. 

Data set 

CSPA EAC 


HGPA MCLA ALSAD KMSAD SLSAD 

IsoLetters 96.4 96.5 100 98.6 100 100 100 

CAL500 89.3 92.9 99.3 97.8 67.9 83.6 82.1 

InternetAds 75 57.1 46.4 98.5 78.6 90 85.7 

Corel 100 89.2 100 100 100 100 100 

Table 5.7: Percentage of multimodal self-refining experiments in which one of the selfrefined 

consensus clustering solutions is better than its non-refined counterpart, across the 

four multimedia data collections and the seven consensus functions. 

periments conducted upon the cluster ensembles created using the |dfA| = 28 clustering 

algorithms across the four multimedia data collections employed in this work. 

For starters, in order to evaluate the ability of the self-refining process to create high 

quality partitions, we have measured the percentage of experiments in which there exists 

at least one self-refined consensus clustering λ pi 

c that attains a higher φ (NMI) than its 

non-refined counterpart λc. The results per data set and consensus function, which are 

presented in table 5.7, reveal that self-refining is capable of yielding a beneficial effect in a 

large majority (an average 90.2%) of the experiments conducted. This figure is of the same 

order of magnitude of the one obtained in the consensus-based unimodal self-refining experiments 

presented in section 4.2.1, which indicates that multimodality does not constitute 

an obstacle as far as the performance of the proposed self-refining procedure is concerned. 

Moreover, so as to evaluate the quality improvement that the proposed self-refining 

procedure is able to introduce, we have computed the relative φ (NMI) gain between the 

154 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 


λ 40 

c 

EAC rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

HGPA rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 40 

c 

MCLA rb−cos−i2 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the rb-cos-i2 algorithm on the IsoLetters data set. 

top quality self-refined consensus clustering and its non-refined counterpart, measured in 

those experiments where there exists a self-refined consensus clustering superior to the nonrefined 

version (i.e. a 90.2% of the total). Table 5.8 presents the results corresponding to 

each data collection and consensus function. It can be observed that φ (NMI) gains over 10% 

are consistently obtained in most cases —again, these results are of comparable magnitude 

to those obtained in the unimodal scenario (see section 4.2.1). Notice that the results 

obtained on the InternetAds data set stand out among the rest, as relative percentage 

φ (NMI) gains of the order of 103 to 105 are observed. These extremely large figures are 

due to the extremely low quality consensus clusterings available prior to refining on this 

collection —thus transforming any φ (NMI) increase caused by self-refining into a huge figure 

when expressed in relative percentage difference terms referred to the non-refined clustering. 

In the 9.8% of the experiments in which none of the self-refined consensus clusterings (λ pi 

c ) 

attains a higher quality than the non-refined one (λc), the difference between the top quality 

λpi c and λc is an averaged relative percentage φ (NMI) loss of 19%. 

Besides comparing the top quality self-refined consensus clustering solution with its nonrefined 

counterpart, we have also contrasted its quality with respect to the clusterings that 

make up the cluster ensemble. 

Firstly, we have computed the percentage of cluster ensemble components that attain 

ahigherφ (NMI) score (with respect to the ground truth) than that of the top quality selfrefined 

consensus clustering, as this figure constitutes a pretty clear indicator of how does 

it compare to the cluster ensemble it is created upon (see table 5.9). In average terms, the 

top quality self-refined consensus clustering is better than the 78.3% of the cluster ensemble 

components, which is a sign of notable robustness to the indeterminacies of multimodal 

clustering. Moreover, recall that this percentage was 52.9% prior to self-refining, which 

155 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c


Data set 

CSPA EAC 



IsoLetters 8.7 93 121.1 31.3 23.5 16 35.4 

CAL500 14.1 31.1 57.4 22.3 13.8 18.6 19.4 

InternetAds 26284 12207 467710 1991.5 1686.8 1788.5 1742.3 

Corel 11.1 64.2 177.1 42.2 32 33.6 44.7 


clustering solutions with respect to its non-refined counterpart, across the four multimedia 

data collections and the seven consensus functions. 

Data set 

CSPA EAC 



IsoLetters 2.2 19.9 16.8 5.4 0.5 1.7 8.7 

CAL500 14.1 61.6 17.6 17.1 15.4 14.4 41.2 

InternetAds 16.9 35.9 94.1 8.3 18.1 17.6 38.5 

Corel 0.9 63.4 29.5 4.9 0.1 1.2 40.9 

Table 5.9: Percentage of the cluster ensemble components that attain a higher φ (NMI) score 

than the top quality self-refined consensus clustering solution, across the four multimedia 

data collections and the seven consensus functions. 

again provides an evidence of the benefits of the proposed consensus self-refining procedure. 

Furthermore, we have compared the top quality self-refined consensus clustering with 

the the highest and median φ (NMI) components of the multimodal cluster ensemble E, 

referred to as BEC (best ensemble component) and MEC (median ensemble component), 

respectively. Using the quality of these two components as a reference, we have evaluated 

i) the percentage of experiments where the maximum φ (NMI) consensus clustering solution 

attains a higher quality to that of the BEC and MEC, and ii) the relative percentage φ (NMI) 

variation between them and the top quality consensus clustering solution. Again, all the 

results presented correspond to an average across all the experiments conducted upon the 

cluster ensembles generated using the twenty-eight clustering algorithms from the CLUTO 

toolbox employed in this work. 

Table 5.10 presents, for each data set and consensus function, the percentage of experiments 

in which the top quality consensus clustering (either non-refined or refined) attains 

a higher quality (i.e. higher φ (NMI) ) than the cluster ensemble component of maximum 

quality (or BEC). In each box of the table, the percentage of experiments in which the 

non-refined consensus clustering reaches a higher φ (NMI) score than the BEC is shown in 

brackets. By doing so, the effect of the self-refining process can be evaluated at a glance. 

Notice, then, that the equality of the bracketed and the unbracketed figures shown in any 

table box indicates that none of the refined consensus clusterings attains a higher quality 

than its non-refined counterpart. 

In average, a self-refined consensus clustering better than the BEC is obtained in a 

19.1% of the experiments conducted, whereas this figure was as low as 1.6% prior to selfrefining. 

Notice that, depending on the data collection, pretty diverse results are obtained 

(i.e. consensus self-refining seems to achieve a greater level of success when applied on the 

IsoLetters and Corel collections, than on CAL500 and InternetAds data sets) —furthermore, 

156

Data set 

IsoLetters 

CAL500 

InternetAds 

Corel 




46.4 3.6 0 26.4 64.3 51.4 25 

(3.5) (0) (0) (0) (7.1) (5) (0) 

7.1 0 0 0 3.6 2.9 3.5 

(0) (0) (0) (0) (3.6) (2.9) (3.5) 

3.6 0 0 0 0 1.4 0 

(0) (0) (0) (0) (0) (0) (0) 

75 0 0 45 92.9 72.9 10.7 

(14.3) (0) (0) (0) (7.1) (0) (0) 

Table 5.10: Percentage of experiments in which the best (either non-refined or self-refined) 

consensus clustering solution is better than the best cluster ensemble component (or BEC), 

across the four multimedia data collections and the seven consensus functions. The percentages 

prior to self-refining are shown in brackets. 

notice the aforementioned differences between the results offered by the distinct consensus 

functions. 

Data set 

IsoLetters 

CAL500 

InternetAds 

Corel 



4.2 11.2 – 2.4 8.2 6.9 8.3 

(2.1) (–) (–) (–) (5.7) (3.2) (–) 

5.6 – – – 10.4 3.2 3.3 

(–) (–) (–) (–) (10.4) (3.2) (3.3) 

6.2 – – – – 2.9 – 

(–) (–) (–) (–) (–) (–) (–) 

6.9 – – 2.1 6.1 5.4 13.7 

(4.2) (–) (–) (–) (1.1) (–) (–) 

Table 5.11: Relative φ (NMI) percentage difference between the top quality (either non-refined 

or self-refined) consensus clustering solution with respect to the best ensemble component 

(or BEC), across the four multimedia data collections and the seven consensus functions. 

The relative φ (NMI) percentage differences prior to self-refining are shown in brackets. 

Moreover, we have computed the relative percentage φ (NMI) gains between the top quality 

(non-refined or refined) consensus clustering solution and the BEC –limited to those 

experiments in which the former is superior to the latter–, obtaining the results presented 

in table 5.11. If the φ (NMI) gains are averaged across all the experiments conducted, a 6.3% 

relative percentage φ (NMI) increase is obtained. Again, the percentages corresponding to 

the same magnitude measured prior to refining is presented in brackets in each box of the 

table, attaining an average φ (NMI) gain of 4.1%. That is, self-refining does not only give rise 

to a larger number of clusterings better than the BEC, but it also increases the φ (NMI) gains 

with respect to it. However, in those experiments in which the top quality (non-refined or 

refined) consensus clustering attains a φ (NMI) score which is lower than that of the BEC 

(i.e. in a 80.9% of the total), its quality is a 28.2% lower, measured in averaged relative 

percentage φ (NMI) terms. 

As regards the comparison with the median quality cluster ensemble component (or 

157


Data set 

IsoLetters 

CAL500 

InternetAds 

Corel 



100 92.9 100 99.3 100 100 100 

(96.4) (21.4) (2.9) (82.9) (100) (100) (96.4) 

100 21.4 97.1 99.3 100 100 71.4 

(100) (7.1) (15) (40) (100) (93.6) (32.1) 

96.4 60.7 2.9 100 85.7 92.1 60.7 

(75) (32.1) (0) (72.9) (75) (66.4) (17.9) 

100 14.3 91.4 100 100 100 60.8 

(100) (14.3) (10) (15.7) (89.3) (85) (17.9) 

Table 5.12: Percentage of experiments in which the top quality (either non-refined or selfrefined) 

consensus clustering solution is better than the median cluster ensemble component 

(or MEC), across the four multimedia data collections and the seven consensus functions. 

The percentages prior to self-refining are shown in brackets. 

MEC), its quality is surpassed by the top quality consensus clustering in an average 83.8% 

of the experiments conducted —see table 5.12 for a detailed vision of these results across 

data sets and consensus functions. It can be observed that in a vast majority of cases, selfrefining 

guarantees the obtention of partitions that are better than the median clustering 

available in the cluster ensemble. Again, the percentage of experiments in which the nonrefined 

intermodal consensus clustering attains a higher φ (NMI) than the MEC is presented 

in brackets in each box of the table, yielding an average of 55.7% —that is, self-refining 

increases the chances of obtaining a partition better than the median one in almost a 30%. 

But better to what extent? So as to answer this question, table 5.13 presents the 

relative φ (NMI) percentage differences between the top quality consensus clustering and 

the MEC, considering only those experiments in which the former attains a higher φ (NMI) 

value than the latter. In average, a relative percentage gain of 142.7% is achieved, which 

again reinforces the notion that the proposed multimodal consensus self-refining process is, 

by itself, capable of yielding good quality partitions upon a previously derived consensus 

clustering solution. Moreover, each box in table 5.13 shows, in brackets, the relative φ (NMI) 

percentage differences between the non-refined intermodal consensus clustering λc and the 

MEC. Notice how the self-refining procedure, besides yielding consensus clusterings superior 

to the MEC in a larger number of experiments, also increases the difference with respect 

to it, rising it from an average 103.7% to the aforementioned 142.7%. In the experiments 

in which the top quality fails to attain a higher φ (NMI) value than that of the MEC (i.e. a 

16.2% of the total), their quality is a 32.1% lower, measured in average relative percentage 

φ (NMI) terms. 

However, it is to notice that, in tables 5.7 to 5.13, the performance of the self-refining 

procedure has been evaluated taking the highest φ (NMI) self-refined consensus clustering 

solution as a reference. But, as aforementioned, the self-refining process generates multiple 

self-refined clusterings λ pi 

c using distinct percentages pi of the original cluster ensemble 

E. Therefore, the subsequent application of the average normalized mutual information 

(φ (ANMI) ) based supraconsensus function is required so as to obtain the final partition of 

the multimodal data set, λ final 

c . As already described in chapter 4, the ability of the 

supraconsensus function to select the top quality consensus clustering solution is a crucial 

issue as regards the overall performance of the multimodal self-refining consensus clustering 

158

Data set 

IsoLetters 

CAL500 

InternetAds 

Corel 




86.6 55.6 63.9 81.1 100.3 95.2 78.8 

(76.8) (15.4) (2.4) (28.1) (65.3) (67.8) (34) 

33.4 29.3 31.6 28.1 28.9 32.6 16 

(18.9) (4.4) (7.6) (-) (-) (-) (-) 

382.9 191.2 196.7 412 408.5 446.3 406.9 

(314.2) (174.1) (–) (236.4) (222.7) (291.5) (356.6) 

113.5 258 40.8 35.5 106.1 106 130.9 

(83.8) (39.7) (2.6) (4.6) (58) (54.8) (225.5) 

Table 5.13: Relative φ (NMI) percentage difference between the top quality (either non-refined 

or self-refined) consensus clustering solution with respect to the median ensemble component 

(or MEC), across the four multimedia data collections and the seven consensus functions. 

The relative φ (NMI) percentage differences prior to self-refining are shown in brackets. 

Data set 

CSPA EAC 



IsoLetters 53.6 67.9 70 50.7 39.3 42.9 39.3 

CAL500 64.3 35.7 17.9 10.7 21.4 46.4 21.5 

InternetAds 7.1 53.6 81.4 10 21.4 12.1 35.7 

Corel 57.1 57.2 70 57.1 39.2 45 50 


quality consensus clustering solution, across the four multimedia data collections and the 

seven consensus functions. 

system, and for this reason, the following paragraphs are devoted to its evaluation. 

Evaluation of the supraconsensus process 

Firstly, we have evaluated the supraconsensus function in terms of the percentage of experiments 

in which it suceeds, i.e. it selects the highest quality consensus clustering solution 

as the final partition λ final 

c . The results obtained for each data set and consensus function 

are presented in table 5.14, performing correctly in an average 42.1% of the experiments 

conducted —that is, it is able to select the best clustering available in less than the half of 

the occasions, which reveals (as outlined in chapter 4) that there is still room for improving 

the performance of supraconsensus functions. 

And secondly, we have analyzed how the consensus clustering selected by supraconsen- 

, compares to the components of the cluster ensemble it is created upon —taking 

sus, λ final 

c 

again the cluster ensemble components of maximum and median φ (NMI) (respectively referred 

to as best and median ensemble component, or BEC and MEC for short) as a reference. 

Hence, we have measured the relative percentage φ (NMI) differences between the 

consensus clustering selected by supraconsensus and these cluster ensemble components, so 

as to provide the reader with a notion of the impact of the apparent lack of accuracy of the 

φ (ANMI) -based supraconsensus function. 

The results, averaged across all the consensus functions, are presented in table 5.15. 

159


Data set 

Relative φ (NMI) difference 

with respect to 

BEC MEC 

IsoLetters -17.5 53.2 

CAL500 -34.7 13.2 

InternetAds -65.2 138.7 

Corel -28.9 169.1 

Table 5.15: Relative φ (NMI) percentage differences between the best and median components 

of the cluster ensemble and the consensus clustering λ final 

c selected by supraconsensus, 

for the four multimedia data collections, averaged across the seven consensus functions 

employed. 

In average, λ final 

c 

is, in relative percentage terms, a 36.6% worse than the BEC and a 

93.5% better than the MEC. As expected, the price to pay for the lack of precision of the 

supraconsensus function is a reduction of the quality of the final clustering solution. 


In this chapter, we have proposed and experimentally explored the use of consensus clustering 

strategies for partitioning multimedia data collections robustly. From our viewpoint, 

this application constitutes a natural extension of the computationally efficient consensus 

architectures presented in chapter 3 and the self-refining procedures proposed in chapter 

4. As mentioned earlier, the growing ubiquity of multimedia data makes the proposals put 

forward in the present chapter even more appealing. 

Across the experiments presented in this chapter and in appendix B.2, we have observed 

that partitioning multimodal data sets in a robust manner is more tricky than doing so in a 

unimodal scenario, as the existence of multiple modalities in the data increases the already 

numerous indeterminacies inherent to the clustering problem. 

As a means for fighting against this fact, modality fusion has become a recurrent issue in 

the multimedia data analysis literature. Indeed, assuming that combining the distinct data 

modalities can be of interest is pretty logical, as it is an obvious way to take advantage of 

the expectably existing constructive dependences between them. In this sense, and focusing 

on the clustering problem, there exist two main approaches to modality fusion: early (aka 

feature level) fusion and late (classification decision) fusion. 

However, our experiments have revealed that none of these fusion strategies is, by itself, 

capable of ensuring robust clustering results: in some cases, feature level fusion gives rise to 

the best clustering results, whereas, in other cases, it simply constitutes a trade-off between 

modalities. For this reason, our multimodal self-refining consensus clustering architectures 

constitute a generic approach for partitioning multimedia data collections with a reasonable 

degree of robustness, with the advantage of encompassing, simultaneously, early and late 

fusion techniques. 

To our knowledge, most works dealing with multimodal clustering focus on feature level 

fusion, deriving novel early fusion approaches for combining modalities (see section 1.4 

for a brief review). However, they often disregard the fact that, in some data sets, early 

160


fusion may not be advantageous (that is, there may exist modalities that do not contribute 

positively to the obtention of a good partition of the data, see appendix B.2). In our opinion, 

this constitutes one of the main strengths of our proposal, as it allows the user to employ 

any modality created by feature-level fusion besides the original modalities of the data for 

obtaining the final partition of the data. 

This aprioripositive openhandedness entails two negative consequences: firstly, it increases 

the computational complexity of the consensus process, although such inconvenience 

can be sorted out by the application of the efficient consensus hierchical consensus architectures 

proposed in chapter 3. The second drawback is the inclusion of the poorest modality 

clustering results in the consensus clustering process, but this can be alleviated by the use 

of the consensus-based self-refining procedures described in chapter 4, achieving notable 

results when applied on the intermodal consensus clustering solutions. 

In future works, the implementation of selection-based self-refining processes (see section 

4.3) on multimodal cluster ensembles will be investigated, as we expect that it may yield 

higher quality multimodal partitions than the ones presented in this chapter. Furthermore, 

as already stated in chapter 4, it will be necessary to devise novel, more precise supraconsensus 

functions, capable of selecting with a higher degree of accuracy the top quality 

consensus clustering solution in an unsupervised manner. 


None of the work regarding multimedia clustering presented in this chapter has been published 

yet. Nevertheless, we would like to hihglight the following paper, focused on applying 

early fusion of modalities for conducting jointly multimodal data analysis and synthesis of 

facial video sequences (Sevillano et al., 2009). The details of this work, published as a book 

chapter, are presented next. 

Authors: Xavier Sevillano, Javier Melenchón, Germán Cobo, Joan Claudi Socoró 

and Francesc Alías 

Title: Audiovisual Analysis and Synthesis for Multimodal Human-Computer Interfaces 

In: Engineering the User Interface: From Research to Practice 


Editors: Miguel Redondo, Crescencio Bravo and Manuel Ortega 

Pages: 179–194 

Year: 2009 

ISBN: 978-1-84800-135-0 

Abstract: Multimodal signal processing techniques are called to play a salient role 

in the implementation of natural computer-human interfaces. In particular, the development 

of efficient interface front ends that emulate interpersonal communication 

would benefit from the use of techniques capable of processing the visual and auditory 

161

5.5. Related publications 

modes jointly. This work introduces the application of audiovisual analysis and synthesis 

techniques based on Principal Component Analysis and Non-negative Matrix 

Factorization on facial audiovisual sequences. Furthermore, the applicability of the 

extracted audiovisual bases is analyzed throughout several experiments that evaluate 

the quality of audiovisual resynthesis using both objective and subjective criteria. 

162

Chapter 6 

Voting based consensus functions 

for soft cluster ensembles 

As outlined in section 1.2.1, clustering algorithms can be bisected into two large categories, 

depending on the number of clusters every object is assigned to. On one hand, hard (or 

crisp) clustering algorithms assign each object to a single cluster. For this reason, the 

result of applying a hard clustering process on a data set containing n objects is usually 

represented as a n-dimensional integer-valued row vector of labels (or labeling) λ, each 

component of which identifies to which of the the k clusters each object is assigned to, that 

is: 

λ =[λ1 λ2 ... λn] (6.1) 

where λi ∈ [1,k], ∀i ∈ [1,n]. 

On the other hand, soft (or fuzzy) clustering algorithms allow the objects to belong to 

all clusters to a certain extent. Thus, the results of their application for partitioning a data 

set containing n objects into k clusters is usually represented by means of a k×n real-valued 

clustering matrix Λ –see equation (6.2)–, the (i,j)th entry of which indicates the degree of 

association between the jth object and the ith cluster. 

⎛ 

⎜ 

Λ = ⎜ 

⎝ 

λ11 λ12 ... λ1n 

λ21 λ22 ... λ2n 

. 

. .. 

λk1 λk2 ... λkn 

. 

⎞ 

⎟ 

⎠ 

(6.2) 

where λij ∈ R, ∀i ∈ [1,k]and∀j ∈ [1,n]. 

For illustration purposes, we resort to the toy clustering example presented in chapter 

2, in which clustering is conducted on the two-dimensional artificial data set presented in 

figure 6.1. This toy data collection contains n = 9 objects, and the desired number of 

clusters k is set to 3. 

If the classic k-means hard clustering algorithm is applied on this data set, the label 

vector presented in equation (6.3) is obtained. Recall that the labels λi contained in λ 

are purely symbolic (i.e. the labelings λ =[111222333]orλ =[333222111] 

represent exactly the same partition of the data). 

163

Chapter 6. Voting based consensus functions for soft cluster ensembles 

0.6 

0.4 

0.2 

0 

3 

2 

1 

−0.2 

−0.2 0 0.2 0.4 0.6 

Figure 6.1: Scatterplot of an artificially generated two-dimensional data set containing n =9 

objects, which are represented by coloured symbols and identified by a number. The black 

star symbols represent the position of the cluster centroids found by the k-means algorithm. 

6 

9 

4 

8 

7 

5 

λ =[222111333] (6.3) 

As regards the results of applying a fuzzy clustering algorithm on this data collection, 

these clearly differ depending on the way the degree of association between objects and 

clusters is codified. Usually, the scalar values λij contained in the clustering matrix Λ 

represent cluster membership probabilities (i.e. the higher the value of λij, the more strongly 

the jth object is associated to the ith cluster). For instance, this is the way the well-known 

fuzzy c-means (FCM) clustering algorithm codifies its clustering results (Höppner, Klawonn, 

and Kruse, 1999). In fact, if this algorithm is applied on the previously described artificial 

data set, the clustering matrix presented in equation (6.4) is obtained. 

⎛ 

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 

⎞ 

0.010 

Λ = ⎝0.921 

0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠ 

(6.4) 

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 

It can be observed that any row permutation in Λ would yield an equivalent fuzzy partition. 

Moreover, notice that Λ can be transformed into a hard clustering solution by simply 

assigning each object to the cluster with maximum membership probability. 

However, the degree of association between objects and clusters can be described in 

terms of other parameters, such as the distance of each object to the cluster centroids (such 

as k-means, that despite being a hard clustering algorithm, can output this information). In 

fact, the object to centroid1 distance matrix obtained after applying k-means on the same 

toy data set as before is presented in equation (6.5). 

⎛ 

0.362 0.325 0.672 0.002 0.002 0.005 0.160 0.092 

⎞ 

0.125 

Λ = ⎝0.010 

0.009 0.027 0.397 0.490 0.436 0.251 0.320 0.209⎠ 

(6.5) 

0.170 0.202 0.445 0.090 0.125 0.162 0.002 0.005 0.002 

In this case, the conversion of Λ into a crisp partition requires assigning every object 

to the closest (i.e. minimum distance) cluster. Thus, depending on the nature of the soft 

1 The cluster centroids are represented by means of a black star symbol in figure 6.1 

164


clustering process, the interpretation of the scalar values λij contained in Λ differs (i.e. if λij 

represent membership probabilities, the larger their value the stronger the object-cluster 

association, whereas the opposite interpretation should be made in case they represent 

object to centroid distances). 

Either way, fuzzy clustering can be regarded as a version of hard clustering with relaxed 

object membership restrictions. Such relaxation is particularly useful when the clusters are 

not well separated, or when their boundaries are ambiguous. Moreover, object to cluster 

association information may be of help in discovering more complex relations between a 

given object and the clusters (Xu and Wunsch II, 2005). Furthermore, notice that soft 

clustering can also be regarded as a generalization of its hard counterpart, as a crisp partition 

can always be derived from a fuzzy one, whereas the opposite cannot be held. 

However, these apparent strengths of soft clustering algorithms have barely been reflected 

in the development of consensus functions especially devised for combining the outcomes 

of multiple fuzzy clustering processes, as most proposals in this area are oriented 

towards their application on hard clustering scenarios. Nevertheless, as described in section 

2.2, there exist a few works in the consensus clustering literature devoted to the derivation 

of consensus functions for soft cluster ensembles, such as the VMA consensus function of 

(Dimitriadou, Weingessel, and Hornik, 2002) or the ITK consensus function of (Punera and 

Ghosh, 2007). Moreover, several other consensus functions can be indistinctly applied on 

both hard and soft cluster ensembles, such as PLA (Lange and Buhmann, 2005) or HGBF 

(Fern and Brodley, 2004), while others, originally devised for hard cluster ensembles, can be 

adapted for their use as soft partition combiners with relative ease (e.g. (Strehl and Ghosh, 

2002)). 

In this chapter, we make several proposals regarding the application of consensus processes 

on soft cluster ensembles. For starters, the notion of soft cluster ensembles is reviewed 

in section 6.1. Next, in section 6.2, we describe a procedure for adapting the hard consensus 

functions employed in this work to soft cluster ensembles. In section 6.3, a family of 

novel soft consensus functions based on the application of cluster disambiguation and voting 

strategies is proposed. Finally, the results of several experiments regarding the performance 

of the proposed soft consensus functions are presented in section 6.4, and the discussion of 

section 6.5 puts an end to this chapter. 

6.1 Soft cluster ensembles 

As described in chapter 2, cluster ensembles are nothing but the compilation of the outputs 

of multiple (namely l) clustering processes. Focusing on a fuzzy clustering scenario, and 

making the simplyfing assumption that the l clustering processes partition the data into k 

clusters, a soft cluster ensemble E is represented by means of a kl × n real-valued matrix: 

⎛ ⎞ 

Λ1 

⎜ 

⎜Λ2 

⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ 

Λl 

(6.6) 

where Λi is the k × n real-valued clustering matrix resulting from the ith soft clustering 

process (∀i ∈ [1,l]). 

165

6.2. Adapting consensus functions to soft cluster ensembles 

Recall that the contents of each clustering matrix Λi enclosed in the soft cluster ensemble 

E results from the execution of a fuzzy clustering process, and that, depending on its nature, 

the interpretation of the scalar values that ultimately make up E may differ largely. Thus, 

for conducting a consensus process on the soft cluster ensemble E it is necessary that such 

values hold the same type of proportionality with repect to the degree of association between 

objects and clusters (i.e. they all are either directly or inversely proportional). 

This prerequisite becomes more evident if an analogy between soft clustering and voting 

procedures is established. Such analogy is inspired by the parallelism between supervised 

classification processes and voting drawn in (van Erp, Vuurpijl, and Schomaker, 2002). 

According to this analogy, the process of fuzzily clustering an object can be regarded as an 

election, in which the clusterer (regarded as a voter) casts its preference for each one of the 

clusters (or candidates). Put that way, it becomes quite obvious that, when the results of 

several fuzzy clustering processes are gathered into a soft cluster ensemble with the purpose 

of building a consolidated clustering solution upon it, they should be straightly comparable 

—possibly, after applying some scale normalization. 

Regardless of the characteristics and nature of soft cluster ensembles, it is interesting 

to evaluate how classic consensus functions (i.e. those originally designed to combine crisp 

partitions) can be applied on the fuzzy consensus clustering problem. The next section 

deals with this very issue. 

6.2 Adapting consensus functions to soft cluster ensembles 

The consensus functions employed so far in the experimental sections of this work (i.e. 

CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD) are originally designed to 

operate on hard cluster ensembles (see appendix A.5 for a description). Nevertheless, they 

can be easily adapted for combining fuzzy partitions. The key point is that all these 

consensus functions base their clustering combination processes on object co-association 

matrices (i.e. matrices the contents of which estimate the degree of similarity between 

objects upon the partitions contained in the cluster ensemble). Fortunately, this type of 

co-association matrices can be derived not only from hard cluster ensembles, but also from 

their soft counterparts, which makes these consensus functions easily applicable on the 

fuzzy consensus clustering problem. In the following paragraphs, we elaborate on this issue 

resorting again to the previous toy example, continuously drawing parallelisms between the 

hard and soft clustering scenarios —for brevity, such comparison will only be referred to 

fuzzy clustering processes that codify object to cluster associations in terms of membership 

probabilities, although an equivalent study could also be formulated in the case these were 

expressed by means of magnitudes inversely proportional to the strength of object to cluster 

associations, such as object to cluster centroids distances. 

Consider the hard clustering solution of equation (6.3) corresponding to our clustering 

toy example, that is: 

λ =[222111333] 

Notice that an equivalent representation of this partition can be also given by a k × n 

incidence matrix Iλ (called binary membership indicator matrix in (Strehl and Ghosh, 

2002)), the (i,j)th entry of which is equal to 1 in the case the jth object is assigned to 

166


the ith cluster according to λ, and 0 otherwise —see equation (6.7), which presents the 

incidence matrix corresponding to the label vector λ of equation (6.2). 

⎛ 

0 0 0 1 1 1 0 0 

⎞ 

0 

Iλ = ⎝1 

1 1 0 0 0 0 0 0⎠ 

(6.7) 

0 0 0 0 0 0 1 1 1 

Notice that the information contained in Iλ is somehow comparable to the contents of 

a soft clustering matrix Λ, in the sense that they both express the degree of association 

between objects and clusters. For illustration purposes, equation (6.8) presents the fuzzy 

clustering matrix Λ output by the FCM clustering algorithm on the artificial data set of 

figure 6.1. In fact, rounding each element of this clustering matrix Λ to the nearest integer 

would indeed yield the incidence matrix Iλ of equation (6.7). 

⎛ 

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 

⎞ 

0.010 

Λ = ⎝0.921 

0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠ 

(6.8) 

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 

The construction of object co-association matrices given the incidence matrix Iλ built 

upon a hard clustering solution λ is pretty straightforward, and it only requires computing 

some matrix products. 

In particular, the object co-association matrix Oλ is computed as the product between 

the transpose of Iλ and Iλ. The object co-association matrix corresponding to the hard 

clustering solution obtained on our toy clustering example is presented in equation (6.9). 

In fact, Oλ is a n × n adjacency matrix, the (i,j)th entry of which equals 1 if the ith 

and the jth objects are placed in the same cluster, or 0 otherwise (Kuncheva, Hadjitodorov, 

and Todorova, 2006). The name object co-association matrix stems from the fact that the 

contents of Oλ indicate, from a clustering viewpoint, the degree of similarity between the 

n objects in the data set. 

167


Oλ = Iλ T Iλ = 

⎛ 

0 1 

⎞ 

0 

⎜ 

0 

⎜ 

0 

⎜ 

⎜1 

⎜ 

⎜1 

⎜ 

⎜1 

⎜ 

⎜0 

⎝0 

1 

1 

0 

0 

0 

0 

0 

0 ⎟ 

0 ⎟ ⎛ 

0 ⎟ 0 

0⎟⎝ 

⎟ 1 

0 ⎟ 0 

1 ⎟ 

1⎠ 

0 

1 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

⎞ 

0 

0⎠ 

1 

0 

⎛ 

1 

0 

1 

1 

1 0 0 0 0 0 

⎞ 

0 

⎜ 

1 

⎜ 

1 

⎜ 

⎜0 

= ⎜ 

⎜0 

⎜ 

⎜0 

⎜ 

⎜0 

⎝0 

1 

1 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

1 

1 

0 ⎟ 

0 ⎟ 

0 ⎟ 

0 ⎟ 

0 ⎟ 

1 ⎟ 

1⎠ 

0 0 0 0 0 0 1 1 1 

(6.9) 

In a fuzzy clustering scenario, the object co-association matrix can easily be derived by 

simply multiplying the transpose clustering matrix Λ by itself. Resorting again to our toy 

clustering example, the resulting object co-association matrix (denoted as OΛ) ispresented 

in equation (6.10). 

It is easy to see that OΛ is indeed a fuzzy adjacency matrix, as the statistical independence 

of the probabilities of assigning objects i and j to the same cluster allows to interpret 

its (i,j)th entry as the joint probability that objects i and j are placed in the same cluster 

by the clusterer. However, statistical independence does not hold for the elements of the 

diagonal of OΛ, as each object is always “co-clustered” with itself, which would require 

making the elements on the diagonal of OΛ equal to 1. 

168

⎛ 

OΛ = ΛT ⎜ 

Λ = ⎜ 

⎝ 

⎛ 

⎜ 

= ⎜ 

⎝ 


0.054 0.921 0.025 

0.026 0.932 0.042 

0.057 0.905 0.038 

0.969 0.025 0.006 

0.976 0.019 0.005 

0.959 0.030 0.011 

0.009 0.014 0.976 

0.016 0.055 0.929 

0.010 0.017 0.972 

⎞ 

⎟ ⎛ 

⎟ ⎝ 

⎟ 

⎠ 

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 0.010 

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017 

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 

0.852 0.861 0.838 0.076 0.070 0.080 0.038 0.075 0.040 

0.861 0.871 0.847 0.049 0.043 0.053 0.054 0.091 0.057 

0.838 0.847 0.824 0.078 0.073 0.082 0.050 0.086 0.053 

0.076 0.049 0.078 0.940 0.946 0.930 0.015 0.022 0.016 

0.070 0.043 0.073 0.946 0.953 0.937 0.014 0.021 0.015 

0.080 0.053 0.082 0.930 0.937 0.921 0.020 0.027 0.021 

0.038 0.054 0.050 0.015 0.014 0.020 0.953 0.908 0.949 

0.075 0.091 0.086 0.022 0.021 0.027 0.908 0.866 0.904 

0.040 0.057 0.053 0.016 0.015 0.021 0.949 0.904 0.945 

⎞ 

⎟ 

⎠ 

⎞ 

⎠ 

(6.10) 

Quite obviously, consensus functions based on object co-association matrices do not 

operate on matrices Oλ or OΛ, as they are derived upon a single clustering solution. However, 

the computation of object co-association matrices can easily be extended to a set of 

clustering solutions compiled in either a hard or a soft cluster ensemble. We start with the 

derivation of the object co-association matrix upon a hard cluster ensemble E containing l 

clusterings, represented by means of a l × n integer valued matrix: 

⎛ 

λ1 

⎜ 

⎜λ2 

⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ 

λl 

⎞ 

(6.11) 

In this case, the incidence matrix of the hard cluster ensemble, denoted as IEλ ,isa 

kl × n matrix resulting from the concatenation of the incidence matrices of the l clusterings 

that make up the ensemble (see equation (6.12)). 

IEλ = 

⎛ ⎞ 

Iλ1 

⎜Iλ2 

⎟ 

⎜ ⎟ 

⎜ ⎟ 

⎝ . ⎠ 

Iλl 

(6.12) 

As in the case of a single clustering, the object co-association matrix of the hard cluster 

ensemble, denoted as OEλ , can be derived upon IEλ by simple matrix multiplication, as 

presented in equation (6.13). 

OEλ 

T 

= IEλ 

IEλ 

169 

(6.13)


Drawing a parallelism with respect to the interpretation of the object co-association 

matrix Oλ built upon a single clustering, it can be stated that the (i,j)th entry of OEλ 

indicates the proportion of clusterers that put the ith and jth objects in the same cluster. 

Porting this same idea to the soft clustering scenario, the soft version of the object coassociation 

matrix OEλ (that we denote as OEΛ ) is computed following an analog procedure 

to the one just reported, summarized in equation (6.14). 

OEΛ = ET E = ΛT 1 ΛT2 ...ΛT ⎛ ⎞ 

Λ1 

⎜ 

⎜Λ2 

⎟ 

l ⎜ ⎟ 

⎝ . ⎠ 

Λl 

(6.14) 

where E is the kl × n matrix representing the soft cluster ensemble made up by the compilation 

of the l fuzzy clustering matrices Λi, ∀i ∈ [1,l]. Just like in the case of the fuzzy 

adjacency matrix OΛ derived upon a single clustering (see equation (6.10)), it is necessary 

to set the elements of the diagonal of OEΛ to unity, as each object is always “co-clustered” 

with itself. 

As aforementioned, all the hard consensus functions used so far in this work employ 

object co-association matrices, and most of them explicitly base their consensus processes 

on them. This is the case of the CSPA, EAC, ALSAD, KMSAD and SLSAD consensus 

functions, which differ among them in the way the object co-association matrix OEλ is 

interpreted. On one hand, some of them construe OEλ 

as an object similarity matrix, 

obtaining the consensus partition by applying some similarity-based clustering algorithm, 

such as graph partitioning (in CSPA (Strehl and Ghosh, 2002)) or hierarchical clustering 

(EAC (Fred and Jain, 2005)). 

On the other hand, the so-called similarity-as-data consensus functions interpret the ith 

row of the object co-association matrix OEλ 

as a set of new n features representing the 

ith object, thus applying some standard clustering algorithm on it for obtaining the consensus 

clustering solution. The application of the hierarchical single-link and average-link 

hierarchical clustering algorithms give rise to the SLSAD and ALSAD consensus functions, 

whereas the KMSAD consensus function consists in conducting this partitioning by means 

of k-means (Kuncheva, Hadjitodorov, and Todorova, 2006). 

On its part, the HGPA consensus function considers the cluster ensemble incidence 

matrix IEλ to be the adjacency matrix of a hypergraph with n vertices and kl hyperedges. 

The consensus clustering process is regarded as the partitioning of such hypergraph by 

cutting the minimum number of hyperedges, a procedure that takes the object co-association 

matrix OEλ as its input (Strehl and Ghosh, 2002). 

And last, the MCLA consensus function tackles the consensus clustering problem as a 

process of clustering clusters, which also implies interpreting the cluster ensemble incidence 

matrix IEλ as a hypergraph adjacency matrix. Again, the algorithmic analysis2 of this 

consensus function reveals that such procedure requires the object co-association matrix 

OEλ as one of its parameters (Strehl and Ghosh, 2002). 

Allowing for the fact that all seven consensus functions employ the object co-association 

matrix OEλ as its main input parameter when operating on hard cluster ensembles, devising 

2 

The Matlab source code of the CSPA, HGPA and MCLA consensus functions is available for download 

at http://www.strehl.com. 

170


their version for soft cluster ensembles becomes pretty straightforward, as it simply requires 

substituting the object co-association matrix derived from a hard cluster ensemble OEλ by 

its fuzzy counterpart, OEΛ . Notice, that, despite taking object co-association matrices 

derived upon a soft cluster ensemble as their input, all these consensus functions output a 

hard consensus clustering solution λc. 

As mentioned at the introduction of this section, we have established an analogy between 

hard and soft consensus functions based on the similarities between object co-association 

matrices, assuming that the results of fuzzy clustering are expressed in terms of membership 

probabilities. However, we consider that the present study is quite generic, as it could be 

extended to the case fuzzy clustering results were expressed in any other form by transforming 

them into membership probabilities prior to computing the corresponding object 

co-association matrices. 

6.3 Voting based consensus functions 

In this section, we put forward a set of proposals in the shape of a family of novel consensus 

functions especially devised for their application on soft cluster ensembles. These consensus 

functions are inspired in voting strategies, which have also been a source of inspiration for 

the development of systems for combining supervised classifiers (van Erp, Vuurpijl, and 

Schomaker, 2002), search engines (Aslam and Montague, 2001), or word sense disambiguation 

systems (Buscaldi and Rosso, 2007). A distinguishing factor of the consensus functions 

we propose in this section is that they yield fuzzy consensus clustering solutions, whereas 

other soft consensus functions found in the literature ouput crisp consensus clusterings, 

despite they are applied on soft cluster ensembles (Punera and Ghosh, 2007). 

In fact, voting is a formal way of combining the opinions of several voters into a single 

decision (e.g. the election of a president). Therefore, it seems quite logical that voting 

strategies can be readily applied for combining the outcomes of multiple decision systems, 

using the voting strategy that best fits the way these decisions are expressed. 

Given that a clusterer is an unsupervised classifier, the most natural parallelism we can 

establish is related to voting based supervised classifier combination (aka classifier ensembles). 

In this case, each classifier is a voter, the possible categories objects can be filed under 

are the candidates, and an election is the classification of an object (van Erp, Vuurpijl, and 

Schomaker, 2002). Quite obviously, the voting strategy employed for combining the votes 

–and thus obtain the winner of the election (i.e. the resulting classification of the object by 

the ensemble of classifiers)– depends on how votes are expressed, be it an assignment to a 

single class (i.e. single-label classification (Sebastiani, 2002)), or an either ranked or scored 

list of classes. The former case calls for the application of unweighed voting methods such 

as plurality or majority voting, whereas the latter make it possible to apply more sophisticated 

voting strategies such as confidence and ranking voting methods (van Erp, Vuurpijl, 

and Schomaker, 2002). 

Nevertheless, prior to the application of any voting strategy on soft cluster ensembles, 

there is a crucial problem to be solved. This has to do with the inherent unsupervised 

nature of clustering processes, which makes clusters be ambiguously identified. Therefore, 

it is necessary to perform a cluster alignment (or disambiguation) process before voting. 

Notice that this is not an issue of concern when applying voting strategies on supervised 

171

6.3. Voting based consensus functions 

classifier ensembles, as categories (i.e. candidates) are univocally defined in that case. We 

elaborate on the cluster disambiguation technique employed in this work in section 6.3.1. It 

is important to highlight that consensus functions based on object co-association matrices 

circumvent this inconvenience (Lange and Buhmann, 2005), although their main drawback 

is that the complexity of the object co-association matrix computation is quadratic with 

the number of objects in the data set (Long, Zhang, and Yu, 2005). 

The problem of combining the outcomes of multiple soft clustering processes by means 

of voting strategies implies interpreting the contents of the soft cluster ensemble as the 

preference of each voter (clusterer) for each candidate (cluster), as soft clustering algorithms 

output the degree of association of each object to all the clusters. For this reason, voting 

methods capable of dealing with voters’ preferences (in particular, confidence and ranking 

voting strategies) are the basis of our consensus functions, as they lend themselves naturally 

to be applied in this context. However, care must be taken as regards how these preferences 

are expressed, that is, if they are directly or inversely proportional to the strength of the 

association between objects and clusters (e.g. membership probabilities or distances to 

centroids, respectively). In section 6.3.2, we describe four voting strategies that give rise to 

the proposed consensus functions. 

6.3.1 Cluster disambiguation 

In this section, we elaborate on the problem of cluster disambiguation, also known as the 

cluster correspondence problem. 

As pointed out earlier, a single hard clustering solution can be expressed by multiple 

equivalent labeling vectors λ, due to the symbolic nature of the labels clusters are identified 

with. This also occurs in soft clustering, as the permutation of the rows of a clustering 

matrix Λ also gives rise to equivalent fuzzy partitions. Quite obviously, this cluster identification 

ambiguity also rises between the multiple clustering solutions compiled in a cluster 

ensemble E, and thus, it becomes an issue of concern when it comes to applying voting 

strategies for conducting consensus clustering, given the equivalence between clusters and 

candidates defined by the previously described analogy with voting procedures. For this 

reason, our voting-based consensus functions for soft cluster ensembles make use of a cluster 

disambiguation technique prior to proper voting. 

In particular, we require from such method the ability to solve the cluster re-labeling 

problem —an instance of the cluster correspondence problem in which a one to one correspondence 

between clusters is considered (recall that, in this work, all the clusterings in the 

ensemble and the consensus clustering are assumed to have the same number of clusters, 

namely k). 

To solve the cluster re-labeling problem we make use of the Hungarian method (also 

known as Kuhn-Munkres algorithm or Munkres assignment algorithm) (Kuhn, 1955), a 

technique that allows to obtain the most consistent alignment among the different clusterings 

(Ayad and Kamel, 2008). 

Given a pair of clustering solutions with k clusters each, the Hungarian method is capable 

of finding, among the k! possible cluster permutations, the one that maximizes the overlap 

between them. In particular, such cluster permutation is applied on one of the two clustering 

solutions, while the other is taken as a reference. Put in probabilistic terms, the Hungarian 

algorithm selects the cluster permutation that best fits the empirical cluster assignment 

172


probabilities estimated from the reference clustering solution (i.e. it finds the optimal 

cluster permutation that yields the largest probability mass over all cluster assignment 

probabilities (Fischer and Buhmann, 2003)). Depending on whether the aforementioned 

clustering solutions correspond to hard or fuzzy partitions, cluster permutations amount to 

label reassignments or to row order rearrangements, respectively. 

The Hungarian algorithm poses the cluster correspondence problem as a weighted bipartite 

matching problem, solving it in O(k3 ) time. A beautiful analysis of its error probability 

can be found in (Topchy et al., 2004). In this work, we have employed the implementation 

of (Buehren, 2008), which bases the clusters disambiguation process upon a measure of 

the dissimilarity between the clusters of the two clustering solutions under consideration. 

Cluster dissimilarity is usually embodied in a k × k matrix, the (i,j)th entry of which is 

proportional to the degree of dissimilarity between the ith cluster of one of the clustering 

solutions and the jth cluster of the other one. 

Cluster dissimilarity can easily be derived upon the considered pair of clustering solutions, 

regardless of whether they are hard or fuzzy partitions, as we show next. In the crisp 

case, a cluster similarity matrix S λ1 ,λ 2 can be obtained by simple matrix products between 

the incidence matrices of both clusterings, denoted as λ1 and λ2 —see equation (6.15). 

Sλ1 ,λ = Iλ1 2 Iλ2 

T 

(6.15) 

For illustration purposes, consider the two crisp clustering solutions of equation (6.16): 

λ1 =[222111333] 

λ2 =[113333322] (6.16) 

The incidence matrices corresponding to λ1 and λ2 are presented in equation (6.17). 

Iλ1 = 

⎛ 

0 0 0 1 1 1 0 0 

⎞ 

0 

⎝1 

1 1 0 0 0 0 0 0⎠ 

0 0 0 0 0 0 1 1 1 

Iλ2 = 

⎛ 

1 1 0 0 0 0 0 0 

⎞ 

0 

⎝0 

0 0 0 0 0 0 1 1⎠ 

(6.17) 

0 0 1 1 1 1 1 0 0 

The cluster similarity matrix derived upon these two clustering solutions is the one 


173


Sλ1 ,λ = Iλ1 2 Iλ2 

T 

= 

⎛ 

1 0 

⎞ 

0 

⎛ 

0 

⎝1 

0 

0 

1 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

⎜ 

1 

⎜ 

⎞ ⎜ 

0 

0 ⎜ 

⎜0 

0⎠ 

⎜ 

⎜0 

1 ⎜ 

⎜0 

⎜ 

⎜0 

⎝0 

0 

0 

0 

0 

0 

0 

1 

0 ⎟ 

1 ⎟ ⎛ 

1 ⎟ 0 

1 ⎟ = ⎝2 

1 ⎟ 0 

1 ⎟ 

0⎠ 

0 

0 

2 

⎞ 

3 

1⎠ 

1 

(6.18) 

0 1 0 

Firstly, notice that S λ1 ,λ 2 is not a symmetric matrix (as object co-association matrices 

are), due to the fact that its rows and columns correspond to different entities. In fact, 

the (i,j)th element of S λ1 ,λ 2 is equal to the number of objects that are assigned to the ith 

cluster of λ1 and to the jth cluster of λ2, thus clearly indicating the degree of resemblance of 

these clusters. Deriving a cluster dissimilarity matrix from S λ1 ,λ 2 is pretty straightforward, 

provided that the implementation of the Hungarian method employed in this work does not 

require that the cluster dissimilarity measures verify any special property as regards their 

scale. The result of the cluster disambiguation method applied on this toy example is the 

cluster correspondence vector π λ1 ,λ 2 presented in equation (6.19). 

π λ1 ,λ 2 = [3 1 2] (6.19) 

The interpretation of the cluster correspondence vector πλ1 ,λ is that the cluster iden- 

2 

tified with the ‘1’ label in λ1 corresponds to the cluster with label ‘3’ of λ2, the cluster 

labeled as ‘2’ in λ1 must be aligned with the cluster with label ‘1’ of λ2, and the cluster ‘3’ 

of λ1 is equivalent to cluster with label ‘2’ of λ2. 

The most usual way of representing the information contained in the cluster correspondence 

vector πλ1 ,λ is by means of a cluster permutation matrix P 2 λ1 ,λ . In general, P 2 λ1 ,λ is 2 

a k × k matrix whose entries are all zero except that the πλ1 ,λ (i)-th entry of the ith row is 

2 

equal to 1. The cluster permutation matrix corresponding to our toy example is presented 

in equation (6.20). Notice how all of its entries are zero except for the third entry of the 

first row (as πλ1 ,λ (1) = 3), the first entry of the second row (as π 2 λ1 ,λ (2) = 1) and the 

2 

second entry of the third row (as πλ1 ,λ (3) = 2). 

2 

P λ1 ,λ 2 = 

⎛ 

0 0 

⎞ 

1 

⎝1 

0 0⎠ 

(6.20) 

0 1 0 

In order to obtain the cluster permuted version of the clustering λ1, it is necessary to 

multiply the transpose of the cluster permutation matrix Pλ1 ,λ by the incidence matrix 

2 

associated to this clustering, Iλ1 , which yields the cluster permuted incidence matrix I πλ1 ,λ2 λ , 1 

as indicated in equation (6.21). 

I πλ 1 ,λ 2 

λ 1 

T 

= Pλ1 ,λ Iλ1 

2 ,λ2 In our example, the cluster permuted incidence matrix is: 

174 

(6.21)


I πλ 

⎛ 

0 

1 ,λ2 λ = ⎝0 

1 

1 

1 

0 

0 

⎞ ⎛ 

0 0 

1⎠⎝1 

0 0 

0 

1 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

⎞ ⎛ 

0 1 

0⎠ 

= ⎝0 

1 0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

1 

0 

0 

1 

0 

⎞ 

0 

1⎠ 

0 

(6.22) 

Therefore, assigning each object to the cluster it is most strongly associated to transforms 

the cluster permuted incidence matrix I πλ1 ,λ2 λ into the disambiguated crisp clustering 

1 

λ πλ1 ,λ2 1 . In equation (6.23), this clustering is presented alongside λ2 —the clustering that 

has been taken as the reference of the cluster disambiguation process. 

λ πλ 1 ,λ 2 

1 =[111333222] (6.23) 

λ2 =[113333322] (6.24) 

In the context of our voting-based soft consensus functions, though, cluster disambiguation 

is conducted on pairs of soft clustering solutions. In order to illustrate how to proceed 

in this case, we use a toy example that is the fuzzy version of the one just reported. For 

brevity, we will only consider the case in which object-to-cluster associations are expressed 

in terms of membership probabilities, although an analog procedure could be devised in 

the case these were expressed by means of other metrics. Therefore, given the two fuzzy 

partitions Λ1 and Λ2 of equation (6.25): 

⎛ 

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 

⎞ 

0.010 

Λ1 = ⎝0.921 

0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠ 

0.025 

⎛ 

0.932 

0.042 

0.921 

0.038 

0.019 

0.006 

0.030 

0.005 

0.014 

0.011 

0.025 

0.976 

0.057 

0.929 

0.017 

0.972 

⎞ 

0.055 

Λ2 = ⎝0.042 

0.025 0.005 0.011 0.009 0.006 0.038 0.972 0.929⎠ 

(6.25) 

0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016 

The cluster similarity matrix S Λ1 ,Λ 2 is computed upon the soft clustering matrices themselves, 

as described by equation (6.26). 

SΛ 1 ,Λ 2 = Λ1Λ T 2 = 

⎛ 

= ⎝ 

⎛ 

= ⎝ 

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 0.010 

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017 

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 

0.143 0.054 2.878 

1.738 0.136 1.042 

0.188 1.845 0.969 

⎞ 

⎛ 

⎜ 

⎞ ⎜ 

⎠ ⎜ 

⎝ 

0.932 0.042 0.026 

0.921 0.025 0.054 

0.019 0.005 0.976 

0.030 0.011 0.959 

0.014 0.009 0.976 

0.025 0.006 0.969 

0.057 0.038 0.905 

0.017 0.972 0.010 

0.055 0.929 0.016 

⎠ (6.26) 

175 

⎞ 

⎟ 

⎠


The interpretation of the contents of S Λ1 ,Λ 2 is the same as in the crisp scenario, i.e. its 

(i,j)th element is proportional to the similarity between the ith cluster of Λ1 and the jth 

cluster of Λ2. Again, transforming S Λ1 ,Λ 2 into a cluster dissimilarity matrix is the final 

step before solving the weight bipartite matching problem using the Hungarian method 

implementation of (Buehren, 2008), thus obtaining the cluster correspondence vector π Λ1 ,Λ 2 

of equation (6.27) (notice that this is exactly the same permutation vector of equation (6.19), 

as the present toy example is the fuzzy version of the former). 

π Λ1 ,Λ 2 = [3 1 2] (6.27) 

Although the interpretation of the cluster correspondence vector is equivalent in both 

the hard and the soft clustering scenarios (i.e. the cluster that is given the number ‘1’ 

identifier in Λ1 corresponds to the cluster number ‘3’ of Λ2, and so on), recall that, in the 

fuzzy case, cluster permutations are equivalent to row order rearrangements. 

Consequently, in order to obtain the cluster permuted version of the fuzzy partition 

Λ1, it is necessary to multiply the transpose of the cluster permutuation matrix PΛ1 ,Λ2 associated to the cluster correspondence vector πΛ1 ,Λ by the fuzzy partition Λ1 itself. As 

2 

a result, the cluster permuted soft clustering Λ πΛ1 ,Λ2 1 is obtained —see equation (6.28) for 

the pair of cluster aligned fuzzy clustering solutions of our toy example. 

Λ πΛ 1 ,Λ 2 

1 

⎛ 

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 

⎞ 

0.017 

= ⎝0.025 

0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972⎠ 

0.054 

⎛ 

0.932 

0.026 

0.921 

0.057 

0.019 

0.969 

0.030 

0.976 

0.014 

0.959 

0.025 

0.009 

0.057 

0.016 

0.017 

0.010 

⎞ 

0.055 

Λ2 = ⎝0.042 

0.025 0.005 0.011 0.009 0.006 0.038 0.972 0.929⎠ 

(6.28) 

0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016 

Given a cluster ensemble E containing a set of l soft clustering solutions, the cluster 

disambiguation process consists in, taking one of them as a reference, apply the Hungarian 

method sequentially on the remaining l − 1 clustering solutions (Topchy et al., 2004). As a 

result, a cluster aligned version of the cluster ensemble is obtained, and voting procedures 

can be readily applied on it. 

6.3.2 Voting strategies 

Once the correspondence between the k clusters of each one of the l soft clustering solutions 

compiled in the cluster ensemble E has been resolved and the corresponding cluster 

permutations have been applied, it is time to derive the consensus clustering solution upon 

E, a task we tackle by means of voting procedures. In this section, we describe four voting 

methods, which give rise to as many consensus functions. 

Before proceeding to their description, recall that the scalar elements of a soft cluster 

ensemble E are considered, from a voting standpoint, as the expression of the degree of 

preference of each voter (i.e. clusterer) for each candidate (cluster) in the present election 

(clusterization of an object). The result of the election (i.e. the consolidated clusterization 

of the object under consideration based upon the decisions of the l clusterers comprised 

176


in the ensemble) will depend on the voting procedure applied, which, at the same time, is 

conditioned by the way the voters’ preferences are expressed. 

Given that in fuzzy cluster ensembles voters express their preference for each and every 

one of the k candidates, the soft consensus functions proposed in this chapter make use of 

confidence and positional voting methods (van Erp, Vuurpijl, and Schomaker, 2002), which 

are applicable in voting scenarios in which voters grade candidates according to their degree 

of confidence. The former makes direct use of the specific values of the preference scores 

the voters emit –thus, they are sensitive to their scaling–, whereas the latter are based on 

ranking the candidates according to the degree of confidence expressed by the voters. 

As mentioned earlier, the way fuzzy clusterers express their preference for the clusters 

can be either directly or inversely proportional to the strength of association between objects 

and clusters (e.g. membership probabilities or distances to centroids, respectively). In fact, 

it is possible that both types of clusterings are intermingled in E, and, for this reason, the 

voting method must somehow be informed of this fact, so that appropriate scale or ranking 

transformations are applied —depending on whether a confidence or a positional voting 

strategy is employed. 

In the following sections, we present four consensus functions for soft cluster ensembles, 

each of which is based on a specific voting mechanism. Besides their generic description, 

we illustrate them by means of a toy example using the soft cluster ensemble E containing 

the l = 2 cluster aligned fuzzy clustering solutions presented in equation (6.28). 

E = 

Λ1 

Λ2 

⎛ 

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 

⎞ 

0.017 

⎜ 

⎜ 

0.025 

⎜ 

= ⎜0.054 

⎜ 

⎜0.932 

⎝0.042 

0.042 

0.026 

0.921 

0.025 

0.038 

0.057 

0.019 

0.005 

0.006 

0.969 

0.030 

0.011 

0.005 

0.976 

0.014 

0.009 

0.011 

0.959 

0.025 

0.006 

0.976 

0.009 

0.057 

0.038 

0.929 

0.016 

0.017 

0.972 

0.972 ⎟ 

0.010 ⎟ 

0.055 ⎟ 

0.929⎠ 

0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016 

(6.29) 

Notice that, in this toy example, the l = 2 clustering solutions (voters) compiled in 

the ensemble (Λ1 and Λ2) express object to cluster associations (their preferences for candidates) 

by means of membership probabilities, which makes the scalar elements of both 

clusterings directly comparable, thus avoiding the need for applying any scale transformations. 

However, this need not be the general case, and we will address how to deal with 

cluster ensembles containing unequal clusterings as the proposed consensus functions are 

presented throughout the following paragraphs. 

Confidence voting 

Consensus functions based on confidence voting methods derive the consolidated clustering 

solution upon the values of the confidence scores each clusterer assigns to each cluster. For 

this reason, a prerequisite for using these voting methods is that these confidence values are 

comparable in magnitude (van Erp, Vuurpijl, and Schomaker, 2002). Assuming this is true, 

we propose the application of the sum and product confidence voting rules, which gives rise 

to the following two consensus functions: 

– SumConsensus (SC): this consensus function is based on the application of the 

confidence voting sum rule, which simply consists of adding the confidence values 

177


Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l) 

Output: Sum voting matrix ΣE 

Data: k clusters, n objects 

Hungarian (E) 

ΣE = 0k×n for i =1...l do 

if Λi not membership probabilities then 

Probabilize (Λi) 

end 

ΣE = ΣE + Λi 

end 

Algorithm 6.1: Symbolic description of the soft consensus function SumConsensus. 

Probabilize and Hungarian are symbolic representations of the conversion of fuzzy clusterings 

to membership probability matrices and the cluster disambiguation procedures, 

respectively, while 0 k×n represents a k × n zero matrix. 

that all the voters cast for each candidate. As a result, a k × n sum matrix ΣE is 

obtained, the (i,j)th entry of which equals the sum of the preference scores of assigning 

the jth object to the ith cluster across the l cluster ensemble components, as presented 

in equation (6.30). 

ΣE = 

l 

i=1 

Λi 

(6.30) 

where Λi refers to the ith clustering contained in the soft cluster ensemble E, once 

the cluster disambiguation process has been conducted. 

A schematic and generic description of the SumConsensus consensus function is presented 

in algorithm 6.1. As it can be observed, we propose transforming all the fuzzy 

clusterings compiled in the cluster ensemble E into membership probability matrices 

–which is symbolically represented by the procedure called Probabilize–, thus making 

the sum voting method directly applicable on them once the cluster alignment 

problem is solved by means of the Hungarian procedure. According to algorithm 

6.1, SumConsensus outputs the sum voting matrix ΣE, which can be interpreted 

readily as a fuzzy consensus clustering. However, it can easily be converted into a 

classic membership probability based fuzzy consensus clustering Λc, oracrispconsensus 

clustering λc, as we show in the following paragraphs —as it will be seen, such 

postprocessing could also be included as the final step of SumConsensus. 

The application of SumConsensus on the toy cluster ensemble of equation (6.29) gives 

rise to the sum matrix ΣE presented in equation (6.31). Notice that the execution of 

the Probabilize and the Hungarian procedures is not required in this case, as the 

l = 2 fuzzy clusterings considered express object to cluster associations by means of 

membership probabilities and their clusters are aligned. 

178


⎛ 

1.853 1.853 0.924 0.055 0.033 0.055 0.071 0.072 

⎞ 

0.072 

ΣE = ⎝0.067 

0.067 0.043 0.017 0.014 0.017 1.014 1.901 1.901⎠ 

(6.31) 

0.08 0.08 1.033 1.928 1.952 1.928 0.914 0.026 0.026 

Notice that the higher the value of the (i,j)th entry of ΣE, the more likely is that the 

jth object belongs to the ith cluster. Of course, this is due to the fact that the l =2 

fuzzy clusterings contained in the soft cluster ensemble of our toy example express 

object to cluster associations by means of membership probabilities, which are directly 

proportional to the strength of association between objects and clusters. 

Moreover, notice that if each column of ΣE is divided by its L1-norm (i.e. the sum 

of its elements), each column entries become cluster membership probabilities, and, 

therefore, a classic fuzzy consensus clustering solution Λc can be obtained (see equation 

(6.32)). 

⎛ 

0.926 0.926 0.462 0.027 0.016 0.028 0.035 0.036 

⎞ 

0.036 

Λc = ⎝0.033 

0.033 0.021 0.008 0.007 0.008 0.507 0.951 0.951⎠ 

(6.32) 

0.041 0.041 0.517 0.965 0.977 0.964 0.458 0.013 0.013 

Furthermore, notice that Λc can be transformed into a crisp consensus clustering 

λc by simply assigning each object to the cluster it is most strongly associated to, 

breaking hypothetical ties at random. Referring once more to our toy example, the 

crisp consensus clustering obtained by hardening Λc is the one presented in equation 

(6.33). 

λc = 1 1 3 3 3 3 2 2 2 

(6.33) 

– ProductConsensus (PC): the only difference between this consensus function and 

SumConsensus is that the preference values per candidate are multiplied instead of 

added. Quite obviously, the product rule is highly sensitive to low values, which could 

ruin the chances of a candidate on winning the election, no matter what its other confidence 

values are (van Erp, Vuurpijl, and Schomaker, 2002). Equation (6.34) presents 

the voting process that constitutes the core of the ProductConsensus consensus function. 

It is important to notice that Λi correspond to the cluster ensemble components 

once cluster alignment has been conducted, and matrix products are computed entrywise. 

As a result, the k × n product matrix ΠE is obtained. 

ΠE = 

l 

i=1 

Λi 

(6.34) 

Algorithm 6.2 presents the schematic description of the ProductConsensus consensus 

function. As in the previous consensus function, we propose converting the fuzzy 

clusterings Λi into membership probability matrices, which allows to apply the product 

voting rule on them once the cluster correspondence problem has been solved by 

means of the Hungarian algorithm. It can be observed that ProductConsensus yields 

179



Output: Product voting matrix ΠE 


Hungarian (E) 

ΠE = 1k×n for i =1...l do 

if Λi not membership probabilities then 

Probabilize (Λi) 

end 

ΠE = ΠE ◦ Λi 

end 

Algorithm 6.2: Symbolic description of the soft consensus function ProductConsensus. 

Probabilize and Hungarian are symbolic representations of the conversion of fuzzy clusterings 

to membership probability matrices and the cluster disambiguation procedures, 

respectively, while 1 k×n represents a k × n unit matrix, and ◦ represents the Hadamard 

(or entrywise) matrix product. 

the product voting matrix ΠE as its output. However, as in the case of SumConsensus, 

it can be transformed into a fuzzy or a crisp consensus clustering, a process that 

can be included as the final step of ProductConsensus. 

The results of applying the product rule on the toy cluster ensemble of equation (6.29) 

yields the product matrix ΠE presented in equation (6.35). 

⎛ 

ΠE = ⎝ 

0.858 0.858 0.017 7.5·10 −4 2.7·10 −4 7.5·10 −4 7.9·10 −4 9.3·10 −4 9.3·10 −4 

0.001 0.001 1.9·10−4 6.6·10−5 4.5·10−5 6.6·10−5 0.037 0.903 0.903 

0.001 0.001 0.056 0.929 0.953 0.929 0.008 1.6·10−4 1.6·10−4 ⎞ 

⎠ 

(6.35) 

Dividing each column of ΠE by its L1-norm gives rise to the fuzzy consensus clustering 

solution Λc based on membership probabilities of equation (6.36), and assigning each 

object to the cluster it is most strongly associated to (breaking ties randomly) yields 

the crisp consensus clustering λc of equation (6.37). 

⎛ 

Λc = ⎝ 

0.997 0.997 0.235 8.1·10−4 2.8·10−4 8.1·10−4 0.017 0.001 0.001 

0.001 0.001 0.003 7.1·10−5 4.7·10−5 7.1·10−5 0.806 0.998 0.998 

0.002 0.002 0.762 0.999 0.999 0.999 0.177 1.8·10−4 1.8·10−4 λc = 1 1 3 3 3 3 2 2 2 

⎞ 

⎠ (6.36) 

(6.37) 

Notice that the differences between the fuzzy consensus clusterings Λc obtained by 

SumConsensus and ProductConsensus (equations (6.32) and (6.36) ) –due to the 

different way voters’ preferences are combined– are lost when they are transformed 

into crisp consensus clusterings. 

180

Positional voting 


Positional (aka ranking) voting methods rank the candidates according to the confidence 

scores emitted by the voters. Thus, fine-grain information regarding preference differences 

between candidates is ignored, although problems in scaling the voters confidence scores 

are avoided —that is, positional voting is useful in situations where confidence values are 

hard to scale correctly (van Erp, Vuurpijl, and Schomaker, 2002). 

As an aid for describing the positional voting methods that constitute the core of our 

consensus functions, equation (6.38) defines Λi (the ith component of the soft cluster ensemble 

E) in terms of its columns, represented by vectors λij (∀j =1,...,n). 

⎛ ⎞ 

Λ1 

⎜ 

⎜Λ2 

⎟ 

E = ⎜ ⎟ 

⎝ . ⎠ where Λi = 

λi1 λi2 ... λin 

Λl 

(6.38) 

In this work, we propose employing two positional voting strategies (namely the Borda 

and the Condorcet voting methods) for deriving the consensus clustering solution, which 

gives rise to the following consensus functions: 

– BordaConsensus (BC): the Borda voting method (Borda, 1781) computes the mean 

rank of each candidate over all voters, reranking them according to their mean rank. 

This process results in a grading of all the n objects with respect to each of the k 

clusters, which is embodied in a k × n Borda voting matrix BE. Such grading process 

is conducted as follows: firstly, for each object (election), clusters (candidates) are 

ranked according to their degree of association with respect to it (from the most to 

the least strongly associated). Then, the top ranked candidate receives k points, the 

second ranked cluster receives k − 1 points, and so on. After iterating this procedure 

across the l cluster ensemble components, the grading matrix BE is obtained. The 

whole process is described in algorithm 6.3. Notice that the Rank procedure orders 

the clusters from the most to the least strongly associated to each object, yielding a 

ranking vector r which is a list of the k clusters ordered according to their degree of 

association with respect to the object under consideration (i.e. its first component 

(r(1)) identifies the most strongly associated cluster, and so on. Thus, the Rank 

procedure must take into account whether the scalar values contained in λab are 

directly or inversely proportional to the strength of association between objects and 

clusters. 

When applied on our toy example, the resulting Borda grading matrix is the one 


⎛ 

6 6 5 4 4 4 4 4 

⎞ 

4 

BE = ⎝3 

3 2 2 2 2 4 6 6⎠ 

(6.39) 

3 3 5 6 6 6 4 2 2 

According to Borda voting, the higher the value of the (i,j)th entry of BE, themore 

likely the jth object belongs to the ith cluster. Thus, again, dividing each column of 

matrix BE by its L1-norm transforms it into a cluster membership probability matrix, 

181



Output: Borda voting matrix BE 


Hungarian (E) 

BE = 0 k×n 

for a =1...l do 

for b =1...n do 

r = Rank (λab); 

for c =1...k do 

BE (r (c) ,b)=BE (r (c) ,b)+(k − c +1); 

end 

end 

end 

Algorithm 6.3: Symbolic description of the soft consensus function BordaConsensus. 

Hungarian and Rank are symbolic representations of the cluster disambiguation and cluster 

ordering procedures, respectively, while the vector λab represents the bth column of 

the ath cluster ensemble component Λa, r is a clusters ranking vector and 0 k×n represents 

a k × n zero matrix. 

i.e. a soft consensus clustering solution Λc (see equation (6.40)), and assigning each 

object to the cluster it is most strongly associated to –breaking ties randomly– yields 

the crisp consensus clustering of equation (6.41). 

⎛ 

0.5 0.5 0.417 0.333 0.333 0.333 0.333 0.333 

⎞ 

0.333 

Λc = ⎝0.25 

0.25 0.166 0.167 0.167 0.167 0.333 0.5 0.5⎠ 

(6.40) 

0.25 0.25 0.417 0.5 0.5 0.5 0.333 0.167 0.167 

λc = 1 1 3 3 3 3 3 2 2 

(6.41) 

– CondorcetConsensus (CC): just like Borda voting, the Condorcet voting method’s 

origin dates from the French revolution period, as an effort to address the shortcomings 

of simple majority voting when there are more than two candidates (Condorcet, 

1785). Although often considered to be a multi-step unweighed voting algorithm, 

the Condorcet voting method can also be regarded as a positional voting strategy, as 

it employs the voters’ preference choices between any given pair of candidates (van 

Erp, Vuurpijl, and Schomaker, 2002). In particular, this voting method performs an 

exhaustive pairwise candidate ranking comparison across voters, and the winner of 

each one of these one-to-one confrontations scores a point. The result of this process 

is the Condorcet score matrix CE, the(i,j)th element of which indicates how many 

candidates does the ith candidate beat in one-to-one comparisons in the jth election 

(where candidates are clusters and an election corresponds to the clusterization of an 

object). 

Algorithm 6.4 presents a description of the CondorcetConsensus consensus function. 

As in BordaConsensus, the Rank procedure must take into account whether the scalar 

182



Output: Condorcet voting matrix CE 


Hungarian (E) 

for b =1...n do 

M = 0k×k for a =1...l do 

r = Rank (λab); 


M (r(c), r(c +1÷ k)) = M (r(c), r(c +1÷ k)) + 1 

end 

end 


CE (c, b) =Count M(c, 1 ÷ k) ≥ l 

 

2 

end 

end 

Algorithm 6.4: Symbolic description of the soft consensus function CondorcetConsensus. 

Hungarian and Rank are symbolic representations of the cluster disambiguation and cluster 

ordering procedures, respectively, while the vector λab represents the bth column of 

the ath cluster ensemble component Λa, r is a clusters ranking vector and 0 k×k represents 

a k × k zero matrix. 

values contained in λab are directly or inversely proportional to the strength of association 

between objects and clusters. In each election, the (i,j)th entry of the square 

matrix M (usually referred to as the Condorcet sum matrix) counts the number of 

times the ith cluster is preferred over the jth one. The Count procedure is used for 

counting the number of elements of the cth row of matrix M that are greater or equal 

than l 

2 , which means that at least half of the voters preferred one cluster over another. 

Resorting again to our toy example, equation (6.42) presents the Condorcet score 

matrix resulting from applying CondorcetConsensus on it. 

⎛ 

2 2 2 1 1 1 2 1 

⎞ 

1 

CE = ⎝1 

1 0 0 0 0 2 2 2⎠ 

(6.42) 

1 1 2 2 2 2 2 0 0 

Again, dividing each column of matrix CE by its L1-norm transforms the Condorcet 

score matrix into a soft consensus clustering solution Λc, whose(i,j)th entry represents 

the probability that the jth object belongs to the ith cluster (see equation (6.43) for 

the fuzzy consensus clustering solution obtained by CondorcetConsensus on our toy 

example). And assigning each object to the cluster it is most strongly associated to 

–breaking ties randomly– yields the crisp consensus clustering of equation (6.44). 

⎛ 

0.500 0.500 0.500 0.333 0.333 0.333 0.333 0.333 

⎞ 

0.333 

Λc = ⎝0.250 

0.250 0 0 0 0 0.333 0.667 0.667⎠ 

(6.43) 

0.250 0.250 0.500 0.667 0.667 0.667 0.333 0 0 

183

6.4. Experiments 

λc = 1 1 1 3 3 3 1 2 2 

(6.44) 

Notice that the fuzzy consensus clusterings Λc output by BordaConsensus and CondorcetConsensus 

(equations (6.40) and (6.43)) differ notably from those obtained by 

SumConsensus and ProductConsensus (equations (6.32) and (6.36)) —see the double 

and triple ties obtained at the clusterization of the third and seventh objects, which 

are due to the intrinsic differences between the distinct voting strategies applied. 

Moreover, notice that the two positional voting based consensus functions (BC and 

CC) yield structuraly similar fuzzy consensus clusterings Λc, although their contents 

differ slightly. However, their hardened versions λc (equations (6.41) and (6.44)) differ 

in a larger extent, due to the random way ties are broken. 

6.4 Experiments 

This section presents several consensus clustering experiments evaluating the consensus 

functions for soft cluster ensembles proposed in the previous section. These experiments 

are conducted according to the following design. 

– What do we want to measure? We are interested in comparing both in the quality 

of the consensus clustering solutions obtained and the time complexity of the proposed 

consensus functions. 

– How do we measure it? As regards the time complexity aspect, all consensus 

processes follow a flat architecture (i.e. one step consensus), and we measure the 

CPU time required for their execution, using the computational resources described 

in appendix A.6. As far as the evaluation of the quality of the consensus clustering 

results is concerned, despite the proposed consensus functions output fuzzy consensus 

clustering solutions, we have compared their hardened version with respect to the 

ground truth of each data set in terms of normalized mutual information φ (NMI) .The 

reason for this is twofold: firstly, a soft ground truth is not available for these data sets, 

so fuzzy consensus clusterings cannot be directly evaluated. And secondly, provided 

that the CSPA, HGPA, MCLA and EAC consensus functions output hard consensus 

clustering solutions, fair inter-consensus functions comparison requires converting the 

soft consensus clustering matrices Λc output by VMA, BC, CC, PC and SC to crisp 

consensus labelings λc —recall that this simply boils down to assigning each object 

to the cluster it is more strongly associated to. 

– How are the experiments designed? In each consensus clustering experiment we 

have applied our four voting-based consensus functions –SumConsensus (SC), ProductConsensus 

(PC), BordaConsensus (BC) and CondorcetConsensus (CC)–, besides 

the fuzzy versions of CSPA, EAC, HGPA and MCLA (see section 6.2) plus one of 

the pioneering soft consensus functions, namely VMA (Voting Merging Algorithm) 

(Dimitriadou, Weingessel, and Hornik, 2002) —see appendix A.5 for a description. 

Experiments have been conducted on the twelve unimodal data collections employed 

in this work, which are described in appendix A.2.1. As regards the creation of the 

soft cluster ensemble components, we have employed the fuzzy c-means and the kmeans 

clustering algorithms. Whereas the former is fuzzy by nature, the latter is not. 

184


However, if object to cluster centroid distances are inverted and normalized using a 

softmax normalization, so they can be interpreted as membership probabilities are 

obtained (i.e. the k-means clustering solutions are fuzzified). For the sake of greater 

algorithmic diversity, variants of k-means using the Euclidean, city block, cosine and 

correlation distance measures have been employed. Thus, the cardinality of the algorithmic 

diversity factor is |dfA| = 5. Applying all these clustering algorithms on each 

and every one of the distinct object representations created by the mutually crossed 

application of the representational and dimensional diversity factors of each data set, 

gives rise to soft cluster ensembles of the sizes l presented in table 6.1. In order to obtain 

a representative analysis of the aforementioned consensus functions performance, 

we have conducted multiple experiments on distinct diversity scenarios. To do so, 

besides using the cluster ensemble of size l, we have also generated cluster ensembles 

of sizes ⌊ l l l 

l 

20⌋, ⌊ 10⌋, ⌊ 5⌋ and ⌊ 2⌋, which are created by randomly picking a subset 

of the original cluster ensemble components. For each distinct cluster ensemble, ten 

independent runs of each consensus function are executed. 

– How are results presented? The performances of the nine soft consensus functions 

are summarized by means of a quality (φ (NMI) with respect to the ground truth) versus 

time complexity (CPU time measured in seconds) diagram that describes, in a pretty 

summarized manner, the qualities of the consensus functions compared. For each 

consensus function, the depicted scatterplot corresponds to the region limited by the 

mean ± 2-standard deviation curves corresponding to the two associated magnitudes 

(i.e. φ (NMI) and CPU time) computed throughout all the experiments conducted 

on each data collection —ten independent experiment runs on each one of the five 

cluster ensemble sizes. In order to determine whether the differences between the 

compared consensus functions are significant or not, we have conducted a pairwise 

comparison (both in CPU time and φ (NMI) terms) among them applying a t-paired 

test, measuring the significance level p at which the null hypothesis (equal means with 

possibly unequal variances) is rejected. If the typical 95% confidence interval for true 

difference in means is taken as a reference, significance level values of p


Data set Soft cluster ensemble size l 

Zoo 285 

Iris 45 

Wine 225 

Glass 145 

Ionosphere 485 

WDBC 565 

Balance 35 

Mfeat 30 

miniNG 365 

Segmentation 260 

BBC 285 

PenDigits 285 

Table 6.1: Soft cluster ensemble sizes l corresponding to the unimodal data sets. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

ZOO 

0 

0 1 2 3 4 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 

Figure 6.2: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus 

functions on the Zoo data collection. 

efficiency of VMA is quite expectable, due to the fact that it simultaneously solves the 

cluster correspondence problem and voting following an iterative procedure (Dimitriadou, 

Weingessel, and Hornik, 2002), whereas in SC, PC, BC and CC, these two processes are 

sequentially conducted. 

As regards the quality of the consensus clustering solutions, notice that the four consensus 

functions proposed achieve almost identical φ (NMI) scores than the best performing 

state-of-the-art alternative, VMA. 

Table 6.2 presents the significance level values obtained from all the t-paired tests conducted 

on the Zoo data set. The upper and lower triangular sections of the table correspond 

to the comparison between consensus functions in terms of CPU time and φ (NMI) , respectively. 

When pairwise comparisons between the ith and the jth consensus functions result 

in statistically significant differences, the corresponding significance level value p is presented 

in the (i,j)th entry of the table (or in the (j,i)th entry, depending on whether it is a 

comparison in terms of CPU time or φ (NMI) ). Otherwise, the lack of statistically significant 

186


CSPA EAC HGPA MCLA VMA BC CC PC SC 

CSPA ——— × × 0.0001 0.0001 0.0001 0.0001 0.0013 0.0012 

EAC 0.0001 ——— × 0.0001 0.0002 0.0001 0.0001 0.002 0.0019 

HGPA 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0017 0.0016 

MCLA × 0.0001 0.0001 ——— 0.0001 0.0009 × 0.0003 0.0003 

VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0001 0.0001 0.0001 0.0001 × ——— 0.0001 × × 

CC 0.0001 0.0001 0.0001 0.0001 × × ——— 0.0001 0.0001 

PC 0.0001 0.0001 0.0001 0.0001 × × × ——— × 

SC 0.0001 0.0001 0.0001 0.0001 × 0.0337 0.0419 × ——— 

Table 6.2: Significance levels p corresponding to the pairwise comparison of soft consensus 

functions using a t-paired test on the Zoo data set. The upper and lower triangular sections 

of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively. 

Statistically non-significant differences (p >0.05) are denoted by the symbol ×. 

differences is denoted by means of the × symbol. 

For instance, let us see how does BordaConsensus (BC) compare to the eight remaining 

consensus function in terms of execution CPU time —for an easier identification, the 

contents of the corresponding boxes of table 6.2 are italicized. In fact, they tell us that the 

differences observed in figure 6.2 (according to which BC is apparently faster than MCLA 

and CC, and slower than CSPA, EAC, HGPA, VMA, PC and SC) are statistically significant 

with respect to all but the PC and SC consensus functions. 

If this comparison is based on the φ (NMI) of the consensus clustering solutions, figure 6.2 

suggests that BC performs better than CSPA, EAC, HGPA and MCLA, which is true from a 

statistical significance standpoint, as the corresponding entries of table 6.2 (which are typed 

in boldface for ease of identification) confirm. In contrast, the small differences between 

the φ (NMI) values of BC, VMA, CC and PC appreciated in figure 6.2 are statistically non 

significant, whereas it is with respect to SC despite its apparent closeness. 

In order to provide the reader with a global perspective that illustrates the performance 

of the proposed consensus functions compared to their state-of-the-art counterparts across 

the twelve unimodal collections employed in this work, we have computed the total percentage 

of experiments in which the latter yield better, equivalent or worse results than the 

voting-based consensus functions —considering the statistical significance of the differences 

between the compared magnitudes (CPU time and φ (NMI) ). 

Firstly, table 6.3 presents the results of such comparative analysis when it is referred 

to the quality of the consensus clusterings output by the consensus functions in all the 

experiments conducted. It can be observed that the four proposed consensus functions 

outperform EAC, HGPA and MCLA in a pretty overwhelming percentage of the experiments 

(an average 94.4% of the total). When compared to CSPA and VMA, we can appreciate 

certain differences between the performance of the consensus functions based on confidence 

voting (PC and SC) and the ones based on positional voting (BC and CC). In general terms, 

SC and PC perform slightly better than BC and CC. Moreover, notice that BordaConsensus 

and CondorcetConsensus attain exactly the same results, whereas the similarity between 

the results of SC and PC is also very noticeable. We conjecture that these high degrees 

of resemblance is due to the fact that evaluation is conducted upon a hardened version 

of the soft consensus clustering output by these consensus functions. Thus, the intrinsic 

187


CSPA 

EAC 

HGPA 

MCLA 

VMA 

φ (NMI) BC CC PC SC 

better than ... 27.3% 27.3% 9.1% 9.1% 

equivalent to ... 9.1% 9.1% 45.4% 45.4% 

worse than ... 63.6% 63.6% 45.4% 45.4% 

better than ... 0% 0% 0% 0% 

equivalent to ... 0% 0% 0% 0% 

worse than ... 100% 100% 100% 100% 

better than ... 0% 0% 0% 0% 

equivalent to ... 0% 0% 0% 0% 

worse than ... 100% 100% 100% 100% 

better than ... 0% 0% 0% 0% 

equivalent to ... 16.7% 16.7% 16.7% 16.7% 

worse than ... 83.3% 83.3% 83.3% 83.3% 

better than ... 25% 25% 0% 0% 

equivalent to ... 33.3% 33.3% 100% 91.7% 

worse than ... 41.7% 41.7% 0% 8.3% 

Table 6.3: Percentage of experiments in which the state-of-the-art consensus functions 

(CSPA, EAC, HGPA, MCLA and VMA) yield (statistically significant) better/equivalent/worse 

consensus clustering solutions than the four proposed consensus functions 

(BC, CC, PC and SC). 

differences between the distinct voting strategies is somehow lost. 

In either case, the φ (NMI) scores obtained by the four proposed voting based consensus 

functions is statistically significantly lower than that of CSPA and VMA in only a 15.3% of 

the experiments conducted, which clearly indicates that, from a consensus quality perspective, 

our proposals constitute an attractive alternative for conducting consensus clustering 

on soft cluster ensembles. 

And secondly, the results of the previously described comparison, but referred to execution 

CPU time, are presented in table 6.4. In general terms, the state-of-the-art consensus 

functions (except for EAC) are faster than the proposed consensus functions based on positional 

voting methods (BC and CC). This is due to the candidates ranking step that 

precedes the voting process itself (see algorithms 6.3 and 6.4). Moreover, the execution of 

CC takes longer than BC, due to the exhaustive pairwise candidate comparison involved in 

the Condorcet voting method. In contrast, the confidence voting based consensus functions 

(PC and SC) are more computationally efficient, being as fast or faster than CSPA, EAC, 

HGPA and MCLA in an average 80.7% of the experiments. However, they are unable to 

match the computational efficiency of VMA, which, as mentioned earlier, is caused by its 

simultaneous and iterative cluster disambiguation and voting procedure. 

As a conclusion, it can be stated that the four voting based consensus functions proposed 

in this chapter are indeed worthy of being considered as an alternative when it comes to 

creating consensus clustering solutions on soft cluster ensembles, as they are capable of 

delivering high quality consensus clustering solutions at an acceptable computational cost 

—this is specially true for those consensus functions based on confidence voting methods 

(i.e. PC and SC). The higher computational complexity of positional voting based consensus 

functions (BC and CC) suggests limiting their application to those cases in which the 

188


CPU time BC CC PC SC 

faster than ... 45.4% 72.7% 18.2% 18.2% 

CSPA equivalent to ... 18.2% 0% 18.2% 18.2% 

slower than ... 36.4% 27.3% 63.6% 63.6% 

faster than ... 9.1% 27.3% 9.1% 9.1% 

EAC equivalent to ... 27.3% 9.1% 0% 0% 

slower than ... 63.6% 63.6% 90.9% 90.9% 

faster than ... 91.7% 91.7% 33.3% 33.3% 

HGPA equivalent to ... 8.3% 8.3% 41.7% 41.7% 

slower than ... 0% 0% 25% 25% 

faster than ... 66.7% 83.3% 16.7% 16.7% 

MCLA equivalent to ... 25% 16.7% 41.7% 41.7% 

slower than ... 8.3% 0% 41.7% 41.7% 

faster than ... 100% 100% 91.7% 91.7% 

VMA equivalent to ... 0% 0% 8.3% 8.3% 

slower than ... 0% 0% 0% 0% 

Table 6.4: Percentage of experiments in which the state-of-the-art consensus functions 

(CSPA, EAC, HGPA, MCLA and VMA) are executed (statistically significantly) 

faster/equivalent/slower than the four proposed consensus functions (BC, CC, PC and SC). 

confidence values contained in the soft cluster ensemble are difficult to scale correctly (van 

Erp, Vuurpijl, and Schomaker, 2002). 


The main motivation of the proposals put forward in this chapter is the fact that most of the 

literature on cluster ensembles is mainly focused on the application of consensus clustering 

processes on hard cluster ensembles. In our opinion, however, soft consensus clustering is 

an alternative worth considering, inasmuch as crisp clustering is in fact a simplification of 

fuzzy clustering —a simplification that may give rise to the loss of valuable information. 

The initial source of inspiration for the soft consensus functions just presented was 

metasearch (aka information fusion) systems, the main purpose of which is to obtain improved 

search results by combining the ranked lists of documents returned by multiple 

search engines in response to a given query. Although the resemblance between metasearch 

and consensus clustering was already reported in (Gionis, Mannila, and Tsaparas, 2007), 

direct inspiration came from the works of Aslam and Montague (Aslam and Montague, 

2001; Montague and Aslam, 2002), where metasearch algorithms based on positional voting 

were devised —notice that this type of voting techniques lend themselves to be applied in 

this context, as search engines return lists of ranked documents. From that point on, the 

analogy between object-to-cluster association scores in a soft cluster ensemble and voters’ 

preferences for candidates became the key issue for deriving consensus functions based on 

positional and confidence voting methods. 

Nevertheless, the application of voting methods for combining clustering solutions is not 

new. For instance, unweighed voting strategies (van Erp, Vuurpijl, and Schomaker, 2002) 

such as plurality and majority voting have been applied for deriving consensus clustering 

189


solutions on hard cluster ensembles (Dudoit and Fridlyand, 2003; Fischer and Buhmann, 

2003; Greene and Cunningham, 2006). To our knowledge, the only voting-based consensus 

function for soft cluster ensembles is the Voting-Merging Algorithm (VMA) of (Dimitriadou, 

Weingessel, and Hornik, 2002), which employs a weighted version of the sum rule for 

confidence voting. Moreover, all these works share a common point in that they use the 

Hungarian algorithm for solving the cluster correspondence problem. 

Additional techniques for cluster disambiguation developed in the consensus clustering 

literature include the correspondence estimation based on common space cluster representation 

by clusters clustering or Singular Value Decomposition (Boulis and Ostendorf, 2004), 

the Soft Correspondence Ensemble Clustering algorithm, which is based on establishing a 

soft correspondence between clusters (in the sense that a cluster of a given clustering corresponds 

to every cluster in another clustering with different weights) (Long, Zhang, and 

Yu, 2005), the cumulative voting approach, that, unlike common one-to-one voting schemes 

(e.g. Hungarian), computes a probabilistic mapping between clusters (Ayad and Kamel, 

2008), or the FullSearch, Greedy and LargeKGreedy cluster alignment algorithms (Jakobsson 

and Rosenberg, 2007). The first two approaches coincide in that they can be indistinctly 

applied for aligning the clusters of crisp and fuzzy partitions. Given the key importance of 

the cluster disambiguation process as a prior step to voting, we plan to evaluate these alternatives 

to the Hungarian method, so as to investigate their impact on both the quality of 

the consensus clusterings obtained and the time complexity of the whole consensus process. 

The comparative performance analysis of the four proposed consensus functions has 

revealed that they constitute a feasible alternative for conducting consensus clustering processes 

on soft cluster ensembles, as they are capable of yielding consensus clustering solutions 

of comparable or superior quality to those obtained by state-of-the-art clustering combiners 

at a reasonable computational cost. An additional appealing feature of our proposals is 

that they naturally deliver fuzzy consensus clustering solutions, which makes all sense in a 

soft clustering scenario —a fact other recent consensus functions for soft cluster ensembles 

–as the one presented in (Punera and Ghosh, 2007)– do not consider. However, the lack 

of a fuzzy ground truth has not allowed evaluating the soft consensus clusterings obtained, 

which constitutes one of the future directions of research of the work conducted in this 

chapter. As mentioned earlier, this would probably make the differences between the proposed 

consensus functions more evident, as it would highlight the differences between the 

distinct voting methods employed. 

As reported earlier, the sequential application of the cluster disambiguation and the 

voting processes penalizes the time complexity of our proposals, specially when they are 

compared to VMA. Thus, in the future, we plan to adopt the iterative cluster alignment 

plus voting strategy employed by this consensus function, which, in our opinion, will surely 

reduce the execution time of the proposed voting-based consensus functions without significantly 

reducing the quality of the consensus functions obtained. 

Another significant conclusion is that the EAC and HGPA consensus functions yield 

the lowest quality consensus clusterings, as already noticed in the vast majority of the 

experiments conducted in the hard clustering scenario. 

190



Our first approach to voting based soft consensus functions was the derivation of Borda- 

Consensus (Sevillano, Alías, and Socoró, 2007b). The details of this publication, presented 

as a poster at the SIGIR 2007 conference held at Amsterdam, are described next. 

Authors: Xavier Sevillano, Francesc Alías and Joan Claudi Socoró 

Title: BordaConsensus: a New Consensus Function for Soft Cluster Ensembles 

In: Proceedings of the 30th ACM SIGIR Conference 

Pages: 743-744 

Year: 2007 

Abstract: Consensus clustering is the task of deriving a single labeling by applying 

a consensus function on a cluster ensemble. This work introduces BordaConsensus, a 

new consensus function for soft cluster ensembles based on the Borda voting scheme. 

In contrast to classic, hard consensus functions that operate on labelings, our proposal 

considers cluster membership information, thus being able to tackle multiclass 

clustering problems. Initial small scale experiments reveal that, compared to stateof-the-art 

consensus functions, BordaConsensus constitutes a good performance vs. 

complexity trade-off. 

191

Chapter 7 

Conclusions 

The contributions put forward in this thesis constitute a unitary proposal for robust clustering 

based on cluster ensembles, with a specific focus on the increasingly interesting application 

of multimedia data clustering and a view on its generalization in fuzzy clustering 

scenarios. In this chapter, we summarize the main features of our proposals, highlighting 

their strengths and weaknesses, and outlining some interesting directions for future research. 

As for the robustness of clustering, recall that the unsupervised nature of this problem 

makes it difficult (if not impossible) to select apriorithe clustering system configuration 

that gives rise to the best 1 data partition. Furthermore, given the myriad of options –e.g. 

clustering algorithms, data representations, etc.– available to the clustering practitioner, 

such important decision making is marked by a high degree of uncertainty. As suboptimal 

configuration decisions may give rise to little meaningful partitions of the data, it turns out 

that these clustering indeterminacies end up being very relevant in practice, which, in our 

opinion, justifies research efforts oriented to overcome them (such as the present one). This 

was the main motivation of our first approaches to robust clustering via cluster ensembles 

(Sevillano et al., 2006a; Sevillano et al., 2006b; Sevillano et al., 2007c), which have attracted 

the attention of several researchers (Tjhi and Chen, 2007; Pinto, 2008; Gonzàlez and Turmo, 

2008b; Tjhi and Chen, 2009). 

For these reasons, our approach to robust clustering intentionally reduces user decision 

making as much as possible, thus following an approach that is nearly opposite to the 

procedure usually employed in clustering: instead of using a specific clustering configuration 

(which is often selected blindly unless domain knowledge is available), the clustering 

practitioner is encouraged to use and combine all the clustering configurations at hand, 

compiling the resulting clusterings into a cluster ensemble, upon which a consensus clustering 

is derived. The more similar this consensus clustering is to the highest quality clustering 

contained in the cluster ensemble, the greater the robustness to clustering indeterminacies. 

In this context, it must be noted that our particular approach to robust clustering foments 

the creation of large cluster ensembles. This motivates that one of our main issues 

of concern is the computationally efficient derivation of a high quality consolidated cluster- 

1 The best data partition is an elusive concept in itself, as it basically depends on how the clustered 

data is interpreted. However, for any given interpretation criterion, some clustering algorithms may obtain 

better clusters than others (Jain, Murty, and Flynn, 1999). In this work, the quality of clusterings has been 

evaluated by comparison with a allegedly correct cluster structure of the data, referred to as ground truth, 

measuring their degree of resemblance by means of normalized mutual information, or φ (NMI) . 

193

Chapter 7. Conclusions 

ing upon the aforementioned cluster ensemble, which gives rise to the first two proposals 

put forward in this thesis: hierarchical consensus architectures and consensus self-refining 

procedures, which are reviewed in sections 7.1 and 7.2, respectively. 

Our proposals for robust clustering based on cluster ensembles find a natural field of 

application in multimedia data clustering, as the existence of multiple data modalities 

poses additional indeterminacies that challenge the obtention of robust clustering results. 

Moreover, our strategy naturally allows the simultaneous use of early and late multimodal 

fusion techniques, which constitutes a highly generic approach to the problem of multimedia 

clustering. In section 7.3, our proposals in this area are reviewed. 

The last proposal of this thesis can be regarded as a first step for generalizing our cluster 

ensembles based robust clustering approach, as it consists of several voting based consensus 

functions for soft cluster ensembles —recall that crisp clustering is in fact a simplification of 

its fuzzy counterpart. These consensus functions, which are reviewed in section 7.4, can also 

be considered a response to the relatively few efforts devoted to the derivation of consensus 

clustering strategies in the context of fuzzy clustering. 

We have given great importance to the experimental evaluation of all our proposals. To 

that effect, we have employed several state-of-the-art consensus functions for hard cluster 

ensembles –hypergraph based (CSPA, HGPA, MCLA) (Strehl and Ghosh, 2002), evidence 

accumulation (EAC) (Fred and Jain, 2005) and similarity-as-data based (ALSAD, KMSAD, 

SLSAD) (Kuncheva, Hadjitodorov, and Todorova, 2006)– to implement our self-refining 

hierarchical consensus architectures. Moreover, the fuzzy versions of CSPA, EAC, HGPA 

and MCLA, plus the VMA soft consensus function (Dimitriadou, Weingessel, and Hornik, 

2002) have been used as an evaluation benchmark for our voting based consensus functions 

for soft cluster ensembles. Our proposals have been tested over a total of sixteen unimodal 

and multimodal data collections, which contain a number of objects ranging from hundreds 

to several thousands. In particular, the performance of self-refining hierarchical consensus 

architectures has been evaluated on both unimodal (chapters 3 and 4) and multimodal data 

collections (chapter 5), whereas the experiments concerning soft consensus functions have 

been conducted on the 12 unimodal collections —see chapter 6. However, in the near future, 

we plan extending these latter experiments towards multimodal data sets. We expect such 

extension to be little costly, since any consensus function can easily accommodate our early 

plus late fusion multimedia clustering proposal. In this sense, we also intend to apply 

our multimedia clustering system on well-known multimodal data sets such as VideoClef 

(composed of video data along with speech recognition transcripts, metadata and shot-level 

keyframes) (VideoCLEF, accessed on May 2009) and ImageClef (still images annotated with 

text) (ImageCLEF, accessed on May 2009). 

Furthermore, we have conducted our experiments on cluster ensembles of very different 

sizes (from 6 to 5124 clusterings), in order to evaluate the influence of this factor on the 

computational facet of our proposals. In all the experiments, the statistical significance 

of the results at the 5% significance level has been evaluated, either explicitly or by their 

presentation by means of boxplot charts. 

As mentioned in appendix A.6, the experiments conducted in this thesis have been 

run under Matlab 7.0.4 on Dual Pentium 4 3GHz/1 GB RAM computers. A total of 

20 computers were employed during approximately 9 months at an almost constant pace, 

combining for an estimated total running time of more than 10 years! 

These experimental variablilty has provided us with a comparative view of the state- 

194


of-the-art consensus funtions employed, both in terms of the quality of the consensus clusterings 

they yield and their computational complexity, and the most relevant conclusions 

drawn are enumerated next. As regards the consensus functions for hard cluster ensembles, 

we have observed that, in general terms, the EAC and HGPA consensus functions deliver, 

by far, the poorest quality consensus clusterings. We believe that the low performance of 

EAC is due to the fact that it was originally devised for consolidating clusterings with a very 

high number of clusters into a consensus partition with a smaller number of clusters (Fred 

and Jain, 2005). However, in our experiments, both the cluster ensemble components and 

the consensus clustering have the same number of clusters, which probably has a detrimental 

effect on the quality of the consensus partitions obtained by the evidence accumulation 

approach. 

From a computational standpoint, the HGPA and MCLA consensus functions are applicable 

on larger data collections than the rest, as their complexity is linear with the number 

of objects n. However, the execution time of MCLA is penalized when it is executed on 

large cluster ensembles, as its time complexity is quadratic with l (the number of cluster 

ensemble components). As regards the soft consensus functions, VMA constitutes the most 

attractive alternative, as it yields pretty high quality consensus clusterings while being fast 

to execute, thanks to the simultaneous execution of the cluster disambiguation and voting 

procedures. A rather opposite behaviour is shown by the soft versions of the EAC and 

HGPA consensus functions: the former is notably time consuming, while the latter outputs 

really poor quality consensus clusterings. 

Let us get critical for a while: possibly one of the major sources of criticism for this 

work refers to the rather unrealistic assumption (though not uncommon in the literature) 

that the number of clusters the objects must be grouped into (referred to as k) isaknown 

parameter. In practice, however, the user seldom knows how many clusters should be found, 

so it becomes a further indeterminacy to deal with. 

In this work, all the clusterings involved in any process (i.e. the cluster ensemble components 

and the consensus clusterings) have the same number of clusters, which coincides 

with the number of true groups in the data, defined by the ground truth that constitutes 

the gold standard for ultimately evaluating the quality our results. By doing so, we have 

also disregarded a common diversity factor employed in the creation of cluster ensembles, 

which often contain clusterings with different numbers of clusters (often chosen randomly). 

However, we would like to highlight at this point that not many of the consensus functions 

found in the literature are capable of estimating the correct number of clusters in 

the data set, thus making it necessary to specify the desired value of k as one of their parameters. 

Quite obviously, two of the clearest future research directions of this work are i) 

estimating the number of clusters of the consensus clustering solution, and ii) adapting the 

proposed consensus functions for dealing with cluster ensemble components with distinct 

numbers of clusters. The achievement of these goals would constitute the ultimate step 

towards a fully generic approach to robust clustering based on cluster ensembles. 

7.1 Hierarchical consensus architectures 

As regards the computational efficiency of consensus processes, the fact that their space and 

time complexities usually scale linearly or quadratically with the cluster ensemble size can 

195

7.1. Hierarchical consensus architectures 

make the execution of traditional one-step (aka flat) consensus processes (in which the whole 

cluster ensemble is input to the consensus function at once) very costly or even unfeasible 

when conducted on highly populated cluster ensembles. For this reason, the application of 

a divide-and-conquer strategy on the cluster ensemble –which gives rise to the hierarchical 

consensus architectures (HCA) proposed in chapter 3– constitutes an alternative to classic 

flat consensus that, besides leaving out none of the l cluster ensemble components, is also 

naturally parallelizable, making it even more computationally appealing. 

In particular, two types of hierarchical consensus architectures have been proposed: 

random and deterministic HCA. Both architectures differ in the way the user intervenes in 

their design. In random HCA, the user selects the size (b) of the mini-ensembles intermediate 

consensus processes are conducted, which, together with the cluster ensemble size l 

determines the number of stages of the consensus architecture. Compared to them, deterministic 

HCA provide a more modular approach to consensus clustering, as clusterings of 

the same nature are combined at each stage of the hierarchy. In fact, our first approach to 

hierarchical consensus architectures dealt with deterministic HCA (Sevillano et al., 2007a), 

although it was solely focused on the analysis of the quality of the consensus clusterings 

obtained, not on its computational aspect. 

Extensive experiments have proven that their computational efficiency is highly dependent 

on the characteristics of the consensus function employed for combining the clusterings 

(in particular, it depends on how its time complexity scales with the number of clusterings 

combined). For instance, flat consensus based on the EAC consensus function is more efficient 

than any hierarchical architecture, whereas a rather opposite behaviour is observed 

when MCLA is used. 

Moreover, we have observed that HCAs become faster than flat consensus when operating 

on highly populated cluster ensembles, regardless of whether their fully serial or parallel 

implementation is considered (except when the EAC consensus function is employed). Expectably, 

the fully parallel version of HCAs outperforms flat consensus (often by a large 

margin), even when small cluster ensembles are employed. An additional interesting feature 

of hierarchical consensus architectures is that they provide a means for obtaining a 

consensus clustering solution in scenarios where the complexity of flat consensus makes its 

execution impossible (for a given set of computational resources). 

Given the fact that multiple specific implementation variants of a HCA exist, and that 

their time complexities can differ largely, it seems necessary to provide the user with tools 

that allow to predict, for a given consensus clustering problem, which is the most computationally 

efficient one. In this sense, a simple methodology for estimating their running time 

–and thus, selecting which is the least time consuming– has also been proposed. Despite 

its simplicity, the proposed methodology is capable of achieving an accuracy close to 80% 

when predicting the fastest serially implementated HCA variant, while this percentage goes 

down to nearly 50% in the parallel implementation case. This difference is caused by the 

fact that, in the parallel case, the running time estimation is more sensitive to random 

deviations of the measured running times the estimation is based upon, as it often ends up 

depending on a single execution time sample. However, the impact of incorrect predictions, 

when measured in running time overheads with respect to the truly fastest HCA variant, 

is well below 10 seconds in a vast majority of the experiments conducted —of course, the 

relative importance of such deviations will ultimately depend on the time requirements of 

the specific application the HCA is embedded in. 

196


Though put forward in a robust clustering via cluster ensembles framework, hierarchical 

consensus architectures can be of interest in any consensus clustering related problem where 

cluster ensembles containing a large number of components are involved. Furthermore, 

HCAs are directly portable to a fuzzy clustering scenario with no modifications. 

In our opinion, the main weakness of this proposal lies in the rather simplistic approach 

taken in the running time estimation methodology, which employs the execution time of a 

single consensus process run for estimating the time complexity of the whole HCA. Despite 

experiments have demonstrated that its performance is pretty good, we conjecture that 

a possible means for improving it –especially in the parallel case, where lower prediction 

accuracies are obtained– would consist in modelling statistically the running times of the 

consensus processes the estimation is based on. 

7.2 Consensus self-refining procedures 

Besides the computational difficulties that have motivated the development of hierarchical 

consensus architectures, the use of large cluster ensembles also poses a challenge as far as the 

quality of the obtained consensus clustering is concerned. Indeed, the somewhat indiscriminate 

generation of clusterings encouraged by our robust clustering via cluster ensembles 

proposal may presumably lead to the creation of low quality cluster ensemble components, 

which affects the quality of consensus clustering negatively. In order to mitigate the undesired 

influence of these components, we have devised an unsupervised strategy for excluding 

them from the consensus process. 

The rationale of such strategy is the following: starting with a reference clustering, 

we measure its similarity (in terms of average normalized mutual information, or φ (ANMI) ) 

with respect to the l cluster ensemble components. Subsequently, a percentage p of these 

components is selected, after ranking them according to their similarity with respect to the 

aforementioned reference clustering. Last, the self-refined consensus clustering is obtained 

by combining the clusterings included in such reduced cluster ensemble, according to either 

a flat or a hierarchical architecture —a decision that can be reliably made using the running 

time estimation methodology mentioned earlier. 

Following this generic approach, two self-refining strategies have been proposed. They 

solely differ in the origin of the clustering used as the reference of the self-refining procedure. 

In the first version (denominated consensus based self-refining), the reference clustering is 

the result of a previous consensus process conducted upon the cluster ensemble at hand. 

In contrast, the second self-refining procedure (referred to as selection based self-refining) 

employs one of the cluster ensemble components as the reference clustering, which is selected 

by means of an φ (ANMI) maximization criterion. 

We would like to highlight the fact that the self-refining procedure is almost fully unsupervised. 

The only user intervention is the selection of the percentage p of the l cluster 

ensemble components included in the select cluster ensemble the self-refined consensus clustering 

is derived upon. In order to minimize the risks of negatively biasing the self-refining 

of procedure results by a suboptimal selection of p, the user is prompted to select not a 

single, but a set of values of p. The self-refining procedure will produce a self-refined consensus 

clustering for each distinct value of p, selecting a posteriori, in a fully unsupervised 

manner, the one with maximum average normalized mutual information with respect to the 

197

7.3. Multimedia clustering based on cluster ensembles 

cluster ensemble (a selection process that is given the name of supraconsensus (Strehl and 

Ghosh, 2002)). 

The analysis of the quality (measured in terms of normalized mutual information with 

respect to the ground truth) of the set of self-refined consensus clusterings obtained at each 

experiment reveals that the proposed self-refining procedure is notably successful, as it is 

higher than that of the reference clustering in a 83% (for consensus based self-refining) or a 

56% (for selection based self-refining) of the experiments conducted. Furthermore, we have 

also observed that producing multiple self-refined consensus clusterings is a highly beneficial 

approach, as the highest quality self-refined clustering is obtained for very disparate values 

of p depending on the experiment —from p=2% to p=90%, thus it would be pretty easy to 

select a suboptimal value of p if a single one was chosen. As far as the quality gains induced 

by the self-refining procedure are concerned, relative percentage φ (NMI) increases (referred 

to the non-refined consensus clustering) higher –and quite often much higher– than 10% 

are obtained in a vast majority of the experiments conducted. 

A further advantage of the self-refining procedure is its ability to uniformize the quality of 

the consensus clustering solutions created by distinct consensus architectures –reducing the 

variances between their φ (NMI) scores by a factor of 20–, thus making it easier to decide which 

is the most appropriate consensus architecture for a given consensus clustering problem on 

computational grounds solely. 

However, the good performance of the proposed self-refining procedure is somewhat 

tarnished by the limited accuracy of the supraconsensus selection process, which manages 

to select the highest quality self-refined consensus clustering in less than the half of the 

experiments conducted, which causes an average 14% relative φ (NMI) reduction between the 

consensus clustering selected by supraconsensus and the top quality one. 

For this reason, the main research activities in this area should be directed, in our 

opinion, towards the derivation of accurate supraconsensus selection techniques capable of 

choosing, in a fully blind manner and as precisely as possible, the highest quality consensus 

clustering among a given bunch of them. 

Last, we have pleasingly noticed that fighting the expectable quality decrease suffered 

by consensus clusterings created upon large cluster ensembles has also drawn the interest 

of other authors. Curiously enough, this issue has also been tackled in (Fern and Lin, 

2008) in a very similar fashion to our selection based self-refining procedure, which can be 

interpreted as a sign of the good sense of our proposals. 

7.3 Multimedia clustering based on cluster ensembles 

Undoubtedly, ‘going multimedia’ is a beneficial trend, as it provides a richer vision of 

information. However, it poses a challenge when multimodal data is to be processed by 

means of unsupervised learning techniques (e.g. clustering), as the existence of multiple 

modalities increases the uncertainties about what is the best way to represent, classify or 

describe the data. In this sense, intuition tends to suggest that constructive interactions 

between the distinct modalities exist, which should lead to a better explanation of the 

data. However, it is not clear how this modality fusion should be conducted, either at a 

feature level (early fusion) or at a decision level (late fusion). Indeed, our experiments have 

demonstrated that early fusion is not always advantageous as regards the quality of the 

198


clustering results —although in other contexts, such as jointly multimodal data analysis 

and synthesis, it becomes a crucial process (Sevillano et al., 2009). 

For this reason, the key point of our approach to robust multimedia clustering consists 

in not prioritizing nor discarding any of the modalities. Rather the contrary, the user is 

encouraged to create clusterings upon each separate modality and on feature level fused 

modalities, compiling them all into a multimodal cluster ensemble, upon which a consensus 

clustering is created. 

Interestingly enough, the application of this strategy –which is nothing but a generalized 

version of our approach to robust clustering– naturally calls for the use of hierarchical 

consensus architectures, as the existence of multiple (say m) modalities increases cluster 

ensemble sizes by a minimum factor of m + 1 (as we consider the m original object representations 

plus the one created by their feature level fusion), which poses a computational 

challenge to the execution of flat consensus clustering. Furthermore, the hypothetical inclusion 

of low quality components in such a large cluster ensemble makes the application of 

self-refining procedures an attractive alternative for obtaining good consensus clusterings 

upon the aforementioned multimodal cluster ensemble. 

In order to evaluate the effect of multimodality in a modular manner, separate consensus 

processes have been conducted for each original data modality and for the modality 

derived from the early fusion of these. To that effect, a deterministic hierarchical consensus 

architecture has been employed in our multimodal consensus clustering experiments, as it 

allows a structured construction of consensus clusterings both within and across modalities. 

As regards within modality consensus, the results obtained reveal that consensus clusterings 

obtained on the multimodal modality (i.e. the one resulting from the early fusion 

of the original modalities of the data) attain higher φ (NMI) scores than their unimodal 

counterparts in an average 56% of the experiments conducted. 

When the unimodal and multimodal consensus clusterings are combined –thus giving 

rise to intermodal consensus clusterings– we observe that, in terms of φ (NMI) with respect 

to the ground truth, these are better than the unimodal ones in a 59.5% of the experiments, 

while this percentage is 34.7% when compared to the multimodal consensus clustering. 

However, the fairly distinct results obtained depending on the data set and consensus 

function employed suggest that creating an intermodal consensus clustering is a pretty 

generic way of proceeding, as its eventual inferior quality can be compensated by means of 

a subsequent self-refining procedure followed by an unsupervised supraconsensus selection 

of the final consensus clustering. 

If the maximum and median quality components of the multimodal cluster ensemble 

are taken as reference thresholds for evaluating the robustness of the self-refined consensus 

clustering selected by supraconsensus, we observe that it is a 36.6% worse than the former 

and a 93.5% better than the latter (measured in relative percentage φ (NMI) variations). As in 

the unimodal case, this performance would be improved if a better supraconsensus selection 

process was devised —which, as aforementioned, is one of the future work priorities. 

As regards the future research lines in the multimodal clustering area, we plan to investigate 

early multimodal fusion techniques capable of unveiling constructive interactions 

between modalities, besides applying selection based consensus self-refining on the multimodal 

cluster ensemble, as we conjecture that will probably yield higher quality clusterings 

than those obtained by consensus based self-refining. 

199

7.4. Voting based soft consensus functions 

7.4 Voting based soft consensus functions 

The outcome of a fuzzy clustering process is much more informative than its crisp counterpart, 

as it indicates the strength of association of each object to each cluster. Despite 

this fact, soft clustering combination strategies are a minority in the consensus clustering 

literature. Allowing for this, we have made several proposals in this area, aiming to extend 

all our previous proposals to the more generic framework of soft clustering. 

There exists a pretty evident parallelism between the strength of association of each 

object to each cluster in a fuzzy clustering solution and the degree of preference of a voter 

for a candidate in an election. This fact directly allows the application of certain voting 

methods for consolidating soft clusterings, considering the clusters as the candidates, the 

cluster ensemble components as voters, and the clusterization of each object as an election. 

However, given the ambiguous identification of clusters inherent to clustering, a cluster 

alignment between the cluster ensemble components is required prior to voting. 

In this work, we have proposed four consensus functions for soft cluster ensembles, which 

are the result of applying as many voting strategies for combining the clusterings in the 

ensemble. In particular, we have employed two confidence voting methods –the sum and 

product rules, which give rise to the SumConsensus (SC) and ProductConsensus (PC) consensus 

functions–, and two positional voting techniques —the Borda and Condorcet voting 

strategies that constitute the basis of the BordaConsensus (BC) and CondorcetConsensus 

(CC) clustering combiners. The main difference between these two families of voting methods 

lies in the fact that the former operate directly on the object-to-cluster association 

values that make up the cluster ensemble components, whereas the latter operate on the 

candidates ranking according to the voters’ preferences. For disambiguating the clusters, 

we have employed the classic Hungarian algorithm (Kuhn, 1955). 

The experiments conducted have evaluated our four consensus functions (SC, PC, BC 

and CC), comparing them with several state-of-the-art soft consensus functions in terms 

of their computational complexity and the quality of the consensus clusterings they yield. 

In terms of execution time, confidence voting consensus functions are faster than their 

positional voting counterparts, as the candidate ranking process penalizes the latter from a 

computational standpoint. In this sense, CC is the slowest proposal, due to the exhaustive 

pairwise candidate confrontation implicit in the Condorcet voting method. Contrarily, the 

more computationally efficient PC and SC consensus functions are as fast or faster than 

CSPA, EAC, HGPA and MCLA in a 81% of the experiments conducted —however, they 

are slower than VMA in a 92% of the cases. 

If the quality of the hardened version of the fuzzy consensus clusterings (measured 

in terms of φ (NMI) with respect to the ground truth) is used as the comparison factor, 

we observe that the four proposed consensus functions yield (statistically significantly) 

better results than any of the state-of-the-art consensus functions in an average 72% of the 

experiments conducted, which is a clear indicator of the goodness of our proposals. It is 

important to highlight that it has been impossible to evaluate directly the fuzzy consensus 

clusterings output by the four proposed consensus functions, due to the unavailability of 

soft labels in the data sets employed. As a future direction of research, we plan conducting 

this fuzzy evaluation, and we conjecture that greater differences between SC, PC, BC and 

CC will be observed, as the differences between the results of the voting strategies they 

are based upon are somewhat masked when the fuzzy consensus clusterings they yield are 

200


hardened. 

Ours is not the first approach to soft consensus clustering based on voting. In fact, 

the VMA consensus function employs a weighted version of the sum voting rule (Dimitriadou, 

Weingessel, and Hornik, 2002). However, to our knowledge, BordaConsensus (firstly 

introduced in (Sevillano, Alías, and Socoró, 2007b)) and CondorcetConsensus are pioneer 

positional voting based consensus functions. 

As aforementioned, our proposals deal with clusterings with a constant number of clusters 

k, and it would be of paramount interest to adapt them to combine clusterings with 

different number of clusters. A possible way to do so would consist in completing those 

clusterings with fewer clusters with dummy clusters (Ayad and Kamel, 2008), as suggested 

in the VMA consensus function (Dimitriadou, Weingessel, and Hornik, 2002). 

Besides this, possibly the clearest direction for future research in this area consists of 

adapting the simultaneous cluster disambiguation and voting mechanism of VMA, which 

would probably i) reduce the time complexity of the proposed consensus functions, and ii) 

require introducing some adjustments to the voting methods employed. Moreover, we are 

also interested in exploring other existing techniques for solving the cluster disambiguation 

problem, analyzing their impact both in terms of the quality of the consensus clustering 

solutions obtained and the overall computational complexity of the consensus function. 

201

References 

References 

Agogino, A. and K. Tumer. 2006. Efficient agent-based cluster ensembles. In Proceedings of 

the 5th International Joint Conference on Autonomous Agents and Multi-Agent Systems. 

Agrawal, R., J. Gehrke, D. Gunopulos, and P. Raghavan. 1998. Automatic subspace 

clustering of high dimensional data for data mining applications. In Proceedings of the 

ACM-SIGMOD Conference on the Management of Data, pages 94–105, Seattle, WA, 

USA. 

Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on 

Automatic Control, 19(6):716–722. 

Al-Sultan, K. 1995. A Tabu search approach to the clustering problem. Pattern Recognition, 

28(9):1443–1451. 

Anderberg, M.R. 1973. Cluster Analysis for Applications. Monographs and Textbooks on 

Probability and Mathematical Statistics. New York: Academic Press, Inc. 

Anderson, L.W., D.R. Krathwohl, P.W. Airasian, K.A. Cruikshank, R.E. Mayer, P.R. Pintrich, 

J. Raths, and M.C. Wittrock. 2001. A Taxonomy for Learning, Teaching, and 

Assessing – A Revision of Bloom’s Taxonomy of Educational Objectives. Addison Wesley 

Longman, Inc. 

Aslam, J.-A. and M. Montague. 2001. Models for metasearch. In Proceedings of the 24th 

ACM SIGIR Conference, pages 276–284, New Orleans, LA, USA. 

Asuncion, A. and D.J. Newman. 1999. UCI Machine Learning Repository. 

http://www.ics.uci.edu/∼mlearn/MLRepository.html. University of California, Irvine, 

School of Information and Computer Sciences. 

Ayad, H.G. and M.S. Kamel. 2008. Cumulative Voting Consensus Method for Partitions 

with Variable Number of Clusters. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 30(1):160–173. 

Ball, G.H. and D.J. Hall. 1965. ISODATA, a novel method of data analysis and classification. 

Technical Report, Stanford University, Stanford, CA, USA. 

Barnard, K., P. Duygulu, D.A. Forsyth, N. De Freitas, D.M. Blei, and M.I. Jordan. 2003. 

Matching Words and Pictures. Journal on Machine Learning Research, 3:1107–1135. 

Barnard, K. and D.A. Forsyth. 2001. Learning the semantics of words and pictures. In 

Proceedings of the IEEE International Conference on Computer Vision, volume II, pages 

408–415. 

Barthelemy, J.P., B. Laclerc, and B. Monjardet. 1986. On the use of ordered sets in 

problems of comparison and consensus of classifications. Journal of Classification, 3:225– 

256. 

Bekkerman, R. and J. Jeon. 2007. Multi-modal clustering for multimedia collections. In 

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition, pages 1–8. 

202

References 

Ben-Hur, A., D. Horn, H. Siegelmann, and V. Vapnik. 2001. Support vector clustering. 

Journal on Machine Learning Research, 2:125–137. 

Benitez, A.B and S.F. Chang. 2002. Perceptual knowledge construction from annotated 

image collections. In Columbia University ADVENT, pages 26–29. 

Berkhin, P. 2002. Survey of clustering data mining techniques. Available 

online at http://www.accrue.com/products/rp cluster review.pdf or 

http://citeseer.nj.nec.com/berkhin02survey.html. 

Biggs, J.B. and K. Collis. 1982. Evaluating the Quality of Learning: the SOLO taxonomy. 

New York: Academic Press. 

Bingham, E. and H. Mannila. 2001. Random projection in dimensionality reduction: applications 

to image and text data. In Proceedings of the 7th ACM SIGKDD International 

Conference on Knowledge Discovery and Data Mining, pages 245–250, San Francisco, 

CA, USA. 

Bloom, B.S. 1956. Taxonomy of Educational Objectives: The Classification of Educational 

Goals. Susan Fauer Company, Inc. 

Borda, J.C. de. 1781. Memoire sur les Elections au Scrutin. Histoire de lAcademie Royale 

des Sciences, Paris. 

Boulis, C. and M. Ostendorf. 2004. Combining multiple clustering systems. In J.F. Boulicaut, 

F. Esposito, F. Giannotti, and D. Pedreschi, editors, Proceedings of the 8th European 

Conference on Principles and Practice of Knowledge Discovery in Databases, 

LNCS vol. 3202, pages 63–74. Springer. 

Brachman, R. and T. Anand. 1996. The process of knowledge discovery in databases: A 

human-centered approach. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 

editors, Advances in Knowledge Discovery and Data Mining, pages 37–58, Menlo 

Park, CA, USA. AAAI Press. 

Buehren, M. 2008. Functions for the rectangular assignment problem. 

http://www.mathworks.com/matlabcentral/fileexchange/6543. 

Buscaldi, D. and P. Rosso. 2007. Upv-wsd : Combining different wsd methods by means of 

fuzzy borda voting. In Proceedings of the International SemEval Workshop, ACL 2007, 

pages 434–437, Prague, Czech Republic. 

Cai, D., X. He, Z. Li, W.Y. Ma, and J.R. Wen. 2004. Hierarchical clustering of www 

image search results using visual, textual and link information. In Proceedings of the 

12th Annual ACM International Conference on Multimedia, pages 952–959. 

Calinski, R.B. and J. Harabasz. 1974. A Dendrite Method for Cluster Analysis. Communications 

in Statistics, 3:1–27. 

Carpenter, G. and S. Grossberg. 1987. A massively parallel architecture for a self-organizing 

neural pattern recognition machine. Computer Vision, Graphics and Image Processing, 

37:54–115. 

203

References 

Carpenter, G., S. Grossberg, and D. Rosen. 1991. Fuzzy ART: Fast stable learning and 

categorization of analog patterns by an adaptive resonance system. Neural Networks, 

4:759–771. 

Carreira-Perpiñán, M.A. 1997. A review of dimension reduction techniques. Technical 

Report CS-96-09, Department of Computer Science, University of Sheffield, Sheffield, 

UK. 

Chakaravathy, S.V. and J. Ghosh. 1996. Scale based clustering using a radial basis function 

network. IEEE Transactions on Neural Networks, 2(5):1250–1261. 

Cheeseman, P. and J. Stutz. 1996. Bayesian classification (Autoclass): theory and results. 

In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in 

Knowledge Discovery and Data Mining, pages 153–180, Menlo Park, CA. AAAI Press. 

Chen, F., U. Gargi, L. Niles, and H. Schütze. 1999. Multi-modal browsing of images in 

web documents. In D. Doermann, editor, Proceedings of the 1999 Symposium on Image 

Understanding Technology, pages 265–276, Annapolis, MD, USA. UMD. 

Chen, N. 2006. A Survey of Indexing and Retrieval of Multimodal Documents: Text and 

Images. Technical Report 2006-505, School of Computing, Queens University, Kingston, 

Ontario, Canada. 

Chiang, J. and P. Hao. 2003. A new kernel-based fuzzy clustering approach: support vector 

clustering with cell growing. IEEE Transactions on Fuzzy Systems, 11(4):518–527. 

Chu, S. and J. Roddick. 2000. A clustering algorithm using the Tabu search approach with 

simulated annealing. In N. Ebecken and C. Brebbia, editors, Data Mining II–Proceedings 

of the 2nd International Conference on Data Mining Methods and Databases, pages 515– 

523. 

Cobo, G., X. Sevillano, F. Alías, and J.C. Socoró. 2006. Técnicas de representación de 

textos para clasificación no supervisada de documentos. Journal of the Spanish Society 

for Natural Language Processing (Procesamiento del Lenguaje Natural)–in Spanish, 

37:329–336. 

Condorcet, M. de. 1785. Essai sur l’application de l’analyse àlaprobabilité des decisions 

rendues àlapluralité des voix. 

Cover, T.M. and J.A. Thomas. 1991. Elements of information theory. John Wiley and 

Sons. 

Cutting, D.R., D.R. Karger, J.O. Pedersen, and J.W. Tukey. 1992. Scatter/Gather: a 

cluster-based approach to browsing large document collections. In Proceedings of the 

15th annual international ACM SIGIR conference on Research and Development in 

Information Retrieval, pages 318–329, Copenhagen, Denmark, June. 

Dasgupta, S., C. Papadimitriou, and U. Vazirani. 2006. Algorithms. McGraw-Hill. 

Davies, D.L and D.W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 1:224–227. 

204

References 

Deerwester, S., S.-T. Dumais, G.-W. Furnas, T.-K. Landauer, and R. Harshman. 1990. 

Indexing by Latent Semantic Analysis. Journal American Society Information Science, 

6(41):391–407. 

Denoeud, L. and A. Guénoche. 2006. Comparison of distance indices between partitions. 

In V. Batagelj, H.H. Bock, A. Ferligoj, and A. ˘ Ziberna, editors, Data Science and 

Classification, pages 21–28. Springer. 

Dhillon, I., J. Fan, and Y. Guan. 2001. Efficient clustering of very large document collections. 

In R.L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R.R. Namburu, 

editors, Data Mining for Scientific and Engineering Applications. Kluwer Academic 

Publishers. 

Dietterich, T.G. 2000. Ensemble methods in machine learning. In J. Kittler and F. Roli, 

editors, Multiple Classifier Systems, LNCS vol. 1857, pages 1–15. Springer. 

Dimitriadou, E., A. Weingessel, and K. Hornik. 2001. Voting-merging: An ensemble 

method for clustering. In G. Dorffner, H. Bischof, and K. Hornik, editors, Artificial 

Neural Networks-ICANN 2001, LNCS vol. 2130, pages 217–224. Springer. 

Dimitriadou, E., A. Weingessel, and K. Hornik. 2002. A combination scheme for fuzzy 

clustering. International Journal of Pattern Recognition and Artificial Intelligence, 

16(7):901–912. 

Ding, C., X. He, and H.D. Simon. 2005. On the equivalence of nonnegative matrix factorization 

and spectral clustering. In Proceedings of the 2005 SIAM Conference on Data 

Mining, pages 606–610. 

Duda, R.O., P.E. Hart, and D.G. Stork. 2001. Pattern Classification. Wiley Interscience. 

Dudoit, S. and J. Fridlyand. 2003. Bagging to Improve the Accuracy of a Clustering 

Procedure. Bioinformatics, 19(9):1090–1099. 

Dunn, J.C. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact 

well-separated clusters. Journal on Cybernetics, 3:32–57. 

Duygulu, P., K. Barnard, N. de Freitas, and D. Forsyth. 2002. Object recognition as 

machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of 

the Seventh European Conference on Computer Vision, volume 4, pages 97–112. Springer 

Verlag. 

Dy, J.G. and C.E. Brodley. 2004. Feature Selection for Unsupervised Learning. Journal of 

Machine Learning Research, 5:845–889. 

Ertz, L., M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and 

densities in noisy, high dimensional data. In Proceedings of the 2nd SIAM International 

Conference on Data Mining, pages 47–58, San Francisco, CA, USA. 

Ester, M., H.P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering 

clusters in large spatial data sets with noise. In Proceedings of the 2nd International 

Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, OR, 

USA. 

205

References 

Fayyad, U. 1996. Data mining and knowledge discovery: making sense out of data. IEEE 

Expert: Intelligent Systems and Their Applications, 11(5):20–25. 

Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. 1996. From data mining to knowledge 

discovery: an overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 

editors, Advances in Knowledge Discovery and Data Mining, pages 1–30, Menlo 

Park, CA, USA. AAAI Press. 

Fenty, J. 2004. Analyzing distances. The Stata Journal, 4(1):1–26. 

Fern, X.Z. and C.E. Brodley. 2003. Random Projection for High Dimensional Data Clustering: 

A Cluster Ensemble Approach. In Proceedings of 20th International Conference 

on Machine Learning, Washington DC, VA, USA. 

Fern, X.Z. and C.E. Brodley. 2004. Solving cluster ensemble problems by bipartite graph 

partitioning. In Proceedings of the 21st International Conference on Machine Learning, 

pages 281–288. 

Fern, X.Z. and W. Lin. 2008. Cluster ensemble selection. In Proceedings of the 2008 SIAM 

International Conference on Data Mining. 

Filkov, V. and S. Skiena. 2004. Integrating microarray data by consensus clustering. 

International Journal of Artificial Intelligence Tools, 13(4):863–880. 

Fischer, B. and J.M. Buhmann. 2003. Bagging for path-based clustering. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 25(11):1411–1415. 

Focardi, S.M. 2001. Clustering economic and financial time series: exploring the existence 

of stable correlation conditions. Technical Report 2001-04, The Intertek Group, Paris, 

France. 

Fodor, I.K. 2002. A survey of dimension reduction techniques. Technical Report UCRL- 

ID-148494, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, 

Livermore, CA. 

Forgy, E. 1965. Cluster analysis of multivariate data: efficiency vs. interpretability of 

classifications. Biometrics, 21:768–780. 

Fowlkes, E.B. and C.L. Mallows. 1983. A method for comparing two hierarchical clusterings. 

Journal of the American Statistical Association, 78(383):553–569. 

Fred, A. 2001. Finding consistent clusters in data partitions. In J. Kittler and F. Roli, 

editors, Multiple Classifier Systems, LNCS vol. 2096, pages 309–318. Springer. 

Fred, A. and A.K. Jain. 2002a. Data clustering using evidence accumulation. In Proceedings 

of the 16th International Conference on Pattern Recognition, pages 276–280. 

Fred, A. and A.K. Jain. 2002b. Evidence accumulation clustering based on the k-means 

algorithm. In T. Caelli, A. Amin, R.P.W. Duin, M. Kamel, and D. de Ridder, editors, 

Structural, Syntactic, and Statistical Pattern Recognition, LNCS vol. 2396, pages 442– 

451. Springer. 

206

References 

Fred, A. and A.K. Jain. 2003. Robust data clustering. In Proceedings of the 2003 IEEE 

Computer Society Conference on Computer Vision and Pattern Recognition, volume2, 

pages 128–133. 

Fred, A. and A.K. Jain. 2005. Combining Multiple Clusterings Using Evidence Accumulation. 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835–850. 

Gao, B., T.Y. Liu, T. Qin, X. Zheng, Q.S. Cheng, and W.Y. Ma. 2005. Web image clustering 

by consistent utilization of visual features and surrounding texts. In Proceedings of the 

13th Annual ACM International Conference on Multimedia, pages 112–121. 

Gionis, A., H. Mannila, and P. Tsaparas. 2007. Clustering Aggregation. ACM Transactions 

on Knowledge Discovery from Data, 1(1):1–30. 

Goder, A. and V. Filkov. 2008. Consensus clustering algorithms: Comparison and refinement. 

In Proceedings of the 2008 SIAM Workshop on Algorithm Engineering and 

Experiments (ALENEX), pages 109–117. 

Gonzàlez, E. and J. Turmo. 2006. Unsupervised Document Clustering by Weighted Combination. 

LSI Research Report LSI-06-17-R, Departament de Llenguatges i Sistemes 

Informátics, Barcelona. 

Gonzàlez, E. and J. Turmo. 2008a. Comparing Non-Parametric Ensemble Methods. In 

E. Kapetanios, V. Sugumaran, and M. Spiliopoulou, editors, Proceedings of the 13th 

International Conference on Applications of Natural Language to Information Systems, 

LNCS vol. 5039, pages 245–256. Springer. 

Gonzàlez, E. and J. Turmo. 2008b. Non-Parametric Document Clustering by Ensemble 

Methods. Procesamiento del Lenguaje Natural, 40:91–98. 

Gopal, S. and C. Woodcock. 1994. Theory and methods for accuracy assessment of thematic 

maps using fuzzy sets. Photogrammetric Engineering and Remote Sensing, 60(2):181– 

188. 

Greene, D. and P. Cunningham. 2006. Efficient ensemble methods for document clustering. 

Technical Report CS-2006-48, Trinity College Dublin. 

Greene, D., A. Tsymbal, N. Bolshakova, and P. Cunningham. 2004. Ensemble clustering in 

medical diagnostics. In Proceedings of the 17th IEEE Symposium on Computer-Based 

Medical Systems, pages 576–581. IEEE Computer Society. 

Gunes, H. and M. Piccardi. 2005. Affect recognition from face and body: early fusion vs. 

late fusion. In Proceedings of 2005 IEEE International Conference on Systems, Man 

and Cybernetics, vol. 4, pages 3437–3443. 

Hadjitodorov, S.T. and L.I. Kuncheva. 2007. Selecting diversifying heuristics for cluster 

ensembles. In M. Haindl, J. Kittler, and F. Roli, editors, Proceedings of the 7th International 

Workshop on Multiple Classifier Systems, LNCS vol. 4472, pages 200–209. 

Springer. 

Hadjitodorov, S.T., L.I. Kuncheva, and L.P. Todorova. 2006. Moderate diversity for better 

cluster ensembles. Information Fusion, 7(3):264–275. 

207

References 

Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2002a. Cluster Validity Methods : Part I. 

ACM SIGMOD Record, 31(2):40–45. 

Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2002b. Cluster Validity Methods : Part 

II. ACM SIGMOD Record, 31(3):19–27. 

Hall, L., I. Özyurt, and J. Bezdek. 1999. Clustering with a genetically optimized approach. 

IEEE Transactions on Evolutionary Computation, 3(2):103–112. 

Hastad, J., B. Just, J.C. Lagarias, and C.P. Schnorr. 1988. Polynomial Time Algorithms 

for Finding Integer Relations among Real Numbers. SIAM Journal of Computing, 

18(5):859–881. 

He, Z., X. Xu, , and S. Deng. 2005. A cluster ensemble method for clustering categorical 

data. Information Fusion, 6(2):143–151. 

Hearst, M.A. 2006. Clustering versus faceted categories for information exploration. Communications 

of the ACM, 49(4):59–61. 

Hettich, S. and S.D. Bay. 1999. The UCI KDD Archive. http://kdd.ics.uci.edu. University 

of California at Irvine, Dept. of Information and Computer Science. 

Hinneburg, A. and D. Keim. 1998. An efficient approach to clustering in large multimedia 

data sets with noise. In Proceedings of the 4th International Conference on Knowledge 

Discovery and Data Mining, pages 58–65. 

Hofmann, T. and J. Buhmann. 1997. Pairwise data clustering by deterministic annealing. 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1–14. 

Höppner, F., F. Klawonn, and R. Kruse. 1999. Fuzzy Cluster Analysis: Methods for 

Classification, Data Analysis, and Image Recognition. Wiley. 

Hore, P., L. Hall, and D. Goldgof. 2006. A Cluster Ensemble Framework for Large Data 

sets. In Proceedings of the 2006 IEEE International Conference on Systems, Man and 

Cybernetics, volume 4, pages 3342–3347. 

Hoyer, P.O. 2004. Non-Negative Matrix Factorization with Sparseness Constraints. Journal 

on Machine Learning Research, 5:1457–1469. 

Hubert, L. and P. Arabie. 1985. Comparing Partitions. Journal of Classification, 2:193– 

218. 

Hyvärinen, A. 1999. Fast and Robust Fixed-Point Algorithms for Independent Component 

Analysis. IEEE Trans. on Neural Networks, 10(3):626–634. 

Hyvärinen, A., J. Karhunen, and E. Oja. 2001. Independent Component Analysis. John 

Wiley and Sons. 

ImageCLEF. accessed on May 2009. The CLEF cross language image retrieval track. 

http://www.imageclef.org. 

208

References 

Ingaramo, D., D. Pinto, P. Rosso, and M. Errecalde. 2008. Evaluation of internal validity 

measures in short-text corpora. In A. Gelbukh, editor, Proceedings of the 9th 

International Conference on Intelligent Text Processing and Computational Linguistics, 

volume 4919 of Lecture Notes in Computer Science, pages 555–567. Springer Verlag, 

Berlin, Heidelberg, New York. 

InternetWorldStats.com. accessed on February 2009. Internet usage statistics february 

2009. http://www.internetworldstats.com/stats.htm. 

Jäger, G. and U. Benz. 2000. Measures of Classification Accuracy Based on Fuzzy Similarity. 

IEEE Transactions on Geoscience and Remote Sensing, 38(3):1462–1467. 

Jain, A.K. 1996. Image segmentation using clustering. In K. Bowyer and N. Ahuja, editors, 

Advances in Image Understanding. IEEE Press. 

Jain, A.K. and R.C. Dubes. 1988. Algorithms for clustering data. Prentice Hall. 

Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data Clustering: a Survey. ACM Computing 

Surveys, 31(3):264–323. 

Jakobsson, M. and N.A. Rosenberg. 2007. CLUMPP: a cluster matching and permutation 

program for dealing with label switching and multimodality in analysis of population 

structure. Bioinformatics, 23:1801–1806. 

Jiang, D., C. Tang, and A. Zhang. 2004. Cluster analysis for gene expression data: a 

survey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370–1386. 

Jolliffe, I.T. 1986. Principal Component Analysis. Springer. 

Jomier, J., V. LeDigarcher, and S.R. Aylward. 2005. Comparison of vessel segmentations 

using STAPLE. In J. Duncan, editor, Proceedings of the 8th International Conference on 

Medical Image Computing and Computer-Assisted Intervention, pages 523–530, LNCS 

3749. Springer. 

Kaban, A. and M. Girolami. 2000. Unsupervised Topic Separation and Keyword Identification 

in Document Collections: A Projection Approach. Technical Report No. 10, Dept. 

of Computing and Information Systems, University of Paisley. 

Käki, M. 2005. Findex: search result categories help users when document ranking fails. In 

Proc. ACM SIGCHI Int’l Conference on Human Factors in Computing Systems, pages 

131–140. ACM Press. 

Kalska, E.P. 2005. Dissimilarity Representations in Pattern Recognition. Ph.D. thesis, 

Delft University of Technology, The Netherlands. 

Karayiannis, N., J. Bezdek, N. Pal, R. Hathaway, and P. Pai. 1996. Repairs to GLVQ: A 

new family of competitive learning schemes. IEEE Transactions on Neural Networks, 

7(5):1062–1071. 

Karypis, G., R. Aggarwal, V. Kumar, and S. Shekhar. 1997. Multilevel hypergraph partitioning: 

applications in VLSI domain. In Proceedings of the 34th Design and Automation 

Conference, pages 526–529. 

209

References 

Karypis, G., E. Han, and V. Kumar. 1999. Chameleon: Hierarchical clustering using 

dynamic modeling. IEEE Computer, 32(8):68–75. 

Karypis, G. and V. Kumar. 1998. A fast and high quality multilevel scheme for partitioning 

irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392. 

Kaski, S. 1998. Dimensionality Reduction by Random Mapping: Fast Similarity Computation 

for Clustering. In Proceedings of the International Joint Conference on Neural 

Networks, pages 413–418, Anchorage, AK, USA. 

Kaufman, L. and P. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster 

Analysis. New York, NY: John Wiley and Sons. 

Kleinberg, J. 2002. An impossibility theorem for clustering. Proceedings of the 2002 

Conference on Advances in Neural Information Processing Systems, 15:463–470. 

Klosgen, W., J.M. Zytkow, and J. Zyt. 2002. Handbook of Data Mining and Knowledge 

Discovery. USA: Oxford University Press. 

Kohavi, R. and G.H. John. 1998. The wrapper approach. In H. Liu and H. Motoda, 

editors, Feature Extraction, Construction and Selection: A Data Mining Perspective, 

pages 33–50. Springer-Verlag. 

Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480. 

Kotsiantis, S. and P. Pintelas. 2004. Recent advances in clustering: a brief survey. WSEAS 

Transactions on Information Science and Applications, 1(1):73–81. 

Kuhn, H. 1955. The Hungarian Method for the Assignment Problem. Naval Research 

Logistic Quarterly, 2:83–97. 

Kuncheva, L.I., S.T. Hadjitodorov, and L.P. Todorova. 2006. Experimental comparison 

of cluster ensemble methods. In Proceedings of the 9th International Conference on 

Information Fusion, pages 24–28. 

La Cascia, M., S. Sethi, and S. Sclaroff. 1998. Combining textual and visual cues for 

content-based image retrieval on the World Wide Web. In Proceedings of the IEEE 

Workshop on Content-Based Access of Image and Video Libraries, pages 24–28. 

Lange, T. and J.M. Buhmann. 2005. Combining partitions by probabilistic label aggregation. 

In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge 

Discovery and Data Mining, pages 147–156. ACM Press. 

Larsen, B. and C. Aone. 1999. Fast and effective text mining using linear time document 

clustering. In Proceedings of the 5th International Conference on Knowledge Discovery 

and Data Mining, pages 16–22. 

Lee, D.D. and H.S. Seung. 1999. Learning the Parts of Objects by Non-Negative Matrix 

Factorization. Nature, 401:788–791. 

Lee, D.D. and H.S. Seung. 2001. Algorithms for Non-Negative Matrix Factorization. Advances 

in Neural Information Processing Systems, 13. 

210

References 

Li, S.Z. and G. GuoDong. 2000. Content-based Audio Classification and Retrieval using 

SVM Learning. In Proceedings of the 1st IEEE Pacific-Rim Conference on Multimedia 

(Invited talk). 

Li, T., C. Ding, and M.I. Jordan. 2007. Solving Consensus and Semi-supervised Clustering 

Problems Using Nonnegative Matrix Factorization. In Proceedings of the 7th IEEE 

International Conference on Data Mining, pages 577–582. 

Lin, J. and D. Gunopulos. 2003. Dimensionality Reduction by Random Projection and 

Latent Semantic Indexing. In Proceedings of the 2003 SIAM Conference on Data Mining. 

Linnaeus, C. 1758. Systema Naturae per regna tria naturae, secundum classes, ordines, 

genera, species, cum characteribus, differentiis, synonymis, locis. Editio decima, reformata. 

Holmiae: Laurentius Salvius. 

Liu, W. and Y. Luo. 2005. Applications of clustering data mining in customer analysis 

in department store. In Proceedings of the 2005 IEEE International Conference on 

Services Systems and Services Management, volume 2, pages 1042–1046. 

Loeff, N., C. Ovesdotter-Alm, and D.A. Forsyth. 2006. Discriminating image senses by clustering 

with multimodal features. In Proceedings of the COLING/ACL 2006 Conference, 

pages 547–554. 

Long, B., Z.M. Zhang, and P.S. Yu. 2005. Combining Multiple Clusterings by Soft Correspondence. 

In Proceedings of the 5th IEEE International Conference on Data Mining, 

pages 282–289. 

Maimon, O. and L. Rokach. 2005. Data Mining and Knowledge Discovery Handbook. New 

York: Springer. 

Mancas-Thillou, C. and B. Gosselin. 2007. Natural scene text understanding. In G. Obinata 

and A. Dutta, editors, Vision Systems, Segmentation and Pattern Recognition, pages 

307–333, Vienna, Austria. I-Tech Education and Publishing. 

Maulik, U. and S. Bandyopadhyay. 2002. Performance Evaluation of Some Clustering 

Algorithms and Validity Indices. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 24(12):1650–1654. 

McLachlan, G. and T. Krishnan. 1997. The EM Algorithm and Extensions. New York: 

Wiley. 

Meila, M. 2003. Comparing clusterings by the variation of information. In B. Scholkopf and 

M.K. Warmuth, editors, Proceedings of the 16th Annual Conference on Computational 

Learning Theory, pages 173–187, LNAI 2777. Springer. 

Minaei-Bidgoli, B., A. Topchy, and W.F. Punch. 2004. Ensembles of partitions via data 

resampling. In Proceedings of the 2004 International Conference on Information Technology, 

volume 2, pages 188–192. 

Mirkin, B.G. 1975. On the problem of reconciling partitions. In H.M. Blalock, A. Aganbegian, 

F.M. Borodkin, R. Boudon, and V. Capecchi, editors, Quantitative Sociology: International 

Perspectives on Mathematical and Statistical Modelling—Quantitative Studies 

in Social Relations, pages 441–449, New York. Academic Press. 

211

References 

Miyajima, K. and A. Ralescu. 1993. Modeling of Natural Objects Including Fuzziness and 

Application to Image Understanding. In Proceedings of the 2nd IEEE International 

Conference on Fuzzy Systems, pages 1049–1054. 

Molina, L.C., L. Belanche, and A. Nebot. 2002. Feature selection algorithms: a survey and 

experimental evaluation. In Proceedings of the 2002 IEEE International Conference on 

Data Mining, pages 306–313. 

Montague, M. and J.A. Aslam. 2002. Condorcet Fusion for Improved Retrieval. In Proceedings 

of the 2002 ACM Conference on Information and Knowledge Management, pages 

538–548. 

NetCraft.com. accessed on February 2009. February 2009 web server survey. 

http://news.netcraft.com/archives/2009/02/index.html. 

Neumann, D.A. and V.T. Norton. 1986. Clustering and isolation in the consensus problem 

for partitions. Journal of Classification, 3:281–298. 

Ng, A., M.I. Jordan, and Y. Weiss. 2002. On spectral clustering: analysis and an algorithm. 

In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information 

Processing Systems, volume 14. MIT Press. 

Nguyen, N. and R. Caruana. 2007. Consensus Clusterings. In Proceedings of the 7th IEEE 

International Conference on Data Mining, pages 607–612. 

Oja, E. 1992. Principal components minor components, and linear neural networks. Neural 

Networks, 5:927–935. 

Patrikainen, A. and M. Meila. 2005. Spectral Clustering for Microsoft Netscan Data. Technical 

report, UW-CSE-2005-06-05, Department of Computer Science and Engineering, 

University of Washington, Seattle, July. 

Piatetsky-Shapiro, G. 1991. Knowledge Discovery in Real Databases: A Report on the 

IJCAI-89 Workshop. AI Magazine, 11(5):68–70. 

Pinto, D.E. 2008. On Clustering and Evaluation of Narrow Domain Short-Text Corpora. 

Ph.D. thesis, Universidad Politécnica de Valencia, July. 

Pinto, F.R., J.A. Carriço, M. Ramirez, and J.S. Almeida. 2007. Ranked Adjusted Rand: 

integrating distance and partition information in a measure of clustering agreement. 

BMC Bioinformatics, 8(44):1–13. 

Punera, K. and J. Ghosh. 2007. Soft Consensus Clustering. In J. Oliveira and W. Pedrycz, 

editors, Advances in Fuzzy Clustering and its Applications, pages 69–92. Wiley. 

Rand, W.M. 1971. Objective criteria for the evaluation of clustering methods. Journal of 

the American Statistics Association, 66:846–850. 

Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464. 

Scott, G., D. Clark, and T. Pham. 2001. A genetic clustering algorithm guided by a descent 

algorithm. In Proceedings of the Congress on Evolutionary Computation, volume 2, 

pages 734–740, Piscataway, NJ, USA. 

212

References 

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing 

Surveys, 34(1):1–47. 

Selim, S. and K. Al-Sultan. 1991. A simulated annealing algorithm for the clustering 

problems. Pattern Recognition, 24(10):1003–1008. 

Sevillano, X., F. Alías, and J.C. Socoró. 2007b. BordaConsensus: a New Consensus Function 

for Soft Cluster Ensembles. In Proceedings of the 30th ACM SIGIR Conference, 

pages 743–744, Amsterdam, The Netherlands, July. 

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2006a. Feature Diversity in Cluster 

Ensembles for Robust Document Clustering. In Proceedings of the 29th ACM SIGIR 

Conference, pages 697–698, Seattle, WA, USA, August. 

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2007a. A hierarchical consensus architecture 

for robust document clustering. In G. Amati, C. Carpineto, and G. Romano, 

editors, Proceedings of 29th European Conference on Information Retrieval, volume 4425 

of Lecture Notes in Computer Science, pages 741–744, Rome, Italy. Springer Verlag. 

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2007c. Text clustering on latent thematic 

spaces: Variants, strenghts and weaknesses. In M.E. Davies, C.C. James, S.A. Abdallah, 

and M.D. Plumbley, editors, Proceedings of 7th International Conference on Independent 

Component Analysis and Signal Separation, volume 4666 of Lecture Notes in Computer 

Science, pages 794–801, London, UK. Springer Verlag. 

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2006b. Robust Document Clustering by 

Exploiting Feature Diversity in Cluster Ensembles. Journal of the Spanish Society for 

Natural Language Processing (Procesamiento del Lenguaje Natural), 37:169–176. 

Sevillano, X., J. Melenchón, G. Cobo, J.C. Socoró, and F. Alías. 2009. Audiovisual analysis 

and synthesis for multimodal human-computer interfaces. In M. Redondo, C. Bravo, 

and M. Ortega, editors, Engineering the User Interface: From Research to Practice, 

pages 179–194, London. Springer Verlag. 

Shafiei, M., S. Wang, R. Zhang, E. Milios, B. Tang, J. Tougas, and R. Spiteri. 2006. 

A Systematic Study of Document Representation and Dimension Reduction for Text 

Clustering. Technical report, CS-2006-05, Dalhousie University. 

Shahnaz, F., M.W. Berry, V.P. Pauca, and R.J. Plemmons. 2004. Document clustering 

using nonnegative matrix factorization. Information Processing and Management, 

42:373–386. 

Sharan, R. and R. Shamir. 2000. CLICK: A clustering algorithm with applications to gene 

expression analysis. In Proceedings of the 8th International Conference on Intelligent 

Systems for Molecular Biology, pages 307–316. 

Sheikholeslami, C., S. Chatterjee, and A. Zhang. 1998. WaveCluster: A-MultiResolution 

Clustering Approach for Very Large Spatial Data set. In A. Gupta, O. Shmueli, and 

J. Widom, editors, Proceedings of the 24th International Conference on Very Large Data 

Bases, pages 428–439, New York, NY, USA. Morgan Kaufmann. 

Shi, J. and J. Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 22(8):888–905. 

213

References 

Snoek, C.G.M., M. Worring, and A.W.M. Smeulders. 2005. Early versus Late Fusion in 

Semantic Video Analysis. In Proceedings of the 13th ACM International Conference on 

Multimedia, pages 399–402. 

Srinivasan, S.H. 2002. Features for Unsupervised Document Classification. In Proceedings 

of the 6th Conference on Computational Natural Language Learning, pages 36–42, 

Taipei, Taiwan. 

Stein, B. and O. Niggemann. 1999. On the nature of structure and its identification. In 

P. Widmayer, G. Neyer, and S. Eidenbenz, editors, Proceedings of the 25th International 

Workshop on Graph-Theoretic Concepts in Computer Science, volume 1665 of Lecture 

Notes in Computer Science, pages 122–134. Springer Verlag, Berlin, Heidelberg, New 

York. 

Steinbach, M., G. Karypis, and V. Kumar. 2004. A comparison of common document 

clustering techniques. In Proceedings of the KDD Workshop on Text Mining, pages 

17–26, Boston, MA, USA. 

Strehl, A. 2002. Relationship-based Clustering and Cluster Ensembles for High-dimensional 

Data Mining. Ph.D. thesis, Faculty of the Graduate School of The University of Texas 

at Austin, May. 

Strehl, A. and J. Ghosh. 2002. Cluster Ensembles – A Knowledge Reuse Framework for 

Combining Multiple Partitions. Journal on Machine Learning Research, 3:583–617. 

Tang, B., X. Luo, M.I. Heywood, and M. Shepherd. 2004. A Comparative Study of Dimension 

Reduction Techniques for Document Clustering. Technical Report CS-2004-14, 

Faculty of Computer Science, Dalhousie University, Halifax, Canada. 

Tang, B., M. Shepherd, E. Milios, and M.I. Heywood. 2005. Comparing and Combining 

Dimension Reduction Techniques for Efficient Text Clustering. In Proceedings of the 

International Workshop on Feature Selection for Data Mining, pages 17–26, Newport 

Beach, CA, USA. 

Theodoridis, S. and K. Koutroumbas. 1999. Pattern Recognition. Academic Press. 

Tjhi, W.C. and L. Chen. 2007. Dual Fuzzy-possibilistic Co-clustering for Document Categorization 

and Summarization. In Optimization-based Data Mining Techniques with 

Applications Workshop of the IEEE International Conference on Data Mining, pages 

259–264. 

Tjhi, W.C. and L. Chen. 2009. Dual Fuzzy-possibilistic Co-clustering for Categorization of 

Documents. IEEE Transactions on Fuzzy Systems. Accepted for future publication (as 

of May 2009). 

Tombros, A., R. Villa, and C.J. van Rijsbergen. 2002. The effectiveness of query-specific 

hierarchic clustering in information retrieval. International Journal on Information 

Processing and Management, 38(4):559–582. 

Topchy, A., A.K. Jain, and W. Punch. 2003. Combining Multiple Weak Clusterings. In 

Proceedings of the 3rd IEEE International Conference on Data Mining, pages 331–338, 

Melbourne, FLA, USA. 

214

References 

Topchy, A., A.K. Jain, and W. Punch. 2004. A Mixture Model for Clustering Ensembles. 

In Proceedings of the 2004 SIAM Conference on Data Mining, pages 379–390. 

Topchy, A., M. Law, A.K. Jain, and A. Fred. 2004. Analysis of consensus partition in 

clustering ensemble. In Proceedings of the 4th International Conference on Data Mining, 

pages 225–232, Brighton, UK. 

Torkkola, K. 2003. Discriminative features for text document classification. Pattern Analysis 

and Applications, 6(4):301–308. 

Tseng, L. and S. Yang. 2001. A genetic approach to the automatic clustering problem. 

Pattern Recognition, 34:415–424. 

Turnbull, D., L. Barrington, D. Torres, and G. Lanckriet. 2007. Towards Musical Queryby-Semantic-Description 

using the CAL500 Dataset. In Proceedings of the 30th ACM 

SIGIR Conference, pages 439–446, Amsterdam, The Netherlands, July. 

Valencia, A. 2002. Search and retrieve. EMBO Reports, 3(5):396–400. 

van Dongen, S. 2000. Performance criteria for graph clustering and Markov cluster experiments. 

Technical Report INS-R0012, Centrum voor Wiskunde en Informatica. 

van Erp, M., L. Vuurpijl, and L. Schomaker. 2002. An overview and comparison of voting 

methods for pattern recognition. In Proceedings of the Eighth International Workshop 

on Frontiers in Handwriting Recognition, pages 195–200, Ontario, Canada, August. 

van Rijsbergen, C.J. 1979. Information Retrieval. Buttersworth-Heinemann. 

VideoCLEF. accessed on May 2009. The CLEF cross language video retrieval track. 

http://www.cdvp.dcu.ie/VideoCLEF. 

von Luxburg, U. 2006. A tutorial on spectral clustering. Technical Report TR-149, Department 

for Empirical Inference, Max Planck Institute for Biological Cybernetics. 

Wallace, C. and D. Dowe. 1994. Intrinsic classification by MML – the Snob program. 

In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, pages 

37–44, Armidale, Australia. 

Wang, W., J. Yang, and R. Muntz. 1997. Sting: A statistical information grid approach 

to spatial data mining. In M. Jarke, M.J. Carey, K.R. Dittrich, F.H. Lochovsky, 

P. Loucopoulos, and M.A. Jeusfeld, editors, Proceedings of the 23rd International Conference 

on Very Large Data Bases, pages 186–195, Athens, Greece. Morgan Kaufmann. 

Witten, I.H. and E. Frank. 2005. Data mining: practical machine learning tools and 

techniques. Morgan Kauffman Publishers. 

Wunsch, D., T. Caudell, C. Capps, R. Marks, and R. Falk. 1993. An optoelectronic 

implementation of the adaptive resonance neural network. IEEE Transactions on Neural 

Networks, 4(4):673–684. 

www.who.int. accessed on February 2009. World Health Organization International Classification 

of Diseases (ICD). http://www.who.int/classifications/icd/en/. 

215

References 

Xu, R. and D. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Transactions on 

Neural Networks, 16(2):645–678. 

Xu, W., X. Liu, and Y. Gong. 2003. Document Clustering Based on Non-Negative Matrix 

Factorization. In Proceedings of the 26th ACM SIGIR Conference, volume 2, pages 

267–273, Toronto, Canada. 

Yang, J. and S. Olafsson. 2005. Near-optimal feature selection. In Proceedings of the 

International Workshop on Feature Selection for Data Mining, pages 27–34, Newport 

Beach, CA. 

Zahn, C.T. 1971. Graph-theoretical methods for detecting and describing Gestalt clusters. 

IEEE Transactions on Computers, 20(1):68–86. 

Zeng, Y., J. Tang, J. Garcia-Frias, and G.R. Gao. 2002. An adaptive meta-clustering 

approach: combining the information from different clustering results. In Proceedings 

of the IEEE Computer Society Conference on Bioinformatics, pages 276–287. 

Zhao, R. and W.I. Grosky. 2002. Negotiating the semantic gap: from feature maps to 

semantic landscapes. Pattern Recognition, 35:593–600. 

Zhao, Y. and G. Karypis. 2001. Criterion functions for document clustering: Experiments 

and analysis. Technical Report TR #0140, Department of Computer Science, University 

of Minnesota, Minneapolis. 

Zhao, Y. and G. Karypis. 2003a. Clustering in life sciences. In A. Khodursky and Brownstein 

M., editors, Functional Genomics: Methods and Protocols, pages 183–218. Humana 

Press. 

Zhao, Y. and G. Karypis. 2003b. Hierarchical Clustering Algorithms for Document 

Datasets. Technical Report UMN CS #03-027, Department of Computer Science, University 

of Minnesota, Minneapolis. 

216

Appendix A 

Experimental setup 

A.1 The CLUTO clustering package 

All the clustering algorithms employed in the experimental sections of this work have been 

extracted from the CLUTO clustering toolkit. In its authors’ words, “CLUTO is a software 

package for clustering low- and high-dimensional data sets and for analyzing the characteristics 

of the obtained clusters. CLUTO is well-suited for clustering data sets arising in many 

diverse application areas including information retrieval, customer purchasing transactions, 

web, geographic information systems, science, and biology”. It is available online for download 

at http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download. We chose CLUTO as 

our clustering algorithm provider due to its ease of use, robustness, speed and scalability, 

as CLUTO’s algorithms have been optimized for operating on very large data sets both 

in terms of the number of objects (up to ∼ 10 5 ) as well as the number of features (up to 

∼ 10 4 ). 

CLUTO provides clustering algorithms based on the partitional, agglomerative, and 

graph-partitioning paradigms. Most of the algorithms implemented in CLUTO treat clustering 

as an optimization problem, thus seeking to maximize or minimize a particular clustering 

criterion function, which can be defined either globally or locally over the entire 

clustering solution space. As in any clustering process, computing the value of these criterion 

functions requires measuring the similarity between the objects in the data set. This 

means that, for applying a specific CLUTO clustering algorithm, it is necessary to select 

the desired: 

– clustering strategy (clustering paradigm and specific implementation): CLUTO includes 

six implementations of partitional, hierarchical agglomerative and graph-based 

clustering strategies. 

– criterion function: CLUTO provides a total of eleven criterion functions for driving 

its clustering algorithms. 

– similarity measure: CLUTO allows to measure the similarity between objects using 

four distinct alternatives. 

The six implementations of partitional, hierarchical agglomerative and graph-based clustering 

strategies available in the CLUTO clustering toolkit are briefly described in the 

217

A.1. The CLUTO clustering package 

following paragraphs: 

1. direct: this method computes the desired k-way clustering solution by simultaneously 

finding all k clusters. 

2. rb: repeated-bisecting clustering process, in which the desired k-way clustering solution 

is computed by performing a sequence of k −1 repeated bisections of the data set. 

At each bisecting step, one of the obtained clusters is selected and bisected further, so 

that each partial 2-way clustering solution optimizes the selected clustering criterion 

function locally. 

3. rbr: a refined repeated-bisecting method that performs a global optimization of the 

clustering solution obtained by the rb algorithm. 

4. agglo: agglomerative hierarchical clustering that locally optimizes the selected criterion 

function, stopping the agglomeration process when k clusters are obtained. 

5. bagglo: biased agglomerative clustering, which applies the agglo clustering method 

on an augmented representation of the objects created by concatenating the d original 

attributes of each object and √ n new features which are proportional to the 

similarity between that object and its cluster centroid according to a √ n-way partitional 

clustering solution that is initially computed on the data set by means of the 

rb algorithm. 

6. graph: graph-based clustering, in which the data set is modelled as a nearest-neighbor 

graph (each object is a vertex connected to the vertices representing its most similar 

objects) that is partitioned into k clusters according to one of the graph criterion 

functions. 

An enumeration of the eleven criterion functions implemented in the CLUTO software 

package follows (Zhao and Karypis, 2001): 

a. i1 : internal criterion function that maximizes the sum of the average pairwise similarities 

between the objects assigned to each cluster, weighted according to its size. Its 

maximization is equivalent to minimize sum-of-squared-distances between the objects 

in the same cluster, as in traditional k-means (Zhao and Karypis, 2001). 

b. i2 : internal criterion function that maximizes the similarity between each object and 

the centroid of the cluster it is assigned to. 

c. e1 : external criterion function that minimizes the proximity between each cluster’s 

centroid and the common centroid of the rest of the data set. 

d. h1 : hybrid criterion function that simultaneosly maximizes i1 and minimizes e1. 

e. g1 : MinMaxCut criterion function applied on the graph obtained by computing pairwise 

object similarities, partitioning the objects into groups by minimizing the edgecut 

of each partition (for graph-based clustering only). 

f. g1p: normalized Cut criterion function applied on the graph obtained by viewing the 

objects and their features as a bipartite graph, simultaneously partitioning the objects 

and their features (for graph-based clustering only). 

218

Appendix A. Experimental setup 

g. slink: traditional single-link criterion function (for agglomerative hierarchical clustering 

only). 

h. wslink: cluster-weighted single-link criterion function (for agglomerative hierarchical 

clustering only). 

i. clink: traditional complete-link criterion function (for agglomerative hierarchical clustering 

only). 

j. wclink: cluster-weighted complete-link criterion function (for agglomerative hierarchical 

clustering only). 

k. upgma: traditional unweighted pair-group method with arithmetic means criterion 

function (for agglomerative hierarchical clustering only). 

Finally, as regards the similarity measures that can be employed by the clustering algorithms 

implemented in CLUTO, they are described next (Zhao and Karypis, 2001): 

i. cos: the similarity between objects is computed using the cosine function. 

ii. corr : the similarity between objects is computed using the correlation coefficient. 

iii. dist: the similarity between objects is computed to be inversely proportional to the 

Euclidean distance (for graph-based clustering only). 

iv. jacc: the similarity between objects is computed using the extended Jaccard coefficient 

(for graph-based clustering only). 

For further insight on the distinct implementations of the clustering strategies, formal 

definitions of the criterion functions and similarity measures, or the criterion functions 

optimization procedure, the interested reader is referred to (Zhao and Karypis, 2001; Zhao 

and Karypis, 2003b). 

As the reader may infer from the previous enumerations, not all the clustering strategycriterion 

function-similarity measure combinations are possible in CLUTO. Table A.1 presents 

which triplets are allowed (denoted by the symbol), which are not allowed (denoted 

by ×), and which have been employed in our experiments (denoted by •)—28 out of the 68 

combinations allowed by CLUTO. 

In the experiments, each specific algorithm is identified by the clustering strategysimilarity 

measure-criterion function triplet employed, e.g. agglo-cos-slink (agglomerative 

clustering using the single link criterion and measuring object proximity with the cosine 

similarity), graph-jacc-i2 (graph-based clustering using the internal criterion function #2 

and the Jaccard distance), etc. 

A.2 Data sets 

In this work we have applied clustering processes on a total of sixteen data sets, 12 unimodal 

and four multimodal. In this section, we present their main features, such as their origin, 

the number of objects they contain (denoted throughout this work by n), the number (d) 

and meaning of their attributes, and the expected number of categories (k). 

219

A.2. Data sets 

Strategy Similarity 

cos 

i1 

 

i2 

• 

e1 

• 

h1 g1 

Criterion function 

g1p slink wslink clink wclink upgma 

rb 

√ × × × × × × × 

corr 

dist 

 

× 

• 

× 

• 

× 

 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

jacc × × × × × × × × × × × 

cos • • × × × × × × × 

rbr 

corr 

dist 

 

× 

• 

× 

• 

× 

 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

jacc × × × × × × × × × × × 

cos • • × × × × × × × 

direct 

corr 

dist 

• 

× 

 

× 

• 

× 

 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

jacc × × × × × × × × × × × 

cos × × × × × × • • • 

agglo 

corr 

dist 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

• 

× 

 

× 

• 

× 

 

× 

• 

× 

jacc × × × × × × × × × × × 

cos × × × × × × • • • 

bagglo 

corr 

dist 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

• 

× 

 

× 

• 

× 

 

× 

• 

× 

jacc × × × × × × × × × × × 

cos • • × × × × × 

graph 

corr 

dist 

 

 

 

 

 

 

 

 

 

 

 

 

× 

× 

× 

× 

× 

× 

× 

× 

× 

× 

jacc • • × × × × × 

Table A.1: Cross-option table indicating which clustering strategy-criterion functionsimilarity 

measure combinations are allowed (), not allowed (×) and employed in our 

experiments (•). 

220

A.2.1 Unimodal data sets 


A total of twelve unimodal data sets have been used in the experimental sections of this 

thesis. Unless noted otherwise, these data sets have been obtained from two classic public 

data repositories for the data mining and machine learning research communities such as the 

UCI Knowledge Discovery in Databases Archive (Hettich and Bay, 1999) and the UCI Machine 

Learning Repository (Asuncion and Newman, 1999). In the following paragraphs, we 

present a brief description of each data set, summarizing their most relevant characteristics 

in table A.2 as a quick reference source. 

1. Zoo: the goal of this data set is to learn to classify animals into seven classes given 17 

binary attributes representing features such as the presence of hair, feathers, backbone 

or teeth, or whether it is an aquatic or airborne animal, among others. The number 

of objects (i.e. animals) in the data set is 101. 

2. Iris: a classic data set in machine learning and pattern recognition. It contains 150 

objects (instances of Iris plants) represented by four real-valued features measuring 

the width and length of its petals and sepals. The goal is to classify the objects into 

one of the three classes of Iris plants, one of which is linearly separable from the 

others, while the latter two are not linearly separable from each other. 

3. Wine: this data set’s goal is to determine the origin of wines by means of chemical 

analysis. It contains 178 samples of wine which must be categorized into three wine 

classes based on their contents of 13 constituents such as alcohol, malic acid, or 

magnesium, represented as real-valued features. 

4. Glass: in this data set, 214 instances of glass are represented by 10 real-valued attributes 

corresponding to their contents in chemical elements such as aluminium, 

sodium, calcium, etc. The goal is to classify the objects into one of the predefined six 

categories (types of glass). 

5. Ionosphere: the contents of this data set are 351 radar returns from the ionosphere, 

classified either as good or bad depending on whether they show evidence of some 

type of structure in the ionosphere or not. Each radar return is described by 34 

autocorrelation-based real-valued features. 

6. WDBC : its complete name is Wisconsin Diagnostic Breast Cancer data set. It contains 

569 objects (breast mass images) represented by 32 real-valued features describing 

characteristics of the cell nuclei present in the image (radius, texture, perimenter, 

etc.). The goal is to classify these objects into one of the possible cancer diagnostics 

(malignant or benign). 

7. Balance: this data set was generated to model psychological experimental results. 

Each of the 625 objects is classified into three classes (as having the balance scale tip 

to the right, tip to the left, or balanced). The integer-valued attributes are the left 

weight, the left distance, the right weight, and the right distance. 

8. Mfeat: its original name is Multiple Features data set, as it represents the objects 

it contains (handwritten numerals from 0 to 9) using different real-valued features 

such as Fourier coefficients, profile correlations, Karhunen-Loève coefficients, pixel 

averages, Zernike moments and morphological attributes. 

221

A.2. Data sets 

Data set Number of Number of Number of Class 

name objects (n) attributes (d) classes (k) imbalance 

Zoo 101 17 7 40.6%–3.9% 

Iris 150 4 3 33.3%–33.3% 

Wine 178 13 3 39.9%–26.9% 

Glass 214 10 6 35.5%–4.2% 

Ionosphere 351 34 2 64.1%–35.9% 

WDBC 569 32 2 62.7%–37.3% 

Balance 625 4 3 46.1%–7.8% 

Mfeat 2000 649 10 10%–10% 

miniNG 2000 6679 20 5%–5% 

Segmentation 2100 19 7 14.3%–14.3% 

BBC 2225 6767 5 22.9%–17.3% 

PenDigits 7494 16 10 10.4%–9.6% 

Table A.2: Summary of the unimodal data sets employed in the experimental sections of 

this thesis. The “Class imbalance” column presents the percentage of objects in the data 

set belonging to the most and least populated categories, respectively. 

9. miniNG: this is a reduced version of the 20 Newsgroups text data set, as it contains 

only 2000 objects (text articles posted in Usenet) belonging to one of the 20 predefined 

thematic classes (e.g. sci.electronics, rec.sport.baseball or talk.politics.mideast). 

Typical text preprocessing tasks such as the removal of stop words and of terms appearing 

in less than 4 documents (document frequency thresholding) gives rise to a 

bag-of-words representation of each article on a 6679-dimensional tfidf -weighted (i.e. 

real-valued) term space (Sebastiani, 2002). 

10. Segmentation: known as the Image Segmentation data set, it contains 2100 outdoor 

images regions represented by 19-dimensional real-valued feature vectors that should 

be classified into one of seven texture classes: brickface, sky, foliage, cement, window, 

path and grass. We have employed the test subset of the Segmentation collection. 

11. BBC : this data set has been obtained from the online repository of the Machine Learning 

Group of the University College Dublin (http://mlg.ucd.ie/content/view/21/). It 

consists of 2225 documents from the BBC news website corresponding to stories in 

five topical areas (business, entertainment, politics, sport, tech). The original documents’ 

representation used a 9636-dimensional term space which was reduced to 6767 

real-valued attributes after removing those terms with a document frequency smaller 

or equal to 4 (Sebastiani, 2002). 

12. PenDigits: its original name is Pen-Based Recognition of Handwritten Digits data 

set, whose training subset contains 7494 digitized handwritten digits (from 0 to 9) 

captured using a pressure sensitive tablet. Each object is represented by 16 integer 

attributes corresponding to the (x, y) coordinates of the electronic pen on the tablet 

sampled every 100 miliseconds. 

222

A.2.2 Multimodal data sets 


A total of four multimodal data sets have been used in the experimental sections of this 

thesis. Three of these data sets are multimodal in nature, whereas the remaining one has 

been generated artificially by the combination of two unimodal data collections. In the 

following paragraphs, we present a brief description of each data set, summarizing their 

most relevant features in table A.3 as a quick reference source. 

1. CAL500 : the Computer Audition Lab 500-song data set is a collection of five hundred 

Western popular songs represented by means of two modalities: acoustic features and 

textual annotations (Turnbull et al., 2007). As regards the acoustic modality, we have 

employed the mean and standard deviations of the original real-valued delta MFCC 

(Mel-Frequency Cepstrum Coefficients) features as the acoustic attributes of each song 

(Li and GuoDong, 2000). As described in (Turnbull et al., 2007), the text modality 

was generated by means of an auditory experience survey, where 55 listeners were 

asked to annotate each song with several terms extracted from a musically relevant 

174-word vocabulary. As a result, each song was annotated with those terms assigned 

by at least three listeners. These annotating terms describe song-related semantic 

concepts as instruments (e.g. bass, piano, trumpet), vocals (e.g. falsetto, breathy, 

agressive), usage (e.g. at a party, waking up, driving) or emotion (e.g. cheerful, 

calming, tender). In order to evaluate the clustering processes conducted on this 

data set, we have used the annotating term “Best Genre” as the class of each object 

(it reflects the musical genre which best fits each song 1 ). There exist sixteen “Best 

Genre” labels, such as Alternative, Classic Rock, Punk, Country, and so on. Finally, 

we selected the subset of 297 songs which are assigned a single “Best Genre” label. 

2. IsoLetters: this multimodal data collection is the result of the artificial fusion of two 

unimodal data sets of the UCI Machine Learning Repository (Asuncion and Newman, 

1999): Isolet and LetterRecognition. Both data sets contain the same type of object: 

the 26 letters of the English alphabet. Whereas the Isolet collection is oriented to 

the spoken letter recognition problem (each object is the name of a letter uttered 

by a speaker, and represented by a total of 617 real-valued acoustic attributes such 

as spectral coefficients, contour features, or sonorant features), the LetterRecognition 

data set contains 16 visual features (statistical moments and edge counts) extracted 

from black and white images of the twenty-six capital letters of the English alphabet. 

Thus, IsoLetters, the multimodal data collection we have created as a combination 

of both unimodal data sets, pursues the goal of recognizing letters upon acoustic and 

visual features. The total number of objects in the IsoLetters data set (1559) is fixed 

by the size of the test subset of the Isolet collection. 

3. InternetAds: this data set (which is also found in the UCI Machine Learning Repository) 

represents a set of images which are possible advertisements on Internet pages. 

The goal is to classify them as an advertisement or not. Each object is described by 

means of 1558 real-valued attributes, including image geometry features, its textual 

caption, the alt text, phrases occuring in the URL, the image’s URL, the anchor text, 

and words occuring near the anchor text. We consider this data set as multimodal 

1 

So as to avoid biasing the results, all the genre-related annotating terms were eliminated, reducing the 

size of the vocabulary to 127 terms. 

223

A.3. Data representations 

Data set name 

CAL500 

(audio/text) 

IsoLetters 

(speech/image) 

InternetAds 

(object/collateral) 

Corel 

(image/text) 

Number of Number of Number of 

objects attributes per classes 

(n) mode (d 1/d 2) (k) 

Class imbalance 

297 78/127 16 14.5%–1.3% 

1559 617/16 26 3.8%–3.8% 

2359 133/1425 2 83.8%–16.2% 

3960 500/371 44 2.2%–2.3% 

Table A.3: Summary of the multimodal data sets employed in the experimental sections of 

this thesis. The “Class imbalance” column presents the percentage of objects in the data 

set belonging to the most and least populated categories, respectively. 

in the sense that some attributes are directly related to the object (as the image geometry 

features, the caption and the alt text, which totalize 133 features) while the 

remaining 1425 features are referred to collateral elements such as the anchor text or 

the URL. We have removed those objects in the data set with missing features (28% 

of the total), obtaining a reduced version of the InternetAds data set containing 2359 

objects. 

4. Corel: this is a rather classic multimodal data set (Duygulu et al., 2002), consisting of 

5000 text-annotated images from 50 Corel Stock Photo CDs. Each CD contains 100 

images on the same topic, such as “Sunrises and Sunsets”, “Mountains of America” 

or “Wild Animals” (Bekkerman and Jeon, 2007). Experiments have been conducted 

on the subset of 3960 images of the training subset of the Corel collection which are 

assigned at least one topic. The visual modality is codified as follows (Duygulu et 

al., 2002): images, represented using 33 color features, are segmented into regions, 

and these regions are clustered into 500 smaller connected regions (aka blobs), which 

are deemed visual terms, so that each image can be expressed in terms of these. As 

regards the textual modality, every image has a caption (i.e. a brief description of the 

scene) and an annotation (a list of objects appearing in the image). The vocabulary 

contains 371 words, and the term vectors are parameterized using the tfidf weighting 

scheme (van Rijsbergen, 1979). 

A.3 Data representations 

A.3.1 Unimodal data representations 

As mentioned in section 1.1, in this work objects are expressed in terms of d numerical 

attributes, so each object is represented as a column vector x ∈ Rd . Therefore, a whole 

data set containing n objects is mathematically expressed by means of a d × n matrix X. 

This original object representation is referred to as baseline throughout the thesis. 

Starting from the baseline representation, four other object representations have been 

224


generated by applying the following well-known feature extraction techniques 2 : Principal 

Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative Matrix 

Factorization (NMF) and Random Projection (RP). Besides providing diversity as 

regards data representation, these techniques are also employed with dimensionality reduction 

purposes. In this work, we refer to the reduced dimensionality of the resulting feature 

space by r, which takes a whole range of values in the interval [3,d]. As a result of each 

feature extraction procedure, a batch of r × n matrices (X r PCA , Xr ICA , Xr NMF or Xr RP )are 

obtained. 

The following paragraphs are devoted to a brief description of the main concepts regarding 

the aforementioned feature extraction techniques. 

– Principal Component Analysis, which is one of the most typical feature extraction 

techniques, is based on projecting the data onto a dimensionally reduced feature space 

such that i) the newly obtained features are decorrelated, and ii) thevarianceofthe 

original data is maximally retained. For these reasons, PCA is said to be capable of 

removing data redundancies while keeping the most relevant information contained 

in the data. There exist several ways for conducting PCA, from the eigenanalysis of 

the covariance matrix of X (Jolliffe, 1986) to neural network approaches (Oja, 1992). 

In this work, PCA is implemented by means of Singular Value Decomposition (SVD), 

following a similar approach to that of Latent Semantic Analysis (Deerwester et al., 

1990). More specifically, given the d×n data matrix X, its SVD is expressed according 

to equation (A.1). 

X = U · Σ · V T 

(A.1) 

Matrix Σ contains the singular values of X ordered in decreasing order and the 

columns of matrices U and V are the left and right singular vectors of X, respectively. 

Dimensionality reduction is achieved by retaining the r largest singular values in Σ 

and the corresponding columns of matrix V, so that the r ×n matrix Xr T 

PCA = ΣrVr 

–where Σr and Vr are the reduced version of the singular values and right singular 

vectors matrices, respectively– will contain the location of the n objects in the 

r-dimensional PCA space, where clustering is conducted. 

– Independent Component Analysis (ICA) can be regarded as an extension of PCA 

for non-Gaussian data (Xu and Wunsch II, 2005), in which the projected data components 

are forced to be statistically independent—a stronger condition than PCA’s 

decorrelation. Being tightly bound to the blind source separation problem (Hyvärinen, 

Karhunen, and Oja, 2001), the application of ICA for feature extraction usually assumes 

the existence of a generative model that, in its simplest version, defines the 

observed data as the result of an unknown linear noiseless combination of r statistically 

independent latent variables (the so-called independent components). The 

goal of ICA algorithms is to recover the independent components making no further 

2 We choose applying feature extraction over feature selection given its greater ease of application (Jain, 

Murty, and Flynn, 1999), as our main goal is creating representational diversity, not elaborating on object 

representations. In informal experiments not reported here, other object representations based on feature 

selection plus change of basis (Srinivasan, 2002) were also tested but finally discarded as, in general terms, 

gave rise to lower quality clustering results. 

225

A.3. Data representations 

assumption than their statistical independence. The application of ICA in feature extraction 

is usually preceded by PCA with dimensionality reduction, as this procedure 

is equivalent to the usual whitening step that simplifies ICA algorithms (Hyvärinen, 

Karhunen, and Oja, 2001). Applying ICA on the PCA data yields an estimation of 

the r independent latent variables which generated the observed data: 

X r ICA = WX r PCA 

(A.2) 

where matrix W is known as the separating matrix. Equation (A.2) can be interpreted 

as a linear transformation of the data through its projection on the basis vectors 

contained in the rows of the separating matrix W. In this work, a version of the 

FastICA algorithm (Hyvärinen, 1999) that maximizes the skewness of the data is 

employed for obtaining the ICA representation of the data (Kaban and Girolami, 

2000). 

– Non-Negative Matrix Factorization (NMF) (Lee and Seung, 1999) is a feature extraction 

technique based on linear representations of non-negative data—i.e. NMF can 

only be applied when the original representation of the data is non-negative. Intuitively, 

NMF can be interpreted as a linear generative model somewhat similar to that 

of ICA but subject to non-negativity constraints, as it factorizes the non-negative data 

matrix X into the approximate product of two non-negative matrices W and H, as 

defined in equation (A.3). Thus, it can be argued that the data set is generated by 

the sum of a set of the latent non-negative variables contained in matrix H, while the 

elements of W are the weights of their linear combination. 

X ≈ WH ,whereX r NMF 

= H (A.3) 

From a practical viewpoint, i) NMF is usually implemented by means of iterative 

algorithms which try to minimize a cost function proportional to the reconstruction 

error ||X−W·H|| (Lee and Seung, 2001), and ii) dimensionality reduction is achieved 

by setting the respective sizes of the factorization matrices W and H to d × r and 

r × n at the time of computing the approximate factorization of equation (A.3). 

In this work, the NMF-based object representation XNMF is obtained by applying a 

mean square reconstruction error minimization algorithm from NMFPACK, a software 

package for NMF in Matlab (Hoyer, 2004). Besides its use as a feature extraction 

technique, the vision of NMF as a means for obtaining a parts-based description of 

the data has motivated alternative NMF-based clustering strategies (Xu, Liu, and 

Gong, 2003; Shahnaz et al., 2004), alongside studies on the theoretical connections 

between NMF and classic clustering approaches (Ding, He, and Simon, 2005). 

Compared to PCA and ICA, NMF is advantageous as the non-negativity of its basis 

vectors favours their interpretability. On the flip side, the derivation of the NMF 

representation is usually more computationally demanding. 

– Random Projection (RP) is a computationally efficient dimensionality reduction technique, 

proposed as an alternative to those feature extraction techniques that become 

too costly when the dimensionality of the original representation (d) isveryhigh 

(Kaski, 1998). The rationale behind RP is pretty straightforward: a dimensionality 

reduction method is effective as long as the distance between the objects in the 

226


original feature space is approximately preserved in the reduced r-dimensional space. 

Allowing for this fact, Kaski proved that this could be achieved using a random linear 

mapping embodied in a r × d random projection matrix R, where the columns of R 

are realizations of independent and identically distributed (i.i.d.) zero-mean normal 

variables, scaled to have unit length (Fodor, 2002). 

X r RP 

= RX (A.4) 

Several experimental studies bear witness of the fact that i) RP takes a fraction of 

the time required for executing other feature extraction techniques as PCA or ICA, 

among others, and ii) clustering results on RP data representations are sometimes 

comparable or even better than those obtained using, for instance, PCA —which 

somehow reinforces the notion of the data representation indeterminacy outlined in 

section 1.4 (Kaski, 1998; Bingham and Mannila, 2001; Lin and Gunopulos, 2003; Tang 

et al., 2005). 

A.3.2 Multimodal data representations 

As regards the representation of the objects of the multimodal data sets described in section 

A.2.2, two distinct approaches have been followed. Firstly, unimodal representations have 

been created for each mode separately, applying the same strategies as in the unimodal 

data sets—thus, we will not expand on this point. And secondly, we have generated truly 

multimodal data representations by combining both modalities. We elaborate on this latter 

issue in the following paragraphs. 

The simple concatenation of the baseline feature vectors of both modalities (previously 

normalized to unit length3 ) gives rise to the multimodal baseline representation, represented 

on a (d1 + d2)-dimensional attribute space —where d1 and d2 are the dimensionalities of 

the baseline representation of each modality. 

Subsequently, the feature extraction techniques described in section A.3.1 (i.e. PCA, 

ICA, RP and NMF—this latter only when data is non-negative) are applied on the multimodal 

baseline representation, yielding representations of dimensionalities r ∈ [3,d1 + d2]. 

This procedure, known as early fusion or feature-level fusion in the literature, is a common 

strategy for creating representations of multimodal data from unimodal representations, 

and it has been applied in content-based image retrieval (La Cascia, Sethi, and Sclaroff, 

1998; Zhao and Grosky, 2002), semantic video analysis (Snoek, Worring, and Smeulders, 

2005), human affect recognition (Gunes and Piccardi, 2005), audiovisual video sequence 

analysis (Sevillano et al., 2009), besides multimodal clustering (Benitez and Chang, 2002). 

A.4 Cluster ensembles 

In this section, we briefly describe the cluster ensembles employed in the experimental 

sections of this thesis. As described in section 2.1, in this work we combine both the homogeneous 

and heterogeneous approaches for creating cluster ensembles. This means that we 

3 An uneven weighting of the concatenated vectors would give more importance to one of the modalities. 

As it is not clear how to appropriately weight each modality a priori, we forced that each subvector had 

unitary norm so as to avoid any bias. 

227

A.4. Cluster ensembles 

Data set name |dfA| =1 |dfA| =10 |dfA| =19 |dfA| =28 

Zoo 57 570 1083 1596 

Iris 9 90 171 252 

Wine 45 450 855 1260 

Glass 29 290 551 812 

Ionosphere 97 970 1843 2716 

WDBC 113 1130 2147 3164 

Balance 7 70 133 196 

Mfeat 6 60 114 168 

miniNG 73 730 1387 2044 

Segmentation 52 520 988 1456 

BBC 57 570 1083 1596 

PenDigits 57 570 1083 1596 

Table A.4: Cluster ensemble sizes l corresponding to distinct algorithmic diversity configurations 

for the unimodal data sets. 

employ several mutually crossed diversity factors (the twenty-eight clustering algorithms of 

the CLUTO clustering package presented in section A.1 are run on the data representations 

with varying dimensionalities described in section A.3) so as to generate the individual 

components of our cluster ensembles. 

However, several cluster ensemble instances have been generated by limiting the cardinality 

of the algorithmic diversity factor |dfA| (i.e. the number of clustering algorithms 

considered in creating the cluster ensemble components) to a discrete set of values: |dfA| = 

{1, 10, 19, 28}. This strategy is adopted with the objective of experimentally evaluating our 

proposals both in terms of i) their sensitivity to the cluster ensemble diversity (as the larger 

|dfA|, the more diverse the cluster ensemble), and ii) their computational scalability as regards 

the cluster ensemble size l (since this factor is proportional to |dfA|). Notice that the 

cluster ensembles with |dfA| = {1, 10, 19} are randomly sampled subsets of the maximally 

diverse cluster ensemble (the one corresponding to |dfA| =28). 

Tables A.4 and A.5 present the sizes of the cluster ensembles corresponding to the 

distinct diversity scenarios (i.e. cardinalities of the algorithmic diversity factor dfA) onthe 

unimodal and multimodal data collections employed in this work. 

Firstly, table A.4 presents the cluster ensemble sizes corresponding to the unimodal 

data sets. As expected, the cluster ensemble size l grows linearly with the value of |dfA|. 

Depending on the cardinalities of the representational and dimensional diversity factors of 

each data collection, fairly distinct cluster ensembles sizes are obtained (from the modest 

values of the Iris data set to the highly populated cluster ensembles of the WDBC collection). 

Last, table A.5 presents the cluster ensembles corresponding to the four multimodal data 

collections employed in this work for each diversity scenario. It is important to highlight 

the fact that the values of l presented in this table encompass the two unimodal and the 

multimodal data representations of the objects contained in these data sets. 

The reader is referred to appendix B for an analysis of the quality and diversity of the 

components of these cluster ensembles. 

228


Data set name |dfA| =1 |dfA| =10 |dfA| =19 |dfA| =28 

CAL500 102 1020 1938 2856 

IsoLetters 111 1110 2109 3108 

InternetAds 183 1830 3477 5124 

Corel 123 1230 2337 3444 

Table A.5: Cluster ensemble sizes l corresponding to distinct algorithmic diversity configurations 

for the multimodal data sets. 

A.5 Consensus functions 

In this section, we briefly describe the consensus functions employed in the experimental 

section of this thesis, placing special emphasis on specific implementation details when 

necessary. Moreover, we present the time complexity of each consensus function for a given 

cluster ensemble size (l), number of objects in the data set (n) and clusters (k). 

The first seven consensus functions are employed on experiments considering both hard 

and soft cluster ensembles (i.e. chapters from 3 to 6). On its part, the last one (VMA) is 

only applied on soft cluster ensembles, that is, in chapter 6. 

The Matlab source code of the first three consensus functions is available for download 

at http://www.strehl.com, whereas the remaining ones were implemented ad hoc for this 

work. For a more theoretical description of these consensus functions, see section 2.2. 

– CSPA (Cluster-based Similarity Partitioning Algorithm): this consensus function 

shares a lot of the rationale of the Evidence Accumulation consensus function (see 

below), as it is based on deriving a pairwise object similarity measure from the cluster 

ensemble and applying a similarity-based clustering algorithm on it—the METIS 

graph partitioning algorithm (Karypis and Kumar, 1998) in this case. Its computational 

complexity is O n 2 kl (Strehl and Ghosh, 2002). 

– HGPA (HyperGraph Partitioning Algorithm): this clustering combiner exploits a hypergraph 

representation of the cluster ensemble, re-partitioning the data by finding 

a hyperedge separator that cuts a minimal number of hyperedges, yielding k unconnected 

components of approximately the same size—which makes HGPA not an 

appropriate consensus function when clusters are highly imbalanced. The hypergraph 

partition is conducted by means of the HMETIS package (Karypis et al., 1997). Its 

time complexity is O (mkl) (Strehl and Ghosh, 2002). 

– MCLA (Meta-CLustering Algorithm): as in HGPA, each cluster corresponds to a hyperedge 

of the hypergraph representing the cluster ensemble. Subsequently, related 

hyperedges are detected by grouping them using the METIS graph-based clustering algorithm 

(Karypis and Kumar, 1998). Next, related hyperedges are collapsed and each 

object is assigned to the collapsed hyperedge in which it participates most strongly. 

Its computational complexity is O mk 2 l 2 (Strehl and Ghosh, 2002). 

– EAC (Evidence Accumulation): this is a pretty direct implementation of the consensus 

function presented in (Fred and Jain, 2002a). It consists in the computation of the 

pairwise object co-association matrix and the subsequent application of the single-link 

229

A.6. Computational resources 

hierarchical clustering algorithm on it. The main difference between our implementation 

and the original one lies in the fact that we cut the resulting dendrogram at the 

desired number of clusters k, whereas Fred proposes performing the cut at the highest 

lifetime level, so that the very consensus function finds the natural number of clusters 

in the data set. Its computational complexity is O n 2 l (Fred and Jain, 2005). 

– ALSAD (Average-Link on Similarity As Data): this is one of the three consensus 

functions presented in (Kuncheva, Hadjitodorov, and Todorova, 2006) based on considering 

object similarity measures as object features. Despite the authors do not give 

a specific name to this family of consensus functions, we have named them xxSAD 

so as to indicate that similarities are deemed as data, replacing xx by the acronym 

of the particular clustering algorithm used for obtaining the consensus clustering solution. 

In this case, the pairwise object co-association matrix is partitioned using the 

average-link (AL) hierarchical clustering algorithm, cutting the resulting dendrogram 

at the desired number of clusters. Its computational complexity is O n 2 l for creating 

the object similarity matrix plus O n 2 for partitioning it with the hierarchical AL 

clustering algorithm (Xu and Wunsch II, 2005). 

– KMSAD (K-Means on Similarity As Data): this consensus function belongs to the 

same family as the previous one. This time, the object co-association matrix is clustered 

using the classic k-means (KM) partitional algorithm. Its computational complexity 

is O n 2 l for creating the object similarity matrix plus O (tkm) for partitioning 

it with the k-means clustering algorithm (Xu and Wunsch II, 2005) —where t is the 

number of iterations of k-means. 

– SLSAD (Single-Link on Similarity As Data): following the same approach as the AL- 

SAD and KMSAD consensus functions, the pairwise object co-association matrix is 

partitioned by means of the single-link (SL) hierarchical clustering algorithm in this 

case. As in the ALSAD consensus function, the consensus clustering solution is obtained 

by cutting the dendrogram at the desired number of clusters. Its computational 

costisthesameasALSAD. 

– VMA (Voting Merging Algorithm): this consensus function is based on sequentially 

solving the cluster correspondence problem on pairs of cluster ensemble components, 

and, at each iteration, applying a weighted version of the sum rule confidence voting 

method. This algorithm scales linearly in the number of objects in the data set and 

the number of cluster ensemble components, i.e. its complexity is O (nl) (Dimitriadou, 

Weingessel, and Hornik, 2002). 

A.6 Computational resources 

All the experiments conducted in this thesis have been executed under Matlab 7.0.4 on 

Dual Pentium 4 3GHz/1 GB RAM computers. The reason for choosing Matlab as the 

programming language for codifying our proposals is threefold: besides the fact we are 

familiar with it, the existence of multiple built-in functions simplifies the implementation of 

many of the processes involved in our proposals (Principal Component Analysis and Random 

Projection feature extraction, for instance). Moreover, the availability of the full Matlab 

source code of several components of our proposals (e.g. hypergraph consensus functions 

230


or Non-negative Matrix Factorization feature extraction) has been a further incentive for 

this decision. However, the downside of this choice is the relatively slow execution of 

our proposals implementation, due to the fact that Matlab is an interpreted programming 

language. 

231

Appendix B 

Experiments on clustering 

indeterminacies 

The goal of this appendix is to present experimental evidences of the indeterminacies affecting 

the practical selection of a specific clustering configuration introduced in chapter 1. In 

particular, we focus on the indeterminacies regarding the selection of the data representation 

and clustering algorithm that yields the best clustering results for both unimodal and 

multimodal data collections. 

As already noted in chapter 1, the evaluation of the clustering results is based on computing 

the normalized mutual information φ (NMI) between a given label vector and the 

ground truth that is not available to the clustering process, being only used with evaluation 

purposes. Recall that φ (NMI) ranges from 0 to 1, the higher its value the more similar the 

clustering result to the ground truth. 

B.1 Clustering indeterminacies in unimodal data sets 

In this section, we analyze which clustering configurations (data representation plus clustering 

algorithm) give rise to the best partitioning of the unimodal data sets described in 

section A.2.1. We aim to demonstrate the dependence between the quality of the clustering 

results and the selection of the way objects are represented and clustered. 

As described in section A.3.1, starting with the original data representation (denoted 

as baseline), four additional representations have been created by applying several feature 

extraction techniques with multiple dimensionalities, namely Principal Component Analysis 

(PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization 

(NMF) and Random Projection (RP) 1 . 

On each distinct object representation, the 28 clustering algorithms from the CLUTO 

toolbox presented in section A.1 have been applied, which gives rise to the number of partitions 

per data representation presented in table B.1. Notice that, in those data sets not 

satisfying non-negativity constraints, the NMF representation was not derived. Moreover, 

1 The only exception to this rule is the MFeat data set, where no attribute transformation was applied, 

as its original form already presents data representation diversity through the use of several features, see 

section A.2.1. 

233

B.1. Clustering indeterminacies in unimodal data sets 

Data set Data representation 

name Baseline PCA ICA NMF RP 

Zoo 28 392 392 392 392 

Iris 28 56 56 56 56 

Wine 28 308 308 308 308 

Glass 28 196 196 196 196 

Ionosphere 28 896 896 - 896 

WDBC 28 784 784 784 784 

Balance 28 56 - 56 56 

Mfeat 

28 on each of its 6 representations 

(FAC, FOU, KAR, MOR, PIX and ZER) 

miniNG 28 504 504 504 504 

Segmentation 28 476 476 - 476 

BBC 28 392 392 392 392 

PenDigits 28 392 392 392 392 

Table B.1: Number of individual clusterings per data representation on each unimodal data 

set. 

the ICA algorithm employed for deriving the homonymous object representation presented 

convergence problems when executed on the Balance data collection, so no ICA representation 

was created on this data set. 

In the next paragraphs, we describe the clustering results obtained on each data set, 

emphasizing which clustering configurations lead to the best clustering results in each case. 

B.1.1 Zoo data set 

Figure B.1 presents the histograms of the φ (NMI) values (ranging in the [0,1] interval) obtained 

by all the clustering algorithms on each data representation for the Zoo data collection. 

Recall that φ (NMI) = 1 corresponds to a perfect match between the ground truth and 

a clustering solution. The analysis of these histograms help us to interpret the influence of 

the clustering indeterminacies on the quality of the clustering results. 

clustering count 

30 

20 

10 

Zoo Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


30 

20 

10 

Zoo PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


30 

20 

10 

Zoo ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


30 

20 

10 

Zoo NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


30 

20 

10 

Zoo RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 

Figure B.1: Histograms of the φ (NMI) values obtained on each data representation in the 

Zoodataset. 

Firstly, by inspecting the histogram corresponding to the clustering results obtained by 

applying the 28 algorithms on the baseline object representation (figure B.1(a)), we can 

234

Appendix B. Experiments on clustering indeterminacies 

see that φ (NMI) values scattered in a range extending approximately from φ (NMI) =0.45 to 

φ (NMI) =0.85 are obtained. It is important to notice that such diverse results are solely due 

to the clustering algorithm selection indeterminacy, as this histogram presents the results 

of running multiple distinct clustering algorithms on a single data representation. 

If this analysis is extended to the remaining histograms (figures B.1(b) to B.1(e)), it can 

be observed that the φ (NMI) scatter extends across an even wider range for each distinct type 

of representation. This somehow gives an idea of the dependence between the quality of the 

clustering results and the selection of the clustering algorithm. However, this conclusion 

cannot be drawn as directly as in the baseline representation, given that histograms B.1(b) 

to B.1(e) present the results of running the 28 algorithms on multiple representations with 

distinct dimensionalities derived by each feature extraction technique. In other words, 

the diversity observed in these histograms is produced by the joint effect of the clustering 

algorithm and dimensionality reduction data representation selection indeterminacies. 

However, if figures B.1(a) to B.1(e) are compared among themselves, the different histogram 

distributions reveal the effect of the clustering indeterminacy regarding the type of 

data representation. For example, clustering results on the NMF representations of this 

data set span across a comparatively narrower and higher range of φ (NMI) values than their 

PCA, ICA and RP counterparts, indicating that it is more probable to obtain better results 

if clustering is run on NMF representations than on the remaining ones. 

B.1.2 Iris data set 

Compared to other data sets, a pretty small number of clustering solutions have been generated 

on the Iris collection. Regardless of this fact, the effect of the clustering indeterminacies 

can also be observed in figure B.2. 


10 

5 

Iris Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


10 

5 

Iris PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


10 

5 

Iris ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


10 

5 

Iris NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


10 

5 

Iris RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 

Figure B.2: Histograms of the φ (NMI) values obtained on each data representations in the 

Iris data set. 

In this case, the wide span of the φ (NMI) histograms of the PCA and ICA representations 

(figures B.2(b) and B.2(c)) is the clearest indicator of the representation dimensionality and 

algorithm selection indeterminacies. 

If the qualities of the clustering solutions obtained for the distinct types of object representation 

are compared, we can observe that the highest φ (NMI) values are obtained using 

the RP and the baseline representations. 

235


B.1.3 Wine data set 

The histograms of the φ (NMI) values obtained by each clustering algorithm across all the 

data representations employed in the Wine data set are presented in figure B.3. 


40 

30 

20 

10 

Wine Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


40 

30 

20 

10 

Wine PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


40 

30 

20 

10 

Wine ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


40 

30 

20 

10 

Wine NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


40 

30 

20 

10 

Wine RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 


Wine data set. 

The clustering indeterminacy regarding the selection of both the clustering algorithm 

and the dimensionality of the data representation is clearly observed in figures B.3(b) and 

B.3(c). For both the PCA and ICA data representations, a rather even histogram is obtained, 

spanning from φ (NMI) =0.04 to φ (NMI) =0.84. 

Moreover, notice that it is only with these data representations (PCA and ICA) that 

φ (NMI) values above 0.5 are obtained on this data set, which reinforces the importance of 

using the optimal type of features for the obtention of good clustering results. 

B.1.4 Glass data set 

The φ (NMI) histograms corresponding to the Glass data set are presented in figure B.4. 


25 

20 

15 

10 

5 

Glass Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


25 

20 

15 

10 

5 

Glass PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


25 

20 

15 

10 

5 

Glass ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


25 

20 

15 

10 

5 

Glass NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


25 

20 

15 

10 

5 

Glass RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 


Glass data set. 

Notice the distinct histogram distributions obtained for each data representation, which 

gives an idea of how the selection of a particular data representation influences the quality of 

the clustering results. Additionally, a pretty wide range of values of φ (NMI) are observed in 

the histograms corresponding to the feature extraction based data representations (figures 

B.4(b) to B.4(e)), thus evidencing the effect of the dimensionality reduction and clustering 

algorithm selection indeterminacy. 

236

B.1.5 Ionosphere data set 


As regards the Ionosphere data collection, pretty similar φ (NMI) distributions are obtained 

for the PCA, ICA and RP representations (see figures B.5(b) to B.5(d)). Thus, in this 

case, there apparently exists a lower dependence between the quality of clustering and the 

feature extraction technique used for representing the objects. Nevertheless, despite the 

notable concentration of clustering results on the leftmost part of the histograms (i.e. poor 

clusterings with low values of φ (NMI) ), there exist some clustering solutions reaching φ (NMI) 

values above 0.5 using PCA and ICA feature extraction (see figures B.5(b) and B.5(c)). 

Moreover, notice that pretty poor quality clusterings are obtained when operating on the 

baseline object representation (figure B.5(a)). 


150 

100 

50 

Ionosphere Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


150 

100 

50 

Ionosphere PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


150 

100 

50 

Ionosphere ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


150 

100 

50 

Ionosphere RP 

0 

0 0.5 1 

φ (NMI) 

(d) RP 


Ionosphere data set. 

B.1.6 WDBC data set 

As regards the WDBC data collection, there exists a notable difference between the profiles 

of the histograms of the PCA, ICA and NMF representations when compared to the 

baseline and RP histograms. Indeed, the former present a sharp peak located in the lowest 

region of the φ (NMI) range, whereas the latter do not—which reflects the data representation 

clustering indeterminacy. The notably large differences between the highest and lowest 

φ (NMI) values of all the histograms reveal the influence of the clustering algorithm and data 

dimensionality selection on the quality of the partition results. 


100 

50 

WDBC Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


100 

50 

WDBC PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


100 

50 

WDBC ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


100 

50 

WDBC NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


100 

50 

WDBC RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 


WDBC data set. 

237


B.1.7 Balance data set 

The approximately even distributions of the φ (NMI) histograms corresponding to the four 

object representations employed in the Balance data set (with the exception of the peak 

around φ (NMI) =0.04 in figure B.7(c)) transmit the idea that the chance of randomly selecting 

a good or a bad clustering configuration is rather equiprobable in this data collection. 


10 

5 

Balance Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


10 

5 

Balance PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


10 

5 

Balance NMF 

0 

0 0.5 1 

φ (NMI) 

(c) NMF 


10 

5 

Balance RP 

0 

0 0.5 1 

φ (NMI) 

(d) RP 

Figure B.7: Histograms of the φ (NMI) values obtained on the each data representation in 

the Balance data set. 

B.1.8 MFeat data set 

In this data set, six distinct feature types were employed for representing the objects, each 

with a single dimensionality. Therefore, the φ (NMI) scatter observed in each of the figures 

from B.8(a) to B.8(f) is solely due to the algorithm selection indeterminacy. 

Notice that, in all these histograms, a pretty high density of clustering solutions around 

φ (NMI) =0.5 can be observed. Nevertheless, notably better clustering results (φ (NMI) ≈ 0.8) 

can be obtained using the KAR and PIX object representations (see figures B.8(c) and 

B.8(e)), which reveals the data representation indeterminacy effect. 

B.1.9 miniNG data set 

The wide spread of the φ (NMI) values observed in figure B.9(a) –from φ (NMI) =0.06 to 

φ (NMI) =0.64– is a clear evidence of how the selection of a particular clustering algorithm 

affects the quality of the clustering results. 

Moreover, notice that the clustering solutions obtained on the RP representation yield 

φ (NMI) values below 0.3, whereas the best results obtained on the remaining representations 

reach and even surpass φ (NMI) =0.5 —i.e. distinct object representations can significantly 

alter the results of a clustering process. 

B.1.10 Segmentation data set 

As regards the effect of applying distinct clustering algorithms on the same object representation, 

figure B.10(a) shows how, despite the accumulation of clustering solutions around 

φ (NMI) =0.35, a maximum quality of φ (NMI) =0.65 can be obtained on the baseline represenmtation 

of the objects. 

238



10 

5 

Mfeat FAC 

0 

0 0.5 1 

φ (NMI) 

10 

5 

(a) FAC 

Mfeat MOR 

0 

0 0.5 1 

φ (NMI) 

(d) MOR 




10 

5 

Mfeat FOU 

0 

0 0.5 1 

φ (NMI) 

10 

5 

(b) FOU 

Mfeat PIX 

0 

0 0.5 1 

φ (NMI) 

(e) PIX 



10 

5 

Mfeat KAR 

0 

0 0.5 1 

φ (NMI) 

10 

5 

(c) KAR 

Mfeat ZER 

0 

0 0.5 1 

φ (NMI) 

(f) ZER 


MFeat data set. 


60 

40 

20 

miniNG Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


60 

40 

20 

miniNG PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


60 

40 

20 

miniNG ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


60 

40 

20 

miniNG NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


60 

40 

20 

miniNG RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 


miniNG data set. 

Furthermore, if figures B.10(b) and B.10(c) are compared to figure B.10(d), it is easy to 

see that whereas the two former present a wide and sharp peak centered around φ (NMI) =0.7 

(thus indicating that clustering solutions this good are likely to be obtained using the PCA 

and ICA representations of the objects), the latter has its acme around φ (NMI) =0.35—i.e. 

the quality of the RP based clustering solutions tends to be lower in this data set. 

B.1.11 BBC data set 

The BBC data collection constitutes another example where very diverse clustering solutions 

–with qualities ranging from φ (NMI) =0.01 to φ (NMI) =0.81– are obtained when clustering 

is conducted on the original representation of the objects (see figure B.11(a)). 

As far as the remaining data representations are concerned, the best results seem to 

be obtained using the NMF feature extraction technique, as its corresponding histogram is 

more scarcely and densely populated at the low and high ranges of φ (NMI) , respectively. 

239



Segmentation Baseline 

40 

30 

20 

10 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


40 

30 

20 

10 

Segmentation PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


40 

30 

20 

10 

Segmentation ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


40 

30 

20 

10 

Segmentation RP 

0 

0 0.5 1 

φ (NMI) 

(d) RP 


Segmentation data set. 


60 

40 

20 

BBC Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


60 

40 

20 

BBC PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


60 

40 

20 

BBC ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


60 

40 

20 

BBC NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


60 

40 

20 

BBC RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 


BBC data set. 

B.1.12 PenDigits data set 

In this case, the distinct object representations present a reasonably similar behaviour 

according to the histograms depicted in figure B.12. Assuming a simplifying viewpoint, 

these can be decomposed into a negatively skewed peak with its acme around φ (NMI) =0.6, 

and two other narrow peaks, one located near φ (NMI) =0.8 and the other on the low range 

of the histogram. Thus, as opposed to what has been observed in other data collections, the 

application of the twenty-eight clustering algorithms on the distinct object representations 

yield comparable quality results in this data set. 


30 

20 

10 

PenDigits Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 


30 

20 

10 

PenDigits PCA 

0 

0 0.5 1 

φ (NMI) 

(b) PCA 


30 

20 

10 

PenDigits ICA 

0 

0 0.5 1 

φ (NMI) 

(c) ICA 


30 

20 

10 

PenDigits NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF 


30 

20 

10 

PenDigits RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP 


PenDigits data set. 

240

B.1.13 Summary 


So as to provide a summarized vision of the data representation and the clustering algorithm 

selection indeterminacies across all the analyzed data sets, table B.2 presents the φ (NMI) 

corresponding to the best clustering solution achieved by each one of the five families of 

clustering algorithms employed in this work, namely agglomerative (agglo), biased agglomerative 

(bagglo), direct, graph, repeated-bisecting (rb) and refined repeated-bisecting (rbr), 

indicating the type of object representation employed in each case (either baseline, PCA, 

ICA, NMF or RP). 

There are several worth observing facts as regards the data representation indeterminacy. 

Notice that, in some data sets (e.g. Zoo or miniNG), there exists a notable diversity as 

regards the type of representation that yields the top clustering result for each family of 

clustering algorithms. In contrast, in other data collections, there seems to exist a particular 

object representation that apparently reveals the data set structure regardless of the type of 

clustering algorithm applied. This behaviour is observed in the Iris and Balance collections, 

and also, to a lesser extent, in the WDBC and Segmentation data sets. Moreover, notice 

the variablility of these optimal object representations across the analyzed data sets, which 

is a clear indicator of the clustering indeterminacy regarding data representations. 

As far as the selection of the optimal clustering algorithm is concerned, it is important 

to note that at least one representative of the five families of clustering algorithms employed 

in this work reach the best absolute performance in at least one of the analyzed data sets, 

which gives an idea of the algorithm selection indeterminacy. Moreover, notice that choosing 

the wrong type of clustering algorithm may affect the quality of the clustering solution 

dramatically (see the Ionosphere and Balance collections) or not (as in the Segmentation 

data set). 

B.2 Clustering indeterminacies in multimodal data sets 

The goal of this section is to evaluate the effect of clustering indeterminacies in the context 

of multimodal data collections. Along with the data representation and clustering algorithm 

selection indeterminacies, multimodality introduces a further focus of uncertainty, as it is 

not evident to decide whether the combination of the m modalities will benefit the quality 

of the obtained clustering solution or not. And again, to make things worse, it is important 

to recall that all these indeterminacies are local to each data collection, so, in general, it is 

not possible to drawn universally valid conclusions. 

As done in appendix B.1, we start by presenting the total number of individual clustering 

solutions obtained by applying the 28 clustering algorithms extracted from the CLUTO 

toolbox on all the data representations of the objects contained in the employed multimodal 

data sets2 —see table B.3. Notice that the CAL500 and InternetAds collections lack the 

NMF representations as they do not satisfy the necessary non-negativity constraints. 

In the next paragraphs, we describe the clustering results obtained on the four multimodal 

data sets, placing special emphasis on which data representations and modalities 

lead to the best clustering results in each case. 

2 See appendices A.1, A.2.2, and A.3.2 for a description of the clustering algorithms, the multimodal 

collections and the multimodal objects representations employed in this work. 

241

B.2. Clustering indeterminacies in multimodal data sets 

Data set Highest maximum φ (NMI) −→ Lowest maximum φ (NMI) 

agglo bagglo direct rb rbr graph 

Zoo NMF-13d PCA-13d baseline RP-12d RP-12d PCA-6d 

(0.865) (0.858) (0.853) (0.848) (0.848) (0.730) 

bagglo direct rb rbr agglo graph 

Iris baseline baseline baseline baseline baseline baseline 

(0.899) (0.899) (0.899) (0.899) (0.837) (0.821) 

direct rbr bagglo rb graph agglo 

Wine PCA-10d PCA-10d ICA-12d PCA-10d PCA-8d ICA-5d 

(0.836) (0.836) (0.802) (0.795) (0.755) (0.720) 

agglo direct rb rbr bagglo graph 

Glass RP-8d RP-5d PCA-3d PCA-6d PCA-4d PCA-3d 

(0.487) (0.442) (0.423) (0.418) (0.417) (0.392) 

graph agglo bagglo direct rb rbr 

Ionosphere PCA-8d PCA-31d RP-14d RP-9d RP-9d RP-9d 

(0.656) (0.314) (0.309) (0.234) (0.234) (0.234) 

graph bagglo direct rb rbr agglo 

WDBC NMF-3d NMF-3d NMF-3d NMF-3d NMF-3d RP-15d 

(0.637) (0.603) (0.563) (0.563) (0.563) (0.522) 

agglo bagglo rbr rb direct graph 

Balance PCA-3d PCA-3d PCA-3d PCA-3d PCA-3d PCA-3d 

(0.703) (0.411) (0.394) (0.388) (0.370) (0.324) 

graph rbr direct agglo bagglo rb 

MFeat PIX PIX PIX KAR PIX KAR 

(0.811) (0.676) (0.669) (0.664) (0.606) (0.585) 

rb rbr bagglo direct graph agglo 

miniNG baseline PCA-70d baseline ICA-50d PCA-50d NMF-50d 

(0.638) (0.597) (0.559) (0.558) (0.412) (0.377) 

graph rbr agglo bagglo rb direct 

Segmentation PCA-11d PCA-13d ICA-13d PCA-13d PCA-14d ICA-17d 

(0.786) (0.741) (0.733) (0.731) (0.728) (0.720) 

rbr direct graph agglo bagglo rb 

BBC baseline baseline ICA-5d ICA-5d ICA-5d baseline 

(0.808) (0.808) (0.777) (0.750) (0.744) (0.726) 

graph agglo rbr direct bagglo rb 

PenDigits NMF-13d ICA-9d NMF-9d NMF-10d NMF-11d NMF-7d 

(0.839) (0.724) (0.682) (0.681) (0.665) (0.648) 

Table B.2: Top clustering results obtained by each clustering algorithm family across all 

the unimodal data sets, sorted from highest to lowest φ (NMI) . 

242


Data representation 

CAL500 

Data set 

Corel InternetAds IsoLetters 

MM 28 28 28 28 

Baseline M1 28 28 28 28 

M2 28 28 28 28 

MM 504 420 308 532 

PCA M1 280 196 308 196 

M2 140 224 392 532 

MM 504 420 308 532 

ICA M1 280 196 308 196 

M2 140 224 392 532 

MM – 420 – 532 

NMF M1 – 196 – 196 

M2 – 224 – 532 

MM 504 420 308 532 

RP M1 280 196 308 196 

M2 140 224 392 532 

Table B.3: Number of individual clusterings per data representation on each multimodal 

data set, where MM, M1 and M2 stand for multimodal, mode #1 and mode #2, respectively. 

B.2.1 CAL500 data set 

The φ (NMI) histograms presented in figure B.13 summarize the clustering results obtained by 

running the aforementioned twenty-eight algorithms on each type of object representation 

for each one of the two modalities, and for the multimodal representations as well. 

If the histograms are compared representationwise, we observe that all representations 

yield clustering solutions whose quality spans over similarly wide ranges below φ (NMI) =0.5. 

For a given modality, there exists no clearly superior object representation. 

However, if the histograms are compared across the modalities, it can be observed that 

better results are obtained when clustering is conducted on the audio modality of this data 

set, regardless of the type of representation employed. Moreover, the multimodal data 

representation seems to yield intermediate quality clustering results (i.e. slightly better 

than clustering on text only, but worse than clustering solely on audio), which reveals that 

the early fusion of acoustic and textual features is not beneficial in this case. 

B.2.2 Corel data set 

Figure B.14 presents the φ (NMI) histograms corresponding to the multimodal and unimodal 

clustering of the captioned images of the Corel data set. 

As regards the comparison across object representations, it can be observed that, specially 

for the image and multimodal modalities, the RP representation offers a large amount 

of good clustering solutions, whereas the quality of the clusterings obtained on the remaining 

representations is scattered over a wide range of φ (NMI) values. 

If the clustering results obtained on the two modalities are compared, we can see that 

the image modality is the one yielding the best clustering results (up to φ (NMI) =0.68), 

243





80 

60 

40 

20 

CAL500 Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 

(multimodal) 

CAL500 Baseline M1 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(e) Baseline 

(audio) 

CAL500 Baseline M2 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(i) Baseline 

(text) 




CAL500 PCA 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(b) PCA (multimodal) 

CAL500 PCA M1 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(f) PCA (audio) 

CAL500 PCA M2 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(j) PCA (text) 




CAL500 ICA 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(c) ICA (multimodal) 

CAL500 ICA M1 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(g) ICA (audio) 

CAL500 ICA M2 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(k) ICA (text) 




CAL500 RP 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(d) RP (multimodal) 

CAL500 RP M1 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(h) RP (audio) 

CAL500 RP M2 

80 

60 

40 

20 

0 

0 0.5 

φ 

1 

(NMI) 

(l) RP (text) 

Figure B.13: Histograms of the φ (NMI) values on the CAL500 data set obtained on the 

following data representations. 

which are far better than those obtained on the text modality (always below φ (NMI) =0.3). 

Last but not least, the multimodal object representation seems to benefit slightly from the 

early fusion of the visual and textual features of both modalities, as better clustering results 

are obtained in this case, although by a very small margin. 

B.2.3 InternetAds data set 

The clustering results corresponding to the InternetAds data collection are summarized in 

figure B.15. Many poor clustering results are obtained on this data set, as the high peaks 

located on the leftmost regions of the histograms reveal. The distinct data representations 

and modalities present a pretty erratic behaviour, as discussed next. 

If the two modalities are compared, the best clustering results are obtained, in general 

terms, using the collateral information of the Internet advertisements (which are the objects 

in this data set). However, the multimodal composition of the objects tends to yield superior 

–although still poor– quality clusterings, except for the PCA representation. 

244




60 

40 

20 

Corel Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 

(multimodal) 

60 

40 

20 

Corel Baseline M1 

0 

0 0.5 1 

φ (NMI) 

(f) Baseline 

(image) 

60 

40 

20 

Corel Baseline M2 

0 

0 0.5 1 

φ (NMI) 

(k) Baseline 

(text) 




60 

40 

20 

Corel PCA 

0 

0 0.5 1 

φ (NMI) 


60 

40 

20 

Corel PCA M1 

0 

0 0.5 1 

φ (NMI) 

(g) PCA (image) 

60 

40 

20 

Corel PCA M2 

0 

0 0.5 1 

φ (NMI) 

(l) PCA (text) 





60 

40 

20 

Corel ICA 

0 

0 0.5 1 

φ (NMI) 


60 

40 

20 

Corel ICA M1 

0 

0 0.5 1 

φ (NMI) 

(h) ICA (image) 

60 

40 

20 

Corel ICA M2 

0 

0 0.5 1 

φ (NMI) 

(m) ICA (text) 




60 

40 

20 

Corel NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF (multimodal) 

60 

40 

20 

Corel NMF M1 

0 

0 0.5 1 

φ (NMI) 

(i) NMF (image) 

60 

40 

20 

Corel NMF M2 

0 

0 0.5 1 

φ (NMI) 

(n) NMF (text) 




60 

40 

20 

Corel RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP (multimodal) 

60 

40 

20 

Corel RP M1 

0 

0 0.5 1 

φ (NMI) 

60 

40 

20 

(j) RP (image) 

Corel RP M2 

0 

0 0.5 1 

φ (NMI) 

(o) RP (text) 

Figure B.14: Histograms of the φ (NMI) values on the Corel data set obtained on the following 

data representations. 

B.2.4 IsoLetters data set 

The collection of clustering solutions obtained on the IsoLetters artificial multimodal data 

collection are presented representation and modality-wise in figure B.16. 

In this case, the quality of the clusterings created on the distinct object representations 

present two clearly different histogram patterns depending on the modality. For instance, 

in the speech modality (figures B.16(e) to B.16(h)), the baseline and RP histograms present 

a main and a secondary minor peaks, whereas the PCA and ICA representations yield a 

pretty uniform distribution of clusterings. In contrast, a totally different distribution is 

found when clustering is run on the visual mode, where a single negatively skewed bell 

shape is observed (see figures B.16(i) to B.16(l)). 

Finally, it is to note that, regardless of the object representation employed, the early 

fusion of the speech and visual features of this data set gives rise to a notable increase in 

the quality of the clustering results (a 16.2% in averaged relative terms as regards the top 

quality individual clustering solution). 

245





200 

100 

InternetAds Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 

(multimodal) 

200 

100 

InternetAds Baseline M1 

0 

0 0.5 1 

φ (NMI) 

(f) Baseline 

(object) 

200 

100 

InternetAds Baseline M2 

0 

0 0.5 1 

φ (NMI) 

(k) Baseline 

(collateral) 




200 

100 

InternetAds PCA 

0 

0 0.5 1 

φ (NMI) 


200 

100 

InternetAds PCA M1 

0 

0 0.5 1 

φ (NMI) 

(g) PCA (object) 

200 

100 

InternetAds PCA M2 

0 

0 0.5 1 

φ (NMI) 

(l) PCA (collateral) 




200 

100 

InternetAds ICA 

0 

0 0.5 1 

φ (NMI) 


200 

100 

InternetAds ICA M1 

0 

0 0.5 1 

φ (NMI) 

(h) ICA (object) 

200 

100 

InternetAds ICA M2 

0 

0 0.5 1 

φ (NMI) 

(m) ICA (collateral) 




200 

100 

InternetAds NMF 

0 

0 0.5 1 

φ (NMI) 

(d) NMF (multimodal) 

200 

100 

InternetAds NMF M1 

0 

0 0.5 1 

φ (NMI) 

(i) NMF (object) 

200 

100 

InternetAds NMF M2 

0 

0 0.5 1 

φ (NMI) 

(n) NMF (collateral) 




200 

100 

InternetAds RP 

0 

0 0.5 1 

φ (NMI) 

(e) RP (multimodal) 

200 

100 

InternetAds RP M1 

0 

0 0.5 1 

φ (NMI) 

(j) RP (object) 

200 

100 

InternetAds RP M2 

0 

0 0.5 1 

φ (NMI) 

(o) RP (collateral) 

Figure B.15: Histograms of the φ (NMI) values on the InternetAds data set obtained on the 


B.2.5 Summary 

Repeating the formula employed in section B.1, table B.4 presents the φ (NMI) values attained 

by the top clustering solution achieved by the best representative of each one of the five 

families of clustering algorithms employed in this work (i.e. agglo, bagglo, direct, graph, 

rb and rbr), referring the employed type of representation (baseline, PCA, ICA, NMF or 

RP) and modality (multimodal –MM–, mode #1 –M1– or mode #2 –M2–). The idea 

is to present a condensed view of the influence of the data representation and clustering 

algorithm selection indeterminacies. 

Notice the distinct ordering of the five families of clustering algorithms in every data 

set. A clear indicator of the algorithm selection indeterminacy, is the fact that the rbr type 

of algorithms yield the top clustering solution in three of the four data sets, while offering 

the poorest performance in the InternetAds collection. 

The indeterminacy regarding the use of multimodal or unimodal data representations 

also becomes evident, as in two of the data sets (Corel and IsoLetters) the multimodal 

246




30 

20 

10 

IsoLetters Baseline 

0 

0 0.5 1 

φ (NMI) 

(a) Baseline 

(multimodal) 

30 

20 

10 

IsoLetters Baseline M1 

0 

0 0.5 1 

φ (NMI) 

(e) Baseline 

(speech) 

30 

20 

10 

IsoLetters Baseline M2 

0 

0 0.5 1 

φ (NMI) 

(i) Baseline 

(image) 




30 

20 

10 


IsoLetters PCA 

0 

0 0.5 1 

φ (NMI) 


30 

20 

10 

IsoLetters PCA M1 

0 

0 0.5 1 

φ (NMI) 

(f) PCA 

(speech) 

30 

20 

10 

IsoLetters PCA M2 

0 

0 0.5 1 

φ (NMI) 

(j) PCA (image) 




30 

20 

10 

IsoLetters ICA 

0 

0 0.5 1 

φ (NMI) 


30 

20 

10 

IsoLetters ICA M1 

0 

0 0.5 1 

φ (NMI) 

(g) ICA 

(speech) 

30 

20 

10 

IsoLetters ICA M2 

0 

0 0.5 1 

φ (NMI) 

(k) ICA (image) 




30 

20 

10 

IsoLetters RP 

0 

0 0.5 1 

φ (NMI) 

(d) RP (multimodal) 

30 

20 

10 

IsoLetters RP M1 

0 

0 0.5 1 

φ (NMI) 

(h) RP (speech) 

30 

20 

10 

IsoLetters RP M2 

0 

0 0.5 1 

φ (NMI) 

(l) RP (image) 

Figure B.16: Histograms of the φ (NMI) values on the IsoLetters data set obtained on the 


representations dominate the best clustering results across all the families of algorithms, 

whereas it is one of the unimodal representations the ones to do so in the CAL500 and InternetAds 

collections. And finally, notice the diversity of types of representations appearing 

in table B.4, which suggests that, for a given data set, it is very difficult to select the data 

representation and clustering strategy that yield the best clustering results. 

247


Data set Highest φ (NMI) −→ Lowest φ (NMI) 

CAL500 

Corel 

InternetAds 

IsoLetters 

rbr direct agglo rb bagglo graph 

RP-M1 RP-M1 RP-M1 RP-M1 RP-M1 baseline-M1 

120d 100d 100d 120d 120d – 

(0.411) (0.404) (0.401) (0.384) (0.381) (0.364) 

rbr graph direct rb bagglo agglo 

NMF-MM RP-MM NMF-MM NMF-MM baseline-M1 baseline-MM 

550d 400d 450d 300d – – 

(0.675) (0.672) (0.671) (0.641) (0.624) (0.622) 

bagglo graph agglo direct rb rbr 

RP-M1 NMF-MM baseline-M2 NMF-M2 NMF-M2 NMF-M2 

70d 150d – 550d 550d 550d 

(0.430) (0.319) (0.258) (0.087) (0.087) (0.087) 

rbr direct graph agglo rb bagglo 

PCA-MM PCA-MM PCA-MM RP-MM baseline-MM baseline-MM 

100d 100d 100d 600d – – 

(0.897) (0.875) (0.846) (0.790) (0.751) (0.728) 

Table B.4: Top clustering results obtained by each clustering algorithm family across all 

the multimodal data sets, sorted from highest to lowest φ (NMI) . 

248

Appendix C 

Experiments on hierarchical 


This appendix presents several experiments regarding self-refining hierarchical consensus 

architectures. 

C.1 Configuration of a random hierarchical consensus architecture 

In this section, we present some examples that describe, in detail, the configuration process 

of random hierarchical consensus architectures (RHCA). The aim is to demostrate how, 

given a cluster ensemble size l and a mini-ensemble size b, equations (C.1), (C.2) and (C.3) 

allow determining the number of stages s, the number of consensus per stage Ki and the 

effective size of each mini-ensemble bij of the corresponding RHCA. 

For starters, let us evaluate carefully the three RHCA examples presented in section 3.2. 

In these toy examples, the mini-ensemble size is set to b = 2, while the respective cluster 

ensembles have l =7, 8 and 9 components. The first step of the RHCA design process 

consists of determining the number of stages of the hierarchy, s, accordingtoequation 

(C.1). 

⎧ 

⎪⎩ 

⌊log b (l)⌉ if 

⎪⎨ 

 

s = ⌊logb (l)⌉−1 if 

⌊log b (l)⌉ +1 if 

 

 

l 


l 


l 


 

≤ 1and 

 

≤ 1and 

 

> 1 

l 

b ⌊log b (l)⌉−1 

l 

b ⌊log b (l)⌉−1 

 

> 1 

 

=1 

(C.1) 

Table C.1 presents the results of this computation for the three aforementioned examples 

(one row per example), specifying the values of the decision factors used for selecting one 

of the three options presented in equation (C.1). 

Once the number of stages of the RHCA is computed, the next step consists of determining 

how many consensus processes are to be executed at each RHCA stage. This factor, 

249

C.1. Configuration of a random hierarchical consensus architecture 

l 

b⌊logb (l)⌉ 

l 

b⌊logb (l)⌉−1 

l =7 0.875 1.75 

s 

⌊logb (l)⌉−1=2 

l =8 1 2 ⌊logb (l)⌉ =3 

l =9 1.125 2.25 ⌊logb (l)⌉ =3 

Table C.1: Examples of computation of the number of stages s of a RHCA on cluster 

ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2. 

which is designated by Ki (where the subindex i denotes the stage number), is computed 

according to equation (C.2). 

 

l 

Ki =max 

bi 

, 1 

(C.2) 

The number of consensus processes per stage of the three RHCA examples discussed 

are presented in table C.2. 

1 

Stage number 

2 3 

l =7 K1 =max(⌊3.5⌋ , 1) = 3 

as 

K2 =max(⌊1.75⌋ , 1) = 1 

– 

l 

b1 =3.5 as l 

b2 l =8 

=1.75 

K1 =max(⌊4⌋ , 1) = 4 

as 

K2 =max(⌊2⌋ , 1) = 2 K3 =max(⌊1⌋ , 1) = 1 

l 

b1 =4 as l 

b2 =2 as l 

b3 l =9 

=1 

K1 =max(⌊4.5⌋ , 1) = 4 

as 

K2 =max(⌊2.25⌋ , 1) = 2 K3 =max(⌊1.125⌋ , 1) = 1 

l 

b1 =4.5 as l 

b2 =2.25 as l 

b3 =1.125 

Table C.2: Examples of computation of the number of consensus per stage (Ki) ofaRHCA 

on cluster ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2. 

And finally, the real mini-ensembles size bij, ∀i ∈ [1,s]and∀j ∈ [1,Ki] mustbecomputed. 

Recall that the effective size of all the mini-ensembles of the RHCA is equal to the 

user-defined mini-ensemble size (i.e bij = b ∀i, j) if and only if l is an integer power of b. In 

practice, the effective mini-ensembles size is computed according to equation (C.3), which 

adjusts this factor so that all the original and intermediate clusterings are subject to a consensus 

process. The bij values corresponding to the three RHCA examples are presented 

in table C.3 along with the corresponding number of consensus Ki at each RHCA stage in 

brackets. 

⎧ 

⎪⎨ 

b if i

l =7 

l =8 

l =9 

Appendix C. Experiments on hierarchical consensus architectures 

1 

Stage number 

2 3 

[K1 =3] [K2 =1] 

b11 = b =2 

b12 = b =2 

b13 = b + l mod b =3 

b21 = K1 =3 

– 

[K1 =4] [K2 =2] [K3 =1] 

b11 = b =2 

b12 = b =2 

b21 = b =2 

b13 = b =2 

b14 = b + l mod b =2 

b22 = b + K1 mod b =2 

[K1 =4] [K2 =2] [K3 =1] 

b11 = b =2 

b12 = b =2 

b21 = b =2 

b13 = b =2 

b14 = b + l mod b =3 

b22 = b + K1 mod b =2 

b31 = K2 =2 

b31 = K2 =2 

Table C.3: Examples of computation of the mini-ensembles size of a RHCA on cluster 

ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2. 

cluster ensemble of an arbitrarily chosen size of l = 30 using the following predefined miniensembles 

sizes: b = {2, 3, 5, 15}. 

It can be observed that the larger the mini-ensemble size b, the smaller the number of 

stages s. Moreover, notice that, for any given RHCA, the number of consensus per stage Ki 

progressively converges to unity, i.e. Ks = 1, giving rise to a regular pyramidal hierarchy of 

consensus processes. Last, it is worth observing how the effective size of the mini-ensembles 

bij is determined. Notice bij is regularly set to b maybe except for the last (i.e. the Kith) 

consensus of each stage and/or the only consensus of the final stage, which may vary their 

size so as to accomodate the necessary number of clusterings into the associated consensus 

process. 

l =30 

2 

mini-ensemble size b 

3 5 15 

#ofstages s =4 s =3 s =2 s =2 

#of 

consensus 

per stage 

Ki = {15, 7, 3, 1} Ki = {10, 3, 1} Ki = {6, 1} Ki = {2, 1} 

miniensembles 

size bij 

b1j =2∀j 

 

∈ [1, 15] 

2 ∀j ∈ [1, 6] 

b2j = 

3 j =7 

 

2 ∀j ∈ [1, 2] 

b3j = 

3 j =3 

b41 =3 

b1j =3∀j ∈ [1, 10] 

 

3 ∀j ∈ [1, 2] 

b2j = 

4 j =3 

b31 =3 

b1j =5∀j ∈ [1, 6] b1j =15∀j ∈ [1, 2] 

b21 =6 b21 =2 

Table C.4: Configuration of RHCA topologies on a cluster ensemble of size l =30with 

varying mini-ensembles sizes b = {2, 3, 5, 15}. 

251

C.2. Estimation of the computationally optimal RHCA 

C.2 Estimation of the computationally optimal RHCA 

Section 3.2 presents a methodology for selecting the most computationally efficient implementation 

variant of random hierarchical consensus architectures. In short, such methodology 

consists of estimating the running time of several RHCA variants differing in the 

mini-ensembles size b, selecting the one that yields the minimum running time, which is the 

one to be truly executed. 

So as to validate this procedure, in this section we present the estimated and real running 

times of several variants of the fully serial and parallel implementations of RHCA on the 

Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat unimodal data sets (see appendix 

A.2.1 for a description of these collections) across the four experimental diversity scenarios 

employed in this work —see appendix A.4. The objective of this experiment is twofold: 

firstly, we seek to verify whether the proposed strategy succeeds in predicting the most 

computationally efficient RHCA variant. And secondly, we intend to analyze the conditions 

under which random hierarchical consensus architectures are computationally advantageous 

compared to flat consensus clustering. The experimental design that has been followed is 

outlined next. 


i) The time complexity of random hierarchical consensus architectures. 


RHCA variant, in both the fully serial and parallel implementations. 


i) The time complexity of the implemented serial and parallel RHCA variants is 


time (SRTRHCA) and parallel running time (PRTRHCA). 

ii) The estimated running times of the same RHCA variants –serial estimated running 

time (SERTRHCA) and parallel estimated running time (PERTRHCA)– are 



regarding the computationally optimal RHCA variant will be successful 


same RHCA variant, and the percentage of experiments in which prediction is 




RHCA variants in the case prediction fails. This evaluation process is replicated 



– How are the experiments designed? All the RHCA variants corresponding to 

the sweep of values of b resulting from the proposed running time estimation methodology 

have been implemented (see table 3.2). In order to test our proposals under a 

wide spectrum of experimental situations, consensus processes have been conducted 

using the seven consensus functions for hard cluster ensembles presented in appendix 

252


A.5 (i.e. CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD), employing 

cluster ensembles of the sizes corresponding to the four diversity scenarios described 

in appendix A.4 —which basically boils down to compiling the clusterings output by 

|dfA| = {1, 10, 19, 28} clustering algorithms. In all cases, the real running times correspond 

to an average of 10 independent runs of the whole RHCA, in order to obtain 

representative real running time values (recall that the mini-ensemble components 

change from run to run, as they are randomly selected). For a description of the 

computational resources employed in or experiments, see appendix A.6. 


serial and parallel implementations of the RHCA variants are depicted by means of 


C.2.1 Iris data set 

For starters, let us analyze the results corresponding to the Iris data collection. In this 

case, each diversity scenario corresponds to a cluster ensemble of size l =9, 90, 171 and 252, 

respectively. The left and right columns of figure C.1 present the estimated and real running 

times of several variants of the serial implementation of the RHCA on this data set across 

the four diversity scenarios. It can be observed that, as the size of the cluster ensemble 

grows, there appear RHCA variants more computationally efficient than flat consensus 

(especially when the MCLA and KMSAD consensus functions are employed). However, 

there are no significant differences between the running times of the fastest RHCA variant 

and flat consensus, probably due to the small size of this data set and of the associated 

cluster ensembles. For this reason, the inaccuracies of the running time prediction based 

on SERTRHCA are of little importance in practice. 

Figure C.2 presents the estimated and real running times of the parallel implementation 

of the RHCA in the four diversity scenarios analyzed. According to PERTRHCA, the 

parallel RHCA variants with the s = 2/lowest b and s = 3/highest b configurations yield 

the maximum computational efficiency —except for the lowest diversity scenario, where flat 

consensus is correctly designated to be the fastest option. If these predictions are compared 

to the real running times presented on the right column of figure C.2, it can be observed 

that, as the diversity level grows, they maintain their accuracy as regards the identification 

of the fastest consensus architecture for most consensus functions. However, in the case the 

prediction strategy fails to identify the fastest RHCA variant, we make, in terms of absolute 

running time penalization, a perfectly assumable error, as the real running times of parallel 

RHCA are below one second for the particular case of this data set. 

C.2.2 Wine data set 

In this section, we present the estimated and real running times of the serial and parallel 

implementations of RHCA on the Wine data collection. As aforementioned, this experiment 

has been replicated across four diversity scenarios that, in the case of this data set, correspond 

to cluster ensembles of size l =45, 450, 855 and 1260. Thus, notice that considerably 

large cluster ensembles are obtained in this case, especially if compared to those of the Iris 

data collection. This is due to the fact that the Wine data set has a much richer dimensional 

diversity as regards the distinct object representations generated (approximately five times 

253






3 

10 

2 2 1 

0 


10 −1 

2 3 4 9 


(a) Estimated RT, |dfA| =1 

6 4 3 3 2 2 1 

10 1 


10 0 

10 −1 

10 1 

10 0 

10 −1 

10 1 

10 0 

2 3 4 7 8 45 90 


(c) Estimated RT, |dfA| =10 


7 5 4 3 3 2 2 1 

2 3 4 5 10 11 85 171 


(e) Estimated RT, |dfA| =19 


7 5 4 4 3 3 2 2 1 

2 3 4 5 6 12 13 126 252 


(g) Estimated RT, |dfA| =28 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





3 

10 

2 2 1 

0 


10 −1 

2 3 4 9 


(b) Real RT, |dfA| =1 

6 4 3 3 2 2 1 

10 1 


10 0 

10 −1 

10 1 

10 0 

10 −1 

10 1 

10 0 

2 3 4 7 8 45 90 


(d) Real RT, |dfA| =10 


7 5 4 3 3 2 2 1 

2 3 4 5 10 11 85 171 


(f) Real RT, |dfA| =19 


7 5 4 4 3 3 2 2 1 

2 3 4 5 6 12 13 126 252 


(h) Real RT, |dfA| =28 

Figure C.1: Estimated and real running times (RT) of the serial RHCA implementation on 

the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

254 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD





10 −1 

10 0 

10 −1 

10 0 

10 −1 

10 0 

10 −1 



3 2 2 1 

2 3 4 9 




6 4 3 3 2 2 1 

2 3 4 7 8 45 90 




7 5 4 3 3 2 2 1 

2 3 4 5 10 11 85 171 




7 5 4 4 3 3 2 2 1 

2 3 4 5 6 12 13 126 252 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 −1 

10 0 

10 −1 

10 0 

10 −1 

10 0 

10 −1 


3 2 2 1 

2 3 4 9 




6 4 3 3 2 2 1 

2 3 4 7 8 45 90 




7 5 4 3 3 2 2 1 

2 3 4 5 10 11 85 171 




7 5 4 4 3 3 2 2 1 

2 3 4 5 6 12 13 126 252 



Figure C.2: Estimated and real running times (RT) of the parallel RHCA implementation 

on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

255 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD


richer), which boosts the size of the cluster ensemble. 

Firstly, figure C.3 depicts the results corresponding to the fully serial RHCA implementation 

across the four diversity scenarios (estimated running times on the left column, real 

running times on the right). The first remarkable fact is that SERTRHCA is a pretty accurate 

predictor of SRTRHCA, which can be easily verified if the pair of subfigures presented on 

each row of figure C.3 are compared. Again, RHCA becomes more computationally attractive 

as the size of the cluster ensemble increases (except for the EAC consensus function). 

Moreover, among the distinct RHCA variants executed in each experiment, the greatest 

efficiency is achieved by the ones with 2 or 3 stages. 

The estimated and real execution times of the parallel implementation of RHCA are 

depicted in figure C.4. As already observed in the previous data sets, PERTRHCA is a 

modestly accurate estimator of PRTRHCA, although it is a fairly good predictor of the most 

computationally efficient consensus architecture. Notice that, in the most diverse scenario 

(|dfA| = 28), the least time consuming RHCA variant is nearly two orders of magnitude 

faster than flat consensus —thus, it can be argued that being able to predict which RHCA 

configuration requires the least computation time constitutes a significantly advantageous 

strategy compared to the traditional one-step approach to consensus clustering. 

C.2.3 Glass data set 

This section presents the results of estimating the execution times of the fully serial and 

parallel implementations of RHCA in the four diversity scenarios for the Glass data set, 

which give rise to cluster ensembles of sizes l =29, 290, 551 and 812 each. 

Firstly, figure C.5 depicts both the estimated and real running times of several serial 

RHCA variants. These results are quite comparable to those obtained in the previous 

data collections. That is, except for the EAC consensus function, the RHCA variants with 

s =2ands = 3 stages become the most computationally efficient as the size of the cluster 

ensemble increases. Moreover, in the most diverse scenario (|dfA| = 28) flat consensus is not 

executable if the MCLA consensus function is employed as the clustering combiner, whereas 

hierarchical consensus does provide a means for obtaining a consolidated clustering solution 

upon the same cluster ensemble using this consensus function. Furthermore, notice that the 

proposed methodology for estimating the running time of serial RHCA yields fairly reliable 

predictions of their real execution time. 

And secondly, the results corresponding to the parallel implementation of RHCA are 

presented in figure C.6. Again, it can be observed that the estimated running time of the 

parallel RHCA is an arguably accurate approximation of the real execution time. However, 

notice that this lack of accuracy is tolerable inasmuch as i) the location of the minima of 

PERTRHCA mostly coincides with the minima of PRTRHCA —which means that the fastest 

consensus architecture is successfully predicted, and ii) the selection of a computationally 

suboptimal RHCA variant involves a light penalization in terms of real execution time. 

C.2.4 Ionosphere data set 

This section describes the results of the minimum complexity RHCA variant selection based 

on running time estimation. In the case of the Ionosphere data collection, cluster ensembles 

of sizes l =97, 970, 1843 and 2716 correspond to the four diversity scenarios where this 

256






5 

10 

4 3 3 2 2 1 

1 


10 0 

10 −1 

2 3 4 5 6 22 45 



8 5 5 4 4 3 3 2 2 1 

10 2 


10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

2 3 4 5 6 7 17 18 225 450 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 8 9 23 24 427 855 




10 7 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 9 10 28 29 630 1260 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





5 

10 

4 3 3 2 2 1 

1 


10 0 

10 −1 

2 3 4 5 6 22 45 



8 5 5 4 4 3 3 2 2 1 

10 2 


10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

2 3 4 5 6 7 17 18 225 450 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 8 9 23 24 427 855 




10 7 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 9 10 28 29 630 1260 




the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

257 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD






10 0 

10 −1 

10 0 

10 −1 

10 1 

10 0 

10 −1 

10 1 

10 0 

10 −1 


5 4 3 3 2 2 1 

2 3 4 5 6 22 45 




8 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 17 18 225 450 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 8 9 23 24 427 855 




10 7 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 9 10 28 29 630 1260 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 

10 0 

10 −1 

10 1 

10 0 

10 −1 

10 1 

10 0 

10 −1 


5 4 3 3 2 2 1 

2 3 4 5 6 22 45 




8 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 17 18 225 450 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 8 9 23 24 427 855 




10 7 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 9 10 28 29 630 1260 




on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

258 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD





10 0 

10 −1 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 



4 3 3 2 2 1 

2 3 4 5 14 29 




8 5 4 4 3 3 2 2 1 

2 3 4 5 6 13 14 145 290 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 275 551 




9 6 5 4 4 3 3 2 2 1 

10 

2 3 4 5 8 9 23 24 406 812 

0 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 


4 3 3 2 2 1 

2 3 4 5 14 29 




8 5 4 4 3 3 2 2 1 

2 3 4 5 6 13 14 145 290 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 275 551 




9 6 5 4 4 3 3 2 2 1 

10 

2 3 4 5 8 9 23 24 406 812 

0 




the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

259 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD






10 0 

10 −1 


4 3 3 2 2 1 

2 3 4 5 14 29 



8 

10 

5 4 4 3 3 2 2 1 

1 


10 0 

10 −1 

10 1 

10 0 

10 −1 

2 3 4 5 6 13 14 145 290 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 275 551 



9 6 5 4 4 3 3 2 2 1 

10 1 


10 0 

10 −1 

2 3 4 5 8 9 23 24 406 812 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 


4 3 3 2 2 1 

2 3 4 5 14 29 



8 

10 

5 4 4 3 3 2 2 1 

1 


10 0 

10 −1 

10 1 

10 0 

10 −1 

10 1 

10 0 

10 −1 

2 3 4 5 6 13 14 145 290 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 7 8 19 20 275 551 




9 6 5 4 4 3 3 2 2 1 

2 3 4 5 8 9 23 24 406 812 




on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

260 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD


experiment is conducted. 

For starters, figure C.7 presents the estimated and real exectution times of several variants 

of the fully serial implementation of RHCA. If the estimated and real running times are 

compared, it can be observed that it is possible to accurately predict the real execution time 

of serial RHCA variants which, at the same time, allows the precise prediction of the most 

computationally efficient RHCA variant —the ultimate goal of the proposed methodology. 

Figure C.8 depicts the estimated and real running times of the parallel RHCA implementation 

across the sweep of values of b for all four diversity scenarios. In comparison to 

what is observed in other data sets, PERTRHCA is a better estimator of PRTRHCA in this 

case. Moreover, as the diversity of the cluster ensemble grows, the computational savings 

derived from employing the fastest RHCA instead of flat consensus are noteworthy (especially 

for the HGPA, CSPA, ALSAD and SLSAD consensus functions). Last, note that 

flat consensus is not executable in the three of the four diversity scenarios if consensus is 

obtained by means of MCLA, due to the large size of the mini-ensembles b. 

C.2.5 WDBC data set 

In this section, we present the results of estimating the execution times of the serial and 

parallel RHCA implementations on the WDBC data collection. According to the four 

diversity scenarios generated by employing |dfA| =1, 10, 19 and 28 clustering algorithms 

for generating the cluster ensembles, these contain l = 113, 1130, 2147 and 3164 individual 

partitions. 

The estimated and real running times corresponding to the serial implementation of 

RHCA are depicted in figure C.9. As observed in the remaining data collections, SERTRHCA 

is a fairly accurate estimator of SRTRHCA, which allows predicting the fastest consensus 

architecture with a high precision. Notice that, as already noted in other collections, RHCA 

becomes a competitive option as the size of the cluster ensemble grows, except when the EAC 

consensus functions is employed. Notice that, for the most diverse scenarios, all consensus 

architectures are highly costly (in terms of execution time), so being able to predict which 

is the fastest can lead to important computation savings. 

As regards the fully parallel implementation of RHCA, the estimated and real running 

times corresponding to the four aforementioned diversity scenarios are presented in figure 

C.10. Despite the estimation of the real execution time is not as accurate as in the serial 

case, PERTRHCA is a reasonable predictor of the fastest consensus architecture in most 

cases. 

C.2.6 Balance data set 

This section presents the estimated and real execution times of multiple variants of random 

hierarchical consensus architectures on the Balance data collection, both in its serial and 

parallel versions. The low cardinality of the dimensional diversity factor of this data set 

gives rise to relatively small cluster ensembles in the four diversity scenarios, which are 

equal to l =7, 70, 133 and 196 in this case. 

Firstly, figure C.11 depicts the estimated and real running times of the serial RHCA 

implementation in the four diversity scenarios. As already observed in the previous data 

261






10 2 

10 1 

10 0 

10 3 

10 2 

10 1 


6 4 4 3 3 2 2 1 

2 3 4 5 8 9 48 97 




9 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 8 9 25 26 485 970 




10 7 6 5 4 4 3 3 2 2 1 

10 3 

10 2 

10 1 

2 3 4 5 6 10 11 35 36 921 1843 




11 7 6 5 5 4 4 3 3 2 2 1 

10 3 

10 2 

10 1 

2 3 4 5 6 7 12 13 42 43 1358 2716 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 1 

10 0 

10 3 

10 2 

10 1 


6 4 4 3 3 2 2 1 

2 3 4 5 8 9 48 97 




9 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 8 9 25 26 485 970 




10 7 6 5 4 4 3 3 2 2 1 

10 3 

10 2 

10 1 

2 3 4 5 6 10 11 35 36 921 1843 




11 7 6 5 5 4 4 3 3 2 2 1 

10 3 

10 2 

10 1 

2 3 4 5 6 7 12 13 42 43 1358 2716 




the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

262 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD





10 1 

10 0 

10 1 

10 0 

10 1 

10 0 

10 1 

10 0 



6 4 4 3 3 2 2 1 

2 3 4 5 8 9 48 97 




9 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 8 9 25 26 485 970 




10 7 6 5 4 4 3 3 2 2 1 

2 3 4 5 6 10 11 35 36 921 1843 




11 7 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 12 13 42 43 1358 2716 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 1 

10 0 

10 1 

10 0 

10 1 

10 0 


6 4 4 3 3 2 2 1 

2 3 4 5 8 9 48 97 




9 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 8 9 25 26 485 970 




10 7 6 5 4 4 3 3 2 2 1 

2 3 4 5 6 10 11 35 36 921 1843 



11 

10 

7 6 5 5 4 4 3 3 2 2 1 

2 


10 1 

10 0 

2 3 4 5 6 7 12 13 42 43 1358 2716 




on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

263 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD






10 3 

10 2 

10 1 

10 0 


6 4 4 3 3 2 2 1 

2 3 4 5 8 9 56 113 




10 7 5 5 4 4 3 3 2 2 1 

10 4 

10 3 

10 2 

10 1 

2 3 4 5 6 9 10 27 28 565 1130 




11 7 6 5 5 4 4 3 3 2 2 1 

10 4 

10 3 

10 2 

10 1 

2 3 4 5 6 7 11 12 37 38 1073 2147 




11 7 6 5 5 4 4 3 3 2 2 1 

10 4 

10 3 

10 2 

10 1 

2 3 4 5 6 7 12 13 45 46 1582 3164 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 3 

10 2 

10 1 

10 0 


6 4 4 3 3 2 2 1 

2 3 4 5 8 9 56 113 




10 7 5 5 4 4 3 3 2 2 1 

10 4 

10 3 

10 2 

10 1 

2 3 4 5 6 9 10 27 28 565 1130 




11 7 6 5 5 4 4 3 3 2 2 1 

10 4 

10 3 

10 2 

10 1 

2 3 4 5 6 7 11 12 37 38 1073 2147 




11 7 6 5 5 4 4 3 3 2 2 1 

10 4 

10 3 

10 2 

10 1 

2 3 4 5 6 7 12 13 45 46 1582 3164 




the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

264 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD





10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 



6 4 4 3 3 2 2 1 

2 3 4 5 8 9 56 113 




10 7 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 9 10 27 28 565 1130 




11 7 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 11 12 37 38 1073 2147 




11 7 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 12 13 45 46 1582 3164 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 


6 4 4 3 3 2 2 1 

2 3 4 5 8 9 56 113 




10 7 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 9 10 27 28 565 1130 




11 7 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 11 12 37 38 1073 2147 




11 7 6 5 5 4 4 3 3 2 2 1 

2 3 4 5 6 7 12 13 45 46 1582 3164 




on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

265 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD


sets, our proposed method allows obtaining a pretty accurate estimation of the execution 

time of any serial RHCA variant which, at the same time, allows the user to make a reliable 

decision regarding the most computationally efficient consensus architecture whatever the 

diversity scenario and consensus function is employed. 

Secondly, figure C.12 presents the corresponding magnitudes in the case that the fully 

parallel implementations of RHCA are employed. In this situation, the estimation of the real 

execution time is not as accurate as in the serial case, although the running time deviation 

suffered when a suboptimal RHCA architecture is selected is, from a practical viewpoint, 

perfectly assumable — a fact that has already been reported in the previous data sets. 

C.2.7 MFeat data set 

In this section, the results of estimating the execution times of RHCA are compared to their 

real counterparts across four diversity scenarios in the context of the MFeat data collection. 

The cluster ensemble sizes corresponding to these diversity scenarios are l =6, 60, 114 and 168, 

respectively. 

Figure C.13 presents the estimated and real running times of multiple variants of the 

serial implementation of RHCA on this data set. Besides the notably high accuracy of 

the estimation, we would like to highlight that flat consensus turns out to be the most 

efficient consensus architecture in the four diversity scenarios for all but two of the consensus 

functions employed (MCLA and HGPA), a behaviour that has already been observed in 

other data collections with small cluster ensembles (e.g. the Iris data set). 

The results corresponding to the parallel implementation of RHCA are depicted in 

figure C.14. In this case, the use of the HGPA and MCLA consensus functions as clustering 

combiners also make the RHCA variants with s =2ands = 3 stages computationally 

optimal. However, for the remaining consensus functions, flat consensus mostly prevails as 

the most efficient consensus architecture in most diversity scenarios. 

266






2 

10 

2 1 

2 


10 1 

10 0 

10 −1 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

2 3 7 




6 4 3 3 2 2 1 

2 3 4 6 7 35 70 




7 5 4 3 3 2 2 1 

2 3 4 5 9 10 66 133 




7 5 4 4 3 3 2 2 1 

10 

2 3 4 5 6 11 12 98 196 

0 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





2 

10 

2 1 

2 


10 1 

10 0 

10 −1 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

2 3 7 




6 4 3 3 2 2 1 

2 3 4 6 7 35 70 




7 5 4 3 3 2 2 1 

2 3 4 5 9 10 66 133 




7 5 4 4 3 3 2 2 1 

10 

2 3 4 5 6 11 12 98 196 

0 



Figure C.11: Estimated and real running times (RT) of the serial RHCA implementation 

on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

267 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD






10 1 

10 0 

10 −1 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 


2 2 1 

2 3 7 




6 4 3 3 2 2 1 

2 3 4 6 7 35 70 




7 5 4 3 3 2 2 1 

2 3 4 5 9 10 66 133 




7 5 4 4 3 3 2 2 1 

2 3 4 5 6 11 12 98 196 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 1 

10 0 

10 −1 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 

10 2 

10 1 

10 0 


2 2 1 

2 3 7 




6 4 3 3 2 2 1 

2 3 4 6 7 35 70 




7 5 4 3 3 2 2 1 

2 3 4 5 9 10 66 133 




7 5 4 4 3 3 2 2 1 

2 3 4 5 6 11 12 98 196 





268 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD





10 2 

10 0 

10 4 

10 2 



2 2 1 

2 3 6 




5 4 3 3 2 2 1 

10 

2 3 4 6 7 30 60 

0 


10 4 

10 3 

10 2 

10 1 

10 4 

10 2 



6 4 4 3 3 2 2 1 

2 3 4 5 8 9 57 114 




7 5 4 3 3 2 2 1 

2 3 4 5 10 11 84 168 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 0 

10 4 

10 2 


2 2 1 

2 3 6 




5 4 3 3 2 2 1 

10 

2 3 4 6 7 30 60 

0 


10 4 

10 3 

10 2 

10 1 

10 4 

10 2 



6 4 4 3 3 2 2 1 

2 3 4 5 8 9 57 114 




7 5 4 3 3 2 2 1 

2 3 4 5 10 11 84 168 



Figure C.13: Estimated and real running times (RT) of the serial RHCA implementation 

on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}. 

269 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD






10 2 

10 0 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

10 0 


2 2 1 

2 3 6 




5 4 3 3 2 2 1 

2 3 4 6 7 30 60 




6 4 4 3 3 2 2 1 

2 3 4 5 8 9 57 114 




7 5 4 3 3 2 2 1 

2 3 4 5 10 11 84 168 



CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 0 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

10 0 

10 3 

10 2 

10 1 

10 0 


2 2 1 

2 3 6 




5 4 3 3 2 2 1 

2 3 4 6 7 30 60 




6 4 4 3 3 2 2 1 

2 3 4 5 8 9 57 114 




7 5 4 3 3 2 2 1 

2 3 4 5 10 11 84 168 





270 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD


C.3 Estimation of the computationally optimal DHCA 

The methodology for selecting the most computationally efficient implementation variant 

of deterministic hierarchical consensus architectures presented in section 3.3 –consisting of 

estimating the running time of several DHCA variants differing in the order diversity factors 

are associated to the stages of the hierarchical consensus architecture, selecting the one that 

yields the minimum running time, which is the one to be truly executed– has been applied 

on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat unimodal data sets (see 

appendix A.2.1 for a description of these collections). 

In these experiments, the fully serial and parallel implementations of DHCA variants 

have been considered across the four experimental diversity scenarios employed in this work 

—see appendix A.4. The objective of this experiment is twofold: firstly, we seek to verify 

whether the proposed strategy succeeds in predicting the most computationally efficient 

DHCA variant. And secondly, we intend to analyze the conditions under which deterministic 

hierarchical consensus architectures are computationally advantageous compared to flat 

consensus clustering. We have followed the experimental design outlined next. 


i) The time complexity of deterministic hierarchical consensus architectures. 


DHCA variant, in both the fully serial and parallel implementations. 

iii) The predictive power of the proposed methodology based on running time estimation 

vs the computational optimality criterion based on designing the DHCA 

according to a decreasing diversity factor cardinality order, in both the fully 

serial and parallel implementations. 


i) The time complexity of the implemented serial and parallel DHCA variants is 


time (SRTDHCA) and parallel running time (PRTDHCA). 

ii) The estimated running times of the same DHCA variants –serial estimated running 

time (SERTDHCA) and parallel estimated running time (PERTDHCA)– are 



regarding the computationally optimal DHCA variant will be successful 


same DHCA variant, and the percentage of experiments in which prediction is 




DHCA variants in the case prediction fails. This evaluation process is replicated 



iii) Both computationally optimal DHCA variants prediction approaches are compared 

in terms of the percentage of experiments in which prediction is successful, 

271

C.3. Estimation of the computationally optimal DHCA 

and in terms of the execution time overheads (in both absolute and relative terms) 

between the truly and the allegedly fastest DHCA variants in the case prediction 

fails. 

– How are the experiments designed? The f! DHCA variants corresponding to 

all the possible permutations of the f diversity factors employed in the generation 

of the cluster ensemble have been implemented (see table 3.6). As described in appendix 

A.4, cluster ensembles have been created by the mutual crossing of f =3 

diversity factors: clustering algorithms (dfA), object representations (dfR) and data 

dimensionalities (dfD). Thus, in all our experiments, the number of DHCA variants is 

f! = 3! = 6, which are identified by an acronym describing the order in which diversity 

factors are assigned to stages —for instance, ADR describes the DHCA variant 

defined by the ordered list O = {df1 = dfA,df2 = dfD,df3 = dfR}. For a given data 

collection, the cardinalities of the representational and dimensional diversity factors 

(|dfR| and |dfD|, respectively) are constant, while the cardinality of the algorithmic 

diversity factor takes four distinct values |dfA| = {1, 10, 19, 28}, giving rise to the four 

diversity scenarios where our proposals are analyzed. Moreover, consensus clustering 

has been conducted by means of the seven consensus functions for hard cluster 

ensembles described in appendix A.5, which allows evaluating the behaviour of our 

proposals under distinct consensus paradigms. In all cases, the real running times 

correspond to an average of 10 independent runs of the whole RHCA, in order to 

obtain representative real running time values. As described in appendix A.6, all the 

experiments have been executed under Matlab 7.0.4 on Pentium 4 3GHz/1 GB RAM 

computers. 


serial and parallel implementations of the DHCA variants are depicted by means of 



the results of the experiments conducted on the Zoo data collection. On this data 

set, the cardinalities of the representational and dimensional diversity factors are 

|dfR| = 5 and |dfD| = 14, respectively. The presentation of the results of these 

same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat 

unimodal data collections is deferred to appendix C.3. 


In this section, we present the estimated and real running times of the serial and parallel implementations 

of DHCA on the Iris data collection. As aforementioned, this experiment has 

been replicated across four diversity scenarios that, in the case of this data set, correspond 

to cluster ensembles of size l =9, 90, 171 and 252. 

The left and right columns of figure C.15 present the estimated and real running times 

of several variants of the serial implementation of the DHCA and flat consensus on this data 

set across the four diversity scenarios. There are a couple of issues worth noting: firstly, 

SERTDHCA is a pretty accurate estimator of the real execution time of the serial DHCA 

implementation, SRTDHCA. Secondly, notice that flat consensus is faster than the most 

efficient DHCA variants regardless of the consensus function and the diversity scenario. 

272


Furthermore, we would like to highlight that the computationally optimal DHCA variants 

are those defined by an ordered list of diversity factors in decreasing cardinality order, a 

trend that is notably well captured by SERTDHCA. 

Figure C.16 presents the estimated and real execution times of the fully parallel implementations 

of the DHCA variants. If compared to the serial case, the running time 

estimation PERTDHCA is not as accurate. Moreover, notice that, as already outlined in 

section 3.3, the execution times of the distinct DHCA variants tend to be quite similar. 

Last, DHCA become faster than flat consensus in the highest diversity scenario. 


This section presents the estimated and real execution times of multiple variants of deterministic 

hierarchical consensus architectures on the Wine data collection, both in its serial 

and parallel versions. The high cardinality of the dimensional diversity factor of this data 

set gives rise to relatively large cluster ensembles in the four diversity scenarios, the sizes 

of which are equal to l =45, 450, 855 and 1260 in this case. 

Figure C.17 depicts the results of this experiment when the fully serial implementation 

of DHCA is considered. As already observed in section C.3.1, SERTDHCA is a pretty 

good estimator of SRTDHCA. Moreover, as regards the computational efficiency of DHCA 

variants, notice that i) those defined by the decreasing cardinality ordered list of diversity 

factors are the fastest, and ii) they become faster than flat consensus as the size of the 

cluster ensemble is increased. 

Meanwhile, figure C.18 presents the results corrresponding to the parallel DHCA implementation. 

Again, the time complexities of DHCA variants tend to converge, which 

reinforces our hypothesis regarding the irrelevance of the way diversity factors are associated 

to stages in parallel scenarios. Moreover, notice that DHCA variants are faster than 

flat consensus in all but one of the diversity scenarios —except when the EAC consensus 

function is employed. 


In this section, we present the results of estimating the execution times of the serial and parallel 

DHCA implementations on the Glass data collection. According to the four diversity 

scenarios generated by employing |dfA| =1, 10, 19 and 28 clustering algorithms for generating 

the cluster ensembles, these contain l =29, 290, 551 and 812 individual partitions. 

For starters, the results corresponding to the serial implementation of deterministic 

hierarchical consensus architectures are presented in figure C.19. It can be observed that 

the estimation of the real execution time is pretty accurate, both in absolute terms (i.e. 

SERTDHCA is a good approximation of SRTDHCA) and as regards the determination of the 

computationally optimal consensus architecture. Furthermore, notice how the definition of 

DHCA variants by diversity factors arranged in decreasing cardinality order gives rise to 

the least time consuming configurations, which are even faster than flat consensus as the 

cluster ensemble size increases —again, consensus architectures based on the EAC consensus 

function constitute the only exception to this rule. Last, notice that when consensus are 

built using the MCLA consensus function, flat consensus is not executable in the highest 

diversity scenario. 

273






10 0 

10 −1 

|df A | = 1 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 0 

10 −1 



DHCA variant 

10 1 

10 0 

|df A | = 10 , |df D | = 2 , |df R | = 5 



DHCA variant 

10 1 

10 0 

|df A | = 19 , |df D | = 2 , |df R | = 5 


|df A | = 28 , |df D | = 2 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 

|df A | = 1 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 0 

10 −1 


|df A | = 10 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 19 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 28 , |df D | = 2 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

Figure C.15: Estimated and real running times (RT) of the serial DHCA implementation 


274





10 0 

10 −1 


|df A | = 1 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 −1 



DHCA variant 

10 0 

10 −1 

|df A | = 10 , |df D | = 2 , |df R | = 5 



DHCA variant 

10 0 

10 −1 

|df A | = 19 , |df D | = 2 , |df R | = 5 


|df A | = 28 , |df D | = 2 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 

|df A | = 1 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 −1 


|df A | = 10 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 0 

10 −1 


|df A | = 19 , |df D | = 2 , |df R | = 5 


DHCA variant 

10 0 

10 −1 


|df A | = 28 , |df D | = 2 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

Figure C.16: Estimated and real running times (RT) of the parallel DHCA implementation 


275






10 0 

|df A | = 1 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 1 


10 


0 

DHCA variant 

10 1 

|df A | = 19 , |df D | = 11 , |df R | = 5 


|df A | = 28 , |df D | = 11 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

|df A | = 1 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 1 


|df A | = 19 , |df D | = 11 , |df R | = 5 

10 


0 

DHCA variant 

10 1 


|df A | = 28 , |df D | = 11 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



276





10 0 

10 −1 


|df A | = 1 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 0 

10 −1 



DHCA variant 

10 0 

|df A | = 10 , |df D | = 11 , |df R | = 5 



DHCA variant 

10 0 

|df A | = 19 , |df D | = 11 , |df R | = 5 


|df A | = 28 , |df D | = 11 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 

|df A | = 1 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 0 

10 −1 


|df A | = 10 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 19 , |df D | = 11 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 28 , |df D | = 11 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



277


Figure C.20 depicts the estimated and real execution times of fully parallel DHCA variants 

and flat consensus. The patterns presented in both columns of this figure (estimated 

running times on the left column, real execution times on the right) reveal the same behaviour 

observed on the previous data sets. That is, all DHCA variants have comparable 

running times, which would make running time estimations unnecessary as far as the election 

of the fastest DHCA variant is concerned. However, this estimation is necessary to 

decide whether hierarchical consensus is faster that its flat alternative, which occurs in all 

the diversity scenarios but in the lowest diversity one. 


In this section, the results of estimating the execution times of DHCA are compared to their 

real counterparts across four diversity scenarios on the Ionosphere data collection. The cluster 

ensemble sizes corresponding to these diversity scenarios are l =97, 970, 1843 and 2716, 

respectively. 

Firstly, figure C.21 depicts the estimated and real execution times considering the fully 

serial implementation of deterministic hierarchical consensus architectures. In this case, 

SERTDHCA is a fairly good estimator of SRTDHCA, and it constitutes a good base for 

predicting the least time consuming consensus architecture. Notice that, when consensus 

clusterings are built by means of the MCLA consensus function, flat consensus execution 

becomes impossible (given the computational resources employed in our experiments, see 

appendix A.6), so its hierarchical counterpart becomes a feasible alternative. Moreover, if 

hierarchical consensus is implemented by means of the DHCA variant defined by an ordered 

list of diversity factors arranged in decreasing cardinality order, notable computation time 

savings can be obtained. 

And secondly, as far as the fully parallel DHCA implementation is concerned (see figure 

C.22), the following observations can be made: i) PERTDHCA is a pretty accurate estimator 

of PRTDHCA, ii) there are no significant differences between the running times of the 

distinct variants of DHCA, and iii) flat consensus is more computationally costly than its 

hierarchical counterpart in all but one of the diversity scenarios considered. 


In this section, let us analyze the results corresponding to the WDBC data collection. In 

this case, each diversity scenario corresponds to a cluster ensemble of size l = 113, 1130, 2147 

and 3164, respectively. In first place, the left and right columns of figure C.23 present the 

estimated and real running times of the variants of the serial implementation of the DHCA 

on this data set across the four diversity scenarios. 

It can be observed that the proposed methodology yields a pretty good estimation 

of the real running time of DHCA variants. This allows the user to make well-grounded 

decisions regarding the most efficient hierarchical consensus architectures. For this data set, 

flat consensus is the computationally optimal architecture except in the highest diversity 

scenario —except when the EAC consensus function is employed. 

In second place, figure C.24 depicts the results corresponding to the parallel implementation 

of DHCA. The same conclusions drawn for the previous data collections are also 

applicable in the WDBC data set. That is, running times are almost independent of the 

278





10 0 


|df A | = 1 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 

10 0 



DHCA variant 

10 1 

|df A | = 19 , |df D | = 7 , |df R | = 5 


|df A | = 28 , |df D | = 7 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

|df A | = 1 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 19 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 


|df A | = 28 , |df D | = 7 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



279






10 0 

10 −1 

|df A | = 1 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 0 



DHCA variant 

10 1 

10 0 

|df A | = 10 , |df D | = 7 , |df R | = 5 



DHCA variant 

10 1 

10 0 

|df A | = 19 , |df D | = 7 , |df R | = 5 


|df A | = 28 , |df D | = 7 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

10 −1 

|df A | = 1 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 0 


|df A | = 10 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 19 , |df D | = 7 , |df R | = 5 


DHCA variant 

10 1 

10 0 


|df A | = 28 , |df D | = 7 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



280





10 2 

10 0 


|df A | = 1 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 2 



DHCA variant 

10 2 

|df A | = 10 , |df D | = 32 , |df R | = 4 



DHCA variant 

10 2 

|df A | = 19 , |df D | = 32 , |df R | = 4 


|df A | = 28 , |df D | = 32 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 0 

|df A | = 1 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 2 


|df A | = 10 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 2 


|df A | = 19 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 2 


|df A | = 28 , |df D | = 32 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



281






10 1 

10 0 

|df A | = 1 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 1 

10 0 



DHCA variant 

10 1 

10 0 

|df A | = 10 , |df D | = 32 , |df R | = 4 



DHCA variant 

10 2 

10 0 

|df A | = 19 , |df D | = 32 , |df R | = 4 


|df A | = 28 , |df D | = 32 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 1 

10 0 

|df A | = 1 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 1 

10 0 


|df A | = 19 , |df D | = 32 , |df R | = 4 


DHCA variant 

10 2 

10 0 


|df A | = 28 , |df D | = 32 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



282


DHCA variant implemented, at least, no significant differences are observed among them – 

as opposed to the serial implementation case–, while hierarchical consensus is more efficient 

than flat consensus as soon as the cluster ensemble size grows. 


This section presents the results of estimating the execution times of the fully serial and 

parallel implementations of DHCA in the four diversity scenarios for the Balance data set, 

which give rise to cluster ensembles of sizes l =7, 70, 133 and 196 each. 

Firstly, figure C.25 presents the estimated and real execution times corresponding to 

the serial implementation context. It is quite apparent that SERTDHCA provides the user 

with a good estimation of the real running time of consensus architectures (SRTDHCA) and, 

as such, it allows determining which is the computationally optimal consensus architecture 

with a high degree of accuracy. In this case, given the small size of the cluster ensemble 

in either of the four diversity scenarios, flat consensus is faster than most serial DHCA 

variants. 

And secondly, the results corresponding to the parallel implementation of DHCA are 

depicted in figure C.26. Once more, all DHCA variants have very similar running times. 

As in the serial case, flat consensus is faster than any of its hierarhical counterparts, except 

when the HGPA and MCLA consensus functions are employed as clustering combiners. 


This section describes the results of the minimum complexity DHCA variant selection based 

on running time estimation. In the case of the MFeat data collection, cluster ensembles of 

sizes l =6, 60, 114 and 168 correspond to the four diversity scenarios where this experiment 

is conducted. 

Figure C.27 depicts the estimated and real execution times of the fully serial implementation 

of DHCA. In this case, SERTDHCA is a pretty accurate estimator of SRTDHCA, and, 

as such, it is a good predictor of the most computationally efficient consensus architecture. 

In most cases, however, due to the relatively small sizes of the cluster ensembles in this data 

set, flat consensus is faster than any of the DHCA variants —except when the HGPA and 

MCLA consensus functions are employed in high diversity scenarios. 

Last, figure C.28 presents the results corrresponding to the parallel DHCA implementation. 

Again, the time complexities of DHCA variants reach pretty similar values. However, 

notice that DHCA variants are slower than flat consensus in most of the diversity scenarios 

—except when the HGPA and MCLA consensus functions are used for combining the 

clusterings. 

283






10 2 

10 0 

|df A | = 1 , |df D | = 28 , |df R | = 5 


DHCA variant 

10 2 


|df A | = 10 , |df D | = 28 , |df R | = 5 


DHCA variant 


|df A | = 19 , |df D | = 28 , |df R | = 5 


DHCA variant 


|df A | = 28 , |df D | = 28 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 0 

|df A | = 1 , |df D | = 28 , |df R | = 5 


DHCA variant 

10 2 


|df A | = 10 , |df D | = 28 , |df R | = 5 


DHCA variant 


|df A | = 19 , |df D | = 28 , |df R | = 5 


DHCA variant 


|df A | = 28 , |df D | = 28 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



284





10 1 

10 0 


|df A | = 1 , |df D | = 28 , |df R | = 5 


DHCA variant 

10 2 

10 0 



DHCA variant 

10 2 

10 0 

|df A | = 10 , |df D | = 28 , |df R | = 5 



DHCA variant 

10 2 

10 0 

|df A | = 19 , |df D | = 28 , |df R | = 5 


|df A | = 28 , |df D | = 28 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 1 

10 0 

|df A | = 1 , |df D | = 28 , |df R | = 5 


DHCA variant 

10 2 

10 0 


|df A | = 10 , |df D | = 28 , |df R | = 5 


DHCA variant 

10 2 

10 0 


|df A | = 19 , |df D | = 28 , |df R | = 5 


DHCA variant 

10 2 

10 0 


|df A | = 28 , |df D | = 28 , |df R | = 5 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



285






10 2 

10 0 

|df A | = 1 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 2 

10 0 



DHCA variant 

10 2 

10 0 

|df A | = 10 , |df D | = 2 , |df R | = 4 



DHCA variant 

10 2 

10 0 

|df A | = 19 , |df D | = 2 , |df R | = 4 


|df A | = 28 , |df D | = 2 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 0 

|df A | = 1 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 2 

10 0 


|df A | = 10 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 2 

10 0 


|df A | = 19 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 2 

10 0 


|df A | = 28 , |df D | = 2 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



286





10 0 


|df A | = 1 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 1 

10 0 



DHCA variant 

10 1 

10 0 

|df A | = 19 , |df D | = 2 , |df R | = 4 


|df A | = 28 , |df D | = 2 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

|df A | = 1 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 1 

10 0 


|df A | = 10 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 1 

10 0 


|df A | = 19 , |df D | = 2 , |df R | = 4 


DHCA variant 

10 1 

10 0 


|df A | = 28 , |df D | = 2 , |df R | = 4 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



287






10 0 

|df A | = 1 , |df D | = 1 , |df R | = 6 


DHCA variant 


|df A | = 10 , |df D | = 1 , |df R | = 6 

10 


0 

DHCA variant 

10 4 

10 2 



DHCA variant 

10 4 

10 2 

|df A | = 19 , |df D | = 1 , |df R | = 6 


|df A | = 28 , |df D | = 1 , |df R | = 6 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 0 

|df A | = 1 , |df D | = 1 , |df R | = 6 


DHCA variant 


|df A | = 10 , |df D | = 1 , |df R | = 6 

10 


0 

DHCA variant 

10 4 

10 2 


|df A | = 19 , |df D | = 1 , |df R | = 6 


DHCA variant 

10 4 

10 2 


|df A | = 28 , |df D | = 1 , |df R | = 6 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



288





10 2 

10 0 


|df A | = 1 , |df D | = 1 , |df R | = 6 


DHCA variant 

10 2 

10 0 



DHCA variant 

10 2 

10 0 

|df A | = 10 , |df D | = 1 , |df R | = 6 



DHCA variant 

10 2 

10 0 

|df A | = 19 , |df D | = 1 , |df R | = 6 


|df A | = 28 , |df D | = 1 , |df R | = 6 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 





10 2 

10 0 

|df A | = 1 , |df D | = 1 , |df R | = 6 


DHCA variant 

10 2 

10 0 


|df A | = 10 , |df D | = 1 , |df R | = 6 


DHCA variant 

10 2 

10 0 


|df A | = 19 , |df D | = 1 , |df R | = 6 


DHCA variant 

10 2 

10 0 


|df A | = 28 , |df D | = 1 , |df R | = 6 


DHCA variant 


CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



289

C.4. Computationally optimal RHCA, DHCA and flat consensus comparison 

C.4 Computationally optimal RHCA, DHCA and flat consensus 

comparison 

In this section, we compare those random and deterministic hierarchical consensus architectures 

deemed to be computationally optimal against classic flat consensus in terms of two 

factors: i) their execution times, and ii) the quality of the consensus clustering solutions 

they yield. This twofold comparison is intended to determine under which conditions any 

the aforementioned consensus architectures outperforms the others, not only in terms of 

their computational efficiency, but also as far as their perfomance for the construction of 

robust clustering systems is concerned. 

This comparison has been conducted across the following eleven unimodal data collection: 

Iris, Wine, Glass, Ionosphere, WDBC, Balance, MFeat, miniNG, Segmentation, BBC 

and PenDigits. For each data set, ten independent experiments have been conducted on four 

diversity scenarios. Each diversity scenario is characterized by the use of cluster ensembles 

generated by applying a certain number of clustering algorithms |dfA| = {1, 10, 19, or 28}. 

In each experiment, the CPU time required for executing either the whole hierarchical consensus 

architecture or flat consensus is measured, and the quality of the consensus clustering 

solution is evaluated in terms of its normalized mutual information φ (NMI) with respect the 

ground truth. 

From a visualization perspective, both the execution times and the φ (NMI) values are 

presented by means of their respective boxplots —each of which comprises the ten independent 

experiments conducted on each diversity scenario for each data collection. When 

comparing boxplots, notice that non-overlapping boxes notches indicate that the medians 

of the compared running times differ at the 5% significance level, which allows a quick 

inference of the statistical significance of the results. 


In this section, the running time and consensus quality comparison experiments are conducted 

on the Iris data collection. The four diversity scenarios correspond to cluster ensembles 

of sizes l =9, 90, 171 and 252. 

Running time comparison 

Figure C.29 presents the running times of the allegedly computationally optimal RHCA 

and DHCA variants (considering their serial implementation) and flat consensus. Due to 

the relatively small cluster ensembles on this data set, it can be observed that flat consensus 

is the fastest option in most cases, regardless of the diversity scenario and the consensus 

function employed. The only exceptions occur when consensus are built using the MCLA 

consensus function in the two highest diversity scenarios –in these cases, DHCA turns out to 

be the most efficient consensus architecture–, as this is the only consensus function the time 

complexity of which grows quadratically with the size of the cluster ensemble l. Amongthe 

hierarchical consensus architectures, DHCA tends to outperform RHCA in computational 

terms, except when the ALSAD consensus function is employed (although the differences 

between RHCA and DHCA are in general minor). 

The execution times corresponding to flat consensus and the entirely parallel implemen- 

290


tation of hierarchical architectures are presented in figure C.30. It can be observed that 

flat consensus gradually moves from being the optimal consensus architecture in the lowest 

diversity scenario to being the slowest in the highest diversity one. Compared to the serial 

implementation, the running time differences between RHCA and DHCA are less significant 

in this case —except when the ALSAD consensus function is the base of the consensus 

architecture. 

Consensus quality comparison 

As regards the quality of the consensus clustering solutions yielded by each consensus architecture 

as a function of the consensus function employed and the diversity scenario, the 

results obtained are presented in figure C.31. A few observations can be made: firstly, if 

the results obtained by the seven consensus functions are compared, it is to notice that 

fairly different performances are obtained: for instance, HGPA gives rise to pretty poorer 

quality consensus than the remaining consensus functions, as none of the boxes exceeds the 

φ (NMI) =0.6 level. Moreover, these relative performances are maintained across the different 

diversity scenarios. Secondly, the variability of the quality of the consensus clustering 

solutions can be evaluated by observing figure C.31(d), as the depicted boxplots corresponds 

to ten independent runs of the consensus clustering processes on the same cluster ensemble. 

Notice that the major differences are observed in the HGPA, MCLA and KMSAD consensus 

functions, which, as aforementioned, contain some random parameters that makes their 

performance vary (largely, as in HGPA or slightly, as in MCLA) from run to run. Thirdly, 

the relative comparison of the quality of the consensus solutions yielded by the the two HCA 

and flat consensus is local to the consensus function employed. Whereas DHCA seems to 

give rise to better consensus clustering solutions when the CSPA, ALSAD and KMSAD 

consensus functions are employed, it tends to be outperformed by RHCA flat consensus 

when clusterings are combined by EAC or HGPA. Last, notice that the highest level of 

similarity between the top-quality cluster ensemble components and consensus clustering 

solutions correspond to DHCA based on the CSPA consensus function. 


This section presents the comparison between flat consensus and the computationally optimal 

consensus architectures in terms of CPU execution time and normalized mutual information 

between the ground truth and the consensus clustering solution yielded by each 

one of them. On this data collection, the cluster ensemble sizes corresponding to the four 

diversity scenarios are l =45, 450, 855 and 1260. 


As regards the execution times of the fully serial implementations of the estimated optimal 

RHCA and DHCA variants and flat consensus, a double evolutive behaviour can be 

observed —see figure C.32. Firstly, those consensus architectures using the CSPA, HGPA, 

MCLA, ALSAD, KMSAD and SLSAD consensus functions follow the same evolution pattern 

observed, for instance, in the Iris data collection (i.e. the larger the cluster ensemble, 

the more efficient hierarhical architectures become compared to flat consensus). In contrast, 

the consensus architectures based on the EAC consensus function present a fairly different 

291






0.5 

0.4 

0.3 

0.2 

0.1 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

2.5 

2 

1.5 

1 

0.5 

2.5 

2 

1.5 

1 

0.5 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





1 

0.8 

0.6 

0.4 

0.2 

0 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 


0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 

HGPA 

RHCA 

DHCA 

flat 


0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

MCLA 

RHCA 

DHCA 

flat 


0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

RHCA 

DHCA 

flat 



1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 

EAC 


1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


1.6 

1.4 

1.2 

1 

0.8 

0.6 

MCLA 

RHCA 

DHCA 

flat 


0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

HGPA 

RHCA 

DHCA 

flat 


2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

MCLA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

ALSAD 

RHCA 

DHCA 

flat 






0.2 

0.15 

0.1 

0.05 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





1 

0.8 

0.6 

0.4 

0.2 

0 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

Figure C.29: Running times of the computationally optimal serial RHCA, DHCA and flat 

consensus architectures on the Iris data collection in the four diversity scenarios |dfA| = 

{1, 10, 19, 28}. 

292





0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.4 

0.35 

0.3 

0.25 

0.2 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





0.5 

0.4 

0.3 

0.2 

0.1 

0 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0.3 

0.25 

0.2 

0.15 

0.1 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

EAC 


RHCA 

DHCA 

flat 

EAC 


0.25 

0.2 

0.15 

0.1 

0.05 

HGPA 

RHCA 

DHCA 

flat 


0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

MCLA 

RHCA 

DHCA 

flat 


0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

MCLA 

RHCA 

DHCA 

flat 


0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 


0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 


0.5 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

MCLA 

RHCA 

DHCA 

flat 


0.3 

0.25 

0.2 

0.15 

0.1 

ALSAD 

RHCA 

DHCA 

flat 






0.15 

0.1 

0.05 

0 

0.2 

0.15 

0.1 

0.05 

0.2 

0.15 

0.1 

0.05 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

Figure C.30: Running times of the computationally optimal parallel RHCA, DHCA and 

flat consensus architectures on the Iris data collection in the four diversity scenarios |dfA| = 

{1, 10, 19, 28}. 

293


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

Figure C.31: φ (NMI) of the consensus solutions yielded by the computationally optimal 

RHCA, DHCA and flat consensus architectures on the Iris data collection in the four diversity 

scenarios |dfA| = {1, 10, 19, 28}. 

294


behaviour, as flat consensus is the fastest option regardless of the diversity scenario. That is, 

as already observed in sections C.2.5 and C.3.5, the time complexity behaviour of consensus 

architectures is local to the consensus function employed for combining the clusterings. 

As regards the computational complexity of the parallel implementation of HCA (see 

figure C.33), it can be observed that these become much faster than flat consensus as soon 

as the size of the cluster ensemble is increased. As in the previous data collection, the 

running times of parallel RHCA and DHCA are pretty similar. 


As far as the quality of the consensus clustering solutions obtained by the distinct consensus 

architectures, figure C.34 depicts the corresponding φ (NMI) boxplots. Again, performances 

are highly local to the consensus function employed: in this case, those consensus architectures 

based on the EAC, HGPA and SLSAD consensus functions give rise to the lowest 

quality consensus clusterings. If the three consensus architectures are compared, it can 

be observed that RHCA and flat consensus tend to perform quite similarly, while worse 

clustering solutions are generally obtained from DHCA. Notice that the highest robustness 

to clustering indeterminacies (i.e. consensus clustering solutions of comparable quality to 

the cluster ensemble components of highest φ (NMI) ) are obtained from the RHCA and flat 

consensus architectures based on MCLA, ALSAD and KMSAD. 


In this section, we present the running times and quality evaluation (by means of φ (NMI) 

values) of the consensus clustering processes implemented by means of the serial and parallel 

RHCA DHCA implementations and flat consensus on the Glass data collection. The cluster 

ensembles sizes corresponding to the four diversity scenarios in which our experiments are 

conducted are l =29, 290, 551 and 812. 


Figure C.35 presents the boxplot charts that represent the running times of the three implemented 

consensus architectures, considering the entirely serial implementation of the 

hierarhical ones. As in the previous data collections, flat consensus is the fastest option 

in the lowest diversity scenario, whereas hierarchical consensus architectures become more 

computationally efficient as soon as the size of the cluster ensemble increases —for all but 

the EAC consensus function. which again highlights the interest of structuring consensus 

processes in a hierarchical manner as a means for i) reducing their time complexity when 

they are to be conducted on large cluster ensembles, and ii) obtaining a consensus clustering 

solution when the execution of flat consensus becomes unfeasible (e.g. when the MCLA 

consensus function is employed in the highest diversity scenario). 

The computational complexity of the consensus architectures presents a very similar 

behaviour when the parallel implementation of hierarchical versions is studied (see figure 

C.36). In this case, though, the differences between the running times of flat and hierarchical 

consensus architectures is even larger. 

295






1.2 

1 

0.8 

0.6 

0.4 

0.2 

5 

4 

3 

2 

1 

8 

7 

6 

5 

4 

8 

7.5 

7 

6.5 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 




0.8 

0.6 

0.4 

0.2 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

2.6 

2.4 

2.2 

2 

1.8 

1.6 

1.4 

1.2 

1 


0 

3 

2.5 

2 

1.5 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

HGPA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

RHCA 

DHCA 

flat 



2.4 

2.2 

2 

1.8 

1.6 

1.4 

1.2 

HGPA 

RHCA 

DHCA 

flat 


6 

5.5 

5 

4.5 

4 

3.5 

3 

2.5 

2 

MCLA 

RHCA 

DHCA 

flat 


5 

4 

3 

2 

1 

ALSAD 

RHCA 

DHCA 

flat 



4.5 

4 

3.5 

3 

2.5 

HGPA 

RHCA 

DHCA 

flat 


20 

15 

10 

5 

MCLA 

RHCA 

DHCA 

flat 


3.5 

3 

2.5 

2 

1.5 

1 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


9 

8 

7 

6 

5 

4 

3 

HGPA 

RHCA 

DHCA 

flat 


45 

40 

35 

30 

25 

20 

15 

10 

5 

MCLA 

RHCA 

DHCA 

flat 


8 

7 

6 

5 

4 

3 

2 

1 

ALSAD 

RHCA 

DHCA 


flat 





0.5 

0.4 

0.3 

0.2 

0.1 

0 

2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

4 

3.5 

3 

2.5 

2 

1.5 

8 

7 

6 

5 

4 

3 

2 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

8 

7 

6 

5 

4 

3 

2 

1 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

Figure C.32: Running times of the computationally optimal serial RHCA, DHCA and 

flat consensus architectures on the Wine data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

296 

SLSAD 

RHCA 

DHCA 

flat





0.5 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

1.2 

1 

0.8 

0.6 

0.4 

3.5 

2.5 

1.5 

0.5 

8 

7 

6 

5 

4 

3 

2 

1 

0 

4 

3 

2 

1 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

1.2 

0.8 

0.6 

0.4 

0.2 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

1 

EAC 


RHCA 

DHCA 

flat 

EAC 


0.25 

0.2 

0.15 

0.1 

0.05 

HGPA 

RHCA 

DHCA 

flat 


0.24 

0.22 

0.2 

0.18 

0.16 

0.14 

0.12 

0.1 

0.08 

MCLA 

RHCA 

DHCA 

flat 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

ALSAD 

RHCA 

DHCA 

flat 


0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 


RHCA 

DHCA 

flat 


1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


6 

5 

4 

3 

2 

1 

0 

MCLA 

RHCA 

DHCA 

flat 


1 

0.8 

0.6 

0.4 

0.2 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 

EAC 


4 

3 

2 

1 

0 

HGPA 

RHCA 

DHCA 

flat 


20 

15 

10 

5 

0 

MCLA 

RHCA 

DHCA 

flat 


3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

ALSAD 

RHCA 

DHCA 

flat 



1.2 

1 

0.8 

0.6 

0.4 

0.2 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 


RHCA 

DHCA 

flat 


8 

6 

4 

2 

0 

HGPA 

RHCA 

DHCA 

flat 


40 

30 

20 

10 

0 

MCLA 

RHCA 

DHCA 

flat 


8 

7 

6 

5 

4 

3 

2 

1 

0 

ALSAD 

RHCA 

DHCA 

flat 



8 

7 

6 

5 

4 

3 

2 

1 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 




0.5 

0.4 

0.3 

0.2 

0.1 

1 

0.8 

0.6 

0.4 

0.2 

0 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 


8 

7 

6 

5 

4 

3 

2 

1 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the Wine data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

297


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the Wine data collection in the four 

diversity scenarios |dfA| = {1, 10, 19, 28}. 

298





1.2 

1 

0.8 

0.6 

0.4 

0.2 

7 

6 

5 

4 

3 

2 

12 

11 

10 

9 

8 

7 

6 

5 

25 

20 

15 

10 

5 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





2 

1.5 

1 

0.5 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

5.5 

5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

EAC 


RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

EAC 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

RHCA 

DHCA 

flat 



3.5 

3 

2.5 

2 

1.5 

1 

HGPA 

RHCA 

DHCA 

flat 


10 

9 

8 

7 

6 

5 

4 

3 

2 

MCLA 

RHCA 

DHCA 

flat 


3.5 

3 

2.5 

2 

1.5 

1 

ALSAD 

RHCA 

DHCA 

flat 


flat 

RHCA 

DHCA 

flat 


12 

10 

8 

6 

4 

2 

HGPA 

RHCA 

DHCA 

flat 


35 

30 

25 

20 

15 

10 

5 

MCLA 

RHCA 

DHCA 

flat 


10 

8 

6 

4 

2 

ALSAD 

RHCA 

DHCA 

flat 



25 

20 

15 

10 

5 

HGPA 

RHCA 

DHCA 

flat 


6 

5.5 

5 

4.5 

4 

3.5 

3 

MCLA 

RHCA 

DHCA 

flat 


25 

20 

15 

10 

5 

ALSAD 

RHCA 

DHCA 

flat 






1.5 

1 

0.5 

0 

7 

6 

5 

4 

3 

2 

12 

10 

8 

6 

4 

25 

20 

15 

10 

5 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

3.5 

3 

2.5 

2 

1.5 

1 

12 

10 

8 

6 

4 

2 

25 

20 

15 

10 

5 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the Glass data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

299






0.5 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

12 

10 

8 

6 

4 

2 

0 

25 

20 

15 

10 

5 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





1.5 

1 

0.5 

1 

0.8 

0.6 

0.4 

0.2 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 


0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

HGPA 

RHCA 

DHCA 

flat 


0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

MCLA 

RHCA 

DHCA 

flat 


0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 

EAC 


3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

HGPA 

RHCA 

DHCA 

flat 


10 

8 

6 

4 

2 

0 

MCLA 

RHCA 

DHCA 

flat 


3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

ALSAD 

RHCA 

DHCA 

flat 



0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 


RHCA 

DHCA 

flat 

EAC 


12 

10 

8 

6 

4 

2 

0 

HGPA 

RHCA 

DHCA 

flat 


35 

30 

25 

20 

15 

10 

5 

0 

MCLA 

RHCA 

DHCA 

flat 


10 

8 

6 

4 

2 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


25 

20 

15 

10 

5 

0 

HGPA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

MCLA 

RHCA 

DHCA 

flat 


25 

20 

15 

10 

5 

0 

ALSAD 

RHCA 

DHCA 

flat 




12 

10 

8 

6 

4 

2 

0 

25 

20 

15 

10 

5 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





1.2 

1 

0.8 

0.6 

0.4 

0.2 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

12 

10 

8 

6 

4 

2 

0 

25 

20 

15 

10 

5 

0 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the Glass data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

300



As regards the quality of the consensus clustering process, figure C.37 presents the boxplots 

depicting the φ (NMI) values of the components of the cluster ensemble E and of the consensus 

clustering solutions output by the RHCA, DHCA and flat consensus architectures. On 

this data collection, the φ (NMI) differences between the clustering solutions output by the 

three distinct consensus architectures are, in general terms, small —except when the EAC 

consensus function is employed, as flat consensus is clearly superior in this case. Moreover, 

we can observe that ALSAD and SLSAD outstand among the remaining consenus functions 

as the top performers. 


This section presents the execution times of the computationally optimal RHCA, DHCA 

and flat consensus architecture and the φ (NMI) values of the consensus clustering solutions 

yielded by them on the Ionosphere data collection. The presented results consider the 

experiments conducted across four diversity scenarios, and the cluster ensemble sizes corresponding 

to them are l =97, 970, 1843 and 2716, respectively. 


The execution times of flat consensus and the serially implemented RHCA and DHCA are 

depicted in the boxplots charts presented in figure C.38. The relative behaviour of the 

three consensus architectures is pretty similar to the one observed on the previous data 

collections. That is, i) flat consensus becomes slower than its hierarchical counterparts as 

the size of the cluster ensemble becomes larger, except when the EAC consensus function is 

employed, and ii) RHCA tends to be faster than DHCA when the EAC, ALSAD, SLSAD, 

whereas the opposite behaviour is observed in the hypergraph based consensus functions. 

The running times obtained in the case the hierarchical consensus architectures are 

implemented in an entirely parallel manner are presented in figure C.39. As expected, 

RHCA and DHCA become sensibly more efficient than flat consensus. Notice, however, 

that notable differences between the running times of the two hierarchical architectures can 

be found under certain consensus functions, such as EAC, ALSAD, or SLSAD. 


As far as the quality of the consensus clustering process is concerned, the φ (NMI) boxplots 

corresponding to the consensus clustering solutions obtained by the RHCA, DHCA and flat 

consensus architectures across the four diversity scenarios on the Ionosphere data collection 

are presented in figure C.40. A notably high variability as regards the optimality of the 

different consensus architectures is found depending on the consensus function employed. 

For instance, DHCA tends to yield the highest quality results when consensus is conducted 

by means of SLSAD, flat consensus gives the best consensus clustering solutions derived by 

HGPA, and when MCLA is chosen as the clustering combiner, RHCA attains the higher 

φ (NMI) values than the remaining consensus architectures. In contrast, marginal quality 

differences are observed between the qualities of the consensus clustering solutions derived 

by the three consensus architectures when the remaining consensus functions are employed. 

301


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the Glass data collection in the four 


302





4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

21 

20 

19 

18 

17 

16 

15 

14 

13 

70 

60 

50 

40 

30 

20 

160 

140 

120 

100 

80 

60 

40 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





20 

15 

10 

5 

0 

60 

50 

40 

30 

20 

10 

100 

80 

60 

40 

20 

120 

100 

80 

60 

40 

20 

EAC 


RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

HGPA 

RHCA 

DHCA 

flat 


1.2 

1 

0.8 

0.6 

0.4 

0.2 

MCLA 

RHCA 

DHCA 

flat 


35 

30 

25 

20 

15 

10 

5 

0 

ALSAD 

RHCA 

DHCA 

flat 



20 

15 

10 

5 

HGPA 

RHCA 

DHCA 

flat 


6 

5.5 

5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

MCLA 

RHCA 

DHCA 

flat 


70 

60 

50 

40 

30 

20 

10 

ALSAD 

RHCA 

DHCA 

flat 



80 

70 

60 

50 

40 

30 

20 

10 

HGPA 

RHCA 

DHCA 

flat 


8 

7.5 

7 

6.5 

6 

5.5 

5 

4.5 

MCLA 

RHCA 

DHCA 

flat 


100 

80 

60 

40 

20 

ALSAD 

RHCA 

DHCA 


flat 


160 

140 

120 

100 

80 

60 

40 

20 

0 

HGPA 

RHCA 

DHCA 

flat 


10 

9.5 

9 

8.5 

8 

7.5 

7 

6.5 

MCLA 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

flat 

ALSAD 

RHCA 

DHCA 





3 

2.5 

2 

1.5 

1 

0.5 

20 

18 

16 

14 

12 

10 

8 

6 

4 

70 

60 

50 

40 

30 

20 

10 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 




16 

14 

12 

10 

8 

6 

4 

2 

0 

50 

45 

40 

35 

30 

25 

20 

15 

10 

70 

60 

50 

40 

30 

20 

10 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


consensus architectures on the Ionosphere data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

303 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

KMSAD 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

SLSAD 

RHCA 

DHCA 

RHCA 

DHCA 

flat 

SLSAD 

flat






1.1 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

20 

15 

10 

5 

0 

70 

60 

50 

40 

30 

20 

10 

0 

160 

140 

120 

100 

80 

60 

40 

20 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





10 

8 

6 

4 

2 

12 

10 

8 

6 

4 

2 

14 

12 

10 

8 

6 

4 

16 

14 

12 

10 

8 

6 

4 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 


0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

MCLA 

RHCA 

DHCA 

flat 


12 

10 

8 

6 

4 

2 

0 

ALSAD 

RHCA 

DHCA 

flat 



20 

15 

10 

5 

0 

HGPA 

RHCA 

DHCA 

flat 


0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

MCLA 

RHCA 

DHCA 

flat 


20 

15 

10 

5 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


80 

70 

60 

50 

40 

30 

20 

10 

0 

HGPA 

RHCA 

DHCA 

flat 


0.5 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

MCLA 

RHCA 

DHCA 

flat 


70 

60 

50 

40 

30 

20 

10 

0 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

0 

HGPA 

RHCA 

DHCA 

flat 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

0 

ALSAD 

RHCA 

DHCA 

flat 




1.2 

1 

0.8 

0.6 

0.4 

0.2 

20 

15 

10 

5 

0 

70 

60 

50 

40 

30 

20 

10 

0 

160 

140 

120 

100 

80 

60 

40 

20 

0 



KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





10 

8 

6 

4 

2 

0 

20 

15 

10 

5 

0 

70 

60 

50 

40 

30 

20 

10 

0 

160 

140 

120 

100 

80 

60 

40 

20 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the Ionosphere data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

304

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 


E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the Ionosphere data collection in the 

four diversity scenarios |dfA| = {1, 10, 19, 28}. 

305



In this section, we present the running times and consensus clustering solution qualities 

of the hierarchical and flat consensus architectures corresponding to the WDBC data collection. 

In this case, each diversity scenario corresponds to a cluster ensemble of size 

l = 113, 1130, 2147 and 3164, respectively. 


Figure C.41 presents the running time of flat consensus and of the serial implementation 

of the fastest random and hierarchical consensus architectures across the four diversity 

scenarios. In this case, the relationship between the execution times of these consensus architectures 

is a little different from what has been observed in the previous data collections. 

In particular, flat consensus is more a competitive alternative, being faster or almost as fast 

as RHCA in all diversity scenarios when consensus is based on the CSPA, EAC, ALSAD 

and SLSAD clustering combiners. In contrast, DHCA is notably slower than RHCA in most 

cases. This is due to the large cardinality of the dimensional diversity factor (|dfD|=28 on 

this data set) that makes the DHCA stage where consensus is conducted on this diversity 

factor much more computationally costly compared to the intermediate consensus processes 

of RHCA. 

In figure C.42, the execution times of the computationally optimal parallel RHCA and 

DHCA variants and flat consensus are presented. Two trends are observed in these boxplots: 

firstly, hierarchical architectures are faster than flat consensus, especially in diversity 

scenarios where large cluster ensembles are employed. And secondly, parallel DHCA are in 

general slower than their RHCA counterparts, for the same reason stated before. 


Figure C.43 presents the φ (NMI) of the consensus clustering solutions yielded by the RHCA, 

DHCA and flat consensus architectures across the four diversity scenarios on the WDBC 

data collection. Firstly, notice that the EAC and SLSAD consensus functions give rise to 

very low quality consensus clusterings regardless of the consensus architecture employed. 

In contrast, flat consensus yields reasonably good consensus clusterings when it is derived 

by means of HGPA, although hierarchical consensus architectures based on this consensus 

function output poor consensus clustering solutions. Meanwhile, the remaining clustering 

combiners yield pretty good consensus —notice that slightly better results are obtained 

when it is derived by means of RHCA and flat consensus architectures. 


This section presents the execution times of the estimated computationally optimal serial 

and parallel implementations of RHCA and DHCA and flat consensus in the four 

diversity scenarios for the Balance data set, which give rise to cluster ensembles of sizes 

l =7, 70, 133 and 196 each. Moreover, the quality of the consensus clustering solutions 

output by each consensus architecture are evaluated in terms of the normalized mutual 

information (φ (NMI) ) with respect to the ground truth. 

306





16 

14 

12 

10 

8 

6 

4 

2 

80 

70 

60 

50 

40 

30 

20 

140 

120 

100 

80 

60 

200 

180 

160 

140 

120 

100 

80 

CSPA 

RHCA 

DHCA 

CSPA 

flat 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





200 

150 

100 

50 

0 

500 

400 

300 

200 

100 

0 

800 

700 

600 

500 

400 

300 

200 

100 

0 

700 

600 

500 

400 

300 

200 

100 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


700 

600 

500 

400 

300 

200 

100 

0 

ALSAD 

RHCA 

DHCA 

flat 



16 

14 

12 

10 

8 

6 

4 

2 

HGPA 

RHCA 

DHCA 

flat 


40 

35 

30 

25 

20 

15 

10 

5 

MCLA 

RHCA 

DHCA 

flat 


1200 

1000 

800 

600 

400 

200 

0 

ALSAD 

RHCA 

DHCA 

flat 



50 

40 

30 

20 

10 

HGPA 

RHCA 

DHCA 

flat 


120 

100 

80 

60 

40 

20 

0 

MCLA 

RHCA 

DHCA 

flat 


2000 

1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 



120 

100 

80 

60 

40 

20 

HGPA 

RHCA 

DHCA 

flat 


16 

14 

12 

10 

8 

6 

MCLA 

RHCA 

DHCA 

flat 


1600 

1400 

1200 

1000 

800 

600 

400 

200 

0 

ALSAD 

RHCA 

DHCA 

flat 






10 

8 

6 

4 

2 

0 

30 

25 

20 

15 

10 

50 

45 

40 

35 

30 

25 

20 

100 

80 

60 

40 

20 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





150 

100 

50 

0 

400 

350 

300 

250 

200 

150 

100 

50 

0 

600 

500 

400 

300 

200 

100 

500 

400 

300 

200 

100 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the WDBC data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

307






4.5 

4 

3.5 

3 

2.5 

2 

1.5 

16 

14 

12 

10 

8 

6 

4 

2 

50 

40 

30 

20 

10 

0 

100 

80 

60 

40 

20 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





100 

80 

60 

40 

20 

0 

80 

70 

60 

50 

40 

30 

20 

10 

80 

70 

60 

50 

40 

30 

20 

60 

50 

40 

30 

20 

EAC 

RHCA 

DHCA 

flat 

EAC 


0.45 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


350 

300 

250 

200 

150 

100 

50 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 

EAC 


16 

14 

12 

10 

8 

6 

4 

2 

0 

HGPA 

RHCA 

DHCA 

flat 


40 

35 

30 

25 

20 

15 

10 

5 

0 

MCLA 

RHCA 

DHCA 

flat 


200 

150 

100 

50 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


50 

40 

30 

20 

10 

0 

HGPA 

RHCA 

DHCA 

flat 


120 

100 

80 

60 

40 

20 

0 

MCLA 

RHCA 

DHCA 

flat 


200 

150 

100 

50 

0 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 


120 

100 

80 

60 

40 

20 

0 

HGPA 

RHCA 

DHCA 

flat 


0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


120 

100 

80 

60 

40 

20 

ALSAD 

RHCA 

DHCA 

flat 






10 

8 

6 

4 

2 

0 

16 

14 

12 

10 

8 

6 

4 

2 

0 

50 

40 

30 

20 

10 

0 

120 

100 

80 

60 

40 

20 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





80 

60 

40 

20 

0 

70 

60 

50 

40 

30 

20 

10 

70 

60 

50 

40 

30 

20 

10 

120 

100 

80 

60 

40 

20 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

RHCA 

DHCA 

flat 

SLSAD 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the WDBC data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

308

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 


E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the WDBC data collection in the four 


309



The characteristics of this data set –in particular, the low cardinalities of the diversity factors 

associated to it–, make flat consensus the fastest consensus architecture when compared to 

the serial implementations of RHCA and DHCA regardless of the diversity scenario —see 

figure C.44. The only exception is the MCLA consensus function, whose time complexity 

scales quadratically with the size of the ensemble, which penalizes the execution of one-step 

consensus processes in front of their hierarchical counterparts. 

A somewhat lighter version of this same behaviour is observed in the running time 

analysis of the parallel implementations of the HCA, which is presented in figure C.45. In 

this case, though, RHCA and DHCA are as fast or faster than flat consensus when the 

HGPA, MCLA and KMSAD consensus functions are employed. 


As regards the quality of the consensus clustering solutions output by the three consensus 

architectures, figure C.46 shows the results obtained on the Balance data collection. It is 

to notice that the EAC and HGPA consensus functions yield, in general, the lowest quality 

results. For the remaining consensus functions, pretty similar quality consensus solutions are 

obtained by means of the three architectures, except for the ALSAD and SLSAD consensus 

function, where notable differences are observed between HCA and flat consensus. 

310





4.5 

4 

3.5 

3 

2.5 

2 

1.5 

20 

15 

10 

5 

20 

15 

10 

5 

25 

20 

15 

10 

5 

1 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





70 

60 

50 

40 

30 

20 

10 

70 

60 

50 

40 

30 

20 

10 

0 

70 

60 

50 

40 

30 

20 

10 

0 

70 

60 

50 

40 

30 

20 

10 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 

HGPA 

RHCA 

DHCA 

flat 


0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 

MCLA 

RHCA 

DHCA 

flat 


80 

70 

60 

50 

40 

30 

20 

10 

0 

ALSAD 

RHCA 

DHCA 

flat 



1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

HGPA 

RHCA 

DHCA 

flat 


1.1 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

MCLA 

RHCA 

DHCA 

flat 


70 

60 

50 

40 

30 

20 

10 

0 

ALSAD 

RHCA 

DHCA 

flat 



1.4 

1.2 

1 

0.8 

0.6 

HGPA 

RHCA 

DHCA 

flat 


1.4 

1.2 

1 

0.8 

0.6 

0.4 

MCLA 

RHCA 

DHCA 

flat 


50 

40 

30 

20 

10 

0 

ALSAD 

RHCA 

DHCA 

flat 



1.6 

1.4 

1.2 

1 

0.8 

HGPA 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

MCLA 

RHCA 

DHCA 

flat 


40 

35 

30 

25 

20 

15 

10 

5 

0 

ALSAD 

RHCA 

DHCA 

flat 






10 

8 

6 

4 

2 

0 

16 

14 

12 

10 

8 

6 

4 

2 

0 

15 

10 

5 

0 

20 

15 

10 

5 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





40 

30 

20 

10 

0 

40 

35 

30 

25 

20 

15 

10 

5 

0 

40 

30 

20 

10 

0 

30 

25 

20 

15 

10 

5 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the Balance data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

311






2.8 

2.6 

2.4 

2.2 

2 

1.8 

1.6 

1.4 

1.2 

3.5 

3 

2.5 

2 

1.5 

3.5 

3 

2.5 

2 

3.5 

3 

2.5 

2 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





70 

60 

50 

40 

30 

20 

10 

0 

40 

35 

30 

25 

20 

15 

10 

5 

40 

35 

30 

25 

20 

15 

10 

5 

35 

30 

25 

20 

15 

10 

5 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 


0.25 

0.2 

0.15 

0.1 

0.05 

HGPA 

RHCA 

DHCA 

flat 


0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

MCLA 

RHCA 

DHCA 

flat 


70 

60 

50 

40 

30 

20 

10 

0 

ALSAD 

RHCA 

DHCA 

flat 


flat 


0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


40 

35 

30 

25 

20 

15 

10 

5 

0 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

HGPA 

RHCA 

DHCA 

flat 


1.2 

1 

0.8 

0.6 

0.4 

0.2 

MCLA 

RHCA 

DHCA 

flat 


35 

30 

25 

20 

15 

10 

5 

0 

ALSAD 

RHCA 

DHCA 

flat 


EAC 

RHCA 

DHCA 

flat 


1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

0.5 

0 

MCLA 

RHCA 

DHCA 

flat 


25 

20 

15 

10 

5 

ALSAD 

RHCA 

DHCA 

flat 






10 

8 

6 

4 

2 

0 

10 

12 

10 

8 

6 

4 

2 

10 

8 

6 

4 

2 

8 

6 

4 

2 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





40 

30 

20 

10 

0 

25 

20 

15 

10 

30 

25 

20 

15 

10 

5 

0 

20 

15 

10 

5 

5 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the Balance data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

312

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 


E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the Balance data collection in the four 


313



This section describes the performance of the minimum complexity RHCA and DHCA 

serial and parallel variants in the case of the MFeat data collection, which are compared to 

classic flat consensus in terms of the time required for their execution and the quality of the 

consensus clustering solutions they yield. Cluster ensembles of sizes l =6, 60, 114 and 168 

correspond to the four diversity scenarios where these experiments are conducted. 


Figure C.47 presents the running times of flat consensus and of the serial implementations 

of RHCA and DHCA. Notice that, except when the HGPA and MCLA consensus functions 

are employed, flat consensus is faster than any of its hierarchical counterparts regardless of 

the size of the cluster ensemble (i.e. it is faster in all the diversity scenarios). 

When the parallel implementation of the HCA is considered (see figure C.48), the observed 

behaviour is very similar to the one that has been just reported. That is, flat consensus 

is the most computationally efficient consensus architecture, except when consensus 

functions based on hypergraph partition are employed. This is due to the fact that, on the 

MFeat data collection, the low cardinality of the diversity factors gives rise to relatively 

small cluster ensembles, which makes flat consensus a competitive alternative to hierarhical 

consensus architectures. 


Figure C.49 presents the quality of the consensus clustering solutions yielded by the flat 

and hierarhical consensus architectures, under the shape of φ (NMI) boxplot diagrams. An 

inter-consensus function analysis reveals that EAC, HGPA and SLSAD yield, in general 

terms, the lowest quality results, while CSPA, ALSAD and KMSAD stand out as the best 

performing consensus functions, as they yield consensus clustering solutions the quality of 

which is comparable to that of the cluster ensemble components that best reveal the true 

cluster structure of the data set (i.e. those attaining the highest φ (NMI) values). Meanwhile, 

if an intra-consensus function study is conducted, we can conclude that whereas the three 

consensus architectures yield pretty similar quality consensus solutions when based on CSPA 

and ALSAD, larger differences between RHCA, DHCA and flat consensus are observed in 

other cases, as when consensus clustering is conducted by means of the EAC, HGPA, MCLA 

or SLSAD consensus functions. 

C.4.8 miniNG data set 

In this section, we present the running times and quality evaluation (by means of φ (NMI) 

values) of the consensus clustering processes implemented by means of the serial and parallel 

RHCA and DHCA implementations and flat consensus on the miniNG data collection. 

The cluster ensembles sizes corresponding to the four diversity scenarios in which our experiments 

are conducted are l =73, 730, 1387 and 2044. 

314





20 

18 

16 

14 

12 

10 

8 

6 

4 

90 

80 

70 

60 

50 

40 

30 

20 

10 

120 

100 

80 

60 

40 

20 

180 

160 

140 

120 

100 

80 

60 

40 

20 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

4000 

3000 

2000 

1000 

0 

3000 

2500 

2000 

1500 

1000 

500 

0 

2500 

2000 

1500 

1000 

500 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 



0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


3000 

2500 

2000 

1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 



1.6 

1.5 

1.4 

1.3 

1.2 

1.1 

1 

0.9 

0.8 

HGPA 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

MCLA 

RHCA 

DHCA 

flat 


3500 

3000 

2500 

2000 

1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 



4 

3.5 

3 

2.5 

2 

1.5 

HGPA 

RHCA 

DHCA 

flat 


8 

7 

6 

5 

4 

3 

2 

MCLA 

RHCA 

DHCA 

flat 


1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 



7 

6 

5 

4 

3 

2 

HGPA 

RHCA 

DHCA 

flat 


14 

12 

10 

8 

6 

4 

2 

MCLA 

RHCA 

DHCA 

flat 


1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 






250 

200 

150 

100 

50 

500 

400 

300 

200 

100 

0 

800 

700 

600 

500 

400 

300 

200 

100 

0 

1000 

800 

600 

400 

200 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

2000 

1500 

1000 

500 

0 

2000 

1500 

1000 

500 

0 

2000 

1500 

1000 

500 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the MFeat data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

315






20 

18 

16 

14 

12 

10 

8 

6 

4 

25 

20 

15 

10 

20 

19 

18 

17 

16 

15 

14 

13 

12 

23 

22 

21 

20 

19 

18 

17 

16 

15 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

3500 

3000 

2500 

2000 

1500 

1000 

500 

0 

2500 

2000 

1500 

1000 

500 

0 

2500 

2000 

1500 

1000 

500 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 


0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

HGPA 

RHCA 

DHCA 

flat 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

MCLA 

RHCA 

DHCA 

flat 


3000 

2500 

2000 

1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 



1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


2000 

1500 

1000 

500 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 

EAC 


4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

HGPA 

RHCA 

DHCA 

flat 


8 

7 

6 

5 

4 

3 

2 

1 

MCLA 

RHCA 

DHCA 

flat 


1000 

800 

600 

400 

200 

0 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


7 

6 

5 

4 

3 

2 

1 

HGPA 

RHCA 

DHCA 

flat 


15 

10 

5 

0 

MCLA 

RHCA 

DHCA 

flat 


1000 

800 

600 

400 

200 

0 

ALSAD 

RHCA 

DHCA 

flat 






250 

200 

150 

100 

50 

300 

250 

200 

150 

100 

50 

350 

300 

250 

200 

150 

100 

50 

500 

400 

300 

200 

100 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

2000 

1500 

1000 

500 

0 

1600 

1400 

1200 

1000 

800 

600 

400 

200 

0 

1600 

1400 

1200 

1000 

800 

600 

400 

200 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the MFeat data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

316

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 


E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the MFeat data collection in the four 


317



The miniNG data collection is one of those cases where the cardinality of the diversity factors 

employed for generating the cluster ensemble, besides the number of objects it contains, 

makes flat consensus non-executable (for all but the EAC consensus function) in those 

scenarios where the cluster ensemble size is relatively large. In this situation, hierarchical 

consensus architectures become a means for making consensus clustering feasible. 

As regards the serial implementation of RHCA and DHCA –figure C.50–, the former 

tends to be faster than the latter, except when the HGPA and MCLA consensus functions 

are employed. This inter-consensus architecture performance is also observed in the parallel 

implementation case, presented in figure C.51. 


The analysis of the quality of the consensus clustering solutions output by the flat and 

hierarchical consensus architectures can be made based on the φ (NMI) boxplot charts depicted 

in figure C.52. A single remark as regards the perforance of the distinct consensus 

functions: notice that CSPA, ALSAD and KMSAD based consensus solutions are the best 

ones in quality terms. And last, the φ (NMI) values of the consensus clusterings output by 

the two hierarchical consensus architectures –the only ones able to operate across all the 

diversity scenarios– are pretty similar in most cases. 

C.4.9 Segmentation data set 

This section presents the comparison between flat consensus and the computationally optimal 

consensus architectures in terms of CPU execution time and normalized mutual information 

between the ground truth and the consensus clustering solution yielded by each one 

of them. On the Segmentation data collection, the cluster ensemble sizes corresponding to 

the four diversity scenarios are l =52, 520, 988 and 1456. 


Figure C.53 presents the execution times of the flat consensus architecture and the estimated 

computationally optimal serial random and deterministic hierarchical consensus 

architectures. In this case, flat consensus is faster than RHCA and DHCA regardless of 

the cluster ensemble size (in our range of observation), except when the HGPA and MCLA 

consensus functions are employed —in fact, MCLA-based flat consensus is unfeasible in the 

two largest diversity scenarios. Moreover, the relative speed comparison between RHCA 

and DHCA yields different results depending on the consensus function employed: RHCA 

is faster than DHCA if consensus is based on CSPA, EAC, ALSAD or SLSAD, while the 

opposite behaviour is observed when the HGPA, MCLA and KMSAD consensus functions 

are used. 

Pretty similar results are obtained when the running times of the fully parallel implementation 

of RHCA and DHCA are analyzed, as figure C.53 reveals. The main difference 

with respect to what has been just reported is the logical speed up of HCA, which makes 

them be faster than flat consensus in the highest diversity scenario. 

318





120 

100 

80 

60 

40 

20 

650 

600 

550 

500 

450 

400 

350 

300 

1100 

1000 

900 

800 

700 

600 

500 

1200 

1100 

1000 

900 

800 

700 

600 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 




16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 


2 

1.5 

1 

0.5 

0 

16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

15000 

10000 

5000 

EAC 

RHCA 

DHCA 


flat 

x 10 4EAC 

0 

RHCA 

DHCA 

flat 

EAC 


8 

7 

6 

5 

4 

3 

2 

HGPA 

RHCA 

DHCA 

flat 


16 

14 

12 

10 

8 

6 

4 

2 

MCLA 

RHCA 

DHCA 

flat 


4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

x 10 4 ALSAD 

RHCA 

DHCA 



40 

35 

30 

25 

20 

HGPA 

RHCA 

DHCA 

flat 


75 

70 

65 

60 

55 

50 

45 

40 

35 

MCLA 

RHCA 

DHCA 

flat 


3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

flat 

x 10 4 ALSAD 

RHCA 

DHCA 

flat 



3000 

2500 

2000 

1500 

1000 

500 

0 

15000 

10000 


RHCA 

DHCA 

flat 

EAC 


100 

90 

80 

70 

60 

50 

40 

HGPA 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

MCLA 

RHCA 

DHCA 

flat 


3500 

3000 

2500 

2000 

1500 

1000 

500 

ALSAD 

RHCA 

DHCA 

flat 


RHCA 

DHCA 

flat 


140 

130 

120 

110 

100 

90 

80 

70 

60 

HGPA 

RHCA 

DHCA 

flat 


250 

200 

150 

100 

MCLA 

RHCA 

DHCA 

flat 


3000 

2500 

2000 

1500 

1000 

ALSAD 

RHCA 

DHCA 

flat 




5000 

0 

1.5 

1 

0.5 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

x 10 

2 

4 KMSAD 

RHCA 

DHCA 

flat 

x KMSAD 104 

2.5 

2 

1.5 

1 

0.5 

RHCA 

DHCA 

flat 





16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

12000 

10000 

8000 

6000 

4000 

2000 

0 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the miniNG data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

319






25 

20 

15 

10 

700 

600 

500 

400 

300 

200 

100 

30 

28 

26 

24 

22 

20 

30 

28 

26 

24 

22 

20 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





4000 

3000 

2000 

1000 

0 

4000 

3000 

2000 

1000 

3000 

2500 

2000 

1500 

1000 

500 

3000 

2500 

2000 

1500 

1000 

500 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 


8 

7 

6 

5 

4 

3 

2 

1 

HGPA 

RHCA 

DHCA 

flat 


16 

14 

12 

10 

8 

6 

4 

2 

0 

MCLA 

RHCA 

DHCA 

flat 


8000 

6000 

4000 

2000 

0 

ALSAD 

RHCA 

DHCA 

flat 


1200 

1000 


RHCA 

DHCA 

flat 

EAC 


5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

HGPA 

RHCA 

DHCA 

flat 


12 

10 

8 

6 

4 

2 

MCLA 

RHCA 

DHCA 

flat 


4000 

3000 

2000 

1000 

0 

ALSAD 

RHCA 

DHCA 

flat 


800 

600 

400 

200 

0 

2000 

1800 

1600 

1400 

1200 

1000 

800 

600 

400 


RHCA 

DHCA 

flat 

EAC 


8 

7 

6 

5 

4 

3 

2 

1 

HGPA 

RHCA 

DHCA 

flat 


25 

20 

15 

10 

5 

MCLA 

RHCA 

DHCA 

flat 


200 

150 

100 

50 

ALSAD 

RHCA 

DHCA 

flat 

2500 

2000 

1500 

1000 


RHCA 

DHCA 

flat 


9 

8 

7 

6 

5 

4 

3 

2 

1 

HGPA 

RHCA 

DHCA 

flat 


16 

14 

12 

10 

8 

6 

4 

2 

MCLA 

RHCA 

DHCA 

flat 


200 

150 

100 

50 

ALSAD 

RHCA 

DHCA 

flat 


500 

1800 

1600 

1400 

1200 

1000 



800 

600 

400 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





3000 

2500 

2000 

1500 

1000 

500 

0 

2500 

2000 

1500 

1000 

500 

2000 

1800 

1600 

1400 

1200 

1000 

800 

600 

400 

2000 

1800 

1600 

1400 

1200 

1000 

800 

600 

400 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


flat consensus architectures on the miniNG data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

320

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 


E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the miniNG data collection in the four 


321






100 

80 

60 

40 

20 

350 

300 

250 

200 

150 

100 

700 

600 

500 

400 

300 

200 

900 

800 

700 

600 

500 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

12000 

10000 

8000 

6000 

4000 

2000 

0 

12000 

10000 

8000 

6000 

4000 

2000 

0 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 


1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

HGPA 

RHCA 

DHCA 

flat 


2.2 

2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

MCLA 

RHCA 

DHCA 

flat 


x ALSAD 104 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

RHCA 

DHCA 



120 

100 

80 

60 

40 

20 

HGPA 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

0 

MCLA 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

0.5 

0 

flat 

x ALSAD 104 

3 

RHCA 

DHCA 



400 

300 

200 

100 

0 

HGPA 

RHCA 

DHCA 

flat 


30 

25 

20 

15 

10 

MCLA 

RHCA 

DHCA 

flat 


8000 

6000 

4000 

2000 

0 

flat 

ALSAD 

RHCA 

DHCA 

flat 



800 

600 

400 

200 

0 

HGPA 

RHCA 

DHCA 

flat 


50 

45 

40 

35 

30 

25 

MCLA 

RHCA 

DHCA 

flat 


8000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

ALSAD 

RHCA 

DHCA 

flat 






800 

600 

400 

200 

0 

2000 

1500 

1000 

500 

0 

3500 

3000 

2500 

2000 

1500 

1000 

500 

3500 

3000 

2500 

2000 

1500 

1000 

500 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

10000 

8000 

6000 

4000 

2000 

0 

5000 

4000 

3000 

2000 

1000 

0 

8000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

SLSAD 

RHCA 

DHCA 

RHCA 

DHCA 

flat 

SLSAD 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


consensus architectures on the Segmentation data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

322





30 

25 

20 

15 

10 

140 

120 

100 

80 

60 

40 

20 

400 

300 

200 

100 

0 

1000 

800 

600 

400 

200 

CSPA 

RHCA 

DHCA 

0 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 


RHCA 

DHCA 

flat 



4000 

3500 

3000 

2500 

2000 

1500 

1000 

500 

0 


5000 

4000 

3000 

2000 

1000 

0 

2000 

1500 

1000 

500 

2000 

1500 

1000 

500 

EAC 

RHCA 

DHCA 

flat 

EAC 



1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


12000 

10000 

8000 

6000 

4000 

2000 

0 

ALSAD 

RHCA 

DHCA 


RHCA 

DHCA 

flat 


120 

100 

80 

60 

40 

20 

0 

HGPA 

RHCA 

DHCA 

flat 


160 

140 

120 

100 

80 

60 

40 

20 

0 

MCLA 

RHCA 

DHCA 

flat 


6000 

5000 

4000 

3000 

2000 

1000 

0 

flat 

ALSAD 

RHCA 

DHCA 

flat 



500 

400 

300 

200 

100 

0 

900 

800 

700 

600 

500 

400 

300 

200 

100 


EAC 

RHCA 

DHCA 

flat 


400 

300 

200 

100 

0 

HGPA 

RHCA 

DHCA 

flat 


3 

2.5 

2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


700 

600 

500 

400 

300 

200 

ALSAD 

RHCA 

DHCA 

flat 


1200 

1000 


EAC 

RHCA 

DHCA 

flat 


800 

600 

400 

200 

0 

HGPA 

RHCA 

DHCA 

flat 


5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

MCLA 

RHCA 

DHCA 

flat 


1000 

900 

800 

700 

600 

500 

400 

300 

200 

ALSAD 

RHCA 

DHCA 

flat 

800 

600 

400 

200 

1200 

1000 



800 

600 

400 

200 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





3500 

3000 

2500 

2000 

1500 

1000 

500 

3500 

3000 

2500 

2000 

1500 

1000 

500 

0 

1200 

1000 

800 

600 

400 

200 

0 

1200 

1100 

1000 

900 

800 

700 

600 

500 

400 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

Figure C.54: Running times of the computationally optimal parallel RHCA, DHCA and flat 

consensus architectures on the Segmentation data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

323



The φ (NMI) values of the consensus clustering solutions yielded by flat, random hierarchical 

and deterministic hierarhical consensus architectures follows a pattern that is quite similar 

to what has been observed in the previous data collections, at least as far as the performance 

of the distinct consensus functions is concerned. That is, the lowest quality consensus solutions 

are obtained by means of the EAC and HGPA consensus functions, whereas ALSAD 

tends to yield the best results. 

C.4.10 BBC data set 

In this section, the running time and consensus quality comparison experiments are conducted 

on the BBC data collection. The four diversity scenarios correspond to cluster 

ensembles of sizes l =57, 570, 1083 and 1596. 


As far as the running times of the entirely serial implementation of RHCA and DHCA 

and of flat consensus are concerned, the boxplots depicted in figure C.56 show that flat 

consensus constitutes the most computationally competitive consensus architecture in most 

cases —in fact, the only exceptions occur when the HGPA and MCLA consensus functions 

are employed. 

When the parallel implementation of hierarchical consensus architecture is considered, 

they become more competitive (in computational terms), reverting the situation observed 

in the serial case for the CSPA and KMSAD consensus functions —see figure C.57. 


As regards the quality of the consensus clustering solutions yielded by the three consensus 

architectures (measured as the φ (NMI) with respect to the ground truth that defines the 

true group structure of the BBC data collection), we can observe great differences between 

the performance of the distinct consensus functions –see figure C.58–: while the MCLA, 

ALSAD and KMSAD clustering combiners tend to yield consensus clusterings of quality 

comparable to the best components of the cluster ensemble, the clustering solutions output 

by consensus architectures based on EAC, HGPA and SLSAD are notably poorer. 

C.4.11 PenDigits data set 

This section presents the execution times of the computationally optimal RHCA, DHCA 

and flat consensus architecture and the φ (NMI) values of the consensus clustering solutions 

yielded by them on the PenDigits data collection. The presented results consider the 

experiments conducted across four diversity scenarios, and the cluster ensemble sizes corresponding 

to them are l =57, 570, 1083 and 1596, respectively. Due to the number of objects 

contained in this data set, only the HGPA and MCLA consensus functions are executable 

on it, as they are the only ones the space complexity of which scales linearly with this 

324

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 


E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the Segmentation data collection in the 

four diversity scenarios |dfA| = {1, 10, 19, 28}. 

325






120 

100 

80 

60 

40 

20 

600 

500 

400 

300 

200 

100 

900 

800 

700 

600 

500 

400 

300 

200 

900 

800 

700 

600 

500 

400 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





x 10 

2 

4EAC 

1.5 

1 

0.5 

0 

2 

1.5 

1 

0.5 

0 

RHCA 

DHCA 

flat 

x 10 

2.5 

4EAC 

1.5 

1 

0.5 

0 

20000 

15000 

10000 

5000 

RHCA 

DHCA 

flat 

x 10 

2 

4EAC 

0 

RHCA 

DHCA 

flat 

EAC 

RHCA 

DHCA 

flat 


1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

HGPA 

RHCA 

DHCA 

flat 


2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

MCLA 

RHCA 

DHCA 

flat 


4 

3 

2 

1 

0 

x 10 4 ALSAD 

RHCA 

DHCA 



120 

100 

80 

60 

40 

20 

HGPA 

RHCA 

DHCA 

flat 


150 

100 

50 

0 

MCLA 

RHCA 

DHCA 

flat 


4 

3 

2 

1 

0 

flat 

x 10 4 ALSAD 

RHCA 

DHCA 



400 

350 

300 

250 

200 

150 

100 

50 

0 

HGPA 

RHCA 

DHCA 

flat 


30 

25 

20 

15 

10 

MCLA 

RHCA 

DHCA 

flat 


12000 

10000 

8000 

6000 

4000 

2000 

0 

flat 

ALSAD 

RHCA 

DHCA 

flat 



800 

600 

400 

200 

0 

HGPA 

RHCA 

DHCA 

flat 


40 

35 

30 

25 

20 

MCLA 

RHCA 

DHCA 

flat 


8000 

6000 

4000 

2000 

0 

ALSAD 

RHCA 

DHCA 

flat 






800 

700 

600 

500 

400 

300 

200 

100 

0 

3000 

2500 

2000 

1500 

1000 

500 

0 

4000 

3000 

2000 

1000 

0 

3500 

3000 

2500 

2000 

1500 

1000 

500 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





15000 

10000 

5000 

0 

15000 

10000 

5000 

0 

8000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

8000 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

SLSAD 

RHCA 

DHCA 

RHCA 

DHCA 

flat 

SLSAD 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 


consensus architectures on the BBC data collection in the four diversity scenarios |dfA| = 

{1, 10, 19, 28}. 

326





35 

30 

25 

20 

15 

120 

100 

80 

60 

40 

20 

400 

350 

300 

250 

200 

150 

100 

50 

800 

600 

400 

200 

0 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 

CSPA 

RHCA 

DHCA 

flat 





5000 

4000 

3000 

2000 

1000 

0 

5000 

4000 

3000 

2000 

1000 

0 

2500 

2000 

1500 

1000 

500 

2500 

2000 

1500 

1000 

500 

EAC 

RHCA 

DHCA 

flat 

EAC 



1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

HGPA 

RHCA 

DHCA 

flat 


2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

ALSAD 

RHCA 

DHCA 


RHCA 

DHCA 

flat 


120 

100 

80 

60 

40 

20 

0 

HGPA 

RHCA 

DHCA 

flat 


150 

100 

50 

0 

MCLA 

RHCA 

DHCA 

flat 


7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

flat 

ALSAD 

RHCA 

DHCA 

flat 



700 

600 

500 

400 

300 

200 

100 

1200 

1000 


EAC 

RHCA 

DHCA 

flat 


400 

350 

300 

250 

200 

150 

100 

50 

0 

HGPA 

RHCA 

DHCA 

flat 


2.5 

2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


2000 

1500 

1000 

500 

ALSAD 

RHCA 

DHCA 

flat 

800 

600 

400 

200 

0 

1200 

1000 


EAC 

RHCA 

DHCA 

flat 


800 

600 

400 

200 

0 

HGPA 

RHCA 

DHCA 

flat 


3.5 

3 

2.5 

2 

1.5 

1 

0.5 

MCLA 

RHCA 

DHCA 

flat 


1600 

1400 

1200 

1000 

800 

600 

400 

ALSAD 

RHCA 

DHCA 

flat 


800 

600 

400 

200 

0 

1000 



800 

600 

400 

200 

0 

0 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 

KMSAD 

RHCA 

DHCA 

flat 





4000 

3000 

2000 

1000 

4000 

3500 

3000 

2500 

2000 

1500 

1000 

500 

0 

1600 

1400 

1200 

1000 

800 

600 

400 

200 

1800 

1600 

1400 

1200 

1000 

800 

600 

400 

0 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

SLSAD 

RHCA 

DHCA 

flat 

Figure C.57: Running times of the computationally optimal parallel RHCA, DHCA and flat 

consensus architectures on the BBC data collection in the four diversity scenarios |dfA| = 

{1, 10, 19, 28}. 

327


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

CSPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

EAC 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

HGPA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

MCLA 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

ALSAD 

E 

RHCA 

DHCA 

flat 


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

KMSAD 

E 

RHCA 

DHCA 

flat 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 

SLSAD 

E 

RHCA 

DHCA 

flat 


RHCA, DHCA and flat consensus architectures on the BBC data collection in the four 


328


attribute. Moreover, flat consensus is unfeasible even when based on the aforementioned 

consensus functions. 


Figure C.59 shows the running times corresponding to the serial implementation of RHCA 

and DHCA. It can be observed that, as the cluster ensemble size increases, RHCA becomes 

faster than DHCA. This is due to the fact that this growth is induced by an augmentation of 

the cardinality of the algorithmic diversity factor |dfA|, which directly produces an increase 

of the time complexity of one of the DHCA stages, while the impact of this growth is 

somewhat scattered across the distinct stages of a RHCA. 

Approximately the same behaviour is observed when the parallel implementation of both 

hierarchical consensus architectures is analyzed —see figure C.60. 


Figure C.61 presents the φ (NMI) corresponding to the two aforementioned consensus architectures. 

There are a couple of issues worth noting in this case. Firstly, notice that HGPA 

yields very poor consensus clustering solutions on this data collection (recall that this fact 

has also been reported in several of the previous data sets). And secondly, it is important to 

highlight the notable differences between the quality of the consensus clusterings output by 

RHCA and DHCA, as the latter consensus architecture tends to yield far better consensus 

clustering solutions than the former —a trend that becomes more evident in high diversity 

scenarios. 

329






4 

3.5 

3 

2.5 

30 

28 

26 

24 

22 

20 

55 

50 

45 

40 

85 

80 

75 

70 

65 

2 

RHCA 

RHCA 

RHCA 

RHCA 

HGPA 

DHCA 

HGPA 

DHCA 

flat 


flat 


HGPA 

DHCA 

flat 


HGPA 

DHCA 

flat 



consensus architectures on the PenDigits data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

330 





5 

4.5 

4 

3.5 

3 

2.5 

2 

30 

28 

26 

24 

22 

20 

65 

60 

55 

50 

45 

40 

110 

100 

90 

80 

70 

60 

RHCA 

RHCA 

RHCA 

RHCA 

MCLA 

DHCA 

MCLA 

DHCA 

MCLA 

DHCA 

MCLA 

DHCA 

flat 

flat 

flat 

flat





1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

1.8 

1.7 

1.6 

1.5 

1.4 

1.3 

1.2 

1.1 

1 

2 

1.9 

1.8 

1.7 

1.6 

1.5 

1.4 

2.5 

2 

1.5 

1 

RHCA 

RHCA 

RHCA 

RHCA 

DHCA 


HGPA 

flat 


HGPA 

DHCA 

flat 


HGPA 

DHCA 

flat 


HGPA 

DHCA 

flat 



flat consensus architectures on the PenDigits data collection in the four diversity scenarios 

|dfA| = {1, 10, 19, 28}. 

331 





1.1 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

1.8 

1.7 

1.6 

1.5 

1.4 

1.3 

1.2 

1.1 

1 

2.5 

2 

1.5 

1 

3.5 

3 

2.5 

2 

1.5 

1 

RHCA 

RHCA 

RHCA 

RHCA 

MCLA 

DHCA 

MCLA 

DHCA 

MCLA 

DHCA 

MCLA 

DHCA 

flat 

flat 

flat 

flat


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

E 

E 

E 

RHCA 

RHCA 

RHCA 

RHCA 

HGPA 

HGPA 

HGPA 

HGPA 

DHCA 

flat 


DHCA 

flat 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

E 

RHCA 

RHCA 


DHCA 

flat 


DHCA 

flat 

φ (NMI) 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

E 

RHCA 

RHCA 



RHCA, DHCA and flat consensus architectures on the PenDigits data collection in the four 


332 

MCLA 

MCLA 

MCLA 

MCLA 

DHCA 

DHCA 

DHCA 

DHCA 

flat 

flat 

flat 

flat

Appendix D 

Experiments on self-refining 


This appendix presents several experiments regarding self-refining flat and hierarchical consensus 

architectures described in chapter 4. In appendix D.1, the proposal for automatically 

refining a previously derived consensus clustering solution –what is called consensus based 

self-refining– is experimentally evaluated, whereas appendix D.2 presents the experiments 

regarding the creation of a refined consensus clustering solution upon the selection of a high 

quality cluster ensemble component, i.e. selection-based self-refining. 

In both cases, the experiments are conducted on eleven unimodal data collections, 

namely: Iris, Wine, Glass, Ionosphere, WDBC, Balance, MFeat, miniNG, Segmentation, 

BBC and PenDigits. The results of the self-refining experiments are displayed by means 

of boxplot charts showing the normalized mutual information (φ (NMI) ) with respect to the 

ground truth of each data set compiled across 100 independent experiment runs of i) the 

cluster ensemble E each experiment is conducted upon, ii) the clustering solution employed 

as the reference of the self-refining procedure, and iii) the self-refined consensus cluster- 

ing solutions λc p i obtained upon select cluster ensembles Epi 

created by the selection of 

a percentage pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90} of the whole ensemble E. Asinall 

the experimental sections of this thesis, consensus processes have been replicated using the 

set of seven consensus functions described in appendix A.5, namely: CSPA, EAC, HGPA, 

MCLA, ALSAD, KMSAD and SLSAD. 

D.1 Experiments on consensus-based self-refining 

In this section, the results of applying the consensus-based self-refining procedure described 

in section 4.1 on the aforementioned eleven data sets are presented. The self-refining process 

is intended to better the quality of a consensus clustering solution λc output by a flat, a 

random (RHCA) and a deterministic hierarchical consensus architecture (DHCA). 

For this reason, the results are displayed as a matrix of boxplot charts with three columns 

(the leftmost one for flat consensus, the central one presenting the results of RHCA, and 

DHCA on the right), and as many rows as consensus functions are employed on each 

particular data collection (seven in all cases except for the PenDigits data set, where only 

two consensus functions are applicable due to memory limitations given our computational 

333

D.1. Experiments on consensus-based self-refining 

resources —see appendix A.6). 

Moreover, the clustering solution deemed as the optimal by the supraconsensus function 

described in section 4.1 in a majority of experiment runs is highlighted by means of a vertical 

green dashed line, so that its performance can be qualitatively evaluated at a glance. 

D.1.1 Iris data set 

Figure D.1 presents the results of the self-refining consensus procedure applied on the Iris 

data set. As regards the results obtained using the CSPA consensus function, we can see 

that self-refining introduces no variations with respect to the quality of the non-refined 

consensus clustering solution λc in the case of the flat and RHCA consensus architectures. 

In contrast, slight but important φ (NMI) gains are obtained in the case of DHCA with the 

refined clustering solutions λ c 20 and λ c 40. Unfortunately, the supraconsensus function fails 

to select in this case one of the highest quality clustering solutions. A very similar situation is 

observed on the self-refining experiments based on the EAC, ALSAD and SLSAD consensus 

functions. 

Examples in which self-refining and supraconsensus perform successfully are the ones 

regarding both hierarchical consensus architectures using the HGPA consensus function. In 

this cases, self-refined consensus clustering solutions of higher quality than the one of λc 

are obtained and selected by the supraconsensus function. In contrast, in the experiments 

basedonMCLA,little(ifany)φ (NMI) gains are obtained via refining, and supraconsensus 

tends to select a clustering solution of slightly lower quality than λc. 

D.1.2 Wine data set 

The results corresponding to the application of the consensus-based self-refining procedure 

on the Wine data set are depicted in figure D.2. As far as the refining of the consensus 

solution output by the flat consensus architecture (leftmost column of figure D.2), we can 

see that quality improvements with respect to the initial consensus clustering λc are obtained 

in all cases, except when the HGPA consensus function is employed. In some cases, 

supraconsensus manages to select the highest quality clustering, such as when consensus 

is based on MCLA and SLSAD, whereas suboptimal solutions are selected in other cases 

—see for instance the CSPA, EAC, HGPA and ALSAD boxplots. 

We would like to hihglight the specially good results obtained on the self-refining of 

the consensus output by the DHCA architecture, see the rightmost column of figure D.2. 

Regardless of the consensus function employed, the self-refining procedure gives rise to 

higher quality clustering solutions, and the supraconsensus function selects the top quality 

one in most cases. 

D.1.3 Glass data set 

Figure D.3 presents the results of the consensus self-refining process when applied on the 

Glass data collection. In this case, little φ (NMI) gains are obtained by self-refining for most 

consensus functions. The clearest exception is EAC, where notable quality increases are 

observed, specially when self-refining is applied on the consensus solutions output by the 

hierarchical consensus architectures (RHCA and DHCA). 

334

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 

Appendix D. Experiments on self-refining consensus architectures 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

Figure D.1: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering 


DHCA consensus architectures on the Iris data collection across all the consensus functions 



335


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the Wine data collection across all the consensus functions 



336


As regards the performance of the supraconsensus function, the generally small quality 

variations among the non-refined and self-refined consensus clustering solutions gives 

relative importance to the lack of precision of supraconsensus in most cases. Again, the 

only exceptions to this behaviour occur in the refining of the consensus clustering solutions 

output by RHCA and DHCA when EAC is employed. In these cases, the supraconsensus 

function erroneously selects the non-refined consensus clustering solution λc as the final 

clustering solution, although higher quality self-refined partitions are available. 

D.1.4 Ionosphere data set 

The application of the consensus-based self-refining procedure on the Ionosphere data collection 

yields the φ (NMI) boxplots presented in figure D.4. On this collection, self-refining 

introduces quality gains in a few cases, such as the refining of the consensus clustering 

output by i) RHCA and DHCA using HGPA, or ii) flat consensus architecture and RHCA 

based on the SLSAD consensus function. In the remaining cases, the self-refining procedure 

brings about little (if any) quality gains. 

As regards the selection accuracy of the supraconsensus function, it consistently selects 

a good quality clustering solution, if not the highest quality one. 

D.1.5 WDBC data set 

Figure D.5 presents the φ (NMI) boxplots corresponding to the application of the consensusbased 

self-refining procedure on the WDBC data set. 

Fairly distinct results are obtained depending on the consensus function employed. For 

instance, when consensus is based on EAC and SLSAD, self-refining brings about nothing. 

In contrast, spectacular quality gains are obtained on the hierarchically derived consensus 

clusterings that employ HGPA. In the remaining cases, more modest φ (NMI) increases are 

observed. 

Last, notice that the supraconsensus function performs pretty accurately, as it selects 

good quality clustering solutions in most cases, although it rarely chooses the top quality 

one. 

D.1.6 Balance data set 

The application of the self-refining consensus procedure on the Balance data set yields the 

results summarized by the boxplots presented in figure D.6. It can be observed that, for 

most consensus functions and consensus architectures, the self-refined consensus clusterings 

show higher φ (NMI) values than those of their non-refined counterpart, λc. In some cases, 

these quality gains are notable, as, for instance, when consensus self-refining is based on 

the SLSAD consensus function on the flat consensus architecture —bottom row of figure 

D.5. In other cases, as in MCLA-based self-refining, the achieved φ (NMI) increases are more 

modest. 

As regards the ability of the supraconsensus function to select the top quality (nonrefined 

or refined) clustering solution, it can be observed that it is a hardly occuring event, 

which motivates the low percentage selection accuracy reported in section 4.2.2. 

337


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the Glass data collection across all the consensus functions 

employed. The green dashed vertical line identifies the clustering solution selected by 

the supraconsensus function in each experiment. 

338

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 


λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the Ionosphere data collection across all the consensus 

functions employed. The green dashed vertical line identifies the clustering solution selected 

by the supraconsensus function in each experiment. 

339


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the WDBC data collection across all the consensus functions 



340

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 


λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the Balance data collection across all the consensus functions 



341


D.1.7 MFeat data set 

Figure D.7 presents the boxplots of the clusterings resulting from running the consensus 

self-refining procedure on the MFeat data collection. In this case, pretty varied behaviours 

are observed. For instance, when a high quality consensus clustering solution λc is available 

prior to self-refining, none of the refined consensus clusterings achieves a higher φ (NMI) — 

see, for instance, the boxplots corresponding to the CSPA, ALSAD and KMSAD consensus 

functions. In contrast, in cases in which λc has a low φ (NMI) , self-refining brings about 

sometimes notable quality gains, such as the ones observed in the EAC or SLSAD based 

flat and RHCA consensus architectures. However, the supraconsensus function tends to 

select the non-refined clustering solution as the final partition of the process in a majority 

of cases. 

D.1.8 miniNG data set 

The boxplot charts depicted in figure D.8 summarize the performance of the consensusbased 

self-refining procedure when applied on the miniNG data set. It is interesting to 

note that, except when self-refining is based on the EAC consensus function, important 

quality gains are obtained —in most cases, there exists at least one self-refined consensus 

clustering with higher φ (NMI) than the non-refined clustering λc. Notice the large quality 

gains obtained when self-refining is based on MCLA, as we move from a very low φ (NMI) 

non-refined consensus clustering solution λc to self-refined clusterings that are comparable 

to the highest quality components in the cluster ensemble E. However, when self-refining is 

basedonEAC,little(ifany)φ (NMI) increases are introduced by the self-refining procedure. 

Last, notice how, in most cases, the supraconsensus function selects high quality clustering 

solutions as the final partition. 

D.1.9 Segmentation data set 

Figure D.9 presents the boxplots of the non-refined and self-refined consensus clustering 

solutions obtained by the consensus-based self-refining procedure applied on the Segmentation 

data collection. Notice that, thanks to the proposed refining process, at least one 

self-refined clustering solution of higher quality than that of the non-refined consensus clustering 

λc is obtained in most cases. In fact, the only exceptions occur in the refinement of 

the λc output by the flat and DHCA consensus architecture based on the EAC consensus 

function. 

As regards the performance of the supraconsensus function, we can see that it casts a 

shadow over the good results of the self-refining process just reported, as it rarely picks up 

the highest quality consensus clustering solution —although it usually selects one of the 

higher quality ones. 

D.1.10 BBC data set 

The qualities of the clusterings resulting from applying the consensus-based self-refining 

procedure on the BBC data set are presented in the boxplots of figure D.10. Notice that, 

although the quality of the non-refined consensus clustering λc is highly dependent on 

the consensus function employed (from the high φ (NMI) values in CSPA, MCLA, ALSAD 

342

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 


λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the MFeat data collection across all the consensus functions 



343


φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the miniNG data collection across all the consensus 



344

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 


λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the Segmentation data collection across all the consensus 



345


or KMSAD to the poorer qualities for EAC, HGPA or SLSAD), the self-refining process 

manages to yield better clusterings in most cases, although the observed φ (NMI) increases 

are, in general, modest. 

Notice that, unfortunately, the supraconsensus function is reasonably successful in selecting 

the top quality consensus clustering solution blindly. 

D.1.11 PenDigits data set 

Figure D.11 depicts the φ (NMI) values of the non-refined and self-refined consensus clusterings 

resulting from the application of the consensus-based self-refining procedure on the 

PenDigits data set. Remember that, on this collection, only the HGPA and MCLA consensus 

functions are applicable using the hierarchical consensus architectures. Whereas the 

quality of the clusterings obtained using HGPA is dramatically bad, the results obtained 

with MCLA are pretty encouraging. The large φ (NMI) gain observed when refining the consensus 

clustering λc output by RHCA is noteworthy. Moreover, notice that supraconsensus 

selects correctly the highest quality clustering solution in this case. 

346

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

KMSAD 

SLSAD 

(a) flat 


λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E p=0p=2p=5p=10p=15p=20p=30p=40p=50p=60p=75p=90 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

KMSAD 

SLSAD 

(b) RHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.8 

0.6 

0.4 

0.2 

0 

1 

0.5 

0 

1 

0.5 

0 

1 

0.5 

0 

CSPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

EAC 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

MCLA 

E p=0p=2p=5p=10p=15p=20p=30p=40p=50p=60p=75p=90 

ALSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

KMSAD 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

E λc 

λ c 2 

λ c 5 

λ c 10 

SLSAD 

(c) DHCA 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the BBC data collection across all the consensus functions 



347

D.2. Experiments on selection-based self-refining 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

E λc 

λ c 2 

λ c 5 

λ c 10 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

MCLA 

(a) RHCA 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

φ (NMI) 

1 

0.5 

0 

1 

0.5 

0 

E λc 

λ c 2 

λ c 5 

λ c 10 

HGPA 

E λc 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

MCLA 

(b) DHCA 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 



DHCA consensus architectures on the PenDigits data collection across all the consensus 



D.2 Experiments on selection-based self-refining 

This section presents the results of the clustering self-refining procedure based on the selection 

of a cluster ensemble component λref by means of an average normalized mutual 

information (φ (ANMI) ) criterion —see section 4.3. 

The results are presented in a very similar fashion to that of the previous section, 

that is, by means of boxplot charts displaying the φ (NMI) of the cluster ensemble E, the 

selected cluster ensemble component λref and the self-refined consensus solutions λcpi , 

with pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}. Moreover, the clustering solution designated 

to be the optimal according to the supraconsensus function is highlighted by a vertical 

green dashed line, which provides a simple and fast means for evaluating its performance 

qualitatively. 

D.2.1 Iris data set 

Figure D.12 presents the results of the selection-based self-refining procedure applied on 

the Iris data set. It can be observed that the selected cluster ensemble component λref is 

of notable quality —i.e. well above the median partition in the cluster ensemble E. Notice 

how the self-refining process brings about relevant φ (NMI) gains depending on the consensus 

function employed. This is the case of the CSPA, MCLA, ALSAD, KMSAD and SLSAD 

consensus functions. However, the supraconsensus function selects λref as the optimal 

partition, thus ignoring the improvements introduced by the self-refining process on the 

aforementioned cases. This again highlights the need for good performing supraconsensus 

functions. 

348

φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 


λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

Figure D.12: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component 

λref and the self-refined consensus clustering solutions λc p i on the Iris data collection 



D.2.2 Wine data set 

The results obtained by the application of the selection-based self-refining procedure on 

the Wine data collection are presented in figure D.13. Notice that the cluster ensemble 

component selected by means of average normalized mutual information criteria, λref, is 

nearly the top quality partition contained in the cluster ensemble E. In this case, the creation 

of select cluster ensembles brings about no quality gains, regardless of the consensus function 

employed. 

As regards the performance of the supraconsensus function, it selects λref as the final 

clustering solution in a majority of cases, choosing a noticeably worse partition in those 

experiments based on the KMSAD and CSPA consensus functions. 

D.2.3 Glass data set 

Figure D.14 presents the φ (NMI) boxplots corresponding to the selection-based self-refining 

procedure applied on the Glass data set. Notice how the notably high quality clustering 

349


φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the Wine data collection 

across all the consensus functions employed. The green dashed vertical line identifies 

the clustering solution selected by the supraconsensus function in each experiment. 

solution λref is hardly surpassed by any of the refined consensus clustering solutions — 

in fact, this only happens when self-refining is based on the EAC and SLSAD consensus 

functions. In most cases, supraconsensus selects the selected cluster ensemble component 

λref as the final clustering solution. 

D.2.4 Ionosphere data set 

The application of the selection-based self-refining process on the Ionosphere data collection 

gives rise to modest quality increases, as depicted in figure D.15. Notice how only when 

the self-refining procedure is based on the CSPA and HGPA consensus functions, clustering 

solutions of higher φ (NMI) than λref are obtained. 

Furthermore, notice that the supraconsensus function selects, in most cases, the highest 

quality clustering solutions —unfortunately, the only exceptions occur when self-refining is 

based on CSPA and HGPA, i.e. the cases when self-refining introduces some significant 

quality gains. 

350

φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 


λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the Glass data collection 

across all the consensus functions employed. The green dashed vertical line identifies 


D.2.5 WDBC data set 

Figure D.16 presents the φ (NMI) boxplots corresponding to the selection-based self-refining 

process applied on the WDBC data set. Firstly, notice that the cluster ensemble component 

selected by means of the φ (ANMI) criterion –λref– is pretty close to the highest quality 

partition contained in the cluster ensemble. Secondly, the effect of self-refining is highly 

dependent on the consensus function employed. For instance, no quality gains are achieved 

when CSPA, EAC or HGPA are used. In contrast, φ (NMI) gains (although modest) are 

obtained when self-refining is conducted using the MCLA, ALSAD, KMSAD and SLSAD 

consensus functions. Last, notice that the supraconsensus function selects very high quality 

clusterings as the final partition. 

D.2.6 Balance data set 

As far as the performance of the selection-based self-refining process when applied on the 

Balance data collection, figure D.17 shows that self-refined clustering solutions of higher 

351


φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the Ionosphere data 

collection across all the consensus functions employed. The green dashed vertical line identifies 


quality of that of the selected cluster ensemble component λref are obtained for most consensus 

functions —in fact, the clearest exceptions to this behaviour are EAC and HGPA. 

However, the supraconsensus function is not capable of selecting those better quality clusterings 

as the final partition in most cases, which again gives an idea of its limited performance. 

D.2.7 MFeat data set 

Figure D.18 presents the φ (NMI) boxplots of the clusterings obtained after applying the 

selection-based self-refining process on the MFeat data set. Notice how, for four out of 

the seven consensus functions (CSPA, MCLA, ALSAD and KMSAD), notable quality gains 

are obtained (i.e. at least one of the refined clusterings attains a higher φ (NMI) than the 

selected cluster ensemble component λref). Unfortunately, the supraconsensus function fails 

to select these high quality partitions as the final clustering solution λ c final, as it constantly 

selects λref as the optimal one, which is the correct option when self-refining is based on 

the EAC, HGPA and SLSAD consensus functions. 

352

φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 


λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the WDBC data 



D.2.8 miniNG data set 

The application of the selection-based self-refining procedure on the miniNG data collection 

yields the boxplots presented in figure D.19. Notice that the selected cluster ensemble 

component λref is comparable to the best individual partitions contained in the ensemble 

E. Despite of this, the self-refining process manages to obtain even higher quality consensus 

clusterings when based on the CSPA, MCLA, ALSAD and KMSAD consensus functions. 

Unfortunately, the supraconsensus functions fails in most occasions in selecting the maximum 

φ (NMI) clustering —in fact, it conducts the correct election when the EAC, HGPA 

and SLSAD consensus functions are employed. 

D.2.9 Segmentation data set 

Figure D.20 presents the φ (NMI) boxplots of the selection-based self-refined clustering solutions 

obtained on the Segmentation data set. Despite the notable quality of the selected 

cluster ensemble component λref, notice that important quality gains are obtained when 

353


φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the Balance data 



self-refining is applied, especially when the CSPA, ALSAD and KMSAD consensus functions 

are employed —more modest improvements are obtained when using MCLA or SLSAD, 

whereas none is attained when self-refining is based on EAC and HGPA. 

As regards the ability of the supraconsensus function to select the top quality clustering 

solution as the final one, it only suceeds when consensus is based on EAC, HGPA and 

MCLA. However, the φ (NMI) losses caused by suboptimal supraconsensus selection are, in 

general, moderate. 

D.2.10 BBC data set 

The application of the selection-based self-refining procedure on the BBC data set yields 

the φ (NMI) boxplots depicted in figure D.21. Notice that, in this very data collection, the 

cluster ensemble component λref selected via average φ (NMI) is very close (if not equal) 

to the maximum quality individual partition contained in the cluster ensemble E. Starting 

from this high quality reference point, the self-refining procedure manages to yield slightly 

better clusterings when it is based on the MCLA, ALSAD and KMSAD consensus functions. 

354

φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 


λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the MFeat data 



Moreover, notice that, regardless of the consensus function employed, the supraconsensus 

function tends to select pretty high quality clustering solutions as the final ones. 

D.2.11 PenDigits data set 

Figure D.22 presents the φ (NMI) boxplots corresponding to the application of the selectionbased 

clustering self-refining procedure on the PenDigits data set. Recall that, due to its 

size, only the HGPA and MCLA consensus functions are executable on this data collection. 

As regards the results obtained, notice that the selected cluster ensemble component λref 

has a notably high quality. However, the results obtained when it is self-refined differ 

dramatically depending on the consensus function applied. In the case of HGPA, selfrefining 

brings about no quality gains, and the supraconsensus function correctly selects 

λref as the final clustering solution. In contrast, refined clusterings yielded by MCLA 

are capable of achieving slightly higher φ (NMI) values than the selected cluster ensemble 

component λref. However, supraconsensus conducts a suboptimal selection, as it does not 

choose the maximum quality refined clustering as the final partition. 

355


φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the miniNG data 



356

φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 


λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the Segmentation 

data collection across all the consensus functions employed. The green dashed vertical line 

identifies the clustering solution selected by the supraconsensus function in each experiment. 

357


φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

HGPA 

1 

0.5 

0 

(c) HGPA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

CSPA 

(a) CSPA 

λref 

λ c 2 

λ c 5 

λ c 10 

φ (NMI) 

KMSAD 

1 

0.5 

0 

E 

(f) KMSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

MCLA 

1 

0.5 

0 

(d) MCLA 

φ (NMI) 

1 

0.5 

0 

E 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

EAC 

(b) EAC 

φ (NMI) 

SLSAD 

1 

0.5 

0 

E 

(g) SLSAD 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

λref 

λ c 2 

λ c 5 

λ c 10 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

ALSAD 

(e) ALSAD 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the BBC data collection 



φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

HGPA 

(a) HGPA 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 

φ (NMI) 

1 

0.5 

0 

E 

λref 

λ c 2 

λ c 5 

λ c 10 

MCLA 

(b) MCLA 

λ c 15 

λ c 20 

λ c 30 

λ c 40 

λ c 50 

λ c 60 

λ c 75 

λ c 90 


λref and the self-refined consensus clustering solutions λc p i on the PenDigits data 



358

Appendix E 

Experiments on multimodal 

consensus clustering 

This appendix presents several experiments regarding multimodal self-refining consensus 

architectures described in chapter 5, applied to the CAL500, InternetAds and Corel data 

collections. Due to space limitations, the experiments described correspond to the application 

of the proposed methodology on cluster ensembles resulting from the application of four 

of the twenty-eight clustering algorithms employed in this thesis, namely agglo-cos-upgma, 

direct-cos-i2, graph-cos-i2 and rb-cos-i2. 

For each one of the data sets, two facets of the experiments are presented separately. 

Firstly, the consensus clusterings obtained on each modality and across modalities is qualitatively 

evaluated. To do so, a set of boxplot charts displaying the φ (NMI) values of the 

components of the corresponding cluster ensemble E, and of the unimodal, multimodal and 

intermodal consensus clusterings obtained by the seven consensus functions employed in 

this work across 10 independent runs. 

And secondly, the quality of the self-refined consensus clustering solutions output by 

the proposed consensus self-refining procedure is also evaluated with the help of boxplot 

diagrams displaying the φ (NMI) values of the corresponding cluster ensembles, of the nonrefined 

consensus clustering λc and of the self-refined consensus clustering solutions λc p i . 

As regards the latter, a set of refined clusterings are obtained using a range of percentages 

pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75} of the whole ensemble E. The performance of the 

φ (ANMI) -based supraconsensus function for picking up one of the λc p i is also qualitatively 

evaluated. 

E.1 CAL500 data set 

In this section, the results of the multimodal consensus clustering experiments conducted 

on the CAL500 data collection are described. The modalities contained in this data set are 

audio and text —see appendix A.2.2 for a description. 

359

E.1. CAL500 data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

audio+text 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


Figure E.1: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering 

solutions using the agglo-cos-upgma algorithm on the CAL500 data set. 

E.1.1 Consensus clustering per modality and across modalities 

For starters, the quality of the consensus clustering solutions obtained on i) the two original 

modalities, ii) the fused audio+text multimodal modality, and iii) across the previous 

three modalities are evaluated. In figure E.1, the results obtained after the application 

of the proposed multimodal consensus architecture on the cluster ensemble resulting from 

the compilation of the partitions output by the agglo-cos-upgma clustering algorithm are 

presented. It can be observed that the quality of the clusterings corresponding to the 

audio modality are notably better than those obtained on the text mode (except when 

the EAC consensus function is employed). The early fusion of the auditory and textual 

features does not introduce any beneficial effect, rather the contrary. The quality of the 

intermodal consensus clusterings λc corresponding to the combination of three modalities 

are approximately a trade-off between them. 

Figures E.2, E.3 and E.4 depict, respectively, the results obtained on the cluster ensembles 

created upon the direct-cos-i2, graph-cos-i2 and rb-cos-i2 CLUTO clustering algorithms. 

It can be observed that pretty similar results to the ones just reported are obtained 

in all cases: that is, the consensus clusterings based on the audio mode attain higher qualities 

than on the remaining modalities, while multimodal and intermodal consensus clustering 

solutions are a kind of trade-off between modalities. 

360

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

Appendix E. Experiments on multimodal consensus clustering 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio+text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the direct-cos-i2 algorithm on the CAL500 data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio+text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the graph-cos-i2 algorithm on the CAL500 data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

audio+text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the rb-cos-i2 algorithm on the CAL500 data set. 

361

E.1. CAL500 data set 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

0.8 

0.6 

0.4 

0.2 

λ 30 

c 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 


1 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 

Figure E.5: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions 

using the agglo-cos-upgma algorithm on the CAL500 data set. 

E.1.2 Self-refined consensus clustering across modalities 

In this section, the results of running the self-refining procedure on the intermodal consensus 

clustering solution λc are evaluated. Firstly, the results of the process applied on 

the cluster ensemble created by the compilation of the clusterings output by the agglo-cosupgma 

clustering algorithm are presented in figure E.5. On each one of the seven boxplot 

charts displayed (one per consensus function), the clustering solution selected by the supraconsensus 

function, λfinal c , is highlighted by a green dashed vertical line. Notice that in 

all cases there exists at least one self-refined consensus clustering λpi c that attains a higher 

φ (NMI) than the non-refined consensus clustering solution, λc. However, the supraconsensus 

function mostly fails to select the top quality clustering as the final partition of the data 

—in fact, it only does so in the experiments based on the CSPA, ALSAD and KMSAD 

consensus functions. This situation is a clear illustrative example of the advantages of the 

proposed self-refining procedures and the shortcomings of the φ (NMI) based supraconsensus 

function. 

Figures E.6 to E.8 present the self-refining results obtained on the cluster ensembles 

constructed upon the clusterings generated by the direct-cos-i2, graph-cos-i2 and rb-cosi2 

clustering algorithms. Notice that, in most cases, the self-refining procedure yields at 

least one clustering of superior quality than the non-refined consensus clustering solution. 

Exceptions to this behaviour occur, for instance, when the KMSAD and EAC consensus 

functions are employed for clustering combination on the direct-cos-i2 and graph-cos-i2 

cluster ensembles (see figures E.6(f) and E.7(b), respectively). Again, the supraconsensus 

function performs with modest accuracy, managing to select the top quality clustering 

solution in some cases (see figure E.8(b)) and failing clamorously in others (as in figure 

362 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 


λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(g) SLSAD 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the direct-cos-i2 algorithm on the CAL500 data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the graph-cos-i2 algorithm on the CAL500 data set. 

E.7(g)). 

363 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 50 

c 

λ 75 

c

E.2. InternetAds data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the rb-cos-i2 algorithm on the CAL500 data set. 

E.2 InternetAds data set 

In the following paragraphs, the results corresponding to the execution of self-refining multimodal 

consensus clustering on the InternetAds collection are presented. In this case, the 

modalities are object (i.e image) attributes and collateral image attributes (see appendix 

A.2.2 for a description). 


In this section, the quality of the unimodal, multimodal and intermodal consensus clusterings 

obtained on the cluster ensembles generated upon the agglo-cos-upgma, direct-cos-i2, 

graph-cos-i2 and rb-cos-i2 clustering algorithms is evaluated. 

Firstly, the results corresponding to the agglo-cos-upgma cluster ensemble are depicted 

in figure E.9. The first thing to notice is the extremely low quality of the cluster ensemble 

components, regardless of the modality. This fact conditions the consensus clustering results 

obtained, which are also of very low quality. Moreover, in contrast to what has been observed 

in the rest of experiments, there exist very little differences among the performance of the 

distinct consensus functions employed. 

A very similar behaviour is observed in the experiments conducted on the direct-cos-i2 

and rb-cos-i2 cluster ensembles (figures E.10 and E.12). However, pretty different results are 

obtained on the graph-cos-i2 cluster ensemble (see figure E.11). In this case, the execution 

of consensus clustering on the collateral and the multimodal modalities yields better results. 

364 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

object 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

0.8 

0.6 

0.4 

0.2 

0 


collateral 


c 

1 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

object+collateral 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the agglo-cos-upgma algorithm on the InternetAds data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

object 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

collateral 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 



c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the direct-cos-i2 algorithm on the InternetAds data set. 


The results of the application of the self-refining procedure on the intermodal consensus 

clustering λc are presented next. Again, the consensus clustering selected by the supraconsensus 

function, λ final 

c , is highlighted by a green vertical dashed line. 

Figures E.13, E.14 and E.16, which show the results corresponding to the agglo-cosupgma, 

direct-cos-i2 and rb-cos-i2 cluster ensembles, reveal that little is achieved by selfrefining 

in these cases. In contrast, the usual growing φ (NMI) patterns induced by selfrefining 

are observed in figure E.15, especially when the MCLA, ALSAD and KMSAD 

consensus functions are employed (see figures E.15(d), E.15(e) and E.15(f)). Unfortunately, 

the supraconsensus function mostly fails in selecting the top quality clustering in these 

cases. 

365

E.3. Corel data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

object 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

collateral 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 



c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the graph-cos-i2 algorithm on the InternetAds data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

object 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

collateral 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 



c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the rb-cos-i2 algorithm on the InternetAds data set. 

E.3 Corel data set 

This section is devoted to the presentation of the results of the multimodal consensus 

clustering experiments executed on the Corel data collection. On this data set, modalities 

are image and text features. 


For starters, figure E.17 depicts the unimodal, multimodal and intermodal consensus clusterings 

obtained on the agglo-cos-upgma cluster ensemble. Notice the notable differences 

between both modalities, as clustering this collection using the textual features of the objects 

leads to the obtention of better partitions than those obtained on the image modality. 

Apparently, the multimodal modality resulting from the early fusion of textual and visual 

features, yields clusterings the quality of which is equal or slightly lower than the textual 

ones. Thus, in this case, multimodality brings about no gains as regards the obtention of 

higher quality partitions. Last, the intermodal consensus clustering λc attains φ (NMI) values 

366

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 



E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the agglo-cos-upgma algorithm on the InternetAds data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(a) CSPA 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

λ 40 

c 

(e) ALSAD 


E λc 

λ 2 c 

λ 5 c 

λ 50 

c 

λ 75 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(b) EAC 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

λ 40 

c 

(f) KMSAD 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(d) MCLA 


using the direct-cos-i2 algorithm on the InternetAds data set. 

comparable to the best text modality based partitions when the CSPA, ALSAD and KM- 

SAD consensus functions are employed, while it constitutes a trade-off between modalities 

when the remaining clustering combiners are used. 

A pretty similar performance is observed when the consensus process is applied on the 

367 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 50 

c 

λ 75 

c


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(g) SLSAD 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the graph-cos-i2 algorithm on the InternetAds data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the rb-cos-i2 algorithm on the InternetAds data set. 

direct-cos-i2, graph-cos-i2 and rb-cos-i2 cluster ensembles, as figures E.18, E.19 and E.20 

reveal. 

368 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 50 

c 

λ 75 

c

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

0.8 

0.6 

0.4 

0.2 

0 


image 


c 

1 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

text+image 


c 

1 

0.8 

0.6 

0.4 

0.2 

0 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the agglo-cos-upgma algorithm on the Corel data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text+image 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the direct-cos-i2 algorithm on the Corel data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text+image 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the graph-cos-i2 algorithm on the Corel data set. 


In the following paragraphs, the results of applying the consensus self-refining procedure on 

the intermodal consensus clustering λc are qualitatively described. 

For starters, the φ (NMI) values of the non-refined and self-refined consensus clusterings 

obtained on the agglo-cos-upgma are presented in figure E.21. We can observe that, in 

all cases, there exists at least one refined consensus clustering that attains a higher φ (NMI) 

value than the non-refined clustering λc. Moreover, in this case, the supraconsensus function 

is pretty successful in selecting the top quality consensus clustering as the final partition 

369


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

image 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

text+image 


c 

E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E 

CSPA 

EAC 

HGPA 

MCLA 

ALSAD 

KMSAD 

SLSAD 



solutions using the rb-cos-i2 algorithm on the Corel data set. 

λ final 

c , which is again highlighted by means of a vertical green dashed line. 

The performance of the self-refining procedure is equally satisfying when conducted on 

the cluster ensembles created by means of the three remaining clustering algorithms, as 

figures E.22, E.23 and E.24 depict. However, the selection accuracy of the supraconsensus 

function is somewhat inconsistent, as already observed on the previous data sets. 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 


1 

0.8 

0.6 

0.4 

0.2 

0 

E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the agglo-cos-upgma algorithm on the Corel data set. 

370 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 


λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(g) SLSAD 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the direct-cos-i2 algorithm on the Corel data set. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

0 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the graph-cos-i2 algorithm on the Corel data set. 

371 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 50 

c 

λ 75 

c


φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(a) CSPA 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(e) ALSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(b) EAC 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(f) KMSAD 

λ 40 

c 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(c) HGPA 

λ 50 

c 

λ 75 

c 

φ (NMI) 

λ 30 

c 

1 

0.8 

0.6 

0.4 

0.2 

0 

λ 40 

c 

λ 50 

c 

λ 75 

c 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

λ 30 

c 

(g) SLSAD 


E λc 

λ 2 c 

λ 5 c 

λ 10 

c 

λ 15 

c 

λ 20 

c 

(d) MCLA 


using the rb-cos-i2 algorithm on the Corel data set. 

372 

λ 40 

c 

λ 50 

c 

λ 75 

c 

λ 30 

c 

λ 40 

c 

λ 50 

c 

λ 75 

c

Appendix F 

Experiments on soft consensus 

clustering 

This appendix presents the results of the consensus clustering experiments on soft cluster 

ensembles. The main purpose of these experiments is to compare the four voting based consensus 

functions put forward in chapter 6 –namely BordaConsensus (BC), CondorcetConsensus 

(CC), ProductConsensus (PC) and SumConsensus (SC)– with five state-of-the-art 

clustering combiners: the soft versions of the hypergraph based hard consensus functions 

CSPA, HGPA and MCLA (Strehl and Ghosh, 2002), and the evidence accumulation approach 

(EAC) (Fred and Jain, 2005) (see section 6.2), plus the voting-merging soft consensus 

function (VMA) of (Dimitriadou, Weingessel, and Hornik, 2002). 

Such comparison entails two aspects: the quality of the consensus clustering solutions 

obtained (measured in terms of normalized mutual information –φ (NMI) – with respect to 

the ground truth of each data set), and the time complexity of each consensus function 

(measured in terms of the CPU time required for their execution —see appendix A.6 for a 

description of the computational resources employed in this work). 

From a formal viewpoint, the results of these experiments are presented by means of a 

φ (NMI) vs. CPU time diagram, onto which the performance of each consensus function is 

described by means of a scatterplot covering the mean ± 2-standard deviation region of the 

corresponding magnitude (i.e. φ (NMI) and CPU time). Moreover, the statistical significance 

of the results is evaluated by means of Student’s t-tests that compare all the consensus 

functions on a pairwise basis, thus analyzing whether the hypothetical superiority of any of 

them is sustained on firm statistical grounds, using the traditional 95% confidence interval 

as a reference for distinguishing between significant and non significant differences. 

These soft consensus clustering experiments have been conducted on the twelve unimodal 

data collections employed in this work (see appendix A.2.1 for a description). The results 

corresponding to the Zoo data collection are presented in chapter 6, and the following 

paragraphs describe the results obtained on the eleven remaining data sets. 

373

F.1. Iris data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

IRIS 

0 

0 0.1 0.2 0.3 0.4 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 

Figure F.1: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus 

functions on the Iris data collection. 

F.1 Iris data set 

This section described the results of the soft consensus clustering experiments run on the 

Iris data set. Figure F.1 presents the φ (NMI) vs CPU time mean ± 2-standard deviation 

regions of the nine consensus functions compared. Quite obviously, the closer the scatterplot 

of a consensus function was to the top left corner of the diagram, the better its performance 

would be (i.e. it would yield high quality consensus clustering solutions with low time 

complexity). 

In this case, the proposed SC and PC consensus functions match the performance of the 

VMA, both in terms of time complexity and consensus quality. The performance of the other 

two consensus functions proposed (BC and CC) is pretty comparable as far as the quality 

of the consensus clustering solutions is concerned, but their computational complexity is 

higher. As regards the state-of-the-art consensus functions, CSPA seems to yield slightly 

better quality results, although its CPU time more than doubles our proposals, being the 

most costly. On its part, MCLA seems to be competitive from a computational viewpoint, 

but it yields lower quality consensus clusterings. Last, EAC and HGPA are the worst 

performing consensus functions. 

If the statistical significance of the results is evaluated –see table F.1–, it can be obvserved 

that the φ (NMI) superiority of CSPA is only apparent, as the differences with respect 

to BC and CC are not statistically significant, and the quality of the consensus clusterings 

output by SC and PC are significantly better than those of CSPA. Moreover, SC and PC 

are statistically equivalent to VMA both in terms of quality and execution time. 

F.2 Wine data set 

The soft consensus clustering results obtained on the Wine data collection are presented 

next. For starters, figure F.2 displays the φ (NMI) vs CPU time mean ± 2-standard deviation 

regions corresponding to the nine consensus functions compared. In general terms, it can 

be observed that VMA is the fastest alternative, while the best quality consensus clustering 

374

Appendix F. Experiments on soft consensus clustering 


CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

EAC 0.0001 ——— 0.0014 0.0001 0.0001 0.0001 0.001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— × 0.0001 × × 0.0001 0.0001 

MCLA 0.043 0.0001 0.0001 ——— 0.0001 × × 0.0001 0.0001 

VMA 0.0146 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 × × 

BC × 0.0001 0.0001 0.0001 0.0377 ——— 0.0163 0.0001 0.0001 

CC × 0.0001 0.0001 0.0001 0.0373 × ——— 0.0001 0.0001 

PC 0.0377 0.0001 0.0001 0.0001 × × × ——— × 

SC 0.0289 0.0001 0.0001 0.0001 × × × × ——— 

Table F.1: Significance levels p corresponding to the pairwise comparison of soft consensus 

functions using a t-paired test on the Iris data set. The upper and lower triangular sections 



φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

WINE 

0 

0 0.2 0.4 0.6 0.8 1 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the Wine data collection. 

solutions are the ones output by two of the proposed consensus functions: BC and CC. 

The analysis of the statistical significance of these results reinforces these notions (see 

table F.2). Indeed, the CPU time differences between VMA and the remaining consensus 

functions is always statistically significant, with very high significance levels (around 

0.0001). Moreover, in terms of φ (NMI) , BC and CC are significantly better than any of the 

alternatives. On their part, as already suggested by the diagram of figure F.2, SC and PC 

are not statistically different from VMA as far as the quality of the consensus clustering 

solutions is concerned. 

F.3 Glass data set 

This section describes the results of the quality and time complexity comparison experiments 

between the nine soft consensus functions employed in this work, when applied on the Glass 

data set. 

375

F.3. Glass data set 


CSPA ——— 0.001 0.0001 × 0.0001 0.0249 0.0005 0.0001 0.0001 

EAC 0.0001 ——— 0.0001 × 0.0001 × 0.0105 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— 0.0004 0.0001 0.0001 0.0001 × × 

MCLA × 0.0001 0.0001 ——— 0.0001 × 0.0199 0.0013 0.0014 

VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0002 0.0002 

BC 0.0001 0.0001 0.0001 0.0001 0.0006 ——— × 0.0001 0.0001 

CC 0.0001 0.0001 0.0001 0.0001 0.0006 × ——— 0.0001 0.0001 

PC 0.0001 0.0001 0.0001 0.0001 × 0.001 0.001 ——— × 

SC 0.0001 0.0001 0.0001 0.0001 × 0.0129 0.0129 × ——— 


functions using a t-paired test on the Wine data set. The upper and lower triangular sections 



φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

GLASS 

0 

0 0.5 1 1.5 2 2.5 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the Glass data collection. 

As figure F.3 suggests, VMA is again the least time consuming consensus function. As 

mentioned earlier, this is due to the simultaneity of the cluster disambiguation and voting 

processes in this consensus function. In contast, the proposed CC consensus function is by 

far the slowest, probably due to the exhaustive pairwise cluster confrontation implicit in 

the Condorcet voting method. 

In terms of quality, there is an apparent equality between the VMA, PC and SC consensus 

functions, attaining the highest φ (NMI) scores. The CSPA, BC, CC and MCLA 

consensus functions apparently yield lower quality consensus clustering solutions. 

When the statistical significance of these results is analyzed –see table F.3–, we see that 

the apparent time complexity superiority of VMA is statistically significant. As regards the 

quality of the consensus clustering solutions, it can be observed that the performances of 

VMA, SC and PC are statistically equivalent, whereas the differences between these and 

BC and CC are indeed significant. 

376



CSPA ——— 0.0002 0.0001 0.0345 0.0001 × 0.0001 0.0016 0.0017 

EAC 0.0001 ——— 0.0001 × 0.0001 × 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— 0.0002 0.0001 0.0001 0.0001 × × 

MCLA × 0.0001 0.0001 ——— 0.0001 × 0.0001 0.002 0.002 

VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0233 0.0001 0.0001 0.0001 0.0025 ——— 0.0001 0.0037 0.0038 

CC 0.0247 0.0001 0.0001 0.0001 0.0022 × ——— 0.0001 0.0001 

PC 0.0001 0.0001 0.0001 0.0001 × 0.0092 0.0084 ——— × 

SC 0.0001 0.0001 0.0001 0.0001 × 0.0064 0.0058 × ——— 


functions using a t-paired test on the Glass data set. The upper and lower triangular sections 



F.4 Ionosphere data set 

In the following paragraphs, the results of the soft consensus clustering experiments conducted 

on the Ionosphere data collection are described. 

For starters, figure F.4 displays the φ (NMI) vs CPU time mean ± 2-standard deviation 

regions corresponding to the nine consensus functions compared in this experiment. It can 

be observed that pretty low quality consensus clustering solutions (φ (NMI) < 0.1) are yielded 

by all clustering combiners. The highest φ (NMI) scores are obtained by CSPA, BC and CC, 

whose performance is statistically significantly better than that of the other six consensus 

functions —see table F.4 for the results of the statistical significance analysis of the results 

of this experiment. 

As regards time complexity, it can be observed that VMA is the most computationally 

efficient option, closely followed by HGPA. The proposed PC and SC consensus functions 

are comparable to CSPA and MCLA in computational terms, while the positional voting 

based BC and CC consensus functions are, together with EAC, the most time consuming 

alternatives. The differences between these three groups are statistically significant, as it 

can be inferred from the data measurements presented in table F.4. 

F.5 WDBC data set 

This section describes the results of the soft consensus clustering experiments conducted on 

the WDBC data set. 

The φ (NMI) vs CPU time mean ± 2-standard deviation regions of the consensus functions 

are depicted in figure F.5. Once again, VMA is the most computationally efficient 

consensus function (which, as mentioned earlier, is due to the simultaneity of the cluster 

disambiguation and voting processes), closely followed by HGPA. However, the confidence 

voting based consensus functions (PC and SC) are pretty close to VMA in CPU time terms, 

being slightly faster than CSPA and MCLA. As already noticed in the previous experiments, 

positional voting makes the BC and CC consensus functions more computationally costly 

(in this case, CC is slightly faster than BC, due to the fact that the low number of clusters 

377

F.5. WDBC data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

IONOSPHERE 

0 

0 0.5 1 1.5 2 2.5 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the Ionosphere data collection. 


CSPA ——— 0.0001 0.0001 × 0.0001 0.0002 0.0002 × × 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 × × 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— 0.0008 0.0001 0.0001 0.0001 0.0033 0.0031 

MCLA 0.012 0.0004 0.0492 ——— 0.0001 0.0009 0.0014 × × 

VMA 0.0336 0.0001 0.0001 × ——— 0.0001 0.0001 0.0001 0.0001 

BC × 0.0001 0.0001 0.0002 0.0001 ——— × 0.0003 0.0003 

CC × 0.0001 0.0001 0.0002 0.0001 × ——— 0.0004 0.0004 

PC 0.0312 0.0001 0.0001 × × 0.0001 0.0001 ——— × 

SC 0.0302 0.0001 0.0001 × × 0.0001 0.0001 × ——— 


functions using a t-paired test on the Ionosphere data set. The upper and lower triangular 

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) , 

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol 

×. 

in this data set –k = 2– does not penalize CondorcetConsensus), while EAC is the least 

efficient option. 

As far as the quality of the consensus clustering solutions is concerned, PC and SC get 

to match VMA as the best performing consensus functions, showing smaller dispersion in 

φ (NMI) terms than BC, CC and MCLA. 

If the statistical significance of the CPU time and φ (NMI) differences between consensus 

functions is evaluated –see table F.5– it can be observed that PC and SC are, in execution 

time terms, equivalent to CSPA and MCLA. As regards the quality of the consensus 

clustering solutions, no significant differences are observed between VMA, BC, CC, PC and 

SC, which, as aforementioned, turn out to be the best performing consensus functions. 

378

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 


WDBC 

0 

0 2 4 6 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the WDBC data collection. 


CSPA ——— 0.0001 0.0001 × 0.0001 0.0001 0.0002 × × 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— 0.0026 0.0001 0.0001 0.0001 0.0267 0.0249 

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0002 0.0004 × × 

VMA 0.0001 0.0001 0.0001 0.0057 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0001 0.0001 0.0001 × × ——— × 0.0001 0.0001 

CC 0.0001 0.0001 0.0001 × × × ——— 0.0001 0.0001 

PC 0.0001 0.0001 0.0001 0.0025 × × × ——— × 

SC 0.0001 0.0001 0.0001 0.0103 × × × × ——— 


functions using a t-paired test on the WDBC data set. The upper and lower triangular 



×. 

F.6 Balance data set 

In this section, the performance of the soft consensus functions is compared through a set 

of consensus clustering experiments conducted on the Balance data set. 

Figure F.6 depicts the diagram that qualitatively compares the nine consensus functions 

in terms of CPU time required for their execution and the φ (NMI) of the consensus clustering 

solutions they yield. 

As regards the former aspect, we can observe that VMA, PC and SC are the most efficient 

consensus functions, and the differences between them, though small, are statistically 

significant according to the results of the t-paired tests presented in table F.6. Moreover, 

we can also observe that the BC and CC consensus functions achieve a mid-range time 

complexity, being slower than MCLA and HGPA, but faster than CSPA and EAC. 

Last, as far as the quality of the consensus clustering solutions is concerned, there is a 

high degree of equality between consensus functions. In fact, the differences between the top 

379

F.7. MFeat data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

BALANCE 

10 0 


CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the Balance data collection. 

performing consensus functions (CSPA, VMA, PC and SC) are not statistically significant. 


CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0003 ——— 0.0419 0.0001 0.0004 0.0001 0.0001 0.0001 

MCLA 0.0291 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 

VMA × 0.0001 0.0001 × ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0139 0.0001 0.0001 × × ——— 0.0044 0.0001 0.0001 

CC 0.0139 0.0001 0.0001 × × × ——— 0.0001 0.0001 

PC × 0.0001 0.0001 × × 0.0322 0.0322 ——— × 

SC × 0.0001 0.0001 × × × × × ——— 


functions using a t-paired test on the Balance data set. The upper and lower triangular 



×. 

F.7 MFeat data set 

The results of the soft consensus clustering experiments conducted on the MFeat data set 

are presented in this section. For this purpose, figure F.7 depicts the diagram displaying 

the φ (NMI) vs CPU time mean ± 2-standard deviation regions corresponding to the nine 

soft consensus functions compared, and table F.7 presents the results of the statistical 

significance t-paired tests that compares them pairwise. 

In time complexity terms, VMA is the best performing consensus function, closely followed 

by MCLA, HGPA, PC and SC (the two latter being statistically equivalent). Among 

the two proposed positional voting based consensus functions, BC is clearly more efficient 

than CC. This probably is due to the larger number of classes (i.e. candidates) in this data 

set, which makes CC more costly due to the exhaustive pairwise candidate confrontation 

380


involved in the Condorcet voting method. However, executing CC takes approximately as 

long as running CSPA, and much less time than doing so with EAC, which is, by far, the 

least efficient consensus function. 

When the quality of the consensus clustering solutions delivered by these consensus 

functions is compared, we can see that PC, BC and CC obtain the highest φ (NMI) scores 

–with no significant differences among them–, closely followed by VMA, SC and CSPA. 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

10 0 

MFEAT 


10 2 

CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the MFeat data collection. 


CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0022 0.0001 0.0001 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— 0.0013 0.0001 0.0015 0.0001 0.0266 0.026 

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 × × 

VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0007 0.0001 0.0001 0.0001 0.0034 ——— 0.0001 0.0001 0.0001 

CC 0.0008 0.0001 0.0001 0.0001 0.0043 × ——— 0.0001 0.0001 

PC 0.0382 0.0001 0.0001 0.0001 × × × ——— × 

SC × 0.0001 0.0001 0.0001 × 0.0024 0.003 × ——— 


functions using a t-paired test on the MFeat data set. The upper and lower triangular 



×. 

F.8 miniNG data set 

In this section we present the results of the soft consensus clustering experiments conducted 

on the miniNG data collection. The φ (NMI) vs CPU time diagram of figure F.8 reveals that 

three of the proposed voting based consensus functions (BC, PC and SC) constitute a good 

trade-off between consensus quality and time complexity. 

381

F.9. Segmentation data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

10 0 

MINING 


10 2 

CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the miniNG data collection. 

Indeed, the consensus clustering solutions they yield are statistically significantly better 

than those output by state-of-the-art consensus functions such as VMA (which is the 

least time consuming), CSPA or MCLA —see table F.8 for further details regarding the 

statistical significance of the differences between consensus functions. The fourth consensus 

function proposed (CC) also yields higher quality than VMA, CSPA and MCLA, but its 

time complexity is notably higher, due to the nature of the Condorcet voting method. 


CSPA ——— 0.0001 0.0001 × 0.0001 0.0015 0.0001 0.0114 0.0115 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— 0.0041 0.0001 0.0001 0.0001 0.0001 0.0001 

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0025 0.0001 0.0185 0.0187 

VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0051 0.0001 0.0001 0.0001 0.0038 ——— 0.0001 × × 

CC 0.008 0.0001 0.0001 0.0001 0.0061 × ——— 0.0001 0.0001 

PC × 0.0001 0.0001 0.0001 × × × ——— × 

SC 0.0001 0.0001 0.0001 0.0001 0.0001 × × 0.0126 ——— 


functions using a t-paired test on the miniNG data set. The upper and lower triangular 



×. 

F.9 Segmentation data set 

The results of the application of the nine soft consensus functions upon the Segmentation 

data set are described next. Figure F.9 presents the φ (NMI) vs CPU time mean ± 2-standard 

deviation regions employed for comparing them. 

Again, VMA is the most computationally efficient consensus function. The proposed 

preference voting based clustering combiners (PC and SC), together with MCLA and 

382

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

10 0 


SEGMENTATION 


10 2 

CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the Segmentation data collection. 

HPGA, are the immediate followers. Between the two proposed consensus functions based 

on positional voting, BC is once more the most efficient (being comparable to CSPA in 

terms of execution CPU time), as Borda voting is less computationally demanding than 

Condorcet voting. 


CSPA ——— 0.0001 0.0001 0.0001 0.0001 × 0.0001 0.0001 0.0001 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— × 0.0001 0.0002 0.0001 × × 

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.006 0.0001 × × 

VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.0006 0.0001 0.0001 0.0001 0.0069 ——— 0.0001 0.0007 0.0007 

CC 0.0006 0.0001 0.0001 0.0001 0.0069 × ——— 0.0001 0.0001 

PC × 0.0001 0.0001 0.0001 × 0.02 0.02 ——— × 

SC × 0.0001 0.0001 0.0001 × 0.0307 0.0307 × ——— 


functions using a t-paired test on the Segmentation data set. The upper and lower triangular 



×. 

However, these two consensus functions (BC and CC) are the ones to obtain the highest 

quality consensus clustering solutions, and the difference between their φ (NMI) scores and 

those of the remaining clustering combiners is statistically significant, as the figures shown 

in table F.9 reveal. The quality of the consensus clusterings output by the other two voting 

based consensus functions (PC and SC) is, from a statistical standpoint, equivalent to that 

of the VMA and CSPA consensus functions. 

383

F.10. BBC data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

10 0 

BBC 


10 2 

CSPA 

EAC 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the BBC data collection. 


CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0002 0.0012 0.0001 0.0001 

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 

HGPA 0.0001 0.0001 ——— × 0.0001 0.0001 0.0001 × × 

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 × × 

VMA 0.0456 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 

BC 0.004 0.0001 0.0001 0.0001 × ——— 0.0001 0.0002 0.0002 

CC 0.004 0.0001 0.0001 0.0001 × × ——— 0.0001 0.0001 

PC × 0.0001 0.0001 0.0001 × × × ——— × 

SC × 0.0001 0.0001 0.0001 × 0.0279 0.0279 × ——— 


functions using a t-paired test on the BBC data set. The upper and lower triangular sections 



F.10 BBC data set 

This section is devoted to the presentation of the results of the soft consensus clustering experiments 

conducted on the BBC data set. A qualitative description of them is provided by 

the φ (NMI) vs CPU time diagram of figure F.10, and the results of the statistical significance 

study of the differences between consensus functions is presented in table F.10. 

It can be observed that VMA is again the fastest consensus function. The confidence 

voting consensus functions (PC and SC) are, in statistical terms, as fast as MCLA and 

HGPA. The positional voting consensus functions (BC and CC) are slower than those, the 

former being also faster than CSPA, while the latter is slower than it. 

As regards the quality of the consensus clustering solutions obtained, CSPA, PC and SC 

yield the highest φ (NMI) scores, being equivalent from a statistical significance viewpoint. 

The BordaConsensus and CondorcetConsensus clustering combiners also deliver pretty good 

performances, together with the VMA consensus function, being notably better than MCLA 

(and far better than EAC and HGPA, which yield extremely poor consensus clustering 

384

solutions). 

F.11 PenDigits data set 


The results of the soft consensus clustering experiments conducted on the PenDigits data 

set are described in the following paragraphs. Due to the number of objects n contained in 

this collection, the CSPA and EAC consensus functions were not executable –as their space 

complexity scales quadratically with n. 

Thus, as a means for comparing the seven consensus functions applied on this data set, 

figure F.11 depicts the φ (NMI) vs CPU time mean ± 2-standard deviation regions corresponding 

to them. It can be observed that, in this case, the four voting based consensus 

functions proposed are the most time consuming. However, those based on confidence voting 

(PC and SC) are relatively comparable to HGPA and MCLA, while BC and CC are the 

most computationally costly (especially the latter). As in the previous cases, VMA is the 

most efficient of the consensus functions compared. 

When comparison is referred to the φ (NMI) of the consensus clustering solutions yielded 

by the seven consensus functions, we can observe that the highest quality is obtained by PC 

and SC, which match VMA in this aspect. The other two voting based consensus functions 

(BC and CC) perform slightly worse, but far better than MCLA and HGPA. 

385

F.11. PenDigits data set 

φ (NMI) 

1 

0.8 

0.6 

0.4 

0.2 

0 

10 0 

PENDIGITS 

10 1 


10 2 

HGPA 

MCLA 

VMA 

BC 

CC 

PC 

SC 


functions on the PenDigits data collection. 

HGPA MCLA VMA BC CC PC SC 

HGPA ——— × 0.0001 0.0005 0.0001 × × 

MCLA 0.0001 ——— 0.0015 0.0001 0.0001 0.0273 0.0269 

VMA 0.0001 0.0001 ——— 0.0001 0.0001 0.0002 0.0002 

BC 0.0001 0.0001 0.0001 ——— 0.0001 0.0097 0.0098 

CC 0.0001 0.0001 0.0001 × ——— 0.0001 0.0001 

PC 0.0001 0.0001 × 0.0001 0.0001 ——— × 

SC 0.0001 0.0001 × 0.0001 0.0001 × ——— 


functions using a t-paired test on the PenDigits data set. The upper and lower triangular 



×. 

386

C.I.F. G: 59069740 Universitat Ramon Lull Fundació Privada. Rgtre. Fund. Generalitat de Catalunya núm. 472 (28-02-90) 

Aquesta Tesi Doctoral ha estat defensada el dia ____ d __________________ de ____ 

al Centre _______________________________________________________________ 

de la Universitat Ramon Llull 

davant el Tribunal format pels Doctors sotasignants, havent obtingut la qualificació: 

President/a 

_______________________________ 

Vocal 

_______________________________ 

Vocal 

_______________________________ 

Vocal 

_______________________________ 

Secretari/ària 

_______________________________ 

Doctorand/a 

C. Claravall, 1-3 

08022 Barcelona 

Tel. 936 022 200 

Fax 936 022 249 

E-mail: urlsc@sec.url.es 

www.url.es

TESI DOCTORAL - La Salle

Create successful ePaper yourself

Delete template?

Save as template?