29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

C.I.F. G: 59069740 Universitat Ramon Lull Fundació Privada. Rgtre. Fund. Generalitat de Catalunya núm. 472 (28-02-90)<br />

<strong>TESI</strong> <strong>DOCTORAL</strong><br />

Títol Hierarchical self-refining consensus architectures and soft consensus<br />

functions for robust multimedia clustering<br />

Realitzada per Xavier Sevillano Domínguez<br />

en el Centre Enginyeria i Arquitectura <strong>La</strong> <strong>Salle</strong><br />

i en el Departament Comunicacions i Teoria del Senyal<br />

Dirigida per Dr. Francesc Alías Pujol i Dr. Joan Claudi Socoró Carrié<br />

C. Claravall, 1-3<br />

08022 Barcelona<br />

Tel. 936 022 200<br />

Fax 936 022 249<br />

E-mail: urlsc@sec.url.es<br />

www.url.es


Agraïments<br />

Aquesta tesi és fruit de moltes hores de treball personal. Tanmateix, hi ha molta gent a qui<br />

estic agraït pel seu suport durant aquests anys.<br />

En primer lloc, vull citar el meu fantàstic equip de co-directors: en Joan Claudi Socoró,<br />

a qui agraeixo que hagi estat un magnífic ponent/director des dels ja llunyans temps del<br />

TFC (equalització adaptativa, uf!), a banda d’haver-me donat la llibertat de fer la tesi que<br />

volia i oferir-me sempre el seu ajut en moments crítics. I en Francesc Alías (en Xuti), per<br />

l’empenta que va acabar suposant l’inici d’aquesta tesi, pel seu constant esperit de millora<br />

i, sobretot, per la seva amistat que dura des de temps encara més llunyans.<br />

Als caps directes que he tingut al llarg d’aquests anys, que són en David Badia, l’Elisa<br />

Martínez i en Gabriel Fernández, els agraeixo l’haver-me concedit un espai a recer, moltes<br />

vegades, de l’habitual pluja de marrons.<br />

També estic molt agraït al meu bon amic i deixeble Germán Cobo, que va donar amb mi<br />

els primers passos del que ha acabat essent aquesta tesi, i amb qui espero seguir treballant<br />

en un futur, ni que sigui a distància (és el que té laUOC).<br />

Les llargues hores de simulacions ho haurien estat encara més si no hagués estat pel<br />

personal de Manteniment (Chus, Héctor, Gerard, Oscar, Raúl), que em va ajudar i facilitar<br />

l’ús (quasi monopoli) d’un bon grapat de PC’s. També els estic molt agraïts als meus<br />

companys de secció que em van permetre okupar els seus ordinadors, moltes vegades en<br />

perjudici propi: Germán, Joan Claudi, José Antonio Montero i Ester Cierco. Gràcies a en<br />

Lluís Formiga per obrir-me les portes de la Multimodal, el que va permetre accelerar molt<br />

el procés d’experimentació.<br />

Gràcies a la Berta Martínez, l’ Àngel Calzada i en Lluís Cortés per haver fet més suportable<br />

l’estiu del 2008, i a en Germán (de nou!) per descobrir-me el Deezer, que ha posat<br />

banda sonora a aquesta tesi. Gràcies, en general, a tots els companys de l’antiga secció de<br />

Tractament, del Departament de Comunicacions i de l’actual DTM.<br />

Gracias a mi madre por su amor y apoyo a lo largo de toda la vida. Gracias a mi<br />

padre por inculcarme la pasión por estudiar, y gracias a ambos por los esfuerzos para que<br />

pudiera hacerlo en las mejores condiciones. Y gracias a toda mi familia y amigos en general<br />

—aunque no lo sepáis, era muy reconfortante para mí que me preguntárais cómo iba la<br />

tesis.<br />

Y gracias a Susana, por su paciencia a lo largo de todo este proceso, dándome ánimos y<br />

creyendo siempre en mí, y por seguir ahí para disfrutar juntos de lo que vendrá ... cuando<br />

acabe la tesis.<br />

iii


Resum<br />

En segmentar de forma no supervisada una col·lecció de dades, l’usuari ha de prendre<br />

múltiples decisions –quin algorisme aplicar, com representar els objectes, en quants grups<br />

agrupar aquests, entre d’altres– que condicionen, en gran mesura, la qualitat de la partició<br />

resultant. Malauradament, la naturalesa no supervisada del problema fa difícil (quan no<br />

impossible) prendre aquestes decisions de manera fonamentada, a no ser que es disposi de<br />

cert coneixement del domini.<br />

En un intent per combatre aquestes incerteses, aquesta tesi proposa una aproximació<br />

al problema que minimitza, intencionadament, la presa de decisions per part de l’usuari.<br />

Ans al contrari, s’encoratja l’ús de tants sistemes de classificació no supervisada com sigui<br />

possible, combinant-los per tal d’obtenir la partició final de les dades (o partició deconsens).<br />

Com més semblant sigui aquesta a la partició demàxima qualitat oferida pels sistemes de<br />

classificació subjectes a combinació, major serà el grau de robustesa assolit respecte les<br />

indeterminacions inherents a la classificació no supervisada.<br />

Nogensmenys, la combinació indiscriminada de classificadors no supervisats planteja<br />

dues dificultats principals, que són i) l’increment de la complexitat computacional del procés<br />

de combinació, fins al punt que la seva execució pot esdevenir inviable si el nombre de<br />

sistemes a combinar és excessiu, i ii) l’obtenció de particions de consens de baixa qualitat<br />

deguda a la inclusió de sistemes de classificació pobres. Amb l’objectiu de lluitar contra<br />

aquests problemes, aquesta tesi introdueix les arquitectures de consens jeràrquiques autorefinables<br />

com a via per a l’obtenció de particions de consens de bona qualitat amb baix<br />

cost computacional, tal i com confirmen els nombrosos experiments realitzats.<br />

Amb la intenció d’exportar aquesta estratègia de classificació no supervisada robusta<br />

a un marc generalista, es proposa un conjunt de funcions de consens basades en votació<br />

per a la combinació de classificadors difusos. Diversos experiments demostren que les seves<br />

prestacions són comparables o superiors a bona part de l’estat de l’art.<br />

Les nostres propostes són aplicables de forma natural a la classificació robusta de dades<br />

multimodals –un problema d’interès ben actual donada la ubiqüitat de la multimèdia–, ja<br />

que l’existència de múltiples modalitats planteja indeterminacions addicionals que dificulten<br />

l’obtenció de particions robustes. <strong>La</strong> base de la nostra proposta és la creació deconjunts<br />

de particions multimodals, el que permet l’ús natural i simultani de tècniques de fusió de<br />

modalitats avançada i retardada, donant peu a una aproximació genèrica i eficient a la classificació<br />

multimèdia —els resultats de la qual s’analitzen al llarg de múltiples experiments.<br />

v


Resumen<br />

Al segmentar de forma no supervisada una colección de datos, el usuario debe tomar<br />

múltiples decisiones –qué algoritmo aplicar, cómo representar los objetos, en cuantos grupos<br />

agrupar éstos, entre otras– que condicionan, en gran medida, la calidad de la partición resultante.<br />

Desgraciadamente, la naturaleza no supervisada del problema hace difícil (cuando<br />

no imposible) tomar estas decisiones de manera fundamentada, a no ser que se disponga de<br />

cierto conocimiento del dominio.<br />

En un intento por combatir estas incertidumbres, esta tesis propone una aproximación<br />

al problema que minimiza, intencionadamente, la toma de decisiones por parte del usuario.<br />

Al contrario, se alienta el uso de tantos sistemas de clasificación no supervisada como sea<br />

posible, combinándolos con el fin de obtener la partición final de los datos (o partición de<br />

consenso). Cuanto más similar sea ésta a la partición de máxima calidad ofrecida por los<br />

sistemas de clasificación sujetos a combinación, mayor será el grado de robustez respecto a<br />

las indeterminaciones inherentes a la clasificación no supervisada.<br />

No obstante, la combinación indiscriminada de clasificadores no supervisados plantea<br />

dos dificultades principales, que son i) el incremento de la complejidad computacional del<br />

proceso de combinación, hasta el punto que su ejecución puede ser inviable si el número<br />

de sistemas a combinar es excesivo, y ii) la obtención de particiones de consenso de baja<br />

calidad debida a la inclusión de sistemas de clasificación pobres. Con el objetivo de luchar<br />

contra estos problemas, esta tesis introduce las arquitecturas de consenso jerárquicas autorefinables<br />

como vía para la obtención de particiones de consenso de buena calidad con bajo<br />

coste computacional, tal como confirman los numerosos experimentos realizados.<br />

Con la intención de exportar esta estrategia de clasificación no supervisada robusta a<br />

un marco generalista, se propone un conjunto de funciones de consenso basadas en votación<br />

para la combinación de clasificadores difusos. Diversos experimentos demuestran que sus<br />

prestaciones son comparables o superiores a buena parte del estado del arte.<br />

Nuestras propuestas son aplicables de forma natural a la clasificación robusta de datos<br />

multimodales –un problema de interés actual dada la ubicuidad de la multimedia–, ya que la<br />

existencia de múltiples modalidades plantea indeterminaciones adicionales que dificultan la<br />

obtención de particiones robustas. <strong>La</strong> base de nuestra propuesta es la creación de conjuntos<br />

de particiones multimodales, lo que permite el uso natural y simultáneo de técnicas de fusión<br />

de modalidades temprana y tardía, dando pie a una aproximación genérica y eficiente a la<br />

clasificación multimedia —cuyos resultados se analizan a lo largo de múltiples experimentos.<br />

vii


Abstract<br />

When facing the task of partitioning a data collection in an unsupervised fashion, the clustering<br />

practitioner must make several crucial decisions –which clustering algorithm to apply,<br />

how the objects in the data set are represented, how many clusters are to be found, among<br />

others– that condition, to a large extent, the quality of the resulting partition. However,<br />

the unsupervised nature of the clustering problem makes it difficult (if not impossible) to<br />

make well-founded decisions unless domain knowledge is available.<br />

In an attempt to fight these indeterminacies, we propose an approach to the clustering<br />

problem that intentionally reduces user decision making as much as possible. Rather the<br />

contrary, the clustering practitioner is encouraged to simultaneously employ as many clustering<br />

systems as possible (compiling their outcomes into a cluster ensemble), combining<br />

them in order to obtain the final partition (or consensus clustering). The greater the similarity<br />

between the highest quality cluster ensemble component and the consensus clustering,<br />

the larger degree of robustness to the inherent indeterminacies of clustering is achieved.<br />

However, the indiscriminate creation of cluster ensemble components poses two main<br />

challenges to the clustering combination process, namely i) an increase of its computational<br />

complexity, to the point that the creation of the consensus clustering can even become unfeasible<br />

if the number of clustering systems combined is too large, and ii) the obtention of a<br />

low quality consensus partition due to the inclusion of poor clustering systems in the cluster<br />

ensemble. In order to fight against these inconveniences, this thesis introduces hierarchical<br />

self-refining consensus architectures as a means for obtaining good quality partitions at a<br />

reduced computational cost, as confirmed by extensive experimental evaluation.<br />

Aiming to port this robust clustering strategy to a more generic framework, a set of<br />

voting based consensus functions for fuzzy clustering systems combination is proposed.<br />

Several experiments demonstrate that the quality of the consensus clusterings they yield is<br />

comparable or better than that of multiple state-of-the-art soft consensus functions.<br />

Our proposals find a natural field of application in the robust clustering of multimodal<br />

data –a problem of current interest due to the growing ubiquity of multimedia–, as the<br />

existence of multiple data modalities poses additional indeterminacies that challenge the<br />

obtention of robust clustering results. The basis of our proposal is the creation of multimodal<br />

cluster ensembles, which naturally allows the simultaneous use of early and late modality<br />

fusion techniques, thus providing a highly generic and efficient approach to multimedia<br />

clustering —the performance of which is analyzed in multiple experiments.<br />

ix


Contents<br />

Resum iii<br />

Resumen v<br />

Abstract vii<br />

List of tables xvi<br />

List of figures xxii<br />

List of algorithms xxxiii<br />

List of symbols xxxiv<br />

1 Framework of the thesis 1<br />

1.1 Knowledge discovery and data mining . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 Clustering in knowledge discovery and data mining . . . . . . . . . . . . . . 7<br />

1.2.1 Overview of clustering methods . . . . . . . . . . . . . . . . . . . . . 8<br />

1.2.2 Evaluation of clustering processes . . . . . . . . . . . . . . . . . . . . 15<br />

1.3 Multimodal clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

1.4 Clustering indeterminacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.5 Motivation and contributions of the thesis . . . . . . . . . . . . . . . . . . . 25<br />

2 Cluster ensembles and consensus clustering 27<br />

2.1 Related work on cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . 31<br />

2.2 Related work on consensus functions . . . . . . . . . . . . . . . . . . . . . . 33<br />

2.2.1 Consensus functions based on voting . . . . . . . . . . . . . . . . . . 34<br />

2.2.2 Consensus functions based on graph partitioning . . . . . . . . . . . 36<br />

2.2.3 Consensus functions based on object co-association measures . . . . 37<br />

2.2.4 Consensus functions based on categorical clustering . . . . . . . . . 39<br />

2.2.5 Consensus functions based on probabilistic approaches . . . . . . . . 39<br />

xi


Contents<br />

2.2.6 Consensus functions based on reinforcement learning . . . . . . . . . 40<br />

2.2.7 Consensus functions based on interpeting object similarity as data . 40<br />

2.2.8 Consensus functions based on cluster centroids . . . . . . . . . . . . 41<br />

2.2.9 Consensus functions based on correlation clustering . . . . . . . . . 41<br />

2.2.10 Consensus functions based on search techniques . . . . . . . . . . . . 42<br />

2.2.11 Consensus functions based on cluster ensemble component selection 42<br />

2.2.12 Other interesting works on consensus clustering . . . . . . . . . . . . 43<br />

3 Hierarchical consensus architectures 45<br />

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.2 Random hierarchical consensus architectures . . . . . . . . . . . . . . . . . 49<br />

3.2.1 Rationale and definition . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

3.2.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

3.2.3 Running time minimization . . . . . . . . . . . . . . . . . . . . . . . 54<br />

3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

3.3 Deterministic hierarchical consensus architectures . . . . . . . . . . . . . . . 69<br />

3.3.1 Rationale and definition . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

3.3.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

3.3.3 Running time minimization . . . . . . . . . . . . . . . . . . . . . . . 72<br />

3.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

3.4 Flat vs. hierarchical consensus . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

3.4.1 Running time comparison . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

3.4.2 Consensus quality comparison . . . . . . . . . . . . . . . . . . . . . . 96<br />

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

3.6 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

4 Self-refining consensus architectures 109<br />

4.1 Description of the consensus self-refining procedure . . . . . . . . . . . . . . 110<br />

4.2 Flat vs. hierarchical self-refining . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

4.2.1 Evaluation of the consensus-based self-refining process . . . . . . . . 116<br />

4.2.2 Evaluation of the supraconsensus process . . . . . . . . . . . . . . . 120<br />

4.3 Selection-based self-refining . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

4.3.1 Evaluation of the selection-based self-refining process . . . . . . . . . 124<br />

4.3.2 Evaluation of the supraconsensus process . . . . . . . . . . . . . . . 126<br />

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

4.5 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

xii


Contents<br />

5 Multimedia clustering based on cluster ensembles 133<br />

5.1 Generation of multimodal cluster ensembles . . . . . . . . . . . . . . . . . . 134<br />

5.2 Self-refining multimodal consensus architecture . . . . . . . . . . . . . . . . 136<br />

5.3 Multimodal consensus clustering results . . . . . . . . . . . . . . . . . . . . 138<br />

5.3.1 Consensus clustering per modality and across modalities . . . . . . . 141<br />

5.3.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 152<br />

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160<br />

5.5 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161<br />

6 Voting based consensus functions for soft cluster ensembles 163<br />

6.1 Soft cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

6.2 Adapting consensus functions to soft cluster ensembles . . . . . . . . . . . . 166<br />

6.3 Voting based consensus functions . . . . . . . . . . . . . . . . . . . . . . . . 171<br />

6.3.1 Cluster disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />

6.3.2 Voting strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176<br />

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184<br />

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189<br />

6.6 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191<br />

7 Conclusions 193<br />

7.1 Hierarchical consensus architectures . . . . . . . . . . . . . . . . . . . . . . 195<br />

7.2 Consensus self-refining procedures . . . . . . . . . . . . . . . . . . . . . . . 197<br />

7.3 Multimedia clustering based on cluster ensembles . . . . . . . . . . . . . . . 198<br />

7.4 Voting based soft consensus functions . . . . . . . . . . . . . . . . . . . . . 200<br />

References 202<br />

Appendices 216<br />

A Experimental setup 217<br />

A.1 The CLUTO clustering package . . . . . . . . . . . . . . . . . . . . . . . . . 217<br />

A.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />

A.2.1 Unimodal data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221<br />

A.2.2 Multimodal data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 223<br />

A.3 Data representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />

A.3.1 Unimodal data representations . . . . . . . . . . . . . . . . . . . . . 224<br />

A.3.2 Multimodal data representations . . . . . . . . . . . . . . . . . . . . 227<br />

A.4 Cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227<br />

xiii


Contents<br />

A.5 Consensus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229<br />

A.6 Computational resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230<br />

B Experiments on clustering indeterminacies 233<br />

B.1 Clustering indeterminacies in unimodal data sets . . . . . . . . . . . . . . . 233<br />

B.1.1 Zoo data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234<br />

B.1.2 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />

B.1.3 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236<br />

B.1.4 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236<br />

B.1.5 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237<br />

B.1.6 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237<br />

B.1.7 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />

B.1.8 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />

B.1.9 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />

B.1.10 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />

B.1.11 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239<br />

B.1.12 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240<br />

B.1.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241<br />

B.2 Clustering indeterminacies in multimodal data sets . . . . . . . . . . . . . . 241<br />

B.2.1 CAL500 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243<br />

B.2.2 Corel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243<br />

B.2.3 InternetAds data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 244<br />

B.2.4 IsoLetters data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245<br />

B.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246<br />

C Experiments on hierarchical consensus architectures 249<br />

C.1 Configuration of a random hierarchical consensus architecture . . . . . . . . 249<br />

C.2 Estimation of the computationally optimal RHCA . . . . . . . . . . . . . . 252<br />

C.2.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253<br />

C.2.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253<br />

C.2.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256<br />

C.2.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256<br />

C.2.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261<br />

C.2.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261<br />

C.2.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266<br />

C.3 Estimation of the computationally optimal DHCA . . . . . . . . . . . . . . 271<br />

C.3.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272<br />

xiv


Contents<br />

C.3.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273<br />

C.3.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273<br />

C.3.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278<br />

C.3.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278<br />

C.3.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283<br />

C.3.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283<br />

C.4 Computationally optimal RHCA, DHCA and flat consensus comparison . . 290<br />

C.4.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290<br />

C.4.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291<br />

C.4.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295<br />

C.4.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301<br />

C.4.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306<br />

C.4.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306<br />

C.4.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314<br />

C.4.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314<br />

C.4.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 318<br />

C.4.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324<br />

C.4.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324<br />

D Experiments on self-refining consensus architectures 333<br />

D.1 Experiments on consensus-based self-refining . . . . . . . . . . . . . . . . . 333<br />

D.1.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334<br />

D.1.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334<br />

D.1.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334<br />

D.1.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337<br />

D.1.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337<br />

D.1.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337<br />

D.1.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />

D.1.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />

D.1.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />

D.1.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />

D.1.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346<br />

D.2 Experiments on selection-based self-refining . . . . . . . . . . . . . . . . . . 348<br />

D.2.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348<br />

D.2.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />

D.2.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />

D.2.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350<br />

xv


Contents<br />

D.2.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351<br />

D.2.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351<br />

D.2.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352<br />

D.2.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353<br />

D.2.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 353<br />

D.2.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354<br />

D.2.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355<br />

E Experiments on multimodal consensus clustering 359<br />

E.1 CAL500 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359<br />

E.1.1 Consensus clustering per modality and across modalities . . . . . . . 360<br />

E.1.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 362<br />

E.2 InternetAds data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364<br />

E.2.1 Consensus clustering per modality and across modalities . . . . . . . 364<br />

E.2.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 365<br />

E.3 Corel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366<br />

E.3.1 Consensus clustering per modality and across modalities . . . . . . . 366<br />

E.3.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 369<br />

F Experiments on soft consensus clustering 373<br />

F.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374<br />

F.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374<br />

F.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375<br />

F.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377<br />

F.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377<br />

F.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379<br />

F.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380<br />

F.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381<br />

F.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382<br />

F.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384<br />

F.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385<br />

xvi


List of Tables<br />

1.1 Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits<br />

data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms . . . 23<br />

1.2 Illustration of the clustering indeterminacies on the CAL500, Corel, InternetAds<br />

and IsoLetters multimoda data sets . . . . . . . . . . . . . . . . . . . 24<br />

2.1 Taxonomy of consensus functions according to their theoretical basis . . . . 34<br />

3.1 Number of inner loop iterations as a function of the outer’s loop index i . . 53<br />

3.2 Methodology for estimating the running time of multiple RHCA variants . 56<br />

3.3 Evaluation of the minimum complexity RHCA variant estimation methodology<br />

in terms of the percentage of correct predictions and running time<br />

penalizations resulting from mistaken predictions . . . . . . . . . . . . . . . 63<br />

3.4 Computationally optimal consensus architectures (flat or RHCA) on the unimodal<br />

data collections assuming a fully serial implementation . . . . . . . . 67<br />

3.5 Computationally optimal consensus architectures (flat or RHCA) on the unimodal<br />

data collections assuming a fully parallel implementation . . . . . . . 68<br />

3.6 Methodology for estimating the running time of multiple DHCA variants . 74<br />

3.7 Evaluation of the minimum complexity DHCA variant estimation methodology<br />

in terms of the percentage of correct predictions and running time<br />

penalizations resulting from mistaken predictions . . . . . . . . . . . . . . . 81<br />

3.8 Evaluation of the minimum complexity serial DHCA variant prediction based<br />

on decreasing diversity factor ordering in terms of the percentage of correct<br />

predictions and running time penalizations resulting from mistaken predictions 83<br />

3.9 Running time differences between the most and least computationally efficient<br />

DHCA variants in both the serial and parallel implementations . . . . 84<br />

3.10 Computationally optimal consensus architectures (flat or DHCA) on the unimodal<br />

data collections assuming a fully serial implementation . . . . . . . . 86<br />

3.11 Computationally optimal consensus architectures (flat or DHCA) on the unimodal<br />

data collections assuming a fully parallel implementation . . . . . . . 87<br />

3.12 Percentage of experiments in which the consensus clustering solution is better<br />

than the median cluster ensemble component . . . . . . . . . . . . . . . . . 104<br />

xvii


List of Tables<br />

3.13 Relative percentage φ (NMI) gain between the consensus clustering solution<br />

and the median cluster ensemble component . . . . . . . . . . . . . . . . . . 104<br />

3.14 Percentage of experiments in which the consensus clustering solution is better<br />

than the best cluster ensemble component . . . . . . . . . . . . . . . . . . . 105<br />

3.15 Relative percentage φ (NMI) gain between the consensus clustering solution<br />

and the best cluster ensemble component . . . . . . . . . . . . . . . . . . . 105<br />

4.1 Methodology of the consensus self-refining procedure . . . . . . . . . . . . . 112<br />

4.2 Percentage of self-refining experiments in which one of the self-refined consensus<br />

clustering solutions is better than its non-refined counterpart . . . . 117<br />

4.3 Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />

clustering solutions with respect to its non-refined counterpart . . . . . . . 117<br />

4.4 Percentage of experiments in which the best (non-refined or self-refined) consensus<br />

clustering solution is better than the best cluster ensemble component 118<br />

4.5 Relative percentage φ (NMI) gain between the best (non-refined or self-refined)<br />

consensus clustering solution and the best cluster ensemble component . . . 118<br />

4.6 Percentage of experiments in which the best (non-refined or self-refined) consensus<br />

clustering solution is better than the median cluster ensemble component119<br />

4.7 Relative percentage φ (NMI) gain between the best (non-refined or self-refined)<br />

consensus clustering solution and the median cluster ensemble component . 119<br />

4.8 φ (NMI) variance of the non-refined and the best non/self-refined consensus<br />

clustering solutions across the flat, RHCA and DHCA consensus architectures120<br />

4.9 Percentage of experiments in which the supraconsensus function selects the<br />

top quality consensus clustering solution . . . . . . . . . . . . . . . . . . . . 120<br />

4.10 Relative percentage φ (NMI) losses due to suboptimal self-refined consensus<br />

clustering solution selection by supraconsensus . . . . . . . . . . . . . . . . 121<br />

4.11 Methodology of the cluster ensemble component selection-based consensus<br />

self-refining procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

4.12 Percentage of self-refining experiments in which one of the self-refined consensus<br />

clustering solutions is better than the selected cluster ensemble component<br />

reference λref . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

4.13 Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />

clustering solutions with respect to the maximum φ (ANMI) cluster ensemble<br />

component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

4.14 Percentage of experiments where either the top quality self-refined consensus<br />

clustering solution or λref better the best cluster ensemble component, and<br />

relative φ (NMI) gain percentage with respect to it . . . . . . . . . . . . . . . 126<br />

4.15 Percentage of experiments where either the top quality self-refined consensus<br />

clustering solution or λref better the median cluster ensemble component,<br />

and relative φ (NMI) gain percentage with respect to it . . . . . . . . . . . . . 126<br />

xviii


List of Tables<br />

4.16 Percentage of experiments in which the supraconsensus function selects the<br />

top quality clustering solution, and relative percentage φ (NMI) losses between<br />

the top quality clustering solution and the one selected by supraconsensus,<br />

averaged across the twelve data collections . . . . . . . . . . . . . . . . . . . 127<br />

5.1 Range and cardinality of the dimensional diversity factor dfD per modality<br />

for each one of the four multimedia data sets . . . . . . . . . . . . . . . . . 136<br />

5.2 Percentage of cluster ensemble components that attain a higher φ (NMI) than<br />

the unimodal and multimodal consensus clusterings, across the four multimedia<br />

data collections and the seven consensus functions . . . . . . . . . . . 145<br />

5.3 Evaluation of the unimodal and multimodal consensus clusterings with respect<br />

to the best cluster ensemble component, across the four multimedia<br />

data collections and the seven consensus functions . . . . . . . . . . . . . . 148<br />

5.4 Evaluation of the unimodal and multimodal consensus clusterings with respect<br />

to the median cluster ensemble component, across the four multimedia<br />

data collections and the seven consensus functions . . . . . . . . . . . . . . 149<br />

5.5 Evaluation of the multimodal consensus clusterings with respect to their<br />

unimodal counterparts, across the four multimedia data collections and the<br />

seven consensus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

5.6 Evaluation of the intermodal consensus clustering with respect to the unimodal<br />

and multimodal consensus clusterings, across the four multimedia data<br />

collections and the seven consensus functions . . . . . . . . . . . . . . . . . 151<br />

5.7 Percentage of multimodal self-refining experiments in which one of the selfrefined<br />

consensus clustering solutions is better than its non-refined counterpart154<br />

5.8 Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />

clustering solutions with respect to its non-refined counterpart . . . . . . . 156<br />

5.9 Percentage of the cluster ensemble components that attain a higher φ (NMI)<br />

score than the top quality self-refined consensus clustering solution . . . . . 156<br />

5.10 Percentage of experiments in which the best (either non-refined or self-refined)<br />

consensus clustering solution is better than the best cluster ensemble component<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

5.11 Relative φ (NMI) percentage difference between the top quality (either nonrefined<br />

or self-refined) consensus clustering solution with respect to the best<br />

ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

5.12 Percentage of experiments in which the top quality (either non-refined or<br />

self-refined) consensus clustering solution is better than the median cluster<br />

ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

5.13 Relative φ (NMI) percentage difference between the top quality (either nonrefined<br />

or self-refined) consensus clustering solution with respect to the median<br />

ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />

5.14 Percentage of experiments in which the supraconsensus function selects the<br />

top quality consensus clustering solution . . . . . . . . . . . . . . . . . . . . 159<br />

xix


List of Tables<br />

5.15 Relative φ (NMI) percentage differences between the best and median components<br />

of the cluster ensemble and the consensus clustering λ final<br />

c selected by<br />

supraconsensus, across the four multimedia data collections . . . . . . . . . 160<br />

6.1 Soft cluster ensemble sizes of the unimodal data sets . . . . . . . . . . . . . 186<br />

6.2 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Zoo data set . . . . . . . . . . . . 187<br />

6.3 Percentage of experiments in which the state-of-the-art consensus functions<br />

(CSPA, EAC, HGPA, MCLA and VMA) yield better/equivalent/worse consensus<br />

clustering solutions than the four proposed consensus functions (BC,<br />

CC, PC and SC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188<br />

6.4 Percentage of experiments in which the state-of-the-art consensus functions<br />

(CSPA, EAC, HGPA, MCLA and VMA) are executed faster/equivalent/slower<br />

than the four proposed consensus functions (BC, CC, PC and SC) . . . . . 189<br />

A.1 Cross-option table indicating the clustering strategy-criterion function-similarity<br />

measure combinations available in CLUTO . . . . . . . . . . . . . . . . . . 220<br />

A.2 Summary of the unimodal data sets employed in the experiments . . . . . . 222<br />

A.3 Summary of the multimodal data sets employed in the experiments . . . . . 224<br />

A.4 Cluster ensemble sizes corresponding to distinct algorithmic diversity configurations<br />

for the unimodal data sets . . . . . . . . . . . . . . . . . . . . . . 228<br />

A.5 Cluster ensemble sizes corresponding to distinct algorithmic diversity configurations<br />

for the multimodal data sets . . . . . . . . . . . . . . . . . . . . . 229<br />

B.1 Number of individual clusterings per data representation on each unimodal<br />

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234<br />

B.2 Top clustering results of each clustering algorithm family sorted from highest<br />

to lowest φ (NMI) on the unimodal collections . . . . . . . . . . . . . . . . . . 242<br />

B.3 Number of individual clusterings per data representation on each multimodal<br />

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243<br />

B.4 Top clustering results of each clustering algorithm family sorted from highest<br />

to lowest φ (NMI) on the multimodal collections . . . . . . . . . . . . . . . . 248<br />

C.1 Examples of computation of the number of stages s of a RHCA with l =<br />

7, 8and9andb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250<br />

C.2 Examples of computation of the number of consensus per stage (Ki) ofa<br />

RHCA with l =7, 8 and 9 and b = 2 . . . . . . . . . . . . . . . . . . . . . . 250<br />

C.3 Examples of computation of the mini-ensembles size of a RHCA with l =<br />

7, 8and9andb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251<br />

C.4 Configuration of RHCA topologies on a cluster ensemble of size l =30with<br />

varying mini-ensembles sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 251<br />

xx


List of Tables<br />

F.1 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Iris data set . . . . . . . . . . . . 375<br />

F.2 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Wine data set . . . . . . . . . . . 376<br />

F.3 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Glass data set . . . . . . . . . . . 377<br />

F.4 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Ionosphere data set . . . . . . . . 378<br />

F.5 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the WDBC data set . . . . . . . . . . 379<br />

F.6 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Balance data set . . . . . . . . . . 380<br />

F.7 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the MFeat data set . . . . . . . . . . . 381<br />

F.8 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the miniNG data set . . . . . . . . . . 382<br />

F.9 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Segmentation data set . . . . . . .<br />

F.10 Significance levels p corresponding to the pairwise comparison of soft consen-<br />

383<br />

sus functions using a t-paired test on the BBC data set . . . . . . . . . . . 384<br />

F.11 Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the PenDigits data set . . . . . . . . . 386<br />

xxi


List of Figures<br />

1.1 Evolution of the total number of websites across all Internet domains, from<br />

November 1995 to February 2009 . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.2 Schematic diagram of the steps involved in a knowledge discovery process . 4<br />

1.3 Taxonomy of data mining methods . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.4 Toy example of a hierarchical clustering dendrogram . . . . . . . . . . . . . 10<br />

1.5 Illustration of the data representation indeterminacy on the Wine and miniNG<br />

data sets clustered by the rbr-corr-e1 algorithm. . . . . . . . . . . . . . 21<br />

1.6 Block diagram of the robust multimodal clustering system based on selfrefining<br />

hierarchical consensus architectures . . . . . . . . . . . . . . . . . . 26<br />

2.1 Scatterplot of an artificially generated two-dimensional toy data set containing<br />

n = 9 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

2.2 Schematic representation of a consensus clustering process on a hard cluster<br />

ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

3.1 Flat vs hierarchical construction of a consensus clustering solution on a hard<br />

cluster ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.2 Three examples of topologies of random hierarchical consensus architectures 52<br />

3.3 Evolution of RHCA parameters as a function of the mini-ensembles size b . 54<br />

3.4 Estimated and real running times of the serial and parallel RHCA implementations<br />

on the Zoo data collection in the |dfA| = 1 diversity scenario . . . . 58<br />

3.5 Estimated and real running times of the serial and parallel RHCA implementations<br />

on the Zoo data collection in the |dfA| = 10 diversity scenario . . . . 59<br />

3.6 Estimated and real running times of the serial and parallel RHCA implementations<br />

on the Zoo data collection in the |dfA| = 10 diversity scenario . . . . 61<br />

3.7 Estimated and real running times of the serial and parallel RHCA implementations<br />

on the Zoo data collection in the |dfA| = 28 diversity scenario . . . . 62<br />

3.8 Evolution of the accuracy of RHCA running time estimation as a function of<br />

the number of consensus processes . . . . . . . . . . . . . . . . . . . . . . . 65<br />

3.9 An example of a deterministic hierarchical consensus architecture . . . . . . 71<br />

3.10 Estimated and real running times of the serial and parallel dHCA implementations<br />

on the Zoo data collection in the |dfA| = 1 diversity scenario . . . . 77<br />

xxiii


List of Figures<br />

3.11 Estimated and real running times of the serial and parallel dHCA implementations<br />

on the Zoo data collection in the |dfA| = 10 diversity scenario . . . .<br />

3.12 Estimated and real running times of the serial and parallel dHCA implemen-<br />

78<br />

tations on the Zoo data collection in the |dfA| = 19 diversity scenario . . . .<br />

3.13 Estimated and real running times of the serial and parallel dHCA implemen-<br />

79<br />

tations on the Zoo data collection in the |dfA| = 28 diversity scenario . . . .<br />

3.14 Evolution of the accuracy of DHCA running time estimation as a function of<br />

80<br />

the number of consensus processes . . . . . . . . . . . . . . . . . . . . . . . 82<br />

3.15 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario<br />

corresponding to a cluster ensemble of size l = 57 . . . . . . . . . . . . . . .<br />

3.16 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario<br />

89<br />

corresponding to a cluster ensemble of size l = 570 . . . . . . . . . . . . . . 91<br />

3.17 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario<br />

corresponding to a cluster ensemble of size l = 1083 . . . . . . . . . . . . . 92<br />

3.18 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario<br />

corresponding to a cluster ensemble of size l = 1596 . . . . . . . . . . . . . 93<br />

3.19 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures across all data collections for all the diversity scenarios 95<br />

3.20 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures across all data collections for all the diversity<br />

scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .<br />

3.21 φ<br />

97<br />

(NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />

for the diversity scenario corresponding to a cluster ensemble of size l =57<br />

3.22 φ<br />

99<br />

(NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />

for the diversity scenario corresponding to a cluster ensemble of size l = 570<br />

3.23 φ<br />

100<br />

(NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />

for the diversity scenario corresponding to a cluster ensemble of size l = 1083 100<br />

3.24 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />

for the diversity scenario corresponding to a cluster ensemble of size l = 1596 101<br />

3.25 φ (NMI) of the consensus solutions obtained by the computationally optimal<br />

parallel RHCA, DHCA and flat consensus architectures across all data collections<br />

for all the diversity scenarios . . . . . . . . . . . . . . . . . . . . . . 103<br />

4.1 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Zoo<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

xxiv


List of Figures<br />

4.2 Decreasingly ordered φ (NMI) (wrt ground truth) values of the 300 clusterings<br />

included in the toy cluster ensemble (left), and their corresponding φ (ANMI)<br />

values (wrt the toy cluster ensemble) (right) . . . . . . . . . . . . . . . . . . 121<br />

4.3 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Zoo data collection . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

5.1 Block diagram of the proposed multimodal consensus clustering system . . 134<br />

5.2 An example of a deterministic hierarchical consensus architecture DRM variant139<br />

5.3 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the IsoLetters data<br />

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

5.4 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the IsoLetters data set 143<br />

5.5 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the IsoLetters data set 143<br />

5.6 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the IsoLetters data set . . 144<br />

5.7 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the IsoLetters data set . . . . . . . 153<br />

5.8 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . 153<br />

5.9 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . 154<br />

5.10 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . . . 155<br />

6.1 Scatterplot of an artificially generated two-dimensional data set containing<br />

n = 9 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164<br />

6.2 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Zoo data collection . . . . . . . . . . . . . . . . . . . . 186<br />

B.1 φ (NMI) histograms on the Zoo data set . . . . . . . . . . . . . . . . . . . . . 234<br />

B.2 φ (NMI) histograms on the Iris data set . . . . . . . . . . . . . . . . . . . . . 235<br />

B.3 φ (NMI) histograms on the Wine data set . . . . . . . . . . . . . . . . . . . . 236<br />

B.4 φ (NMI) histograms on the Glass data set . . . . . . . . . . . . . . . . . . . . 236<br />

B.5 φ (NMI) histograms on the Ionosphere data set . . . . . . . . . . . . . . . . . 237<br />

B.6 φ (NMI) histograms on the WDBC data set . . . . . . . . . . . . . . . . . . . 237<br />

B.7 φ (NMI) histograms on the Balance data set . . . . . . . . . . . . . . . . . . . 238<br />

B.8 φ (NMI) histograms on the MFeat data set . . . . . . . . . . . . . . . . . . . 239<br />

B.9 φ (NMI) histograms on the miniNG data set . . . . . . . . . . . . . . . . . . . 239<br />

B.10 φ (NMI) histograms on the Segmentation data set . . . . . . . . . . . . . . . . 240<br />

xxv


List of Figures<br />

B.11 φ (NMI) histograms on the BBC data set . . . . . . . . . . . . . . . . . . . . 240<br />

B.12 φ (NMI) histograms on the PenDigits data set . . . . . . . . . . . . . . . . . . 240<br />

B.13 φ (NMI) histograms on the CAL500 data set . . . . . . . . . . . . . . . . . . 244<br />

B.14 φ (NMI) histograms on the Corel data set . . . . . . . . . . . . . . . . . . . . 245<br />

B.15 φ (NMI) histograms on the InternetAds data set . . . . . . . . . . . . . . . . 246<br />

B.16 φ (NMI) histograms on the IsoLetters data set . . . . . . . . . . . . . . . . . . 247<br />

C.1 Estimated and real running times of the serial RHCA implementation on the<br />

Iris data collection in the four diversity scenarios . . . . . . . . . . . . . . . 254<br />

C.2 Estimated and real running times of the parallel RHCA implementation on<br />

the Iris data collection in the four diversity scenarios . . . . . . . . . . . . . 255<br />

C.3 Estimated and real running times of the serial RHCA implementation on the<br />

Wine data collection in the four diversity scenarios . . . . . . . . . . . . . . 257<br />

C.4 Estimated and real running times of the parallel RHCA implementation on<br />

the Wine data collection in the four diversity scenarios . . . . . . . . . . . . 258<br />

C.5 Estimated and real running times of the serial RHCA implementation on the<br />

Glass data collection in the four diversity scenarios . . . . . . . . . . . . . . 259<br />

C.6 Estimated and real running times of the parallel RHCA implementation on<br />

the Glass data collection in the four diversity scenarios . . . . . . . . . . . . 260<br />

C.7 Estimated and real running times of the serial RHCA implementation on the<br />

Ionosphere data collection in the four diversity scenarios . . . . . . . . . . . 262<br />

C.8 Estimated and real running times of the parallel RHCA implementation on<br />

the Ionosphere data collection in the four diversity scenarios . . . . . . . . . 263<br />

C.9 Estimated and real running times of the serial RHCA implementation on the<br />

WDBC data collection in the four diversity scenarios . . . . . . . . . . . . . 264<br />

C.10 Estimated and real running times of the parallel RHCA implementation on<br />

the WDBC data collection in the four diversity scenarios . . . . . . . . . . . 265<br />

C.11 Estimated and real running times of the serial RHCA implementation on the<br />

Balance data collection in the four diversity scenarios . . . . . . . . . . . . 267<br />

C.12 Estimated and real running times of the parallel RHCA implementation on<br />

the Balance data collection in the four diversity scenarios . . . . . . . . . . 268<br />

C.13 Estimated and real running times of the serial RHCA implementation on the<br />

Mfeat data collection in the four diversity scenarios . . . . . . . . . . . . . . 269<br />

C.14 Estimated and real running times of the parallel RHCA implementation on<br />

the Mfeat data collection in the four diversity scenarios . . . . . . . . . . . 270<br />

C.15 Estimated and real running times of the serial DHCA implementation on the<br />

Iris data collection in the four diversity scenarios . . . . . . . . . . . . . . . 274<br />

C.16 Estimated and real running times of the parallel DHCA implementation on<br />

the Iris data collection in the four diversity scenarios . . . . . . . . . . . . . 275<br />

C.17 Estimated and real running times of the serial DHCA implementation on the<br />

Wine data collection in the four diversity scenarios . . . . . . . . . . . . . . 276<br />

xxvi


List of Figures<br />

C.18 Estimated and real running times of the parallel DHCA implementation on<br />

the Wine data collection in the four diversity scenarios . . . . . . . . . . . . 277<br />

C.19 Estimated and real running times of the serial DHCA implementation on the<br />

Glass data collection in the four diversity scenarios . . . . . . . . . . . . . . 279<br />

C.20 Estimated and real running times of the parallel DHCA implementation on<br />

the Glass data collection in the four diversity scenarios . . . . . . . . . . . . 280<br />

C.21 Estimated and real running times of the serial DHCA implementation on the<br />

Ionosphere data collection in the four diversity scenarios . . . . . . . . . . . 281<br />

C.22 Estimated and real running times of the parallel DHCA implementation on<br />

the Ionosphere data collection in the four diversity scenarios . . . . . . . . . 282<br />

C.23 Estimated and real running times of the serial DHCA implementation on the<br />

WDBC data collection in the four diversity scenarios . . . . . . . . . . . . . 284<br />

C.24 Estimated and real running times of the parallel DHCA implementation on<br />

the WDBC data collection in the four diversity scenarios . . . . . . . . . . . 285<br />

C.25 Estimated and real running times of the serial DHCA implementation on the<br />

Balance data collection in the four diversity scenarios . . . . . . . . . . . . 286<br />

C.26 Estimated and real running times of the parallel DHCA implementation on<br />

the Balance data collection in the four diversity scenarios . . . . . . . . . . 287<br />

C.27 Estimated and real running times of the serial DHCA implementation on the<br />

Mfeat data collection in the four diversity scenarios . . . . . . . . . . . . . . 288<br />

C.28 Estimated and real running times of the parallel DHCA implementation on<br />

the Mfeat data collection in the four diversity scenarios . . . . . . . . . . . 289<br />

C.29 Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the Iris data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 292<br />

C.30 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Iris data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 293<br />

C.31 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Iris data collection in<br />

the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . 294<br />

C.32 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Wine data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 296<br />

C.33 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Wine data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 297<br />

C.34 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Wine data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 298<br />

xxvii


List of Figures<br />

C.35 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Glass data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 299<br />

C.36 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Glass data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 300<br />

C.37 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Glass data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 302<br />

C.38 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Ionosphere data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 303<br />

C.39 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Ionosphere data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 304<br />

C.40 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Ionosphere data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . 305<br />

C.41 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the WDBC data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 307<br />

C.42 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the WDBC data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 308<br />

C.43 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the WDBC data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 309<br />

C.44 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Balance data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 311<br />

C.45 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Balance data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . 312<br />

C.46 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Balance data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . 313<br />

C.47 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the MFeat data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 315<br />

C.48 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the MFeat data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 316<br />

xxviii


List of Figures<br />

C.49 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the MFeat data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 317<br />

C.50 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the miniNG data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 319<br />

C.51 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the miniNG data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . 320<br />

C.52 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the miniNG data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . 321<br />

C.53 Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the Segmentation data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 322<br />

C.54 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Segmentation data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 323<br />

C.55 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Segmentation data<br />

collection in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . 325<br />

C.56 Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the BBC data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 326<br />

C.57 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the BBC data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 327<br />

C.58 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the BBC data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 328<br />

C.59 Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the PenDigits data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 330<br />

C.60 Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the PenDigits data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . 331<br />

C.61 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the PenDigits data collection<br />

in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . 332<br />

D.1 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Iris<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />

xxix


List of Figures<br />

D.2 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Wine<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336<br />

D.3 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Glass<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338<br />

D.4 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Ionosphere<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339<br />

D.5 φ (NMI) boxplots of the self-refined consensus clustering solutions on the WDBC<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340<br />

D.6 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Balance<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341<br />

D.7 φ (NMI) boxplots of the self-refined consensus clustering solutions on the MFeat<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343<br />

D.8 φ (NMI) boxplots of the self-refined consensus clustering solutions on the miniNG<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344<br />

D.9 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Segmentation<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345<br />

D.10 φ (NMI) boxplots of the self-refined consensus clustering solutions on the BBC<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347<br />

D.11 φ (NMI) boxplots of the self-refined consensus clustering solutions on the PenDigits<br />

data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348<br />

D.12 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Iris data collection . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />

D.13 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Wine data collection . . . . . . . . . . . . . . . . . . . . . . . . 350<br />

D.14 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Glass data collection . . . . . . . . . . . . . . . . . . . . . . . . 351<br />

D.15 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Ionosphere data collection . . . . . . . . . . . . . . . . . . . . . 352<br />

D.16 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the WDBC data collection . . . . . . . . . . . . . . . . . . . . . . . 353<br />

D.17 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Balance data collection . . . . . . . . . . . . . . . . . . . . . . 354<br />

D.18 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the MFeat data collection . . . . . . . . . . . . . . . . . . . . . . . 355<br />

D.19 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the miniNG data collection . . . . . . . . . . . . . . . . . . . . . . 356<br />

D.20 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the Segmentation data collection . . . . . . . . . . . . . . . . . . . 357<br />

D.21 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the BBC data collection . . . . . . . . . . . . . . . . . . . . . . . . 358<br />

xxx


List of Figures<br />

D.22 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />

on the PenDigits data collection . . . . . . . . . . . . . . . . . . . . . 358<br />

E.1 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the CAL500 data<br />

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360<br />

E.2 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the CAL500 data set . 361<br />

E.3 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the CAL500 data set . 361<br />

E.4 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the CAL500 data set . . . 361<br />

E.5 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the CAL500 data set . . . . . . . . 362<br />

E.6 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . 363<br />

E.7 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . 363<br />

E.8 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . . . 364<br />

E.9 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the InternetAds data<br />

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365<br />

E.10 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the InternetAds data<br />

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365<br />

E.11 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the InternetAds data<br />

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366<br />

E.12 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the InternetAds data set . 366<br />

E.13 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the InternetAds data set . . . . . . 367<br />

E.14 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the InternetAds data set . . . . . . . . 367<br />

E.15 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the InternetAds data set . . . . . . . . 368<br />

E.16 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the InternetAds data set . . . . . . . . . . 368<br />

E.17 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the Corel data set 369<br />

xxxi


List of Figures<br />

E.18 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the Corel data set . . . 369<br />

E.19 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the Corel data set . . . 369<br />

E.20 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the Corel data set . . . . . 370<br />

E.21 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the Corel data set . . . . . . . . . . 370<br />

E.22 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . 371<br />

E.23 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . 371<br />

E.24 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . . . 372<br />

F.1 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Iris data collection . . . . . . . . . . . . . . . . . . . . 374<br />

F.2 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Wine data collection . . . . . . . . . . . . . . . . . . . 375<br />

F.3 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Glass data collection . . . . . . . . . . . . . . . . . . . 376<br />

F.4 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Ionosphere data collection . . . . . . . . . . . . . . . . 378<br />

F.5 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the WDBC data collection . . . . . . . . . . . . . . . . . . 379<br />

F.6 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Balance data collection . . . . . . . . . . . . . . . . . . 380<br />

F.7 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the MFeat data collection . . . . . . . . . . . . . . . . . . . 381<br />

F.8 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the miniNG data collection . . . . . . . . . . . . . . . . . . 382<br />

F.9 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Segmentation data collection . . . . . . . . . . . . . . .<br />

F.10 φ<br />

383<br />

(NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the BBC data collection . . . . . . . . . . . . . . . . . . . .<br />

F.11 φ<br />

384<br />

(NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the PenDigits data collection . . . . . . . . . . . . . . . . . 386<br />

xxxii


List of Algorithms<br />

6.1 Symbolic description of the soft consensus function SumConsensus . . . . . . 178<br />

6.2 Symbolic description of the soft consensus function ProductConsensus . . . . 180<br />

6.3 Symbolic description of the soft consensus function BordaConsensus . . . . . 182<br />

6.4 Symbolic description of the soft consensus function CondorcetConsensus . . 183<br />

xxxiii


List of symbols<br />

A: set of algorithms used for creating a cluster ensemble<br />

BE: Borda voting score matrix related to a cluster ensemble E<br />

CE: Condorcet voting score matrix related to a cluster ensemble E<br />

b: size of the mini-ensembles of a hierarchical consensus architecture<br />

c: number of executions of a consensus function in the running time estimation process<br />

Cλ: cluster co-association matrix of a clustering λ<br />

d: number of attributes used for representing an object<br />

D: set of object representation dimensionalities used for creating a cluster ensemble<br />

dfi: ith diversity factor employed in the generation of a cluster ensemble<br />

DHCA: deterministic hierachical consensus architecture<br />

E: (hard or soft) cluster ensemble<br />

f: number of diversity factors employed in the generation of a cluster ensemble<br />

F: consensus function<br />

γ: ground truth vector<br />

HCA: hierachical consensus architecture<br />

Iλ: incidence matrix of a hard clustering solution λ<br />

k: numberofclusters<br />

Ki: number of consensus processes executed in the ith stage of a hierarchical consensus<br />

architecture<br />

l: number of clusterings contained in a cluster ensemble<br />

λ: label vector resulting from a hard clustering process<br />

Λ: clustering matrix resulting from a soft clustering process<br />

λc: label vector resulting from a hard consensus clustering process<br />

Λc: clustering matrix resulting from a soft consensus clustering process<br />

m: number of modalities of a multimodal data set<br />

n: number of objects in a data set<br />

OEλ : object co-association matrix of a hard cluster ensemble<br />

OEΛ : object co-association matrix of a soft cluster ensemble<br />

Oλ: object co-association matrix of a hard clustering solution λ<br />

xxxv


List of symbols<br />

OΛ: object co-association matrix of a soft clustering solution Λ<br />

π λ1 ,λ 2 : cluster correspondence vector between the hard clustering solutions λ1 and λ2<br />

π Λ1 ,Λ 2 : cluster correspondence vector between the soft clustering solutions Λ1 and Λ2<br />

P λ1 ,λ 2 : cluster permutation matrix between the hard clustering solutions λ1 and λ2<br />

P Λ1 ,Λ 2 : cluster permutation matrix between the soft clustering solutions Λ1 and Λ2<br />

ΠE: product rule voting score matrix related to a cluster ensemble E<br />

φ (ANMI) : average normalized mutual information<br />

φ (NMI) : normalized mutual information<br />

PERTDHCA: estimated running time of a parallel DHCA<br />

PRTDHCA: real running time of a parallel DHCA<br />

PERTRHCA: estimated running time of a parallel RHCA<br />

PRTRHCA: real running time of a parallel RHCA<br />

r: number of attributes of an object after a dimensionality reduction process<br />

R: set of object representations used for creating a cluster ensemble<br />

RHCA: random hierachical consensus architecture s: number of stages of a hierarchical<br />

consensus architecture<br />

S λ1 ,λ 2 : cluster similarity matrix between the hard clustering solutions λ1 and λ2<br />

S Λ1 ,Λ 2 : cluster similarity matrix between the soft clustering solutions Λ1 and Λ2<br />

SERTDHCA: estimated running time of a serial DHCA<br />

SRTDHCA: real running time of a serial DHCA<br />

SERTRHCA: estimated running time of a serial RHCA<br />

SRTRHCA: real running time of a serial RHCA<br />

ΣE: sum rule voting score matrix related to a cluster ensemble E<br />

w: power proportionality factor of the time complexity of a consensus function<br />

xi: d-dimensional column vector denoting the ith object contained in a data set<br />

X: d × n matrix denoting a data set<br />

xxxvi


Chapter 1<br />

Framework of the thesis<br />

The information and communications technologies (ICT) play a key role in the construction<br />

of the so-called global knowledge society. In fact, the increasingly rapid development of the<br />

ICT and the democratization of their use have facilitated information generation and access<br />

–a basic component of knowledge acquisition– to large segments of population.<br />

Possibly, the most paradigmatic example of this evolution is the World Wide Web<br />

(WWW, or “the Web” for short), which offers almost universal access to information to<br />

over 1500 million users worldwide, experiencing a percentage increase of 336% since year<br />

2000 (InternetWorldStats.com, accessed on February 2009). In parallel, this evolution has<br />

boosted the number of existing websites, which has grown exponentially up to over 215<br />

millions (NetCraft.com, accessed on February 2009)—see figure 1.1. Quite obviously, this<br />

latter fact incides on the quality of the information available on the Web (e.g. webpages<br />

with replicated, erroneous or even forged content are commonly found), making it sometimes<br />

difficult to separate the wheat from the chaff.<br />

This is an example of how the development of the ICT entails two instrinsically contradictory<br />

consequences: while facilitating knowledge acquisition by making information easier<br />

to share and access, it has also fostered the generation of increasingly growing amounts of<br />

digital information, giving rise to the so-called data overload effect —a problem that started<br />

to attract the attention of researchers more than a decade ago (Fayyad, Piatetsky-Shapiro,<br />

and Smyth, 1996).<br />

Indeed, as computing technologies have become increasingly efficient in the compression,<br />

transmission, storage and manipulation of data, the real bottleneck has moved towards the<br />

user side: that is, human information interpretation capabilities are often exceeded by the<br />

sheer amount of data. Not seldom, this situation is aggravated by the existence of data<br />

inconsistencies, making knowledge extraction even more difficult.<br />

Although exemplified in the WWW context, the data overload effect is far from being<br />

exclusive to it. At a smaller scale, large amounts of scientific and social data are being<br />

generated and made either freely or commercially available. Examples of these include<br />

experimental or observational data sets in the physics, chemistry, biomedical, marketing,<br />

financial or social sciences fields, whose sizes can even range over the Terabyte (Valencia,<br />

2002; Witten and Frank, 2005).<br />

In addition to boosting the volume of information repositories, the evolution of the<br />

ICT has also brought about an increase of data complexity (Hinneburg and Keim, 1998).<br />

1


Chapter 1. Framework of the thesis<br />

Figure 1.1: Evolution of the total number of websites across all Internet<br />

domains, from November 1995 to February 2009 (extracted from<br />

http://news.netcraft.com/archives/2009/02/index.html).<br />

Resorting to the WWW example again, we have all witnessed, through the last decade, how<br />

web pages have evolved from static plain text to dynamic multimedia contents. That is,<br />

the information available on the Web is, to a large extent, no longer restricted to a single<br />

modality (e.g. news in text format). Rather the contrary, data is increasingly becoming<br />

multimodal, i.e. a combination of several modalities (e.g. text news accompanied with<br />

photos, graphics, audio or video).<br />

This shift towards data multimodality can be regarded as a change of paradigm which<br />

is also found in many other domains (Klosgen, Zytkow, and Zyt, 2002). For instance,<br />

meteorological information often combines satellite and radar imagery with meteorological<br />

data in numerical form (e.g. temperature, humidity, wind speed, rainfall, etc.). In medical<br />

contexts, repositories often contain data obtained from several diagnostic tests (e.g.<br />

blood analysis, radiography, electrocardiophy, electroencephalography, functional magnetic<br />

resonance) whose results are represented under distinct modalities (nominal and numerical<br />

data, images, etc.).<br />

To sum up, despite providing enormous quantities of information on a silver plate, the<br />

development and expansion of the ICT pose a serious challenge to human analytic and<br />

understanding capabilities, not only by the large volumes of data available, but also by its<br />

growing complexity. Therefore, it seems logical to highlight the importance of developing<br />

automatic tools that allow knowledge extraction from large multimodal data repositories,<br />

regardless of their domain (Witten and Frank, 2005). The techniques supporting these tools<br />

belong to the fields of knowledge discovery and data mining (Klosgen, Zytkow, and Zyt,<br />

2002), which constitute, in a broad sense, the frame of reference of this thesis.<br />

When it comes to extracting knowledge from a given data collection, one of the primary<br />

tasks one thinks of is organization: clearly, arranging the contents of a data repository<br />

according to some meaningful structure helps to gain some perspective on it –in fact, orga-<br />

2


Chapter 1. Framework of the thesis<br />

nizing information is one of the most innate activities involved in human learning (Anderberg,<br />

1973). In general terms, the structures according to which objects 1 are classified are<br />

known as taxonomies, and, although their shape can vary widely (e.g. from parent-child<br />

hierarchical trees to network schemes or simple group structures), they share the common<br />

goal of allowing knowledge extraction by implementing a structure in an unstructured world<br />

of information. Taxonomies have been proposed by experts in almost every scientific field,<br />

such as biology (e.g. the Linnaean taxonomy (Linnaeus, 1758), which settled the basis of<br />

species classification), medicine (for instance, the International Classification of Diseases of<br />

the World Health Organization (www.who.int, accessed on February 2009)) or education<br />

(from classifications of the different learning objectives and skills that educators set for<br />

students (Bloom, 1956; Anderson et al., 2001) to taxonomic models that describe the levels<br />

of increasing complexity in student’s understanding of subjects (Biggs and Collis, 1982)).<br />

When dealing with digital data, the manual creation of a taxonomy can become a very<br />

challenging and burdensome task, as it requires previous domain knowledge (which is not<br />

always available) and/or careful inspection of the whole data collection before designing<br />

the most suitable taxonomic structure. For this reason, it would be very useful to develop<br />

systems capable of organizing data in a fully automatic manner, so that no expert supervision<br />

nor domain knowledge was required. If this goal was accomplished, the role of expert<br />

taxonomists would be minimized —good news provided the dramatic pace at which digital<br />

data is generated.<br />

Regardless of the taxonomic scheme’s layout, data organization criteria are typically<br />

based on analyzing the similarities between objects, grouping them according to their degree<br />

of similarity —i.e. the goal is to place dissimilar instances in separate and distant groups (or<br />

clusters), while placing similar objects in the same group (or in different but closely located<br />

clusters). This task, known as unsupervised classification or clustering, isanimportant<br />

process which underlies many automated knowledge discovery processes (Fayyad, Piatetsky-<br />

Shapiro, and Smyth, 1996; Klosgen, Zytkow, and Zyt, 2002; Witten and Frank, 2005).<br />

The remainder of this chapter provides an insight on the general framework of the<br />

thesis, highlighting the importance of clustering processes as a part of automatic knowledge<br />

discovery systems, and introducing the central focus of this thesis: the robust clustering of<br />

multimodal data.<br />

1.1 Knowledge discovery and data mining<br />

The subject of knowledge discovery from data repositories has come a long way. In fact,<br />

the interest in this field emerged more than a decade ago as a response to the data overload<br />

effect (Fayyad, Piatetsky-Shapiro, and Smyth, 1996; Fayyad, 1996), when the growth of<br />

digital data repositories started to surpass human analytic capabilities. Indeed, while the<br />

analysis and understanding abilities of human analysts remain more or less the same, the<br />

seemingly ever-growing computers’ storing capacity is holding back a true avalanche of data<br />

(Witten and Frank, 2005). Nowadays, many textbooks, journals, workshops and conferences<br />

are devoted to this scientific area, denoting it still is a very active research field (Klosgen,<br />

Zytkow, and Zyt, 2002).<br />

1 By object, we refer to anything -animate beings, inanimate objects, places, concepts, events, properties,<br />

or relationships- liable to be classified according to some taxonomic scheme.<br />

3


1.1. Knowledge discovery and data mining<br />

Figure 1.2: Schematic diagram of the steps involved in the KD process (extracted from<br />

(Fayyad, 1996)).<br />

It is a commonplace that potentially useful and beneficial information patterns lie in<br />

digital data repositories awaiting analysis (Witten and Frank, 2005). However, the concept<br />

of analysis and its goals are highly local to the context it is applied (Fayyad, 1996).<br />

Typical application scenarios can be as disparate as i) mining records of buyers’ choices<br />

for creating marketing campaigns adapted to distinct customer profiles (Witten and Frank,<br />

2005), ii) analyzing credit card transactions history of bank customers so as to detect possible<br />

fraudulent operations from unauthorized users (Fayyad, 1996) or iii) locating and<br />

cataloging geologic objects of interest in remotely sensed images of planets or asteroids<br />

(Fayyad, Piatetsky-Shapiro, and Smyth, 1996).<br />

Thus, be it either economic or scientific, there exists a great interest in replacing (or, at<br />

least, augmenting) human analytic capabilities by computer-based means. The field of computer<br />

science devoted to the extraction of useful patterns from data has been given different<br />

names in the literature, such as information discovery, information harvesting or data archaeology<br />

(Fayyad, Piatetsky-Shapiro, and Smyth, 1996), being knowledge discovery2 (KD)<br />

and data mining (DM) the two most common denominations.<br />

However, the use of KD and DM as synonymous concepts has been a matter of dispute in<br />

the research community (Klosgen, Zytkow, and Zyt, 2002): while deemed equivalent by some<br />

authors (Witten and Frank, 2005), others refer to KD as the whole process of extracting<br />

knowledge from data, defining DM as the central constituting step of KD processes (Fayyad,<br />

Piatetsky-Shapiro, and Smyth, 1996), as depicted in figure 1.2.<br />

According to this latter standpoint (to which we adhere in this thesis), KD is defined<br />

as the ‘non-trivial process of identifying valid, novel, useful and ultimately understandable<br />

patterns in data’, whereas DM is ‘the application of specific algorithms for extracting patterns<br />

from data’ (Fayyad, Piatetsky-Shapiro, and Smyth, 1996). By ‘extracting patterns<br />

patterns from data’ we refer to making any high-level description of a set of data, e.g. fitting<br />

a model to data or finding structure from it (Fayyad, Piatetsky-Shapiro, and Smyth,<br />

1996). Thus, according to this point of view, KD and DM constitute what could be called<br />

2 Although this discipline was originally named KDD —for Knowledge Discovery in Databases (Piatetsky-<br />

Shapiro, 1991)— in this work we assume that operations are conducted on a flat file extracted from the<br />

database, i.e. we remove the second D in KDD and focus on the knowledge discovery process.<br />

4


Chapter 1. Framework of the thesis<br />

the general and specific frames of reference of this work, respectively. Due to its generic<br />

definition, KD is a crossroad of several disciplines, and, as such, it attracts researchers and<br />

practitioners from the fields of statistics, machine learning, pattern recognition, information<br />

retrieval, or visualization, to name a few (Fayyad, 1996).<br />

As shown in the flow diagram presented in figure 1.2, the KD process can be regarded<br />

as a succession of five steps, namely: selection, preprocessing, transformation, data mining<br />

and interpretation/evaluation. That is, extracting knowledge from a given data set can be<br />

regarded as a multistage, interactive and iterative process (Brachman and Anand, 1996), as<br />

the user evaluation of the extracted patterns can lead to a re-execution of any of the stages<br />

(as denoted by the dashed arrows in figure 1.2). However, depending on the nature of the<br />

data and the problem at hand, the two first steps may even be skipped. The following<br />

paragraphs present a brief description of each of these five phases.<br />

For starters, the target data set that will be subject to the KD process is created in the<br />

selection phase. This typically implies selecting a subset of the available objects in the<br />

database, although, in some cases, no selection is conducted, and all the data items in the<br />

repository are included in the target data set.<br />

Optionally, this stage can also consider representing the objects upon a subset of the<br />

variables (aka attributes or features) that constitute them. In general terms, these variables<br />

can either be numerical or nominal. In the former case, attributes usually represent the<br />

value of a quantitatively measurable magnitude (e.g. temperature, altitude or population).<br />

In the latter case, features can only take one of a predefined set of categorical values, such<br />

as the outlook feature in the classic weather data repository, i.e. outlook = {overcast,<br />

sunny, rainy} (Witten and Frank, 2005). In this work, all objects will be represented by<br />

means of a set of d numeric attributes gathered into real-valued d-dimensional vectors.<br />

Secondly, a data preprocessing step is carried out, which includes data parameterization,<br />

noise and/or outliers removal or missing data fields handling, among others.<br />

Thirdly, the objects in the data set are subject to a transformation process, which<br />

basically consists of finding useful features for representing the objects according to the<br />

goals of the overall KD process. In general terms, this step is aimed to decrease the number<br />

of variables used for representing the objects (dimensionality reduction) so as to improve<br />

the results and/or the computational complexity of the data mining step of the KD process.<br />

The reasoning underlying dimensionality reduction is based on the fact that the original<br />

data representation is often redundant, e.g. there may exist high levels of correlation between<br />

several variables, or the values of some features may be so small that they are almost<br />

irrelevant (Carreira-Perpiñán, 1997). Moreover, when the number of variables associated<br />

with each object is too high, the scalability of KD systems is negatively affected, as the<br />

time complexity of the DM stage is usually proportional to the number of attributes (Yang<br />

and Olafsson, 2005). Furthermore, many methods suffer performance breakdowns when the<br />

dimensionality of the feature space is very high (Fodor, 2002).<br />

There exist two main strategies to conduct dimensionality reduction: feature selection<br />

and feature extraction. In feature selection, the reduced feature set is a subset of the original<br />

object attributes, whereas in feature extraction, the original variables are transformed into<br />

a set of new features, typically obtained as combinations of the original ones. For an insight<br />

on feature selection and extraction techniques, the reader is referred to (Molina, Belanche,<br />

and Nebot, 2002; Dy and Brodley, 2004) and (Fodor, 2002), respectively.<br />

5


1.1. Knowledge discovery and data mining<br />

Verrification<br />

Goodness of<br />

fit<br />

HHypothesis<br />

teesting<br />

Analysis oof<br />

variancee<br />

Data miining<br />

methoods<br />

Prediction<br />

Classific cation<br />

Regres ssion<br />

Discovery<br />

Description<br />

Clusteriing<br />

Summarization<br />

Figure 1.3: Taxonomy of data mining methods (adapted from (Maimon and Rokach, 2005)).<br />

Next, depending on the goals of the KD process, a suitable data mining task must<br />

be chosen. According to the taxonomy presented in figure 1.3, data mining methods can<br />

be classified into two main groups: verification-oriented and discovery-oriented (Fayyad,<br />

Piatetsky-Shapiro, and Smyth, 1996). In verification-oriented DM, the role of the system<br />

is to evaluate an hypothesis proposed by the user, and this task is usually accomplished<br />

by means of traditional statistical methods. In discovery-oriented data mining, the goal of<br />

the system is to discover useful patterns in the data. In this work, we focus on this latter<br />

family of DM tasks.<br />

Among discovery-oriented data mining methods, one can distinguish between prediction<br />

and description DM tasks. In prediction tasks, the goal of the system is to build<br />

a behavioral model upon the data, whereas in description tasks, the system aims to find<br />

human-understandable patterns that facilitate knowledge extraction from data.<br />

According to figure 1.3, there exist two main prediction DM tasks: classification and<br />

regression. In classification problems, the goal is to learn a mapping between the categories<br />

in a known taxonomic scheme and a set of pre-classified objects, so that any unseen object<br />

can be categorized into any of these predefined classes. The aim of regression tasks is to<br />

learn a function that maps unseen data objects to a real-valued prediction variable.<br />

As aforementioned, description-oriented DM methods focus on finding understandable<br />

representations of the underlying structure of the data (Maimon and Rokach, 2005). One of<br />

the most common descriptive DM tasks is clustering, which consists of identifying a finite<br />

set of categories to describe the data with no previous knowledge (i.e. deriving a taxonomy<br />

solely from the data). Another description-oriented data mining task is summarization,<br />

whose aim is to find a compact description for a subset of data. To do so, summarization<br />

techniques often make use of multivariate visualization methods.<br />

Once the data mining task that fits the goals of the KD process is identified, there<br />

comes the time to select the specific data mining algorithm to be applied. This selection<br />

must take into account not only which models and parameters are the most appropriate<br />

from an algorithmic viewpoint, but also the desired level of accuracy, utility, and intel-<br />

6


Chapter 1. Framework of the thesis<br />

ligibility of the descriptions of the structural patterns of the data (Fayyad, 1996). This<br />

latter issue is of paramount importance with a view to the final stage of the KD process —<br />

evaluation/interpretation—, which often involves visualizing the mined patterns and/or<br />

the data according to these. As mentioned earlier, depending on the user’s evaluation of the<br />

extracted patterns, it may be necessary to re-execute any of the previous steps for further<br />

refinement of the KD process (Halkidi, Batistakis, and Vazirgiannis, 2002a).<br />

As the reader may have observed, the user must make several critical decisions at every<br />

step of the knowledge discovery process (Fayyad, Piatetsky-Shapiro, and Smyth, 1996).<br />

By critical we mean that a wrong election in any of the intermediate stages could lead to<br />

the extraction of little meaningful patterns, and, as a consequence, the objectives of the<br />

whole KD process would not be reached. This issue becomes especially tricky when the DM<br />

stage is based on unsupervised learning techniques, as when, for instance, clustering is the<br />

data mining task selected for the DM phase of the KD process. In this situation, the user’s<br />

decisions are usually made blindly, which may conclude in an insatisfactory evaluation, thus<br />

requiring a (blind) re-execution of one or several of the phases of the KD process —which,<br />

in the worst-case scenario, can end in a tedious iterative loop while groping for the right<br />

decisions.<br />

This thesis is focused on the discovery and description-oriented data mining task of<br />

clustering, placing special emphasis on its application on multimodal data. In particular,<br />

one of our main goals is the design of clustering systems robust to the indeterminacies<br />

induced by the fact that many decisions surrounding clustering processes must be made<br />

blindly (e.g. which features should be used, which clustering algorithm should be applied).<br />

In the next section, we present a by no means exhaustive overview of several fundamental<br />

aspects of the clustering problem, which will lead to a description of the key problems<br />

addressed in this thesis.<br />

1.2 Clustering in knowledge discovery and data mining<br />

Clustering can be defined as the process of separating a finite unlabeled data set into a<br />

finite and discrete set of natural clusters based on similarity (Xu and Wunsch II, 2005;<br />

Jain, Murty, and Flynn, 1999).<br />

After a successful clustering process, the presumably high number of objects contained in<br />

the data set can be represented by means of a comparatively smaller number of meaningful<br />

clusters, which necessarily implies a loss of certain fine details, but yields a simplified clusterbased<br />

data model (Berkhin, 2002). In other words, clustering is a description-oriented data<br />

mining task, as the obtained clusters should somehow reflect the mechanisms that cause<br />

some objects to be more similar to one another than to the remaining ones (Witten and<br />

Frank, 2005).<br />

It is important to notice that clustering is a non-supervised classification task, as the<br />

objects in the data set are unlabeled (i.e. there is no prior knowledge about how they should<br />

be grouped). This fact states a clear difference between clustering and the predictionoriented<br />

task of supervised classification (see figure 1.3). In this latter case, we are provided<br />

with a collection of labeled (i.e. pre-classified) objects so as to learn the descriptions of<br />

classes, which in turn are used to categorize new data items. In clustering, objects are also<br />

assigned labels, but these are data driven —that is, cluster labels are obtained solely from<br />

7


1.2. Clustering in knowledge discovery and data mining<br />

the data, not provided by an external source (Jain, Murty, and Flynn, 1999).<br />

Being such a generic task, clustering has found applications in multiple research fields,<br />

among which we can name the following few:<br />

– information retrieval, where clustering has been applied for organizing the results<br />

returned by a search engine in response to a users query (i.e. post-retrieval clustering)<br />

(Tombros, Villa, and van Rijsbergen, 2002; Hearst, 2006), for refining ambiguous<br />

queries input to retrieval systems (Käki, 2005), or for improving their performance<br />

(van Rijsbergen, 1979).<br />

– text mining, where browsing through large document collections is simplified if they<br />

are previously clustered (Cutting et al., 1992; Steinbach, Karypis, and Kumar, 2004;<br />

Dhillon, Fan, and Guan, 2001).<br />

– computational genomics, where clustering of gene expression data from DNA microarray<br />

experiments is applied for identifying the functionality of genes, finding out what<br />

genes are co-regulated or distinguishing the important genes between abnormal and<br />

normal tissues (Zhao and Karypis, 2003a; Jiang, Tang, and Zhang, 2004).<br />

– ecomomics, where clustering economic and financial time series can be employed for<br />

identifying i) areas or sectors for policy-making purposes, ii) structural similarities<br />

in economic processes for economic forecasting, iii) stable dependencies for risk management<br />

and investment management (Focardi, 2001), or iv) customer profiles and<br />

customers-products relationships (Liu and Luo, 2005).<br />

– computer vision, where clustering is applied for common procedures such as image<br />

preprocessing (Jain, 1996), segmentation (Mancas-Thillou and Gosselin, 2007) and<br />

matching (Miyajima and Ralescu, 1993).<br />

Regardless of the application, the desired result of any clustering process is a maximally<br />

representative partition of the data set, which usually corresponds to clusters with high<br />

intra-cluster and low inter-cluster object similarities. In the quest for this goal, a myriad<br />

of clustering methods have been proposed. With no claim of being exhaustive, the next<br />

section presents an overview of some of the most relevant clustering methods, highlighting<br />

some important concepts in this context.<br />

1.2.1 Overview of clustering methods<br />

Several excellent and extensive surveys on clustering can be found in the literature (Jain,<br />

Murty, and Flynn, 1999; Berkhin, 2002; Kotsiantis and Pintelas, 2004; Xu and Wunsch II,<br />

2005). Providing a detailed description of the existing clustering methods lies beyond the<br />

scope of this work, so the reader interested in their ins and outs is referred to the previously<br />

cited surveys and references therein. However, due to the central role of clustering processes<br />

in this thesis, this section presents a brief description of several key issues in this context,<br />

such as:<br />

1. A categorization of clustering algorithms according to generic criteria.<br />

2. A brief discussion on similarity measures, one of the central notions in clustering.<br />

8


Chapter 1. Framework of the thesis<br />

3. An outline of the foundations of several representative clustering methods.<br />

For starters, let us introduce a few notational conventions which will be employed<br />

throughout this work:<br />

– in general terms, any object in a data set will be represented by means of a ddimensional<br />

column vector x =[x1 x2 ... xd] T (where T denotes transposition). As<br />

mentioned in section 1.1, in this work we consider only object representations based<br />

on numerical attributes, so that each feature xi is a real number and, hence, x ∈ R d .<br />

– so as to refer to a particular data item (e.g. the ith object in the data repository) we<br />

will use the notation xi =[xi1 xi2 ... xid ]T .<br />

– a data set is defined as a compilation of n objects, being mathematically representated<br />

by a d × n matrix X =[x1 x2 ... xn].<br />

– the number of clusters into which the objects will be partitioned is denoted as k. Thus,<br />

each cluster resulting from the clustering process will be assigned an integer-valued<br />

label in the integer range [1,...,k].<br />

Categorization of clustering methods<br />

Without taking into account their theoretical foundations, there exist many ways of classifying<br />

clustering algorithms (Jain, Murty, and Flynn, 1999). However, they can be split<br />

into two clearly separated groups according to two universal criteria: i) the structure of the<br />

final clustering solution, and ii) the overlap of the obtained clusters.<br />

On one hand, if clustering algorithms are analyzed in terms of the structure of the clustering<br />

solution, one can distinguish between partitional and hierarchical clustering methods.<br />

1. Partitional clustering algorithms: the clustering solution output by this type of algorithms<br />

corresponds to the most intuitive notion of clustering: a single partition of the<br />

objects in the data set into the desired number of clusters k.<br />

2. Hierarchical clustering algorithms: the structure of the clustering solution is a tree<br />

of clusters (Kotsiantis and Pintelas, 2004), which is built sequentially. Depending on<br />

whether this tree is constructed bottom-up or top-down (i.e. from n singleton clusters<br />

to a sole cluster or vice versa), hierarchical algorithms are called agglomerative or divisive.<br />

In either case, the algorithm decides, at each level of the tree, which two clusters<br />

should be merged (if agglomerative) or split (if divisive) depending on their degree of<br />

similarity, which is typically measured according to either the minimum, maximum<br />

or average of the distances between all pairs of objects drawn from both clusters,<br />

giving rise to the single-link, complete-link or average-link criteria (Jain, Murty, and<br />

Flynn, 1999). The clustering solution structure yielded by hierarchical algorithms is<br />

usually represented by means of a binary tree of clusters or dendrogram, anexampleof<br />

which is depicted in figure 1.4. Although hierarchical clustering algorithms typically<br />

compute the complete hierarchy of clusters 3 , a single partition can be obtained by<br />

3 As a consequence, when compared to their partitional counterparts, hierarchical clustering algorithms<br />

tend to be more computationally demanding (Jain, Murty, and Flynn, 1999).<br />

9


1.2. Clustering in knowledge discovery and data mining<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

−0.2<br />

−0.2 0 0.2 0.4 0.6<br />

(a) Scatterplot of the data of this<br />

toy two-dimensional data set.<br />

4<br />

7<br />

5<br />

9<br />

6<br />

8<br />

Euclidean distance<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

7 9 8 4 5 6 1 2 3<br />

object index<br />

(b) Dendrogram obtained by the<br />

single-link hierarchical algorithm.<br />

Figure 1.4: A hierarchical clustering toy example: (a) Scatterplot of an artificially generated<br />

two-dimensional data set containing n = 9 objects, each one of them is identified by a<br />

numerical label. (b) Dendrogram resulting of applying the single-link hierarchical agglomerative<br />

clustering algorithm on this data, using the Euclidean distance as the similarity<br />

measure. The dashed horizontal line performs a cut on the dendrogram, yielding a 4-way<br />

partition with an Euclidean distance between clusters ranging between 0.112 and 0.255.<br />

performing a cut through the dendrogram at the desired level of cluster similarity or<br />

by setting the desired number of clusters k, as shown by the dashed horizontal line in<br />

figure 1.4(b).<br />

On the other hand, the overlap of the clusters into which objects are grouped is an<br />

additional factor which allows splitting clustering algorithms into two large categories: hard<br />

and soft algorithms.<br />

1. Hard (aka crisp) clustering algorithms: this type of algorithms partition the data<br />

set into k disjoint clusters, i.e. each object is assigned to one and only one cluster.<br />

In mathematical terms, the result of a hard clustering process on the data set X<br />

is a n-dimensional integer-valued row label vector λ = [λ1 λ2 ... λn], where λi ∈<br />

{1, 2,...,k}, ∀i ∈ [1,n]. That is, the ith component of the label vector (or labeling<br />

for short) contains the label of the cluster the ith object in the data set is assigned<br />

to. For instance, the label vector obtained after applying a classic hard clustering<br />

algorithm such as k-means on the artificial toy data set depicted on figure 1.4(a),<br />

setting k =3,isλ =[222111333]. Notice the symbolic nature of the cluster labels,<br />

as the same clustering result would be represented by label permuted label vectors<br />

such as λ =[111222333] or λ =[333111222].<br />

2. Soft (aka fuzzy) clustering algorithms: they generate a set of k overlapping clusters,<br />

i.e. each object is associated to each of the k clusters to a certain degree. Hence, the<br />

result of conducting a soft clustering process on the data set X is a k × n real-valued<br />

clustering matrix Λ, whose(i,j)th entry indicates the degree of association between<br />

the ith cluster and the jth object. This degree of association is typically expressed<br />

in terms of the probability of membership of each object to each cluster, as done by<br />

10


Chapter 1. Framework of the thesis<br />

the well-known fuzzy c-means (FCM) soft clustering algorithm. Following with the<br />

toy two-dimensional data set presented in figure 1.4(a), the membership probability<br />

matrix yielded by the FCM clustering algorithm (with k = 3) is the following:<br />

⎛<br />

Λ = ⎝<br />

0.0544 0.0418 0.0572 0.0254 0.0192 0.0301 0.9764 0.9285 0.9723<br />

0.0252 0.0258 0.0375 0.9688 0.9758 0.9586 0.0144 0.0554 0.0173<br />

0.9205 0.9324 0.9053 0.0059 0.0050 0.0114 0.0092 0.0160 0.0104<br />

It is easy to see that permuting the rows of matrix Λ would not alter the clustering<br />

results, as cluster identifiers are symbolic. Moreover, notice that a clustering matrix<br />

Λ can always be transformed into a label vector λ by simply assigning each object<br />

to the cluster it is more strongly associated with (e.g. the cluster with the highest<br />

membership probability).<br />

Summarizing, the structure of the clustering solution and the overlap of the resulting<br />

clusters define a two-dimensional frame of reference which allows categorizing clustering<br />

algorithms in a broad sense, i.e. without resorting to their theoretical foundations.<br />

Distance and similarity measures<br />

According to the definition of clustering presented at the beginning of section 1.2, measuring<br />

the resemblance between the objects in the data set is central to clustering processes. For<br />

this reason, these next paragraphs will be devoted to a brief description of several ways<br />

of measuring similarity. More specifically, we focus solely on measures for computing the<br />

resemblance between objects under numeric feature representation, although there exist<br />

equivalent measures for comparing objects represented by means of ordinal or nominal<br />

attributes (Xu and Wunsch II, 2005; Jain, Murty, and Flynn, 1999).<br />

Hence, let us consider two objects in the data set represented as vectors in a R d space,<br />

namely xi =[xi1 xi2 ... xid ]T and xj =[xj1 xj2 ... xjd ]T .<br />

There exist two complementary ways of comparing xi and xj: i) measuring their degree<br />

of similarity, denoted as S(xi, xj), or ii) measuring the distance between them, i.e.<br />

D(xi, xj). Although when dealing with objects represented by numerical features it is more<br />

usual to measure the distance between them than their similarity (Jain, Murty, and Flynn,<br />

1999; Xu and Wunsch II, 2005), both types of measures will be described next—furthermore,<br />

there are multiple ways of transforming a similarity measure into a distance and viceversa<br />

(Fenty, 2004).<br />

1. Distance measures<br />

– Minkowski distance: properly speaking, it is a family of distances. In general<br />

terms, the Minkowski distance of order n is defined according to equation (1.1):<br />

D(xi, xj) =<br />

11<br />

d<br />

l=1<br />

|xil<br />

− xjl | 1<br />

n<br />

n<br />

⎞<br />

⎠<br />

(1.1)


1.2. Clustering in knowledge discovery and data mining<br />

– Euclidean distance: possibly the most widely used metric, it is obtained as the<br />

particularization of the Minkowski metric for n =2:<br />

D(xi, xj) =<br />

d<br />

l=1<br />

|xil<br />

− xjl | 1<br />

2<br />

2<br />

(1.2)<br />

This is the distance measure used in the most classic implementation of the kmeans<br />

clustering algorithm, and it tends to form hyperspherical clusters (Xu and<br />

Wunsch II, 2005).<br />

– Manhattan distance: also known as city block distance, it is defined as a particular<br />

case of the Minkowski metric for n = 1 (see equation (1.3)), and it tends to<br />

create hyperrectangular clusters (Xu and Wunsch II, 2005).<br />

<br />

d<br />

<br />

D(xi, xj) = |xil − xjl |<br />

l=1<br />

(1.3)<br />

– Mahalanobis distance: it can be regarded as a modified version of the Euclidean<br />

distance, that takes into account the covariance among the attributes. It is<br />

defined as follows:<br />

D(xi, xj) =(xi − xj) T S −1 (xi − xj) (1.4)<br />

where S is the sample covariance matrix computed over all the data set (Jain,<br />

Murty, and Flynn, 1999). Algorithms using this distance tend to create hyperellipsoidal<br />

clusters (Xu and Wunsch II, 2005).<br />

2. Similarity measures<br />

– Cosine similarity: it consists of measuring the angle comprised between the vectors<br />

representing the objects, and, as such, it does not depend on their length.<br />

It is defined as follows:<br />

S(xi, xj) = xT i xj<br />

||xi|| ||xj||<br />

(1.5)<br />

where ||·||denotes vector norm.<br />

– Pearson correlation coefficient: a classic concept in the probability theory and<br />

statistics fields, correlation measures the strength and direction of the linear<br />

relationship between vectors xi and xj. The most widely used correlation index<br />

is the Pearson correlation coefficient, which is defined in equation (1.6):<br />

d<br />

(xil − ¯xi)(xjl − ¯xj)<br />

l=1<br />

S(xi, xj) = <br />

d<br />

d 2 2<br />

(xil − ¯xi) (xjl − ¯xj)<br />

l=1<br />

l=1<br />

where xil is the lth component of vector xi, and¯xi denotes its sample mean.<br />

12<br />

(1.6)


Chapter 1. Framework of the thesis<br />

– Extended Jaccard coefficient: whereas the cosine and Pearson correlation coefficient<br />

similarity measures consider that vectors xi and xj are similar if they point<br />

in the same direction (i.e. they have roughly the same set of features and in the<br />

same proportion), the extended Jaccard coefficient –defined in equation (1.7)–<br />

accounts both for the angle and magnitude of the vectors.<br />

S(xi, xj) =<br />

x T i xj<br />

||xi|| + ||xj|| − x T i xj<br />

(1.7)<br />

For further insight on these and other distance and similarity measures, their properties<br />

and other characteristics, see (Duda, Hart, and Stork, 2001; Xu and Wunsch II, 2005).<br />

Approaches to clustering<br />

As aforementioned, multiple clustering algorithms with fairly different foundations can be<br />

found in the literature. Our interest here is to present a brief description of the most wellknown<br />

theoretical approaches clustering algorithms are based on, enumerating some specific<br />

implementations emanated from them.<br />

1. Square error based clustering: in this approach, the goal is minimizing the sum of<br />

squared distances between the objects and the centroid (i.e. the center of gravity)<br />

of the cluster they are assigned to. This minimization process is usually executed<br />

iteratively, as in the case of the k-means clustering algorithm, which is probably<br />

the most representative example of this type of algorithm (Forgy, 1965). A slight<br />

variant of the k-means algorithm is k-medioids, where clusters are represented by<br />

one of its member objects instead of their centroids, which makes it more robust to<br />

outliers (Kaufman and Rousseeuw, 1990). Moreover, techniques that allow splitting<br />

and merging the resulting clusters have also been developed. An example is the<br />

ISODATA algorithm (Ball and Hall, 1965), which divides high variance clusters and<br />

joins close clusters—quite obviously, setting the thresholds that govern the cluster<br />

merging and splitting decisions is a key issue in ISODATA.<br />

2. Mixture densities based clustering: this approach follows a probabilistic perspective,<br />

as it assumes that the objects in the data set have been generated according to several<br />

probability distributions—typically one per cluster. Finding the clusters boils down to<br />

making assumptions on these probability distributions (they are often assumed to be<br />

Gaussian) and estimating the parameters of the underlying models usually following<br />

a maximum likelihood approach (Jain, Murty, and Flynn, 1999). The iterative optimization<br />

of the maximum likelihood criterion using the expectation-maximization<br />

(EM) algorithm has given rise to the most popular clustering algorithm based on<br />

mixture densities, EM clustering (Mc<strong>La</strong>chlan and Krishnan, 1997). Other algorithms<br />

based on this kind of approach are AutoClass (which extends mixture densities to<br />

Poisson, Bernoulli and log-normal probability distributions) (Cheeseman and Stutz,<br />

1996), or the SNOB algorithm, which uses a mixture model in conjunction with the<br />

minimum message length principle (Wallace and Dowe, 1994).<br />

3. Graph based clustering: this type of algorithms applies the concepts and properties<br />

of graph theory to the clustering problem, mapping data objects onto the nodes of a<br />

13


1.2. Clustering in knowledge discovery and data mining<br />

weighted graph whose edges reflect the similarity between each pair of objects. This<br />

makes graph based clustering conceptually similar to hierarchical clustering (Jain and<br />

Dubes, 1988), a good example of which is the Chameleon hierarchical clustering algorithm,<br />

which is based on the k-nearest neighbour graph (Karypis, Han, and Kumar,<br />

1999). Other (non-hierarchical) graph based clustering algorithms are i) Zahn’s algorithm<br />

(Zahn, 1971), which is based on discarding the edges with the largest lengths<br />

on the minimal spanning tree so as to create clusters, ii) CLICK, based on computing<br />

the minimum weight cut to form clusters (Sharan and Shamir, 2000), and iii)<br />

MajorClust, which is based on the weighted partial connectivity of the graph, a measure<br />

whose maximization implicitly determines the optimum number of clusters to be<br />

found (Stein and Niggemann, 1999). <strong>La</strong>stly, the more recent family of spectral clustering<br />

algorithms can be included in the context of graph based clustering. This type<br />

of algorithms are often reported as outperforming traditional clustering techniques<br />

(von Luxburg, 2006). In addition, spectral clustering is simple to implement and can<br />

be solved efficiently by standard linear algebra methods, as, in short, it boils down to<br />

computing the eigenvectors of the <strong>La</strong>placian matrix of the graph. Several variants of<br />

spectral clustering algorithms have been proposed, differing in the way the similarity<br />

graph and the <strong>La</strong>placian matrix are computed (e.g. (Shi and Malik, 2000; Ng, Jordan,<br />

and Weiss, 2002)).<br />

4. Clustering based on combinatorial search: this approach is based on considering clustering<br />

as a combinatorial optimization problem that can be solved by applying search<br />

techniques for finding the global (or approximate global) optimum clustering solution.<br />

In this context, two paradigms have been followed in the design of clustering algorithms:<br />

stochastic optimization methods and deterministic search techniques. Among<br />

the former, some of the most popular approaches are based on evolutionary computation<br />

(e.g. hard or soft clustering based on genetic algorithms (Hall, Özyurt,<br />

and Bezdek, 1999; Tseng and Yang, 2001)), simulated annealing (e.g. (Selim and<br />

Al-Sultan, 1991)), Tabu search (Al-Sultan, 1995) and hybrid solutions (Chu and Roddick,<br />

2000; Scott, Clark, and Pham, 2001), whereas deterministic annealing is the most<br />

typical deterministic search technique applied to clustering (Hofmann and Buhmann,<br />

1997).<br />

5. Clustering based on neural networks: the well-known learning and modelling abilities<br />

of neural networks have been exploited to solve clustering problems. The two most<br />

successful neural networks paradigms applied to clustering are i) competitive learning,<br />

where the Self-Organizing Maps (Kohonen, 1990) and Generalized Learning Vector<br />

Quantization (Karayiannis et al., 1996) play a salient role, and ii) adaptive resonance<br />

theory (Carpenter and Grossberg, 1987), which encompasses a whole family of neural<br />

networks architectures that can be used for hierarchical (Wunsch et al., 1993) and<br />

soft (Carpenter, Grossberg, and Rosen, 1991) clustering.<br />

6. Kernel based clustering: the rationale of kernel-based learning algorithms is simplifying<br />

the task of separating the objects in the data set by nonlinearly transforming them<br />

into a higher-dimensional feature space. Through the design of an inner-product kernel,<br />

the time-consuming and sometimes even infeasible process of explicitly describing<br />

the nonlinear mapping and computing the corresponding data points in the transformed<br />

space can be avoided (Xu and Wunsch II, 2005). A recent example of this<br />

approach is Support Vector Clustering (SVC) (Ben-Hur et al., 2001), that employs<br />

14


Chapter 1. Framework of the thesis<br />

the radial basis function as its kernel, being capable of forming either agglomerative<br />

or divisive hierarchical clusters. Moreover, SVC can be further extended to allow for<br />

fuzzy membership (Chiang and Hao, 2003). Kernel-based clustering algorithms have<br />

many advantages, such as the ability to form arbitrary clustering shapes or to deal<br />

with noise and outliers (Xu and Wunsch II, 2005).<br />

7. Density based clustering: this type of clustering algorithms form clusters based on the<br />

density of objects in a region of the feature space, so that the neighborhood of each<br />

object in a cluster must contain a minimum number of objects. This principle allows<br />

the growth of clusters in any direction, thus being able to discover arbitrarily shaped<br />

clusters, besides providing a natural protection against outliers. There are two main<br />

approaches to density-based clustering, differing in the way density is computed: the<br />

first class of algorithms refers the computation of density to the objects in the data<br />

set (such as the DBSCAN algorithm (Ester et al., 1996)). In contrast, the second type<br />

of density-based clustering strategies create an analytical model of the density over<br />

all the feature space, using influence functions that describe the impact of each data<br />

object within its neighbourhood, thus identifying clusters by determining the maximum<br />

of the overall density function—as the DENCLUE algorithm does (Hinneburg<br />

and Keim, 1998). More recently, another density-based clustering algorithm called<br />

Shared Nearest Neighbors (Ertz, Steinbach, and Kumar, 2003) has been proposed: it<br />

measures object similarity upon the number of common nearest neighbours of each<br />

pair of objects, thus identifying core points, around which clusters are grown.<br />

8. Grid based clustering: in this approach, the clustering space is quantized into a finite<br />

number of hyperrectangular cells. Those cells containing a number of objects over<br />

a predetermined threshold are connected to form the clusters. Possibly, the three<br />

most well-known clustering algorithms based on this approach are STING (Wang,<br />

Yang, and Muntz, 1997), WaveCluster (Sheikholeslami, Chatterjee, and Zhang, 1998)<br />

and CLIQUE (Agrawal et al., 1998). The main differences between them is the cell<br />

generation procedure. In STING, the feature space is divided into several levels, thus<br />

forming a hierarchical cell structure. In contrast, CLIQUE follows a recursive process<br />

for generating (k+1)-dimensional dense cells by associating dense k-dimensional cells,<br />

starting with k = 1. In turn, WaveCluster follows a fairly different approach, as it<br />

applies the Wavelet transform on the original feature space into a frequency space<br />

where the natural clusters in the data become distinguishable (i.e. cells are somehow<br />

defined on the transformed space).<br />

1.2.2 Evaluation of clustering processes<br />

According to the flow diagram depicted in figure 1.2, the final stage of knowledge discovery<br />

processes involves the user’s evaluation of the mined patterns (Fayyad, Piatetsky-Shapiro,<br />

and Smyth, 1996). When clustering is the central data mining task of the KD process, this<br />

implies validating the obtained clusters.<br />

As far as this issue is concerned, three distinct approaches can be followed depending<br />

on the reference used for evaluating the clustering solution:<br />

– the data itself, by determining whether the evaluated cluster structure provides a<br />

proper description of the data, which is measured by means of internal cluster validity<br />

15


1.2. Clustering in knowledge discovery and data mining<br />

indices.<br />

– a predefined and allegedly correct cluster structure, measuring its degree of resemblance<br />

to the obtained clustering solution by means of external cluster validity indices.<br />

– a clustering solution resulting from another clustering process (e.g. a distinct execution<br />

of the same clustering algorithm but using different parameters), measuring<br />

their relative merit so as to decide which one may best reveal the characteristics of<br />

the objects using relative cluster validity indices.<br />

All these three types of evaluation criteria can be used for validating individual clusters,<br />

as well as the output of partitional and hierarchical clustering algorithms (Jain and Dubes,<br />

1988; Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and Vazirgiannis,<br />

2002b).<br />

As regards their applicability, only internal and relative evaluation criteria are applicable<br />

to evaluate clustering solutions in real-life scenarios. This is due to the unsupervised<br />

nature of the clustering problem, as the class membership of objects is unknown in practice.<br />

However, in a research context –where ‘correct’ category labels presumably assigned by an<br />

expert to the objects in the data set are usually known, but not available to the clustering<br />

algorithm–, it is sometimes more appropriate to use external validity indices, since clusters<br />

are ultimately evaluated externally by humans (Strehl, 2002). For this reason, in this work<br />

we will make use of external evaluation criteria solely. However, there exist some recent<br />

efforts that aim to find correlations between internal and external cluster validity indices,<br />

such as (Ingaramo et al., 2008). Nevertheless, for further insight on internal and relative<br />

validity indices, see (Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and<br />

Vazirgiannis, 2002b; Maulik and Bandyopadhyay, 2002).<br />

Therefore, evaluation will consist in testing whether the clustering solution reflects the<br />

true group structure of the data set, captured in a reference clustering or ground truth4 .<br />

A further advantage of this evaluation approach lies in the fact that external evaluation<br />

measures can be used to compare fairly the performance of clustering algorithms regardless<br />

of their foundations, as they make no assumption about the mechanisms used for finding<br />

the clusters (Strehl, 2002).<br />

There exist multiple ways of comparing a clustering solution to a ground truth. Quite<br />

obviously, different approaches must be followed depending on the nature of the clustering<br />

solution (i.e. whether it is hard or soft, hierarchical or partitional).<br />

As regards the evaluation of soft clustering solutions, the main difficulty lies not in the<br />

comparison process (see (Gopal and Woodcock, 1994; Jäger and Benz, 2000) for some classic<br />

approaches), but in the creation of a fuzzy ground truth, which may require applying an<br />

averaging scheme that accounts for systematic biases in the answers of the expert labelers<br />

(Jäger and Benz, 2000; Jomier, LeDigarcher, and Aylward, 2005).<br />

As far as the validation hierarchical clustering solutions is concerned, it is necessary to<br />

use a hierarchical taxonomy as the ground truth. However, this type of ground truth is<br />

4 The unsupervised nature of clustering makes that the performance of clustering algorithms cannot be<br />

judged with the same certitude as for supervised classifiers, as the external categorization (ground truth)<br />

might not be optimal. For instance, the way web pages are organized in the Yahoo! taxonomy is possibly not<br />

the best structure possible, but achieving a grouping similar to the Yahoo! taxonomy is certainly indicative<br />

of successful clustering (Strehl, 2002).<br />

16


Chapter 1. Framework of the thesis<br />

not always available, as some domains are more prone to be organized hierarchically than<br />

others. Possibly due to this fact, not much research has been done on external hierarchical<br />

clustering evaluation (Patrikainen and Meila, 2005). Some examples of the few existing<br />

hierarchical clustering comparison methods are simple layer-wise comparison (Fowlkes and<br />

Mallows, 1983) and cophenetic matrices (Theodoridis and Koutroumbas, 1999), although<br />

they also have their shortcomings (Patrikainen and Meila, 2005). For this reason, the most<br />

extended strategy is to compare the clusterings found at a certain level of the dendrogram<br />

with a partitional ground truth. Unfortunately, this approach does not take in account the<br />

cluster hierarchy in any way, which is clearly not the point if the hierarchical clustering<br />

solution is to be validated as a whole.<br />

Allowing for all these considerations, and provided that the outputs of soft and hierarchical<br />

clustering algorithms can always be converted to hard and partitional clustering<br />

solutions, respectively (see section 1.2.1), the most common cluster validation procedure<br />

consists in comparing hard partitional clustering solutions (i.e. label vectors) with the<br />

same type of ground truths (i.e. comparing cluster labels with class labels) (Strehl, 2002).<br />

The following paragraphs are devoted to a description of some relevant external validity<br />

indices for evaluating hard partitional clustering solutions.<br />

The multiple possible ways for comparing the class labels contained in the ground truth<br />

label vector γ and the cluster labels in a label vector λ can be categorized into two groups<br />

depending on whether they are based on i) object pairwise matching, or ii) cluster matching.<br />

Object pairwise matching cluster validity indices are based on counting how many object<br />

pairs (xi, xj) , ∀ i = j, are clustered together and separately in both γ and λ (the more coincidences,<br />

the higher the similarity between the clustering solution and the ground truth).<br />

Following this rationale, several validity indices have been proposed, such as the Rand index<br />

(Rand, 1971), the Adjusted Rand index (Hubert and Arabie, 1985), the Fowlkes-Mallows<br />

index (Fowlkes and Mallows, 1983) or the Jaccard index, among others—see (Halkidi, Batistakis,<br />

and Vazirgiannis, 2002a; Denoeud and Guénoche, 2006).<br />

Cluster matching cluster validity indices measure the degree of agreement between the<br />

assignment of objects to classes (according to γ) and clusters (as designated by λ). Typical<br />

examples of this kind of validity indices are the <strong>La</strong>rsen index (<strong>La</strong>rsen and Aone, 1999), the<br />

Van Dongen index (van Dongen, 2000), variation of information (Meila, 2003), entropy or<br />

mutual information (Cover and Thomas, 1991), to name a few.<br />

In all the experimental sections of this work, the cluster validity index used for evaluating<br />

clustering results is normalized mutual information, denoted as φ (NMI) . This choice is<br />

motivated by the fact that φ (NMI) is theoretically well-founded, unbiased, symmetric with<br />

respect to λ and γ and normalized in the [0, 1] interval —the higher the value of φ (NMI) ,the<br />

more similar λ and γ are (Strehl, 2002). Mathematically, normalized mutual information<br />

is defined as follows:<br />

φ (NMI) k k h=1 l=1<br />

(γ, λ) =<br />

nh,l log<br />

k h=1 n(γ)<br />

<br />

)<br />

n(γ k<br />

h<br />

h log n<br />

<br />

n·nh,l<br />

(γ )<br />

n h n(λ)<br />

l<br />

l=1 n(λ)<br />

l<br />

<br />

log n(λ)<br />

<br />

l<br />

n<br />

(1.8)<br />

where n (γ)<br />

h is the number of objects in cluster h according to γ, n (λ)<br />

l is the number of objects<br />

17


1.3. Multimodal clustering<br />

in cluster l according to λ, nh,l denotes the number of objects in cluster h according to γ as<br />

well as in group l according to λ, n is the number of objects contained in the data set, and<br />

k is the number of clusters into which objects are clustered according to λ and γ (Strehl,<br />

2002).<br />

Thus, the more similar the clustering solutions represented by the label vector λ and<br />

the ground truth γ, the closer to 1 φ (NMI) (γ, λ) will be. As the ground truth γ is assumed<br />

to represent the true partition of the data, high quality clusterings will attain φ (NMI) (γ, λ)<br />

values close to unity. As a consequence, given two label vectors λ1 and λ2, the former will<br />

be considered to be better than the latter if φ (NMI) (γ, λ1) >φ (NMI) (γ, λ2), and vice versa.<br />

1.3 Multimodal clustering<br />

The ubiquity of multimedia data has motivated an increasing interest in clustering techniques<br />

capable of dealing with multimodal data. In the following paragraphs we review<br />

some of the most relevant works on clustering multimodal data.<br />

Possibly one of the first works that mention the multimedia clustering problem was<br />

(Hinneburg and Keim, 1998). The authors place special emphasis in highlighting the two<br />

main challenges faced by clustering algorithms in this context: the high dimensionality<br />

of the feature vectors and the existence of noise. To tackle these problems, the authors<br />

proposed DENCLUE, a density-based clustering algorithm capable of dealing satisfactorily<br />

with both issues. However, in that work, multimodality seemed to be more of a pretext to<br />

justify the challenges of clustering high dimensional noisy data than an interest in itself.<br />

This was not the case of the browsing and retrieval system for collections of text annotated<br />

web images presented in (Chen et al., 1999), which was a multimodal extension<br />

of the Scatter-Gather document browser of (Cutting et al., 1992). In this case, multiple<br />

clusterings were created upon text and image features independently. In particular, clustering<br />

on image features was employed as part of a query refinement process. Therefore, this<br />

proposal is multimodal in the sense that the features of the distinct modalities are employed<br />

for clustering the image collection, but, still, they were not fully integrated in the clustering<br />

process.<br />

In contrast, the true multimodality of the clustering approach proposed in (Barnard<br />

and Forsyth, 2001) is guaranteed by modeling the probabilities of word and image feature<br />

occurrences and co-occurrences. It consisted of a statistical hierarchical generative model<br />

fitted with the EM algorithm, which organizes image databases using both image features<br />

and their associated text. In subsequent works, the learnt joint distribution of image regions<br />

and words was exploited in several applications, such as the prediction of words associated<br />

with whole images or with image regions (Barnard et al., 2003).<br />

Multimodal clustering has also been applied to the discovery of perceptual clusters for<br />

disambiguating the semantic meaning of text annotated images (Benitez and Chang, 2002).<br />

To do so, the images are clustered based on the visual or the text feature descriptors. Moreover,<br />

the system could also conduct multimodal clustering upon any combination of text<br />

and visual feature descriptors by conducting an early fusion of these. Principal Component<br />

Analysis was used to integrate and to reduce the dimensionality of feature descriptors<br />

before clustering. As regards the results obtained by multimodal clustering, the authors<br />

highlighted the uncorrelatedness of visual and text feature descriptors, which suggests that<br />

18


Chapter 1. Framework of the thesis<br />

they should be integrated in the knowledge extraction process.<br />

In the multimodal clustering context, the field that has motivated the largest amount<br />

of research efforts is the clustering of web image search results based not only on visual<br />

features, but also using the surrounding text and also link information—as organizing the<br />

results into different semantic clusters might facilitates users browsing (Cai et al., 2004).<br />

In that work, each image returned by the search engine is represented using three kinds<br />

of information: visual information, textual information and link information (text and link<br />

data are recovered from the surroundings of the image). The rationale of this approach is<br />

based on the fact that the textual and link based representations can reflect the semantic<br />

relationship of images better than visual features. The proposed system implements a<br />

two level clustering algorithm: in the first level, clustering is conducted using the textual<br />

and link representation of images (separately or jointly). In the second level, clustering<br />

is conducted on the images assigned to each cluster resulting from the previous stage. In<br />

this case, low level visual features are employed to re-organize the images in the first level<br />

clusters, so as to group visually similar images to facilitate users browsing.<br />

A second paper dealing with web image search results clustering was (Gao et al., 2005).<br />

In that work, a tripartite graph was used to model the relations among low-level features,<br />

images and their surrounding texts. Thus, the method was formulated as a constrained<br />

multiobjective optimization problem, which can be efficiently solved by semi-definite programming.<br />

In a similar context, clustering was applied for image sense discrimination for web images<br />

retrieved from ambiguous keywords (Loeff, Ovesdotter-Alm, and Forsyth, 2006). Its goal<br />

was presenting the image search results in semantically sensible clusters for improved image<br />

browsing. To do so, spectral clustering was applied on multimodal features: simple local and<br />

global image features, and a bag of words representation of the text in the embedding web<br />

page. Multimodal fusion was achieved by combining pairwise object similarities measured<br />

on both image and textual features in the graph affinity matrix of the spectral clustering<br />

algorithm.<br />

Finally, the notion that each of the multiple modalities in a multimedia collection contributes<br />

its own perspective to the collections organization was the driving force behind the<br />

proposal in (Bekkerman and Jeon, 2007). That work presents the Comraf* model, a lightweight<br />

version of combinatorial Markov random fields. In Comraf*, multimodal clustering<br />

is faced as the problem of simultaneously constructing a partition of each data modality.<br />

By clustering modalities simultaneously, the statistical sparseness of the data representation<br />

can be overcome, obtaining a dense and smooth joint distribution of the modalities.<br />

However, not every modality has to be clustered, as long as the so-called target modality<br />

is.<br />

The reader interested in multimedia indexing and retrieval is referred to the recent<br />

and complete survey of (Chen, 2006), although it is mainly focused on text plus image<br />

modalities.<br />

1.4 Clustering indeterminacies<br />

As mentioned at the end of section 1.1, the accomplishment of a knowledge discovery process<br />

requires making several critical decisions at each of its stages, which may have to be re-<br />

19


1.4. Clustering indeterminacies<br />

executed if the user is not satisfied with the evaluation of the mined patterns. In the case<br />

that clustering is the data mining task of the knowledge discovery process, these important<br />

decisions are often made blindly, due to the unsupervised nature of the clustering problem.<br />

Unfortunately, these decisions determine, to a large extent, the effectiveness of the clustering<br />

task (Jain, Murty, and Flynn, 1999), so they should not be made unconcernedly.<br />

Thus, the obtention of a good quality clustering solution relies heavily on making optimal<br />

(or quasi-optimal) decisions at every stage of the KD process. The doubts that seize<br />

clustering practitioners at the time of making such decisions are caused by what we call<br />

cluster indeterminacies, which mainly localize in the selection of i) the way objects are<br />

represented, and ii) the clustering algorithm to be applied.<br />

As regards the decision on data representation, ideal features should permit distinguishing<br />

objects belonging to different clusters, besides being robust to noise, easy to extract and<br />

interpret (Xu and Wunsch II, 2005). In a blind quest for finding such a data representation,<br />

the clustering practitioner is struck by the following questions:<br />

– how should the objects be represented? Should we stick to their original representation,<br />

select a subset of the original attributes (i.e. feature selection) or transform<br />

them into a new feature space (i.e. feature extraction)?<br />

– if the original data representation is subject to a dimensionality reduction process,<br />

which should be the dimensionality of the reduced feature space?<br />

– if the original data representation is subject to a feature selection process, which<br />

criterion should be followed?<br />

– if feature extraction is applied, which criterion should guide it?<br />

Regrettably, whereas these questions are easy to answer in a supervised classification<br />

scenario (e.g. the optimal feature subset can be chosen by maximizing some function of<br />

predictive classification performance (Kohavi and John, 1998) or by applying a feature<br />

transformation driven by class labels (Torkkola, 2003)), they have no clear nor universal<br />

answer in an unsupervised context. This is due to the fact that, in clustering, the lack of<br />

class labels makes feature selection a necessarily ad hoc and often trial-and-error process (Dy<br />

and Brodley, 2004). Moreover, studies comparing the influence of object representations<br />

based on feature extraction in clustering performance often come up with contradictory<br />

conclusions (Tang et al., 2004; Shafiei et al., 2006; Cobo et al., 2006; Sevillano, Alías, and<br />

Socoró, 2007b).<br />

To illustrate the effect and importance of the data representation clustering indeterminacy,<br />

in the following paragraphs we present experimental evidences that the selection of<br />

a specific object representation can condition the quality of a clustering process to a large<br />

extent. In particular, we have represented the objects contained in the Wine and the miniNG<br />

data collections using multiple data representations: the original attributes (referred<br />

to as baseline) and feature extraction-based representations —obtained by means of Principal<br />

Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative<br />

Matrix Factorization (NMF) and Random Projection (RP)— on a range of distinct dimensionalities.<br />

Upon each object representation, we have applied a refined repeated bisecting<br />

clustering algorithm based on the correlation similarity measure for obtaining a partition<br />

20


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

WINE rbr−corr−e1<br />

Baseline<br />

PCA<br />

ICA<br />

NMF<br />

RP<br />

0<br />

3 4 5 6 7 8 9 10 11 12 13<br />

dimensions<br />

(a) Clustering results on the Wine<br />

data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

Chapter 1. Framework of the thesis<br />

miniNG rbr−corr−e1<br />

Baseline<br />

PCA<br />

ICA<br />

NMF<br />

RP<br />

0<br />

50 100 150 200 250 300 350 390<br />

dimensions<br />

(b) Clustering results on the miniNG<br />

data set.<br />

Figure 1.5: Illustration of the data representation indeterminacy on the clustering results<br />

of the (a) Wine and (b) miniNG data sets clustered by the rbr-corr-e1 algorithm.<br />

of the data 5 . In all cases, the desired number of clusters k is set to the number of classes<br />

defined by the ground truth of each data set.<br />

The results of these clustering processes are presented in figure 1.5, which displays<br />

the normalized mutual information (φ (NMI) ) values between these clustering solutions and<br />

the ground truth of each data collection. It can be observed that, in the Wine data set<br />

(figure 1.5(a)), the clustering solution obtained operating on the original representation<br />

is worse (i.e. it attains a lower φ (NMI) score) than all but three of the feature-extraction<br />

based representations. In particular, the maximum value of φ (NMI) is attained using an<br />

8-dimensional PCA transformation of the original data. However, these results cannot be<br />

generalized. In fact, it is the baseline object representation that yields the best results<br />

when the same clustering algorithm is applied on the miniNG data set (see figure 1.5(b)).<br />

If we analyze the data representation that yields the best clustering results across the<br />

12 unimodal data sets described in appendix A.2, we observe a rather even distribution:<br />

baseline (23% of the times), LSA (31%), NMF (31%) and RP (15%), which somehow<br />

reinforces the notion that no intrinsically superior data representation exists—see appendix<br />

B.1 for more experimental results regarding data representation clustering indeterminacies.<br />

Moreover, notice the remarkable influence of the data representation dimensionality on<br />

the value of φ (NMI) , i.e. it is not only important to select the right type of representation,<br />

but its dimensionality is also a critical factor. Although there exist several approaches for<br />

determining the most natural dimensionality of the data (dnat) –such as spectrum of eigenvectors<br />

in PCA (Duda, Hart, and Stork, 2001) or reconstruction error in NMF (Sevillano<br />

et al., 2009)–, it is not trivial to ensure that any clustering algorithm will yield its best<br />

performance when operating on the dnat-dimensional representation of the data.<br />

To sum up, this modest example tries to demonstrate the relevance of the data representation<br />

indeterminacy, as an incorrect choice in data representation may ruin the results<br />

of a clustering process.<br />

5 For a detailed description of the clustering algorithms, the data sets and the data representations employed<br />

in the experimental sections of this thesis, see appendices A.1, A.2 and A.3, respectively.<br />

21


1.4. Clustering indeterminacies<br />

The other major source of indeterminacy is the selection of the particular clustering<br />

algorithm to apply. In this sense, there are several critical questions that must be answered:<br />

– what type of algorithm should we apply? Hierarchical or partitional? Hard or soft?<br />

– once the type of clustering algorithm is selected, which specific clustering algorithm<br />

should be applied?<br />

– how should the parameters of the clustering algorithm be tuned?<br />

– in how many clusters should the data objects be grouped?<br />

As far as the selection of the type of clustering algorithm is concerned, it depends on<br />

the desired shape of the clustering solution. In any case, it is worth noting that soft and<br />

hierarchical clustering algorithms can be regarded as a generalization of their hard and<br />

partitional counterparts, as the latter can always be obtained from the former.<br />

Moreover, it is a commonplace that no universally superior clustering algorithm exists,<br />

as most of the proposals found in the literature have been designed to solve particular<br />

problems in specific fields (Xu and Wunsch II, 2005), thus being able to outperform the<br />

other existing algorithms in a concrete context, but not in others (Jain, Murty, and Flynn,<br />

1999). This fact has been theoretically analyzed and demonstrated by the impossibility<br />

problem in (Kleinberg, 2002). Thus, unless some domain knowledge enables clustering<br />

practitioners to choose a specific algorithm, this selection is often made blindly to a large<br />

extent.<br />

Once the algorithm is chosen, its parameters must be set. Again, this is not a trivial<br />

choice, as they largely determine its behaviour. Several examples concerning the sensitivity<br />

of some popular clustering algorithms to their parameter tuning are easy to find in the<br />

literature: for instance, there is no universal method for identifying the initial centroids<br />

of k-means. The EM clustering algorithm is highly sensitive to the selection of its initial<br />

parameters and the effect of a singular covariance matrix, as it can converge to a local<br />

optimum (Xu and Wunsch II, 2005). In combinatorial search based clustering, there exist<br />

no theoretic guidelines to select the appropriate and effective parameters, while the selection<br />

of the graph <strong>La</strong>placians is a major issue that affects the performance of spectral clustering<br />

algorithms, just like it happens in kernel-based clustering algorithms as regards selecting<br />

the width of the Gaussian kernels (Xu and Wunsch II, 2005).<br />

And finally, one has to decide the number of clusters k into which the objects must be<br />

grouped, as many clustering algorithms (e.g. partitional) require this value to be passed as<br />

one of their parameters. Unfortunately, in most cases the number of classes in a data set<br />

is unknown, so it is one more parameter to guess about. Moreover, determining which is<br />

the ‘correct’ number of clusters in a data set is an open question: in some cases, equally<br />

satisfying (though substantially different) clustering solutions can be obtained with different<br />

values of k for the same data, proving that the right value of k often depends on the scale<br />

at which the data is inspected (Chakaravathy and Ghosh, 1996).<br />

Notwithstanding, there exist several practical procedures for determining the most suitable<br />

value of k. Possibly, the most intuitive approach consists in visualizing the data set on<br />

a two-dimensional space, although this strategy is of little use for complex data sets (Xu and<br />

Wunsch II, 2005). Additionally, relative cluster validity indices (such as the Davies-Bouldin<br />

22


φ (NMI) direct-cos-i2 graph-cos-i2<br />

BBC 0.807 (P100) 0.603 (P83)<br />

PenDigits 0.658 (P84) 0.829 (P99)<br />

Chapter 1. Framework of the thesis<br />

Table 1.1: Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits<br />

data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms.<br />

index (Davies and Bouldin, 1979), Dunn’s index (Dunn, 1973), the Calinski-Harabasz index<br />

(Calinski and Harabasz, 1974) or the I index (Maulik and Bandyopadhyay, 2002)) can be<br />

applied for determining the most appropriate number of clusters by comparing the relative<br />

merit of several clustering solutions obtained for distinct values of k (Halkidi, Batistakis,<br />

and Vazirgiannis, 2002b)—unfortunately, the performance of these indices is data dependent,<br />

which gives rise to a further indeterminacy (Xu and Wunsch II, 2005). And last, in<br />

the context of mixture densities based clustering, the number of clusters can be determined<br />

through the optimization of criterion functions such as the Akaike’s Information Criterion<br />

(Akaike, 1974) or the Bayesian Inference Criterion (Schwarz, 1978), among others.<br />

In this work, the number of clusters is assumed to be known from the start of the<br />

clustering process, and k is set to be equal to the number of classes defined by the ground<br />

truth of each data set. However, in a real-world scenario, this parameter should be tuned<br />

using some of the previously mentioned techniques.<br />

To illustrate the clustering algorithm selection indeterminacy, table 1.1 presents the<br />

values of φ (NMI) (and their percentiles in the global φ (NMI) distribution of each case) obtained<br />

from evaluating the clustering solutions yielded by the graph-cos-i2 and direct-cos-i2 (graphbased<br />

and direct clustering algorithms using the cosine distance, respectively) operating on<br />

the baseline data representation of the objects contained in the BBC and PenDigits data<br />

sets 6 . As just mentioned, the desired number of clusters k is set to the number of classes<br />

in each data set. It can be observed that these two algorithms, despite using the same<br />

object similarity measure and optimizing the same criterion function, offer almost opposite<br />

performances in these two specific data collections, so no absolute claims on the superiority<br />

of none of them can be made.—see appendix B.1 for more experimental results regarding<br />

the clustering algorithm selection indeterminacy.<br />

It is worth noticing that the decision problems caused by the previously described clustering<br />

indeterminacies are multiplied in the case of clustering multimodal data. In this<br />

context, besides the data representation and clustering algorithm selection indeterminacies,<br />

the clustering practitioner must face an additional set of questions with no clear answer,<br />

such as:<br />

– should one modality dominate the clustering process? If so, which one, and to which<br />

extent?<br />

– should the modalities be fused? If so, how should the fusion process be conducted?<br />

To illustrate the effect of these and the aforementioned indeterminacies in a multimodal<br />

clustering scenario, we have conducted several clustering experiments on the multimodal<br />

data sets presented in appendix A.2.2.<br />

6 Refer to appendices A.1, A.2 and A.3 for a detailed description of the clustering algorithms, the data<br />

sets and the data representations employed in the experimental sections of this thesis.<br />

23


1.4. Clustering indeterminacies<br />

Data set Best results Mode #1 Mode #2 Multimodal<br />

CAL500<br />

Corel<br />

InternetAds<br />

IsoLetters<br />

φ (NMI) 0.411 (P100) 0.249 (P74) 0.310 (P88)<br />

algorithm rbr-cos-i2 graph-cos-i2 graph-jacc-i2<br />

representation RP, r=120 PCA, r=40 ICA, r=100<br />

φ (NMI) 0.669 (P99) 0.270 (P40) 0.675 (P100)<br />

algorithm rbr-corr-i2 rb-cos-e1 rbr-corr-i2<br />

representation RP, r=300 NMF, r=400 NMF, r=550<br />

φ (NMI) 0.430 (P100) 0.258 (P98) 0.319 (P99)<br />

algorithm bagglo-cos-slink agglo-corr-clink graph-jacc-i2<br />

representation RP, r=70 Baseline NMF, r=150<br />

φ (NMI) 0.754 (P93) 0.537 (P67) 0.897 (P100)<br />

algorithm rbr-corr-i2 graph-jacc-i2 rbr-corr-i2<br />

representation Baseline ICA, r=12 PCA, r=100<br />

Table 1.2: Illustration of the clustering indeterminacies on the CAL500, Corel, InternetAds<br />

and IsoLetters multimoda data sets. Each column presents the top-performing clustering<br />

configuration for each separate modality and for the multimodal data representation.<br />

The twenty-eight clustering algorithms employed in this work (see table A.1 for a quick<br />

reference) are run on i) each modality of the data set, and ii) on the multimodal representation<br />

obtained by early feature fusion, as described in appendix A.3.2. In both cases,<br />

clustering is run on the baseline and feature extraction based data representations (see<br />

section A.3.1 of appendix A).<br />

As a summary of the obtained results and an illustration of the clustering indeterminacies,<br />

table 1.2 presents the highest quality clustering results obtained in each case (i.e.<br />

when clustering is conducted on either mode –mode #1 and mode #2 columns– and on<br />

the multimodal representation), indicating the corresponding value of φ (NMI) , its percentile<br />

in the global φ (NMI) distribution obtained, and the top-performing clustering configuration<br />

(i.e. algorithm plus data representation plus dimensionality of the reduced feature space r<br />

when needed).<br />

As expected, there is no predominant modality across all the data sets. For the CAL500<br />

collection, the clustering results obtained on mode #1 (text) are clearly superior to the rest.<br />

A similar behaviour is observed in the InternetAds data set, where it is also mode #1 (images<br />

size and aspect ratio, plus caption and alternate text in this case) the one that yields the<br />

best clustering results, although its predominance is not as clear as in CAL500. In contrast,<br />

the highest quality clustering solutions are obtained from multimodal representations in the<br />

Corel and IsoLetters data sets.<br />

<strong>La</strong>st but not least, notice the effect of the data representation and clustering algorithm<br />

indeterminacies on all the data sets. Again, there is no universally superior clustering<br />

algorithm nor data representation that guarantees the best clustering results. For a more<br />

detailed description of the experimental results regarding the clustering indeterminacies in<br />

multimodal data collections, see appendix B.2.<br />

24


1.5 Motivation and contributions of the thesis<br />

Chapter 1. Framework of the thesis<br />

The main motivation of this thesis is the construction of an efficient multimodal clustering<br />

system that performs as autonomously as possible, avoiding the re-execution of the different<br />

stages of the knowledge discovery process. As these feedback loops are caused by suboptimal<br />

decision-making, our idea is setting cluster practitioners free from the obligation of making<br />

such critical decisions in a blind way, obtaining, at the same time, clustering solutions which<br />

are robust to the clustering indeterminacies presented in the previous section.<br />

Instead of being forced to blindly select a single clustering configuration, the user is<br />

encouraged to use and combine all the data modalities, representations and clustering algorithms<br />

at hand, generating as many individual clustering solutions (compiled into a cluster<br />

ensemble) as possible. It will be the proposed system which, in a fully unsupervised mode,<br />

outputs a consensus clustering solution that will hopefully be comparable to (or even better<br />

than) the one achieved using the best clustering configuration among the available ones.<br />

As the informed reader may have guessed, the approach followed in the quest for this<br />

goal localizes in the consensus clustering framework, which is defined as “the problem of<br />

combining multiple partitionings of a set of objects into a single consolidated clustering<br />

without accessing the features or algorithms that determined these partitionings” (Strehl<br />

and Ghosh, 2002). That is, the data representations, modalities and clustering algorithms<br />

employed for generating the individual partitions are not of the system’s concern, as it will<br />

operate on the individual clustering solutions regardless of the way they were created.<br />

However, applying consensus clustering on cluster ensembles as a means for obtaining<br />

robust clustering solutions is not new—in fact it has been a central or collateral matter in<br />

several works (Strehl and Ghosh, 2002; Fred and Jain, 2003; Sevillano et al., 2006a; Fern<br />

and Lin, 2008). Anyway, this thesis deals with several crucial and, to our knowledge, little<br />

addressed issues in this context, such as:<br />

– the computational burden imposed by the use of large cluster ensembles generated by<br />

crossing multiple data modalities, representations and clustering algorithms.<br />

– the quality decrease of the consensus clustering solution caused by the wide diversity<br />

of the cluster ensemble.<br />

– the application of cluster ensembles on the multimodal clustering problem.<br />

– the definition of methods for building consensus clustering solutions (either crisp or<br />

fuzzy) from the outputs of soft clustering algorithms.<br />

As a systematic response to these challenges, this thesis puts forward the following<br />

proposals:<br />

– parallelizable hierarchical consensus architectures for creating consensus clustering<br />

solutions in a computationally efficient way (see chapter 3).<br />

– fully unsupervised consensus self-refining procedures, so as to drive the quality of<br />

the consensus clustering solution near or even above the best available individual<br />

clustering configuration —see chapter 4.<br />

25


1.5. Motivation and contributions of the thesis<br />

Obj Object<br />

representation<br />

Multimedia<br />

data set R df D df df ,<br />

X Clustering E<br />

df df A<br />

( hard / soft)<br />

Flat/<br />

hi hierarchical hi l<br />

(serial/parallel)<br />

consensus architecture<br />

( hard / soft)<br />

<br />

c<br />

or<br />

<br />

c<br />

Consensus<br />

final<br />

c<br />

self- or<br />

refining final<br />

c<br />

Figure 1.6: Block diagram of the robust multimodal clustering system based on self-refining<br />

hierarchical consensus architectures.<br />

– the construction of multimodal cluster ensembles and the application of self-refining<br />

hierarchical consensus architectures for robust multimodal clustering (see chapter 5).<br />

– consensus functions based on voting strategies for combining fuzzy partitions contained<br />

in soft cluster ensembles —see chapter 6.<br />

These contributions can be articulated in a unitary proposal for robust multimodal<br />

clustering based on cluster ensembles, a block diagram of which is shown in figure 1.6. The<br />

procedure for deriving the partition of a multimodal data collection X accordingtoour<br />

proposal goes as follows: firstly, multiple representations of the objects contained in X are<br />

created by the application of a set of representational and dimensional diversity factors<br />

provided by the user (denoted as dfR and dfD in figure 1.6). Next, a set of either hard or<br />

soft clustering algorithms (referred to as the algorithmic diversity factor dfA) are applied on<br />

the distinct object representations obtained from the previous step, giving rise to a set of<br />

clusterings compiled in the cluster ensemble E. Notice that, up to this point, the only choices<br />

made by the user refer to the object representation techniques and clustering algorithms<br />

employed for creating the ensemble. As mentioned earlier, the user is encouraged to employ<br />

the widest possible range of diversity factors, thus creating maximally diverse clusterings<br />

so as to break free from the indeterminacies inherent to clustering. The obviously high<br />

computational cost associated to this cluster ensemble generation strategy can somehow be<br />

mitigated considering it is a highly parallelizable process (Hore, Hall, and Goldgof, 2006).<br />

Subsequently, the process for deriving the partition of the data set X upon the cluster<br />

ensemble E starts by applying a consensus clustering procedure. This can either be<br />

conducted according to a flat or a hierarchical consensus architecture, a decision that is<br />

automatically made by the system based on the characteristics of the data set X, the cluster<br />

ensemble E and the consensus function F employed for combining the clusterings in E<br />

—which is selected by the user. In case a hierarchical consensus architecture is employed,<br />

an additional decision (also made with no user supervision) is the one related to its serial or<br />

parallel execution, which ultimately depends on the availability of computational resources.<br />

As a result, a consensus clustering solution is obtained, which can either be represented<br />

by a consensus label vector λc or a consensus clustering matrix Λc, depending on whether<br />

a crisp or a fuzzy clustering approach is taken. Subsequently, this consensus clustering is<br />

subjected to an almost fully autonomous self-refining procedure, which requires the user<br />

to specify a percentage threshold (denoted by symbol ‘%’ in figure 1.6). Finally, the final<br />

partition of the data set X is obtained, denoted as λ final<br />

c (or Λfinal c in the fuzzy case).<br />

Before proceeding with the description of our proposals, the next chapter presents an<br />

overview of related work in the area of cluster ensembles.<br />

26<br />

%


Chapter 2<br />

Cluster ensembles and consensus<br />

clustering<br />

In our quest for overcoming clustering indeterminacies in a multimodal context, the notions<br />

of cluster ensembles and consensus clustering play a central role. As mentioned at the<br />

end of chapter 1, our strategy for clustering multimodal data in a robust manner is based<br />

on the massive creation of multiple partitions of the target data set and the subsequent<br />

combination of these into a single consensus clustering solution. Therefore, an appropriate<br />

way to start this chapter is by formally defining the two closely related concepts of cluster<br />

ensembles and consensus clustering1 .<br />

For starters, a cluster ensemble E is defined as the compilation of the outcomes of l<br />

clustering processes. For simplicity, we assume in this work that the l clustering processes<br />

group the data into the same number of clusters, namely k, although this is not a strictly<br />

necessary constraint2 . Depending on whether the clustering processes are crisp or fuzzy, E<br />

will be a hard or a soft cluster ensemble.<br />

In the former case, E is mathematically defined as a l×n integer-valued matrix compiling<br />

l row label vectors λi (∀i ∈ [1,...,l]) resulting from the respective hard clustering processes<br />

(see equation (2.1)).<br />

⎛<br />

λ1<br />

λl<br />

⎞<br />

⎛<br />

⎜<br />

⎜λ2<br />

⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠ =<br />

⎜<br />

⎝ .<br />

λ11 λ12 ... λ1m<br />

λ21 λ22 ... λ2m<br />

. ..<br />

λl1 λl2 ... λlm<br />

⎞<br />

⎟<br />

⎠<br />

(2.1)<br />

where λij ∈{1,...,k} (∀i ∈ [1,...,l], and ∀j ∈ [1,...,n]), i.e. each component of each<br />

1 In some works, the term ‘cluster ensemble’ is used to designate the framework for combining multiple<br />

partitionings obtained from separate clustering runs into a final consensus clustering (Strehl and Ghosh,<br />

2002; Punera and Ghosh, 2007). In this work, however, we stick to the literal meaning of this expression,<br />

and use it to design the result of gathering several clustering solutions.<br />

2 Since our goal is to combine partitions differing only in the way data are represented and clustered, we<br />

set the number of clusters k to be equal across the l clustering processes. However, combining clustering<br />

solutions with a variable number of clusters is a common practice in the cluster ensembles literature. This<br />

can be useful for clustering complex data sets upon simple individual partitions (Fred and Jain, 2005), or<br />

for discovering the natural number of clusters in the data set (Strehl and Ghosh, 2002), although these<br />

potentialities are not exploited in this work.<br />

27


Chapter 2. Cluster ensembles and consensus clustering<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

−0.2<br />

−0.2 0 0.2 0.4 0.6<br />

Figure 2.1: Scatterplot of an artificially generated two-dimensional toy data set containing<br />

n = 9 objects grouped into k = 3 natural clusters. Each object is identified by a numerical<br />

label.<br />

labeling is an integer label identifying to which of the k clusters each of the n objects in<br />

the data set is assigned to.<br />

For illustration purposes, and resorting to the toy clustering example presented in section<br />

1.2.1, equation (2.2) presents a hard cluster ensemble created by compiling the outcomes of<br />

l = 3 independent runs of the k-means clustering algorithm on the two-dimensional data set<br />

presented in figure 2.1, which contains n = 9 objects, setting the desired number of clusters<br />

k equal to 3.<br />

4<br />

7<br />

5<br />

9<br />

6<br />

8<br />

⎛<br />

1 1 1 3 3 3 2 2<br />

⎞<br />

2<br />

E = ⎝2<br />

2 2 1 1 1 3 3 3⎠<br />

(2.2)<br />

2 2 2 3 3 3 1 1 1<br />

On its part, a soft cluster ensemble E is defined as the compilation of the outcomes of l<br />

fuzzy clustering processes, and as such, it is mathematically expressed as a kl×n matrix, as<br />

presented in equation (2.3) (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />

⎛ ⎞<br />

Λ1<br />

⎜<br />

⎜Λ2<br />

⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠<br />

Λl<br />

(2.3)<br />

where Λi is the k × n real-valued clustering matrix resulting from the ith soft clustering<br />

process (∀i ∈ [1,...,l]).<br />

Continuing with the same toy example, equation (2.4) presents a soft cluster ensemble<br />

created by collecting the outcomes of l = 3 independent executions of the fuzzy c-means<br />

clustering algorithm on the same data set as before. As k = 3, the first three rows of<br />

E correspond to the clustering probability membership matrix output by the first soft<br />

clustering process, the next three are the outcome of the second fuzzy clusterer, and so on.<br />

28


1 1 1 3 3 3 2 2 2<br />

1<br />

2<br />

3<br />

2 2 2 1<br />

2 2 2 3<br />

1 1 3 3<br />

3 3 1 1<br />

3<br />

1<br />

Chapter 2. Cluster ensembles and consensus clustering<br />

E<br />

Consensus<br />

function c<br />

1<br />

1 1 2 2<br />

2 3 3 3<br />

Figure 2.2: Schematic representation of the obtention of a consensus labeling λc by applying<br />

a consensus function F on a hard cluster ensemble E containing l = 3 individual label<br />

vectors, being n =9andk =3.<br />

⎛<br />

0.921 0.932 0.905 0.006 0.005 0.011 0.009 0.016<br />

⎞<br />

0.010<br />

⎜<br />

0.054<br />

⎜<br />

0.025<br />

⎜<br />

⎜0.025<br />

E = ⎜<br />

⎜0.920<br />

⎜<br />

⎜0.054<br />

⎜<br />

⎜0.054<br />

⎝0.920<br />

0.042<br />

0.026<br />

0.026<br />

0.932<br />

0.042<br />

0.042<br />

0.932<br />

0.057<br />

0.038<br />

0.038<br />

0.905<br />

0.057<br />

0.057<br />

0.905<br />

0.025<br />

0.969<br />

0.969<br />

0.006<br />

0.025<br />

0.025<br />

0.006<br />

0.019<br />

0.976<br />

0.976<br />

0.005<br />

0.019<br />

0.019<br />

0.005<br />

0.030<br />

0.959<br />

0.959<br />

0.011<br />

0.030<br />

0.030<br />

0.011<br />

0.976<br />

0.014<br />

0.014<br />

0.009<br />

0.976<br />

0.976<br />

0.009<br />

0.929<br />

0.055<br />

0.055<br />

0.016<br />

0.929<br />

0.929<br />

0.016<br />

0.972 ⎟<br />

0.017 ⎟<br />

0.017 ⎟<br />

0.010 ⎟<br />

0.972 ⎟<br />

0.972 ⎟<br />

0.010⎠<br />

0.025 0.026 0.038 0.969 0.976 0.959 0.014 0.055 0.017<br />

(2.4)<br />

Notice that, just like it was observed in section 1.2.1 regarding crisp and fuzzy clustering<br />

solutions, soft cluster ensembles can be converted to hard cluster ensembles by assigning<br />

each object to the cluster it is more strongly associated to. In fact, by doing so, the soft<br />

ensemble in equation (2.4) would be converted to the hard cluster ensemble in equation<br />

(2.2). Moreover, notice that the l = 3 components that make up both cluster ensembles are<br />

identical, given the symbolic nature of cluster labels.<br />

As for consensus clustering, it is defined as the process of obtaining a consolidated<br />

clustering solution through the application of a consensus function F on a cluster ensemble<br />

E (Strehl and Ghosh, 2002). In other words, consensus clustering can be regarded as the<br />

problem of combining several clustering solutions without accessing the features representing<br />

the clustered objects. Figure 2.2 depicts a schematic representation of a consensus clustering<br />

process conducted on the hard cluster ensemble resulting from our toy example. In this<br />

case, the result of the consensus clustering process is a consensus label vector λc which,<br />

quite obviously, represents the same partition as the individual label vectors that compose<br />

the cluster ensemble. However, in a real context, a higher degree of diversity among the<br />

clustering solutions embedded in the cluster ensemble can be expected (which, in fact, is<br />

desirable), a situation consensus clustering algorithms take advantage of for consolidating<br />

richer consensus clustering solutions (Pinto et al., 2007).<br />

Quite obviously, the design of the consensus function F is a central issue as regards<br />

consensus clustering. Most works in the consensus clustering literature focus on combining<br />

the outcomes of hard clustering processes (as in the example depicted in figure 2.2), although<br />

some consensus functions can be applied to either hard or soft cluster ensembles indistinctly,<br />

possibly after introducing some minor modifications (Strehl and Ghosh, 2002; Fern and<br />

Brodley, 2004; <strong>La</strong>nge and Buhmann, 2005). However, little effort has been conducted<br />

towards the design of specific consensus functions for soft cluster ensembles that generate<br />

29


fuzzy consensus clustering solutions (Dimitriadou, Weingessel, and Hornik, 2002; Punera<br />

and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />

Regardless of whether the cluster ensemble is hard or soft, combining the results of<br />

several clustering processes has multiple applications, a good description of which can be<br />

found in (Strehl and Ghosh, 2002). In a nutshell, consensus clustering is useful for:<br />

– knowledge reuse: in some scenarios, one may want to create a partition of a set of<br />

objects, but the access to the original data may be restricted due to copyright or privacy<br />

reasons (customer databases are the most prototypical examples of this type of<br />

situation). However, if a set of legacy partitions of the data exist (e.g. segmentations<br />

of a customer database based on distinct criteria –such as residence, purchasing patterns,<br />

age, etc.), consensus clustering provides a means for reconciling the knowledge<br />

contained in those legacy clusterings.<br />

– distributed clustering: due to security or operational reasons, there exist situations in<br />

which the data to be clustered is scattered across different locations. In this context,<br />

as an alternative to gathering and processing all the data at one site –which can be<br />

unfeasible, for instance, due to storage costs–, the data available at each location<br />

would be subject to a clustering process, and the label vectors obtained would be<br />

combined by means of consensus clustering, yielding a consolidated classification of<br />

the data.<br />

– robust clustering: in this case, the goal is to obtain a consensus clustering solution<br />

that improves the quality of the component clusterings, based on the fact that if the<br />

distinct clustering processes disagree, combining their outcomes may offer additional<br />

information and discriminatory power, thus obtaining a combined better clustering<br />

closer to a hypothetical true classification (Pinto et al., 2007).<br />

It is in this latter application that consensus clustering can be more clearly regarded as<br />

the unsupervised counterpart of classifier committees, as the objective of both strategies is to<br />

combine the outcomes of several classification processes aiming to improve the quality of the<br />

component classifiers (Dietterich, 2000). However, the purely symbolic nature of the labels<br />

returned by unsupervised classifiers makes consensus clustering a more challenging task.<br />

Possibly due to this fact, consensus clustering has historically been far less popular than<br />

classifier committees, and it has only began to draw considerable attention of researchers<br />

during the last decade.<br />

In the quest for obtaining good quality consensus clustering solutions, the design of<br />

both the cluster ensemble and the consensus function are of critical importance. Although<br />

having a cluster ensemble is always necessary in order to conduct consensus clustering, some<br />

works focus mainly on the design of the consensus function, relegating the construction of<br />

the cluster ensemble, and vice versa. Given the importance of both elements, we split the<br />

revision of the related work in this field into two separate parts, devoting section 2.1 to the<br />

previous work regarding the construction of cluster ensembles and section 2.2 to overview<br />

the existing approaches to the design of consensus functions.<br />

30


2.1 Related work on cluster ensembles<br />

Chapter 2. Cluster ensembles and consensus clustering<br />

Our aim in this section is to review the strategies applied in the literature as regards the<br />

construction of cluster ensembles, given its influence on the consensus clustering process results.<br />

Two alternative approaches have been traditionally followed in this context, differing<br />

in the number of distinct clustering algorithms used for generating the individual partitions<br />

in the ensemble.<br />

The first cluster ensemble creation strategy consists of compiling the outcomes of multiple<br />

runs of a single clustering algorithm, which gives rise to what is known as a homogeneous<br />

cluster ensemble (Hadjitodorov, Kuncheva, and Todorova, 2006). In this case, the diversity<br />

of the ensemble components can be induced by several means, often in a combined manner:<br />

– application of a stochastic clustering algorithm: this strategy relies on the fact that<br />

the outcome of a stochastic clustering algorithm depends on how its parameters are<br />

adjusted. For instance, diverse clustering solutions can be obtained by the random<br />

initialization of the starting centroids of k-means (Fred, 2001; Fred and Jain, 2002a;<br />

Fred and Jain, 2003; Dimitriadou, Weingessel, and Hornik, 2001; Greene et al., 2004;<br />

Long, Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Kuncheva, Hadjitodorov,<br />

and Todorova, 2006; Li, Ding, and Jordan, 2007; Nguyen and Caruana, 2007; Ayad<br />

and Kamel, 2008; Fern and Lin, 2008) or fuzzy c-means (Dimitriadou, Weingessel,<br />

and Hornik, 2002), or the initial settings of EM clustering (Punera and Ghosh, 2007;<br />

Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b).<br />

– random number of clusters: in this case, at each run of the clustering algorithm,<br />

the number of clusters to be found is set randomly (Fred and Jain, 2002b; Fred and<br />

Jain, 2005; Topchy, Jain, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova,<br />

2006; Hadjitodorov and Kuncheva, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and<br />

Turmo, 2008b; Ayad and Kamel, 2008). In general terms, this number of clusters<br />

is usually set to be much larger than the expected number of categories in the data<br />

set (Dimitriadou, Weingessel, and Hornik, 2001; Fred and Jain, 2002a), being often<br />

selected at random from a predefined interval (Long, Zhang, and Yu, 2005; Hore, Hall,<br />

and Goldgof, 2006).<br />

– distinct object representations: another source of diversity lies in the way objects<br />

are represented. Indeed, as we showed in section 1.4, running the same clustering<br />

algorithm on distinct representations of the same data set often leads to pretty diverse<br />

clustering solutions. Allowing for this fact, cluster ensembles have been created by<br />

running a single clustering algorithm on different data representations generated by<br />

random feature selection (Agogino and Tumer, 2006; Hadjitodorov and Kuncheva,<br />

2007; Fern and Lin, 2008), random feature extraction (Greene et al., 2004; Long,<br />

Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Hadjitodorov and Kuncheva,<br />

2007; Fern and Lin, 2008) or deterministic feature extraction (Sevillano et al., 2006a;<br />

Sevillano et al., 2006b; Sevillano et al., 2007a; Sevillano, Alías, and Socoró, 2007b).<br />

– data subsampling: the creation of multiple clustering solutions upon distinct random<br />

subsamples of the data set has been applied as a means for generating diverse cluster<br />

ensembles (Fischer and Buhmann, 2003; Dudoit and Fridlyand, 2003; Minaei-Bidgoli,<br />

Topchy, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova, 2006; Punera and<br />

Ghosh, 2007).<br />

31


2.1. Related work on cluster ensembles<br />

– weak clustering: another approach to the generation of homogeneous cluster ensembles<br />

is the repeated application of computationally cheap and conceptually simple<br />

clustering procedures that, although yielding poor clustering solutions by themselves<br />

(this is why they are said to be weak),mayleadtobetterdataclusteringifcombined.<br />

This type of strategies are of special interest when clustering high dimensional and/or<br />

large data collections, as deriving multiple partitions by traditional means may become<br />

too costly (Fern and Brodley, 2003). Examples of this include using random<br />

hyperplanes for splitting the data (Topchy, Jain, and Punch, 2003) or prematurely<br />

halted executions of k-means (Hadjitodorov and Kuncheva, 2007).<br />

– noise injection: the random perturbation of the representation of the objects (Hadjitodorov,<br />

Kuncheva, and Todorova, 2006) or the labels contained in the individual<br />

clustering solutions (Hadjitodorov and Kuncheva, 2007) through noise injection has<br />

also been applied in a few research works, although these strategies constitute a far<br />

less natural way of creating diverse cluster ensembles if compared to the previous ones.<br />

The second approach for creating cluster ensembles consists of applying several distinct<br />

clustering algorithms for generating the individual components of the ensemble, which gives<br />

rise to what are known as heterogeneous cluster ensembles. If clustering algorithms with<br />

substantially different biases are employed, cluster ensembles with a high degree of diversity<br />

can be obtained. This strategy has been applied in several works, such as (Strehl and<br />

Ghosh, 2002; <strong>La</strong>nge and Buhmann, 2005; Gonzàlez and Turmo, 2006; Gionis, Mannila, and<br />

Tsaparas, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b).<br />

Notice that the strategies used for creating homogeneous and heterogeneous cluster<br />

ensembles can be combined so as to create even more diverse ensembles, as in (Sevillano et<br />

al., 2007c), where a bunch of clustering algorithms are applied on different representations –<br />

obtained by means of multiple feature extraction techniques with distinct dimensionalities–<br />

of the objects in the data set. In this work, we will follow this approach as regards the<br />

generation of cluster ensembles, using the clustering algorithms and object representations<br />

described in appendices A.1 and A.3, as our goal is to overcome the indeterminacies resulting<br />

from the selection of a particular clustering configuration.<br />

There exist several recent works in the literature dealing with the design of the cluster<br />

ensemble. In general terms, they can be divided into two categories: i) those works focused<br />

on analyzing which strategies should be followed for creating cluster ensemble components<br />

that give rise to good quality consensus clustering solutions, and ii) those centered on<br />

obtaining a good quality consensus clustering given a particular cluster ensemble.<br />

Among the first group, we highlight the works by Kuncheva and Hadjitorov. In (Hadjitodorov,<br />

Kuncheva, and Todorova, 2006), the authors analyze the diversity of the individual<br />

partitions composing a hard cluster ensemble and its effect on the quality of the<br />

consensus clustering. To do so, several measures for evaluating the diversity of a cluster ensemble<br />

are proposed and evaluated. Moreover, such measures are employed in the derivation<br />

of a procedure for selecting the candidate with the median diversity among a population<br />

of cluster ensembles, a criterion that leads to the obtention of equal or better consensus<br />

clustering solutions than those obtained on arbitrarily chosen cluster ensembles.<br />

The notion that moderately diverse cluster ensembles lead to good quality consensus<br />

is reinforced by the experimental results presented in (Hadjitodorov and Kuncheva, 2007),<br />

where the authors apply a standard genetic algorithm for driving the selection of the cluster<br />

32


Chapter 2. Cluster ensembles and consensus clustering<br />

ensemble components, which are generated by random feature extraction and selection, weak<br />

clusterers, or random number of clusters, among others. Unfortunately, the fitness function<br />

driving the genetic algorithm used to select the best cluster ensemble is the classification<br />

accuracy of the respective ensemble with respect to the correct object labels (ground truth),<br />

which makes this strategy of limited use in practice.<br />

Another interesting work that incides in the heuristics of the cluster ensemble construction<br />

process is (Kuncheva, Hadjitodorov, and Todorova, 2006). In this sense, the authors<br />

recommend creating individual partitions with a variable number of clusters as a means for<br />

obtaining good quality consensus labelings.<br />

As mentioned earlier, an alternative way of obtaining a good quality consensus clustering<br />

solution is based not on designing the cluster ensemble components according to certain<br />

heuristics, but on refining its contents. The rationale of such strategies is based on the fact<br />

that the quality of the consensus clustering solution is penalized by the inclusion of poor<br />

individual clustering solutions in the ensemble. For this reason, it seems logical to develop<br />

techniques capable of discarding such cluster ensemble components prior to conducting the<br />

consensus clustering process.<br />

In this sense, the application of quality and/or diversity criteria for selecting a small<br />

subset of a large cluster ensemble was evaluated in (Fern and Lin, 2008) as a means for obtaining<br />

a consensus clustering solution that equals or betters the one that would be obtained<br />

using the whole ensemble. Pursuing the same goal, the authors of (Goder and Filkov, 2008)<br />

propose creating smaller subsets of the cluster ensemble that will yield better consensus.<br />

These mini-cluster ensembles are generated by clustering the individual partitions of the<br />

cluster ensemble using a hierarchical agglomerative average-link clustering algorithm.<br />

2.2 Related work on consensus functions<br />

The goal of this subsection is to review the state of the art in the area of consensus clustering.<br />

Although a considerable corpus of theoretical work on combining classifications was<br />

developed in the 80s and earlier (e.g. (Mirkin, 1975; Barthelemy, <strong>La</strong>clerc, and Monjardet,<br />

1986; Neumann and Norton, 1986)), it was not until the start of the present decade when<br />

this field experienced a significant flourish of activity.<br />

Despite this relatively recent awakening, multiple approaches to the combination of<br />

several clusterings can be found in the literature. In general terms, consensus clustering<br />

can be posed as an optimization problem the goal of which is to minimize a cost function<br />

measuring the dissimilarity between the consensus clustering solution and the partitions in<br />

the cluster ensemble –often, the cost function is expressed in terms of the number of pairwise<br />

co-clustering disagreements between the individual partitions in the cluster ensemble and<br />

the consensus clustering solution. Unfortunately, finding the partition that minimizes the<br />

proposed symmetric difference distance metric (i.e. the so-called median partition) isa<br />

NP-hard problem (Goder and Filkov, 2008), and this is the reason why it is necessary to<br />

resort to distinct heuristics so as to conduct clustering combination.<br />

Aiming to provide the reader with a global perspective on the distinct existing approaches<br />

in this field, table 2.1 presents a taxonomy of some of the most relevant consensus<br />

functions according to the theoretical foundations that guides the consensus process. Notice<br />

that some consensus functions appear under more than one theoretical approach, as some-<br />

33


2.2. Related work on consensus functions<br />

Theoretical approach Consensus functions<br />

VMA (Dimitriadou, Weingessel, and Hornik, 2002)<br />

BagClust1 (Dudoit and Fridlyand, 2003)<br />

Voting<br />

URCV, RCV, ACV (Ayad and Kamel, 2008)<br />

Also in (Fischer and Buhmann, 2003;<br />

Greene and Cunningham, 2006)<br />

CSPA, HGPA, MCLA (Strehl and Ghosh, 2002)<br />

Graph partitioning<br />

HBGF (Fern and Brodley, 2004)<br />

BALLS (Gionis, Mannila, and Tsaparas, 2007)<br />

EAC (Fred and Jain, 2005)<br />

CSPA (Strehl and Ghosh, 2002)<br />

BagClust2 (Dudoit and Fridlyand, 2003)<br />

Object co-association IPC (Nguyen and Caruana, 2007)<br />

BALLS (Gionis, Mannila, and Tsaparas, 2007)<br />

Majority Rule, CC Pivot (Goder and Filkov, 2008)<br />

Also in (Greene et al., 2004)<br />

QMI (Topchy, Jain, and Punch, 2003)<br />

Categorical clustering<br />

ITK (Punera and Ghosh, 2007)<br />

EM (Topchy, Jain, and Punch, 2004)<br />

PLA (<strong>La</strong>nge and Buhmann, 2005)<br />

Probabilistic<br />

Also in (Long, Zhang and Yu, 2005;<br />

Li, Ding and Jordan, 2007)<br />

Reinforcement learning (Agogino and Tumer, 2006)<br />

ALSAD, KMSAD, SLSAD<br />

Similarity as data<br />

(Kuncheva, Hadjitodorov, and Todorova, 2006)<br />

IVC, IPVC, IPC (Nguyen and Caruana, 2007)<br />

Centroid based<br />

Also in (Hore, Hall, and Goldgof, 2006)<br />

Agglomerative, Furthest, LocalSearch<br />

Correlation clustering<br />

(Gionis, Mannila, and Tsaparas, 2007)<br />

Search techniques SA, BOEM (Goder and Filkov, 2008)<br />

BestClustering (Gionis, Mannila, and Tsaparas, 2007)<br />

Cluster ensemble component selection BOK (Goder and Filkov, 2008)<br />

Also in (Fern and Lin, 2008)<br />

Table 2.1: Taxonomy of consensus functions according to their theoretical basis.<br />

times the limits between them are somewhat vague. Throughout the following paragraphs,<br />

the main features of these consensus functions are described.<br />

2.2.1 Consensus functions based on voting<br />

The main idea underlying consensus functions based on voting strategies is the notion that<br />

objects assigned to a particular cluster by many partitions in the ensemble should also be<br />

located in that cluster according to the consensus clustering solution. One obvious way to<br />

achieve this goal is to consider cluster labels as votes, thus consolidating different clusterings<br />

by means of voting procedures. However, due to the symbolic nature of clusters (caused<br />

by the unsupervised nature of the clustering problem), it is necessary to disambiguate the<br />

clusters across the l components of the cluster ensemble prior to voting.<br />

One of the pioneering works in voting-based consensus clustering was the voting-merging<br />

algorithm (VMA) of Dimitriadou, Weingessel, and Hornik (2001). In that work, cluster<br />

34


Chapter 2. Cluster ensembles and consensus clustering<br />

desambiguation is conducted by matching those clusters sharing the highest percentage<br />

of objects, iterating this cluster matching process across all the partitions in the cluster<br />

ensemble. As a result of the voting step, a fuzzy partition of the data set is obtained.<br />

Subsequently, a merging procedure is conducted on this soft partition, fusing those clusters<br />

which are closest to each other. By imposing a stopping criterion based on the sureness<br />

of the obtained clusters, this merging process is capable of finding the natural number<br />

of clusters in the data set. In (Dimitriadou, Weingessel, and Hornik, 2002), the authors<br />

define the consensus clustering solution as the one minimizing the average square distance<br />

with respect to all the partitions in the cluster ensemble. They demostrate that obtaining<br />

such consensus clustering boils down to finding the optimal re-labeling of the clusters of<br />

all the individual clusterings, which becomes an unfeasible problem if approached directly,<br />

since it requires an enumeration of all possible permutations. Therefore, they resort to<br />

the VMA consensus function of (Dimitriadou, Weingessel, and Hornik, 2001) for finding an<br />

approximate solution to the problem, extending its application to soft cluster ensembles.<br />

One of the two consensus functions proposed in (Dudoit and Fridlyand, 2003), called<br />

BagClust1, is based on applying plurality voting on the labelings in the cluster ensemble<br />

after a label disambiguation process based on measuring the overlap between clusters. The<br />

generation of the cluster ensemble components follows a resampling strategy similar to<br />

bagging, aiming to reduce the variability in the partitioning results via consensus clustering.<br />

A related proposal was the one presented in (Fischer and Buhmann, 2003). In that work,<br />

consensus clustering is viewed as a means for improving the quality and reliability of the<br />

results of path-based clustering, applying bagging for creating the hard cluster ensemble.<br />

The consensus clustering solution is obtained through a maximum likelihood mapping in<br />

which the label permutation problem is solved by means of the Hungarian method (Kuhn,<br />

1955), which somehow resembles the application of plurality voting on the disambiguated<br />

individual partitions in the cluster ensemble (Ayad and Kamel, 2008). Moreover, a related<br />

reliability measure chooses the number of clusters with the highest stability as the preferable<br />

consensus solution.<br />

In (Greene and Cunningham, 2006), a majoritary voting strategy was applied for generating<br />

the consensus clustering solution, after disambiguating the clusters using the Hungarian<br />

algorithm. An additional interest of that work is that it was one of the first research<br />

efforts that considered the problem of creating and combining a large number of clustering<br />

solutions in the context of high dimensional data sets (such as text document collections).<br />

Indeed, the authors point out that using large ensembles boosts computational cost, while<br />

small ensembles tend to produce unstable consensus clustering solutions. In this context,<br />

the authors propose basing the cluster ensemble construction and the consensus clustering<br />

tasks on a prototype reduction technique that allows representing the whole data set by<br />

means of a minimal set of objects, ensuring that the clustering results will approximate<br />

those that would be obtained on the original data set. By doing so, the final clustering<br />

solution can be extended to those objects that have been left out of the reduced data set<br />

representation while alleviating the overall computational cost of the whole process. In particular,<br />

the reduced version of the cluster ensemble is obtained by projecting the pairwise<br />

object similarity matrix by means of a kernel matrix.<br />

The recent work of (Ayad and Kamel, 2008) presented a set of consensus functions based<br />

on cumulative voting –named URCV, RCV and ACV– whose time complexity scales linearly<br />

with the size of the data set. Another interesting feature is their capability of combining<br />

35


2.2. Related work on consensus functions<br />

crisp partitions with different number of clusters, although the desired number of clusters<br />

k in the consensus clustering solution is a necessary parameter for the execution of their<br />

consensus functions. The proposals presented in this work are based on the computation of<br />

a probabilistic mapping for solving the cluster correspondence problem–instead of the oneto-one<br />

classic cluster mapping– which allows combining partitions with different number of<br />

clusters avoiding the addition of dummy clusters as in (Dimitriadou, Weingessel, and Hornik,<br />

2002). In particular, three ways for deriving such probabilistic mapping based on the idea<br />

of cumulative voting are presented. The construction of the consensus clustering solution is<br />

a two-stage procedure: firstly, based on the cumulative vote mapping, a tentative consensus<br />

is derived as a summary of the cluster ensemble maximizing the information content in<br />

terms of entropy. And secondly, the extraction of the final consensus clustering solution<br />

is obtained by applying an agglomerative clustering algorithm that minimizes the average<br />

generalized Jensen-Shannon divergence within each cluster.<br />

As already mentioned, solving the cluster correspondence problem paves the way for the<br />

application of voting strategies for combining the outcomes of multiple clustering processes.<br />

This issue is the central focus of (Boulis and Ostendorf, 2004), which presented several<br />

methods for finding the correspondence between the clusters of the individual partitions in<br />

the cluster ensemble. Two of these proposals are based on linear optimization techniques,<br />

which are applied on an objective function that measures the degree of agreement among<br />

clusters. In contrast, the third cluster correspondence method is based on Singular Value<br />

Decomposition, and it sets cluster correspondences based on cluster correlation. All these<br />

methods operate on a common space where the clusters of the distinct partitions (which<br />

can be either crisp or fuzzy) are represented by means of cluster co-association matrices.<br />

2.2.2 Consensus functions based on graph partitioning<br />

The work by Strehl and Ghosh on consensus clustering based on graph partitioning is<br />

probably one of the most classic references in the field of cluster ensembles (Strehl and<br />

Ghosh, 2002). To our knowledge, they were the first to formulate the consensus clustering<br />

problem in an information theoretic framework –i.e. the consensus clustering solution<br />

should be the one maximizing the mutual information with respect to all the individual<br />

partitions in the cluster ensemble–, a path followed by other authors in subsequent works<br />

(Fred and Jain, 2003). In view of its prohibitive cost when formulated as a combinatorial<br />

optimization problem in terms of shared mutual information, the authors propose<br />

three clustering combination heuristics based on deriving a hypergraph representation of<br />

the cluster ensemble —all of which require the desired number of clusters k as one of their<br />

parameters. The first consensus function (called Cluster-based Similarity Partitioning Algorithm<br />

or CSPA) induces a pairwise object similarity measure from the cluster ensemble<br />

(as in (Fred and Jain, 2002a)), obtaining the consensus partition by reclustering the objects<br />

with the METIS graph partitioning algorithm (Karypis and Kumar, 1998). For this<br />

reason, we have enclosed the CSPA consensus function in both the graph partitioning and<br />

object co-association categories in table 2.1. In the second clustering combiner proposed<br />

in (Strehl and Ghosh, 2002) –named HGPA for HyperGraph Partitioning Algorithm–, the<br />

cluster ensemble problem is posed as the partitioning of a hypergraph where hyperedges<br />

represent clusters into k unconnected components of approximately the same size, cutting<br />

a minimum number of hyperedges. And the third consensus function (Meta-CLustering<br />

Algorithm or MCLA) views the clustering integration process as a cluster correspondence<br />

36


Chapter 2. Cluster ensembles and consensus clustering<br />

problem that is solved by identifying and consolidating groups of clusters (meta-clusters),<br />

which is done by applying graph-based clustering to hyperedges in the hypergraph representation<br />

of the cluster ensemble. In (Strehl and Ghosh, 2002), the authors apply the proposed<br />

consensus functions on hard cluster ensembles, suggesting that they could be extended to a<br />

fuzzy clustering integration scenario. Such extensions (in particular, the soft versions of the<br />

CSPA and MCLA consensus functions, sCSPA and sMCLA, respectively) were introduced<br />

in (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />

The clustering combination problem was also formulated as a graph partitioning problem<br />

in (Fern and Brodley, 2004). In particular, a bipartite graph is built from a hard cluster<br />

ensemble, although the authors suggest that the same method can be applied for combining<br />

soft clustering solutions after introducing minor modifications on the proposed consensus<br />

function, which is called HBGF for Hybrid Bipartite Graph Formulation. As in previous<br />

graph partitioning approaches to clustering combination, the desired number of clusters<br />

must be set aprioriandpassed as a parameter of the consensus function. In contrast, HBGF<br />

considers object and cluster similarity simultaneously when creating the consensus clustering<br />

solution, an issue not considered by other graph partition based consensus functions as<br />

CSPA and MCLA (Strehl and Ghosh, 2002).<br />

More recently, the BALLS consensus function (Gionis, Mannila, and Tsaparas, 2007) operates<br />

on a graph representation of the pairwise object co-dissociation matrix, viewing its<br />

vertices as the objects in the data set, its edges being weighted by the pairwise object<br />

distances. The rationale of the consensus clustering creation process is the iterative construction<br />

of consensus clusters from compact and relatively isolated sets of close vertices,<br />

which are then removed from the graph.<br />

2.2.3 Consensus functions based on object co-association measures<br />

The approach to consensus clustering based on object co-association measures is based on<br />

the assumption that objects belonging to a natural cluster are likely to be co-located in<br />

the same cluster by different clusterings of the cluster ensemble. Therefore, pairwise object<br />

co-occurrences are deemed as votes which are aggregated in a n × n object co-association<br />

matrix (where n is the number of instances contained in the data collection). A great<br />

advantage of this type of methods is that it avoids cluster disambiguation processes, as the<br />

cluster ensemble is inspected object-wise, rather than cluster-wise. However, a downside of<br />

consensus functions based object co-association is that their time and space complexities<br />

scale quadratically with n, thus making their application on large data sets highly costly or<br />

even unfeasible.<br />

One of the pioneering works on the combination of hard clustering solutions based on<br />

object co-association metrics is the evidence accumulation (EAC) approach. In the original<br />

form of the evidence accumulation consensus function, the consensus clustering solution is<br />

obtained by applying a simple majority voting scheme on the co-association matrix (Fred,<br />

2001). In subsequent versions, consensus is derived by applying the single-link hierarchical<br />

clustering algorithm on the object co-association matrix, regarding it as a measure of the<br />

similarity between objects (Fred and Jain, 2002a)—a virtually identical proposal is found<br />

in a contemporary work (Zeng et al., 2002). In (Fred and Jain, 2003), the evidence accumulation<br />

approach is formulated in an information-theoretic framework, defining the optimal<br />

consensus clustering solution as the one maximizing the sum of normalized mutual infor-<br />

37


2.2. Related work on consensus functions<br />

mation (φ (NMI) ) with respect to all the partitions in the cluster ensemble. The authors<br />

prove that, by maximizing the number of shared objects in the consolidated clusters, the<br />

EAC consensus function maximizes the aforementioned information theoretical objective<br />

function, although reaching its global optimum is not ensured in all situations. Moreover,<br />

cutting the dendrograms resulting from the application of the single-link clustering on the<br />

co-association matrix at the highest lifetime level leads to a minimization of the variance<br />

of the average φ (NMI) , which guarantees the robustness of the clustering solution to small<br />

variations in the composition of the cluster ensemble —furthermore, this also avoids making<br />

assumptions on the number of clusters, a significant advantage with respect to other<br />

consensus functions. A compendium on the evidence accumulation consensus clustering<br />

approach is presented in (Fred and Jain, 2005), extending the previous consensus functions<br />

through the application of other hierarchical clustering algorithms on the pairwise object<br />

co-association matrix.<br />

The clustering of high dimensional data is the main motivation of the work presented<br />

in (Fern and Brodley, 2003). In this scenario, Random Projection (RP) is an efficient<br />

dimensionality reduction technique, although it often gives rise to highly unstable clustering<br />

results. In order to reduce this variability, the authors propose creating cluster ensembles<br />

by compiling partitions resulting from distinct RP runs, combining them using a consensus<br />

function very similar to EAC, as it applies an agglomerative clustering algorithm on an<br />

object similarity matrix.<br />

One of the two consensus functions presented by Dudoit and Fridlyand (2003), named<br />

BagClust2, resembles evidence accumulation, as it builds a pairwise object dissimilarity<br />

matrix which is subject to a partitioning process for obtaining the consensus clustering.<br />

However, BagClust2 and EAC differ in that the former requires that the desired number<br />

of clusters is passed as a parameter to the consensus function (the same happens with<br />

BagClust1).<br />

In (Greene et al., 2004), consensus clustering was conducted by means of variants of the<br />

EAC consensus functions using distinct hierarchical clustering algorithms (i.e. single-link,<br />

complete-link and average-link) for partitioning the pairwise object co-association matrix,<br />

as proposed in (Fred and Jain, 2005). However, the central matter of study in that work<br />

is the analysis of the diversity of the cluster ensemble as a factor determining the quality<br />

of the consensus clustering. In this sense, the authors focused on random techniques for<br />

introducing diversity in the cluster ensemble, such as random subspacing, random algorithm<br />

initialization, random number of clusters or random feature projection.<br />

A related work is the Majority Rule consensus function of (Goder and Filkov, 2008),<br />

which is also based on clustering the pairwise object co-dissociation matrix, which can be<br />

done by simply setting a dissimilarity threshold like in the the first version of EAC (Fred,<br />

2001), or by applying the average-link hierarchical clustering algorithm —like in the latest<br />

versions of EAC (Fred and Jain, 2005).<br />

Moreover, there exist several consensus functions that make indirect use of pairwise<br />

object co-association (or co-dissociation) matrices, despite the way the consensus clustering<br />

is obtained differs from that of EAC. Examples of this include some graph partition-based<br />

consensus functions, such as CSPA (Strehl and Ghosh, 2002) and BALLS (Gionis, Mannila,<br />

and Tsaparas, 2007), the Iterative Pairwise Consensus (IPC) (Nguyen and Caruana, 2007)<br />

(a consensus function based on cluster centroids in which objects are iteratively reassigned<br />

to the clusters of the consensus partition according to their similarity), or the CC Pivot<br />

38


Chapter 2. Cluster ensembles and consensus clustering<br />

consensus function (Goder and Filkov, 2008), which obtains the consensus partition by<br />

conducting an iterative pivoting on the object dissimilarity matrix.<br />

2.2.4 Consensus functions based on categorical clustering<br />

A different approach to consensus clustering is the one related to categorical clustering,<br />

which basically consists of transforming the contents of the cluster ensemble into quantitative<br />

features that represent the objects, for subsequently clustering them according to this<br />

novel representation —thus obtaining the consensus partition. The QMI (Quadratic Mutual<br />

Information) consensus function of (Topchy, Jain, and Punch, 2003) posed the problem of<br />

combining the partitions contained in a hard cluster ensemble in an information theoretic<br />

framework, and consists of applying the k-means clustering algorithms on this new feature<br />

space, which forces the user to set the desired number of clusters k in advance.<br />

In (Punera and Ghosh, 2007), a novel fuzzy consensus function based on the Information<br />

Theoretic K-means (ITK) algorithm was presented. Its rationale follows a similar approach<br />

to that of (Topchy, Jain, and Punch, 2003). In this case, though, consensus clustering<br />

is conducted on soft cluster ensembles (i.e. the compilation of the outcomes of multiple<br />

fuzzy clustering processes), so each object in the data set is represented by means of the<br />

concatenated posterior cluster membership probability distributions corresponding to each<br />

one of the l fuzzy partitions in the cluster ensemble. Thus, using the Kullback-Leibler<br />

divergence (KLD) between those probability distributions as a measure of the distance<br />

between objects, the k-means algorithm is applied so as to obtain the consensus clustering<br />

solution. Note that the ITK consensus function is capable of combining fuzzy partitions<br />

with variable number of clusters, while producing a crisp consensus clustering solution.<br />

Moreover, this consensus function allows assigning distinct weights to each clustering in<br />

the cluster ensemble, which can be useful for the user to express his/her confidence on the<br />

quality of some individual clusterings.<br />

2.2.5 Consensus functions based on probabilistic approaches<br />

Consensus clustering has also been approached from a probabilistic perspective. One of<br />

the pioneering works in this direction was the Expectation-Maximization (EM) consensus<br />

function proposed in (Topchy, Jain, and Punch, 2004), where a probabilistic model of the<br />

consensus clustering solution is defined in the space of the contributing clusters. Such<br />

model is based on a finite mixture of multinomial distributions, each component of which<br />

corresponds to a cluster of the combined clustering, which is obtained as the solution to<br />

the maximum likelihood problem solved by means of the EM algorithm. Contrasting with<br />

other consensus functions, the authors highlight the low computational complexity of the<br />

proposed method and its ability to combine partitions with different numbers of clusters.<br />

Another probabilistic approach to the consensus clustering problem was presented in<br />

(Long, Zhang, and Yu, 2005). The central matter in that work was finding a solution to<br />

the cluster correspondence problem (which, as mentioned earlier, is due to the symbolic<br />

identification of clusters caused by the unsupervised nature of the clustering problem). In<br />

particular, the goal was to derive a correspondence matrix that desambiguates the clusters<br />

of each individual clustering in the cluster ensemble (represented as a probabilistic or binary<br />

membership matrix depending on whether the cluster ensemble is soft or hard) with regard<br />

39


2.2. Related work on consensus functions<br />

to a hypothetical consensus clustering solution membership matrix. The goal is to find<br />

a correspondence matrix that yields the best projection of each individual clustering on<br />

the space defined by the consensus clustering solution. From a practical viewpoint, both<br />

the correspondence and consensus clustering matrices are derived simultaneously, using an<br />

EM-like approach.<br />

The beautiful proposal of (<strong>La</strong>nge and Buhmann, 2005) introduced a consensus function<br />

named Probabilistic <strong>La</strong>bel Aggregation (PLA), which operates on soft cluster ensembles<br />

(although it also works on crisp ones). Its rationale is as follows: given a single fuzzy<br />

partition, a pairwise object co-association matrix is created by simply multiplicating the<br />

membership probabilities matrix by its own transpose. Repeating this process on all the<br />

partitions in the soft cluster ensemble and aggregating (and subsequently normalizing) the<br />

resulting matrices gives rise to a joint probability matrix of finding two objects in the<br />

same cluster. Neatly, the authors propose subjecting this joint probability matrix to a<br />

non-negative matrix factorization process that yields estimates for class-likelihoods and<br />

class-posteriors, upon which the consensus clustering solution is based. This factorization<br />

process is posed as an optimization problem which is solved by applying the EM algorithm.<br />

Besides the elegance of the proposed solution, this work also stands out by the fact that<br />

it supports an out-of-sample extension that makes it possible to assign previously unseen<br />

objects to classes of the consensus clustering solution. Moreover, the proposed method also<br />

allows combining weighted partitions, i.e. it gives the user the chance to assign different<br />

degrees of relevance to the cluster ensemble partitions.<br />

A closely related proposal is the application of Non-Negative Matrix Factorization<br />

(NMF) for solving the consensus clustering problem presented (Li, Ding, and Jordan, 2007).<br />

In contrast to (<strong>La</strong>nge and Buhmann, 2005), the aim is to combine crisp partitions, which<br />

imposes a series of constraints on the optimization problem that is solved via symmetric<br />

NMF—which, from an algorithmic viewpoint, is implemented by means of multiplicative<br />

rules. Moreover, the same approach is employed for conducting semi-supervised clustering,<br />

a problem that lies beyond the scope of this work.<br />

2.2.6 Consensus functions based on reinforcement learning<br />

Reinforcement learning has also been applied for the construction of consensus clustering<br />

solutions (Agogino and Tumer, 2006). In that work, the average φ (NMI) of the consensus<br />

clustering solution with respect to the cluster ensemble is regarded as the reward that must<br />

be maximized by the actions of the agents. In this case, each agent casts a vote indicating<br />

which cluster each object should be assigned to (i.e. it operates on hard cluster ensembles).<br />

The application of a majority voting scheme on these votes yields the consensus clustering<br />

solution, which is iteratively refined as the agents learn how to vote so as to maximize the<br />

average φ (NMI) . The authors highlight the ease of their approach for combining clusterings<br />

in distributed scenarios, which makes it specially suitable in failure-prone domains.<br />

2.2.7 Consensus functions based on interpeting object similarity as data<br />

The work by Kuncheva, Hadjitodorov, and Todorova (2006) introduced three consensus<br />

functions based on interpreting object similarity as data. That is, each object is represented<br />

by n features, where the jth feature of the ith object corresponds to the co-association<br />

40


Chapter 2. Cluster ensembles and consensus clustering<br />

strength between the ith and the jth objects. The authors base these consensus functions<br />

on the proved suitability of using similarity measures as object features in classification<br />

problems (Kalska, 2005). Thus, the consensus clustering solution is obtained by applying<br />

standard clustering algorithms on the pairwise object co-association matrix. In particular,<br />

the hierarchical average-link, single-link and k-means clustering algorithms are applied,<br />

giving rise to the ALSAD, SLSAD and KMSAD consensus functions. Notice that these<br />

consensus functions, despite being based on partitioning the object co-association matrix<br />

(just like EAC, for instance), differ from these in that the contents are not interpreted as<br />

measures of similarity between objects, but rather as attributes in a new feature space.<br />

2.2.8 Consensus functions based on cluster centroids<br />

The time and memory scalability problems derived from combining clusterings of large data<br />

sets is the principal motivation of several works that tackle the consensus clustering problem<br />

following a centroid-based approach (Hore, Hall, and Goldgof, 2006; Nguyen and Caruana,<br />

2007). The underlying rationale consists of representing the cluster ensemble components<br />

in terms of the centroids of their clusters, instead of label vectors. By doing so, storage<br />

inconveniences are alleviated, as the number of clusters k is usually much lower than the<br />

number of objects n. In (Hore, Hall, and Goldgof, 2006), moreover, the authors highlight<br />

that the parallelization of the cluster ensemble construction process would dramatically<br />

decrease its time complexity. As regards the creation of the consensus clustering solution,<br />

it is based on computing the average centroid of each cluster, after disambiguating them<br />

by means of the Hungarian algorithm. Furthermore, that work introduced the possibility<br />

of discarding bad clusters from the cluster ensemble at consensus clustering creation time.<br />

In (Nguyen and Caruana, 2007), three iterative consensus functions were presented and<br />

empirically compared with other eleven clustering combiners in a complete experimental<br />

study. The proposal in that work derived the consensus clustering solution in terms of the<br />

centroids of its clusters. Although the proposed consensus functions are capable of combining<br />

clusterings with a variable number of clusters, all the individual partitions contained<br />

in the hard cluster ensembles used in the experiments have the same number of clusters<br />

for simplicity. The first consensus function proposed in this work, called Iterative Voting<br />

Consensus (IVC), is based on recursively computing the centroids of the consensus solution<br />

clusters, and assigning each object to the nearest cluster, which is determined in terms<br />

of the Hamming distance. This procedure is iterated until the centroids of the consensus<br />

clustering solution reach a stable state. The second proposal (named Iterative Probabilistic<br />

Voting Consensus or IPVC), is a variant of IVC in which objects are iteratively assigned<br />

to consensus clusters in terms of their distance with respect to the objects that have been<br />

previously assigned to them. And in the third proposed consensus function, Iterative Pairwise<br />

Consensus or IPC, objects are iteratively reassigned to consensus clusters according to<br />

their similarity as measured by the pairwise object co-association matrix.<br />

2.2.9 Consensus functions based on correlation clustering<br />

Recently, the connection between the late-emerging problem of correlation clustering and<br />

consensus clustering was exploited for deriving novel consensus functions capable of determining<br />

the most natural number of clusters (Gionis, Mannila, and Tsaparas, 2007). In<br />

that work, the cluster ensemble is modeled as a graph resembling a pairwise object co-<br />

41


2.2. Related work on consensus functions<br />

dissociation matrix, and the consensus clustering solution is defined as the one minimizing<br />

the disagreements with respect to the individual partitions contained in the cluster ensemble.<br />

In particular, three consensus functions based on correlation clustering were presented<br />

in this work, which are briefly described next. Firstly, the AGGLOMERATIVE consensus function<br />

results from applying a standard bottom-up procedure for correlation clustering on<br />

cluster ensembles. Resorting to the graph view of the pairwise object distance matrix, the<br />

AGGLOMERATIVE algorithm follows an iterative merging process that joins objects in clusters<br />

depending on whether their average distance is below a predefined threshold, stopping when<br />

further cluster merging does not reduce the number of disagreements of the consensus solution<br />

with respect to the cluster ensemble. Secondly, the FURTHEST consensus function can<br />

be regarded as the converse of AGGLOMERATIVE , as it consists of a top-down procedure that<br />

iteratively separates maximally distant graph vertices into consensus clusters, assigning<br />

the remaining objects to the cluster that minimizes the overall number of disagreements.<br />

This process is stopped when no disagreement reduction is achieved from additional cluster<br />

splitting. And fifthly, the LOCALSEARCH algorithm is derived from the application of a local<br />

search correlation clustering heuristic, which is based on a greedy procedure that, starting<br />

with a specific (possibly random) partition of the graph, tries to minimize the number of<br />

disagreements resulting from moving objects to different clusters or creating new singleton<br />

clusters, stopping when no move can decrease the disagreements rate. Interestingly, the authors<br />

point out that, despite its high computational cost, the LOCALSEARCH algorithm can be<br />

employed as a post-processing step for refining a previously obtained consensus clustering<br />

solution.<br />

2.2.10 Consensus functions based on search techniques<br />

In (Goder and Filkov, 2008), two consensus functions based on search techniques were<br />

introduced. Their rationale consists of building the consensus clustering solution by means<br />

of a greedy search process aiming to minimize the cost function —the authors implement<br />

such search processes either by means of Simulated Annealing (SA), as in (Filkov and Skiena,<br />

2004), and on successive single object movements that guarantee the largest decrease of the<br />

cost function (Best One Element Moves or BOEM).<br />

2.2.11 Consensus functions based on cluster ensemble component selection<br />

Recall that the aim of any consensus clustering process is to obtain a single partition<br />

from a collection of l clustering solutions. As an alternative means for achieving that<br />

goal, cluster ensemble component selection techniques are based on obtaining the consensus<br />

clustering solution by selection, not by combination. For instance, the BESTCLUSTERING<br />

algorithm (Gionis, Mannila, and Tsaparas, 2007) is not a consensus function proper, as it<br />

identifies as the consensus clustering the individual partition that minimizes the number of<br />

disagreements with respect to the remaining clusterings in the cluster ensemble.<br />

Following a very similar approach, the Best of K (BOK) consensus function is based on<br />

selecting that individual clustering from the cluster ensemble that minimizes the number<br />

of pairwise co-clustering disagreements between the individual partitions in the cluster<br />

ensemble (Goder and Filkov, 2008).<br />

42


Chapter 2. Cluster ensembles and consensus clustering<br />

2.2.12 Other interesting works on consensus clustering<br />

There exist several works in the literature devoted to the experimental comparison of the<br />

performance of consensus functions, the main interest of which lies in the evaluation of the<br />

quality of the consensus clusterings obtained.<br />

Examples of this include the work by Minaei-Bidgoli, Topchy, and Punch (2004), where<br />

a data resampling scheme was presented as a means for improving the robustness and stability<br />

of the consensus clustering solution. In that work, the EAC, BagClust2, QMI, CSPA,<br />

HGPA and MCLA consensus functions are experimentally compared when operating on<br />

hard cluster ensembles created from bootstrap partitions on several artificial and real data<br />

collections. The main conclusion drawn is that, as expected, there exists no universally superior<br />

consensus function, as each consensus function explores the data set in different ways,<br />

thus its efficiency greatly depends on the existing structure in the data set. Another extensive<br />

and interesting performance comparison between several consensus functions operating<br />

on small hard cluster ensembles is presented in (Kuncheva, Hadjitodorov, and Todorova,<br />

2006). Recently, the application of consensus clustering as a means for avoiding the obtention<br />

of suboptimal clustering solutions when applying non-parametric clustering algorithms<br />

on text document collections is tackled in both (Gonzàlez and Turmo, 2008a) and (Gonzàlez<br />

and Turmo, 2008b). These works compared i) the performance of several non-parametric<br />

clustering algorithms across six text corpora, and ii) the quality of the consensus clustering<br />

solution when it is built –using some of the consensus functions presented in (Gionis,<br />

Mannila, and Tsaparas, 2007)– upon homogeneous and heterogeneous cluster ensembles.<br />

In most of these inter-consensus functions comparisons, the evaluation of their computational<br />

complexity is often given marginal importance, although it becomes a critical aspect<br />

when it comes to their application in practice, especially when dealing with large data<br />

collections or cluster ensembles containing many partitions. This is the main motivation<br />

behind the data sampling strategy proposed in (Greene and Cunningham, 2006; Gionis,<br />

Mannila, and Tsaparas, 2007). The proposal of the latter work is the SAMPLING algorithm,<br />

which consists of performing a sufficient subsampling of the objects in the data set –thus<br />

constructing the consensus clustering solution on a reduced cluster ensemble–, and the subsequent<br />

extension of the combined clustering solution on those objects that have been left<br />

out of the subsampling process. The time complexity of these two processes is linear with<br />

the data set size, which can lead to relevant time savings when dealing with large data<br />

collections.<br />

Another variant of the consensus clustering problem is the weighted combination of<br />

clusterings, which constitutes the central point of (Gonzàlez and Turmo, 2006). The idea<br />

behind weighted consensus clustering is the possibility of giving more relevance to some<br />

components of the cluster ensemble, as they may better describe the structure of the data<br />

set. Thus, it makes sense to combine clusterings in a weighted manner, emphasizing the<br />

contribution of those components deemed as the best ones in the ensemble. Besides designing<br />

consensus functions capable of combining weighted partitions, it is necessary to devise<br />

strategies for setting the proper weight of each individual clustering, which is not trivial in<br />

an unsupervised scenario. In this work, hypergraph-based (Strehl and Ghosh, 2002) and<br />

probabilistic (Topchy, Jain, and Punch, 2004) consensus functions are modified so as to<br />

handle weighted partitions. Moreover, the best weighting scheme is determined by creating<br />

differently weighted cluster ensembles, and subsequently selecting the best option in an<br />

unsupervised manner through the maximization of a scoring function. Moreover, this con-<br />

43


2.2. Related work on consensus functions<br />

sensus function allows assigning distinct weights to each clustering in the cluster ensemble,<br />

which can be useful for the user to express his/her confidence on the quality of some individual<br />

clusterings. The ITK consensus function of (Punera and Ghosh, 2007) also allows to<br />

assign distinct weights to each clustering in the cluster ensemble, which can be useful for<br />

the user to express his/her confidence on the quality of some individual clusterings.<br />

44


Chapter 3<br />

Hierarchical consensus<br />

architectures<br />

As outlined in section 1.5, our proposal for building robust multimedia clustering systems<br />

lies on the creation of consensus clustering solutions upon cluster ensembles. These ensembles<br />

are made up of a large number of individual clusterings resulting from the execution of<br />

multiple clustering algorithms on several unimodal and multimodal representations of the<br />

objects contained in the data set.<br />

Indeed, the massive crossing between clustering algorithms, object representations and<br />

data modalities is a simple and parallelizable manner of generating highly diverse heterogeneous<br />

cluster ensembles, entrusting the obtention of a meaningful combined clustering<br />

solution to the consensus clustering task.<br />

Given the unsupervised nature of the clustering problem, we think this is a pretty<br />

sensible way of proceeding so as to obtain clustering solutions robust to the influence of the<br />

clustering indeterminacies, as sticking to the use of a handful of clustering algorithms or<br />

object representations can lead to an involuntary and undesirable limitation as regards the<br />

quality and diversity of the cluster ensemble components.<br />

However, at the same time that this strategy allows the creation of rich cluster ensembles,<br />

it also introduces several drawbacks that affect the consensus clustering task:<br />

– the large number of individual clustering solutions contained in the cluster ensemble,<br />

resulting from the aforementioned combination of clustering algorithms, object representations<br />

and data modalities, often leads to a notable increase in the computational<br />

cost of the execution of the consensus function, which can even become prohibitive.<br />

– this same fact incides in the diversity and quality of the cluster ensemble components,<br />

and, while moderate diversity has been found to be beneficial as far as consensus<br />

clustering is concerned (Hadjitodorov, Kuncheva, and Todorova, 2006; Fern and Lin,<br />

2008), the existence of poor quality clustering solutions in the cluster ensemble may<br />

cause a detrimental effect on the quality of the consensus clustering solution.<br />

Allowing for these considerations, in this thesis we introduce the concept of self-refining<br />

hierarchical consensus architectures (SHCA), defined as a generic means for fighting against:<br />

45


3.1. Motivation<br />

– the computational complexity of combining a large number of individual partitions, by<br />

means of hierarchical consensus architectures, which consist in the layered construction<br />

of the consensus clustering solution through a hierarchical structure of low complexity<br />

intermediate consensus processes.<br />

– the negative bias induced by poor quality clusterings in the consensus clustering solution,<br />

by means of a self-refining post-processing that, using the obtained consensus<br />

clustering solution as a reference, builds a select and reduced cluster ensemble (i.e. a<br />

subset of the original cluster ensemble), deriving a new and refined consensus clustering<br />

upon it in a fully unsupervised manner.<br />

Although both strategies are complementary (not in vain they can be naturally combined<br />

giving rise to SHCA), their description and study is decoupled in the present and the<br />

next chapters, respectively. Thus, in our description and analysis of hierarchical consensus<br />

architectures (chapter 3), we ultimately aim to design computationally optimal consensus<br />

architectures and, consequently, we will solely focus on aspects regarding their time complexity.<br />

Meanwhile, the study of consensus self-refining procedures, which are presented in<br />

chapter 4, is centered on improving the quality of the consensus solutions yielded by the<br />

most computationally efficient consensus architectures devised in the present chapter.<br />

The introduction, the discussion of their rationale and the theoretical description of<br />

hierarchical consensus architectures are complemented by the presentation of multiple experiments<br />

analyzing multiple aspects of their performance on several real data collections.<br />

<strong>La</strong>st but not least, it is to note that although all the proposals put forward in this chapter<br />

are focused on a hard cluster ensemble scenario, they are also applicable for fuzzy clusterings<br />

combination.<br />

3.1 Motivation<br />

The construction of consensus clustering solutions is usually tackled as a one-step process,<br />

in the sense that the whole cluster ensemble E is input to the consensus function F at once<br />

—see figure 3.1(a). This is what we call flat consensus clustering. However, as outlined in<br />

chapter 2, the time and space complexities of consensus functions typically scale linearly or<br />

quadratically with the size of the cluster ensemble l –i.e. O (l w ), where w ∈{1, 2}–, which<br />

may lead to a highly costly or even impossible execution of the consensus clustering task if<br />

it is to be conducted on a cluster ensemble containing a large number of partitions 1 .<br />

For this reason, a natural way for avoiding this limitation besides reducing the computational<br />

complexity of the consensus solution creation process consists in applying the<br />

classic divide-and-conquer strategy (Dasgupta, Papadimitriou, and Vazirani, 2006) which<br />

basically:<br />

– breaks the original problem into subproblems which are nothing but smaller instances<br />

of the same type of problem<br />

1 Moreover, the time complexity of consensus functions also depends –linearly or quadratically, see appendix<br />

A.5– on the number of objects in the data set n and the number of clusters k of the clusterings<br />

in the ensemble. However, as we assume that these two factors are constant for a given cluster ensemble<br />

corresponding to a specific data set, the only dependence of concern is that referring to the cluster ensemble<br />

size l.<br />

46


λ 1 λ 11 λ 12 λ 13 … λ 1m<br />

λ 2 λ 21 λ 22 λ 23 … λ 2m<br />

λ 3 λ 31 λ 32 λ 33 … λ 3m<br />

λ 4 λ 41 λ 42 λ 43 … λ 4m<br />

λ 5 λ 51 λ 52 λ 53 … λ 5m<br />

…<br />

λ l λ l1 λ l2 λ l3 … λ lm<br />

Chapter 3. Hierarchical consensus architectures<br />

Consensus<br />

function<br />

(a) Flat construction of a consensus clustering solution<br />

on a hard cluster ensemble<br />

1 11 12 13 … 1m Consensus 1<br />

<br />

function 1<br />

2 21 22 23 … 2m c 1<br />

<br />

3 31 32 33 … 3m<br />

4 41 42 43 … 4m<br />

5 51 52 53 … 5m <br />

l l1 l2 l3 … lm<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

<br />

<br />

1<br />

c 2<br />

1<br />

c K K1<br />

Consensus<br />

ffunction i<br />

Consensus<br />

function<br />

λ c<br />

1<br />

s<br />

c1 1<br />

<br />

1<br />

s<br />

cKs Consensus<br />

function<br />

(b) Hierarchical construction of a consensus clustering solution on a hard cluster ensemble<br />

Figure 3.1: Flat vs hierarchical construction of a consensus clustering solution on a hard<br />

cluster ensemble.<br />

– recursively solves these subproblems<br />

– appropriately combines their outcomes<br />

Transferring this strategy to the consensus clustering problem is equivalent to segmenting<br />

the cluster ensemble into subsets (referred to as mini-ensembles hereafter), building<br />

an intermediate consensus solution upon each mini-ensemble, and subsequently combining<br />

these halfway consensus clusterings into the final consensus clustering solution λc —see<br />

figure 3.1(b). Due to the fact that successive layers (or stages) of consensus solutions are<br />

created, we give this approach the name of hierarchical consensus architecture (HCA), as<br />

opposed to the traditional flat topology of consensus clustering processes.<br />

The rationale of hierarchical consensus architectures is pretty simple. By reducing the<br />

time and space complexities of each intermediate consensus clustering –which is achieved by<br />

creating it upon a smaller ensemble–, we aim to reduce the overall execution requirements<br />

(i.e. memory and specially, CPU time), although a larger number of low cost consensus<br />

clustering processes must be run. However, this strategy is capable of yielding computational<br />

gains, as for large enough values of l, the execution of the original problem becomes<br />

slower than the recursive execution of the subproblems into which it is divided (Dasgupta,<br />

Papadimitriou, and Vazirani, 2006).<br />

47<br />

c


3.1. Motivation<br />

An additional and very relevant point as regards the computational efficiency of hierarchical<br />

consensus architectures is that they naturally allow the parallel execution of the<br />

consensus clustering processes of every HCA stage —quite obviously, this will ultimately<br />

depend on the availability of computing resources. Thus, the degree of parallelism in executing<br />

the consensus of every HCA stage will set the lower and upper bounds of the time<br />

required for obtaining the final consensus clustering λc.<br />

In the best-case scenario, the HCA running time can be as low as the sum of the<br />

execution times of the longest-lasting consensus task of each stage of the architecture,<br />

provided that the available computational resources allow the parallel computation of all<br />

the intermediate consensus solutions of any given stage.<br />

On the contrary, if the execution of the halfway consensus is serialized, the time required<br />

for running the whole HCA amounts to the sum of the execution times of all the consensus<br />

processes of the stages of the hierarchical consensus architecture, which constitutes the<br />

upper bound of the running time of a hierarchical consensus architecture.<br />

Therefore, depending on the design of the HCA, the simultaneously available computing<br />

resources and the characteristics of the data set, structuring the consensus clustering task<br />

in a hierarchical manner may be more or less computationally beneficial (or not beneficial<br />

at all) as compared to its flat counterpart. From a practical viewpoint, our general idea is<br />

to provide the user with simple tools that, for a given consensus clustering problem, allow<br />

to decide a priori whether hierarchical consensus architectures are more computationally<br />

efficient than traditional flat consensus and, if so, implement the HCA variant of minimal<br />

complexity.<br />

Moreover, it is important to highlight the fact that, in cases where the flat execution<br />

of the consensus function F becomes impossible due to memory limitations caused by the<br />

large size of the cluster ensemble, a carefully designed HCA will allow obtaining a consensus<br />

clustering solution.<br />

Let us now elaborate briefly on several notational definitions regarding hierarchical<br />

consensus architectures that will be of help when describing our proposals in detail. We<br />

suggest the reader resort to the generic HCA topology depicted in figure 3.1(b) for a better<br />

understanding of the concepts we are about to expose.<br />

Firstly, a hierarchical consensus architecture is structured in s successive stages. The<br />

number of intermediate consensus solutions obtained at the output of the ith stage is denoted<br />

as Ki —notice that Ks = 1 (i.e. the last stage yields the single final consensus<br />

clustering solution λc). The notation used for designating the jth halfway consensus clustering<br />

created at the ith HCA stage is λ i cj ,wherei∈ [1,s− 1] and j ∈ [1,Ki].<br />

Another important factor in the definition of HCAs is the size of the mini-ensembles,<br />

which may vary from stage to stage or even within the same stage. For this reason, we<br />

denote as bij the size of the mini-ensemble upon which the jth consensus process of the ith<br />

HCA stage is conducted. Notice that bs1 = Ks−1 (i.e. the last consensus stage combines<br />

all the intermediate clusterings output by the previous stage into the single final consensus<br />

clustering solution λc), while, in the HCA presented in figure 3.1(b), b1j =2∀j ∈ [1,K1].<br />

Moreover, notice that hierarchical architectures naturally allow the use of distinct consensus<br />

functions across the distinct stages (or even within the same stage). However, in<br />

this work we assume that a single consensus function F is applied for conducting all the<br />

consensus processes involved.<br />

48


Chapter 3. Hierarchical consensus architectures<br />

In this chapter, we propose two strategies for constructing hierarchical consensus architectures,<br />

which differ in i) the way mini-ensembles are created, and ii) which HCA<br />

parameters are tuned by the user. As a result, two HCA implementation alternatives are<br />

put forward:<br />

– random hierarchical consensus architectures, whose tunable parameter is the size of<br />

the mini-ensembles –the components of which are selected at random–, which eventually<br />

determines the HCA topology (i.e. its number of stages).<br />

– deterministic hierarchical consensus architectures, where the construction of the miniensembles<br />

is driven by the cluster ensemble creation process —in particular, the diversity<br />

factors used in creating the ensemble determine the number of HCA stages<br />

and the mini-ensembles components.<br />

The following sections are devoted to describing the rationale and implementation details<br />

of both HCA variants, specifying how the number of stages, the number of consensus<br />

processes per stage and the size of the mini-ensembles are determined in each case. This<br />

description is completed by an analysis of their computational complexity.<br />

3.2 Random hierarchical consensus architectures<br />

In this section, we introduce random hierarchical consensus architectures (RHCA for short),<br />

we define their topology from a generic perspective, making a brief description of their<br />

foundations, followed by an analysis of their computational complexity.<br />

3.2.1 Rationale and definition<br />

The idea behind random hierarchical consensus architectures is to construct a regular pyramidal<br />

structure of intermediate consensus processes that delivers, at its top, the final consensus<br />

clustering solution λc. The term random refers not to the consensus architecture<br />

itself, but to the way mini-ensembles are created. In particular, the randomness of RHCA<br />

lies in the fact that the clusterings input to each stage of the hierarchical architecture are<br />

shuffled randomly.<br />

Besides this fact, the most distinctive feature of RHCA is that the user determines<br />

the size of the mini-ensembles, setting it to b, keeping it constant across the stages of<br />

the consensus architecture. Therefore, given a cluster ensemble containing l component<br />

clusterings and a mini-ensemble size set equal to b by the user, the number of stages s of<br />

the resulting RHCA is computed by equation (3.1).<br />

⎧<br />

⎪⎩<br />

⌊log b (l)⌉ if<br />

⎪⎨<br />

<br />

s = ⌊logb (l)⌉−1 if<br />

⌊log b (l)⌉ +1 if<br />

<br />

<br />

l<br />

b ⌊log b (l)⌉<br />

l<br />

b ⌊log b (l)⌉<br />

l<br />

b ⌊log b (l)⌉<br />

49<br />

<br />

≤ 1and<br />

<br />

≤ 1and<br />

<br />

> 1<br />

l<br />

b ⌊log b (l)⌉−1<br />

l<br />

b ⌊log b (l)⌉−1<br />

<br />

> 1<br />

<br />

=1<br />

(3.1)


3.2. Random hierarchical consensus architectures<br />

where ⌊x⌉ denotes the operation of rounding x to the nearest integer (Hastad et al., 1988).<br />

The second option in equation (3.1) reduces the number of stages by one in the case that<br />

the penultimate RHCA stage already yields one consensus clustering, whereas the third one<br />

adds a supplementary stage so as to ensure the obtention of a single consensus solution at<br />

the output of the RHCA.<br />

Furthermore, the number of consensus solutions computed at the ith stage of the RHCA<br />

(where i ∈ [1,s]) is determined by the expression in equation (3.2).<br />

<br />

l<br />

Ki =max<br />

bi <br />

, 1<br />

(3.2)<br />

where ⌊x⌋ stands for the greatest integer less than or equal to x (i.e. the result of applying<br />

the floor function on number x).<br />

However, it is important to notice that it will only be possible to keep the mini-ensembles<br />

size constant all along the hierarchy (i.e. bij = b, ∀i ∈ [1,s]and∀j∈ [1,Ki]) when l is<br />

an integer power of b. In the likely case that this condition is not met, in the current<br />

implementation of RHCA we choose, for simplicity, integrating the spare clusterings2 of the<br />

ith RHCA stage into its last (i.e. the Kith) mini-ensemble, thus introducing a bounded<br />

increase of its size, as b ≤ biKi < 2b. Moreover, the size of the mini-ensemble input to the<br />

sth stage is set to be equal to the number of halfway consensus output by the penultimate<br />

RHCA stage, as defined in equation (3.3).<br />

⎧<br />

⎪⎨<br />

b if i


Chapter 3. Hierarchical consensus architectures<br />

the mini-ensembles size, notice that the size of the third mini-ensemble of the first RHCA<br />

stage is increased (b13 =3) so that all the l = 7 components of the cluster ensemble are<br />

involved in one of the K1 = 3 consensus processes of the first RHCA stage. This also<br />

happens in the second stage, where b21 =3andK2 =1,which,asjustmentioned,yieldsa<br />

single consensus at its output.<br />

The interested reader will find a more detailed description of these and other RHCA<br />

configuration examples in appendix C.1.<br />

3.2.2 Computational complexity<br />

In the following paragraphs, we present a study of the asymptotic computational complexity<br />

of RHCA, considering both its fully serial and parallel implementations, which, as<br />

aforementioned, constitute the upper and lower bounds of the RHCA execution time.<br />

Serial RHCA<br />

For starters, the time complexity of the fully serialized implementation is considered. This<br />

means that the intermediate consensus tasks of each RHCA stage must be sequentially executed<br />

on a single computation unit. Recall that the time complexity of consensus functions<br />

typically grows linearly or quadratically with the cluster ensemble size, that is, it can be<br />

expressed as O (l w ), where w ∈{1, 2}. Therefore, the serial time complexity of a RHCA<br />

(STCRHCA) withs stages boils down to systematically adding the time complexities of all<br />

the consensus processes executed across the whole RHCA, as defined in equation (3.4).<br />

STCRHCA =<br />

s Ki <br />

O (bij w ) (3.4)<br />

i=1 j=1<br />

where Ki refers to the number of consensus processes executed in the ith RHCA stage, bij is<br />

the mini-ensemble size corresponding to the jth consensus process executed at the ith stage<br />

of the hierarchy —the exact value of these parameters is computed according to equations<br />

(3.2) and (3.3), respectively—, and O (bij w ) reflects the complexity of each intermediate<br />

consensus process.<br />

Equation (3.4) can be reformulated so as to obtain a compact expression of an upper<br />

bound of STCRHCA as a function of the user defined mini-ensembles size b. This requires<br />

recalling that, in the current RHCA implementation, the effective mini-ensembles size is<br />

bounded, that is, bij < 2b, ∀i ∈ [1,s]and∀j ∈ [1,Ki]. Therefore, we can write:<br />

STCRHCA <<br />

s Ki <br />

O ((2b) w ) (3.5)<br />

i=1 j=1<br />

Notice that, from an algorithmic viewpoint, equation (3.5) can be regarded as two nested<br />

loops where the number of iterations of the inner loop (Ki) depends on the value of the<br />

outer loop’s index (i). The number of iterations of the inner loop as a function of the outer<br />

loop’s index is presented in table 3.1.<br />

Thus, it can be observed that the total number of times the mini-ensemble consensus of<br />

51


3.2. Random hierarchical consensus architectures<br />

λ 1 λ 11 λ 12 λ 13 … λ 1m<br />

λ 2 λ 21 λ 22 λ 23 … λ 2m<br />

λ 3 λ 31 λ 32 λ 33 … λ 3m<br />

λ 4 λ 41 λ 42 λ 43 … λ 4m<br />

λ 5 λ 51 λ 52 λ 53 … λ 5m<br />

λ 6 λ 61 λ 62 λ 63 … λ 6m<br />

λ 7 λ 71 λ 72 λ 73 … λ 7m<br />

λ 8 λ 81 λ 82 λ 83 … λ 8m<br />

λ 1 λ 11 λ 12 λ 13 … λ 1m<br />

λ 2 λ 21 λ 22 λ 23 … λ 2m<br />

λ 3 λ 31 λ 32 λ 33 … λ 3m<br />

λ 4 λ 41 λ 42 λ 43 … λ 4m<br />

λ 5 λ 51 λ 52 λ 53 … λ 5m<br />

λ 6 λ 61 λ 62 λ 63 … λ 6m<br />

λ 7 λ 71 λ 72 λ 73 … λ 7m<br />

λ 8 λ 81 λ 82 λ 83 … λ 8m<br />

λ 9 λ 91 λ 92 λ 93 … λ 9m<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

λ<br />

λ<br />

λ<br />

λ<br />

1<br />

c1<br />

1<br />

c2<br />

1<br />

c3<br />

1<br />

c4<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

(a) Topology of a RHCA with l =8andb =2,withs =3<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

λ<br />

λ<br />

λ<br />

λ<br />

1<br />

c1<br />

1<br />

c2<br />

1<br />

c3<br />

1<br />

c4<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

(b) Topology of a RHCA with l =9andb =2,withs =3<br />

1 11 12 13 … 1m<br />

2 21 22 23 … 2m 3 31 32 33 … 3m 4 41 42 43 … 4m 5 51 52 53 … <br />

5m<br />

6 61 62 63 … 6m<br />

7 71 72 73 … 7m<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

<br />

1<br />

c1<br />

λ<br />

λ<br />

λ<br />

λ<br />

2<br />

c1<br />

2<br />

c2<br />

2<br />

c1<br />

2<br />

c2<br />

Consensus<br />

function<br />

(c) Topology of a RHCA with l =7andb =2,withs =2<br />

<br />

1<br />

c2<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Figure 3.2: Three examples of topologies of random hierarchical consensus architectures<br />

with distinct relationships between the cluster ensemble and mini-ensembles sizes, l and b,<br />

respectively.<br />

maximum complexity O ((2b) w ) are executed is s<br />

Ki. The value of this summation can be<br />

approximated by considering that the number of consensus per stage Ki is also bounded,<br />

as Ki ≤ l<br />

b i (see equation (3.2)), yielding:<br />

52<br />

i=1<br />

<br />

1<br />

c3<br />

c 2<br />

1<br />

<br />

c<br />

λc 3<br />

1<br />

λc 3<br />

1<br />

= λ<br />

= λ<br />

c<br />

c


Chapter 3. Hierarchical consensus architectures<br />

i # inner loop iterations<br />

1 K1<br />

2 K2<br />

... ...<br />

s 1<br />

Table 3.1: Number of inner loop iterations as a function of the outer’s loop index i.<br />

s<br />

Ki ≤<br />

i=1<br />

s<br />

i=1<br />

l<br />

= l ·<br />

bi s<br />

i=1<br />

<br />

1 1<br />

1 b − b = l ·<br />

bi s+1<br />

1 − 1<br />

<br />

bs − 1<br />

= l ·<br />

b<br />

b<br />

s <br />

(3.6)<br />

(b − 1)<br />

. Therefore, the<br />

which equals the partial sum of a geometric series whose common ratio is 1<br />

b<br />

upper bound of the time complexity STCRHCA can be rewritten as:<br />

<br />

bs − 1<br />

STCRHCA <br />

Ki, ∀i ∈ [2,s]). In this case of maximum parallelism, the parallel time complexity of a<br />

RHCA (PTCRHCA) ofs stages is formulated according to equation (3.9).<br />

PTCRHCA =<br />

s<br />

i=1<br />

max<br />

j∈[1,Ki] (O (bij w )) (3.9)<br />

That is, the parallel time complexity of a RHCA is equal to the sum of the time complexities<br />

of the most time-consuming consensus process of each RHCA stage. Notice that<br />

it is not difficult to find an upper bound to PTCRHCA, as finding the maximum of O (bij w )<br />

just requires taking into account that bij < 2b, ∀i, j. Thus:<br />

PTCRHCA <<br />

s<br />

i=1<br />

O ((2b) w )=s · O ((2b) w )=O (s · (2b) w ) (3.10)<br />

If the number of RHCA stages is approximated as s ≈ log b (l), and constants are<br />

dropped, equation (3.10) can be rewritten as a function of l and b:<br />

53


3.2. Random hierarchical consensus architectures<br />

s : number of stages<br />

10 1<br />

Number of RHCA stages as a function of b<br />

10 0<br />

10 0<br />

10 1<br />

10 2<br />

b<br />

10 3<br />

100<br />

200<br />

500<br />

1000<br />

2000<br />

5000<br />

10000<br />

10 4<br />

(a) Evolution of the number of<br />

RHCA stages s as a function<br />

of b<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 0<br />

10 1<br />

10 2<br />

b<br />

10 3<br />

100<br />

200<br />

500<br />

1000<br />

2000<br />

5000<br />

10000<br />

10 4<br />

sum(K i ) : total number of consensus Number of consensus in a RHCA as a function of b<br />

(b) Evolution of the total<br />

number of RHCA consensus<br />

s<br />

Ki as a function of b<br />

i=1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

b ij : mean mini−ensemble size Mini−ensembles size of a RHCA as a function of b<br />

10 0<br />

10 0<br />

10 1<br />

10 2<br />

b<br />

10 3<br />

100<br />

200<br />

500<br />

1000<br />

2000<br />

5000<br />

10000<br />

10 4<br />

(c) Evolution of the mean<br />

mini-ensembles size as a function<br />

of b<br />

Figure 3.3: Evolution of RHCA parameters as a function of the mini-ensembles size b for<br />

cluster ensembles sizes ranging from 100 to 10000.<br />

3.2.3 Running time minimization<br />

PTCRHCA O (log b (l)(2b) w )=O (b w log b (l)) (3.11)<br />

In light of the expressions of the upper bounds of the serial and parallel time complexities<br />

of RHCA, a naturally arising question is which particular RHCA configuration yields, for<br />

a given cluster ensemble, the minimal running time —notice that the user’s election of the<br />

mini-ensembles size b determines both the number of stages and of consensus computed per<br />

stage, see equations (3.1) and (3.2), which ultimately determines the running time of the<br />

RHCA.<br />

In fact, there exists a trade-off between the value of b and the execution time of the<br />

whole RHCA, as selecting a small value for b simultaneously reduces the time complexity of<br />

each consensus while increasing the total number of stages (s) and of consensus processes<br />

of the RHCA ( s<br />

Ki), and vice versa.<br />

i=1<br />

With the purpose of visualizing the dependence between b and these factors, figure 3.3<br />

depicts their value for different cluster ensembles sizes l ∈ [100, 10000] as a function of the<br />

mini-ensembles size b ∈ 2, ⌊ l<br />

2 ⌋ .<br />

Firstly, figure 3.3(a) shows the exponential decay of the number of RHCA stages s as a<br />

function of b, caused by the fact that s is computed as the b-base logarithm of the cluster<br />

ensemble size l. Secondly, the evolution of the total number of consensus processes follows<br />

a fast exponential decay (hard to appreciate in this doubly logarithmic chart), as depicted<br />

in figure 3.3(b). And finally, figure 3.3(c) presents the mean value of the effective miniensembles<br />

size bij across the whole RHCA (which obviously scales linearly with b)asarough<br />

indicator of the complexity of each halfway consensus process, which will approximately be<br />

(linearly or quadratically) proportional to this value.<br />

Allowing for the evident dependence between the user defined mini-ensembles size b and<br />

54


Chapter 3. Hierarchical consensus architectures<br />

the running of RHCA, it seems necessary to accurately choose the value of this parameter<br />

—regardless of whether the serial or parallel version of RHCA is implemented (in this latter<br />

case, notice that the RHCA running time still depends on b (via s) although it becomes<br />

s<br />

insensitive to the total number of consensus processes, Ki)—,asRHCAvariantswith<br />

different values of b may have dramatically different running times.<br />

As mentioned earlier, we aim to design an automatic mechanism that allows selecting<br />

thespecificvalueofb that gives rise to the RHCA variant of minimal running time, making<br />

it also possible to decide aprioriwhether it is more computationally efficient than flat<br />

consensus.<br />

To do so, we propose a simple yet effective methodology for comparing distinct RHCA<br />

variants (i.e. with different b values) on computational grounds, a detailed description of<br />

which is presented in table 3.2. In a nutshell, the proposed strategy is based on estimating<br />

the RHCA variants running time using equations (3.4) or (3.9) (depending on whether the<br />

serial or parallel version is to be implemented), replacing the theoretical time complexity of<br />

intermediate consensus processes (O (bij w )) by the average real execution time of c consensus<br />

processes using a specific consensus function F on a mini-ensemble of size bij —recall that<br />

the values of bij can be computed by means of equation (3.3).<br />

It is to note that, for a specific RHCA variant (that is, for a given mini-ensemble size<br />

b), the parameter bij may take a few different values —for instance, in the three toy RHCA<br />

variants depicted in figure 3.2, bij = {2, 3} although b = 2 in all of them. That is, the<br />

number of consensus processes to be executed for estimating the running time of the RHCA<br />

is usually small, which makes the proposed procedure little computationally demanding.<br />

Futhermore, notice that a more robust estimation can be obtained by averaging the running<br />

times of c>1 executions of the consensus function F on mini-ensembles of size bij, although<br />

this would increase the time required for completing the estimation procedure.<br />

By means of the proposed methodology, the user is provided with an estimation of<br />

which is the most computationally efficient RHCA configuration and which is its running<br />

time for the consensus clustering problem at hand. However, it is necessary to decide<br />

whether this allegedly optimal RHCA variant is faster than flat consensus. The running<br />

time of flat consensus can be estimated by extrapolating from the running times of the<br />

consensus processes executed upon mini-ensembles of size bij. A simpler although less<br />

efficient alternative consists of launching the execution of flat consensus, halting it as soon<br />

as its running time exceeds the estimated execution time of the allegedly optimal RHCA<br />

variant.<br />

The next section is devoted to illustrate the performance of the proposed running time<br />

estimation methodology by means of several experiments, as well as for evaluating the<br />

computational efficiency of RHCA in front of flat consensus.<br />

3.2.4 Experiments<br />

In the following paragraphs, we present a set of experiments oriented to i) evaluate the<br />

performance of the computationally optimal consensus architecture prediction methodology,<br />

and ii) analyze the computational efficiency of random hierarchical consensus architectures.<br />

To do so, we have designed several experiments which are outlined next.<br />

55<br />

i=1


3.2. Random hierarchical consensus architectures<br />

1. Given the cluster ensemble size l, create a set of mini-ensembles sizes b with values<br />

sweeping from 2 to ⌊ l<br />

2 ⌋.<br />

2. For each b value –which corresponds to a RHCA variant– compute the number of<br />

stages of the RHCA s according to equation (3.1). With these results in hand, limit<br />

the sweep of values of b according to two criteria:<br />

i) as there exist multiple values of b that yield RHCA variants with the same number<br />

of stages, consider only the largest and smallest values of b that yield the same<br />

number of RHCA stages s.<br />

ii) keep those values of b that uniquely give rise to RHCA variants with a specific<br />

number of stages.<br />

3. For the reduced set of b values, compute the total number of consensus processes<br />

s<br />

Ki, and the real mini-ensembles sizes bij of the corresponding RHCA variants,<br />

i=1<br />

according to equations (3.2) and (3.3), respectively.<br />

4. Measure the time required for executing the consensus function F on c randomly<br />

picked mini-cluster ensembles of the sizes bij corresponding to each value of b.<br />

5. Employ the computed parameters of each RHCA variant (i.e. number of stages s,<br />

s<br />

total number of consensus processes Ki and the running times of the consensus<br />

i=1<br />

function F) to estimate the running times of the whole hierarchical architecture,<br />

using equations (3.4) or (3.9) depending on whether its fully serial or parallel version<br />

is to be implemented in practice.<br />

Table 3.2: Methodology for estimating the running time of multiple RHCA variants.<br />

Experimental design<br />

– What do we want to measure?<br />

i) The time complexity of random hierarchical consensus architectures.<br />

ii) The ability of the proposed methodology for predicting the computationally optimal<br />

RHCA variant, in both the fully serial and parallel implementations.<br />

– How do we measure it?<br />

i) The time complexity of the implemented serial and parallel RHCA variants is<br />

measured in terms of the CPU time required for their execution —serial running<br />

time (SRTRHCA) and parallel running time (PRTRHCA).<br />

ii) The estimated running times of the same RHCA variants –serial estimated running<br />

time (SERTRHCA) and parallel estimated running time (PERTRHCA)– are<br />

computed by means of the proposed running time estimation methodology, which<br />

is based on the measured running time of c = 1 consensus clustering process. Predictions<br />

regarding the computationally optimal RHCA variant will be successful<br />

56


Chapter 3. Hierarchical consensus architectures<br />

in case that both the real and estimated running times are minimized by the<br />

same RHCA variant, and the percentage of experiments in which prediction is<br />

successful is given as a measure of its performance. In order to measure the<br />

impact of incorrect predictions, we also measure the execution time differences<br />

(in both absolute and relative terms) between the truly and the allegedly fastest<br />

RHCA variants in the case prediction fails. This evaluation process is replicated<br />

for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />

on the prediction accuracy of the proposed methodology.<br />

– How are the experiments designed? All the RHCA variants corresponding to<br />

the sweep of values of b resulting from the proposed running time estimation methodology<br />

have been implemented (see table 3.2). In order to test our proposals under a<br />

wide spectrum of experimental situations, consensus processes have been conducted<br />

using the seven consensus functions for hard cluster ensembles presented in appendix<br />

A.5 (i.e. CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD), employing<br />

cluster ensembles of the sizes corresponding to the four diversity scenarios described<br />

in appendix A.4 —which basically boils down to compiling the clusterings output by<br />

|dfA| = {1, 10, 19, 28} clustering algorithms. In all cases, the real running times correspond<br />

to an average of 10 independent runs of the whole RHCA, in order to obtain<br />

representative real running time values (recall that the mini-ensemble components<br />

change from run to run, as they are randomly selected). For a description of the<br />

computational resources employed in or experiments, see appendix A.6.<br />

– How are results presented? Both the real and estimated running times of the<br />

serial and parallel implementations of the RHCA variants are depicted by means of<br />

curves representing their average values.<br />

– Which data sets are employed? For brevity reasons, this section only describes<br />

the results of the experiments conducted on the Zoo data collection. The presentation<br />

of the results of these same experiments on the Iris, Wine, Glass, Ionosphere, WDBC,<br />

Balance and MFeat unimodal data collections is deferred to appendix C.2.<br />

One word before proceeding to present the results obtained. In practice, only serial<br />

RHCA have been implemented in our experiments. The real execution times of their parallel<br />

counterparts are, in fact, an estimation based on retrieving the execution time of the longestlasting<br />

consensus process of each stage of the serial RHCA and plugging them into equation<br />

(3.9).<br />

Diversity scenario |df A| =1<br />

Firstly, figure 3.4 presents the results corresponding to the lowest diversity scenario, i.e.<br />

the one resulting from using a single randomly chosen clustering algorithm for generating<br />

the cluster ensemble —that is, the cardinality of the algorithmic diversity factor is equal<br />

to one, i.e. |dfA| = 1, which, on this data set, gives rise to a cluster ensemble size l = 57.<br />

Following the methodology of table 3.2, the sweep of values of the mini-ensemble size is<br />

b = {2, 3, 4, 6, 7, 28, 57} —recall that each value of b gives rise to a distinct RHCA variant.<br />

Figure 3.4(a) presents the serial RHCA estimated running time (SERTRHCA), while figure<br />

3.4(b) depicts the real serial running time (or SRTRHCA) of the implemented RHCA<br />

57


3.2. Random hierarchical consensus architectures<br />

SERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 0<br />

10 −1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

2 3 4 6 7 28 57<br />

b : mini−ensemble size<br />

(a) Serial estimated running time<br />

5<br />

10<br />

4 3 3 2 2 1<br />

0<br />

s : number of stages<br />

10 −1<br />

2 3 4 6 7 28 57<br />

b : mini−ensemble size<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 0<br />

10 −1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

2 3 4 6 7 28 57<br />

b : mini−ensemble size<br />

(b) Serial real running time<br />

5<br />

10<br />

4 3 3 2 2 1<br />

0<br />

s : number of stages<br />

10 −1<br />

2 3 4 6 7 28 57<br />

b : mini−ensemble size<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.4: Estimated and real running times of the serial RHCA on the Zoo data collection<br />

in the diversity scenario corresponding to a cluster ensemble of size l = 57.<br />

variants. Figures 3.4(c) and 3.4(d) present their counterparts for the parallel RHCA implementation.<br />

The lower horizontal axis of each chart presents the mini-ensembles size b of<br />

each RHCA variant, and the superior horizontal axis indicates the corresponding number<br />

of stages s of the RHCA. Notice, for instance, that s =1forb = 57, which corresponds to<br />

flat consensus.<br />

If the estimated and real execution times of the serial implementation of the RHCA are<br />

analyzed separately (figures 3.4(a) and 3.4(b)), it can be observed that flat consensus is<br />

faster than any RHCA variant regardless of the consensus function employed. This is due<br />

to the small size of the cluster ensemble (l = 57) in this low diversity scenario, which makes<br />

any hierarchical consensus architecture slower than its one-step counterpart.<br />

Moreover, the visual comparison of figures 3.4(a) and 3.4(b) shows that SERTRHCA is<br />

a fairly accurate estimation of SRTRHCA. However, it is to notice that our goal is not<br />

to predict the exact value of SRTRHCA, but to use SERTRHCA to predict where the real<br />

running time will attain its minimum value —which is equivalent to determining the most<br />

computationally efficient RHCA variant, a goal that is perfectly accomplished in this case.<br />

Figures 3.4(c) and 3.4(d) depict the estimated and real running times of the fully<br />

parallel implementation of the same RHCA variants as before. The observation of these<br />

charts reveals that PERTRHCA succeeds notably in predicting the location of the minima<br />

58


SERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 285 570<br />

b : mini−ensemble size<br />

(a) Serial estimated running time<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 285 570<br />

b : mini−ensemble size<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Chapter 3. Hierarchical consensus architectures<br />

SRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 285 570<br />

b : mini−ensemble size<br />

(b) Serial real running time<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 285 570<br />

b : mini−ensemble size<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.5: Estimated and real running times of the serial RHCA on the Zoo data collection<br />

in the diversity scenario corresponding to a cluster ensemble of size l = 570.<br />

of PRTRHCA —in fact, the only prediction error occurs in the case the SLSAD consensus<br />

function is employed. In this case, according to PERTRHCA (figure 3.4(c)), the most efficient<br />

consensus architecture is the RHCA variant with s = 2 stages using mini-ensembles of<br />

size b = 28. However, the real execution times (figure 3.4(d)) reveal that flat consensus is<br />

the fastest option when this consensus function is employed for combining the clusterings.<br />

Nevertheless, we would like to highlight that the cost (measured in terms of running<br />

time) of selecting this computationally suboptimal RHCA variant based on the PERTRHCA<br />

prediction is almost negligible in absolute terms, as the difference between the running<br />

times of the truly and allegedly optimal parallel RHCA variants is smaller than a tenth of<br />

a second in this case.<br />

Diversity scenario |df A| =10<br />

Figure 3.5 presents the estimated and real execution times of several architectural variants of<br />

both the fully serial and parallel RHCA implementations in the second diversity scenario,<br />

the one resulting from employing |dfA| = 10 randomly chosen clustering algorithms for<br />

generating a cluster ensemble of size l = 570. In this case, the sweep of values of the<br />

mini-ensembles size is b = {2, 3, 4, 5, 7, 8, 19, 20, 285, 570}.<br />

If the estimated and real execution times of the serial implementation of the RHCA are<br />

59


3.2. Random hierarchical consensus architectures<br />

observed (figures 3.5(a) and 3.5(b)), it can be noticed that i) SERTRHCA is again a pretty<br />

accurate estimation of SRTRHCA, andii) for several consensus functions (in fact, all but<br />

EAC) there exists at least one RHCA variant that is more computationally efficient that<br />

flat consensus. In general terms, the difference between the running times of the fastest<br />

RHCA variant and flat consensus is small, although in the case of the MCLA consensus<br />

function, the execution of flat consensus (i.e. b = l = 570) is six times as costly as the<br />

fastest RHCA variant (the one with b = 20).<br />

Three main conclusions can be drawn at this point: firstly, increasing the size of the<br />

cluster ensemble makes hierarchical consensus architectures a computationally competitive<br />

alternative to flat consensus. Secondly, it is necessary to predict accurately which is the<br />

fastest RHCA variant (i.e. the specific value of the mini-ensembles size b) soastoobtain<br />

significant execution time savings. And thirdly, the computational optimality of a particular<br />

RHCA variant is local to the consensus function employed.<br />

As regards the estimated and real running times of the fully parallel RHCA implementation,<br />

depicted in figures 3.5(c) and 3.5(d), we can conclude that, again, PERTRHCA is a<br />

good indicator of the most computationally efficient RHCA variant. Furthermore, notice<br />

that the differences between the running times of flat consensus and the optimal RHCA are<br />

beyond one order of magnitude for most consensus functions, which highlights the interestingness<br />

of RHCA in computational terms, as well as the need for being able to predict<br />

which is the least time consuming consensus architecture.<br />

Diversity scenario |df A| =19<br />

The results corresponding to the third diversity scenario (i.e. cluster ensembles of size<br />

l = 1083 using |dfA| = 19 randomly chosen clustering algorithms) are presented in figure<br />

3.6. In this case, the mini-ensembles size sweep is b = {2, 3, 4, 5, 6, 8, 9, 26, 27, 541, 1083}.<br />

As regards the serial implementation of the RHCA –whose estimated and real running<br />

times are presented in figures 3.6(a) and 3.6(b), respectively–, a few observations must be<br />

made. Firstly, notice that the curves in figure 3.6(a) present a high degree of resemblance<br />

to the ones in figure 3.6(b), which indicates that SERTRHCA is a notably accurate predictor<br />

of SRTRHCA. Again, we would like to highlight the fact that our main interest is that<br />

the former is a good predictor of the location of the minima of the latter, a goal which is<br />

pretty successfully achieved in this case. Secondly, notice the influence of the consensus<br />

function employed for conducting the clustering combination on the running time of the<br />

RHCA. Whereas most of them yield a similar running time pattern (i.e. they have a<br />

more or less pronounced minimum around b =26orb = 27), two consensus functions<br />

stand out for their particular behaviour: i) when the EAC consensus function is employed,<br />

flat consensus is faster than any serial RHCA variant, and ii) when consensus is created<br />

by means of the MCLA consensus function, the space complexity requirements of MCLA<br />

make flat consensus not executable, as this is the only consensus function (among the ones<br />

employed in this work) whose complexity scales quadratically with the cluster ensemble size<br />

(see appendix A.5).<br />

If the estimated and real parallel RHCA implementation running times are evaluated<br />

(see figures 3.6(c) and 3.6(d)), it can be observed that, whatever the consensus function<br />

employed, there always exists at least one parallel RHCA variant which performs more<br />

efficiently than flat consensus. Moreover, notice that just like in the previous diversity<br />

60


SERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10<br />

10<br />

6 5 5 4 4 3 3 2 2 1<br />

2<br />

s : number of stages<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 −1<br />

2 3 4 5 6 8 9 26 27 541 1083<br />

b : mini−ensemble size<br />

(a) Serial estimated running time<br />

s : number of stages<br />

10 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 8 9 26 27 541 1083<br />

b : mini−ensemble size<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Chapter 3. Hierarchical consensus architectures<br />

SRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10<br />

10<br />

6 5 5 4 4 3 3 2 2 1<br />

2<br />

s : number of stages<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 −1<br />

2 3 4 5 6 8 9 26 27 541 1083<br />

b : mini−ensemble size<br />

(b) Serial real running time<br />

s : number of stages<br />

10 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 8 9 26 27 541 1083<br />

b : mini−ensemble size<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.6: Estimated and real running times of the serial RHCA on the Zoo data collection<br />

in the diversity scenario corresponding to a cluster ensemble of size l = 1083.<br />

scenario, there exists a notable difference between the running times of the most efficient<br />

parallel RHCA and flat consensus, which can be as high as two orders of magnitude.<br />

Diversity scenario |df A| =28<br />

Figure 3.7 depicts the estimated and real execution times corresponding to the highest<br />

diversity scenario —i.e. the one resulting from applying the |dfA| = 28 clustering algorithms<br />

from the CLUTO clustering package for generating cluster ensembles of size l = 1596. In<br />

this case, the mini-ensembles size sweep is b = {2, 3, 4, 5, 6, 10, 11, 32, 33, 798, 1596}.<br />

The results obtained are pretty similar to those obtained on the previous diversity<br />

scenario, although the following remarks must be made: firstly, notice that the large size of<br />

the cluster ensemble may not only impede flat consensus, but also the execution of those<br />

RHCA variants using larger mini-ensembles (see the curves corresponding to the MCLA<br />

consensus function). And secondly, it is noteworthy that the larger the cluster ensemble,<br />

the greater running time savings –which can be as high as two orders of magnitude– are<br />

derived from using the computationally optimal RHCA variant instead of flat consensus (in<br />

the case it is executable), regardless of the consensus function employed.<br />

61


3.2. Random hierarchical consensus architectures<br />

SERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 10 11 32 33 798 1596<br />

b : mini−ensemble size<br />

(a) Serial estimated running time<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 10 11 32 33 798 1596<br />

b : mini−ensemble size<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 10 11 32 33 798 1596<br />

b : mini−ensemble size<br />

(b) Serial real running time<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 10 11 32 33 798 1596<br />

b : mini−ensemble size<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.7: Estimated and real running times of the serial RHCA on the Zoo data collection<br />

in the diversity scenario corresponding to a cluster ensemble of size l = 1596.<br />

Conclusions regarding the computationally efficiency of RHCA<br />

The observation of the results obtained across the four diversity scenarios allow drawing<br />

several conclusions (together with the experiments presented in appendix C.2) as regards<br />

the computational efficiency of hierarchical and flat consensus architectures:<br />

– hierarchical consensus architectures can constitute a feasible way to obtain a consensus<br />

clustering solution in cases where one-step consensus is not affordable, which<br />

ultimately depends on the size of the cluster ensemble, the characteristics of the consensus<br />

function and the computational resources at hand.<br />

– as expected, parallel RHCA are highly efficient, being faster than flat consensus even<br />

in low diversity scenarios.<br />

– serial RHCA implementations become computationally competitive in medium to high<br />

diversity scenarios.<br />

– depending on the characteristics of the consensus function(s) employed for conducting<br />

clustering combination, large variations of the overall execution time of consensus<br />

architectures are observed.<br />

62


Chapter 3. Hierarchical consensus architectures<br />

Serial RHCA Parallel RHCA<br />

Dataset % correct ΔRT % correct ΔRT<br />

predictions (sec.) (%) predictions (sec.) (%)<br />

Zoo 72.2 1.11 109.7 54.9 0.10 67.2<br />

Iris 90.4 0.05 26.1 56.7 0.12 102.7<br />

Wine 77.1 0.60 37.8 46.8 0.21 139.5<br />

Glass 74.6 0.49 26.5 25.9 0.26 97.3<br />

Ionosphere 73.1 2.63 16.6 67.8 0.77 110.5<br />

WDBC 63.0 12.11 39.1 38.6 8.17 113.9<br />

Balance 92.4 0.31 29.5 73.2 3.09 87.3<br />

MFeat 83.4 7.02 27.7 76.3 14.41 50.3<br />

average 78.3 3.04 39.1 55.0 3.39 96.1<br />

Table 3.3: Evaluation of the minimum complexity RHCA variant estimation methodology<br />

in terms of the percentage of correct predictions and running time penalizations resulting<br />

from mistaken predictions.<br />

Conclusions regarding the optimal RHCA prediction methodology<br />

As far as the proposed running time estimation methodology is concerned, the following<br />

conclusions are drawn:<br />

– the computation of SERTRHCA and PERTRHCA constitutes a reasonable, simple and<br />

fast means for predicting whether flat or hierarchical consensus must be conducted.<br />

– the selection of computationally suboptimal consensus architectures caused by prediction<br />

errors of the proposed methodology entails a (usually assumable) execution<br />

time overhead.<br />

In order to provide the reader with a more quantitative analysis of the predictive power<br />

of the proposed running time estimation methodology, we have computed the percentage<br />

of experiments –considering the eight data sets over which they have been conducted– the<br />

minimum value of the estimated and real running times is obtained for the same consensus<br />

architecture. If, for a given experiment, both functions are simultaneously minimized, then<br />

our methodology suceeds in determining aprioriwhich is the fastest consensus architecture.<br />

If not, we compute the difference between the real running times of the truly (i.e. the<br />

one that minimizes SRTRHCA or PRTRHCA) and the allegedly (that is, the one minimizing<br />

SERTRHCA or PERTRHCA) computationally optimal consensus architectures, so as to<br />

provide a measure of the impact of choosing a suboptimal consensus configuration both in<br />

absolute and relative terms.<br />

Table 3.3 presents the percentage of experiments where the minima of SERTRHCA<br />

and PERTRHCA predict the most efficient consensus architecture correctly (expressed as<br />

‘% correct predictions’). In this case, SERTRHCA and PERTRHCA have been estimated<br />

upon a single execution (c = 1) of a consensus process on the mini-ensembles of size bij,<br />

i.e. no statistical running time averaging is conducted. Moreover, the difference between<br />

the real running times (ΔRT, measured both in seconds and in relative percentage) of the<br />

truly and the allegedly computationally optimal consensus architectures is also presented,<br />

63


3.2. Random hierarchical consensus architectures<br />

which corresponds to an average across 50 independent runs of the experiments conducted<br />

on each data collection.<br />

It is to note that, as already observed in the experiment described in this section and<br />

those presented in appendix C.2, SERTRHCA is a pretty accurate estimator of SRTRHCA<br />

(despite being based on the running time of a single consensus) and, as such, it suceeds<br />

notably in determining the most computationally efficient consensus architecture, achieving<br />

an average level of accuracy superior to 78% across the eight data sets employed in these<br />

experiments. In spite of this notably high accuracy percentage, notice that the resulting<br />

running time increase derived from inaccurate predictions is pretty high when measured in<br />

relative percentage —e.g. for the Zoo data set, the average running time of truly optimal<br />

consensus architectures is more than doubled when suboptimal ones are selected. However,<br />

if this average execution time increase is measured in absolute terms, we can conclude that<br />

it is perfectly assumable from a practical viewpoint in most cases —after all, the large<br />

relative deviation observed in the Zoo data collection results only in a one second running<br />

time rise.<br />

As regards the parallel implementation of the RHCA, there are a couple of issues worth<br />

noting: firstly, the proposed prediction methodology performs worse than in the serial case.<br />

This is due to the fact that, as observed across all the experiments conducted, PERTRHCA is<br />

a poorer estimator of PRTRHCA. However, despite ΔRT achieves very high values in relative<br />

terms, the absolute running time deviations between the truly and allegedly fastest consensus<br />

architectures are, again, not of paramount importance (i.e. the running time overhead<br />

due to a slightly erroneous estimation of the fastest RHCA variant clearly compensates the<br />

hypothetical execution of the least efficient consensus architecture).<br />

Recall that the proposed running time estimation methodology is based on capturing the<br />

execution times of several (namely c) runs of the consensus process on mini-ensembles of the<br />

sizes bij corresponding to each RHCA variant. As aforementioned, the results just reported<br />

have been obtained upon a single execution (c = 1). But expectably, the larger the value<br />

of c, the more accurate the estimation, but also the more costly in terms of computation<br />

time.<br />

So as to evaluate the influence of this factor, figure 3.8 depicts the evolution of the<br />

percentage of correct predictions (for both the serial and parallel implementations, referred<br />

to as %CPS and %CPP, respectively) and of the running time deviations (ΔRTS and ΔRTP<br />

in both absolute and relative terms) as a function of the parameter c, varying its value<br />

between 1 and 20, averaged across the eight data collections employed in this experiment.<br />

It can be observed that, despite the relatively wide sweep of values of c, thevariation<br />

in the percentage of correct predictions is below 6% for both the serial and parallel RHCA<br />

implementations —see figure 3.8(a). In terms of the difference between the running times<br />

of the truly and allegedly fastest consensus architectures, this results in a slight reduction<br />

of ΔRTS and ΔRTP –figure 3.8(b)–, which is, in any case, lower than 1.7 seconds —which,<br />

in relative percentage terms, amounts to a maximum reduction of 22% —see figure 3.8(c).<br />

Thus, we can conclude that, despite being a coarse approximation, using the running<br />

time of a single consensus process as the basis for estimating the execution time of the<br />

whole RHCA yields pretty accurate results as far as the prediction of the most efficient<br />

consensus architecture is concerned. Furthermore, when this prediction methodology fails,<br />

the execution time overhead is, in most cases, not dramatic from a practical standpoint.<br />

64


% correct predictions<br />

90<br />

80<br />

70<br />

60<br />

CP S<br />

CP P<br />

50<br />

1 5 10 15 20<br />

c : number of consensus processes<br />

(a) Percentage of correct optimal<br />

consensus architecture<br />

predictions<br />

RT (sec.)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

Chapter 3. Hierarchical consensus architectures<br />

RT S<br />

RT P<br />

1.5<br />

1 5 10 15 20<br />

c : number of consensus processes<br />

(b) Absolute running time<br />

differences between the<br />

truly and allegedly optimal<br />

consensus architectures<br />

relative % RT<br />

100<br />

80<br />

60<br />

40<br />

20<br />

relative RT S<br />

relative RT P<br />

0<br />

1 5 10 15 20<br />

c : number of consensus processes<br />

(c) Relative running time<br />

differences between the truly<br />

and allegedly optimal consensus<br />

architectures<br />

Figure 3.8: Evolution of the accuracy of the serial and parallel RHCA running time estimation<br />

as a function of the number of consensus processes c used in the estimation, measured in<br />

terms of (a) the percentage of correct predictions, the (b) absolute and (c) relative running<br />

time deviations between the truly and allegedly optimal consensus architectures, averaged<br />

across the eight data sets employed .<br />

Summary of the most computationally efficient RHCA variants<br />

Following the proposed methodology, we have estimated which are the most computationally<br />

efficient consensus architectures for the twelve unimodal data collections described in<br />

appendix A.2.1. The results corresponding to the fully serial and parallel implementations<br />

are presented in tables 3.4 and 3.5, respectively.<br />

From a notational viewpoint, RHCA variants are expressed in terms of their number of<br />

stages s (in case there exist two variants with the same number of stages, we denote whether<br />

it corresponds to the implementation using the minimum or maximum mini-ensemble size<br />

by the symbols bm andbM, respectively). Moreover, successful predictions of the computationally<br />

optimal consensus architectures (i.e. the minima of SERTRHCA –or PERTRHCA–<br />

and SRTRHCA –or PRTRHCA– are yielded by the same consensus architecture) are denoted<br />

with the dagger symbol (†). Obviously, this applies for the eight first data collections<br />

(from Zoo to MFeat), where the true running times of all consensus architectures have<br />

been measured after their real execution. For the remaining four data sets (from miniNG<br />

to PenDigits), the allegedly optimal consensus architectures are presented —however, we<br />

think it is not an outlandish assumption to consider that a rate of computationally optimal<br />

consensus architecture correct predictions comparable to those presented in table 3.3 can<br />

be expected in these cases.<br />

As regards the consensus architecture serial implementation (table 3.4), a few observations<br />

can be made: firstly, the higher the degree of diversity (i.e. the larger the cluster<br />

ensembles), the more efficient RHCA variants become when compared to flat consensus.<br />

As observed earlier, the most notable exception to this rule of thumb occurs when clustering<br />

combination is conducted by means of the EAC consensus function, whereas it can<br />

be observed that the remaining ones show, in general terms, a pretty similar behaviour as<br />

regards the type of consensus architecture (flat or hierarchical) that minimizes the total<br />

65


3.2. Random hierarchical consensus architectures<br />

running time. Secondly, notice that flat consensus tends to be computationally optimal in<br />

those data sets having small cluster ensembles even in high diversity scenarios (e.g. Iris,<br />

Balance or MFeat). Thirdly, for data collections containing a large number of objects n<br />

(such as PenDigits), only the HGPA and MCLA consensus functions are executable in our<br />

experimental conditions (as they are the only whose complexity scales linearly with the<br />

data set size, see appendix A.5). And last, notice the predominance of RHCA variants<br />

with s =2ands = 3 stages among the fastest ones, which seems to indicate that, from a<br />

computational perspective, RHCA variants few stages are more efficient, despite consensus<br />

processes are conducted on large mini-cluster ensembles.<br />

Most of these observations can be extrapolated to the case of the fully parallel consensus<br />

implementation (table 3.5), where we can observe a pretty overwhelming prevalence of<br />

RHCA variants over flat consensus, a trend that was already reported earlier in this section<br />

and also in appendix C.2.<br />

In the remains of this work, experiments concerning random hierarchical consensus<br />

architectures have been limited to those RHCA variants of minimum estimated running<br />

time for the sake of brevity.<br />

66


Consensus Diversity Dataset<br />

function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

CSPA A| = 10 s=2,bM† flat† s=2,bM† s=2,bM† s=2,bM† s=2,bM† flat† flat† s=2,bM flat flat –<br />

|dfA| = 19 s=3,bM flat† s=2,bM† s=2,bM† s=2,bM s=2,bM† flat† flat† s=2,bm s=2,bM s=2,bM –<br />

|dfA| = 28 s=2,bm† flat† s=2,bM† s=2,bM s=2,bm s=2,bM flat† flat† s=3,bM s=2,bM s=2,bM –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

EAC A| = 10 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|dfA| = 19 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|dfA| = 28 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat flat† flat† s=3,bM s=2,bM flat s=2,bM<br />

|df<br />

HGPA A| = 10 s=2,bM† flat† s=2,bM† s=2,bM† s=2,bm† s=2,bm† flat† s=2,bM† s=4,bM s=2,bm s=2,bm s=2,bm<br />

|dfA| = 19 s=2,bm† flat† s=2,bM† s=2,bM s=2,bm† s=3,bM† s=2,bM† s=2,bm† s=2,bm s=3,bM s=2,bm s=3,bm<br />

|dfA| = 28 s=2,bm† s=2,bM† s=3,bM s=2,bm s=3,bM† s=3,bM s=2,bM† s=3,bM s=3,bM s=4,bM s=6 s=3,bm<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat flat† flat† s=3,bm s=2,bm s=3,bM s=2,bM<br />

|df<br />

MCLA A| = 10 s=3,bM flat† s=2,bm† s=2,bm† s=3,bM† s=2,bm† flat s=2,bm† s=4,bM s=4,bM s=3,bM s=3,bm<br />

|dfA| = 19 s=2,bm† s=2,bM† s=2,bm s=2,bm† s=2,bm s=3,bM s=2,bM† s=2,bm† s=2,bm s=3,bm s=4,bm s=3,bm<br />

|dfA| = 28 s=2,bm† s=2,bM† s=2,bm s=3,bM† s=2,bm† s=3,bM† s=2,bM s=3,bm s=3,bM s=4,bM s=3,bM s=4,bM<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

ALSAD A| = 10 s=2,bm† flat† s=2,bM s=2,bM s=2,bM† flat† flat† flat† s=2,bM flat flat –<br />

|dfA| = 19 s=2,bm† flat† s=2,bm† s=2,bm† s=2,bm† s=2,bM flat† flat† s=2,bm flat flat –<br />

|dfA| = 28 s=3,bM s=2,bM† s=2,bm† s=3,bM s=2,bm† s=2,bm flat† flat† s=3,bm flat flat –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

KMSAD A| = 10 s=2,bm† flat† s=2,bM† s=2,bM† s=3,bM† s=2,bM† flat† flat† flat flat flat –<br />

|dfA| = 19 s=2,bm† flat s=2,bm† s=2,bM† s=3,bM† s=2,bm† flat† flat† s=2,bM s=2,bM s=2,bM –<br />

|dfA| = 28 s=2,bm† s=2,bM s=3,bM s=2,bm† s=2,bm† s=2,bm† flat† flat† s=3,bM s=3,bM s=2,bM –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

SLSAD A| = 10 s=2,bm† flat† s=2,bM s=2,bm† s=2,bM s=2,bM† flat† flat† s=2,bM flat flat –<br />

|dfA| = 19 s=2,bm† flat† s=2,bm† s=3,bM s=3,bM s=2,bM flat† flat† s=2,bM flat flat –<br />

|dfA| = 28 s=2,bm† s=2,bM s=3,bM s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bm flat flat –<br />

Chapter 3. Hierarchical consensus architectures<br />

67<br />

Table 3.4: Computationally optimal consensus architectures (flat or RHCA) on the unimodal data collections assuming a fully serial<br />

implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.


3.2. Random hierarchical consensus architectures<br />

Consensus Diversity Dataset<br />

function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† s=2,bm s=2,bm flat –<br />

|df<br />

CSPA A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 19 s=3,bm flat s=3,bM s=3,bM s=2,bm† s=3,bM flat† s=3,bM s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 28 s=2,bm† s=3,bM s=3,bm s=3,bM s=2,bm† s=3,bM flat s=3,bm s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

EAC A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† flat s=2,bm s=3,bM –<br />

|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=3,bM s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 28 s=2,bm† s=3,bM s=2,bm† s=3,bM s=2,bm† s=3,bM s=2,bm flat† s=2,bm s=2,bm s=2,bm –<br />

|dfA| = 1 flat† flat† flat† flat† s=3,bM s=3,bM flat† flat† s=3,bm s=3,bm s=2,bm s=3,bM<br />

|df<br />

HGPA A| = 10 s=3,bM flat† s=2,bm s=3,bM s=4,bM s=3,bm s=2,bm† s=3,bM s=6 s=4,bm s=2,bm s=4,bM<br />

|dfA| = 19 s=3,bm† s=3,bM s=3,bM s=4,bM s=2,bm† s=4,bM s=2,bm† s=4,bM s=2,bM s=4,bM s=6 s=3,bm<br />

|dfA| = 28 s=3,bm s=3,bM s=3,bm s=3,bM s=3,bm s=3,bm† s=3,bm s=3,bm† s=3,bM s=5,bm s=6 s=4,bM<br />

|dfA| = 1 s=2,bm† flat† flat† flat† s=3,bM s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm s=3,bM<br />

|df<br />

MCLA A| = 10 s=3,bm† s=2,bm† s=2,bm s=3,bm s=4,bM s=2,bm† s=2,bm† s=3,bM s=5 s=4,bm s=4,bm s=4,bM<br />

|dfA| = 19 s=2,bm s=3,bM s=3,bM s=4,bM s=2,bm† s=4,bM s=2,bm† s=4,bM s=2,bM s=4,bM s=4,bm s=3,bm<br />

|dfA| = 28 s=3,bm† s=2,bm† s=3,bm s=3,bm† s=4,bM s=4,bm s=2,bm† s=3,bm† s=3,bM s=5,bm s=3,bM s=4,bM<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

ALSAD A| = 10 s=3,bM flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM flat s=3,bM –<br />

|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=2,bm† s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 28 s=4,bM s=2,bm† s=3,bM s=3,bm s=2,bm† s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

KMSAD A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=4,bM s=2,bm† flat† flat† s=3,bM s=3,bm s=2,bM –<br />

|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=3,bm s=3,bm s=3,bM s=2,bm flat† s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 28 s=2,bm† s=2,bm† s=4,bM s=3,bM s=2,bm† s=3,bM s=4,bM flat† s=3,bM s=2,bm s=3,bm –<br />

|dfA| = 1 s=2,bM flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

SLSAD A| = 10 s=3,bM flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM s=2,bm s=3,bM –<br />

|dfA| = 19 s=2,bm† s=2,bM s=3,bM s=4,bM s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm –<br />

|dfA| = 28 s=3,bm s=2,bm† s=3,bM s=3,bm s=2,bm† s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm –<br />

68<br />

Table 3.5: Computationally optimal consensus architectures (flat or RHCA) on the unimodal data collections assuming a fully parallel<br />

implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.


Chapter 3. Hierarchical consensus architectures<br />

3.3 Deterministic hierarchical consensus architectures<br />

This section is devoted to the description of deterministic hierarchical consensus architectures<br />

(or DHCA). As in the previous section, we present a generic definition of this<br />

architectural variant along with a study of its computational complexity.<br />

3.3.1 Rationale and definition<br />

As opposed to random HCA, this proposal drives the creation of the mini-ensembles by<br />

a deterministic criterion. The main idea behind DHCA is to exploit the distinct ways of<br />

introducing diversity in the cluster ensemble as the guiding principle for creating the miniensembles<br />

upon which the intermediate consensus clustering solutions are built. That is,<br />

a key differential factor between DHCA and RHCA is that the former type of architecture<br />

is indirectly designed by the user when creating the cluster ensemble, whereas the latter<br />

requires the user to fix an architectural defining factor (i.e. assign a value to the size of the<br />

mini-ensembles b).<br />

Enlarging on the relationship between the creation of the cluster ensemble and the<br />

configuration of the DHCA, it is important to recall the strategies employed for introducing<br />

diversity in cluster ensembles (see section 2.1).<br />

For instance, heterogeneous cluster ensembles –whose components are generated by<br />

the execution of multiple clustering algorithms on the data set– have a single diversity<br />

factor, i.e. the set of distinct clustering algorithms employed. Meanwhile, when creating<br />

homogeneous cluster ensembles (those compiling the outcomes of multiple runs of a single<br />

clustering algorithm), a wider spectrum of diversity factors can be applied, such as the<br />

random starting configuration of a stochastic algorithm, or the use of distinct attributes for<br />

representing the objects in the data set, among others.<br />

As aforementioned, in this work we combine both the homogeneous and heterogeneous<br />

approaches for creating cluster ensembles, aiming not only to obtain highly diverse cluster<br />

ensembles, but also to design a strategy for fighting against clustering indeterminacies. This<br />

means that we employ several mutually crossed diversity factors (e.g. multiple clustering<br />

algorithms are run on several data representations with varying dimensionalities), and this<br />

will be the scenario where DHCA will be defined.<br />

In general terms, let us denote the number of diversity factors employed in the cluster<br />

ensemble creation process as f. Each diversity factor dfi ∀i ∈ [1,f] has a cardinality |dfi|<br />

—e.g. |dfi| denotes the number of clustering algorithms employed for creating the cluster<br />

ensemble in case that the ith diversity factor dfi represents the algorithmic diversity of the<br />

ensemble.<br />

Finally, notice that, if fully mutual crossing between all diversity factors is ensured (e.g.<br />

each cluster ensemble component is the result of running each clustering algorithm on each<br />

document representation of each distinct dimensionality), the cluster ensemble size l can be<br />

expressed as:<br />

f<br />

l = |dfk| (3.12)<br />

k=1<br />

Let us see how the design of the cluster ensemble determines the topology of a deterministic<br />

hierarchical consensus architecture. The guiding principle is that the consensus<br />

69


3.3. Deterministic hierarchical consensus architectures<br />

processes conducted at each stage of a DHCA combine those clusterings stemming from a<br />

single diversity factor (e.g. those ensemble components obtained by applying all the available<br />

algorithms on a particular object representation with a specific dimensionality). Then,<br />

quite obviously, the number of stages of a DHCA is equal to the number of diversity factors<br />

employed in creating the cluster ensemble, i.e. s = f.<br />

Besides selecting the diversity factors (and their cardinality) used in generating the cluster<br />

ensemble, the user must make an additional choice that affects the DHCA configuration:<br />

deciding which diversity factor is subject to consensus at each DHCA stage. This is specified<br />

by means of an ordered list of diversity factors, O = {df1,df2,...,dff }, sothatdfi will<br />

refer hereafter to the diversity factor which is subject to consensus at the ith stage of the<br />

DHCA.<br />

As regards the number of consensus executed on the ith DHCA stage (Ki), it is equal<br />

to the product of the cardinalities of the diversity factors that have not been subject to<br />

consensus neither in the present nor in the previous stages —except for the last stage,<br />

where a single consensus is conducted, as defined in equation (3.13).<br />

⎧<br />

⎨<br />

Ki =<br />

⎩<br />

f<br />

k=i+1<br />

|dfk| if 1 ≤ i


λ 1<br />

λ 2<br />

= λ<br />

= λ<br />

λ3 = λ<br />

λ 4<br />

= λ<br />

λ5 = λ<br />

λ 6<br />

λ 7<br />

= λ<br />

= λ<br />

λ8 = λ<br />

λ 9<br />

λ 10<br />

λ 11<br />

λ 12<br />

λ 13<br />

λ 14<br />

λ 15<br />

λ 16<br />

= λ<br />

= λ<br />

= λ<br />

= λ<br />

= λ<br />

= λ<br />

= λ<br />

= λ<br />

λ17 = λ<br />

λ 18<br />

= λ<br />

1,<br />

1,<br />

1<br />

2,<br />

1,<br />

1<br />

3,<br />

1,<br />

1<br />

1,<br />

1,<br />

2<br />

2,<br />

1,<br />

2<br />

3,<br />

1,<br />

2<br />

1,<br />

1,<br />

3<br />

2,<br />

1,<br />

3<br />

3,<br />

1,<br />

3<br />

1,<br />

2,<br />

1<br />

2,<br />

2,<br />

1<br />

3,<br />

2,<br />

1<br />

1,<br />

2,<br />

2<br />

2,<br />

2,<br />

2<br />

3,<br />

2,<br />

2<br />

1,<br />

2,<br />

3<br />

2,<br />

2,<br />

3<br />

3,<br />

2,<br />

3<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

λ<br />

λ<br />

λ<br />

λ<br />

λ<br />

λ<br />

A,<br />

1,<br />

1<br />

A,<br />

1,<br />

2<br />

A,<br />

1,<br />

3<br />

A,<br />

2,<br />

1<br />

A,<br />

2,<br />

2<br />

A,<br />

2,<br />

3<br />

Chapter 3. Hierarchical consensus architectures<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

λ<br />

λ<br />

A, 1 1,<br />

D<br />

A, 2 2,<br />

D<br />

Consensus<br />

function c λ<br />

Figure 3.9: An example of a deterministic hierarchical consensus architecture operating on a<br />

cluster ensemble created using three diversity factors: three clustering algorithms |dfA| =3,<br />

two object representations |dfR| = 2 of three dimensionalities each |dfD| = 3. The cluster<br />

ensemble component obtained by running the ith clustering algorithm on the jth object<br />

representation and the kth dimensionality is denoted as λi,j,k. Consensus are sequentially<br />

created across the algorithmic, dimensional and representational diversity factors (dfA, dfD<br />

and dfR, respectively).<br />

Therefore, a total of K2 = |dfR| = 2 consensus processes are run, each on a mini-ensemble<br />

of size b2j = |dfD| =3, ∀j ∈ [1, 2]. The halfway consensus clustering solutions obtained<br />

after this second stage are designated as λA,j,D.<br />

And finally, the last DHCA stage combines the clusterings output by the previous one,<br />

which only differ in its original object representation. Being the final stage of the hierarchy,<br />

a single consensus process is executed (K3 = 1), and the size of the mini-ensemble coincides<br />

with the cardinality of the representation diversity factor, i.e. b3j = |dfR| =2.<br />

3.3.2 Computational complexity<br />

The maximum and minimum time complexities of DHCA –corresponding to the serial and<br />

parallel execution of the consensus processes of each stage, respectively– are estimated in<br />

the following paragraphs. In this case, the goal is to express these complexities in terms of<br />

the cardinality and number of diversity factors employed in the cluster ensemble creation<br />

process. Recall that the time complexity of consensus functions typically grows linearly or<br />

quadratically with the cluster ensemble size, i.e. it is O (l w ), where w ∈{1, 2}.<br />

71


3.3. Deterministic hierarchical consensus architectures<br />

Serial DHCA<br />

Firstly, let us consider the fully serialized version of the DHCA. In this case, the time<br />

complexity amounts to the sum of all the consensus processes, as defined by equation (3.14).<br />

Notice that STCDHCA can be expressed ultimately in terms of the number and cardinalities<br />

of the diversity factors employed in the generation of the cluster ensemble.<br />

STCDHCA =<br />

s Ki <br />

i=1 j=1<br />

O (bij w )=<br />

s<br />

i=1<br />

Ki · O (bij w )=<br />

f<br />

<br />

f<br />

i=1<br />

k=i+1<br />

Keeping the higher order terms, serial DHCA time complexity is:<br />

Parallel DHCA<br />

<br />

f<br />

<br />

STCDHCA = O |dfk| |df1| w<br />

<br />

k=2<br />

<br />

|dfk| · O (|dfi| w ) (3.14)<br />

(3.15)<br />

And secondly, the time complexity of the parallelized execution of the DHCA is presented<br />

in equation (3.16). As all the consensus processes in a given DHCA stage are equally costly,<br />

the value of PTCDHCA amounts to the addition of the complexities of one of the consensus<br />

processes run on each of the s stages of the hierarchy.<br />

PTCDHCA =<br />

s<br />

i=1<br />

O (bij w )=<br />

f<br />

O (|dfi| w ) (3.16)<br />

Notice that the parallel execution of a DHCA can be regarded as a sequence of f<br />

instructions of complexity O (|dfi| w ) , ∀i ∈ [1,f]. Therefore, applying the sum rule of<br />

assymptotic notation, PTCDHCA can be rewritten as:<br />

<br />

PTCDHCA = O<br />

3.3.3 Running time minimization<br />

i=1<br />

max (|dfi|)<br />

i<br />

w<br />

<br />

(3.17)<br />

As in section 3.2, a naturally arising question regarding the practical implementation of<br />

deterministic hierarchical consensus architectures is the following: given a cluster ensemble<br />

of size l created upon a set of diversity factors dfi (for i = {1,...,f}), which is the least<br />

time consuming DHCA variant that can be built?<br />

Indeed, as the topology of a deterministic hierarchical consensus architecture is ultimately<br />

determined by an ordered list O of the f diversity factors indicating upon which<br />

diversity factor consensus is conducted at each DHCA stage, there exist f! distinctDHCA<br />

variants for a given consensus clustering problem —one for each of the possible ways of ordering<br />

the f diversity factors. Then, the question transforms into: how should the diversity<br />

factors be ordered so as to minimize the total running time of the DHCA?<br />

72


Chapter 3. Hierarchical consensus architectures<br />

Notice that, as opposed to what was observed in random HCA, the distinct DHCA<br />

variants do not differ in their number of stages (which is, in all cases, equal to the number<br />

of diversity factors, i.e. s = f), but in the time complexity of each stage of the architecture.<br />

Thus, in order to determine which is the computationally optimal DHCA variant, it is<br />

necessary to analyze the dependence between the ordering of the diversity factors and the<br />

total number of consensus processes executed and their complexity. In this section, we tackle<br />

this issue for both the fully serial and parallel implementation of deterministic hierarchical<br />

consensus architectures.<br />

Without loss of generality, let us assume that consensus clustering is to be conducted<br />

on a cluster ensemble of size l generated upon f = 3 mutually crossed diversity factors. By<br />

means of an ordered list O, these three factors are associated to one of the stages of the<br />

DHCA, i.e. O = {df1,df2,df3} —recall that, according to the definition of DHCA, i) the<br />

numerical subindex of each diversity factor identifies the stage it is associated to, and ii)<br />

Ki consensus processes of complexity O (|dfi| w )(wherew = {1, 2}) are conducted in the ith<br />

stage, with i = {1, 2, 3} in this case.<br />

As aforementioned, the total number of consensus processes depends on the cardinality<br />

of the diversity factors, which in this particular case amounts to the expression presented<br />

in equation (3.18):<br />

f<br />

Ki =<br />

i=1<br />

3<br />

Ki =<br />

i=1<br />

3<br />

|dfk| +<br />

k=2<br />

3<br />

|dfk| +1=|df2||df3| + |df3| + 1 (3.18)<br />

k=3<br />

where the number of consensus per stage Ki is computed according to equation (3.13).<br />

Firstly, let us analyze the running time of the fully parallel DHCA implementation. As<br />

in section 3.2, we assume that sufficient computing resources allow the concurrent execution<br />

of all the consensus processes of any of the DHCA stages —notice that this amounts to<br />

having as many as |df2||df3| parallel computation units capable of running simultaneously<br />

all the consensus processes of the first stage, which is the one with the largest number of<br />

consensus.<br />

If this condition is met, the running time of any parallel DHCA variant becomes independent<br />

of the ordering of the diversity factors. This is due to the fact that, assuming<br />

the fully simultaneous execution of all the consensus processes corresponding to the same<br />

DHCA stage, the running time of the whole consensus architecture will be proportional to<br />

O (|df1| w )+O (|df2| w )+O (|df3| w ) –as the running time of each DHCA stage is equivalent to<br />

the execution of a single consensus process–, which is independent of which diversity factor<br />

is assigned to each stage.<br />

However, despite the diversity factors ordering does not affect the running time of parallel<br />

DHCA variants, this factor has a significant impact on the dimensioning of the necessary<br />

resources for the entirely parallel execution of all the consensus processes involved, as it is<br />

directly related to the total number of consensus that must be executed in the DHCA.<br />

f<br />

From equation (3.18), it is straightforward to see that Ki is independent of the<br />

cardinality of the diversity factor associated to the first DHCA stage (df1). Thus, notice<br />

that the total number of consensus of a DHCA is minimized if the diversity factors are<br />

arranged in the ordered list O according to their cardinality and in decreasing order, i.e.<br />

|df1| ≥|df2| ≥|df3|. By doing so, the number of consensus processes conducted in the<br />

73<br />

i=1


3.3. Deterministic hierarchical consensus architectures<br />

1. Given the cluster ensemble size l generated upon a set of f mutually crossed diversity<br />

factors, create f! ordered lists corresponding to all the possible permutations of the<br />

diversity factors, each giving rise to a DHCA variant.<br />

2. For each one of the ordered lists, compute the total number of consensus processes<br />

per stage Ki, according to equation (3.13).<br />

4. Measure the time required for executing the consensus function F on c randomly<br />

picked mini-cluster ensembles of sizes |dfi| for i ∈{1,...,f}.<br />

5. Employ the computed parameters of each DHCA variant (i.e. number of stages s =<br />

f, total number of consensus processes s<br />

Ki and the running times of the consensus<br />

i=1<br />

function F) to estimate the running times of the whole hierarchical architecture,<br />

using equations (3.14) or (3.16) depending on whether its fully serial or parallel<br />

version is to be implemented in practice.<br />

Table 3.6: Methodology for estimating the running time of multiple DHCA variants.<br />

first stage of the DHCA is minimized, which is equivalent to minimizing the necessary<br />

computation units for the fully parallel implementation of the DHCA.<br />

If the running time of the serial implementation of the DHCA is now considered, the<br />

total number of consensus to be executed is not the only factor to take into account. In<br />

fact, arranging the diversity factors in decreasing order, while minimizing the total number<br />

of consensus to be executed across the DHCA, brings about a contradictory collateral effect<br />

if the complexity and number of the consensus processes run at each stage are considered.<br />

Indeed, notice that it is in the first DHCA stage where the largest number of consensus<br />

processes is executed (K1 = |df2||df3|), and they have the highest complexity (O (|df1| w ),<br />

where w = {1, 2}), as |df1| ≥|df2| ≥|df3|. Moreover, a single minimum complexity (i.e.<br />

O (|df3| w )) consensus process is conducted at the third stage, as K3 =1.<br />

Thus, as far as the running time of the serial implementation of DHCA is concerned,<br />

there exists an apparent trade-off involving the order of diversity factors, and the number<br />

and complexity of the associated consensus processes. It is important to note that the computationally<br />

optimal solution ultimately depends on the growth rates of the total number<br />

of consensus and of their time complexities with respect to the cardinality of the diversity<br />

factors (|dfi|, fori = {1,...,f}). However, while the latter grow according to a linear or<br />

quadratic law, the former follows a usually steeper multiplicative growth rate.<br />

Similarly to what has been discussed in section 3.2, table 3.6 presents a methodology<br />

for estimating the running times of the f! DHCA variants, which allows comparing them<br />

and, as a consequence, predicting which is the most computationally efficient.<br />

74


3.3.4 Experiments<br />

Chapter 3. Hierarchical consensus architectures<br />

This section presents the results of multiple experiments oriented to illustrate the computational<br />

efficiency of DHCA, as well as to evaluate the predictive power of the running time<br />

estimation methodology of table 3.6. Their design follows the scheme presented next.<br />

Experimental design<br />

– What do we want to measure?<br />

i) The time complexity of deterministic hierarchical consensus architectures.<br />

ii) The ability of the proposed methodology for predicting the computationally optimal<br />

DHCA variant, in both the fully serial and parallel implementations.<br />

iii) The predictive power of the proposed methodology based on running time estimation<br />

vs the computational optimality criterion based on designing the DHCA<br />

according to a decreasing diversity factor cardinality order, in both the fully<br />

serial and parallel implementations.<br />

– How do we measure it?<br />

i) The time complexity of the implemented serial and parallel DHCA variants is<br />

measured in terms of the CPU time required for their execution —serial running<br />

time (SRTDHCA) and parallel running time (PRTDHCA).<br />

ii) The estimated running times of the same DHCA variants –serial estimated running<br />

time (SERTDHCA) and parallel estimated running time (PERTDHCA)– are<br />

computed by means of the proposed running time estimation methodology, which<br />

is based on the measured running time of c = 1 consensus clustering process. Predictions<br />

regarding the computationally optimal DHCA variant will be successful<br />

in case that both the real and estimated running times are minimized by the<br />

same DHCA variant, and the percentage of experiments in which prediction is<br />

successful is given as a measure of its performance. In order to measure the<br />

impact of incorrect predictions, we also measure the execution time differences<br />

(in both absolute and relative terms) between the truly and the allegedly fastest<br />

DHCA variants in the case prediction fails. This evaluation process is replicated<br />

for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />

on the prediction accuracy of the proposed methodology.<br />

iii) Both computationally optimal DHCA variants prediction approaches are compared<br />

in terms of the percentage of experiments in which prediction is successful,<br />

and in terms of the execution time overheads (in both absolute and relative terms)<br />

between the truly and the allegedly fastest DHCA variants in the case prediction<br />

fails.<br />

– How are the experiments designed? The f! DHCA variants corresponding to<br />

all the possible permutations of the f diversity factors employed in the generation<br />

of the cluster ensemble have been implemented (see table 3.6). As described in appendix<br />

A.4, cluster ensembles have been created by the mutual crossing of f =3<br />

diversity factors: clustering algorithms (dfA), object representations (dfR) and data<br />

dimensionalities (dfD). Thus, in all our experiments, the number of DHCA variants is<br />

75


3.3. Deterministic hierarchical consensus architectures<br />

f! = 3! = 6, which are identified by an acronym describing the order in which diversity<br />

factors are assigned to stages —for instance, ADR describes the DHCA variant<br />

defined by the ordered list O = {df1 = dfA,df2 = dfD,df3 = dfR}. For a given data<br />

collection, the cardinalities of the representational and dimensional diversity factors<br />

(|dfR| and |dfD|, respectively) are constant, while the cardinality of the algorithmic<br />

diversity factor takes four distinct values |dfA| = {1, 10, 19, 28}, giving rise to the four<br />

diversity scenarios where our proposals are analyzed. Moreover, consensus clustering<br />

has been conducted by means of the seven consensus functions for hard cluster<br />

ensembles described in appendix A.5, which allows evaluating the behaviour of our<br />

proposals under distinct consensus paradigms. In all cases, the real running times<br />

correspond to an average of 10 independent runs of the whole RHCA, in order to<br />

obtain representative real running time values. As described in appendix A.6, all the<br />

experiments have been executed under Matlab 7.0.4 on Pentium 4 3GHz/1 GB RAM<br />

computers.<br />

– How are results presented? Both the real and estimated running times of the<br />

serial and parallel implementations of the DHCA variants are depicted by means of<br />

curves representing their average values.<br />

– Which data sets are employed? For brevity reasons, this section only describes<br />

the results of the experiments conducted on the Zoo data collection. On this data<br />

set, the cardinalities of the representational and dimensional diversity factors are<br />

|dfR| = 5 and |dfD| = 14, respectively. The presentation of the results of these<br />

same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat<br />

unimodal data collections is deferred to appendix C.3.<br />

Diversity scenario |df A| =1<br />

Figure 3.10 presents the estimated and real running times of the serial and parallel DHCA<br />

implementations in the lowest diversity scenario corresponding to the use of |dfA| = 1<br />

randomly chosen clustering algorithm for creating a cluster ensemble of size l = 57. The<br />

DHCA variants are identified in the horizontal axis of each chart. Meanwhile, the values<br />

of SERTDHCA and PERTDHCA correspond to an arbitrarily chosen estimation experiment<br />

based on a single consensus run (i.e. c =1).<br />

On one hand, figures 3.10(a) and 3.10(b) present the estimated and real running times<br />

of the serial DHCA implementation (SERTDHCA and SRTDHCA). Notice that SERTDHCA<br />

is a fairly good estimator of the real execution time of the DHCA variants. Moreover, it<br />

successfully predicts the fastest consensus architecture —which is what we are ultimately<br />

interested in. Notice that DRA is the DHCA variant minimizing SRTDHCA, which corresponds<br />

to the decreasing ordering of the diversity factors in terms of their cardinality.<br />

On the other hand, figures 3.10(c) and 3.10(d) depict the estimated and real running<br />

times corresponding to the fully parallel implementation of DHCA (PERTDHCA and<br />

PRTDHCA, respectively). There are two issues worth noting in this case: firstly, notice that<br />

the real execution time of the distinct DHCA variants shows a notably lower dispersion than<br />

in the serial case, which somehow corroborates our conjecture regarding the unimportance<br />

of the diversity factors ordering in parallel DHCA variants. And secondly, it is clear that<br />

PERTDHCA does not perform as accurately as regards the determination of the fastest con-<br />

76


SERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(a) Serial estimated running time<br />

10 −1<br />

|df A | = 1 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Chapter 3. Hierarchical consensus architectures<br />

SRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 −1<br />

(b) Serial real running time<br />

|df A | = 1 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.10: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />

data collection in the diversity scenario corresponding to a cluster ensemble of size l = 57.<br />

sensus architecture as in the serial case. Moreover, notice that in this low diversity scenario,<br />

hierarchical consensus architectures are, in general terms, slower than flat consensus.<br />

Diversity scenario |df A| =10<br />

Figure 3.11 presents the results obtained in the diversity scenario corresponding to the generation<br />

of the cluster ensemble by the application of |dfA| = 10 clustering algorithms chosen<br />

at random. In this case, there exists at least one serial DHCA variant that performs faster<br />

than flat consensus —the only exception occurs when consensus are created by the EAC<br />

consensus function (recall this also happened with RHCA, see section 3.2.4). As in the previous<br />

diversity scenario, all the parallel DHCA variants yield pretty similar running times,<br />

as opposed to what is observed in the serial case, where there exist significant execution<br />

time differences between variants. <strong>La</strong>st, as regards the prediction of the computationally<br />

optimal consensus architecture, notice that its performance is pretty accurate in both the<br />

serial and parallel cases.<br />

Diversity scenario |df A| =19<br />

As the diversity level of the cluster ensemble increases, a shift in the computationally optimal<br />

serial DHCA variant can be observed —see figure 3.12. Indeed, as figure 3.12(b) shows,<br />

the ADR variant of DHCA becomes the least computationally expensive serial hierarchical<br />

77


3.3. Deterministic hierarchical consensus architectures<br />

SERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 10 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(a) Serial estimated running time<br />

|df A | = 10 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 10 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(b) Serial real running time<br />

|df A | = 10 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.11: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />

data collection in the diversity scenario corresponding to a cluster ensemble of size l = 570.<br />

consensus architecture except when the EAC consensus function is employed —however,<br />

this behaviour is not always successfully predicted by SERTDHCA, asdepictedinfigure<br />

3.12(a). We believe this is due to the fact that our estimation is founded on the execution<br />

time of a single consensus process.<br />

As regards the parallel implementation of DHCA, the six DHCA variants yield very<br />

similar real execution times, as shown in figure 3.12(d), a trend that the running time<br />

estimation also captures —see figure 3.12(c). However, notice that this fact makes it difficult<br />

that the absolute minima of PERTDHCA and PRTDHCA coincide, which will probably harm<br />

the predictive accuracy of our proposal in the parallel implementation context. Moreover,<br />

notice that when the MCLA consensus function is employed, flat consensus is not executable<br />

(with the resources available in our experiments, see appendix A.6), while all the DHCA<br />

variants are.<br />

Diversity scenario |df A| =28<br />

Figure 3.13 presents the estimated and real running times of the serial and parallel DHCA<br />

implementations in the highest diversity scenario, i.e. the one corresponding to the creation<br />

of the cluster ensemble by means of the |dfA| = 28 clustering algorithms of the CLUTO<br />

clustering toolbox —which gives rise to a cluster ensemble containing l = 1596 components.<br />

In this context, arranging the diversity factors in decreasing cardinality order for defining<br />

their association to the DHCA stages again yields the most computationally efficient serial<br />

78


SERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 19 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(a) Serial estimated running time<br />

|df A | = 19 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Chapter 3. Hierarchical consensus architectures<br />

SRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 19 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(b) Serial real running time<br />

|df A | = 19 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.12: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />

data collection in the diversity scenario corresponding to a cluster ensemble of size l = 1083.<br />

DHCA variant (ADR in this case). This fact somehow reinforces the idea that, when<br />

compared to the typically linear or quadratic time complexity of consensus functions, the<br />

multiplicative growth rate of the total number of consensus imposes a stronger constraint<br />

as far as the running time of the DHCA is concerned. As observed in the previous diversity<br />

scenarios, the selection of a particular DHCA variant is a less critical matter when the fully<br />

parallel implementation of DHCA is considered, as all six variants yield pretty similar real<br />

execution times. In the serial case, in contrast, the accuracy of the running time estimation<br />

methodology is much more crucial, although SERTDHCA performs as a reasonably successful<br />

predictor.<br />

As mentioned earlier, these same experiments have been run on the Iris, Wine, Glass,<br />

Ionosphere, WDBC, Balance and MFeat unimodal data collections, and the corresponding<br />

results are presented in appendix C.3. Most of the conclusions drawn regarding the computational<br />

efficiency of hierarchical and flat consensus architectures in the analysis of random<br />

hierarchical consensus architectures are also applicable in the context of DHCA, such as<br />

the high computational efficiency of i) parallel DHCA even in low diversity scenarios, and<br />

ii) serial DHCA in medium and high diversity scenarios, or the dependence between the<br />

characteristics of the consensus function employed for conducting clustering combination<br />

and the execution time of consensus architectures.<br />

Moreover, two extra conclusions regarding the selection of the computationally optimal<br />

DHCA variant must be discussed. Firstly, defining the DHCA architecture by means of an<br />

ordered list of diversity factors arranged in decreasing cardinality order (i.e. associating the<br />

79


3.3. Deterministic hierarchical consensus architectures<br />

SERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 28 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(a) Serial estimated running time<br />

|df A | = 28 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(c) Parallel estimated running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 28 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(b) Serial real running time<br />

|df A | = 28 , |df D | = 14 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(d) Parallel real running time<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure 3.13: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />

data collection in the diversity scenario corresponding to a cluster ensemble of size l = 1596.<br />

most numerous diversity factor to the first stage, the second most populated to the second<br />

DHCA stage, and so on) seems to give rise to the most computationally efficient serial<br />

DHCA variant. And secondly, the execution time of fully parallel DHCA appears to be<br />

pretty insensitive to the way diversity factors are associated to the stages of the hierarchical<br />

consensus architecture.<br />

In practice, these two latter facts may play down the accuracy of the computationally<br />

optimal consensus architecture prediction methodology presented in table 3.6, as it seems<br />

possible to make well-grounded aprioridecisions as regards the selection of the fastest<br />

deterministic hierarchical consensus architecture variant without need of any running time<br />

estimation. For this reason, the next section presents an exhaustive comparative evaluation<br />

of these two strategies for predicting the computationally optimal consensus architecture.<br />

Evaluation of the optimal DHCA prediction methodology based on running<br />

time estimation<br />

Following an analogous procedure as in section 3.2.4, we have computed the percentage of<br />

experiments in which the estimated and real running times are simultaneously minimized by<br />

the same consensus architecture. The impact of the failures of this prediction methodology is<br />

measured in terms of the absolute and relative differences between the real execution times<br />

–ΔRT– of the truly (i.e. the one minimizing SRTDHCA or PRTDHCA) and the allegedly<br />

(the one that minimizes SERTDHCA or PERTDHCA) computationally optimal consensus<br />

80


Chapter 3. Hierarchical consensus architectures<br />

Serial DHCA Parallel DHCA<br />

Dataset % correct ΔRT % correct ΔRT<br />

predictions (sec.) (%) predictions (sec.) (%)<br />

Zoo 55.9 1.32 227.3 40.0 0.03 34.0<br />

Iris 93.4 0.56 180.7 51.2 0.10 63.7<br />

Wine 76.1 1.19 48.2 36.8 0.14 48.1<br />

Glass 76.4 0.76 32.1 39.3 0.23 60.7<br />

Ionosphere 83.2 11.9 33.1 47.9 1.38 46.5<br />

WDBC 76.2 77.73 58.1 39.7 4.93 44.5<br />

Balance 90.6 0.52 26.5 71.1 0.93 51.7<br />

MFeat 88.6 5.40 27.4 68.4 5.83 28.9<br />

average 80.0 12.57 78.0 49.3 1.70 47.3<br />

Table 3.7: Evaluation of the minimum complexity DHCA variant estimation methodology<br />

in terms of the percentage of correct predictions and running time penalizations resulting<br />

from mistaken predictions.<br />

architectures. The results corresponding to an averaging across 20 independent running<br />

time estimation experiments are presented in table 3.7.<br />

It can be observed that, in the serial case, SERTDHCA is a pretty accurate predictor,<br />

achieving correct prediction rates of the computationally optimal consensus architecture<br />

superior to 75% in all but one of the data sets. The running time overheads associated<br />

to incorrect predictions are usually negligible in absolute terms (ΔRT (sec.)) —except for<br />

the WDBC data collection, where the large real execution times of any of the consensus<br />

architectures make any mistake costly.<br />

As regards the performance of the proposed methodology for predicting the most efficent<br />

parallel implementation of DHCA, its degree of accuracy is inferior than in the serial case –a<br />

circumstance already observed in the context of RHCA–, although the penalization caused<br />

by this lower level of precision ranges below one second of extra execution time, which<br />

constitutes an assumable cost from a practical viewpoint —the WDBC and MFeat data<br />

collections stand out as the exceptions to this rule, although the corresponding ΔRT (sec.)<br />

overheads (around five seconds) are again of little importance in practice.<br />

Finally, we have also evaluated the influence of employing the execution times of c>1<br />

consensus executions for estimating the running times of the DHCA variants. Expectably,<br />

the larger c, the more accurate the running time estimation and, consequently, smaller<br />

running time overheads will be derived from incorrect predictions. On the flip side, however,<br />

this will slow down the prediction process —recall that, in the experiments presented up to<br />

now, c =1.<br />

As in section 3.2.4, a sweep of values of c ∈ [1, 20] has been conducted, computing<br />

the percentage of fastest consensus architecture correct predictions and the absolute and<br />

relative running time deviations associated to prediction errors at each step of the sweep,<br />

averaging the results of twenty independent runs of this experiment on each one of the eight<br />

unimodal data collections —see figure 3.14.<br />

It can be observed that, despite the gradual increase of correct predictions (figure<br />

3.14(a)), the running time deviations suffer a steep decrease as soon as c = 4 consensus<br />

processes are employed for computing SERTDHCA. Moreover, notice that using larger val-<br />

81


3.3. Deterministic hierarchical consensus architectures<br />

% correct predictions<br />

100<br />

80<br />

60<br />

40<br />

1 5 10<br />

CP<br />

S<br />

15<br />

CP<br />

P<br />

20<br />

c : number of consensus processes<br />

(a) Percentage of correct optimal<br />

consensus architecture<br />

predictions<br />

RT (sec.)<br />

15<br />

10<br />

5<br />

RT S<br />

RT P<br />

0<br />

1 5 10 15 20<br />

c : number of consensus processes<br />

(b) Absolute running time<br />

differences between the truly<br />

and allegedly optimal consensus<br />

architectures<br />

relative % RT<br />

80<br />

60<br />

40<br />

20<br />

relative RT S<br />

relative RT P<br />

0<br />

1 5 10 15 20<br />

c : number of consensus processes<br />

(c) Relative running time differences<br />

between the truly and<br />

allegedly optimal consensus<br />

architectures<br />

Figure 3.14: Evolution of the accuracy of the serial and parallel DHCA running time estimation<br />

as a function of the number of consensus processes used in the estimation, measured in<br />

terms of (a) the percentage of correct predictions, the (b) absolute and (c) relative running<br />

time deviations between the truly and allegedly optimal consensus architectures.<br />

ues of c does not imply significant reductions in ΔRT, which shows a pretty stationary<br />

behaviour for c>5. <strong>La</strong>st, it is worth observing that, in the parallel case, the correct<br />

prediction percentage increase and relative ΔRT decrease obtained for c>1resultinalmost<br />

negligible absolute ΔRT reductions, which again reveals the lesser importance of the<br />

aprioriselection of a particular parallel DHCA variant.<br />

This adds to the fact that, as suggested earlier, it seems unnecessary to conduct any<br />

runnning time estimation process for determining the fastest hierarchical consensus architecture<br />

variant, as assigning the diversity factors to DHCA stages in decreasing cardinality<br />

order apparently gives rise to the serial DHCA variant of minimum time complexity. With<br />

the purpose of evaluating the validity of this latter hypothesis, we have conducted the<br />

experiments presented throughout the following paragraphs.<br />

Evaluation of the optimal DHCA prediction methodology based on decreasing<br />

cardinality diversity factor ordering<br />

As far as the serial DHCA implementation is concerned, we have computed the percentage<br />

of experiments where the minimum real running time is achieved by the variant corresponding<br />

to the decreasing cardinality ordering of the diversity factors. Moreover, in case this<br />

prediction fails, we have computed the running time overhead resulting from selecting a<br />

computationally suboptimal hierarchical consensus architecture as the fastest one. The average<br />

results obtained for each one of the eight unimodal data collections are presented in<br />

table 3.8.<br />

It can be observed, for instance, that the DHCA variants defined by the ordered list of<br />

diversity factors in decreasing cardinality order is always the fastest hierarchical consensus<br />

architecture in the Zoo data set —which is equivalent to a 100% correct prediction rate.<br />

Using this prediction method, the lowest accuracy is obtained in the MFeat data collection<br />

(71.4%) —in this case, the average running time deviation derived from the 28.6% of in-<br />

82


Chapter 3. Hierarchical consensus architectures<br />

Serial DHCA<br />

Dataset % correct ΔRT<br />

predictions (sec.) (%)<br />

Zoo 100 – –<br />

Iris 96.4 0.01 1.2<br />

Wine 89.3 0.69 16.2<br />

Glass 92.9 0.09 5.4<br />

Ionosphere 85.7 23.10 35.3<br />

WDBC 92.9 2.03 2.2<br />

Balance 75.0 1.12 17.1<br />

MFeat 71.4 233.6 18.5<br />

average 88.0 32.6 11.9<br />

Table 3.8: Evaluation of the minimum complexity serial DHCA variant prediction based on<br />

decreasing diversity factor ordering in terms of the percentage of correct predictions and<br />

running time penalizations resulting from mistaken predictions.<br />

correct predictions amounts to 233.6 seconds (a very high value due to the absolute real<br />

execution times of hierarchical consensus architectures on this data set), which is equivalent<br />

to an average deviation of 18.5% in relative percentage terms.<br />

An averaging across data sets yields a prediction accuracy of 88%, i.e. it performs better<br />

than the prediction methodology based on running time estimation, which made a 80% of<br />

correct predictions (see table 3.7). This result reinforces the notion that the decreasing<br />

cardinality diversity factor ordering approach to select the computationally optimal serial<br />

DHCA variant is an alternative worth considering, as it requires no previous consensus<br />

execution besides obtaining higher levels of prediction accuracy.<br />

Aiming to support the conjecture that there is no computationally superior DHCA variant<br />

when its fully parallel implementation is considered, we have conducted an experiment<br />

seeking to quantify the differences between the least and most time consuming DHCA variants.<br />

So as to provide a valid contrast to these results, the same computation has been<br />

conducted regarding the most and least computationally efficient serial DHCA variants,<br />

proving that making an accurate selection is much more important in the serial than in the<br />

parallel case. Table 3.9 presents the results obtained, averaged across all the experiments<br />

conducted on each of the eight unimodal data collections.<br />

It can be observed that the running time differences between the most and least computationally<br />

efficient DHCA variants are very notable in the serial case —in fact, it takes from<br />

5 to 18 times longer to run the slowest DHCA variant than the computationally optimal<br />

one. In contrast, these variations are much smaller when the fully parallel implementation<br />

of DHCA is considered. In this case, as expected, a greater running time uniformity is<br />

observed across DHCA variants, as the least computationally efficient variant is at most 2.5<br />

times slower than the fastest one.<br />

To sum up, the decreasing cardinality diversity factor ordering provides the user with a<br />

pretty accurate notion of which is the most computationally efficient DHCA configuration<br />

without need of executing a single consensus process. However, this strategy does not allow<br />

to decide whether the allegedly fastest DHCA variant is more efficient than flat consensus.<br />

To do so, we propose estimating the running time of the computationally optimal DHCA<br />

83


3.3. Deterministic hierarchical consensus architectures<br />

Serial DHCA Parallel DHCA<br />

Dataset max (SRTDHCA) − min (SRTDHCA) max (PRTDHCA) − min (PRTDHCA)<br />

(sec.) (%) (sec.) (%)<br />

Zoo 7.36 547.6 0.08 42.3<br />

Iris 5.34 707.4 0.12 53.2<br />

Wine 12.66 636.8 0.23 70.1<br />

Glass 8.78 387.1 0.34 90.8<br />

Ionosphere 487.73 1650.3 2.82 92.3<br />

WDBC 2095.36 1357.6 15.91 80.5<br />

Balance 187.77 736.5 8.57 96.2<br />

MFeat 16667.23 1562.4 1104.08 154.4<br />

Table 3.9: Running time differences between the most and least computationally efficient<br />

DHCA variants in both the serial and parallel implementations.<br />

variant (following the strategy presented in table 3.6), and then i) estimate the running time<br />

of flat consensus by extrapolating the execution times of the consensus processes conducted<br />

upon mini-ensembles of size |dfi| (for i = {1,...,f}), or ii) launch the execution of flat<br />

consensus, halting it as soon as its running time exceeds the estimated execution time of<br />

the allegedly optimal DHCA variant —which is a simpler but less efficient alternative.<br />

Summary of the most computationally efficient DHCA variants<br />

To end this section, we have estimated which are the most computationally efficient consensus<br />

architectures for the twelve unimodal data collections described in appendix A.2.1.<br />

The results corresponding to the fully serial and parallel implementations are presented in<br />

tables 3.10 and 3.11, respectively.<br />

As regards the serial consensus architecture implementation (table 3.10), a few notational<br />

observations must be made: successful predictions of the computationally optimal<br />

consensus architecture (i.e. the minima of SERTDHCA and SRTDHCA are yielded by the<br />

same consensus architecture) are denoted with a dagger (†). Moreover, we highlight the<br />

cases where the minimum time complexity consensus architecture is the DHCA variant defined<br />

by the ordered list of diversity factors arranged in decreasing cardinality order using<br />

the double dagger (‡) symbol. Quite obviously, this only applies to the first eight data<br />

collections (Zoo to MFeat), as in these cases both the estimated and real execution times<br />

are available. For the remaining data sets (miniNG to PenDigits), we have only estimated<br />

which are the computationally optimal consensus architectures.<br />

Firstly, notice the large number of † symbols in table 3.10, which indicates the reasonably<br />

high accuracy of the proposed optimal consensus architecture prediction methodology.<br />

Moreover, notice that the most times we predict correctly that the least time consuming<br />

consensus architecture is a DHCA variant, its architecture is created by arranging the<br />

diversity factors in decreasing order of cardinality (which is denoted by the ‡ symbol).<br />

Secondly, it is important to highlight that the higher the degree of diversity, the more<br />

efficient DHCA variants become when compared to flat consensus —as already observed<br />

throughout all the experiments reported, the EAC consensus function constitutes an exception<br />

to this rule. However, notice that flat consensus tends to be computationally optimal<br />

84


Chapter 3. Hierarchical consensus architectures<br />

in those data sets having small cluster ensembles even in high diversity scenarios (e.g. Iris,<br />

Balance or MFeat).<br />

Thirdly, as the number of objects n contained in the data set increases (such as in<br />

PenDigits collection), only the HGPA and MCLA consensus functions are executable (as<br />

they are the only whose time complexity scales linearly with the data set size, see appendix<br />

A.5), and hierarchical consensus architectures are the most computationally efficient ones.<br />

However, if the data set was even larger, nor flat neither DHCA would be affordable from a<br />

computational perspective —with the resources employed in our experiments, see appendix<br />

A.6.<br />

Most of these observations can be extrapolated to the case of the fully parallel consensus<br />

implementation (table 3.11), where a pretty overwhelming prevalence of DHCA variants over<br />

flat consensus can be observed, a trend that was already reported earlier in this section and<br />

can also be observed in the experiments described in appendix C.3.<br />

For brevity reasons, the experiments presented in the remains of this work concerning<br />

deterministic hierarchical consensus architectures will solely refer to those DHCA variants<br />

of minimum estimated running time —i.e. those presented in tables 3.10 and 3.11.<br />

85


3.3. Deterministic hierarchical consensus architectures<br />

Consensus Diversity Dataset<br />

function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

CSPA A| = 10 DAR‡ flat† flat† flat† DAR‡ flat† flat† flat† flat flat flat –<br />

|dfA| = 19 ADR‡ flat† flat† ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat –<br />

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

EAC A| = 10 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|dfA| = 19 flat† flat† flat† flat† flat† flat† flat† flat† ADR flat flat –<br />

|dfA| = 28 flat† flat† flat† flat† flat† flat† flat† flat† ADR flat flat –<br />

|dfA| = 1 flat† flat† flat† flat† flat† DRA flat† RDA DRA DRA DRA DRA<br />

|df<br />

HGPA A| = 10 DAR‡ flat† flat† ADR‡ DAR‡ DRA flat† ARD‡ DAR DAR DAR DAR<br />

|dfA| = 19 ARD‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ ARD‡ ARD‡ ADR ADR ADR ADR<br />

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR‡ ADR ARD† ARD‡ ADR ADR ADR ADR<br />

|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† DRA DRA DRA DRA<br />

|df<br />

MCLA A| = 10 DAR‡ flat† DAR‡ ADR‡ DAR‡ DAR‡ flat† ARD‡ DAR DAR DAR DAR<br />

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ ARD‡ ARD‡ ADR ADR ADR ADR<br />

|dfA| = 28 ADR‡ ARD‡ ADR‡ ADR‡ DAR‡ ADR‡ ARD‡ ARD‡ ADR ADR ADR ADR<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

ALSAD A| = 10 DAR‡ flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR flat† flat† flat† ADR flat flat –<br />

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ ADR‡ flat† flat† flat† ADR flat flat –<br />

|dfA| = 1 flat† flat† flat† flat† flat† DRA flat† RAD flat flat flat –<br />

|df<br />

KMSAD A| = 10 DAR‡ flat† DAR flat† DAR‡ DAR‡ flat† flat† flat flat flat –<br />

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat –<br />

|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR ADR‡ flat† flat† ADR flat flat –<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

SLSAD A| = 10 DAR‡ flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ flat† flat† flat† ADR flat flat –<br />

|dfA| = 28 ADR‡ flat† ADR‡ ADR DAR flat† flat† flat† ADR flat flat –<br />

86<br />

Table 3.10: Computationally optimal consensus architectures (flat or DHCA) on the unimodal data collections assuming a fully serial<br />

implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions. The double dagger (‡) identifies DHCA<br />

variants defined by the ordered list of diversity factors in decreasing cardinality order.


Consensus Diversity Dataset<br />

function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† DRA flat flat –<br />

|df<br />

CSPA A| = 10 DAR† flat† DAR† ADR DAR† DAR flat† ARD DAR DAR DAR –<br />

|dfA| = 19 ADR† ARD ADR† ADR† DAR DAR† flat† ARD ADR ADR ADR –<br />

|dfA| = 28 ADR† ARD† ADR† ADR† DAR ADR† ARD† ARD ADR DAR ADR –<br />

|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

EAC A| = 10 DAR flat† DAR ADR DAR DAR flat† flat† flat flat flat –<br />

|dfA| = 19 ADR† ARD ADR† ADR† DAR DAR flat† flat† flat flat flat –<br />

|dfA| = 28 ADR† ARD ADR ADR DAR ADR flat† flat† flat flat flat –<br />

|dfA| = 1 DRA flat† flat† DRA DRA† DRA† flat† RAD DRA DRA DRA DRA<br />

|df<br />

HGPA A| = 10 DAR ARD DAR ADR† DAR† DAR ARD ARD† DAR DAR DAR DAR<br />

|dfA| = 19 ADR ARD† ADR ADR† DAR† DAR ARD ARD† ADR ADR ADR ADR<br />

|dfA| = 28 ADR† ARD† ADR† ADR† DAR† ADR ARD ARD ADR ADR ADR ADR<br />

|dfA| = 1 DRA† flat† flat† flat† DRA† DRA† flat† flat† DRA DRA DRA DRA<br />

|df<br />

MCLA A| = 10 DAR† ARD DAR ADR† DAR† DAR ARD ARD† DAR DAR DAR DAR<br />

|dfA| = 19 ADR† ARD† ADR† ADR DAR† DAR ARD† ARD† ADR ADR ADR ADR<br />

|dfA| = 28 ADR† ARD† ADR† ADR† DAR ADR† ARD ARD† ADR ADR ADR ADR<br />

|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

ALSAD A| = 10 DAR flat† DAR ADR† DAR DAR flat† flat† DAR flat flat –<br />

|dfA| = 19 ADR† flat† ADR ADR DAR DAR flat† flat† ADR flat flat –<br />

|dfA| = 28 ADR† ARD† ADR ADR DAR ADR† flat† flat† ADR flat flat –<br />

|dfA| = 1 flat† flat† DRA DRA flat† DRA† flat† RAD DRA flat flat –<br />

|df<br />

KMSAD A| = 10 DAR ARD DAR† ADR DAR DAR ARD ARD DAR DAR DAR –<br />

|dfA| = 19 ADR ARD† ADR† ADR DAR DAR ARD ARD ADR ADR ADR –<br />

|dfA| = 28 ADR† ARD ADR† ADR DAR ADR† ARD† ARD ADR ADR ADR –<br />

|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />

|df<br />

SLSAD A| = 10 DAR flat† DAR ADR DAR DAR flat† flat† DAR flat flat –<br />

|dfA| = 19 ADR ARD ADR ADR DAR DAR flat† flat† ADR flat flat –<br />

|dfA| = 28 ADR ARD ADR† ADR DAR ARD flat† flat† ADR ADR flat –<br />

Chapter 3. Hierarchical consensus architectures<br />

87<br />

Table 3.11: Computationally optimal consensus architectures (flat or DHCA) on the unimodal data collections assuming a fully parallel<br />

implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.


3.4. Flat vs. hierarchical consensus<br />

3.4 Flat vs. hierarchical consensus<br />

In sections 3.2 and 3.3, two specific implementations of hierarchical consensus architectures<br />

have been proposed, alongside a methodology for determining aprioriwhich is the fastest<br />

(random or deterministic) HCA variant, and deciding whether it is computationally advantageous<br />

with respect to classic flat consensus. In this section, we present a direct twofold<br />

comparison between flat consensus and those DHCA and RHCA variants deemed as the<br />

most computationally efficient by the proposed running time estimation methodologies.<br />

Firstly, we compare them in terms of computational complexity. In fact, such comparison<br />

could be made upon the results presented in sections 3.2 and 3.3, but we think that a<br />

comparison considering only the allegedly best performing variants may simplify the process<br />

of drawing meaningful conclusions. And secondly, these least time consuming hierarhical<br />

consensus architecture variants will be compared with flat consensus in terms of the quality<br />

of the consensus clustering solutions they yield. By doing so, we intend to present a comprehensive<br />

picture of our hierarchical consensus architecture proposals in terms of the two<br />

main factors that condition robust clustering by consensus: time complexity and quality.<br />

3.4.1 Running time comparison<br />

This section compares the real execution times of the allegedly fastest DHCA and RHCA<br />

variants and flat consensus. The experiments conducted follow the design outlined next.<br />

Experimental design<br />

– What do we want to measure? The time complexity of the allegedly fastest<br />

DHCA and RHCA variants and flat consensus.<br />

– How do we measure it? We measure the CPU time required for the execution of<br />

the aforementioned consensus architectures.<br />

– How are the experiments designed? Such comparison entails the running times<br />

of ten independent runs of each one of the compared consensus architectures. So<br />

as to evaluate their computational efficiency under distinct experimental conditions,<br />

the consensus processes involved have been conducted by means of the seven consensus<br />

functions for hard cluster ensembles employed in this work —see appendix A.5.<br />

Moreover, experiments have been replicated on the four diversity scenarios described<br />

in appendix A.4 —recall that they differ in the algorithmic diversity factor, as a set of<br />

|dfA| = {1, 10, 19, 28} randomly chosen clustering algorithms are employed for creating<br />

the cluster ensemble in each diversity scenario.<br />

– How are results presented? In formal terms, the measured execution times are<br />

presented by means of boxplot charts, so as to provide the reader with a notion<br />

of the degree of dispersion and asymmetry of the running times of each consensus<br />

architecture. When comparing boxplots, notice that non-overlapping boxes notches<br />

indicate that the medians of the compared running times differ at the 5% significance<br />

level, which allows a quick inference of the statistical significance of the results.<br />

– Which data sets are employed? A detailed description of the results of this<br />

comparison on the Zoo data collection is presented in the following paragraphs. Recall<br />

88


CPU time (sec.)<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

Chapter 3. Hierarchical consensus architectures<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(a) Serial implementation running time<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

flat<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(b) Parallel implementation running time<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.18<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.15: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario corresponding to<br />

a cluster ensemble of size l = 57.<br />

that, the cardinalities of the dimensional and representational diversity factors of this<br />

fata set are |dfD| =14and|dfR| = 5, respectively. For brevity reasons, the results<br />

obtained on eleven more unimodal data sets are described in detail in appendix C.4.<br />

However, at the end of this section, the running times of the three compared consensus<br />

architectures measured across the experiments conducted on the twelve unimodal<br />

data collections employed in this work are compiled and compared. The goal of such<br />

comparison is to analyze whether any of the consensus architecture is inherently faster<br />

than the rest.<br />

Diversity scenario |df A| =1<br />

The running times of the estimated computationally optimal serial and parallel DHCA and<br />

RHCA implementations and flat consensus architectures in the lowest diversity scenario are<br />

presented in figure 3.15.<br />

As regards the fully serial implementation (figure 3.15(a)), it can be observed that flat<br />

consensus is 1.2 to 5 times faster than the fastest hierarchical consensus variant regardless<br />

89


3.4. Flat vs. hierarchical consensus<br />

of the consensus function employed, and that such differences are, in all cases, statistically<br />

significant. As far as the hierarchical consensus architectures are concerned, notice that the<br />

fastest DHCA variant (DRA) is more computationally efficient than its RHCA counterpart<br />

(which has s = 2 stages and mini-ensembles of size b = 28), except when consensus processes<br />

are conducted by means of the EAC and SLSAD consensus functions —in this case, statistically<br />

equivalent running times are attained by both HCA. If these results are contrasted<br />

with the predicted computationally optimal consensus architectures presented in tables 3.4<br />

and 3.10, a single prediction error is detected (flat consensus turns out to be faster than the<br />

DRA DHCA variant when consensus is conducted by MCLA), which reinforces the notion<br />

that the proposed computationally optimal consensus architecture prediction methodology<br />

performs pretty well.<br />

Figure 3.15(b) presents the running times of the fully parallel optimal hierarhical consensus<br />

architectures and flat consensus. As in the serial case, it can be noticed that flat<br />

consensus tends to be more efficient than RHCA and DHCA. The only exception occurs<br />

when consensus is conducted by means of the MCLA consensus function —which is due<br />

to the fact that it is the only combiner the computational complexity of which increases<br />

quadratically with the size of the cluster ensemble. <strong>La</strong>st but not least, it is to note that,<br />

as opposed to what was observed in the serial implementation, the fastest RHCA is less<br />

time consuming than the most efficient DHCA variant, and the running time differences<br />

between them are statistically significant. The reason for this lies in the fact that this<br />

specific RHCA variant has s = 2 stages and consensus is conducted on mini-ensembles of<br />

size b = 7, whereas the DHCA variant consists of three consensus stages, in one of which<br />

consensus are built on larger mini-ensembles of size |dfD| = 14, which is responsible for the<br />

higher computational cost of parallel DHCA in this case.<br />

Diversity scenario |df A| =10<br />

The results corresponding to the experiments conducted in the second diversity scenario<br />

(i.e. cluster ensembles generated by the compilation of the clusterings output by |dfA| =10<br />

randomly selected clustering algorithms, giving rise to cluster ensembles of size l = 570) are<br />

presented in figure 3.16. In particular, figure 3.16(a) depicts the execution time boxplots<br />

of the serial implementation of hierarchical consensus architectures. The first noticeable<br />

situation is that, in contrast to what was observed in the lowest diversity scenario, the<br />

computationally optimal RHCA variant (s =2andb =20orb = 285 depending on the<br />

consensus function employed) is faster than its DHCA counterpart (DAR). This is due to the<br />

fact that the rise of the algorithmic diversity factor (from |dfA| =1to|dfA| = 10) entails an<br />

increase of the computational cost of one of the DHCA stages that exceeds the increment<br />

of the complexity of the RHCA caused by the same factor. Meanwhile, regarding the<br />

computational efficiency of flat consensus, two opposed behaviours are observed depending<br />

on the consensus function employed: while being faster than any hierarchical architecture<br />

when consensus are built using the CSPA and EAC consensus functions, one-step consensus<br />

is slower when the remaining clustering combiners are employed. Moreover, the differences<br />

between the running times of these consensus architectures is statistically significant at the<br />

5% significance level in all cases.<br />

Figure 3.16(b) presents the results corresponding to the fully parallel implementation<br />

of consensus architectures. In this case, flat consensus is at least four times more computationally<br />

costly than the DHCA and RHCA variants. Moreover, the optimal RHCA<br />

90


CPU time (sec.)<br />

CPU time (sec.)<br />

3.8<br />

3.6<br />

3.4<br />

3.2<br />

3<br />

2.8<br />

2.6<br />

2.4<br />

2.2<br />

2<br />

1.5<br />

1<br />

0.5<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

3.4<br />

3.2<br />

3<br />

2.8<br />

2.6<br />

2.4<br />

2.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

Chapter 3. Hierarchical consensus architectures<br />

25<br />

20<br />

15<br />

10<br />

5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.16: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario corresponding to<br />

a cluster ensemble of size l = 570.<br />

is faster than its DHCA counterpart (except for the MCLA consensus function), although<br />

they both attain very similar execution times, their differences being statistically significant<br />

little below the 5% significance level.<br />

Diversity scenario |df A| =19<br />

The running times of the consensus architectures corresponding to the experiments conducted<br />

on the third diversity scenario (i.e. cluster ensembles of size l = 1083) are presented.<br />

Figure 3.17(a) depicts the execution time boxplots of the serially implemented consensus<br />

architectures. In most cases, hierarchical consensus architectures are faster than their flat<br />

counterpart —the only exception occurs when consensus are built using the EAC consensus<br />

function, a trend that was already observed in sections 3.2 and 3.3. Notice, moreover,<br />

that flat consensus is not executable when MCLA is the consensus function employed for<br />

creating the consensus clustering solutions.<br />

When the entirely parallel implementation of hierarchical consensus architectures is evaluated<br />

from a computational viewpoint, the optimal RHCA and DHCA variants (see tables<br />

91


3.4. Flat vs. hierarchical consensus<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7.5<br />

7<br />

6.5<br />

6<br />

5.5<br />

5<br />

4.5<br />

4<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time<br />

8<br />

6<br />

4<br />

2<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.17: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario corresponding to<br />

a cluster ensemble of size l = 1083.<br />

3.5 and 3.11) are at least eight times faster than flat consensus –see figure 3.17(b)– attaining<br />

very similar execution times (being statistically equivalent when the HGPA, ALSAD and<br />

SLSAD consensus functions are employed). This is quite logical provided that DHCA has 3<br />

stages, building consensus on mini-ensembles of sizes |dfA| = 19, |dfD| = 14, and |dfR| =5,<br />

while the fastest RHCA has two or three stages (depending on the consensus function employed)<br />

where consensus is built on mini-ensembles of size b =26orb = 27. At the end of<br />

the day, the number of stages and the mini-ensembles sizes of DHCA and RHCA are conterbalanced,<br />

yielding, as mentioned earlier, pretty similar running times. Notice, however,<br />

that when consensus are built using the MCLA consensus function, RHCA is penalized<br />

with respect to DHCA, given the larger size of its mini-ensembles and the aforementioned<br />

quadratic dependence of this consensus function running time with this factor.<br />

Diversity scenario |df A| =28<br />

A very similar behaviour to the one just reported is observed when the size of the cluster<br />

ensembles is increased. Indeed, in the highest diversity scenario, i.e. the one corresponding<br />

to the use of the |dfA| = 28 clustering algorithms of the CLUTO toolbox for creating cluster<br />

92


CPU time (sec.)<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.5<br />

1<br />

0.5<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

12<br />

11<br />

10<br />

Chapter 3. Hierarchical consensus architectures<br />

9<br />

8<br />

7<br />

6<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time<br />

15<br />

10<br />

5<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.18: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />

architectures on the Zoo data collection for the diversity scenario corresponding to<br />

a cluster ensemble of size l = 1596.<br />

ensembles of size l = 1596, almost identical running time boxplot charts are obtained —see<br />

figure 3.18.<br />

As aforementioned, this same experiment has been conducted on the eleven remaining<br />

unimodal data collections, and the corresponding execution time boxplot charts are<br />

presented in appendix C.4. From their analysis, the following conclusions have been drawn:<br />

– the prediction regarding the computationally optimality of flat or hierarhical consensus<br />

architectures are made with a high degree of accuracy (in fact, average prediction<br />

accuracies of 93.45% and 91.07% are obtained across our experiments in the serial<br />

and parallel implementation cases, respectively).<br />

– flat consensus is the most efficient consensus architecture when the EAC consensus<br />

function is employed, regardless of the diversity scenario or whether the serial or<br />

parallel implementation of hierarchical architectures is employed.<br />

– in large datasets, only HGPA and MCLA are executable, as the time and space<br />

complexities of the remaining consensus functions scale quadratically with the number<br />

93


3.4. Flat vs. hierarchical consensus<br />

of objects in the collection, as they employ object co-association matrices as a basic<br />

element of their consensus processes.<br />

– regarding which type of hierarchical consensus architecture (RHCA or DHCA) is<br />

more computationally efficient, little can be said apriori, as it depends on the specific<br />

configurations of the hierarhical architectures, i.e. their number of stages and the<br />

sizes of the mini-ensembles.<br />

– using the MCLA consensus function penalizes those architectures with large miniensembles,<br />

as its time complexity depends quadratically on this factor.<br />

Comparison across diversity scenarios and data collections<br />

Aiming to reveal the existence of any global computational superiority pattern between<br />

the two hierarchical consensus architecture variants, besides confirming the hypotheses put<br />

forward earlier, we have compiled the real running times of the assumedly computationally<br />

optimal RHCA and DHCA variants and of flat consensus in each diversity scenario using<br />

each consensus function, across the twelve data collections employed in these experiments.<br />

This process has been replicated for both the fully serial and parallel implementations<br />

of hierarchical consensus architectures, and as a result, the running time boxplot charts<br />

presented in figures 3.19 and 3.20 have been obtained.<br />

For starters, figure 3.19 presents the running times corresponding to the entirely serial<br />

implementation of the computationally optimal RHCA and DHCA variants and flat consensus.<br />

Notice the notable height of the boxes in the boxplots, caused by the fact that they<br />

represent running times of the consensus architectures across data collections with fairly<br />

distinct characteristics (i.e. number of objects and clusters). Nevertheless, the focus of this<br />

analysis should be placed on detecting relative differences between the boxes corresponding<br />

to the three consensus architectures. In general terms, it can be observed how flat consensus<br />

becomes gradually slower than hierarchical consensus as the size of the cluster ensembles<br />

grows (i.e. as the cardinality of the algorithmic diversity factor |dfA| is increased). As reported<br />

earlier, consensus architectures based on the EAC consensus function constitute the<br />

only exception to this rule. As regards the comparison between the running times of RHCA<br />

and DHCA, the most significant differences are observed when the ALSAD and SLSAD<br />

consensus functions are employed —in these cases, the random HCA variants are faster<br />

than their deterministic counterparts, a trend that becomes more apparent as the cluster<br />

ensembles size grows. Finally, notice that, in absolute terms, consensus architectures based<br />

on the HGPA, MCLA and CSPA consensus functions are faster than those employing the<br />

EAC, ALSAD, KMSAD and SLSAD clustering combiners.<br />

And secondly, the execution time boxplots corresponding to flat consensus and the fully<br />

parallel implementation of consensus architectures are depicted in figure 3.20. As in the serial<br />

case, the use of the HGPA, MCLA and CSPA consensus functions gives rise, in general,<br />

to faster consensus architectures than when consensus clustering solutions are generated by<br />

means of the EAC, ALSAD, KMSAD and SLSAD clustering combiners. Moreover, the superiority<br />

of hierarchical architectures in front of flat consensus becomes manifest in diversity<br />

scenarios with |dfA| ≥10, depending on the consensus function employed. As regards the<br />

comparison between RHCA and DHCA, a wide spectrum of behaviours is observed. When<br />

consensus is built upon the CSPA, EAC, HGPA and KMSAD consensus functions, little significant<br />

differences between both hierarchical architectures are detected. Meanwhile, RHCA<br />

94


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

CSPA<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

CSPA<br />

RHCA DHCA flat<br />

10<br />

RHCA DHCA flat<br />

−1<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

CSPA<br />

CSPA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 4<br />

10 2<br />

10 0<br />

10 −2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

EAC<br />

RHCA DHCA flat<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

EAC<br />

10<br />

RHCA DHCA flat<br />

−1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

EAC<br />

EAC<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 1<br />

10 0<br />

10 −1<br />

HGPA<br />

10<br />

RHCA DHCA flat<br />

−2<br />

CPU time (sec.)<br />

Chapter 3. Hierarchical consensus architectures<br />

10 1<br />

10 0<br />

10 −1<br />

MCLA<br />

10<br />

RHCA DHCA flat<br />

−2<br />

CPU time (sec.)<br />

10 4<br />

10 2<br />

10 0<br />

10 −2<br />

ALSAD<br />

RHCA DHCA flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

HGPA<br />

RHCA DHCA flat<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

MCLA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

ALSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

HGPA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

MCLA<br />

RHCA DHCA flat<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

ALSAD<br />

RHCA DHCA flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

HGPA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

MCLA<br />

RHCA DHCA flat<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

ALSAD<br />

10<br />

RHCA DHCA flat<br />

−1<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

10 −2<br />

KMSAD<br />

10<br />

RHCA DHCA flat<br />

−3<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

KMSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

KMSAD<br />

RHCA DHCA flat<br />

KMSAD<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−1<br />

Figure 3.19: Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures across all data collections for the diversity scenarios corresponding<br />

to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms.<br />

95


3.4. Flat vs. hierarchical consensus<br />

tends to be a bit quicker to execute than DHCA in high diversity scenarios, although the<br />

fact that the mini-ensembles sizes employed in the fastest random architecture variants are<br />

usually larger than those employed in DHCA is penalized by consensus architectures based<br />

on the MCLA consensus function. And last, as already perceived in the serial case, RHCA<br />

tends to be slightly more efficient than DHCA when clustering combination is conducted by<br />

means of consensus functions based on treating hierarchical clustering similarity measures<br />

as data (i.e. ALSAD and SLSAD).<br />

To sum up, we can conclude that there exists a very important dependence between<br />

the computationally optimal type of consensus architecture, the size of the cluster ensemble<br />

upon which consensus is built and the consensus function employed. From a practical<br />

standpoint, in front of a specific consensus clustering problem (i.e. a cluster ensemble of a<br />

given size l and a particular computational resources configuration), the user should take<br />

into account how these factors interact at the time of deciding which type of consensus<br />

architecture is to be implemented. However, this decision should not only be made on<br />

computational efficiency grounds. In fact, it should also allow for the quality of the consensus<br />

clustering solution obtained, as the quick obtention of a poor consensus data grouping<br />

would be of little use in practice. For this reason, the next section evaluates the quality of<br />

the consensus label vectors output by the same consensus architectures that have just been<br />

analyzed in computational terms.<br />

3.4.2 Consensus quality comparison<br />

In this section, we evaluate the quality of the consensus clustering solutions yielded by the<br />

fastest DHCA and RHCA variants and flat consensus architectures, which constitutes an<br />

indicator of their suitability for conducting robust clustering. The experiments conducted<br />

to this end follow the design described next.<br />

Experimental design<br />

– What do we want to measure?<br />

i) The suitability of the allegedly fastest DHCA and RHCA variants and flat consensus<br />

for obtaining clustering results robust to the inherent indeterminacies of<br />

clustering.<br />

ii) A further goal of this section is to determine whether certain consensus architectures<br />

tend to outperform others as regards the quality of the consensus clusterings<br />

they obtain.<br />

– How do we measure it?<br />

i) We analyze the quality of the consensus clustering solutions obtained by these<br />

consensus architectures, comparing it with respect the individual clusterings contained<br />

in the cluster ensemble E upon which consensus is built. The more similar<br />

the qualities of the consensus clustering solution and the top quality cluster ensemble<br />

components, the higher robustness to the clustering indeterminacies is<br />

attained. As mentioned in section 1.2.2, in this work we evaluate clustering<br />

solutions by means of an external cluster validity index, i.e. we compare the consensus<br />

clustering solution embodied in the labeling vector λc with a predefined<br />

96


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

CSPA<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

CSPA<br />

RHCA DHCA flat<br />

10<br />

RHCA DHCA flat<br />

−1<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

CSPA<br />

CSPA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 4<br />

10 2<br />

10 0<br />

10 −2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

EAC<br />

RHCA DHCA flat<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

EAC<br />

10<br />

RHCA DHCA flat<br />

−1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

CPU time (sec.)<br />

10 1<br />

10 0<br />

10 −1<br />

HGPA<br />

10<br />

RHCA DHCA flat<br />

−2<br />

CPU time (sec.)<br />

Chapter 3. Hierarchical consensus architectures<br />

10 1<br />

10 0<br />

10 −1<br />

MCLA<br />

10<br />

RHCA DHCA flat<br />

−2<br />

CPU time (sec.)<br />

10 4<br />

10 2<br />

10 0<br />

10 −2<br />

ALSAD<br />

RHCA DHCA flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

HGPA<br />

RHCA DHCA flat<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

MCLA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

ALSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

(b) Parallel implementation running time, |dfA| =10<br />

EAC<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

HGPA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

MCLA<br />

RHCA DHCA flat<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

ALSAD<br />

RHCA DHCA flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

HGPA<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

MCLA<br />

RHCA DHCA flat<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

ALSAD<br />

10<br />

RHCA DHCA flat<br />

−1<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

10 −2<br />

KMSAD<br />

10<br />

RHCA DHCA flat<br />

−3<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

KMSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

KMSAD<br />

RHCA DHCA flat<br />

KMSAD<br />

10<br />

RHCA DHCA flat<br />

−1<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 −1<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−2<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

SLSAD<br />

10<br />

RHCA DHCA flat<br />

−1<br />

Figure 3.20: Running times of the computationally optimal parallel RHCA, DHCA and flat<br />

consensus architectures across all data collections for the diversity scenarios corresponding<br />

to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms.<br />

97


3.4. Flat vs. hierarchical consensus<br />

and allegedly correct cluster structure (or ground truth), measuring their degree<br />

of resemblance in terms of normalized mutual information, φ (NMI) —recall that<br />

φ (NMI) ∈ [0, 1] and the higher its value, the better the quality of the consensus<br />

clustering solution. We measure the percentage of experiments and the relative<br />

φ (NMI) differences between the consensus clusterings and the cluster ensemble<br />

components of maximum and median φ (NMI) score.<br />

ii) We compare the φ (NMI) scores of the consensus clusterings obtained by the three<br />

consensus architectures subject to evaluation.<br />

– How are the experiments designed? The experimental methodology followed is<br />

the same as when the computational efficiency of consensus architectures was analyzed<br />

in the previous sections. That is, the consensus quality comparison has been<br />

conducted on the four diversity scenarios described in appendix A.4. In each diversity<br />

scenario, ten independent experiments have been conducted using the seven consensus<br />

functions for hard cluster ensembles employed in this work (CSPA, EAC, HGPA,<br />

MCLA, ALSAD, KMSAD and SLSAD). From a formal viewpoint, the φ (NMI) values<br />

of the consensus clustering solutions obtained in the 10 experiments corresponding to<br />

each consensus function and diversity scenario are presented.<br />

– How are results presented? In formal terms, the measured φ (NMI) values are<br />

presented by means of boxplot charts. By doing so, we can see the quality scatter<br />

of each consensus function and architecture. Again, non-overlapping boxes notches<br />

indicate that the medians of the compared φ (NMI) differ at the 5% significance level.<br />

– Which data sets are employed? In this section, we present in detail the results<br />

obtained on the Zoo data collection —for the sake of brevity, the results obtained in<br />

the remaining eleven unimodal data sets are deferred to appendix C.4. However, at<br />

the end of this section, the φ (NMI) scores of the three compared consensus architectures<br />

measured across the experiments conducted on the twelve unimodal data collections<br />

employed in this work are compiled and compared. The goal of such comparison is<br />

to analyze whether any of the consensus architecture tends to yield better consensus<br />

clustering solutions than the rest.<br />

One last before proceeding: notice that the only differences between serial and parallel<br />

hierarchical consensus architectures refer to their time complexity, not the quality of the<br />

consensus clustering solutions they yield. For this reason, the distinction between serial and<br />

parallel architectures is not found in this section.<br />

Diversity scenario |df A| =1<br />

Firstly, the φ (NMI) values of the estimated optimal serial and parallel DHCA and RHCA<br />

implementations and flat consensus architectures in the lowest diversity scenario are presented<br />

in figure 3.21. Each chart presents four boxes that, from left to right, represent the<br />

φ (NMI) values of the components of the cluster ensemble E, and of the consensus clustering<br />

solutions output by the RHCA, DHCA and flat consensus architectures, respectively. It<br />

can observed that the three consensus architectures, yield, in general, pretty similar quality<br />

consensus solutions (in fact, the differences between them are statistically non significant<br />

at the 5% level for the CSPA, MCLA, ALSAD and KMSAD consensus functions). The<br />

98


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

Chapter 3. Hierarchical consensus architectures<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.21: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />

scenario corresponding to a cluster ensemble of size l = 57.<br />

largest inter-consensus architecture deviations are found when consensus clustering is based<br />

on the HGPA consensus function, as the notches of the corresponding φ (NMI) boxes are far<br />

from overlapping. If the consensus functions are compared in terms of the quality of the<br />

clustering solutions they yield, it can be observed that the best results are obtained by the<br />

EAC, ALSAD, KMSAD and SLSAD consensus functions. In these cases, the medians of<br />

the consensus solutions output by the consensus architectures are better than the 75% of<br />

the components of the cluster ensemble E, which denotes a notable level of robustness to<br />

the clustering indeterminacies.<br />

Diversity scenario |df A| =10<br />

Secondly, the quality of the consensus clustering solutions output by the consensus architectures<br />

corresponding to the experiments conducted on the diversity scenario corresponding<br />

to cluster ensembles of size l = 570 are presented in figure 3.22. The trends detected in the<br />

previous diversity scenario are somehow confirmed in this case. That is, those consensus<br />

functions based on evidence accumulation (EAC) and object similarity as data (ALSAD,<br />

KMSAD and SLSAD) yield the best quality consensus clustering solutions, and they show<br />

a high degree of independence with respect to the topology of the consensus architecture.<br />

In fact, EAC and SLSAD based consensus architectures give rise to consensus clusterings<br />

which are better than the 80% of the cluster ensemble components, which again reveals the<br />

ability of these consensus functions to attain clustering solutions robust to the uncertainties<br />

inherent to clustering. In contrast, consensus labelings obtained by hypergraph-based<br />

consensus functions (CSPA, HGPA and MCLA) attain lower φ (NMI) values, while showing<br />

a larger quality variabilty (this only applies to the HGPA and MCLA consensus functions).<br />

Diversity scenario |df A| =19<br />

The results corresponding to the experiments conducted in the third diversity scenario (i.e.<br />

cluster ensembles generated by the compilation of the clusterings output by |dfA| =19<br />

randomly selected clustering algorithms, giving rise to cluster ensembles of size l = 1083)<br />

are presented in figure 3.23. The behaviour detected in the previous diversity scenarios<br />

is also found in this case. Again, the largest inter-consensus architecture variations are<br />

99


3.4. Flat vs. hierarchical consensus<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.22: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />

scenario corresponding to a cluster ensemble of size l = 570.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.23: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />

scenario corresponding to a cluster ensemble of size l = 1083.<br />

observed when consensus are built using the HGPA consensus function. In the remaining<br />

cases, much smaller deviations are found (maybe with the exception of ALSAD, but the<br />

observed dispersions are smaller than those of HGPA) —in fact, statistically non significant<br />

differences between the three consensus architectures are observed in the EAC, MCLA,<br />

KMSAD and SLSAD based architectures.<br />

Diversity scenario |df A| =28<br />

And last, the φ (NMI) values of the consensus clustering solutions yielded by the hierarchical<br />

and flat consensus architectures corresponding to the experiments conducted on the<br />

highest diversity scenario (i.e. cluster ensembles of size l = 1596) are presented in figure<br />

3.24. This scenario is ideal for analyzing the variability of the quality of the consensus<br />

clustering solutions output by the distinct consensus functions, as exactly the same cluster<br />

ensemble has been employed in the ten experiments analyzed —in contrast, in the previous<br />

diversity scenarios, the cluster ensemble employed in each one of the ten experiments was<br />

created by compiling the clustering components generated by |dfA| = {1, 10, 19} randomly<br />

picked clustering algorithms (that is, two superimposed randomness factors underlie the<br />

boxplots presented in figures 3.21 to 3.23). In this sense, it can be observed that, for any<br />

100


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

Chapter 3. Hierarchical consensus architectures<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure 3.24: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />

scenario corresponding to a cluster ensemble of size l = 1596.<br />

given consensus architecture, the HGPA, MCLA and KMSAD consensus functions have the<br />

highest quality variability, due to the existence of some random underlying process in their<br />

consensus generation procedures (e.g. the random initialization of k-means in the KM-<br />

SAD consensus function). In contrast, the qualities of the consensus clusterings output by<br />

those consensus architectures based on CSPA, EAC, ALSAD and SLSAD show very small<br />

(or even null) variations. As regards the inter-consensus architecture quality divergences,<br />

those based on the HGPA and ALSAD consensus functions show the most disparate results,<br />

whereas in other cases (e.g. CSPA, EAC, MCLA or KMSAD) statistically equivalent qualities<br />

are yielded by the three consensus architectures. And last, as far as the robustness of<br />

the consensus clustering solutions is concerned, notice that the EAC, ALSAD, KMSAD and<br />

SLSAD based consensus architectures yield the highest quality clustering results, getting<br />

pretty close to the top-quality component of the cluster ensemble E, being,inmostcases,<br />

better than the 75% of the clusterings contained in it.<br />

Comparison across diversity scenarios and data collections<br />

So as to provide the reader with a global comparative view of the consensus architectures in<br />

terms of the quality of the consensus clustering solutions they yield, we have compiled the<br />

φ (NMI) values obtained across all the experiments conducted on the twelve unimodal data<br />

collections in each diversity scenario, representing them in the boxplots depicted in figure<br />

3.25. Recall that, when comparing boxplots, non-overlapping boxes notches indicate that<br />

the medians of the compared magnitudes differ at the 5% significance level.<br />

A twofold qualitative analysis can be made in view of these results. The first aspect of<br />

study is an intra-consensus function comparison among consensus architectures. A quick<br />

inspection of any of the rows of figure 3.25 reveals that the optimality of consensus architectures<br />

is a property that is local to the consensus function applied. When the clustering<br />

combination process is based on the CSPA consensus function, the three consensus architectures<br />

yield pretty similar quality consensus solutions (as the boxes have a notable overlap),<br />

although DHCA tends to attain slightly higher φ (NMI) values —a similar pattern is observed<br />

in the boxplots presented in the column corresponding to the EAC consensus function. In<br />

contrast, flat consensus architectures yield higher quality consensus than their hierarchical<br />

counterparts when they are based on the HGPA clustering combiner. The analysis of the<br />

101


3.4. Flat vs. hierarchical consensus<br />

results obtained when the MCLA consensus function is employed as the basis of consensus<br />

architectures must be made with care, as flat consensus is not executable when it must<br />

be conducted on large cluster ensembles. For this reason, the boxplots presented in the<br />

MCLA column only reflect the φ (NMI) values corresponding to those experiments where the<br />

three consensus architectures are executable. In these cases, the best consensus clustering<br />

solutions are obtained by the flat and DHCA consensus architectures. A greater degree of<br />

evenness between consensus architectures is observed in the consensus functions that treat<br />

object similarity as object features, i.e. ALSAD, KMSAD and SLSAD —notice the large<br />

overlap between boxes. However, whereas DHCA yields slightly lower quality consensus<br />

clustering solutions than the RHCA and flat consensus architectures when the ALSAD and<br />

KMSAD consensus functions are employed, it is the flat consensus approach that attains<br />

the lowest φ (NMI) values among the SLSAD based consensus architectures.<br />

And secondly, if an inter-consensus functions comparison is conducted, we can conclude<br />

that the excellent performance of the EAC consensus function on the Zoo data collection<br />

apparently constitutes an exception to the rule, as –together with HGPA and SLSAD–<br />

it yields the lowest φ (NMI) values (i.e. the poorest consensus clustering solutions) when<br />

the results obtained across all the data sets and diversity scenarios are compiled. In contrast,<br />

CSPA, MCLA, ALSAD and KMSAD tend to yield comparatively better consensus<br />

clustering solutions in a global perspective.<br />

Following a more quantitative perspective, we have compared the quality of the consensus<br />

clustering solutions yielded by the three consensus architectures with the components<br />

of the cluster ensemble consensus is conducted upon. In particular, this comparison has<br />

taken into account the cluster ensemble components of median and maximum φ (NMI) with<br />

respect to the ground truth (referred to as the median ensemble component, orMEC,and<br />

best ensemble component, or BEC, respectively). This comparison makes sense inasmuch we<br />

focus the application of consensus clustering as a means for becoming robust to the inherent<br />

indeterminacies that affect the clustering problem. More specifically, the higher the φ (NMI)<br />

of the consensus clustering solution with respect to that of the cluster ensemble components,<br />

the higher robustness is achieved. The median and maximum φ (NMI) components are used<br />

as a summarized reference of the quality of the cluster ensemble contents.<br />

For this reason, we have evaluated i) the percentage of experiments in which the consensus<br />

clustering solution attains a higher φ (NMI) than the MEC and the BEC, and ii)<br />

the relative percentage φ (NMI) variation between the median and the best cluster ensemble<br />

components and the consensus clustering solution.<br />

As regards the first issue, table 3.12 presents the percentage of experiments (considering<br />

all data sets in the highest diversity scenario) where the consensus clustering solution is better<br />

than the median normalized mutual information cluster ensemble component (MEC). It<br />

can be observed that the average percentage of experiments (considering all data sets in the<br />

highest diversity scenario) where the consensus clustering solution is better than the MEC<br />

is 53.1%, which indicates than in more than the half of the experiments, consensus yields a<br />

clustering solution better than the one located halfway of the cluster ensemble components.<br />

When the relative percentage φ (NMI) gains between consensus clustering solutions and the<br />

MEC are computed, a reasonable average of 59% gain is obtained —see table 3.13 for a<br />

detailed presentation of the results per consensus function and consensus architecture.<br />

If such comparison is referred to the cluster ensemble component that best describes the<br />

group structure of the data in terms of normalized mutual information with respect to the<br />

102


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

CSPA<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

CSPA<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

CSPA<br />

CSPA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

EAC<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

EAC<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

EAC<br />

EAC<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

Chapter 3. Hierarchical consensus architectures<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

ALSAD<br />

0<br />

RHCA DHCA flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

ALSAD<br />

0<br />

RHCA DHCA flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

ALSAD<br />

0<br />

RHCA DHCA flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

ALSAD<br />

0<br />

RHCA DHCA flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

KMSAD<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

KMSAD<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

KMSAD<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

KMSAD<br />

0<br />

RHCA DHCA flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

SLSAD<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

SLSAD<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

SLSAD<br />

0<br />

RHCA DHCA flat<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

SLSAD<br />

0<br />

RHCA DHCA flat<br />

Figure 3.25: φ (NMI) of the consensus solutions obtained by the computationally optimal<br />

parallel RHCA, DHCA and flat consensus architectures across all data collections for the<br />

diversity scenarios corresponding to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms.<br />

103


3.5. Discussion<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 58.3 25 42.3 33.2 66.7 69.8 41.7<br />

RHCA 69.8 24.7 15.9 74.2 79.1 79.1 50.4<br />

DHCA 83.3 8.3 3.9 77.1 75 77.9 58.3<br />

Table 3.12: Percentage of experiments in which the consensus clustering solution is better<br />

than the median cluster ensemble component.<br />

given ground truth (i.e. the BEC), we observe that the consensus clustering solution attains<br />

higher φ (NMI) values in only a 0.1% of the experiments —see table 3.14. If the degree of<br />

improvement of those consensus clustering solutions that attain a higher φ (NMI) than the<br />

BEC is measured in terms of relative percentage φ (NMI) increase, a modest 0.6% φ (NMI)<br />

gain is obtained in average —see table 3.15 for a detailed view across consensus functions<br />

and architectures.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 90 12.9 80.2 50.1 107.1 87.7 24.6<br />

RHCA 78.8 16.3 9.2 96.4 94.7 90.6 33.3<br />

DHCA 73.7 11.6 5.6 53.1 83.7 72.2 67.6<br />

Table 3.13: Relative percentage φ (NMI) gain between the consensus clustering solution and<br />

the median cluster ensemble component.<br />

As a conclusion, we can see that the application of consensus clustering processes on<br />

a collection of partitions of a given data collection provides a means for obtaining a summarized<br />

clustering that, although rarely better than the best component available in the<br />

cluster ensemble, is reasonably often quite better than the median data partition. However,<br />

despite these fairly good results, we aim to obtain clustering solutions more robust to the<br />

inherent indeterminacies of clustering (i.e. closer or even better than the maximum quality<br />

cluster ensemble component). For this reason, chapter 4 introduces what we call consensus<br />

self-refining procedures that aim to improve the quality of the consensus clustering solutions<br />

obtained from either hierarchical or flat consensus architectures.<br />

3.5 Discussion<br />

Our proposal for building clustering systems that behave robustly in front of the indeterminacies<br />

inherent to unsupervised classification problems relies on the application of consensus<br />

clustering processes on large cluster ensembles created by the application of multiple mutually<br />

crossed diversity factors.<br />

However, in the consensus clustering literature, relatively few works face the problematics<br />

of combining large amounts of clusterings, as most authors tend to employ rather small<br />

cluster ensembles for evaluating their proposals. However, the application of certain consensus<br />

clustering approaches in computationally demanding scenarios can be difficult. Typical<br />

examples of this include consensus functions based on object co-association measures that<br />

become inapplicable on large data collections, or clustering combiners not executable on<br />

104


Chapter 3. Hierarchical consensus architectures<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 0 0 0 0 0 0.2 0<br />

RHCA 0.4 0 0 0 1.2 0.1 0<br />

DHCA 0 0 0 0 0 0.3 0<br />

Table 3.14: Percentage of experiments in which the consensus clustering solution is better<br />

than the best cluster ensemble component.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat – – – – – 0.07 –<br />

RHCA 0.85 – – – 0.8 0.07 –<br />

DHCA – – – – – 1.1 –<br />

Table 3.15: Relative percentage φ (NMI) gain between the consensus clustering solution and<br />

the best cluster ensemble component.<br />

large cluster ensembles if their complexity increases quadratically with the number of components<br />

in the ensemble.<br />

To our knowledge, most previous proposals oriented towards this aim deal with subsampling<br />

strategies as a means for reducing the computational complexity of consensus<br />

processes. That is, if the clustering combination task becomes more costly as the number of<br />

the objects in the data set and/or the number of the cluster ensemble components grow, a<br />

natural solution consists in applying the consensus clustering process on a reduced version<br />

of the data collection (<strong>La</strong>nge and Buhmann, 2005; Greene and Cunningham, 2006; Gionis,<br />

Mannila, and Tsaparas, 2007) and/or the cluster ensemble (Greene and Cunningham, 2006),<br />

which is created by means of some sufficient subsampling procedure. Once the consensus<br />

process is completed on the reduced data set (or cluster ensemble), it is extended to those<br />

entities (objects or cluster ensemble components) that have been left out of the mentioned<br />

subset. Whereas reducing the size of the data collection and/or the cluster ensemble subject<br />

to the consensus clustering process entails an automatic saving of time complexity, one<br />

should take into account the cost associated to the subsampling and extension processes,<br />

which is often linear with the size of the data collection (<strong>La</strong>nge and Buhmann, 2005; Greene<br />

and Cunningham, 2006; Gionis, Mannila, and Tsaparas, 2007).<br />

In contrast, our hierarchical consensus architecture proposals are based on reducing the<br />

time complexity of consensus processes without discarding any of the objects in the data<br />

set nor any of the cluster ensemble components. By means of a divide and conquer type<br />

of approach (Dasgupta, Papadimitriou, and Vazirani, 2006), we break a single consensus<br />

clustering problem into multiple smaller problems, which gives rise to hierarchical consensus<br />

architectures that allow to achieve important computational time savings, specially in high<br />

diversity scenarios —i.e. the ones we might find ourselves in if the strategy of using multiple<br />

mutually crossed diversity factors for creating large cluster ensembles is followed.<br />

As far as we know, the use of divide and conquer approaches to the consensus clustering<br />

problem has only been reported in (He et al., 2005) as a means for clustering data sets<br />

that contain both numeric and categorical attributes. This proposal consists of dividing<br />

the original data collection into two purely numeric and categorical subsets, conducting<br />

105


3.5. Discussion<br />

separate clustering processes on each type of features (employing well-established clustering<br />

algorithms designed to that end), and subsequently combining the resulting clustering<br />

solutions by means of consensus functions. Thus, this divide and conquer consensus clustering<br />

proposal is aimed to deal with objects composed of multi-type features, rather than<br />

a means for reducing the overall time complexity of consensus processes.<br />

In this chapter, two versions of hierarchical consensus architectures have been proposed.<br />

In each one of them, one of the two factors that define the topology of the architecture are<br />

prefixed, i.e. the number of stages (in deterministic HCA) or the size of the mini-ensembles<br />

(in random hierarchical consensus architectures). Structuring the whole consensus clustering<br />

task as a set of partial consensus processes that take place in successive stages gives<br />

the user the chance to apply different consensus functions across the hierarchy —a possibility<br />

that, to our knowledge, remains unexplored. Moreover, the decomposition of a classic<br />

one-step problem into a set of smaller instances of the same problem naturally allows its parallelization<br />

—provided that sufficient computational resources are available. At this point,<br />

we would like to highlight the fact that, though posed in the context of the robust clustering<br />

problem, hierarchical consensus architectures are applicable to any consensus clustering<br />

task involving large cluster ensembles.<br />

From a practical perspective, we have presented a simple running time estimation<br />

methodology that, for a given consensus clustering problem, allows a fast and pretty accurate<br />

prediction of which is the computationally optimal consensus architecture. However,<br />

the reasonably good performance of the proposed methodology could be further improved<br />

by means of a more complex (probably statistical) modeling of the consensus running times<br />

which constitute the basis of the estimation.<br />

Based on these predictions, we have presented an experimental study in which the flat<br />

and the fastest hierarchical consensus architectures are, firstly, compared in terms of their<br />

execution time. Such comparison has taken into account the most and least computationally<br />

costly HCA implementations (i.e. fully serial and parallel), so as to provide a notion of the<br />

upper and lower bounds of the time complexity of hierarchical consensus architectures.<br />

One of the most expectable conclusions drawn from the conducted experiments is that the<br />

computational optimality of a given consensus architecture is local to the consensus function<br />

F employed for combining the clusterings. In particular, as far as the execution time of<br />

hierarchical consensus architectures is concerned, the main issue to take into account is<br />

the dependence between the time complexity of F and the size of the mini-ensembles upon<br />

which consensus is conducted. For instance, the use of consensus functions the complexity<br />

of which scales quadratically with the number of clusterings consensus is created upon (e.g.<br />

MCLA) clearly favours hierarchical consensus architectures. In contrast, flat consensus<br />

is more efficient than the fastest serial hierarchical consensus architectures even in high<br />

diversity scenarios when consensus functions such as EAC are employed.<br />

Besides analyzing their computational aspects, we have also compared hierarchical and<br />

flat consensus architectures in terms of the quality of the consensus clustering solutions<br />

they yield. In this sense, inter-consensus architecture variability is highly dependent on the<br />

characteristics of the cluster ensemble and the consensus function employed. For instance,<br />

hierarchical and flat consensus architectures based on the CSPA, EAC, ALSAD and SLSAD<br />

consensus functions yield pretty similar quality consensus clusterings, whereas greater variances<br />

are observed when the remaining consensus functions are used. Moreover, in general<br />

terms, we have observed that consensus architectures based on EAC and HGPA typically<br />

106


Chapter 3. Hierarchical consensus architectures<br />

yield lower quality consensus clustering solutions when compared to the other consensus<br />

functions.<br />

Thus, in some sense, we face a further indeterminacy, this one referred to the consensus<br />

function to apply. However, this indeterminacy can be overcome by taking advantage of the<br />

capability of creating several consensus clustering solutions by means of multiple consensus<br />

functions in computationally optimal time, and subsequently, apply a supraconsensus<br />

function that allows selecting the highest quality consensus clustering solution in a fully<br />

unsupervised manner, as proposed in (Strehl and Ghosh, 2002).<br />

Besides this use, supraconsensus strategies constitute a basic ingredient of the consensus<br />

self-refining procedure presented in the next chapter, which is oriented to better the quality<br />

of consensus clustering solutions as a means for creating robust clustering systems upon<br />

consensus clustering processes.<br />

3.6 Related publications<br />

Our first approach to hierarchical consensus architectures dealt with deterministic HCA<br />

(Sevillano et al., 2007a), although it was solely focused on the analysis of the quality of<br />

the consensus clusterings obtained, not on its computational aspect. The details of this<br />

publication, presented as a poster at the ECIR 2007 conference held at Rome, are described<br />

next.<br />

Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />

Title: A Hierarchical Consensus Architecture for Robust Document Clustering<br />

In: Proceedings of 29th European Conference on Information Retrieval (ECIR 2007)<br />

Publisher: Springer<br />

Series: Lecture Notes in Computer Science<br />

Volume: 4425<br />

Editors: Giambattista Amati, Claudio Carpineto and Giovanni Romano<br />

Pages: 741-744<br />

Year: 2007<br />

Abstract: A major problem encountered by text clustering practitioners is the difficulty<br />

of determining aprioriwhich is the optimal text representation and clustering<br />

technique for a given clustering problem. As a step towards building robust document<br />

partitioning systems, we present a strategy based on a hierarchical consensus clustering<br />

architecture that operates on a wide diversity of document representations and<br />

partitions. The conducted experiments show that the proposed method is capable of<br />

yielding a consensus clustering that is comparable to the best individual clustering<br />

available even in the presence of a large number of poor individual labelings, outperforming<br />

classic non-hierarchical consensus approaches in terms of performance and<br />

computational cost.<br />

107


Chapter 4<br />

Self-refining consensus<br />

architectures<br />

As described in chapter 3, our proposal for building clustering systems robust to the inherent<br />

indeterminacies that affect the clustering problem consists of i) creating a cluster ensemble<br />

E composed of a large number of individual partitions generated by the use of as many<br />

diversity factors (e.g. clustering algorithms, object representations, etc.) as possible, ii)<br />

deriving a unique clustering solution λc upon that cluster ensemble through the application<br />

of a consensus clustering process.<br />

As mentioned earlier, the use of such a large cluster ensemble entails two negative consequences.<br />

The first one refers to the fact that the construction of the consensus clustering<br />

solution can become costly or even unfeasible, as the space and time complexity of consensus<br />

functions scales up linearly or even quadratically with the size of the cluster ensemble.<br />

In order to overcome such difficulty, in chapter 3 we put forward the concept of hierarchical<br />

consensus architectures, which are based on applying a divide-and-conquer approach<br />

to consensus clustering. Moreover, by means of a simple running time estimation methodology,<br />

the user is capable of deciding apriori, with a notable degree of accuracy, which<br />

is the most computationally efficient consensus architecture for solving a given consensus<br />

clustering problem.<br />

The other main downside to the use of large cluster ensembles is the negative bias induced<br />

on the quality of the consensus clustering solution λc by the expectable presence<br />

of poor1 individual clusterings in E, caused by the somewhat indiscriminate generation of<br />

cluster ensemble components that our proposal indirectly encourages. In order to overcome<br />

this inconvenience, we propose a simple consensus self-refining process that, in a fully unsupervised<br />

manner, allows to improve the quality of the derived consensus clustering solution<br />

λc. Moreover, an additional benefit derived from this automatic consensus refining procedure<br />

is the uniformization of the quality of the consensus clustering solutions yielded by<br />

distinct consensus architectures, which allows selecting the most appropriate one based on<br />

1 By good quality clustering solutions we refer to those partitions that reflect the true group structure<br />

of the data. Provided that we evaluate our clustering results by means of an external cluster validity index<br />

–normalized mutual information (φ (NMI) ) with respect to the ground truth, i.e. an allegedly correct group<br />

structure of the data–, the highest quality clustering results will be those attaining a φ (NMI) close to 1,<br />

whereas the φ (NMI) values associated to poor quality partitions will tend to zero, as φ (NMI) ∈ [0, 1] by<br />

definition (Strehl and Ghosh, 2002).<br />

109


4.1. Description of the consensus self-refining procedure<br />

computational efficiency criteria solely. While put forward in a hard clustering scenario, this<br />

proposal could be exported to a fuzzy context by introducing several minor modifications.<br />

This chapter is organized as follows: section 4.1 describes the proposed self-refining consensus<br />

procedure. Next, several experiments regarding the application of the self-refining<br />

process on the consensus clustering solutions output by the three types of consensus architectures<br />

described in the previous chapter are presented in section 4.2. An alternative<br />

procedure based on cluster ensemble component selection for obtaining refined consensus<br />

clustering solutions upon a given cluster ensemble is described in section 4.3, and finally,<br />

the discussion and conclusions presented in section 4.4 put the end to this chapter.<br />

4.1 Description of the consensus self-refining procedure<br />

The proposed approach for refining the quality of the consensus clustering solution λc is<br />

pretty straightforward, and it is based on the notion of average normalized mutual information<br />

φ (ANMI) (Strehl and Ghosh, 2002) between a cluster ensemble E and a consensus<br />

clustering solution λc built upon it, as defined by equation (4.1).<br />

φ (ANMI) (E, λc) = 1<br />

l<br />

l<br />

φ (NMI) (λi, λc) (4.1)<br />

where l represents the number of partitions (or components) contained in the cluster ensemble<br />

E and λi is the ith of these components.<br />

The higher φ (ANMI) (E, λc), the more information the consensus clustering solution λc<br />

shares with all the clusterings in E, thus it can be considered to capture the information<br />

contained in the ensemble to a larger extent. In fact, the computation of the φ (ANMI)<br />

between a given cluster ensemble E and a set of consensus clustering solutions obtained<br />

by means of different consensus functions is proposed in (Strehl and Ghosh, 2002) as a<br />

means for choosing among them in a unsupervised fashion, giving rise to what is called a<br />

supraconsensus function.<br />

It is important to notice that each term of the summation in equation (4.1) measures<br />

the resemblance between each cluster ensemble component and λc. As a consequence, those<br />

cluster ensemble components more similar to the consensus clustering solution contribute<br />

in a greater proportion to the sum in equation (4.1).<br />

Assuming that the consensus function F applied for obtaining the consensus clustering<br />

solution λc delivers a moderately good performance –in the sense that the quality of λc will<br />

be reasonably higher than the one of the poorest components of the cluster ensemble E–,<br />

then the normalized mutual information (φ (NMI) ) between λc and each cluster ensemble<br />

component λi (∀i ∈ [1,...,l]), gives an approximate measure of the quality of the latter<br />

(Fern and Lin, 2008).<br />

Allowing for this fact, the proposed consensus self-refining procedure is based on ranking<br />

the l components of the cluster ensemble E accordingtotheirφ (NMI) with respect the<br />

consensus clustering solution λc. The results of this sorting process is represented by means<br />

of an ordered list Oφ (NMI) = {λ<br />

φ (NMI) 1, λ φ (NMI)2,...λφ (NMI)l}, the subindices of which refer<br />

to the aforementioned φ (NMI) based ranking, i.e. λ<br />

φ (NMI) 1 denotes the cluster ensemble<br />

component that attains the highest normalized mutual information with respect to λc,<br />

110<br />

i=1


Chapter 4. Self-refining consensus architectures<br />

λ φ (NMI) 2 is the component with the second highest φ(NMI) respect the consensus clustering<br />

solution, and so on.<br />

Subsequently, a percentage p of the highest ranked l individual partitions is selected so<br />

as to form a select cluster ensemble Ep –see equation (4.2)–, upon which a refined consensus<br />

clustering solution λcp will be derived through the application of the consensus function F.<br />

Notice that, the larger the percentage p, the more components are included in the select<br />

cluster ensemble Ep —ultimately, Ep = E if p = 100.<br />

⎛<br />

⎜<br />

Ep = ⎜<br />

⎝<br />

λ<br />

φ (NMI) 1<br />

λ<br />

φ (NMI) 2<br />

.<br />

λ p<br />

φ (NMI) ⌊<br />

100 l⌉<br />

⎞<br />

⎟<br />

⎠<br />

(4.2)<br />

Following the rationale of the proposed self-refining procedure, it can be assumed that,<br />

with a high probability, the worst components of the initial cluster ensemble E will have been<br />

excluded from Ep. Therefore, the self-refined consensus clustering solution λcp obtained<br />

through the application of the consensus function F on Ep will probably improve the initial<br />

consensus labeling λc, as we will experimentally demonstrate in later sections.<br />

Finally, three additional remarks so as to conclude this description: firstly, notice that<br />

the consensus process run on the select cluster ensemble Ep can be conducted following<br />

either a flat or a hierarchical approach, depending on the consensus function applied, the<br />

characteristics of the data set and the value of p, which, as aforementioned, determines<br />

thesizeofEp. As reported in the previous chapter, the proposed running time estimation<br />

methodologies constitute an efficient means for deciding whether the self-refined consensus<br />

solution λcp should be derived following either a flat or a hierarhical consensus architecture.<br />

Secondly, notice that the proposed consensus self-refining process is entirely automatic<br />

and unsupervised (hence its name), as it is solely based on the cluster ensemble E, the<br />

consensus clustering solution λc and a similarity measure –φ (NMI) – that requires no external<br />

knowledge for its computation. The only user-driven decision refers to the selection of the<br />

value of the percentage p used for creating the select cluster ensemble Ep.<br />

And the third remark deals just with this latter issue. Quite obviously, the selection<br />

of the percentage p is made blindly. So as to avoid the negative consequences of choosing<br />

a suboptimal value of p at random, our consensus self-refining proposal is completed by<br />

the (possibly parallelized) creation of multiple refined consensus clustering solutions using<br />

P distinct percentage values p = {p1,p2,...,pP }, i.e. λc p i ,fori = {1, 2,...,P}, selecting<br />

as the final refined consensus clustering solution λ final<br />

c<br />

the one maximizing φ (ANMI) with<br />

respect to the cluster ensemble E, as defined by equation (4.3) —in fact, this unsupervised a<br />

posteriori clustering selection process is equivalent to the supraconsensus function proposed<br />

in (Strehl and Ghosh, 2002).<br />

λ final<br />

c<br />

<br />

<br />

= λ max φ (ANMI) <br />

(E, λ) , λ ∈{λc, λcp1 ,...,λcpP } (4.3)<br />

For summarization purposes, table 4.1 describes the steps that constitute the proposed<br />

consensus self-refining procedure.<br />

111


4.2. Flat vs. hierarchical self-refining<br />

1. Given a cluster ensemble E containing l<br />

⎛<br />

components:<br />

⎞<br />

λ1<br />

⎜<br />

⎜λ2<br />

⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠<br />

and a pre-computed consensus clustering solution λc, compute the φ (NMI) between<br />

λc and each of the components of the cluster ensemble, that is:<br />

λl<br />

φ (NMI) (λc, λk) , ∀k =1,...,l<br />

2. Generate an ordered list Oφ (NMI) = {λ<br />

φ (NMI) 1, λ φ (NMI)2,...λφ (NMI)l} of cluster ensemble<br />

components ranked in decreasing order according to their φ (NMI) with respect<br />

λc.<br />

3. Create a set of P select cluster ensembles Epi<br />

nents of Oφ (NMI):<br />

Epi =<br />

⎛<br />

λ<br />

φ (NMI)<br />

⎜<br />

⎝<br />

1<br />

λ<br />

φ (NMI) 2<br />

⎞<br />

⎟<br />

.<br />

⎟<br />

⎠<br />

where pi ∈ (0, 100) , ∀i =1,...,P.<br />

λ φ (NMI) ⌊ p i<br />

100 l⌉<br />

pi<br />

by compiling the first ⌊ 100l⌉ compo-<br />

4. Run a (flat or hierarchical) consensus architecture based on a consensus function F<br />

on Epi , obtaining a self-refined consensus clustering solution λc p i .<br />

5. Apply the supraconsensus function on the non-refined consensus clustering solution<br />

λc and the set of self-refined consensus clustering solutions λc p i , i.e. select as the<br />

final consensus solution the one maximizing its φ (ANMI) with respect to the cluster<br />

ensemble E.<br />

Table 4.1: Methodology of the consensus self-refining procedure.<br />

4.2 Flat vs. hierarchical self-refining<br />

In this section, we present several experiments regarding the application of the consensus<br />

self-refining procedure described in section 4.1 on the consensus clustering solutions output<br />

by the three consensus architectures described in chapter 3. In all cases, our main interest<br />

is focused on analyzing the qualities of the clusterings obtained by the proposed self-refining<br />

procedure, not on evaluating the computational aspects of the self-refining process, as the<br />

decision regarding whether it is implemented according to a hierarchical or a flat consensus<br />

architecture can be efficiently made using the running time estimation methodologies<br />

proposed in chapter 3.<br />

The experiments conducted follow the design described next.<br />

112


– What do we want to measure?<br />

Chapter 4. Self-refining consensus architectures<br />

i) The quality of the self-refined consensus clusterings obtained by the proposed<br />

methodology when applied on the consensus clusterings output by the flat and<br />

allegedly fastest RHCA and DHCA consensus architectures.<br />

ii) The ability of the proposed self-refining procedure to obtain a consensus clustering<br />

of higher quality than that of a) its non-refined counterpart, and b) the<br />

highest and median quality cluster ensemble components.<br />

iii) The quality of the self-refined consensus clustering of maximum quality compared<br />

to its non-refined counterpart.<br />

iv) The ability of the supraconsensus function to select, in a fully unsupervised<br />

manner, the highest quality self-refined consensus clustering among the set of<br />

self-refined clusterings generated.<br />

v) We analyze whether self-refining constitutes a means for uniformizing the quality<br />

of the consensus clustering solutions yielded by the flat and allegedly fastest<br />

RHCA and DHCA consensus architectures, thus making it possible to decide, on<br />

computational grounds only, which is the most suitable consensus architecture<br />

for a given clustering problem.<br />

– How do we measure it?<br />

i) The quality of the self-refined consensus clusterings is measured in terms of the<br />

φ (NMI) with respect to the ground truth of each data collection.<br />

ii) The percentage of experiments in which the proposed self-refining procedure gives<br />

rise to at least one self-refined consensus clustering of higher quality than that<br />

of a) its non-refined counterpart, and b) the highest and median quality cluster<br />

ensemble components.<br />

iii) We measure the relative φ (NMI) percentage difference between the self-refined<br />

consensus clustering of maximum quality and its non-refined counterpart.<br />

iv) The precision of the supraconsensus function is measured in terms of the percentage<br />

of experiments in which it manages to select the highest quality self-refined<br />

consensus clustering.<br />

v) We compare the average variance between the φ (NMI) scores of the consensus<br />

clusterings λc yielded by the three evaluated consensus architectures (i.e. prior to<br />

self-refining) with the variance between the consensus clustering selected by the<br />

) after the self-refining procedure is conducted.<br />

supraconsensus function (λ final<br />

c<br />

– How are the experiments designed? we only analyze the results of the consensus<br />

self-refining process executed on the highest diversity scenario (i.e. the one where<br />

cluster ensembles are created by applying the |dfA| = 28 clustering algorithms from<br />

the CLUTO clustering package). The reason for this is twofold: besides brevity, this<br />

limitation on our analysis avoids that the results of the self-refining process are masked<br />

by the consensus quality variability observed in lower diversity scenarios —recall that,<br />

in those cases, the quality of the consensus clustering solutions shows larger variances,<br />

as the cluster ensemble changes from experiment to experiment due to the random<br />

selection of |dfA| = {1, 10, 19} clustering algorithms, whereas exactly the same cluster<br />

ensemble is employed across the ten experiments in the highest diversity scenario. As<br />

113


4.2. Flat vs. hierarchical self-refining<br />

in all the experimental sections of this thesis, consensus processes have been replicated<br />

using the set of seven consensus functions described in appendix A.5, namely: CSPA,<br />

EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD. Results are averaged across ten<br />

independent experiments consisting of ten consensus function runs each. With the<br />

objective of analyzing the expectable dependence between the degree of refinement of<br />

the consensus clustering solution and the percentage p of cluster ensemble components<br />

included in the select cluster ensemble Ep, the experiments have been replicated for a<br />

set of percentage values in the range p ∈ [2, 90]. Subsequently, the final consensus label<br />

vector λ final<br />

c is selected among all the available (i.e. non-refined and refined) consensus<br />

clustering solutions through the application of the supraconsensus function presented<br />

in equation (4.3). <strong>La</strong>st, it is important to state that, although it is possible (and, in<br />

fact, recommendable) to apply either flat or hierarchical consensus on the select cluster<br />

ensemble depending on which is the most computationally efficient option, all selfrefining<br />

consensus processes in our experiments have been conducted, for simplicity,<br />

using a flat consensus architecture.<br />

– How are results presented? Results are presented by means of boxplot charts of<br />

the φ (NMI) values corresponding to the consensus self-refining process. In particular,<br />

each subfigure depicts –from left to right– the φ (NMI) values of: i) the components of<br />

the cluster ensemble E, ii) the non-refined consensus clustering solution (i.e. the one<br />

resulting from the application of either a hierarchical or a flat consensus architecture,<br />

denoted as λc), and iii) the self-refined consensus labelings λc p i obtained upon select<br />

cluster ensembles created using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}.<br />

Moreover, the consensus clustering solution deemed as the optimal one (across a<br />

majority of experiment runs) by the supraconsensus function is identified by means<br />

of a vertical green dashed line. Moreover, the quality comparisons between the selfrefined<br />

consensus clusterings, the non-refined consensus clusterings and the cluster<br />

ensemble components are presented by means of tables showing the average values of<br />

the measured magnitudes.<br />

– Which data sets are employed? These experiments span the twelve unimodal data<br />

collections employed in this work. For brevity reasons, and following the presentation<br />

scheme of the previous chapter, this section only describes in detail the results of<br />

the self-refining procedure obtained on the Zoo data set, deferring the portrayal of<br />

the results obtained on the remaining data collections to appendix D.1. However,<br />

the global evaluation of the self-refining and the supraconsensus processes entails the<br />

results obtained on the twelve unimodal collections employed in this work.<br />

Figure 4.1 presents the boxplot charts of the φ (NMI) values corresponding to the consensus<br />

self-refining process applied on the Zoo data set. Notice that figure 4.1 is organized into<br />

three columns of subfigures, each one of which corresponds to one of the three consensus<br />

architectures, i.e. flat, RHCA and DHCA.<br />

Pretty varied results can be observed in figure 4.1, as regards both the performance of<br />

the self-refining process in itself and of the supraconsensus selection function. For instance,<br />

when the consensus clustering solution output by the flat consensus architecture using the<br />

CSPA consensus function is subject to self-refining (see the leftmost boxplot on the top<br />

row of figure 4.1), we can observe that two of the refined solutions yield clearly higher<br />

φ (NMI) values than their non-refined counterpart —in particular, the ones obtained using<br />

114


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

Chapter 4. Self-refining consensus architectures<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure 4.1: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Zoo data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by the<br />

supraconsensus function in each experiment.<br />

115


4.2. Flat vs. hierarchical self-refining<br />

the 30% and 60% of the whole cluster ensemble, i.e. λc30 and λc60. Moreover, notice that<br />

supraconsensus selects λc30 as the final consensus clustering solution (that is, it performs<br />

correctly).<br />

In other cases, supraconsensus fails to select the highest quality consensus clustering<br />

solution. See, for instance, that supraconsensus selects the non-refined consensus clustering<br />

solution λc yielded by the flat consensus architecture based on the EAC consensus function<br />

as the optimal one, whereas the refined clusterings λc40, λc50 and λc60 attain higher φ (NMI)<br />

values (leftmost boxplot chart on the second row of figure 4.1). Furthermore, notice that<br />

in some minority cases, the self-refining procedure introduces no or little improvement,<br />

as when it is applied on the consensus solution output by the RHCA using the ALSAD<br />

consensus function —central column boxplot on the fifth row of figure 4.1.<br />

<strong>La</strong>st, notice that the boxplot corresponding to the refining of the consensus clustering<br />

solution output by the flat consensus architecture using MCLA –leftmost boxplot on the<br />

fourth row– only presents the φ (NMI) values corresponding to the cluster ensemble E. This<br />

is due to the fact that, for this particular consensus function and diversity scenario, flat<br />

consensus is not executable with our computational resources —see appendix A.6. Moreover,<br />

as all self-refining consensus processes in our experiments have been conducted using a<br />

flat consensus architecture, the self-refining of the consensus clustering solutions output by<br />

RHCA and DHCA are not computed from λc40 forth due to memory limitations when using<br />

the MCLA consensus function. However, recall from chapter 3 that hierarchical consensus<br />

architectures would allow the computation of consensus clustering solutions in situations<br />

where flat consensus is not executable.<br />

A deeper and more quantitative evaluation of the proposed consensus self-refining procedure<br />

requires analyzing two of its facets. Firstly, it is necessary to evaluate the self-refining<br />

process in itself, answering questions such as: i) how often does the self-refining process yield<br />

a higher quality consensus clustering solution than the non-refined one? ii) to which extent<br />

are the top quality self-refined consensus clustering solutions better than their non-refined<br />

counterpart? iii) how do the best self-refined consensus clustering solutions compare to<br />

the cluster ensemble components? or iv) does the self-refining procedure reduce the differences<br />

between the quality of the consensus clustering solutions output by distinct consensus<br />

architectures? The answers to these questions are presented in section 4.2.1.<br />

And secondly, given a set of self-refined consensus clustering solutions, a supraconsensus<br />

function capable of blindly selecting the highest quality self-refined solution is required. Its<br />

performance can be evaluated in terms of i) the percentage of occasions the supraconsensus<br />

function selects the highest quality consensus solution, and ii) the quality loss degree due<br />

to the supraconsensus selection of suboptimal consensus clustering solutions. These aspects<br />

are evaluated in section 4.2.2.<br />

4.2.1 Evaluation of the consensus-based self-refining process<br />

As regards the evaluation of the self-refining process, we have firstly analyzed the percentage<br />

of self-refining experiments in which at least one of the self-refined consensus clustering<br />

solutions attains a φ (NMI) with respect to the ground truth that is higher than the one<br />

achieved by the consensus clustering solution available prior to self-refinement. The results<br />

presented in table 4.2, which correspond to an average across all the data sets for each<br />

consensus architecture and consensus function, reveal that the proposed self-refining proce-<br />

116


Chapter 4. Self-refining consensus architectures<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 90.4 63.6 70 77.25 80 76.9 90<br />

RHCA 85.7 83.2 97 94.2 73.4 76.5 82.4<br />

DHCA 91.1 78.5 98 88.5 89.8 88.1 67.8<br />

Table 4.2: Percentages of self-refining experiments in which one of the self-refined consensus<br />

clustering solutions is better than its non-refined counterpart, averaged across the twelve<br />

data collections.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 16.5 273.3 53.3 14.9 9.8 16.1 433.4<br />

RHCA 10.9 157.1 294779.8 200.8 14.1 15.1 205.6<br />

DHCA 24.5 66.4 152450.9 79.9 38.4 30.9 224.9<br />

Table 4.3: Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />

clustering solutions with respect to its non-refined counterpart, averaged across the twelve<br />

data collections.<br />

dure performs pretty successfully, giving rise to at least one self-refined consensus clustering<br />

solution that improves the consensus clustering available prior to refining in an average 83%<br />

of the experiments conducted.<br />

Moreover, we have also computed the relative φ (NMI) percentage gain between the nonrefined<br />

and the top quality self-refined consensus clustering solution —considering only<br />

those experiments where self-refining yields a better clustering solution, i.e. the 83% of<br />

the total. The results presented in table 4.3, which again correspond to an average across<br />

all the data sets for each consensus architecture and consensus function, reveal that the<br />

proposed self-refining procedure performs in an overwhelmingly successful manner, giving<br />

rise to an average relative percentage φ (NMI) gain of 21386% across all the experiments<br />

conducted. This exceptionally large figure is due to the fact that, although seldom, extremely<br />

poor quality consensus clustering solutions are available prior to self-refining in some cases.<br />

In particular, this situation is found when the HGPA consensus function is employed for<br />

refining the consensus clustering solutions yielded by hierarchical consensus architectures<br />

on the WDBC and BBC data collections (see, for instance, figure D.10 in appendix D).<br />

Despite its exceptionality, this fact introduces a large bias on the averaged values of φ (NMI)<br />

gains. However, if this kind of artifact is ignored, relative φ (NMI) gains between 10% and<br />

430% are consistently obtained in all cases, which gives an idea of the suitability of the<br />

proposed self-refining procedure for bettering consensus clustering solutions.<br />

Besides comparing the top quality self-refined consensus clustering solution with its<br />

non-refined counterpart, we have also contrasted its quality with respect to the highest and<br />

median φ (NMI) components of the cluster ensemble E, referred to as BEC (best ensemble<br />

component) and MEC (median ensemble component), respectively. Using the quality of<br />

these two components as a reference, we have evaluated i) the percentage of experiments<br />

where the maximum φ (NMI) (either refined or non-refined) consensus clustering solution<br />

attains a higher quality to that of the BEC and MEC, and ii) the relative percentage<br />

φ (NMI) variation between them and the top quality consensus clustering solution. Again,<br />

117


4.2. Flat vs. hierarchical self-refining<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 8.3 0 0 0 25 23.9 8.3<br />

RHCA 8.3 0 0 16.7 28.3 23.8 4<br />

DHCA 16.7 0 0.1 16.6 16.7 18.2 8.3<br />

Table 4.4: Percentages of experiments in which the best (non-refined or self-refined) consensus<br />

clustering solution is better than the best cluster ensemble component, averaged across<br />

the twelve data collections.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 2.7 – – – 3.5 1.1 0.1<br />

RHCA 2.5 – – 1.7 4.2 1 0.1<br />

DHCA 3.3 – 2.2 1.4 1.4 1.2 0.8<br />

Table 4.5: Relative percentage φ (NMI) gains between the best (non-refined or self-refined)<br />

consensus clustering solution and the best cluster ensemble component, averaged across the<br />

twelve data collections.<br />

all the results presented correspond to an average across all the experiments conducted on<br />

the twelve unimodal data collections.<br />

As regards the first issue, table 4.4 presents the percentage of experiments where the<br />

highest quality consensus clustering solution (either refined or non-refined) is better than the<br />

BEC (i.e. it attains a φ (NMI) that is higher than that of the cluster ensemble component that<br />

best describes the group structure of the data in terms of normalized mutual information<br />

with respect to the given ground truth). In average, this happens in a 10.6% of the conducted<br />

experiments, which is a frequency of occurence 100 times higher than what was obtained<br />

when non-refined clustering solutions were considered (see table 3.14 in chapter 3). Again,<br />

this result reveals the notable consensus improvement introduced by the proposed selfrefining<br />

procedure. Moreover, notice the poor results obtained with the EAC and the<br />

HGPA consensus functions, which were already reported to be the worst performing ones<br />

in chapter 3.<br />

Moreover, the relative percentage φ (NMI) gains between the top quality consensus clustering<br />

solution and the BEC are presented in table 4.5, attaining a modest average increase<br />

of 1.8%. However, recall that this figure was as low as 0.6% when the non-refined consensus<br />

clustering solution was considered (see table 3.15 in section 3.4), which indicates that the<br />

consensus self-refining procedure introduces again notable quality improvements.<br />

If this comparison is now referred to the median ensemble component, it can be observed<br />

that, in average, the best (non-refined or self-refined) consensus clustering solution attains<br />

a φ (NMI) that is higher than that of the cluster ensemble component that has the median<br />

normalized mutual information with respect to the given ground truth in a 67.7% of the<br />

experiments conducted (see table 4.6). Recall that this percentage was 53.1% when the<br />

consensus clustering solution prior to self-refining was compared to the MEC —see table<br />

3.12 in section 3.4.<br />

If the degree of improvement between the best (non-refined or self-refined) consensus<br />

clustering solutions that attain a higher φ (NMI) than the MEC is measured in terms of<br />

118


Chapter 4. Self-refining consensus architectures<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 67.7 41.7 62.1 25 75 75 66.7<br />

RHCA 72.3 46.4 69.8 81.9 82 83.3 66.6<br />

DHCA 83.3 33.3 69.2 88.2 83.3 83.1 66.7<br />

Table 4.6: Percentage of experiments in which the best (non-refined or self-refined) consensus<br />

clustering solution is better than the median cluster ensemble component, averaged<br />

across the twelve data collections.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 107.4 33.1 82.4 91.1 113.8 109.1 73.3<br />

RHCA 96.6 24 98.7 109.5 118.4 114.3 70.4<br />

DHCA 113.3 25.6 100.2 108.8 118.9 114.7 87.8<br />

Table 4.7: Relative percentage φ (NMI) gain between the best (non-refined or self-refined)<br />

consensus clustering solution and the median cluster ensemble component, averaged across<br />

the twelve data collections.<br />

relative percentage φ (NMI) increase, a notable 91% relative φ (NMI) gain is obtained in average<br />

—see table 4.7 for a detailed view across consensus functions and architectures. Again,<br />

the beneficial effect of self-refining becomes evident if this result is compared to the one<br />

obtained from the analysis of the non-refined consensus clustering solution as, in that case,<br />

the observed relative φ (NMI) gain was 59% (see table 3.13).<br />

Furthermore, we have also measured the ability of the self-refining procedure for uniformizing<br />

the quality of the consensus clustering solutions output by the distinct consensus<br />

architectures. So as to evaluate this issue, we have computed, for each individual experiment,<br />

the average variance of the φ (NMI) values of the non-refined consensus solutions<br />

yielded by the RHCA, DHCA and flat consensus architectures —the smaller the variance,<br />

the more similar φ (NMI) values. This procedure has been repeated for the top quality (either<br />

refined or non-refined) consensus clustering solutions obtained at each experiment. The results<br />

of this analysis are presented in table 4.8. Except for the EAC consensus function,<br />

we can observe a notable reduction in the variance between the φ (NMI) of the consensus<br />

solutions output by the three considered consensus architectures, keeping it below the hundredth<br />

threshold in most cases. In average global terms, variance is dramatically reduced<br />

by an approximate factor of 20, from 0.105 to 0.0056. For this reason, it can be conjectured<br />

that, besides bettering the quality of consensus clustering solutions as already reported, the<br />

proposed self-refining procedure also helps making the quality of the self-refined consensus<br />

clustering solution more independent from the consensus architecture employed —so that<br />

it can be selected following computational criteria solely.<br />

As a conclusion, it can be asserted that the proposed consensus self-refining procedure is<br />

reasonably successful, as, in general terms, it introduces a quality increase that makes selfrefined<br />

consensus clustering solutions closer to the best individual components available<br />

on the cluster ensemble, which would ultimately constitute the goal of robust clustering<br />

systems based on consensus clustering.<br />

It is of paramount importance to notice that, in the analysis of all the previous results,<br />

119


4.2. Flat vs. hierarchical self-refining<br />

Consensus Consensus function<br />

solution CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

non-refined 0.011 0.011 0.026 0.638 0.009 0.011 0.029<br />

best non/self-refined 0.004 0.019 0.005 0.002 0.002 0.001 0.006<br />

Table 4.8: φ (NMI) variance of the non-refined and the best non/self-refined consensus clustering<br />

solutions across the flat, RHCA and DHCA consensus architectures, averaged across<br />

the twelve data collections.<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 30.4 50 53.1 11 25 23.8 37.5<br />

RHCA 25 35.9 38.4 3.9 24.4 24.1 36.6<br />

DHCA 12.5 42 40.1 17 0 29.5 37.5<br />

Table 4.9: Percentage of experiments in which the supraconsensus function selects the top<br />

quality consensus clustering solution, averaged across the twelve data collections.<br />

we have assumed the use of the top quality self-refined consensus clustering solution. Quite<br />

obviously, achieving the encouraging results reported would require using a supraconsensus<br />

function that, in an automatic manner, would detect that best self-refined consensus clustering<br />

in any given situation. The next section is devoted to the performance analysis of<br />

such supraconsensus function.<br />

4.2.2 Evaluation of the supraconsensus process<br />

As regards the performance of the supraconsensus function proposed by (Strehl and Ghosh,<br />

2002), we have firstly evaluated the percentage of experiments in which the supraconsensus<br />

function selects the highest quality consensus clustering solution. Table 4.9 presents the<br />

results averaged across all the data collections, for each consensus function and architecture.<br />

The average accuracy with which the supraconsensus function selects the top quality selfrefined<br />

consensus selection is 29%, i.e. it manages to select the best solution in less than a<br />

third of the experiments conducted.<br />

This somehow contradicts the beautiful conclusions of (Strehl and Ghosh, 2002), where<br />

φ (ANMI) (E, λc) is presented as a suitable surrogate of φ (NMI) (γ, λc) for selecting the best<br />

consensus clustering solutions in real scenarios, where a ground truth γ is not available.<br />

Such conclusion was supported by the fact that both φ (ANMI) (E, λc) andφ (NMI) (γ, λc)<br />

follow very similar patterns as regards their growth (i.e. the higher φ (NMI) (γ, λc), the<br />

higher φ (ANMI) (E, λc)). However, such claims were sustained on experiments using synthetic<br />

clustering results. In several of our experiments, in contrast, we have witnessed that<br />

this behaviour is not always obeyed in a strict fashion.<br />

Just for illustration purposes, we have conducted a toy experiment, in which a set of<br />

randomly picked 300 cluster ensemble components corresponding to the Zoo data collection<br />

have been evaluated in terms of i) their φ (NMI) with respect to the ground truth, and ii)<br />

their φ (ANMI) with respect to the 299 remaining clusterings selected. Figure 4.2 depicts<br />

both magnitudes, where the horizontal axis of each figure corresponds to an index of the<br />

clusterings in the ensemble arranged in decreasing order of φ (NMI) with respect to the<br />

120


φ (NMI)<br />

0.8<br />

0.78<br />

0.76<br />

0.74<br />

0.72<br />

0 100 200 300<br />

clustering index<br />

(a) Decreasingly ordered<br />

φ (NMI) (wrt ground truth)<br />

Chapter 4. Self-refining consensus architectures<br />

φ (ANMI)<br />

0.8<br />

0.75<br />

0.7<br />

0.65<br />

0 100 200 300<br />

clustering index<br />

(b) φ (ANMI) values wrt the<br />

toy cluster ensemble<br />

Figure 4.2: Decreasingly ordered φ (NMI) (wrt ground truth) values of the 300 clusterings<br />

included in the toy cluster ensemble (left), and their corresponding φ (ANMI) values (wrt the<br />

toy cluster ensemble) (right).<br />

Consensus Consensus function<br />

architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

flat 8.8 14.5 17.8 12 6.7 8.9 12.1<br />

RHCA 6.3 26.6 24.7 27.3 8.6 8.4 16.9<br />

DHCA 15.9 27 22.9 19.6 9.4 11 8.7<br />

Table 4.10: Relative percentage φ (NMI) losses due to suboptimal self-refined consensus clustering<br />

solution selection by supraconsensus, averaged across the twelve data collections.<br />

ground truth. Notice how the monotonic decreasing behaviour of φ (NMI) –figure 4.2(a)– is<br />

not strictly observed in φ (ANMI) (see figure 4.2(b), where a fifth order fitting red dashed<br />

curve is overlayed for comparison). In fact, the clustering attaining the maximum φ (ANMI)<br />

is the one with the fiftieth largest φ (NMI) . Thus, in practice, φ (ANMI) seems to constitute<br />

a means for identifying good clustering solutions, but not the best one. For this reason,<br />

it seems that requiring φ (ANMI) (E, λc) to select the one self-refined consensus clustering<br />

solution of highest quality is a far too restrictive constraint, which leads to the slightly<br />

disappointing results presented in table 4.9.<br />

In order to evaluate the influence of the apparent lack of precision of the supraconsensus<br />

function, we have measured the relative percentage φ (NMI) loss derived from a suboptimal<br />

consensus solution selection, using the top quality consensus clustering solution as a reference<br />

(i.e. the one that should be selected by an ideal supraconsensus function). The results,<br />

which are presented in table 4.10, show that the impact of the modest selection accuracy<br />

of the supraconsensus function leads to an average relative φ (NMI) loss of 14.9%.<br />

To conclude, it can be asserted that, while the proposed consensus self-refining procedure<br />

introduces notable gains as regards the quality of consensus clustering solutions, there<br />

still exists room for taking full advantage of its performance, as the entirely unsupervised<br />

selection of the highest quality consensus solution is not a fully solved problem yet.<br />

121


4.3. Selection-based self-refining<br />

4.3 Selection-based self-refining<br />

The consensus self-refining procedure proposed in section 4.1 is based on using a consensus<br />

clustering solution as a reference for computing the φ (NMI) of the cluster ensemble components,<br />

which, at the same time, constitutes the guiding principle of the creation of the select<br />

cluster ensemble Ep, upon which the self-refining process is conducted.<br />

In this section, we propose an alternative procedure for obtaining a self-refined consensus<br />

clustering solution. The only difference between this proposal and the one put forward in<br />

section 4.1 lies in the fact that the computation of the φ (NMI) of the cluster ensemble<br />

components –a step prior to the creation of the select cluster ensemble Ep– is not referred<br />

to a previously obtained consensus labeling λc, but to one of the components of the cluster<br />

ensemble E.<br />

By doing so, we aim to devise an alternative means for obtaining a high quality clustering<br />

from a large cluster ensemble that does not require the execution of any consensus process<br />

for obtaining the reference clustering with respect to which the cluster ensemble components<br />

are compared in terms of normalized mutual information, with the obvious computational<br />

savings it conveys.<br />

Due to the fact that this proposed method is based on selecting one of the cluster<br />

ensemble components for initiating the consensus self-refining process, we have called it<br />

selection-based self-refining, and its constituting steps are presented in table 4.11.<br />

In the next paragraphs, we will analyze the performance of this second self-refining<br />

proposal, following the same experimental scheme employed in section 4.2. That is, we firstly<br />

review the results obtained on the Zoo data collection at a qualitative level (the analysis<br />

corresponding to the remaining data collections is presented in appendix D.2), followed by<br />

a quantitative study of the quality of the self-refined consensus clustering solutions across<br />

all the experiments conducted.<br />

With the objective of making the results of selection-based consensus self-refining comparable<br />

to those presented in the previous section, we have followed the same experimental<br />

procedure, that is: i) the experiments have been replicated for a set of self-refining percentage<br />

values in the range p ∈ [2, 90], ii) the experiments have been executed on the cluster<br />

ensembles corresponding to the highest diversity scenario.<br />

For starters, figure 4.3 depicts the boxplot charts of the φ (NMI) values corresponding to<br />

the selection-based consensus self-refining process. Each chart depicts –from left to right–<br />

the φ (NMI) values of: i) the components of the cluster ensemble E, ii) the cluster ensemble<br />

component with maximum φ (ANMI) with respect to the whole ensemble, i.e. λref, andiii)<br />

the self-refined consensus clusterings λcpi obtained upon select cluster ensembles created<br />

using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}.<br />

Firstly, we can notice the high quality of the selected cluster ensemble component λref,<br />

the φ (NMI) of which is pretty close to the highest quality component of the cluster ensemble<br />

E. Thus, it seems that the proposed selection procedure constitutes, by itself, a fairly good<br />

approach for obtaining clustering solutions that are robust to the inherent indeterminacies<br />

of the clustering problem. When the self-refining procedure is applied on the select cluster<br />

ensemble created upon λref, distinct performances are observed. Whereas in some cases<br />

none of the self-refined clustering solutions λcpi attains higher φ (NMI) values than λref (see,<br />

for instance, figure 4.3(a)), the opposite is observed when self-refining based on the EAC and<br />

122


1. Given a cluster ensemble E containing l components:<br />

⎛ ⎞<br />

λ1<br />

⎜λ2⎟<br />

⎜ ⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠<br />

Chapter 4. Self-refining consensus architectures<br />

compute the φ (ANMI) between each one of them and the cluster ensemble, that is:<br />

l<br />

φ (NMI) (λi, λk) , ∀k =1,...,l<br />

φ (ANMI) (E, λk) = 1<br />

l<br />

i=1<br />

2. Select the cluster ensemble component that maximizes its φ (ANMI) with respect to the whole<br />

ensemble as the reference for the self-refining<br />

<br />

process:<br />

(ANMI)<br />

λref =maxφ<br />

(E, λk)<br />

λk<br />

<br />

3. Compute the φ (NMI) between λref and each of the components of the cluster ensemble, that<br />

is:<br />

φ (NMI) (λref, λk) , ∀k =1,...,l<br />

4. Generate an ordered list Oφ (NMI) = {λφ (NMI) 1, λφ (NMI)2,...λφ (NMI)l} of cluster ensemble components<br />

ranked in decreasing order according to their φ (NMI) with respect λref.<br />

5. Create a set of P select cluster ensembles Epi by compiling the first ⌊ pi<br />

100l⌉ components of<br />

Oφ (NMI):<br />

⎛<br />

λφ (NMI)<br />

⎜<br />

Epi = ⎜<br />

⎝<br />

1<br />

λφ (NMI) 2<br />

⎞<br />

⎟<br />

.<br />

⎟<br />

⎠<br />

where pi ∈ (0, 100) , ∀i =1,...,P.<br />

λl<br />

λ φ (NMI) ⌊ p i<br />

100 l⌉<br />

6. Run a (flat or hierarchical) consensus architecture based on a consensus function F on Epi,<br />

obtaining a self-refined consensus clustering solution λc p i .<br />

7. Apply the supraconsensus function on the selected cluster ensemble component λref and<br />

the set of self-refined consensus clustering solutions λc p i , i.e. select as the final consensus<br />

solution the one maximizing its φ (ANMI) with respect to the cluster ensemble E, i.e.:<br />

λ final<br />

c<br />

= λ max φ (ANMI) (E, λ) , λ ∈{λref, λc p 1 ,...,λc p P }<br />

Table 4.11: Methodology of the cluster ensemble component selection-based consensus selfrefining<br />

procedure.<br />

SLSAD consensus functions is applied on λref —see figures 4.3(b) and 4.3(g), respectively.<br />

As in section 4.2, the consensus clustering solution deemed as the optimal one (across<br />

a majority of experiment runs) by the supraconsensus function is identified by means of a<br />

vertical green dashed line. As regards its performance, we can observe that it manages to<br />

select the highest quality clustering solution in all cases except when self-refining is based<br />

on the ALSAD and KMSAD consensus functions.<br />

So as to provide the reader with a more comprehensive and quantitative analysis of the<br />

performance of the proposed selection-based consensus self-refining procedure, the follow-<br />

123


4.3. Selection-based self-refining<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure 4.3: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Zoo data collection<br />

across all the consensus functions employed. The green dashed vertical line identifies the<br />

clustering solution selected by the supraconsensus function in each experiment.<br />

ing sections present a separate study of the results yielded by the self-refining procedure<br />

itself and the supraconsensus function that, a posteriori, must select the best self-refining<br />

consensus clustering solution.<br />

4.3.1 Evaluation of the selection-based self-refining process<br />

As far as the evaluation of the selection-based consensus self-refining procedure is concerned,<br />

four analysis have been conducted. For starters, we have measured the percentage<br />

of experiments in which the procedure of self-refining yields a better quality clustering<br />

than the cluster ensemble component selected as a reference (i.e. the one maximizing its<br />

φ (ANMI) with respect to the cluster ensemble E, referred to as λref). The results, averaged<br />

across all the unimodal data collections employed in this work, are presented in table 4.12<br />

as a function of the consensus function employed. In average, the self-refining procedure,<br />

when conducted on select cluster ensembles created upon the selection of λref, yields better<br />

clustering solutions in a 56% of the conducted experiments.<br />

This figure is notably lower than what was obtained when the select cluster ensemble<br />

124


Chapter 4. Self-refining consensus architectures<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

54.5 28.2 10.3 69.6 81.8 82 65.4<br />

Table 4.12: Percentage of self-refining experiments in which one of the self-refined consensus<br />

clustering solutions is better than the selected cluster ensemble component reference λref,<br />

averaged across the twelve data collections.<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

26.9 9.1 20.8 15.2 15.5 11.4 7.8<br />

Table 4.13: Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />

clustering solutions with respect to the maximum φ (ANMI) cluster ensemble component,<br />

averaged across the twelve data collections.<br />

creation uses a previously derived consensus clustering solution (it was 83%). This is due to<br />

the fact that the cluster ensemble component selection usually results in using a reference<br />

clustering λref of higher quality than the consensus clustering solution λc.<br />

Secondly, in those experiments where self-refined consensus solutions are better than<br />

λref, we have measured the relative degree of improvement achieved (quantified in terms<br />

of relative percentage φ (NMI) increase). The results, presented in table 4.13, show notable<br />

quality improvements, averaging a 15.2% relative φ (NMI) gain across all data sets and consensus<br />

functions. These quality gains obtained are much smaller than those obtained on the<br />

self-refining experiments based on a previously derived consensus clustering solution (see<br />

section 4.2), again due to the superior quality of the reference clustering the self-refining<br />

procedure is based upon.<br />

Next, the maximum and median φ (NMI) components of the cluster ensemble E –referred<br />

to as BEC (best ensemble component) and MEC (median ensemble component), respectively–<br />

are compared to either the top quality self-refined consensus clustering solution or λref,<br />

(depending on which has the largest φ (NMI) with respect to the ground truth). As in the<br />

previous section, we have evaluated i) the percentage of experiments where the maximum<br />

φ (NMI) consensus clustering solution attains a higher quality to that of the BEC and MEC,<br />

and ii) the relative percentage φ (NMI) variation between them and the top quality consensus<br />

clustering solution. Once more, all the results presented correspond to an average across<br />

all the experiments conducted on the twelve unimodal data collections.<br />

On one hand, table 4.14 presents the aforementioned magnitudes referred to the best<br />

cluster ensemble component. In average, the highest quality clustering (either λref or one<br />

of the self-refined consensus solutions) is better that the BEC in a 14.1% of the conducted<br />

experiments, achieving an average relative percentage φ (NMI) gain of 1.6%. It is important<br />

to notice that these results are pretty similar to those obtained when self-refining is based<br />

on a previously derived consensus clustering solution (see section 4.2), as these percentages<br />

were equal to 10.6% and 1.8%, respectively.<br />

On the other hand, table 4.15 presents the results of the same experiment, but referred<br />

to the median ensemble component (or MEC). In this case, the selection and self-refining<br />

procedure yields clusterings better than the MEC in 98% of the occasions, attaining an aver-<br />

125


4.3. Selection-based self-refining<br />

%of<br />

experiments<br />

relative %<br />

φ (NMI) gain<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

9.1 9.1 0 16.7 36.4 27.6 0<br />

2.1 0.2 – 2.6 1.8 1.3 –<br />

Table 4.14: Percentage of experiments where either the top quality self-refined consensus<br />

clustering solution or λref better the best cluster ensemble component, and relative φ (NMI)<br />

gain percentage with respect to it, averaged across the twelve data collections.<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

%of<br />

experiments<br />

100 95.4 95 100 100 100 95.4<br />

relative %<br />

φ<br />

118.7 100.7 118.3 114.9 116.1 112.5 107.4<br />

(NMI) gain<br />

Table 4.15: Percentage of experiments where either the top quality self-refined consensus<br />

clustering solution or λref better the median cluster ensemble component, and relative<br />

φ (NMI) gain percentage with respect to it, averaged across the twelve data collections.<br />

age relative φ (NMI) gain of 112.7%. These figures indicate that the selection-based consensus<br />

self-refining yields better results –when compared to the MEC– than its consensus-based<br />

counterpart, where the two aforementioned percentages reduced to 67.7% and 91%, respectively.<br />

As a summarization of this analysis of the selection-based consensus self-refining proposal,<br />

we can conclude that, firstly, it constitutes a fairly good approach as far as the<br />

obtention of a high quality partition of the data is concerned. When compared to the<br />

consensus-based self-refining procedure put forward in section 4.1, it can be observed that,<br />

while the relative quality gains introduced by the self-refining process itself are smaller in<br />

selection-based consensus self-refining, the top quality clustering results obtained are superior<br />

to those yielded by consensus-based self-refining. We believe that these phenomena<br />

are both due to the differences in the quality of the clustering solution that constitutes<br />

the starting point of the self-refining process —in the case of consensus-based self-refining,<br />

this reference is a previously derived consensus clustering λc, which typically is a poorer<br />

data partition than the maximum φ (ANMI) cluster ensemble component λref (see figures in<br />

appendices D.1 and D.2 for a quick visual comparison). This fact makes selection-based<br />

self-refining even a more attractive alternative, all the more since no previous consensus<br />

process execution is required —with the obvious computational savings this implies.<br />

4.3.2 Evaluation of the supraconsensus process<br />

As regards the performance of the supraconsensus function proposed by (Strehl and Ghosh,<br />

2002), we have conducted a twofold evaluation. On one hand, we have analyzed the percentage<br />

of experiments in which the highest quality and the clustering solution selected<br />

via supraconsensus coincide. On the other hand, we have measured the relative percentage<br />

φ (NMI) loss derived from a suboptimal consensus solution selection, using the top quality<br />

clustering solution as a reference (i.e. the one that should be selected by an ideal supracon-<br />

126


%of<br />

experiments<br />

relative %<br />

φ (NMI) loss<br />

Chapter 4. Self-refining consensus architectures<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

37.5 61.2 84.6 30.6 12.5 23 60<br />

24.6 10.4 16.8 12.2 10.9 12.7 11.9<br />

Table 4.16: Percentage of experiments in which the supraconsensus function selects the top<br />

quality clustering solution, and relative percentage φ (NMI) losses between the top quality<br />

clustering solution and the one selected by supraconsensus, averaged across the twelve data<br />

collections.<br />

sensus function). The results of these two experiments are presented in table 4.16, averaged<br />

across all the data collections and for each consensus function.<br />

The average accuracy with which the supraconsensus function selects the top quality<br />

self-refined consensus selection is 44.2%, i.e. it manages to select the best solution in less<br />

than a half of the experiments conducted. Moreover, this apparent lack of precision entails<br />

an average relative φ (NMI) reduction of 14.2%.<br />

These results reinforce the idea that the supraconsensus function proposed in (Strehl<br />

and Ghosh, 2002) is still far from constituting the most appropriate means for selecting, in<br />

a completely unsupervised manner, the best consensus clustering solution among a bunch<br />

of them, specially if they have pretty similar qualities. This is the reason why the average<br />

level of selection accuracy attained by the supraconsensus function in the selection-based<br />

self-refining scenario is higher than in the consensus-based context (44.2% vs. 29%), as<br />

the φ (NMI) differences between the top quality clustering solution and the remaining ones<br />

is notably higher in the former case than in the latter —in selection-based self-refining,<br />

the selected cluster ensemble component λref is often of notably higher quality than the<br />

self-refined consensus clustering solutions λcpi , see appendix D.2.<br />

In contrast, similar results are obtained in both the selection-based and consensusbased<br />

self-refining scenarios when the efficiency of the supraconsensus function is measured<br />

in terms of the φ (NMI) loss caused from erroneous selections (i.e. when a clustering solution<br />

other than the highest quality one is selected by the supraconsensus function). In<br />

selection-based self-refining, this relative percentage φ (NMI) loss is 14.2%, being 14.9% in<br />

the consensus-based self-refining context.<br />

4.4 Discussion<br />

In this chapter, we have put forward a couple of proposals oriented to obtain a high quality<br />

clustering solution given a cluster ensemble and a similarity measure between partitions,<br />

using consensus clustering and following a fully unsupervised procedure. Together with the<br />

computationally efficient consensus architectures presented in chapter 3, these proposals<br />

constitute the basis for constructing robust consensus clustering systems.<br />

Our proposals are based on applying consensus clustering on a set of clusterings –<br />

compiled in a select cluster ensemble– which are chosen from the cluster ensemble according<br />

to their similarity with respect to an initially available clustering solution. By doing so, we<br />

127


4.4. Discussion<br />

have experimentally proved that it is very likely to obtain a refined consensus clustering<br />

solution of higher quality than the original one.<br />

The main difference between our two proposals lies in the origin of the clustering employed<br />

as a reference for creating the select cluster ensemble. In the first proposal, referred<br />

to as consensus-based self-refining, this initial clustering is the consensus clustering solution<br />

λc resulting from a previous consensus process run on the whole cluster ensemble. In our<br />

second proposition, the starting point of the refining process is one of the components of<br />

the cluster ensemble, which is selected using an average normalized mutual information<br />

criterion —giving rise to what we call selection-based self-refining.<br />

Unfortunately, the optimal configuration of this self-refining procedure –e.g. the size<br />

of the select cluster ensemble, or the consensus function employed for creating the refined<br />

clustering solutions– is local to each particular experiment. This inconvenience, which<br />

is by no means new in the consensus clustering literature, can be tackled by means of<br />

a supraconsensus function that, in a blind manner, selects the highest quality clustering<br />

solution among a bunch of them, created using distinct self-refining configurations. However,<br />

the application of one of the most extended supraconsensus function (the one proposed in<br />

(Strehl and Ghosh, 2002)) in our experiments has yielded little disappointing results, as it<br />

is capable of selecting the highest quality clustering solution in a relatively low percentage<br />

of the experiments conducted. Moreover, alternative supraconsensus functions based on<br />

average normalized mutual information gave rise to even poorer selection accuracies (not<br />

reported here due to the limited interest of the results obtained), which suggests that it<br />

is necessary to conduct further research in order to devise novel supraconsensus functions<br />

capable of satisfying such a restrictive constraint as the one imposed here —i.e. selecting,<br />

among a set of clustering solutions, the top quality one in a fully unsupervised fashion.<br />

As aforementioned, the concepts of consensus self-refining and supraconsensus functions<br />

are closely related. In fact, supraconsensus is originally presented in (Strehl and Ghosh,<br />

2002) as a means for selecting the best consensus clustering solution among a bunch of<br />

them, created using different consensus functions. Therefore, it seems logical to consider<br />

the application of supraconsensus not on a set of previously derived consensus clustering<br />

solutions, but on the cluster ensemble components themselves, so as to select the highest<br />

quality ones.<br />

Some very recent works have dealt with this issue, such as (Gionis, Mannila, and<br />

Tsaparas, 2007), where the BESTCLUSTERING algorithm is defined as a means for identifying<br />

the individual partition that minimizes the number of disagreements with respect to<br />

the remaining components of the cluster ensemble. Nevertheless, no posterior consensus<br />

clustering based refinement process is applied on this presumably high quality cluster ensemble<br />

component, which, as we have experimentally proved, may bring about important<br />

quality gains.<br />

More recently, the use of clustering solution refinement procedures based on consensus<br />

clustering has been studied in (Fern and Lin, 2008) contemporarily to the completion of this<br />

thesis. That work and ours have multiple points in common, such as i) the primary purpose<br />

of avoiding the negative influence of poor clusterings contained in large cluster ensembles<br />

on the quality of the consensus clustering solutions built upon them, ii) the use of one of<br />

the components of the cluster ensemble as the reference partition for creating the reduced<br />

select cluster ensemble, as we propose in selection-based self-refining, iii) the analysis of<br />

the quality of several refined consensus clustering solutions generated upon multiple select<br />

128


Chapter 4. Self-refining consensus architectures<br />

cluster ensembles of distinct sizes, and iv) the use of normalized mutual information as the<br />

guiding principle for comparing clusterings.<br />

However, there also exist several differences between both works, as in (Fern and Lin,<br />

2008) i) refining is not presented as a means for bettering the quality of a previously<br />

derived consensus clustering solution, but as a means for obtaining a good quality one<br />

upon a select cluster ensemble resulting from discarding those poor components that may<br />

induce a quality loss in it, ii) the criteria employed for choosing the components included<br />

in the select cluster ensemble consider both clustering quality and diversity, iii) clustering<br />

refinement results obtained by a single consensus function (CSPA) are reported, and iv) no<br />

supraconsensus methodology for selecting the best refined consensus clustering is studied.<br />

To conclude, we would like to highlight again the significant quality improvements that<br />

can be obtained by means of self-refining consensus procedures. However, it is also important<br />

to be aware that we cannot take full advantage of these gains if a good performing<br />

supraconsensus methodology allows to select the top quality self-refined clustering solution<br />

with a high level of confidence. For this reason, in our opinion, devising such a technique is<br />

a matter of paramount importance as regards the further research to be conducted in this<br />

particular field.<br />

In the future, it would be interesting to analyze how consensus self-refining procedures<br />

perform if the cluster ensemble selection process was based on clustering similarity measures<br />

other than normalized mutual information. Furthermore, we also intend to study the<br />

possibility of creating the select cluster ensemble by including in it all those clusterings<br />

exceeding a certain φ (ANMI) threshold with respect to the reference clustering, instead of<br />

selecting a percentage p of the cluster ensemble components.<br />

4.5 Related publications<br />

As mentioned earlier, the aim of the proposed self-refining consensus clustering approach<br />

is to obtain partitions which are robust to the indeterminacies inherent to the clustering<br />

problem. This has been the main propeller of our research since the early days, which<br />

has been reflected in several publications at several international conferences and national<br />

journals. The application focus of these works was document clustering, so they were<br />

mainly published at Information Retrieval and Natural <strong>La</strong>nguage Processing forums. In<br />

none of these works, however, self-refining procedures are included as a means for obtaining<br />

improved quality clustering results, so our proposals in this specific area remain, by the<br />

moment, unpublished.<br />

The first publication regarding robust document clustering based on cluster ensembles<br />

was presented as a poster at the SIGIR 2006 conference held at Seattle. The details of this<br />

publication follow.<br />

Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />

Title: Feature Diversity in Cluster Ensembles for Robust Document Clustering<br />

In: Proceedings of the 29th ACM SIGIR Conference<br />

Pages: 697-698<br />

129


4.5. Related publications<br />

Year: 2006<br />

Abstract: The performance of document clustering systems depends on employing<br />

optimal text representations, which are not only difficult to determine beforehand,<br />

but also may vary from one clustering problem to another. As a first step towards<br />

building robust document clusterers, a strategy based on feature diversity and cluster<br />

ensembles is presented in this work. Experiments conducted on a binary clustering<br />

problem show that our method is robust to near-optimal model order selection and<br />

able to detect constructive interactions between different document representations in<br />

the test bed.<br />

A subsequent extension of this work was published at the Journal of the Spanish Society<br />

for Natural <strong>La</strong>nguage Processing (Procesamiento del Lenguaje Natural) after its presentation<br />

at the SEPLN 2006 conference held at Zaragoza.<br />

Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />

Title: Robust Document Clustering by Exploiting Feature Diversity in Cluster Ensembles<br />

In: Journal of the Spanish Society for Natural <strong>La</strong>nguage Processing (Procesamiento<br />

del Lenguaje Natural)<br />

Volume: 37<br />

Pages: 169176<br />

Year: 2006<br />

Abstract: The performance of document clustering systems is conditioned by the use<br />

of optimal text representations, which are not only difficult to determine beforehand,<br />

but also may vary from one clustering problem to another. This work presents an<br />

approach based on feature diversity and cluster ensembles as a first step towards<br />

building document clustering systems that behave robustly across different clustering<br />

problems. Experiments conducted on three binary clustering problems of increasing<br />

difficulty show that the proposed method is i) robust to near-optimal model order<br />

selection, and ii) able to detect constructive interactions between different document<br />

representations, thus being capable of yielding consensus clusterings superior to any<br />

of the individual clusterings available.<br />

<strong>La</strong>st, a global analysis regarding clustering indeterminacies and how they can be overcome<br />

via cluster ensembles was presented at the ICA 2007 conference as an oral presentation.<br />

Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />

Title: Text Clustering on <strong>La</strong>tent Thematic Spaces: Variants, Strenghts and Weaknesses<br />

In: Proceedings of 7th International Conference on Independent Component Analysis<br />

and Signal Separation<br />

130


Publisher: Springer<br />

Series: Lecture Notes in Computer Science<br />

Volume: 4666<br />

Chapter 4. Self-refining consensus architectures<br />

Editors: Mike E. Davies, Christopher J. James, Samer A. Abdallah and Mark D.<br />

Plumbley<br />

Pages: 794-801<br />

Year: 2007<br />

Abstract: Deriving a thematically meaningful partition of an unlabeled document<br />

corpus is a challenging task. In this context, the use of document representations<br />

based on latent thematic generative models can lead to improved clustering. However,<br />

determining apriorithe optimal document indexing technique is not straighforward,<br />

as it depends on the clustering problem faced and the partitioning strategy adopted.<br />

So as to overcome this indeterminacy, we propose deriving a single consensus labeling<br />

upon the results of clustering processes executed on several document representations.<br />

Experiments conducted on subsets of two standard text corpora evaluate distinct<br />

clustering strategies based on latent thematic spaces and highlight the usefulness<br />

of consensus clustering to overcome the indeterminacy regarding optimal document<br />

indexing.<br />

131


Chapter 5<br />

Multimedia clustering based on<br />

cluster ensembles<br />

As already outlined in section 1.3, multimodality is an increasingly noticeable trend as<br />

far as the nature of data is concerned. Given the growing ubiquity of multimedia data,<br />

it seems logical to consider that the derivation of robust clustering strategies as a means<br />

for organizing the increasingly larger multimodal repositories available is already a field of<br />

interest in itself.<br />

However, it is important to take into account that the direct application of classic<br />

clustering algorithms for partitioning multimedia data collections may turn out to be suboptimal.<br />

The reason for this is twofold: firstly, the usual indeterminacies that condition<br />

the performance of clustering algorithms are multiplied due to the existence of several data<br />

modalities. This means that the user must not only make a decision regarding which is<br />

the object representation or the clustering algorithm that are supposed to yield the best<br />

partition of the data —i.e. the ones best describing the natural group structure of the<br />

data. Furthermore, it is also necessary to make a decision regarding on which of the m<br />

data modalities clustering is to be conducted, as classic clustering algorithms are designed<br />

to operate on unimodal data.<br />

And secondly, notice that clustering multimedia data using a single modality entails<br />

ignoring the presumable positive synergies that may exist between the different modalities,<br />

which could be of interest for deriving a better partition of the data. The only way a<br />

classic clustering algorithm can take advantage of the possible benefits of multimodality<br />

consists in creating multimodal representations of the objects, conducting an early fusion<br />

of the features corresponding to distinct modalities prior to clustering. That is, clustering<br />

is conducted on a single, artificially generated multimodal representation of the objects<br />

created by the combination of the m original modalities. However, feature fusion may be<br />

benefitial or not as regards the quality of the clustering results, as reported in appendix<br />

B.2, which turns the application of this strategy into a further indeterminacy to deal with.<br />

For these reasons, in this chapter we propose applying consensus clustering as a means<br />

for clustering multimedia data robustly, as it provides a natural way for combining i) the<br />

results of clustering processes run on each one of the m distinct modalities —thus conducting<br />

a late fusion of modalities, and ii) the partitions obtained upon the multimodal data<br />

representation derived by the early fusion of the features of the m modalities. By doing so,<br />

133


5.1. Generation of multimodal cluster ensembles<br />

X 1<br />

Object<br />

representation p<br />

+ clustering<br />

Object<br />

X2 representation<br />

X<br />

+ clustering<br />

Multimedia<br />

data set<br />

Xm Object<br />

representation<br />

+ clustering<br />

E<br />

Consensus<br />

architecture<br />

(fl (flat t or<br />

hierarchical)<br />

Feature<br />

fusion<br />

Object<br />

representation<br />

+ clustering<br />

Multimodal cluster ensemble generation<br />

<br />

c<br />

Consensus<br />

self<br />

refining<br />

Figure 5.1: Block diagram of the proposed multimodal consensus clustering system.<br />

we can take advantage of both modality fusion approaches, which can be of help to reveal<br />

the group structure of the data.<br />

The proposed multimodal consensus clustering approach follows the schematic block<br />

diagram of figure 5.1, and consists of the following steps: the generation of the multimodal<br />

cluster ensemble E, plus the application of a computationally efficient consensus architecture<br />

that, followed by a consensus-based self-refining procedure, yields the final partition of the<br />

multimodal data collection subject to clustering, λfinal c .<br />

In this chapter, the phases that constitute the multimodal consensus clustering process<br />

are described and contextualized in the framework of the experiments conducted in this<br />

work. For starters, section 5.1 presents the strategies followed in the creation of the multimodal<br />

cluster ensemble. Next, section 5.2 describes the particularities of the consensus<br />

architecture and the self-refining procedure that give rise to the multimodal data partition.<br />

<strong>La</strong>st, the results of the multimedia consensus clustering experiments run in this thesis<br />

are presented in section 5.3, and with the conclusions discussed in section 5.4 the present<br />

chapter comes to an end.<br />

5.1 Generation of multimodal cluster ensembles<br />

The key point for conducting a multimodal consensus clustering process lies in the creation<br />

of a cluster ensemble that contains both clusterings derived on each data modality separately<br />

and on fused modalities. In this section, we describe a general procedure for creating a<br />

multimodal cluster ensemble E upon a multimedia data collection.<br />

Without loss of generality, let us assume that the multimodal data collection subject to<br />

clustering contains n objects represented by numeric attributes. Thus, the whole data set<br />

can be formally represented by means of a d × n real-valued matrix X, where each object<br />

is represented by means of a d-dimensional column vector xi, ∀i ∈ [1,n].<br />

134<br />

<br />

final<br />

c


Chapter 5. Multimedia clustering based on cluster ensembles<br />

Moreover, suppose that our multimedia data collection is composed of m modalities.<br />

That is, each of the n objects is simultaneously represented by m disjoint sets of real-valued<br />

attributes –each one of which corresponds to one of the m modalities– of sizes d1, d2, ...,<br />

dm, sothatd1 + d2 + ... + dm = d. Thus, the multimodal data set matrix X can be<br />

decomposed in m submatrices X1, X2, ..., Xm. Each one of these matrices Xi (of size<br />

di × n, ∀i ∈ [1,m]) represents all the objects in the data set according to each one of the<br />

m modalities it contains —see figure 5.1.<br />

Given this scenario, a first subset of the clusterings that will constitute the multimodal<br />

cluster ensemble E are generated through the application, upon each 1 submatrix<br />

Xi, (∀i ∈ [1,n]), of f mutually crossed diversity factors dfj, ∀j ∈ [1,f]. If the same set of<br />

diversity factors is applied on the m modalities, the number of clusterings generated in this<br />

first subset is equal to:<br />

l1 mod = m|df1||df2| ...|dff| (5.1)<br />

where the |·|operator denotes the cardinality of a set.<br />

Secondly, another subset of clusterings is created by the application of a set of diversity<br />

factors (not necessarily equal to the previous one) upon a fused multimodal representation<br />

of the data set. This representation can be generated by means of any early feature fusion<br />

process, such as the application of a projection-based object representation technique on<br />

the d-dimensional vectors resulting from the concatenation of the features corresponding<br />

to the m modalities (<strong>La</strong> Cascia, Sethi, and Sclaroff, 1998; Zhao and Grosky, 2002; Benitez<br />

and Chang, 2002; Snoek, Worring, and Smeulders, 2005; Gunes and Piccardi, 2005). This<br />

second subset of clusterings will be referred to using the symbol m mod, astheyareobtained<br />

upon an object representation that combines the m modalities into a single one.<br />

Assuming, for simplicity, that the same set of diversity factors are employed for creating<br />

the subsets of unimodal and multimodal clusterings, the number of multimodal partitions<br />

created becomes:<br />

lm mod = |df1||df2| ...|dff | (5.2)<br />

Finally, the mere compilation of the unimodal and multimodal partitions constitute the<br />

multimedia cluster ensemble E, the size of which is equal to:<br />

l = l1 mod + lm mod =(m +1)|df1||df2| ...|dff | (5.3)<br />

As regards the creation of the multimodal cluster ensembles E on the four multimedia<br />

data collections employed in this work (see appendix A.2.2 for a description), three diversity<br />

factors have been applied: clustering algorithms (dfA), object representations (dfR)<br />

and object representation dimensionalities (dfD). In the following paragraphs, a detailed<br />

description of these diversity factors and their role in the cluster ensemble creation process<br />

are presented.<br />

Starting with the original object features, which constitute the baseline representation,<br />

additional object representations are derived by means of feature extraction based on<br />

1 As these clusterings are created by running multiple clustering processes separately on each modality,<br />

we refer to them using the symbol 1 mod, which stands for “one modality”.<br />

135


5.2. Self-refining multimodal consensus architecture<br />

Data set Modality df D range |df D|<br />

audio [30,120] 10<br />

CAL500<br />

text [30,70] 5<br />

audio + text [30,200] 18<br />

speech [100,600] 11<br />

IsoLetters image [3,16] 14<br />

speech + image [100,600] 11<br />

object [60,120] 7<br />

InternetAds collateral [100,1000] 19<br />

object + collateral [100,1000] 19<br />

image [50,350] 7<br />

Corel<br />

text [100,450] 8<br />

image + text [100,800] 15<br />

Table 5.1: Range and cardinality of the dimensional diversity factor dfD per modality for<br />

each one of the four multimedia data sets.<br />

Principal Component Analysis, Independent Component Analysis, Random Projection and<br />

Non-negative Matrix Factorization (this last representation can only be applied when the<br />

original object features are non-negative). Thus, the total number of object representationsiseither|dfR|<br />

= 4 (for the CAL500 and IsoLetters collections) or |dfR| = 5 (for the<br />

InternetAds and Corel data sets). It is important to notice that these feature extraction<br />

based object representations are derived for each one of the m = 2 modalities these data<br />

sets contain, plus for the multimodal baseline representation created by concatenating the<br />

features of both modes.<br />

For each feature extraction based object representation and modality, a set of distinct<br />

representations are created by conducting a sweep of dimensionalities, which constitutes<br />

the second diversity factor, dfD. Quite obviously, its cardinality depends on the data set<br />

and modality. The range and cardinality of dfD per modality corresponding to each one<br />

of the four multimedia data sets employed in the experimental section of this chapter are<br />

presented in table 5.1.<br />

And finally, the clusterings that make up the multimodal cluster ensemble E are created<br />

by running |dfA| = 28 clustering algorithms from the CLUTO clustering package<br />

(see appendix A.1) on each distinct object representation. As a result, a total of l =<br />

2856, 3108, 5124 and 3444 partitions are obtained for the CAL500, IsoLetters, InternetAds<br />

and Corel multimodal data collections, respectively. Notice that, in our case, diversity<br />

factors are not mutually crossed, as the baseline object representations lack dimensional<br />

diversity. Therefore, the generic expressions of equations (5.1) to (5.3) do not apply in our<br />

case.<br />

5.2 Self-refining multimodal consensus architecture<br />

Once the multimodal cluster ensemble E is built, the next step consists in deriving a consensus<br />

clustering solution λc upon it. Recall that, according to the conclusions drawn in<br />

chapter 3, it may be more computationally efficient to tackle this task by means of a flat or<br />

a hierarchical consensus architecture depending on the size of the cluster ensemble.<br />

136


Chapter 5. Multimedia clustering based on cluster ensembles<br />

As far as this latter issue is concerned, it is important to notice that, if the cluster<br />

ensemble generation process proposed in section 5.1 is followed, multimodality induces an<br />

important increase in the cluster ensemble size. Indeed, if a set of f mutually crossed<br />

diversity factors of cardinalities |df1|, |df2|, ..., |dff | are applied on a particular unimodal<br />

data collection, we obtain a cluster ensemble of size:<br />

lunimodal = |df1||df2| ...|dff| (5.4)<br />

However, if the same data collection was multimodal and contained m modalities, the<br />

application of the previously presented multimedia cluster ensemble creation procedure –<br />

using exactly the same diversity factors applied on the unimodal version of the data set–<br />

would yield an ensemble of size:<br />

lmultimodal =(m +1)|df1||df2| ...|dff| (5.5)<br />

That is, multimodality increases the size of cluster ensembles by a factor of (m +1)<br />

—i.e. in the minimally multimodal case, m = 2, the cluster ensembles obtained are three<br />

times larger than those that would be created in a unimodal scenario. For this reason, and<br />

allowing for the conclusions of chapter 3, it seems that hierarchical consensus architectures<br />

are likely to be the most computationally efficient implementation alternative for deriving<br />

consensus clustering solutions upon multimodal cluster ensembles —however, the running<br />

time estimation process proposed in chapter 3 constitutes a valid tool for selecting apriori,<br />

and with a high degree of precision, which is the computationally optimal consensus<br />

architecture for solving a specific consensus clustering problem.<br />

Regardless of the consensus architecture employed, it will output a consensus clustering<br />

solution λc. The next and final step consists in applying the consensus self-refining procedure<br />

proposed in section 4.1, so as to obtain a presumably higher quality refined consensus<br />

clustering solution λ final<br />

c . Quite obviously, in a multimedia clustering scenario, the selfrefining<br />

process will be based on selecting a subset of the components of the multimodal<br />

cluster ensemble E for creating a select cluster ensemble. Otherwise, it follows exactly the<br />

steps presented in table 4.1.<br />

In this work, a three-stage deterministic hierarchical consensus architecture (or DHCA<br />

for short) has been applied for deriving the consensus clustering solution λc upon our<br />

multimodal cluster ensembles. This is due to the fact that we have deemed multimodality<br />

as a diversity factor (denoted as dfM) in itself. Moreover, in contrast to the procedure<br />

followed in the previous chapters, the multimodal cluster ensemble E considered in each<br />

individual experiment conducted in this chapter only contains clusterings created by a<br />

single clustering algorithm, despite, as mentioned earlier, |dfA| = 28 of them have been<br />

employed. In other words, for each data collection, experiments on |dfA| = 28 different<br />

cluster ensembles have been conducted. The components of each one of these ensembles<br />

only differ in the representational (dfR), dimensional (dfD) and multimodal (dfM) diversity<br />

factors, while having been created by means of a single clustering algorithm.<br />

According to the conclusions drawn in section 3.3, the DHCA variant that minimizes the<br />

number of executed consensus processes and the running time of its serial implementation<br />

is the one in which consensus processes are sequentially run on the distinct diversity factors<br />

that make up the cluster ensemble E arranged in decreasing cardinality order.<br />

137


5.3. Multimodal consensus clustering results<br />

Therefore, it is necessary to determine the cardinality of the three diversity factors so<br />

as to devise the computationally optimal DHCA topology. As described in section 5.1, the<br />

cardinality of the representational diversity factor is either |dfR| =4or|dfR| = 5, depending<br />

on the data set. The inspection of table 5.1 reveals that the dimensional diversity factor<br />

adopts a wide range of cardinalities, but all of them fall in the [5, 19] interval. <strong>La</strong>st, the newly<br />

introduced modality diversity factor entails not only the m = 2 original data modalities,<br />

but also the multimodal one resulting from the feature-level fusion of the former —thus, its<br />

cardinality is equal to |dfM| = 3. For this reason, the specific DHCA variant implemented is<br />

referred to as DRM (as |dfD| > |dfR| > |dfM|), and consensus will be sequentially conducted<br />

across dimensionalities, representations and modalities at each of its three stages.<br />

For illustration purposes, figure 5.2 depicts a toy DHCA DRM variant applied on a 27dimensional<br />

multidimensional cluster ensemble created on dimension, representation and<br />

modality diversity factors all of cardinality equal to 3. In its first stage, consensus are conducted<br />

across dimensionalities, thus yielding a first set of intermediate consensus clusterings<br />

denoted as λD,j,k, wherej and k index object representations and modalities, respectively.<br />

Subsequently, the second consensus stage executes consensus processes across the distinct<br />

object representations, giving rise to a second set of partial consensus clustering solutions,<br />

denoted as λD,R,k, wherek designates modalities. Assuming that two of these modalities<br />

are truly original modes, and that the third one is a created by feature-level fusion of the<br />

mod 1<br />

former, the clusterings output by the second stage of the DHCA are also denoted as λc ,<br />

mod 2 mod 1+mod 2<br />

λc and λc , respectively. Finally, the execution of a consensus process on<br />

these three intermediate clusterings yields the final consensus clustering solution λc, which<br />

is referred to as intermodal hereafter.<br />

Notice that conducting consensus by means of this DHCA variant instead of a flat<br />

or a random hierarchical consensus architecture is specially interesting from an analytic<br />

viewpoint, as it makes it possible to compare the effect of the consensus process on each<br />

of the three modalities, by simply evaluating the three intermediate consensus clusterings<br />

mod 1 mod 2 mod 1+mod 2<br />

input to the last consensus stage –i.e. λ , λ and λ in figure 5.2.<br />

5.3 Multimodal consensus clustering results<br />

c<br />

In this section, the results of the proposed multimodal consensus clustering experiments<br />

conducted in this work are described. The design of these experiments has followed the<br />

rationale described next.<br />

– What do we want to measure?<br />

c<br />

i) In section 5.3.1, we evaluate the quality of the partial consensus clusterings obtained<br />

on each separate modality and on the one resulting from multimodal<br />

mod 1 mod 2 mod 1+mod 2<br />

feature-level fusion (i.e. λc , λc and λc , respectively) plus the<br />

intermodal clustering λc resulting from applying the consensus process for combining<br />

the three aforementioned consensus clustering solutions. Moreover, we<br />

also analyze how do the unimodal, multimodal and intermodal consensus clusterings<br />

compare to each other plus the components of the cluster ensembles they<br />

are created upon.<br />

138<br />

c


1,<br />

1,<br />

1<br />

2,<br />

1,<br />

1<br />

<br />

<br />

3 3,<br />

1 1,<br />

1<br />

1,<br />

1,<br />

2<br />

2,<br />

1,<br />

2<br />

<br />

<br />

3,<br />

1,<br />

2<br />

1<br />

, 1 , 3<br />

2,<br />

1,<br />

3<br />

<br />

3,<br />

1,<br />

3<br />

1 , 2 , 1<br />

2,<br />

2,<br />

1<br />

<br />

3,<br />

2,<br />

1<br />

1<br />

, 2 , 2<br />

2,<br />

2,<br />

2<br />

<br />

3,<br />

2,<br />

2<br />

1 1,<br />

2 2,<br />

3<br />

2,<br />

2,<br />

3<br />

<br />

3,<br />

2,<br />

3<br />

<br />

1 1,<br />

3 3,<br />

1<br />

2,<br />

3,<br />

1<br />

3,<br />

2,<br />

1<br />

<br />

1 1,<br />

3 3,<br />

2<br />

2,<br />

3,<br />

2<br />

3,<br />

3,<br />

2<br />

<br />

<br />

1 1,<br />

3 3,<br />

3<br />

2,<br />

3,<br />

3<br />

3,<br />

3,<br />

3<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

Chapter 5. Multimedia clustering based on cluster ensembles<br />

D,<br />

1,<br />

1<br />

D,<br />

2,<br />

1<br />

D,<br />

3,<br />

1<br />

D,<br />

1,<br />

2<br />

D,<br />

2,<br />

2<br />

D,<br />

3,<br />

2<br />

D,<br />

1,<br />

3<br />

D,<br />

2,<br />

3<br />

D,<br />

3,<br />

3<br />

Consensus<br />

mod_1<br />

function D D,<br />

R,<br />

1 c<br />

Consensus<br />

function<br />

Consensus<br />

function<br />

mod_ 2<br />

D D,<br />

R,<br />

2 c<br />

mod_1mod_<br />

2<br />

D D,<br />

R,<br />

3 c<br />

Consensus<br />

function c <br />

Figure 5.2: Deterministic hierarchical consensus architecture DRM variant operating on<br />

a cluster ensemble created using three diversity factors: three dimensionality |dfD| =3of<br />

three object representations |dfR| = 3 and three modalities |dfM| = 3. The cluster ensemble<br />

component obtained upon the jth object representation with the ith dimensionality on the<br />

kth modality is denoted as λi,j,k. Consensus are sequentially created across the dimension,<br />

representation and modality diversity factors (dfD, dfR and dfM, respectively).<br />

ii) In section 5.3.2, we analyze the quality of the self-refined intermodal consensus<br />

clusterings λc p i obtained upon select cluster ensembles containing a percentage<br />

pi of the partitions of the original multimodal cluster ensemble E. Moreover,<br />

we also evaluate the performance of the supraconsensus function as a means<br />

for selecting, in a fully unsupervised manner, the top quality (either refined or<br />

non-refined) consensus clustering.<br />

– How do we measure it?<br />

i) The quality of the unimodal, multimodal and intermodal consensus clusterings<br />

obtained is evaluated in terms of their φ (NMI) with respect to the ground truth of<br />

the data set. Inter-consensus clusterings comparisons are conducted in terms of<br />

their relative percentage φ (NMI) differences, and the percentage of experiments in<br />

which one of them attains higher φ (NMI) scores than the rest. Moreover, comparisons<br />

between the consensus clusterings and their associated cluster ensembles<br />

are made in terms of relative percentage φ (NMI) differences, the percentage of<br />

139


5.3. Multimodal consensus clustering results<br />

cluster ensemble components that attain φ (NMI) scores higher than that of the<br />

evaluated consensus clusterings, and the percentage of experiments and relative<br />

percentage φ (NMI) differences between them and the cluster ensemble components<br />

of maximum and median quality.<br />

ii) The quality of the self-refined consensus clusterings is measured in terms of their<br />

φ (NMI) with respect to the ground truth of the data set. We also compute the percentage<br />

of experiments in which the top quality self-refined consensus clustering<br />

attains a higher φ (NMI) score than its non-refined counterpart, besides the relative<br />

percentage φ (NMI) differences between them and the maximum and median<br />

cluster ensemble components. On the other hand, the ability of the supraconsensus<br />

function is evaluated by the computation of the percentage of experiments in<br />

which it succeeds in selecting the highest quality consensus clustering available<br />

as the final partition of the data, besides the relative percentage φ (NMI) losses<br />

suffered when it does not.<br />

– How are the experiments designed? Just like in all the experimental sections<br />

of this thesis, consensus clusterings have been derived by means of the seven consensus<br />

functions described in appendix A.5, namely CSPA, EAC, HGPA, MCLA,<br />

ALSAD, KMSAD and SLSAD. By doing so, it is possible to compare their performances<br />

across all the consensus clustering problems conducted. It is important to<br />

note that, in this chapter, we are solely interested in analyzing the quality of the consensus<br />

clustering solutions obtained, as the main purpose of the proposed multimodal<br />

consensus clustering approach is achieving robustness to clustering indeterminacies,<br />

which, as aforementioned, are increased due to multimodality. For brevity reasons,<br />

only the results corresponding to cluster ensembles based on four distinct clustering<br />

algorithms are graphically displayed. These four clustering algorithms –namely<br />

agglo-cos-upgma, direct-cos-i2, graph-cos-i2 and rb-cos-i2 – cover all the clustering<br />

approaches encompassed in the CLUTO clustering package (see appendix A.1 for a<br />

description). However, when global analyses are presented, the results obtained on<br />

the |dfA| = 28 multimodal cluster ensembles are considered.<br />

– How are results presented?<br />

i) The quality of the unimodal, multimodal and intermodal consensus clusterings<br />

obtained is presented by means of φ (NMI) score boxplots. Recall that nonoverlapping<br />

boxes notches indicate that the medians of the compared running<br />

times differ at the 5% significance level, which allows a quick inference of the statistical<br />

significance of the results. Quantitative performance evaluation measures<br />

are presented in the shape of numeric tables showing the average values of the<br />

magnitudes analyzed (mainly, percentage of experiments and relative percentage<br />

φ (NMI) differences).<br />

ii) Results are presented by means of boxplot charts of the φ (NMI) values corresponding<br />

to the consensus self-refining process. In particular, each subfigure depicts<br />

–fromlefttoright–theφ (NMI) values of: i) the components of the cluster ensemble<br />

E, ii) the non-refined consensus clustering solution (i.e. the one resulting from<br />

the application of either a hierarchical or a flat consensus architecture, denoted<br />

as λc), and iii) the self-refined consensus labelings λcpi obtained upon select<br />

cluster ensembles created using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 75}.<br />

140


Chapter 5. Multimedia clustering based on cluster ensembles<br />

Moreover, the consensus clustering solution deemed as the optimal one (across<br />

a majority of experiment runs) by the supraconsensus function is identified by<br />

means of a vertical green dashed line. Moreover, the quality comparisons between<br />

the self-refined consensus clusterings, the non-refined consensus clusterings and<br />

the cluster ensemble components are presented by means of tables showing the<br />

average values of the measured magnitudes on each one of the four multimodal<br />

data collections employed in this work.<br />

– Which data sets are employed? Only the multimodal consensus clustering results<br />

obtained on the IsoLetters data collection are described in detail in this section —a<br />

thorough portrayal of the experiments corresponding to the three other multimedia<br />

data collections employed in this work (CAL500, InternetAds and Corel) can be found<br />

in appendix E. However, the global evaluation of our proposals encompasses the<br />

results obtained on the four multimodal data collections, presenting the average values<br />

as a result.<br />

5.3.1 Consensus clustering per modality and across modalities<br />

This section is devoted to the evaluation of the intermediate (i.e. unimodal and multimodal)<br />

and final (that is, intermodal) consensus clustering solutions yielded by the proposed multimodal<br />

deterministic hierarchical consensus architecture applied on the IsoLetters data set.<br />

Recall that the objects contained in this data collection are instances of the letters of the<br />

English alphabet expressed in two original modalities, speech and image.<br />

We start with a visual evaluation of the quality of the aforementioned consensus clusterings,<br />

measured in terms of their normalized mutual information φ (NMI) with respect to<br />

the ground truth of the data set (the closer its value to unity the higher the quality of the<br />

corresponding clustering). Moreover, we also constrast them with the components of the<br />

multimodal cluster ensemble they are created upon.<br />

For starters, figure 5.3 depicts four boxplot charts corresponding to the φ (NMI) scores of<br />

the cluster ensemble components and the unimodal, multimodal and intermodal consensus<br />

clusterings, in the case the cluster ensemble compiles partitions output by the agglo-cosupgma<br />

clustering algorithm. In each one of the boxplots, the φ (NMI) values of the components<br />

of the cluster ensemble E and of the consensus clusterings yielded by each one of the<br />

seven consensus functions employed in this work across ten experiment runs are shown. In<br />

particular, figures from 5.3(a) to 5.3(c) depict the quality of the intermediate unimodal and<br />

multimodal consensus clustering solutions λimage c , λspeech c ,andλimage+speech c , respectively.<br />

<strong>La</strong>st, figure 5.3(d) shows the boxplots corresponding to the intermodal consensus clustering<br />

λc, resulting from the combination of the previous three.<br />

There are several observations worth making in view of these results. Firstly, it is to note<br />

that pretty diverse quality consensus clusterings are obtained depending on the consensus<br />

function employed. Clearly, EAC and HGPA yield the worst results, whereas the five other<br />

consensus functions tend to yield better consensus partitions, being often able to compete<br />

with the highest quality cluster ensemble components.<br />

Secondly, focusing on the unimodal consensus clustering processes solely (figures 5.3(a)<br />

and 5.3(b)), notice the substantial differences between the φ (NMI) values of the cluster ensemble<br />

components corresponding to the image and speech modalities. Undoubtedly, this<br />

141


5.3. Multimodal consensus clustering results<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ agglo−cos−upgma<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

speech<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

image+speech<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c agglo−cos−upgma<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure 5.3: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the IsoLetters data set.<br />

is a determining factor in that the qualities of the consensus clustering solutions λ image<br />

c<br />

are notably higher than those of λ speech<br />

c<br />

, as average relative percentage φ (NMI) differences<br />

between both modalities of 57.4% are obtained. That is, despite consensus clustering manages,<br />

in both modalities, to yield reasonable quality results, selecting a single modality for<br />

clustering a multimedia data collection can be a highly suboptimal option —as it can totally<br />

limit the quality of the obtained partition.<br />

Thirdly, when consensus clustering is conducted on the multimodal modality resulting<br />

from feature-level fusion (figure 5.3(c)), even better consensus clustering results are obtained<br />

–in average relative φ (NMI) terms, a 13.6% better than those obtained on the image<br />

modality–, which indicates the existence of positive synergies between both modalities on<br />

this data collection.<br />

And finally, if the λ image<br />

c<br />

, λ speech<br />

c<br />

and λ image+speech<br />

c<br />

consensus clustering solutions are<br />

combined –figure 5.3(d)– pretty good clusterings are obtained, specially when the CSPA,<br />

ALSAD and KMSAD consensus functions are employed (in these cases, relative φ (NMI) differences<br />

with respect the image and image+speech modalities below 5% are obtained). For<br />

the remaining consensus functions, the intermodal consensus clustering λc attains lower<br />

φ (NMI) scores, thus constituting a trade-off between the consensus clusterings of the unimodal<br />

and fused modalities.<br />

Figure 5.4 presents the results of the same process, but executed on the multimodal<br />

cluster ensemble E created by means of the direct-cos-i2 clustering algorithm. In this case,<br />

we can observe a very similar behaviour to the one just reported. In this case, however,<br />

the intermodal consensus clustering solution λc (see figure 5.4(d)) is, in some cases (e.g.<br />

when consensus is based on CSPA), better (from a 3.1% to a 13.9% in relative percentage<br />

φ (NMI) differences) than any of its unimodal and multimodal counterparts —figures 5.4(a)<br />

to 5.4(c).<br />

The quality of the unimodal, multimodal and intermodal consensus clustering solutions<br />

obtained by the application of the multimodal DHCA on the cluster ensemble generated<br />

upon the graph-cos-i2 clustering algorithm of the CLUTO toolbox are presented in figure<br />

5.5. In this case, a larger performance uniformity among consensus functions is observed,<br />

at least as far as the image and image+speech modalities are concerned (figures 5.5(a) and<br />

5.5(c)). Otherwise, the consensus clusterings obtained upon the multimodal representation<br />

142


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Chapter 5. Multimedia clustering based on cluster ensembles<br />

speech<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

image+speech<br />

λ direct−cos−i2<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c direct−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure 5.4: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the IsoLetters data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

speech<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

image+speech<br />

λ graph−cos−i2<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c graph−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure 5.5: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the IsoLetters data set.<br />

of the data (λ image+speech<br />

c ) attain a higher quality than any of their unimodal counterparts<br />

(a 16.1% better in relative terms). Moreover, intermodal consensus –figure 5.5(d)– gives<br />

rise to clusterings that, at best, are comparable to the partitions obtained on the multimodal<br />

modality (e.g. ALSAD and KMSAD, where average relative φ (NMI) losses of 3% are<br />

observed) and, in the worst cases, constitute a trade-off between the combined modalities.<br />

<strong>La</strong>st, notice that pretty similar results are obtained when this consensus process is applied<br />

on the cluster ensemble created by the compilation of the partitions output by the<br />

rb-cos-i2 CLUTO clustering algorithm (see figure 5.6). Once more, the comparison of the<br />

consensus clusterings obtained on the unimodal and multimodal modalities (figures 5.6(a)<br />

to 5.6(c)) reveals the superiority of the latter on this data collection. When these three<br />

modalities are fused in the final consensus stage of the DHCA, the obtained intermodal<br />

consensus clustering solution λc yielded by the CSPA, ALSAD, KMSAD and SLSAD consensus<br />

functions attain φ (NMI) values only 0.9% to a 7.1% worse than those attained on the<br />

multimodal data representation, as depicted in figure 5.6(d).<br />

While the boxplots depicted in figures 5.3 to 5.6 provide the reader with a qualitative<br />

though partial vision of the results of unimodal, multimodal and intermodal consensus<br />

143


5.3. Multimodal consensus clustering results<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

speech<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image+speech<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c rb−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure 5.6: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the IsoLetters data set.<br />

clustering, it is necessary to conduct a more quantitative and generic analysis across the<br />

experiments conducted on the |dfA| = 28 cluster ensembles created using all the clustering<br />

algorithms from the CLUTO toolbox upon the four multimodal data collections employed<br />

in this work.<br />

Unimodal and multimodal consensus clustering vs their cluster ensembles<br />

mod 1<br />

Firstly, we have evaluated the quality of the two unimodal (λc mod 1+mod 2<br />

multimodal (λc have been created upon.<br />

mod 2<br />

and λc )andthe<br />

) consensus clusterings with respect to the cluster ensembles they<br />

In order to evaluate how the consensus clusterings compare to their associated cluster<br />

ensembles, we have computed the percentage of cluster ensemble components that attain<br />

ahigherφ (NMI) than the evaluated consensus clustering. Quite obviously, the smaller this<br />

percentage, the higher robustness to clustering indeterminacies is achieved. The results of<br />

this analysis are presented in table 5.2.<br />

Care must be taken in analyzing the presented percentages, due to the fact that the two<br />

unimodal and the multimodal consensus clusterings have been created upon different cluster<br />

ensembles, which makes comparisons across columns (i.e. across consensus functions) fair,<br />

but the same does not hold for comparisons across rows (i.e. across consensus clusterings).<br />

If the performance of the seven consensus functions is contrasted, it can be observed that<br />

EAC and HGPA yield, in most cases, the worst results, as the consensus clusterings they<br />

yield (either unimodal or multimodal) are, in average, worse than the 76.9% and 78.8% of<br />

the components of the cluster ensemble they are created upon —a percentage that goes<br />

below 30% in the case of the best performing consensus functions (CSPA, ALSAD and<br />

KMSAD). These results confirm that there may exist great differences between consensus<br />

functions as far as the quality of the consensus clusterings is concerned, so care must be<br />

taken when choosing which ones are applied.<br />

If averages are taken for summarization purposes, unimodal consensus clusterings are<br />

better than the 49.2% of their corresponding cluster ensemble components, while this percentage<br />

rises to 56.5% when multimodal consensus clusterings are considered.<br />

144


Chapter 5. Multimedia clustering based on cluster ensembles<br />

Data Consensus Consensus function<br />

set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

λ<br />

IsoLetters<br />

image<br />

c 21.1 60.1 63.4 40.9 20.4 26.6 40.2<br />

λ speech<br />

c 43.9 91.5 79.4 47.4 18.3 34.6 77.4<br />

λ image+speech<br />

c 17.1 51.8 59.9 34.1 16.7 25.0 39.1<br />

λ<br />

CAL500<br />

audio<br />

c 16.5 86.3 67.7 29.9 9.9 12.0 70.3<br />

λ text<br />

c 29.7 83.5 52.6 59.3 32.6 30.5 75.7<br />

λ audio+text<br />

c 21.9 94.9 54.8 48.0 16.9 20.3 61.9<br />

λ<br />

InternetAds<br />

object<br />

c 47.8 62.2 99.1 44.7 39.8 40.5 61.6<br />

λ collateral<br />

c 55.3 52.4 99.4 40.9 45.1 37.6 59.6<br />

λ object+collateral<br />

c 41.6 64.5 99.7 29.9 29.2 34.6 37.3<br />

λ<br />

Corel<br />

image<br />

c 12.7 93.8 96.2 50.0 19.1 26.7 79.1<br />

λ text<br />

c 10.4 89.2 88.1 43.7 28.8 23.7 77.8<br />

λ image+text<br />

c 7.5 92.2 85.1 40.2 10.9 20.2 64.1<br />

145<br />

Table 5.2: Percentage of cluster ensemble components that attain a higher φ (NMI) than the unimodal and multimodal consensus clusterings,<br />

across the four multimedia data collections and the seven consensus functions.


5.3. Multimodal consensus clustering results<br />

The second analysis consists of comparing the unimodal and multimodal consensus<br />

clusterings with the cluster ensemble components of maximum and median φ (NMI) (which<br />

we call best and median ensemble components, or BEC and MEC for short). Taking these<br />

two cluster ensemble components as a reference, we have computed i) the percentage of<br />

experiments in which the evaluated consensus clustering attains a higher φ (NMI) ,andii) the<br />

relative percentage φ (NMI) differences between them and the evaluated consensus clustering.<br />

The results of this analysis are presented, per data collection and consensus function, in<br />

tables 5.3 and 5.4, where evaluation is referred to BEC and the MEC, respectively.<br />

It can be observed that, as already noticed in previous experiments, the EAC and<br />

HGPA consensus functions perform notably worse than the remaining ones. In average,<br />

unimodal consensus clusterings are better than their corresponding BEC in a 6.5% of the<br />

experiments, whereas the multimodal consensus clustering solutions attain a higher φ (NMI)<br />

than the BEC in a 8.7% of the occasions (see table 5.3). If this comparison is made in<br />

terms of the relative percentage φ (NMI) differences, we see that, in average, the unimodal<br />

consensus clusterings are a 33.5% worse than the BEC, while this percentage reduces to<br />

28.2% when the multimodal consensus clusterings are considered.<br />

If the median ensemble component is taken as a reference –see table 5.4–, we observe<br />

that unimodal consensus clusterings are better than the MEC in a 54% of the experiments<br />

conducted. In contrast, when the multimodal consensus clustering solution is considered,<br />

superiority with respect to the MEC is obtained in a 62.3% of the cases. If the MEC and<br />

the consensus clusterings are compared in terms of relative percentage φ (NMI) differences,<br />

we see that the unimodal consensus yields clusterings which are a 21.1% better than the<br />

MEC, while this percentage rises to 39.6% in the case of multimodal consensus.<br />

Thus, a conclusion we can draw at this point is that, in view of the results just reported,<br />

the execution of consensus processes on multimodal cluster ensembles yields better quality<br />

consensus clusterings than those obtained upon cluster ensembles based on a single modality,<br />

which somehow constitutes a claim in favour of early fusion techniques. However, this<br />

statement must be made with caution, as it is not supported by evidence in all the data<br />

collections (for instance, the CAL500 collection constitutes an exception to this rule).<br />

Multimodal vs unimodal consensus clustering<br />

Secondly, we have compared the quality of the multimodal consensus clusterings (that is,<br />

mod 1+mod 2<br />

mod 1 mod 2<br />

λ ) with the quality of their unimodal counterparts (λ and λ ). Again,<br />

c<br />

c<br />

c<br />

this comparison has been made in terms of the percentage of experiments in which the<br />

former attains a higher φ (NMI) than the latter, and the relative percentage φ (NMI) differences<br />

between them, taking the unimodal consensus clusterings as a reference. The results are<br />

presented in table 5.5.<br />

mod 1+mod 2<br />

In average terms, the multimodal consensus clustering λc is better than its<br />

two unimodal counterparts in a 53.3% of the experiments conducted, and worse than both<br />

of them in only a 13% of the occasions. If the φ (NMI) differences between these consensus<br />

clusterings are measured, we observe that multimodal consensus yields partitions that, in<br />

average relative percentage φ (NMI) terms, are a 4802.4% better. Although this large figure is<br />

mainly caused by two outliers (on the InternetAds data set using the ALSAD and KMSAD<br />

consensus functions), the results presented in table 5.5 show an overwhelming majority of<br />

positive Δφ (NMI) values, which reinforces the notion that multimodal consensus processes,<br />

146


Chapter 5. Multimedia clustering based on cluster ensembles<br />

when compared to their unimodal counterparts, constitute, in most cases, a better option.<br />

Intermodal vs unimodal and multimodal consensus clustering<br />

Furthermore, we have also investigated whether the combination of unimodal and multimodal<br />

consensus clusterings (i.e the execution of intermodal consensus processes) can lead<br />

to the obtention of better partitions.<br />

For this reason, table 5.6 presents the detailed results corresponding to the comparison of<br />

the intermodal consensus clustering solutions with respect to their unimodal and multimodal<br />

counterparts, across all the data sets and consensus functions. Once again, such comparison<br />

is twofold, as it takes into account the percentage of experiments in which intermodal<br />

consensus is better than unimodal and multimodal, and the relative percentage φ (NMI)<br />

differences between them (taking the unimodal and multimodal consensus clusterings as a<br />

reference).<br />

If averages across data collections and consensus functions are taken, the following results<br />

are obtained: when compared to the unimodal consensus clusterings, intermodal consensus<br />

are better than them in a 59.5% of the experiments conducted, attaining an average relative<br />

φ (NMI) gain of 2821.7% with respect to them. That is, intermodal consensus clusterings are,<br />

in general terms, superior to their unimodal counterparts. Possibly the clearest exceptions<br />

to this rule are found in the audio modality of the CAL500 data set and the image mode<br />

of the Corel collection.<br />

However, intermodal consensus clusterings are superior to their multimodal counterparts<br />

in just a 34.7% of the occasions, reaching a quality that, measured in average relative φ (NMI)<br />

percentage terms, is a 65.5% better. Thus, as already suggested by the boxplots charts<br />

presented in figures 5.3 to 5.6, intermodal consensus clusterings tend to become a trade-off<br />

between the multimodal and unimodal consensus clustering solutions it is based on.<br />

Furthermore, if the quality of the intermodal consensus clustering is contrasted to that of<br />

the cluster ensemble it is created upon (that is, the one compiling both unimodal and multimodal<br />

clusterings), we obtain that it is better than the 52.9% of its components —recall that<br />

this percentage was 49.2% and 56.5% when referred to the unimodal and multimodal consensus<br />

clusterings, which reinforces the notion that, in general terms, intermodal consensus<br />

is a trade-off between its unimodal and multimodal counterparts.<br />

Notice that pretty different situations are found among the data sets used in this experiment.<br />

For instance, the intermodal consensus clustering is clearly inferior to its multimodal<br />

conterpart on the IsoLetters data set, whereas quite the opposite is observed on the InternetAds<br />

collection. Therefore, we consider that creating an intermodal consensus clustering<br />

is a pretty generic way of proceeding, as sometimes it can be advantageous to combine unimodal<br />

and multimodal consensus clusterings. Its eventual inferior quality (when compared<br />

to either its unimodal and multimodal counterparts) can be compensated by the consensus<br />

self-refining procedure presented in section 5.2. The results of applying it on the intermodal<br />

consensus clustering λc are described in the following section.<br />

147


5.3. Multimodal consensus clustering results<br />

Data Consensus Consensus function<br />

set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

λ<br />

IsoLetters<br />

image %exp 3.6 0 0 0 17.9 5 3.6<br />

c<br />

∆φ (NMI) -7.3 -42.9 -39.2 -18.8 -10.1 -9.1 -22<br />

λ speech %exp 0 0 0 0 10.7 7.1 3.6<br />

c ∆φ (NMI) -17.3 -59.8 -37.5 -20.3 -4.8 -10.2 -42.1<br />

λ image+speech %exp 10.7 0 0 0 7.1 8.6 0<br />

c ∆φ (NMI) -5.8 -33.2 -38.2 -16.1 -8.6 -8.5 -16.2<br />

λ<br />

CAL500<br />

audio %exp 3.5 0 0 1.4 25 7.8 0<br />

c ∆φ (NMI) -9.4 -50.5 -37.5 -17 -5.8 -7 -37.3<br />

λ text %exp 28.6 0 13.5 1.4 21.4 26.4 7.1<br />

c<br />

∆φ (NMI) -4.1 -24.2 -11.7 -14.9 -5.5 -4.6 -23.1<br />

λ audio+text %exp 21.4 0 5.7 0.7 28.6 21.4 3.6<br />

c ∆φ (NMI) -4.3 -44.8 -18.1 -19.6 -4.2 -6.1 -23.3<br />

λ<br />

InternetAds<br />

object %exp 0 3.6 0 0 0 0 0<br />

c ∆φ (NMI) -45.1 -72.8 -99.9 -45.4 -43.3 -41.7 -61.9<br />

λ collateral %exp 0 0 0 0 3.6 4.3 0<br />

c ∆φ (NMI) -81.7 -80.2 -99.9 -70.5 -70.3 -64.8 -87.7<br />

λ object+collateral %exp 0 0 0 25 3.5 3.6 3.6<br />

c<br />

∆φ (NMI) -41 -82.5 -99.9 -38.6 -34.7 -42.3 -42.8<br />

λ<br />

Corel<br />

image %exp 25 0 0 3.6 39.3 9.3 3.5<br />

c<br />

∆φ (NMI) -2 -56.7 -43.5 -28.9 -4.8 -7.1 -28.8<br />

λ text %exp 35.7 0 5.7 1.4 14.3 15 10.7<br />

c<br />

∆φ (NMI) 21.3 -56.3 -34.3 -24.6 -9.5 -8.4 -34.2<br />

λ image+text %exp 39.3 0 0 11.4 32.1 10 7.1<br />

c<br />

∆φ (NMI) 6.6 -60.6 -40.7 -24.4 -4.9 -8.2 -29.4<br />

148<br />

Table 5.3: Evaluation of the unimodal and multimodal consensus clusterings with respect to the best cluster ensemble component (or<br />

BEC), across the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of<br />

experiments in which the evaluated consensus clustering is superior to the BEC, and ∆φ (NMI) stands for the relative percentage φ (NMI)<br />

difference of the former wrt the latter.


Chapter 5. Multimedia clustering based on cluster ensembles<br />

Data Consensus Consensus function<br />

set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

λ<br />

IsoLetters<br />

image %exp 92.3 21.4 22.9 85 100 90.7 67.9<br />

c<br />

Δφ (NMI) 80 -10.7 5.2 45.9 56.3 69.3 35.1<br />

λ speech %exp 89.3 3.5 0 75.7 100 98.6 7.2<br />

c Δφ (NMI) 5.6 -47.4 -20.1 0.7 23.6 17 -23.4<br />

λ image+speech %exp 100 50 42.1 91.4 100 92.9 85.7<br />

c<br />

Δφ (NMI) 96.3 11.2 13.8 64 70.4 83.9 63<br />

λ<br />

CAL500<br />

audio %exp 100 3.6 12.9 86.4 100 99.3 21.4<br />

c Δφ (NMI) 24.4 -31.9 -15.4 12 28.8 27.8 -14.1<br />

λ text %exp 78.6 14.3 45 39.3 75 73.6 28.6<br />

c<br />

Δφ (NMI) 9.9 -13.3 1.1 -2.4 8.4 9.5 -11.8<br />

λ audio+text %exp 96.4 3.5 41.4 55.7 85.7 92.9 35.7<br />

c<br />

Δφ (NMI) 16.3 -33.2 -0.6 -2.9 16.4 14 -6.8<br />

λ<br />

InternetAds<br />

object %exp 71.4 32.1 0 77.1 67.9 68.6 46.2<br />

c Δφ (NMI) 64.5 110.2 -99.9 30.4 69 115.4 93.6<br />

λ collateral %exp 50 50 0 61.4 57.1 67.9 21.4<br />

c<br />

Δφ (NMI) 111.7 173.5 -99.7 109.2 143.8 184 -33.5<br />

λ object+collateral %exp 71.4 25 0 78.6 82.1 66.4 67.9<br />

c<br />

Δφ (NMI) 236.9 -13 -99.4 137.5 186.1 202.9 14.6<br />

λ<br />

Corel<br />

image %exp 100 0 0 52.1 96.4 85 14.3<br />

c<br />

Δφ (NMI) 23.4 -46.7 -33.1 -17.3 18.7 16.5 -9.4<br />

λ text %exp 96.4 10.7 8.6 62.9 85.7 87.9 21.4<br />

c<br />

Δφ (NMI) 48.8 -45.7 -19.7 -8.2 13.4 14.9 -16.9<br />

λ image+text %exp 100 3.6 0 63.6 100 95.7 17.9<br />

c<br />

Δφ (NMI) 51 -45.7 -21.5 -3.3 30.7 27.4 -0.2<br />

149<br />

Table 5.4: Evaluation of the unimodal and multimodal consensus clusterings with respect to the median cluster ensemble component<br />

(or MEC), across the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of<br />

experiments in which the evaluated consensus clustering is superior to the MEC, and Δφ (NMI) stands for the relative percentage φ (NMI)<br />

difference of the former wrt the latter.


5.3. Multimodal consensus clustering results<br />

Data Consensus Consensus function<br />

set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

λ<br />

IsoLetters<br />

image %exp 92.9 92.9 93.6 92.9 92.9 92.9 85.7<br />

c<br />

∆φ (NMI) 10.7 37.4 11.2 14.3 10.7 9.7 20.1<br />

λ speech %exp 96.4 92.9 95 95.7 85.7 92.9 92.9<br />

c ∆φ (NMI) 81.1 177.7 56.5 70.7 54.3 65.9 155.4<br />

λ<br />

CAL500<br />

audio %exp 7.1 10.7 30.7 2.1 3.6 5 25<br />

c ∆φ (NMI) -26 -21.7 -7.3 -31.4 -28.7 -29.4 -11.2<br />

λ text %exp 100 17.9 85 79.3 96.4 93.6 82.1<br />

c<br />

∆φ (NMI) 19.9 -12.3 11.8 13.8 22.2 18.7 23.4<br />

λ<br />

InternetAds<br />

object %exp 64.3 35.7 93.6 62.1 64.3 59.3 71.4<br />

c ∆φ (NMI) 80 234 1616 348 53833 1454 1270<br />

λ collateral %exp 67.9 50 78.6 73.6 82.1 64.3 78.6<br />

c ∆φ (NMI) 1690 160 -10 910 3750 200660 1030<br />

λ<br />

Corel<br />

image %exp 85.7 28.6 42.9 65.7 60.7 61.4 53.6<br />

c<br />

∆φ (NMI) -2.5 -0.8 96.3 12.9 -5.9 -5.7 -6.7<br />

λ text %exp 89.3 96.4 88.6 91.4 96.4 96.4 96.5<br />

c<br />

∆φ (NMI) 142.9 149.9 129.9 165.8 169.6 151.7 192<br />

150<br />

Table 5.5: Evaluation of the multimodal consensus clusterings with respect to their unimodal counterparts, across the four multimedia<br />

data collections and the seven consensus functions. The symbol %exp denotes the percentage of experiments in which the multimodal<br />

consensus clustering is superior to the unimodal consensus clusterings, and ∆φ (NMI) stands for the relative percentage φ (NMI) difference<br />

of the former wrt the latter.


Chapter 5. Multimedia clustering based on cluster ensembles<br />

Data Consensus Consensus function<br />

set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

λ<br />

IsoLetters<br />

image %exp 92.9 42.9 1.4 8.6 89.3 86.4 46.4<br />

c<br />

Δφ (NMI) 9 -5.2 -24.3 -12.1 8.9 10.1 3<br />

λ speech %exp 96.4 89.3 52.9 87.1 82.1 92.9 92.9<br />

c Δφ (NMI) 78.2 90.2 6.6 30.5 49.6 59.5 114.2<br />

λ image+speech %exp 17.9 3.6 0 0.7 17.9 12.1 7.1<br />

c<br />

Δφ (NMI) -1.4 -28.6 -31.8 -22.5 -0.3 1.8 -12.7<br />

λ<br />

CAL500<br />

audio %exp 7.1 17.9 8.6 1.4 7.1 7.2 28.6<br />

c Δφ (NMI) -23.8 -10.1 -19.7 -31.9 -23.7 -25.8 -8.9<br />

λ text %exp 96.4 53.6 32.9 80.7 100 99.3 89.3<br />

c<br />

Δφ (NMI) 25 0.5 -3.4 13.8 30.7 24.2 28<br />

λ audio+text %exp 75 78.6 11.4 40.7 75 72.1 57.1<br />

c<br />

Δφ (NMI) 4.2 17.3 -13.1 3 7.3 5 6.8<br />

λ<br />

InternetAds<br />

object %exp 57.1 57.2 90 45.7 57.1 53.6 42.9<br />

c Δφ (NMI) 2 3 -10 104 44152 465 -38<br />

λ collateral %exp 85.7 82.1 70.7 78.6 92.9 71.4 64.3<br />

c<br />

Δφ (NMI) 1250 60 -30 940 4880 105030 60<br />

λ object+collateral %exp 42.9 67.9 75 30 71.4 55.7 32.1<br />

c<br />

Δφ (NMI) 479.1 41.7 -19.2 59.4 41.1 1322.6 22.2<br />

λ<br />

Corel<br />

image %exp 75 10.7 3.6 7.1 17.9 6.4 3.6<br />

c<br />

Δφ (NMI) -0.7 -17.1 -24.4 -22.4 -10.8 -11.6 -18<br />

λtext %exp 100 92.9 91.4 84.3 100 100 100<br />

c<br />

Δφ (NMI) 145.9 136.8 30.1 83.5 153.8 143.4 157.9<br />

λ image+text %exp 14.3 46.4 14.4 5.7 14.3 14.2 17.9<br />

c<br />

Δφ (NMI) 6.7 1.1 -38.2 -25.2 3.2 7.5 -1.6<br />

151<br />

Table 5.6: Evaluation of the intermodal consensus clustering with respect to the unimodal and multimodal consensus clusterings, across<br />

the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of experiments in<br />

which the intermodal consensus clustering is superior to the unimodal and multimodal consensus clusterings, and Δφ (NMI) stands for the<br />

relative percentage φ (NMI) difference of the former wrt the latter.


5.3. Multimodal consensus clustering results<br />

5.3.2 Self-refined consensus clustering across modalities<br />

In this section, we analyze the results of subjecting the intermodal consensus clustering<br />

solution λc to a self-refining procedure based on a cluster ensemble E, the components of<br />

which correspond to both unimodal and multimodal partitions.<br />

As described in chapter 4, the self-refining process is based on the creation of a select<br />

cluster ensemble Epi –containing a percentatge pi of the clusterings in E– followed by<br />

the application of a (either flat or hierarchical) consensus process on it, which yields a<br />

self-refined consensus clustering λ pi<br />

c . In this work, the refining consensus processes are<br />

conducted by means of a flat consensus architecture, and the set of percentages pi employed<br />

is pi = {2, 5, 10, 15, 20, 30, 40, 50, 75}.<br />

For this reason, the obtention of the final consensus clustering solution λ final<br />

c<br />

relies on the<br />

application of a supraconsensus selection function onto the set of self-refined clusterings λ pi<br />

c<br />

—see section 4.1 for a description of the φ (ANMI) -based supraconsensus function employed<br />

in this work. In the following paragraphs, the performances of these two processes (i.e. the<br />

self-refining procedure and the supraconsensus function) are evaluated separately.<br />

As in the previous section, the results of the execution of these processes on the IsoLetters<br />

data collection are described in detail next. Again, for brevity reasons, the presentation of<br />

the results corresponding to the CAL500, InternetAds and Corel data sets is deferred to<br />

appendix E.<br />

Evaluation of the multimodal self-refining process<br />

For starters, figure 5.7 depicts the results of the self-refining process using the multimodal<br />

cluster ensemble resulting from gathering the partitions output by the agglo-cos-upgma<br />

clustering algorithm of the CLUTO package. Each one of the seven boxplots presented (one<br />

per consensus function) shows, from left to right, the φ (NMI) values of the multimodal cluster<br />

ensemble E components, of the intermodal consensus clustering λc, and of the self-refined<br />

consensus clustering solutions λ pi<br />

c . The consensus clustering selected by the supraconsensus<br />

, is highlighted by a vertical green dashed line.<br />

Notice that, regardless of the consensus function employed, there always exists at least<br />

one self-refined consensus clustering solution that attains a φ (NMI) value that is statistically<br />

significantly higher than the one achieved by the non-refined version λc. In some cases, as<br />

when consensus is based on CSPA (figure 5.7(a)), relatively small φ (NMI) gains are obtained,<br />

specially if they are compared with the dramatic φ (NMI) increases brought about by selfrefining<br />

when, for instance, the EAC, HGPA or SLSAD consensus functions are employed<br />

(see figures 5.7(b), 5.7(c) and 5.7(g)).<br />

As regards the ability of the supraconsensus function to select the highest quality (either<br />

refined or non-refined) consensus clustering as the final partition of the data, it is to<br />

notice that it apparently performs reasonably well, as it mostly succeeds in picking up the<br />

clustering solution of maximum φ (NMI) as λfinal c . A deeper analysis of the supraconsensus<br />

function performance will be presented in the next section.<br />

Notice that the proposed intermodal consensus self-refining procedure shows a very similar<br />

behaviour in the experiments conducted on the multimodal cluster ensembles created<br />

upon the partitions output by the direct-cos-i2, thegraph-cos-i2 and the rb-cos-i2 clustering<br />

algorithms from the CLUTO toolbox —see figures 5.8, 5.9 and 5.10, respectively.<br />

function, λ final<br />

c<br />

152


φ (NMI)<br />

CSPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Chapter 5. Multimedia clustering based on cluster ensembles<br />

EAC agglo−cos−upgma<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

ALSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

HGPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

KMSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

MCLA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

SLSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure 5.7: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the IsoLetters data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

SLSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

MCLA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure 5.8: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the IsoLetters data set.<br />

A comprehensive evaluation of the results of the intermodal consensus self-refining procedure<br />

is presented throughout the following paragraphs. This analysis considers the ex-<br />

153<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


5.3. Multimodal consensus clustering results<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

SLSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 40<br />

c<br />

MCLA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure 5.9: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the IsoLetters data set.<br />

Data set<br />

CSPA EAC<br />

Consensus function<br />

HGPA MCLA ALSAD KMSAD SLSAD<br />

IsoLetters 96.4 96.5 100 98.6 100 100 100<br />

CAL500 89.3 92.9 99.3 97.8 67.9 83.6 82.1<br />

InternetAds 75 57.1 46.4 98.5 78.6 90 85.7<br />

Corel 100 89.2 100 100 100 100 100<br />

Table 5.7: Percentage of multimodal self-refining experiments in which one of the selfrefined<br />

consensus clustering solutions is better than its non-refined counterpart, across the<br />

four multimedia data collections and the seven consensus functions.<br />

periments conducted upon the cluster ensembles created using the |dfA| = 28 clustering<br />

algorithms across the four multimedia data collections employed in this work.<br />

For starters, in order to evaluate the ability of the self-refining process to create high<br />

quality partitions, we have measured the percentage of experiments in which there exists<br />

at least one self-refined consensus clustering λ pi<br />

c that attains a higher φ (NMI) than its<br />

non-refined counterpart λc. The results per data set and consensus function, which are<br />

presented in table 5.7, reveal that self-refining is capable of yielding a beneficial effect in a<br />

large majority (an average 90.2%) of the experiments conducted. This figure is of the same<br />

order of magnitude of the one obtained in the consensus-based unimodal self-refining experiments<br />

presented in section 4.2.1, which indicates that multimodality does not constitute<br />

an obstacle as far as the performance of the proposed self-refining procedure is concerned.<br />

Moreover, so as to evaluate the quality improvement that the proposed self-refining<br />

procedure is able to introduce, we have computed the relative φ (NMI) gain between the<br />

154<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

Chapter 5. Multimedia clustering based on cluster ensembles<br />

λ 40<br />

c<br />

EAC rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 40<br />

c<br />

MCLA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure 5.10: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the IsoLetters data set.<br />

top quality self-refined consensus clustering and its non-refined counterpart, measured in<br />

those experiments where there exists a self-refined consensus clustering superior to the nonrefined<br />

version (i.e. a 90.2% of the total). Table 5.8 presents the results corresponding to<br />

each data collection and consensus function. It can be observed that φ (NMI) gains over 10%<br />

are consistently obtained in most cases —again, these results are of comparable magnitude<br />

to those obtained in the unimodal scenario (see section 4.2.1). Notice that the results<br />

obtained on the InternetAds data set stand out among the rest, as relative percentage<br />

φ (NMI) gains of the order of 103 to 105 are observed. These extremely large figures are<br />

due to the extremely low quality consensus clusterings available prior to refining on this<br />

collection —thus transforming any φ (NMI) increase caused by self-refining into a huge figure<br />

when expressed in relative percentage difference terms referred to the non-refined clustering.<br />

In the 9.8% of the experiments in which none of the self-refined consensus clusterings (λ pi<br />

c )<br />

attains a higher quality than the non-refined one (λc), the difference between the top quality<br />

λpi c and λc is an averaged relative percentage φ (NMI) loss of 19%.<br />

Besides comparing the top quality self-refined consensus clustering solution with its nonrefined<br />

counterpart, we have also contrasted its quality with respect to the clusterings that<br />

make up the cluster ensemble.<br />

Firstly, we have computed the percentage of cluster ensemble components that attain<br />

ahigherφ (NMI) score (with respect to the ground truth) than that of the top quality selfrefined<br />

consensus clustering, as this figure constitutes a pretty clear indicator of how does<br />

it compare to the cluster ensemble it is created upon (see table 5.9). In average terms, the<br />

top quality self-refined consensus clustering is better than the 78.3% of the cluster ensemble<br />

components, which is a sign of notable robustness to the indeterminacies of multimodal<br />

clustering. Moreover, recall that this percentage was 52.9% prior to self-refining, which<br />

155<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


5.3. Multimodal consensus clustering results<br />

Data set<br />

CSPA EAC<br />

Consensus function<br />

HGPA MCLA ALSAD KMSAD SLSAD<br />

IsoLetters 8.7 93 121.1 31.3 23.5 16 35.4<br />

CAL500 14.1 31.1 57.4 22.3 13.8 18.6 19.4<br />

InternetAds 26284 12207 467710 1991.5 1686.8 1788.5 1742.3<br />

Corel 11.1 64.2 177.1 42.2 32 33.6 44.7<br />

Table 5.8: Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />

clustering solutions with respect to its non-refined counterpart, across the four multimedia<br />

data collections and the seven consensus functions.<br />

Data set<br />

CSPA EAC<br />

Consensus function<br />

HGPA MCLA ALSAD KMSAD SLSAD<br />

IsoLetters 2.2 19.9 16.8 5.4 0.5 1.7 8.7<br />

CAL500 14.1 61.6 17.6 17.1 15.4 14.4 41.2<br />

InternetAds 16.9 35.9 94.1 8.3 18.1 17.6 38.5<br />

Corel 0.9 63.4 29.5 4.9 0.1 1.2 40.9<br />

Table 5.9: Percentage of the cluster ensemble components that attain a higher φ (NMI) score<br />

than the top quality self-refined consensus clustering solution, across the four multimedia<br />

data collections and the seven consensus functions.<br />

again provides an evidence of the benefits of the proposed consensus self-refining procedure.<br />

Furthermore, we have compared the top quality self-refined consensus clustering with<br />

the the highest and median φ (NMI) components of the multimodal cluster ensemble E,<br />

referred to as BEC (best ensemble component) and MEC (median ensemble component),<br />

respectively. Using the quality of these two components as a reference, we have evaluated<br />

i) the percentage of experiments where the maximum φ (NMI) consensus clustering solution<br />

attains a higher quality to that of the BEC and MEC, and ii) the relative percentage φ (NMI)<br />

variation between them and the top quality consensus clustering solution. Again, all the<br />

results presented correspond to an average across all the experiments conducted upon the<br />

cluster ensembles generated using the twenty-eight clustering algorithms from the CLUTO<br />

toolbox employed in this work.<br />

Table 5.10 presents, for each data set and consensus function, the percentage of experiments<br />

in which the top quality consensus clustering (either non-refined or refined) attains<br />

a higher quality (i.e. higher φ (NMI) ) than the cluster ensemble component of maximum<br />

quality (or BEC). In each box of the table, the percentage of experiments in which the<br />

non-refined consensus clustering reaches a higher φ (NMI) score than the BEC is shown in<br />

brackets. By doing so, the effect of the self-refining process can be evaluated at a glance.<br />

Notice, then, that the equality of the bracketed and the unbracketed figures shown in any<br />

table box indicates that none of the refined consensus clusterings attains a higher quality<br />

than its non-refined counterpart.<br />

In average, a self-refined consensus clustering better than the BEC is obtained in a<br />

19.1% of the experiments conducted, whereas this figure was as low as 1.6% prior to selfrefining.<br />

Notice that, depending on the data collection, pretty diverse results are obtained<br />

(i.e. consensus self-refining seems to achieve a greater level of success when applied on the<br />

IsoLetters and Corel collections, than on CAL500 and InternetAds data sets) —furthermore,<br />

156


Data set<br />

IsoLetters<br />

CAL500<br />

InternetAds<br />

Corel<br />

Chapter 5. Multimedia clustering based on cluster ensembles<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

46.4 3.6 0 26.4 64.3 51.4 25<br />

(3.5) (0) (0) (0) (7.1) (5) (0)<br />

7.1 0 0 0 3.6 2.9 3.5<br />

(0) (0) (0) (0) (3.6) (2.9) (3.5)<br />

3.6 0 0 0 0 1.4 0<br />

(0) (0) (0) (0) (0) (0) (0)<br />

75 0 0 45 92.9 72.9 10.7<br />

(14.3) (0) (0) (0) (7.1) (0) (0)<br />

Table 5.10: Percentage of experiments in which the best (either non-refined or self-refined)<br />

consensus clustering solution is better than the best cluster ensemble component (or BEC),<br />

across the four multimedia data collections and the seven consensus functions. The percentages<br />

prior to self-refining are shown in brackets.<br />

notice the aforementioned differences between the results offered by the distinct consensus<br />

functions.<br />

Data set<br />

IsoLetters<br />

CAL500<br />

InternetAds<br />

Corel<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

4.2 11.2 – 2.4 8.2 6.9 8.3<br />

(2.1) (–) (–) (–) (5.7) (3.2) (–)<br />

5.6 – – – 10.4 3.2 3.3<br />

(–) (–) (–) (–) (10.4) (3.2) (3.3)<br />

6.2 – – – – 2.9 –<br />

(–) (–) (–) (–) (–) (–) (–)<br />

6.9 – – 2.1 6.1 5.4 13.7<br />

(4.2) (–) (–) (–) (1.1) (–) (–)<br />

Table 5.11: Relative φ (NMI) percentage difference between the top quality (either non-refined<br />

or self-refined) consensus clustering solution with respect to the best ensemble component<br />

(or BEC), across the four multimedia data collections and the seven consensus functions.<br />

The relative φ (NMI) percentage differences prior to self-refining are shown in brackets.<br />

Moreover, we have computed the relative percentage φ (NMI) gains between the top quality<br />

(non-refined or refined) consensus clustering solution and the BEC –limited to those<br />

experiments in which the former is superior to the latter–, obtaining the results presented<br />

in table 5.11. If the φ (NMI) gains are averaged across all the experiments conducted, a 6.3%<br />

relative percentage φ (NMI) increase is obtained. Again, the percentages corresponding to<br />

the same magnitude measured prior to refining is presented in brackets in each box of the<br />

table, attaining an average φ (NMI) gain of 4.1%. That is, self-refining does not only give rise<br />

to a larger number of clusterings better than the BEC, but it also increases the φ (NMI) gains<br />

with respect to it. However, in those experiments in which the top quality (non-refined or<br />

refined) consensus clustering attains a φ (NMI) score which is lower than that of the BEC<br />

(i.e. in a 80.9% of the total), its quality is a 28.2% lower, measured in averaged relative<br />

percentage φ (NMI) terms.<br />

As regards the comparison with the median quality cluster ensemble component (or<br />

157


5.3. Multimodal consensus clustering results<br />

Data set<br />

IsoLetters<br />

CAL500<br />

InternetAds<br />

Corel<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

100 92.9 100 99.3 100 100 100<br />

(96.4) (21.4) (2.9) (82.9) (100) (100) (96.4)<br />

100 21.4 97.1 99.3 100 100 71.4<br />

(100) (7.1) (15) (40) (100) (93.6) (32.1)<br />

96.4 60.7 2.9 100 85.7 92.1 60.7<br />

(75) (32.1) (0) (72.9) (75) (66.4) (17.9)<br />

100 14.3 91.4 100 100 100 60.8<br />

(100) (14.3) (10) (15.7) (89.3) (85) (17.9)<br />

Table 5.12: Percentage of experiments in which the top quality (either non-refined or selfrefined)<br />

consensus clustering solution is better than the median cluster ensemble component<br />

(or MEC), across the four multimedia data collections and the seven consensus functions.<br />

The percentages prior to self-refining are shown in brackets.<br />

MEC), its quality is surpassed by the top quality consensus clustering in an average 83.8%<br />

of the experiments conducted —see table 5.12 for a detailed vision of these results across<br />

data sets and consensus functions. It can be observed that in a vast majority of cases, selfrefining<br />

guarantees the obtention of partitions that are better than the median clustering<br />

available in the cluster ensemble. Again, the percentage of experiments in which the nonrefined<br />

intermodal consensus clustering attains a higher φ (NMI) than the MEC is presented<br />

in brackets in each box of the table, yielding an average of 55.7% —that is, self-refining<br />

increases the chances of obtaining a partition better than the median one in almost a 30%.<br />

But better to what extent? So as to answer this question, table 5.13 presents the<br />

relative φ (NMI) percentage differences between the top quality consensus clustering and<br />

the MEC, considering only those experiments in which the former attains a higher φ (NMI)<br />

value than the latter. In average, a relative percentage gain of 142.7% is achieved, which<br />

again reinforces the notion that the proposed multimodal consensus self-refining process is,<br />

by itself, capable of yielding good quality partitions upon a previously derived consensus<br />

clustering solution. Moreover, each box in table 5.13 shows, in brackets, the relative φ (NMI)<br />

percentage differences between the non-refined intermodal consensus clustering λc and the<br />

MEC. Notice how the self-refining procedure, besides yielding consensus clusterings superior<br />

to the MEC in a larger number of experiments, also increases the difference with respect<br />

to it, rising it from an average 103.7% to the aforementioned 142.7%. In the experiments<br />

in which the top quality fails to attain a higher φ (NMI) value than that of the MEC (i.e. a<br />

16.2% of the total), their quality is a 32.1% lower, measured in average relative percentage<br />

φ (NMI) terms.<br />

However, it is to notice that, in tables 5.7 to 5.13, the performance of the self-refining<br />

procedure has been evaluated taking the highest φ (NMI) self-refined consensus clustering<br />

solution as a reference. But, as aforementioned, the self-refining process generates multiple<br />

self-refined clusterings λ pi<br />

c using distinct percentages pi of the original cluster ensemble<br />

E. Therefore, the subsequent application of the average normalized mutual information<br />

(φ (ANMI) ) based supraconsensus function is required so as to obtain the final partition of<br />

the multimodal data set, λ final<br />

c . As already described in chapter 4, the ability of the<br />

supraconsensus function to select the top quality consensus clustering solution is a crucial<br />

issue as regards the overall performance of the multimodal self-refining consensus clustering<br />

158


Data set<br />

IsoLetters<br />

CAL500<br />

InternetAds<br />

Corel<br />

Chapter 5. Multimedia clustering based on cluster ensembles<br />

Consensus function<br />

CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />

86.6 55.6 63.9 81.1 100.3 95.2 78.8<br />

(76.8) (15.4) (2.4) (28.1) (65.3) (67.8) (34)<br />

33.4 29.3 31.6 28.1 28.9 32.6 16<br />

(18.9) (4.4) (7.6) (-) (-) (-) (-)<br />

382.9 191.2 196.7 412 408.5 446.3 406.9<br />

(314.2) (174.1) (–) (236.4) (222.7) (291.5) (356.6)<br />

113.5 258 40.8 35.5 106.1 106 130.9<br />

(83.8) (39.7) (2.6) (4.6) (58) (54.8) (225.5)<br />

Table 5.13: Relative φ (NMI) percentage difference between the top quality (either non-refined<br />

or self-refined) consensus clustering solution with respect to the median ensemble component<br />

(or MEC), across the four multimedia data collections and the seven consensus functions.<br />

The relative φ (NMI) percentage differences prior to self-refining are shown in brackets.<br />

Data set<br />

CSPA EAC<br />

Consensus function<br />

HGPA MCLA ALSAD KMSAD SLSAD<br />

IsoLetters 53.6 67.9 70 50.7 39.3 42.9 39.3<br />

CAL500 64.3 35.7 17.9 10.7 21.4 46.4 21.5<br />

InternetAds 7.1 53.6 81.4 10 21.4 12.1 35.7<br />

Corel 57.1 57.2 70 57.1 39.2 45 50<br />

Table 5.14: Percentage of experiments in which the supraconsensus function selects the top<br />

quality consensus clustering solution, across the four multimedia data collections and the<br />

seven consensus functions.<br />

system, and for this reason, the following paragraphs are devoted to its evaluation.<br />

Evaluation of the supraconsensus process<br />

Firstly, we have evaluated the supraconsensus function in terms of the percentage of experiments<br />

in which it suceeds, i.e. it selects the highest quality consensus clustering solution<br />

as the final partition λ final<br />

c . The results obtained for each data set and consensus function<br />

are presented in table 5.14, performing correctly in an average 42.1% of the experiments<br />

conducted —that is, it is able to select the best clustering available in less than the half of<br />

the occasions, which reveals (as outlined in chapter 4) that there is still room for improving<br />

the performance of supraconsensus functions.<br />

And secondly, we have analyzed how the consensus clustering selected by supraconsen-<br />

, compares to the components of the cluster ensemble it is created upon —taking<br />

sus, λ final<br />

c<br />

again the cluster ensemble components of maximum and median φ (NMI) (respectively referred<br />

to as best and median ensemble component, or BEC and MEC for short) as a reference.<br />

Hence, we have measured the relative percentage φ (NMI) differences between the<br />

consensus clustering selected by supraconsensus and these cluster ensemble components, so<br />

as to provide the reader with a notion of the impact of the apparent lack of accuracy of the<br />

φ (ANMI) -based supraconsensus function.<br />

The results, averaged across all the consensus functions, are presented in table 5.15.<br />

159


5.4. Discussion<br />

Data set<br />

Relative φ (NMI) difference<br />

with respect to<br />

BEC MEC<br />

IsoLetters -17.5 53.2<br />

CAL500 -34.7 13.2<br />

InternetAds -65.2 138.7<br />

Corel -28.9 169.1<br />

Table 5.15: Relative φ (NMI) percentage differences between the best and median components<br />

of the cluster ensemble and the consensus clustering λ final<br />

c selected by supraconsensus,<br />

for the four multimedia data collections, averaged across the seven consensus functions<br />

employed.<br />

In average, λ final<br />

c<br />

is, in relative percentage terms, a 36.6% worse than the BEC and a<br />

93.5% better than the MEC. As expected, the price to pay for the lack of precision of the<br />

supraconsensus function is a reduction of the quality of the final clustering solution.<br />

5.4 Discussion<br />

In this chapter, we have proposed and experimentally explored the use of consensus clustering<br />

strategies for partitioning multimedia data collections robustly. From our viewpoint,<br />

this application constitutes a natural extension of the computationally efficient consensus<br />

architectures presented in chapter 3 and the self-refining procedures proposed in chapter<br />

4. As mentioned earlier, the growing ubiquity of multimedia data makes the proposals put<br />

forward in the present chapter even more appealing.<br />

Across the experiments presented in this chapter and in appendix B.2, we have observed<br />

that partitioning multimodal data sets in a robust manner is more tricky than doing so in a<br />

unimodal scenario, as the existence of multiple modalities in the data increases the already<br />

numerous indeterminacies inherent to the clustering problem.<br />

As a means for fighting against this fact, modality fusion has become a recurrent issue in<br />

the multimedia data analysis literature. Indeed, assuming that combining the distinct data<br />

modalities can be of interest is pretty logical, as it is an obvious way to take advantage of<br />

the expectably existing constructive dependences between them. In this sense, and focusing<br />

on the clustering problem, there exist two main approaches to modality fusion: early (aka<br />

feature level) fusion and late (classification decision) fusion.<br />

However, our experiments have revealed that none of these fusion strategies is, by itself,<br />

capable of ensuring robust clustering results: in some cases, feature level fusion gives rise to<br />

the best clustering results, whereas, in other cases, it simply constitutes a trade-off between<br />

modalities. For this reason, our multimodal self-refining consensus clustering architectures<br />

constitute a generic approach for partitioning multimedia data collections with a reasonable<br />

degree of robustness, with the advantage of encompassing, simultaneously, early and late<br />

fusion techniques.<br />

To our knowledge, most works dealing with multimodal clustering focus on feature level<br />

fusion, deriving novel early fusion approaches for combining modalities (see section 1.4<br />

for a brief review). However, they often disregard the fact that, in some data sets, early<br />

160


Chapter 5. Multimedia clustering based on cluster ensembles<br />

fusion may not be advantageous (that is, there may exist modalities that do not contribute<br />

positively to the obtention of a good partition of the data, see appendix B.2). In our opinion,<br />

this constitutes one of the main strengths of our proposal, as it allows the user to employ<br />

any modality created by feature-level fusion besides the original modalities of the data for<br />

obtaining the final partition of the data.<br />

This aprioripositive openhandedness entails two negative consequences: firstly, it increases<br />

the computational complexity of the consensus process, although such inconvenience<br />

can be sorted out by the application of the efficient consensus hierchical consensus architectures<br />

proposed in chapter 3. The second drawback is the inclusion of the poorest modality<br />

clustering results in the consensus clustering process, but this can be alleviated by the use<br />

of the consensus-based self-refining procedures described in chapter 4, achieving notable<br />

results when applied on the intermodal consensus clustering solutions.<br />

In future works, the implementation of selection-based self-refining processes (see section<br />

4.3) on multimodal cluster ensembles will be investigated, as we expect that it may yield<br />

higher quality multimodal partitions than the ones presented in this chapter. Furthermore,<br />

as already stated in chapter 4, it will be necessary to devise novel, more precise supraconsensus<br />

functions, capable of selecting with a higher degree of accuracy the top quality<br />

consensus clustering solution in an unsupervised manner.<br />

5.5 Related publications<br />

None of the work regarding multimedia clustering presented in this chapter has been published<br />

yet. Nevertheless, we would like to hihglight the following paper, focused on applying<br />

early fusion of modalities for conducting jointly multimodal data analysis and synthesis of<br />

facial video sequences (Sevillano et al., 2009). The details of this work, published as a book<br />

chapter, are presented next.<br />

Authors: Xavier Sevillano, Javier Melenchón, Germán Cobo, Joan Claudi Socoró<br />

and Francesc Alías<br />

Title: Audiovisual Analysis and Synthesis for Multimodal Human-Computer Interfaces<br />

In: Engineering the User Interface: From Research to Practice<br />

Publisher: Springer<br />

Editors: Miguel Redondo, Crescencio Bravo and Manuel Ortega<br />

Pages: 179–194<br />

Year: 2009<br />

ISBN: 978-1-84800-135-0<br />

Abstract: Multimodal signal processing techniques are called to play a salient role<br />

in the implementation of natural computer-human interfaces. In particular, the development<br />

of efficient interface front ends that emulate interpersonal communication<br />

would benefit from the use of techniques capable of processing the visual and auditory<br />

161


5.5. Related publications<br />

modes jointly. This work introduces the application of audiovisual analysis and synthesis<br />

techniques based on Principal Component Analysis and Non-negative Matrix<br />

Factorization on facial audiovisual sequences. Furthermore, the applicability of the<br />

extracted audiovisual bases is analyzed throughout several experiments that evaluate<br />

the quality of audiovisual resynthesis using both objective and subjective criteria.<br />

162


Chapter 6<br />

Voting based consensus functions<br />

for soft cluster ensembles<br />

As outlined in section 1.2.1, clustering algorithms can be bisected into two large categories,<br />

depending on the number of clusters every object is assigned to. On one hand, hard (or<br />

crisp) clustering algorithms assign each object to a single cluster. For this reason, the<br />

result of applying a hard clustering process on a data set containing n objects is usually<br />

represented as a n-dimensional integer-valued row vector of labels (or labeling) λ, each<br />

component of which identifies to which of the the k clusters each object is assigned to, that<br />

is:<br />

λ =[λ1 λ2 ... λn] (6.1)<br />

where λi ∈ [1,k], ∀i ∈ [1,n].<br />

On the other hand, soft (or fuzzy) clustering algorithms allow the objects to belong to<br />

all clusters to a certain extent. Thus, the results of their application for partitioning a data<br />

set containing n objects into k clusters is usually represented by means of a k×n real-valued<br />

clustering matrix Λ –see equation (6.2)–, the (i,j)th entry of which indicates the degree of<br />

association between the jth object and the ith cluster.<br />

⎛<br />

⎜<br />

Λ = ⎜<br />

⎝<br />

λ11 λ12 ... λ1n<br />

λ21 λ22 ... λ2n<br />

.<br />

. ..<br />

λk1 λk2 ... λkn<br />

.<br />

⎞<br />

⎟<br />

⎠<br />

(6.2)<br />

where λij ∈ R, ∀i ∈ [1,k]and∀j ∈ [1,n].<br />

For illustration purposes, we resort to the toy clustering example presented in chapter<br />

2, in which clustering is conducted on the two-dimensional artificial data set presented in<br />

figure 6.1. This toy data collection contains n = 9 objects, and the desired number of<br />

clusters k is set to 3.<br />

If the classic k-means hard clustering algorithm is applied on this data set, the label<br />

vector presented in equation (6.3) is obtained. Recall that the labels λi contained in λ<br />

are purely symbolic (i.e. the labelings λ =[111222333]orλ =[333222111]<br />

represent exactly the same partition of the data).<br />

163


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

3<br />

2<br />

1<br />

−0.2<br />

−0.2 0 0.2 0.4 0.6<br />

Figure 6.1: Scatterplot of an artificially generated two-dimensional data set containing n =9<br />

objects, which are represented by coloured symbols and identified by a number. The black<br />

star symbols represent the position of the cluster centroids found by the k-means algorithm.<br />

6<br />

9<br />

4<br />

8<br />

7<br />

5<br />

λ =[222111333] (6.3)<br />

As regards the results of applying a fuzzy clustering algorithm on this data collection,<br />

these clearly differ depending on the way the degree of association between objects and<br />

clusters is codified. Usually, the scalar values λij contained in the clustering matrix Λ<br />

represent cluster membership probabilities (i.e. the higher the value of λij, the more strongly<br />

the jth object is associated to the ith cluster). For instance, this is the way the well-known<br />

fuzzy c-means (FCM) clustering algorithm codifies its clustering results (Höppner, Klawonn,<br />

and Kruse, 1999). In fact, if this algorithm is applied on the previously described artificial<br />

data set, the clustering matrix presented in equation (6.4) is obtained.<br />

⎛<br />

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016<br />

⎞<br />

0.010<br />

Λ = ⎝0.921<br />

0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠<br />

(6.4)<br />

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />

It can be observed that any row permutation in Λ would yield an equivalent fuzzy partition.<br />

Moreover, notice that Λ can be transformed into a hard clustering solution by simply<br />

assigning each object to the cluster with maximum membership probability.<br />

However, the degree of association between objects and clusters can be described in<br />

terms of other parameters, such as the distance of each object to the cluster centroids (such<br />

as k-means, that despite being a hard clustering algorithm, can output this information). In<br />

fact, the object to centroid1 distance matrix obtained after applying k-means on the same<br />

toy data set as before is presented in equation (6.5).<br />

⎛<br />

0.362 0.325 0.672 0.002 0.002 0.005 0.160 0.092<br />

⎞<br />

0.125<br />

Λ = ⎝0.010<br />

0.009 0.027 0.397 0.490 0.436 0.251 0.320 0.209⎠<br />

(6.5)<br />

0.170 0.202 0.445 0.090 0.125 0.162 0.002 0.005 0.002<br />

In this case, the conversion of Λ into a crisp partition requires assigning every object<br />

to the closest (i.e. minimum distance) cluster. Thus, depending on the nature of the soft<br />

1 The cluster centroids are represented by means of a black star symbol in figure 6.1<br />

164


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

clustering process, the interpretation of the scalar values λij contained in Λ differs (i.e. if λij<br />

represent membership probabilities, the larger their value the stronger the object-cluster<br />

association, whereas the opposite interpretation should be made in case they represent<br />

object to centroid distances).<br />

Either way, fuzzy clustering can be regarded as a version of hard clustering with relaxed<br />

object membership restrictions. Such relaxation is particularly useful when the clusters are<br />

not well separated, or when their boundaries are ambiguous. Moreover, object to cluster<br />

association information may be of help in discovering more complex relations between a<br />

given object and the clusters (Xu and Wunsch II, 2005). Furthermore, notice that soft<br />

clustering can also be regarded as a generalization of its hard counterpart, as a crisp partition<br />

can always be derived from a fuzzy one, whereas the opposite cannot be held.<br />

However, these apparent strengths of soft clustering algorithms have barely been reflected<br />

in the development of consensus functions especially devised for combining the outcomes<br />

of multiple fuzzy clustering processes, as most proposals in this area are oriented<br />

towards their application on hard clustering scenarios. Nevertheless, as described in section<br />

2.2, there exist a few works in the consensus clustering literature devoted to the derivation<br />

of consensus functions for soft cluster ensembles, such as the VMA consensus function of<br />

(Dimitriadou, Weingessel, and Hornik, 2002) or the ITK consensus function of (Punera and<br />

Ghosh, 2007). Moreover, several other consensus functions can be indistinctly applied on<br />

both hard and soft cluster ensembles, such as PLA (<strong>La</strong>nge and Buhmann, 2005) or HGBF<br />

(Fern and Brodley, 2004), while others, originally devised for hard cluster ensembles, can be<br />

adapted for their use as soft partition combiners with relative ease (e.g. (Strehl and Ghosh,<br />

2002)).<br />

In this chapter, we make several proposals regarding the application of consensus processes<br />

on soft cluster ensembles. For starters, the notion of soft cluster ensembles is reviewed<br />

in section 6.1. Next, in section 6.2, we describe a procedure for adapting the hard consensus<br />

functions employed in this work to soft cluster ensembles. In section 6.3, a family of<br />

novel soft consensus functions based on the application of cluster disambiguation and voting<br />

strategies is proposed. Finally, the results of several experiments regarding the performance<br />

of the proposed soft consensus functions are presented in section 6.4, and the discussion of<br />

section 6.5 puts an end to this chapter.<br />

6.1 Soft cluster ensembles<br />

As described in chapter 2, cluster ensembles are nothing but the compilation of the outputs<br />

of multiple (namely l) clustering processes. Focusing on a fuzzy clustering scenario, and<br />

making the simplyfing assumption that the l clustering processes partition the data into k<br />

clusters, a soft cluster ensemble E is represented by means of a kl × n real-valued matrix:<br />

⎛ ⎞<br />

Λ1<br />

⎜<br />

⎜Λ2<br />

⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠<br />

Λl<br />

(6.6)<br />

where Λi is the k × n real-valued clustering matrix resulting from the ith soft clustering<br />

process (∀i ∈ [1,l]).<br />

165


6.2. Adapting consensus functions to soft cluster ensembles<br />

Recall that the contents of each clustering matrix Λi enclosed in the soft cluster ensemble<br />

E results from the execution of a fuzzy clustering process, and that, depending on its nature,<br />

the interpretation of the scalar values that ultimately make up E may differ largely. Thus,<br />

for conducting a consensus process on the soft cluster ensemble E it is necessary that such<br />

values hold the same type of proportionality with repect to the degree of association between<br />

objects and clusters (i.e. they all are either directly or inversely proportional).<br />

This prerequisite becomes more evident if an analogy between soft clustering and voting<br />

procedures is established. Such analogy is inspired by the parallelism between supervised<br />

classification processes and voting drawn in (van Erp, Vuurpijl, and Schomaker, 2002).<br />

According to this analogy, the process of fuzzily clustering an object can be regarded as an<br />

election, in which the clusterer (regarded as a voter) casts its preference for each one of the<br />

clusters (or candidates). Put that way, it becomes quite obvious that, when the results of<br />

several fuzzy clustering processes are gathered into a soft cluster ensemble with the purpose<br />

of building a consolidated clustering solution upon it, they should be straightly comparable<br />

—possibly, after applying some scale normalization.<br />

Regardless of the characteristics and nature of soft cluster ensembles, it is interesting<br />

to evaluate how classic consensus functions (i.e. those originally designed to combine crisp<br />

partitions) can be applied on the fuzzy consensus clustering problem. The next section<br />

deals with this very issue.<br />

6.2 Adapting consensus functions to soft cluster ensembles<br />

The consensus functions employed so far in the experimental sections of this work (i.e.<br />

CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD) are originally designed to<br />

operate on hard cluster ensembles (see appendix A.5 for a description). Nevertheless, they<br />

can be easily adapted for combining fuzzy partitions. The key point is that all these<br />

consensus functions base their clustering combination processes on object co-association<br />

matrices (i.e. matrices the contents of which estimate the degree of similarity between<br />

objects upon the partitions contained in the cluster ensemble). Fortunately, this type of<br />

co-association matrices can be derived not only from hard cluster ensembles, but also from<br />

their soft counterparts, which makes these consensus functions easily applicable on the<br />

fuzzy consensus clustering problem. In the following paragraphs, we elaborate on this issue<br />

resorting again to the previous toy example, continuously drawing parallelisms between the<br />

hard and soft clustering scenarios —for brevity, such comparison will only be referred to<br />

fuzzy clustering processes that codify object to cluster associations in terms of membership<br />

probabilities, although an equivalent study could also be formulated in the case these were<br />

expressed by means of magnitudes inversely proportional to the strength of object to cluster<br />

associations, such as object to cluster centroids distances.<br />

Consider the hard clustering solution of equation (6.3) corresponding to our clustering<br />

toy example, that is:<br />

λ =[222111333]<br />

Notice that an equivalent representation of this partition can be also given by a k × n<br />

incidence matrix Iλ (called binary membership indicator matrix in (Strehl and Ghosh,<br />

2002)), the (i,j)th entry of which is equal to 1 in the case the jth object is assigned to<br />

166


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

the ith cluster according to λ, and 0 otherwise —see equation (6.7), which presents the<br />

incidence matrix corresponding to the label vector λ of equation (6.2).<br />

⎛<br />

0 0 0 1 1 1 0 0<br />

⎞<br />

0<br />

Iλ = ⎝1<br />

1 1 0 0 0 0 0 0⎠<br />

(6.7)<br />

0 0 0 0 0 0 1 1 1<br />

Notice that the information contained in Iλ is somehow comparable to the contents of<br />

a soft clustering matrix Λ, in the sense that they both express the degree of association<br />

between objects and clusters. For illustration purposes, equation (6.8) presents the fuzzy<br />

clustering matrix Λ output by the FCM clustering algorithm on the artificial data set of<br />

figure 6.1. In fact, rounding each element of this clustering matrix Λ to the nearest integer<br />

would indeed yield the incidence matrix Iλ of equation (6.7).<br />

⎛<br />

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016<br />

⎞<br />

0.010<br />

Λ = ⎝0.921<br />

0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠<br />

(6.8)<br />

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />

The construction of object co-association matrices given the incidence matrix Iλ built<br />

upon a hard clustering solution λ is pretty straightforward, and it only requires computing<br />

some matrix products.<br />

In particular, the object co-association matrix Oλ is computed as the product between<br />

the transpose of Iλ and Iλ. The object co-association matrix corresponding to the hard<br />

clustering solution obtained on our toy clustering example is presented in equation (6.9).<br />

In fact, Oλ is a n × n adjacency matrix, the (i,j)th entry of which equals 1 if the ith<br />

and the jth objects are placed in the same cluster, or 0 otherwise (Kuncheva, Hadjitodorov,<br />

and Todorova, 2006). The name object co-association matrix stems from the fact that the<br />

contents of Oλ indicate, from a clustering viewpoint, the degree of similarity between the<br />

n objects in the data set.<br />

167


6.2. Adapting consensus functions to soft cluster ensembles<br />

Oλ = Iλ T Iλ =<br />

⎛<br />

0 1<br />

⎞<br />

0<br />

⎜<br />

0<br />

⎜<br />

0<br />

⎜<br />

⎜1<br />

⎜<br />

⎜1<br />

⎜<br />

⎜1<br />

⎜<br />

⎜0<br />

⎝0<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0 ⎟<br />

0 ⎟ ⎛<br />

0 ⎟ 0<br />

0⎟⎝<br />

⎟ 1<br />

0 ⎟ 0<br />

1 ⎟<br />

1⎠<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

⎞<br />

0<br />

0⎠<br />

1<br />

0<br />

⎛<br />

1<br />

0<br />

1<br />

1<br />

1 0 0 0 0 0<br />

⎞<br />

0<br />

⎜<br />

1<br />

⎜<br />

1<br />

⎜<br />

⎜0<br />

= ⎜<br />

⎜0<br />

⎜<br />

⎜0<br />

⎜<br />

⎜0<br />

⎝0<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

1<br />

0 ⎟<br />

0 ⎟<br />

0 ⎟<br />

0 ⎟<br />

0 ⎟<br />

1 ⎟<br />

1⎠<br />

0 0 0 0 0 0 1 1 1<br />

(6.9)<br />

In a fuzzy clustering scenario, the object co-association matrix can easily be derived by<br />

simply multiplying the transpose clustering matrix Λ by itself. Resorting again to our toy<br />

clustering example, the resulting object co-association matrix (denoted as OΛ) ispresented<br />

in equation (6.10).<br />

It is easy to see that OΛ is indeed a fuzzy adjacency matrix, as the statistical independence<br />

of the probabilities of assigning objects i and j to the same cluster allows to interpret<br />

its (i,j)th entry as the joint probability that objects i and j are placed in the same cluster<br />

by the clusterer. However, statistical independence does not hold for the elements of the<br />

diagonal of OΛ, as each object is always “co-clustered” with itself, which would require<br />

making the elements on the diagonal of OΛ equal to 1.<br />

168


⎛<br />

OΛ = ΛT ⎜<br />

Λ = ⎜<br />

⎝<br />

⎛<br />

⎜<br />

= ⎜<br />

⎝<br />

Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

0.054 0.921 0.025<br />

0.026 0.932 0.042<br />

0.057 0.905 0.038<br />

0.969 0.025 0.006<br />

0.976 0.019 0.005<br />

0.959 0.030 0.011<br />

0.009 0.014 0.976<br />

0.016 0.055 0.929<br />

0.010 0.017 0.972<br />

⎞<br />

⎟ ⎛<br />

⎟ ⎝<br />

⎟<br />

⎠<br />

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 0.010<br />

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017<br />

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />

0.852 0.861 0.838 0.076 0.070 0.080 0.038 0.075 0.040<br />

0.861 0.871 0.847 0.049 0.043 0.053 0.054 0.091 0.057<br />

0.838 0.847 0.824 0.078 0.073 0.082 0.050 0.086 0.053<br />

0.076 0.049 0.078 0.940 0.946 0.930 0.015 0.022 0.016<br />

0.070 0.043 0.073 0.946 0.953 0.937 0.014 0.021 0.015<br />

0.080 0.053 0.082 0.930 0.937 0.921 0.020 0.027 0.021<br />

0.038 0.054 0.050 0.015 0.014 0.020 0.953 0.908 0.949<br />

0.075 0.091 0.086 0.022 0.021 0.027 0.908 0.866 0.904<br />

0.040 0.057 0.053 0.016 0.015 0.021 0.949 0.904 0.945<br />

⎞<br />

⎟<br />

⎠<br />

⎞<br />

⎠<br />

(6.10)<br />

Quite obviously, consensus functions based on object co-association matrices do not<br />

operate on matrices Oλ or OΛ, as they are derived upon a single clustering solution. However,<br />

the computation of object co-association matrices can easily be extended to a set of<br />

clustering solutions compiled in either a hard or a soft cluster ensemble. We start with the<br />

derivation of the object co-association matrix upon a hard cluster ensemble E containing l<br />

clusterings, represented by means of a l × n integer valued matrix:<br />

⎛<br />

λ1<br />

⎜<br />

⎜λ2<br />

⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠<br />

λl<br />

⎞<br />

(6.11)<br />

In this case, the incidence matrix of the hard cluster ensemble, denoted as IEλ ,isa<br />

kl × n matrix resulting from the concatenation of the incidence matrices of the l clusterings<br />

that make up the ensemble (see equation (6.12)).<br />

IEλ =<br />

⎛ ⎞<br />

Iλ1<br />

⎜Iλ2<br />

⎟<br />

⎜ ⎟<br />

⎜ ⎟<br />

⎝ . ⎠<br />

Iλl<br />

(6.12)<br />

As in the case of a single clustering, the object co-association matrix of the hard cluster<br />

ensemble, denoted as OEλ , can be derived upon IEλ by simple matrix multiplication, as<br />

presented in equation (6.13).<br />

OEλ<br />

T<br />

= IEλ<br />

IEλ<br />

169<br />

(6.13)


6.2. Adapting consensus functions to soft cluster ensembles<br />

Drawing a parallelism with respect to the interpretation of the object co-association<br />

matrix Oλ built upon a single clustering, it can be stated that the (i,j)th entry of OEλ<br />

indicates the proportion of clusterers that put the ith and jth objects in the same cluster.<br />

Porting this same idea to the soft clustering scenario, the soft version of the object coassociation<br />

matrix OEλ (that we denote as OEΛ ) is computed following an analog procedure<br />

to the one just reported, summarized in equation (6.14).<br />

OEΛ = ET E = ΛT 1 ΛT2 ...ΛT ⎛ ⎞<br />

Λ1<br />

⎜<br />

⎜Λ2<br />

⎟<br />

l ⎜ ⎟<br />

⎝ . ⎠<br />

Λl<br />

(6.14)<br />

where E is the kl × n matrix representing the soft cluster ensemble made up by the compilation<br />

of the l fuzzy clustering matrices Λi, ∀i ∈ [1,l]. Just like in the case of the fuzzy<br />

adjacency matrix OΛ derived upon a single clustering (see equation (6.10)), it is necessary<br />

to set the elements of the diagonal of OEΛ to unity, as each object is always “co-clustered”<br />

with itself.<br />

As aforementioned, all the hard consensus functions used so far in this work employ<br />

object co-association matrices, and most of them explicitly base their consensus processes<br />

on them. This is the case of the CSPA, EAC, ALSAD, KMSAD and SLSAD consensus<br />

functions, which differ among them in the way the object co-association matrix OEλ is<br />

interpreted. On one hand, some of them construe OEλ<br />

as an object similarity matrix,<br />

obtaining the consensus partition by applying some similarity-based clustering algorithm,<br />

such as graph partitioning (in CSPA (Strehl and Ghosh, 2002)) or hierarchical clustering<br />

(EAC (Fred and Jain, 2005)).<br />

On the other hand, the so-called similarity-as-data consensus functions interpret the ith<br />

row of the object co-association matrix OEλ<br />

as a set of new n features representing the<br />

ith object, thus applying some standard clustering algorithm on it for obtaining the consensus<br />

clustering solution. The application of the hierarchical single-link and average-link<br />

hierarchical clustering algorithms give rise to the SLSAD and ALSAD consensus functions,<br />

whereas the KMSAD consensus function consists in conducting this partitioning by means<br />

of k-means (Kuncheva, Hadjitodorov, and Todorova, 2006).<br />

On its part, the HGPA consensus function considers the cluster ensemble incidence<br />

matrix IEλ to be the adjacency matrix of a hypergraph with n vertices and kl hyperedges.<br />

The consensus clustering process is regarded as the partitioning of such hypergraph by<br />

cutting the minimum number of hyperedges, a procedure that takes the object co-association<br />

matrix OEλ as its input (Strehl and Ghosh, 2002).<br />

And last, the MCLA consensus function tackles the consensus clustering problem as a<br />

process of clustering clusters, which also implies interpreting the cluster ensemble incidence<br />

matrix IEλ as a hypergraph adjacency matrix. Again, the algorithmic analysis2 of this<br />

consensus function reveals that such procedure requires the object co-association matrix<br />

OEλ as one of its parameters (Strehl and Ghosh, 2002).<br />

Allowing for the fact that all seven consensus functions employ the object co-association<br />

matrix OEλ as its main input parameter when operating on hard cluster ensembles, devising<br />

2<br />

The Matlab source code of the CSPA, HGPA and MCLA consensus functions is available for download<br />

at http://www.strehl.com.<br />

170


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

their version for soft cluster ensembles becomes pretty straightforward, as it simply requires<br />

substituting the object co-association matrix derived from a hard cluster ensemble OEλ by<br />

its fuzzy counterpart, OEΛ . Notice, that, despite taking object co-association matrices<br />

derived upon a soft cluster ensemble as their input, all these consensus functions output a<br />

hard consensus clustering solution λc.<br />

As mentioned at the introduction of this section, we have established an analogy between<br />

hard and soft consensus functions based on the similarities between object co-association<br />

matrices, assuming that the results of fuzzy clustering are expressed in terms of membership<br />

probabilities. However, we consider that the present study is quite generic, as it could be<br />

extended to the case fuzzy clustering results were expressed in any other form by transforming<br />

them into membership probabilities prior to computing the corresponding object<br />

co-association matrices.<br />

6.3 Voting based consensus functions<br />

In this section, we put forward a set of proposals in the shape of a family of novel consensus<br />

functions especially devised for their application on soft cluster ensembles. These consensus<br />

functions are inspired in voting strategies, which have also been a source of inspiration for<br />

the development of systems for combining supervised classifiers (van Erp, Vuurpijl, and<br />

Schomaker, 2002), search engines (Aslam and Montague, 2001), or word sense disambiguation<br />

systems (Buscaldi and Rosso, 2007). A distinguishing factor of the consensus functions<br />

we propose in this section is that they yield fuzzy consensus clustering solutions, whereas<br />

other soft consensus functions found in the literature ouput crisp consensus clusterings,<br />

despite they are applied on soft cluster ensembles (Punera and Ghosh, 2007).<br />

In fact, voting is a formal way of combining the opinions of several voters into a single<br />

decision (e.g. the election of a president). Therefore, it seems quite logical that voting<br />

strategies can be readily applied for combining the outcomes of multiple decision systems,<br />

using the voting strategy that best fits the way these decisions are expressed.<br />

Given that a clusterer is an unsupervised classifier, the most natural parallelism we can<br />

establish is related to voting based supervised classifier combination (aka classifier ensembles).<br />

In this case, each classifier is a voter, the possible categories objects can be filed under<br />

are the candidates, and an election is the classification of an object (van Erp, Vuurpijl, and<br />

Schomaker, 2002). Quite obviously, the voting strategy employed for combining the votes<br />

–and thus obtain the winner of the election (i.e. the resulting classification of the object by<br />

the ensemble of classifiers)– depends on how votes are expressed, be it an assignment to a<br />

single class (i.e. single-label classification (Sebastiani, 2002)), or an either ranked or scored<br />

list of classes. The former case calls for the application of unweighed voting methods such<br />

as plurality or majority voting, whereas the latter make it possible to apply more sophisticated<br />

voting strategies such as confidence and ranking voting methods (van Erp, Vuurpijl,<br />

and Schomaker, 2002).<br />

Nevertheless, prior to the application of any voting strategy on soft cluster ensembles,<br />

there is a crucial problem to be solved. This has to do with the inherent unsupervised<br />

nature of clustering processes, which makes clusters be ambiguously identified. Therefore,<br />

it is necessary to perform a cluster alignment (or disambiguation) process before voting.<br />

Notice that this is not an issue of concern when applying voting strategies on supervised<br />

171


6.3. Voting based consensus functions<br />

classifier ensembles, as categories (i.e. candidates) are univocally defined in that case. We<br />

elaborate on the cluster disambiguation technique employed in this work in section 6.3.1. It<br />

is important to highlight that consensus functions based on object co-association matrices<br />

circumvent this inconvenience (<strong>La</strong>nge and Buhmann, 2005), although their main drawback<br />

is that the complexity of the object co-association matrix computation is quadratic with<br />

the number of objects in the data set (Long, Zhang, and Yu, 2005).<br />

The problem of combining the outcomes of multiple soft clustering processes by means<br />

of voting strategies implies interpreting the contents of the soft cluster ensemble as the<br />

preference of each voter (clusterer) for each candidate (cluster), as soft clustering algorithms<br />

output the degree of association of each object to all the clusters. For this reason, voting<br />

methods capable of dealing with voters’ preferences (in particular, confidence and ranking<br />

voting strategies) are the basis of our consensus functions, as they lend themselves naturally<br />

to be applied in this context. However, care must be taken as regards how these preferences<br />

are expressed, that is, if they are directly or inversely proportional to the strength of the<br />

association between objects and clusters (e.g. membership probabilities or distances to<br />

centroids, respectively). In section 6.3.2, we describe four voting strategies that give rise to<br />

the proposed consensus functions.<br />

6.3.1 Cluster disambiguation<br />

In this section, we elaborate on the problem of cluster disambiguation, also known as the<br />

cluster correspondence problem.<br />

As pointed out earlier, a single hard clustering solution can be expressed by multiple<br />

equivalent labeling vectors λ, due to the symbolic nature of the labels clusters are identified<br />

with. This also occurs in soft clustering, as the permutation of the rows of a clustering<br />

matrix Λ also gives rise to equivalent fuzzy partitions. Quite obviously, this cluster identification<br />

ambiguity also rises between the multiple clustering solutions compiled in a cluster<br />

ensemble E, and thus, it becomes an issue of concern when it comes to applying voting<br />

strategies for conducting consensus clustering, given the equivalence between clusters and<br />

candidates defined by the previously described analogy with voting procedures. For this<br />

reason, our voting-based consensus functions for soft cluster ensembles make use of a cluster<br />

disambiguation technique prior to proper voting.<br />

In particular, we require from such method the ability to solve the cluster re-labeling<br />

problem —an instance of the cluster correspondence problem in which a one to one correspondence<br />

between clusters is considered (recall that, in this work, all the clusterings in the<br />

ensemble and the consensus clustering are assumed to have the same number of clusters,<br />

namely k).<br />

To solve the cluster re-labeling problem we make use of the Hungarian method (also<br />

known as Kuhn-Munkres algorithm or Munkres assignment algorithm) (Kuhn, 1955), a<br />

technique that allows to obtain the most consistent alignment among the different clusterings<br />

(Ayad and Kamel, 2008).<br />

Given a pair of clustering solutions with k clusters each, the Hungarian method is capable<br />

of finding, among the k! possible cluster permutations, the one that maximizes the overlap<br />

between them. In particular, such cluster permutation is applied on one of the two clustering<br />

solutions, while the other is taken as a reference. Put in probabilistic terms, the Hungarian<br />

algorithm selects the cluster permutation that best fits the empirical cluster assignment<br />

172


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

probabilities estimated from the reference clustering solution (i.e. it finds the optimal<br />

cluster permutation that yields the largest probability mass over all cluster assignment<br />

probabilities (Fischer and Buhmann, 2003)). Depending on whether the aforementioned<br />

clustering solutions correspond to hard or fuzzy partitions, cluster permutations amount to<br />

label reassignments or to row order rearrangements, respectively.<br />

The Hungarian algorithm poses the cluster correspondence problem as a weighted bipartite<br />

matching problem, solving it in O(k3 ) time. A beautiful analysis of its error probability<br />

can be found in (Topchy et al., 2004). In this work, we have employed the implementation<br />

of (Buehren, 2008), which bases the clusters disambiguation process upon a measure of<br />

the dissimilarity between the clusters of the two clustering solutions under consideration.<br />

Cluster dissimilarity is usually embodied in a k × k matrix, the (i,j)th entry of which is<br />

proportional to the degree of dissimilarity between the ith cluster of one of the clustering<br />

solutions and the jth cluster of the other one.<br />

Cluster dissimilarity can easily be derived upon the considered pair of clustering solutions,<br />

regardless of whether they are hard or fuzzy partitions, as we show next. In the crisp<br />

case, a cluster similarity matrix S λ1 ,λ 2 can be obtained by simple matrix products between<br />

the incidence matrices of both clusterings, denoted as λ1 and λ2 —see equation (6.15).<br />

Sλ1 ,λ = Iλ1 2 Iλ2<br />

T<br />

(6.15)<br />

For illustration purposes, consider the two crisp clustering solutions of equation (6.16):<br />

λ1 =[222111333]<br />

λ2 =[113333322] (6.16)<br />

The incidence matrices corresponding to λ1 and λ2 are presented in equation (6.17).<br />

Iλ1 =<br />

⎛<br />

0 0 0 1 1 1 0 0<br />

⎞<br />

0<br />

⎝1<br />

1 1 0 0 0 0 0 0⎠<br />

0 0 0 0 0 0 1 1 1<br />

Iλ2 =<br />

⎛<br />

1 1 0 0 0 0 0 0<br />

⎞<br />

0<br />

⎝0<br />

0 0 0 0 0 0 1 1⎠<br />

(6.17)<br />

0 0 1 1 1 1 1 0 0<br />

The cluster similarity matrix derived upon these two clustering solutions is the one<br />

presented in equation (6.18).<br />

173


6.3. Voting based consensus functions<br />

Sλ1 ,λ = Iλ1 2 Iλ2<br />

T<br />

=<br />

⎛<br />

1 0<br />

⎞<br />

0<br />

⎛<br />

0<br />

⎝1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

⎜<br />

1<br />

⎜<br />

⎞ ⎜<br />

0<br />

0 ⎜<br />

⎜0<br />

0⎠<br />

⎜<br />

⎜0<br />

1 ⎜<br />

⎜0<br />

⎜<br />

⎜0<br />

⎝0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

0<br />

1<br />

0 ⎟<br />

1 ⎟ ⎛<br />

1 ⎟ 0<br />

1 ⎟ = ⎝2<br />

1 ⎟ 0<br />

1 ⎟<br />

0⎠<br />

0<br />

0<br />

2<br />

⎞<br />

3<br />

1⎠<br />

1<br />

(6.18)<br />

0 1 0<br />

Firstly, notice that S λ1 ,λ 2 is not a symmetric matrix (as object co-association matrices<br />

are), due to the fact that its rows and columns correspond to different entities. In fact,<br />

the (i,j)th element of S λ1 ,λ 2 is equal to the number of objects that are assigned to the ith<br />

cluster of λ1 and to the jth cluster of λ2, thus clearly indicating the degree of resemblance of<br />

these clusters. Deriving a cluster dissimilarity matrix from S λ1 ,λ 2 is pretty straightforward,<br />

provided that the implementation of the Hungarian method employed in this work does not<br />

require that the cluster dissimilarity measures verify any special property as regards their<br />

scale. The result of the cluster disambiguation method applied on this toy example is the<br />

cluster correspondence vector π λ1 ,λ 2 presented in equation (6.19).<br />

π λ1 ,λ 2 = [3 1 2] (6.19)<br />

The interpretation of the cluster correspondence vector πλ1 ,λ is that the cluster iden-<br />

2<br />

tified with the ‘1’ label in λ1 corresponds to the cluster with label ‘3’ of λ2, the cluster<br />

labeled as ‘2’ in λ1 must be aligned with the cluster with label ‘1’ of λ2, and the cluster ‘3’<br />

of λ1 is equivalent to cluster with label ‘2’ of λ2.<br />

The most usual way of representing the information contained in the cluster correspondence<br />

vector πλ1 ,λ is by means of a cluster permutation matrix P 2 λ1 ,λ . In general, P 2 λ1 ,λ is 2<br />

a k × k matrix whose entries are all zero except that the πλ1 ,λ (i)-th entry of the ith row is<br />

2<br />

equal to 1. The cluster permutation matrix corresponding to our toy example is presented<br />

in equation (6.20). Notice how all of its entries are zero except for the third entry of the<br />

first row (as πλ1 ,λ (1) = 3), the first entry of the second row (as π 2 λ1 ,λ (2) = 1) and the<br />

2<br />

second entry of the third row (as πλ1 ,λ (3) = 2).<br />

2<br />

P λ1 ,λ 2 =<br />

⎛<br />

0 0<br />

⎞<br />

1<br />

⎝1<br />

0 0⎠<br />

(6.20)<br />

0 1 0<br />

In order to obtain the cluster permuted version of the clustering λ1, it is necessary to<br />

multiply the transpose of the cluster permutation matrix Pλ1 ,λ by the incidence matrix<br />

2<br />

associated to this clustering, Iλ1 , which yields the cluster permuted incidence matrix I πλ1 ,λ2 λ , 1<br />

as indicated in equation (6.21).<br />

I πλ 1 ,λ 2<br />

λ 1<br />

T<br />

= Pλ1 ,λ Iλ1<br />

2 ,λ2 In our example, the cluster permuted incidence matrix is:<br />

174<br />

(6.21)


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

I πλ<br />

⎛<br />

0<br />

1 ,λ2 λ = ⎝0<br />

1<br />

1<br />

1<br />

0<br />

0<br />

⎞ ⎛<br />

0 0<br />

1⎠⎝1<br />

0 0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

⎞ ⎛<br />

0 1<br />

0⎠<br />

= ⎝0<br />

1 0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

1<br />

0<br />

0<br />

1<br />

0<br />

⎞<br />

0<br />

1⎠<br />

0<br />

(6.22)<br />

Therefore, assigning each object to the cluster it is most strongly associated to transforms<br />

the cluster permuted incidence matrix I πλ1 ,λ2 λ into the disambiguated crisp clustering<br />

1<br />

λ πλ1 ,λ2 1 . In equation (6.23), this clustering is presented alongside λ2 —the clustering that<br />

has been taken as the reference of the cluster disambiguation process.<br />

λ πλ 1 ,λ 2<br />

1 =[111333222] (6.23)<br />

λ2 =[113333322] (6.24)<br />

In the context of our voting-based soft consensus functions, though, cluster disambiguation<br />

is conducted on pairs of soft clustering solutions. In order to illustrate how to proceed<br />

in this case, we use a toy example that is the fuzzy version of the one just reported. For<br />

brevity, we will only consider the case in which object-to-cluster associations are expressed<br />

in terms of membership probabilities, although an analog procedure could be devised in<br />

the case these were expressed by means of other metrics. Therefore, given the two fuzzy<br />

partitions Λ1 and Λ2 of equation (6.25):<br />

⎛<br />

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016<br />

⎞<br />

0.010<br />

Λ1 = ⎝0.921<br />

0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠<br />

0.025<br />

⎛<br />

0.932<br />

0.042<br />

0.921<br />

0.038<br />

0.019<br />

0.006<br />

0.030<br />

0.005<br />

0.014<br />

0.011<br />

0.025<br />

0.976<br />

0.057<br />

0.929<br />

0.017<br />

0.972<br />

⎞<br />

0.055<br />

Λ2 = ⎝0.042<br />

0.025 0.005 0.011 0.009 0.006 0.038 0.972 0.929⎠<br />

(6.25)<br />

0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016<br />

The cluster similarity matrix S Λ1 ,Λ 2 is computed upon the soft clustering matrices themselves,<br />

as described by equation (6.26).<br />

SΛ 1 ,Λ 2 = Λ1Λ T 2 =<br />

⎛<br />

= ⎝<br />

⎛<br />

= ⎝<br />

0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 0.010<br />

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017<br />

0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />

0.143 0.054 2.878<br />

1.738 0.136 1.042<br />

0.188 1.845 0.969<br />

⎞<br />

⎛<br />

⎜<br />

⎞ ⎜<br />

⎠ ⎜<br />

⎝<br />

0.932 0.042 0.026<br />

0.921 0.025 0.054<br />

0.019 0.005 0.976<br />

0.030 0.011 0.959<br />

0.014 0.009 0.976<br />

0.025 0.006 0.969<br />

0.057 0.038 0.905<br />

0.017 0.972 0.010<br />

0.055 0.929 0.016<br />

⎠ (6.26)<br />

175<br />

⎞<br />

⎟<br />


6.3. Voting based consensus functions<br />

The interpretation of the contents of S Λ1 ,Λ 2 is the same as in the crisp scenario, i.e. its<br />

(i,j)th element is proportional to the similarity between the ith cluster of Λ1 and the jth<br />

cluster of Λ2. Again, transforming S Λ1 ,Λ 2 into a cluster dissimilarity matrix is the final<br />

step before solving the weight bipartite matching problem using the Hungarian method<br />

implementation of (Buehren, 2008), thus obtaining the cluster correspondence vector π Λ1 ,Λ 2<br />

of equation (6.27) (notice that this is exactly the same permutation vector of equation (6.19),<br />

as the present toy example is the fuzzy version of the former).<br />

π Λ1 ,Λ 2 = [3 1 2] (6.27)<br />

Although the interpretation of the cluster correspondence vector is equivalent in both<br />

the hard and the soft clustering scenarios (i.e. the cluster that is given the number ‘1’<br />

identifier in Λ1 corresponds to the cluster number ‘3’ of Λ2, and so on), recall that, in the<br />

fuzzy case, cluster permutations are equivalent to row order rearrangements.<br />

Consequently, in order to obtain the cluster permuted version of the fuzzy partition<br />

Λ1, it is necessary to multiply the transpose of the cluster permutuation matrix PΛ1 ,Λ2 associated to the cluster correspondence vector πΛ1 ,Λ by the fuzzy partition Λ1 itself. As<br />

2<br />

a result, the cluster permuted soft clustering Λ πΛ1 ,Λ2 1 is obtained —see equation (6.28) for<br />

the pair of cluster aligned fuzzy clustering solutions of our toy example.<br />

Λ πΛ 1 ,Λ 2<br />

1<br />

⎛<br />

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055<br />

⎞<br />

0.017<br />

= ⎝0.025<br />

0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972⎠<br />

0.054<br />

⎛<br />

0.932<br />

0.026<br />

0.921<br />

0.057<br />

0.019<br />

0.969<br />

0.030<br />

0.976<br />

0.014<br />

0.959<br />

0.025<br />

0.009<br />

0.057<br />

0.016<br />

0.017<br />

0.010<br />

⎞<br />

0.055<br />

Λ2 = ⎝0.042<br />

0.025 0.005 0.011 0.009 0.006 0.038 0.972 0.929⎠<br />

(6.28)<br />

0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016<br />

Given a cluster ensemble E containing a set of l soft clustering solutions, the cluster<br />

disambiguation process consists in, taking one of them as a reference, apply the Hungarian<br />

method sequentially on the remaining l − 1 clustering solutions (Topchy et al., 2004). As a<br />

result, a cluster aligned version of the cluster ensemble is obtained, and voting procedures<br />

can be readily applied on it.<br />

6.3.2 Voting strategies<br />

Once the correspondence between the k clusters of each one of the l soft clustering solutions<br />

compiled in the cluster ensemble E has been resolved and the corresponding cluster<br />

permutations have been applied, it is time to derive the consensus clustering solution upon<br />

E, a task we tackle by means of voting procedures. In this section, we describe four voting<br />

methods, which give rise to as many consensus functions.<br />

Before proceeding to their description, recall that the scalar elements of a soft cluster<br />

ensemble E are considered, from a voting standpoint, as the expression of the degree of<br />

preference of each voter (i.e. clusterer) for each candidate (cluster) in the present election<br />

(clusterization of an object). The result of the election (i.e. the consolidated clusterization<br />

of the object under consideration based upon the decisions of the l clusterers comprised<br />

176


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

in the ensemble) will depend on the voting procedure applied, which, at the same time, is<br />

conditioned by the way the voters’ preferences are expressed.<br />

Given that in fuzzy cluster ensembles voters express their preference for each and every<br />

one of the k candidates, the soft consensus functions proposed in this chapter make use of<br />

confidence and positional voting methods (van Erp, Vuurpijl, and Schomaker, 2002), which<br />

are applicable in voting scenarios in which voters grade candidates according to their degree<br />

of confidence. The former makes direct use of the specific values of the preference scores<br />

the voters emit –thus, they are sensitive to their scaling–, whereas the latter are based on<br />

ranking the candidates according to the degree of confidence expressed by the voters.<br />

As mentioned earlier, the way fuzzy clusterers express their preference for the clusters<br />

can be either directly or inversely proportional to the strength of association between objects<br />

and clusters (e.g. membership probabilities or distances to centroids, respectively). In fact,<br />

it is possible that both types of clusterings are intermingled in E, and, for this reason, the<br />

voting method must somehow be informed of this fact, so that appropriate scale or ranking<br />

transformations are applied —depending on whether a confidence or a positional voting<br />

strategy is employed.<br />

In the following sections, we present four consensus functions for soft cluster ensembles,<br />

each of which is based on a specific voting mechanism. Besides their generic description,<br />

we illustrate them by means of a toy example using the soft cluster ensemble E containing<br />

the l = 2 cluster aligned fuzzy clustering solutions presented in equation (6.28).<br />

E =<br />

Λ1<br />

Λ2<br />

⎛<br />

0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055<br />

⎞<br />

0.017<br />

⎜<br />

⎜<br />

0.025<br />

⎜<br />

= ⎜0.054<br />

⎜<br />

⎜0.932<br />

⎝0.042<br />

0.042<br />

0.026<br />

0.921<br />

0.025<br />

0.038<br />

0.057<br />

0.019<br />

0.005<br />

0.006<br />

0.969<br />

0.030<br />

0.011<br />

0.005<br />

0.976<br />

0.014<br />

0.009<br />

0.011<br />

0.959<br />

0.025<br />

0.006<br />

0.976<br />

0.009<br />

0.057<br />

0.038<br />

0.929<br />

0.016<br />

0.017<br />

0.972<br />

0.972 ⎟<br />

0.010 ⎟<br />

0.055 ⎟<br />

0.929⎠<br />

0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016<br />

(6.29)<br />

Notice that, in this toy example, the l = 2 clustering solutions (voters) compiled in<br />

the ensemble (Λ1 and Λ2) express object to cluster associations (their preferences for candidates)<br />

by means of membership probabilities, which makes the scalar elements of both<br />

clusterings directly comparable, thus avoiding the need for applying any scale transformations.<br />

However, this need not be the general case, and we will address how to deal with<br />

cluster ensembles containing unequal clusterings as the proposed consensus functions are<br />

presented throughout the following paragraphs.<br />

Confidence voting<br />

Consensus functions based on confidence voting methods derive the consolidated clustering<br />

solution upon the values of the confidence scores each clusterer assigns to each cluster. For<br />

this reason, a prerequisite for using these voting methods is that these confidence values are<br />

comparable in magnitude (van Erp, Vuurpijl, and Schomaker, 2002). Assuming this is true,<br />

we propose the application of the sum and product confidence voting rules, which gives rise<br />

to the following two consensus functions:<br />

– SumConsensus (SC): this consensus function is based on the application of the<br />

confidence voting sum rule, which simply consists of adding the confidence values<br />

177


6.3. Voting based consensus functions<br />

Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />

Output: Sum voting matrix ΣE<br />

Data: k clusters, n objects<br />

Hungarian (E)<br />

ΣE = 0k×n for i =1...l do<br />

if Λi not membership probabilities then<br />

Probabilize (Λi)<br />

end<br />

ΣE = ΣE + Λi<br />

end<br />

Algorithm 6.1: Symbolic description of the soft consensus function SumConsensus.<br />

Probabilize and Hungarian are symbolic representations of the conversion of fuzzy clusterings<br />

to membership probability matrices and the cluster disambiguation procedures,<br />

respectively, while 0 k×n represents a k × n zero matrix.<br />

that all the voters cast for each candidate. As a result, a k × n sum matrix ΣE is<br />

obtained, the (i,j)th entry of which equals the sum of the preference scores of assigning<br />

the jth object to the ith cluster across the l cluster ensemble components, as presented<br />

in equation (6.30).<br />

ΣE =<br />

l<br />

i=1<br />

Λi<br />

(6.30)<br />

where Λi refers to the ith clustering contained in the soft cluster ensemble E, once<br />

the cluster disambiguation process has been conducted.<br />

A schematic and generic description of the SumConsensus consensus function is presented<br />

in algorithm 6.1. As it can be observed, we propose transforming all the fuzzy<br />

clusterings compiled in the cluster ensemble E into membership probability matrices<br />

–which is symbolically represented by the procedure called Probabilize–, thus making<br />

the sum voting method directly applicable on them once the cluster alignment<br />

problem is solved by means of the Hungarian procedure. According to algorithm<br />

6.1, SumConsensus outputs the sum voting matrix ΣE, which can be interpreted<br />

readily as a fuzzy consensus clustering. However, it can easily be converted into a<br />

classic membership probability based fuzzy consensus clustering Λc, oracrispconsensus<br />

clustering λc, as we show in the following paragraphs —as it will be seen, such<br />

postprocessing could also be included as the final step of SumConsensus.<br />

The application of SumConsensus on the toy cluster ensemble of equation (6.29) gives<br />

rise to the sum matrix ΣE presented in equation (6.31). Notice that the execution of<br />

the Probabilize and the Hungarian procedures is not required in this case, as the<br />

l = 2 fuzzy clusterings considered express object to cluster associations by means of<br />

membership probabilities and their clusters are aligned.<br />

178


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

⎛<br />

1.853 1.853 0.924 0.055 0.033 0.055 0.071 0.072<br />

⎞<br />

0.072<br />

ΣE = ⎝0.067<br />

0.067 0.043 0.017 0.014 0.017 1.014 1.901 1.901⎠<br />

(6.31)<br />

0.08 0.08 1.033 1.928 1.952 1.928 0.914 0.026 0.026<br />

Notice that the higher the value of the (i,j)th entry of ΣE, the more likely is that the<br />

jth object belongs to the ith cluster. Of course, this is due to the fact that the l =2<br />

fuzzy clusterings contained in the soft cluster ensemble of our toy example express<br />

object to cluster associations by means of membership probabilities, which are directly<br />

proportional to the strength of association between objects and clusters.<br />

Moreover, notice that if each column of ΣE is divided by its L1-norm (i.e. the sum<br />

of its elements), each column entries become cluster membership probabilities, and,<br />

therefore, a classic fuzzy consensus clustering solution Λc can be obtained (see equation<br />

(6.32)).<br />

⎛<br />

0.926 0.926 0.462 0.027 0.016 0.028 0.035 0.036<br />

⎞<br />

0.036<br />

Λc = ⎝0.033<br />

0.033 0.021 0.008 0.007 0.008 0.507 0.951 0.951⎠<br />

(6.32)<br />

0.041 0.041 0.517 0.965 0.977 0.964 0.458 0.013 0.013<br />

Furthermore, notice that Λc can be transformed into a crisp consensus clustering<br />

λc by simply assigning each object to the cluster it is most strongly associated to,<br />

breaking hypothetical ties at random. Referring once more to our toy example, the<br />

crisp consensus clustering obtained by hardening Λc is the one presented in equation<br />

(6.33).<br />

λc = 1 1 3 3 3 3 2 2 2 <br />

(6.33)<br />

– ProductConsensus (PC): the only difference between this consensus function and<br />

SumConsensus is that the preference values per candidate are multiplied instead of<br />

added. Quite obviously, the product rule is highly sensitive to low values, which could<br />

ruin the chances of a candidate on winning the election, no matter what its other confidence<br />

values are (van Erp, Vuurpijl, and Schomaker, 2002). Equation (6.34) presents<br />

the voting process that constitutes the core of the ProductConsensus consensus function.<br />

It is important to notice that Λi correspond to the cluster ensemble components<br />

once cluster alignment has been conducted, and matrix products are computed entrywise.<br />

As a result, the k × n product matrix ΠE is obtained.<br />

ΠE =<br />

l<br />

i=1<br />

Λi<br />

(6.34)<br />

Algorithm 6.2 presents the schematic description of the ProductConsensus consensus<br />

function. As in the previous consensus function, we propose converting the fuzzy<br />

clusterings Λi into membership probability matrices, which allows to apply the product<br />

voting rule on them once the cluster correspondence problem has been solved by<br />

means of the Hungarian algorithm. It can be observed that ProductConsensus yields<br />

179


6.3. Voting based consensus functions<br />

Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />

Output: Product voting matrix ΠE<br />

Data: k clusters, n objects<br />

Hungarian (E)<br />

ΠE = 1k×n for i =1...l do<br />

if Λi not membership probabilities then<br />

Probabilize (Λi)<br />

end<br />

ΠE = ΠE ◦ Λi<br />

end<br />

Algorithm 6.2: Symbolic description of the soft consensus function ProductConsensus.<br />

Probabilize and Hungarian are symbolic representations of the conversion of fuzzy clusterings<br />

to membership probability matrices and the cluster disambiguation procedures,<br />

respectively, while 1 k×n represents a k × n unit matrix, and ◦ represents the Hadamard<br />

(or entrywise) matrix product.<br />

the product voting matrix ΠE as its output. However, as in the case of SumConsensus,<br />

it can be transformed into a fuzzy or a crisp consensus clustering, a process that<br />

can be included as the final step of ProductConsensus.<br />

The results of applying the product rule on the toy cluster ensemble of equation (6.29)<br />

yields the product matrix ΠE presented in equation (6.35).<br />

⎛<br />

ΠE = ⎝<br />

0.858 0.858 0.017 7.5·10 −4 2.7·10 −4 7.5·10 −4 7.9·10 −4 9.3·10 −4 9.3·10 −4<br />

0.001 0.001 1.9·10−4 6.6·10−5 4.5·10−5 6.6·10−5 0.037 0.903 0.903<br />

0.001 0.001 0.056 0.929 0.953 0.929 0.008 1.6·10−4 1.6·10−4 ⎞<br />

⎠<br />

(6.35)<br />

Dividing each column of ΠE by its L1-norm gives rise to the fuzzy consensus clustering<br />

solution Λc based on membership probabilities of equation (6.36), and assigning each<br />

object to the cluster it is most strongly associated to (breaking ties randomly) yields<br />

the crisp consensus clustering λc of equation (6.37).<br />

⎛<br />

Λc = ⎝<br />

0.997 0.997 0.235 8.1·10−4 2.8·10−4 8.1·10−4 0.017 0.001 0.001<br />

0.001 0.001 0.003 7.1·10−5 4.7·10−5 7.1·10−5 0.806 0.998 0.998<br />

0.002 0.002 0.762 0.999 0.999 0.999 0.177 1.8·10−4 1.8·10−4 λc = 1 1 3 3 3 3 2 2 2 <br />

⎞<br />

⎠ (6.36)<br />

(6.37)<br />

Notice that the differences between the fuzzy consensus clusterings Λc obtained by<br />

SumConsensus and ProductConsensus (equations (6.32) and (6.36) ) –due to the<br />

different way voters’ preferences are combined– are lost when they are transformed<br />

into crisp consensus clusterings.<br />

180


Positional voting<br />

Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

Positional (aka ranking) voting methods rank the candidates according to the confidence<br />

scores emitted by the voters. Thus, fine-grain information regarding preference differences<br />

between candidates is ignored, although problems in scaling the voters confidence scores<br />

are avoided —that is, positional voting is useful in situations where confidence values are<br />

hard to scale correctly (van Erp, Vuurpijl, and Schomaker, 2002).<br />

As an aid for describing the positional voting methods that constitute the core of our<br />

consensus functions, equation (6.38) defines Λi (the ith component of the soft cluster ensemble<br />

E) in terms of its columns, represented by vectors λij (∀j =1,...,n).<br />

⎛ ⎞<br />

Λ1<br />

⎜<br />

⎜Λ2<br />

⎟<br />

E = ⎜ ⎟<br />

⎝ . ⎠ where Λi = <br />

λi1 λi2 ... λin<br />

Λl<br />

(6.38)<br />

In this work, we propose employing two positional voting strategies (namely the Borda<br />

and the Condorcet voting methods) for deriving the consensus clustering solution, which<br />

gives rise to the following consensus functions:<br />

– BordaConsensus (BC): the Borda voting method (Borda, 1781) computes the mean<br />

rank of each candidate over all voters, reranking them according to their mean rank.<br />

This process results in a grading of all the n objects with respect to each of the k<br />

clusters, which is embodied in a k × n Borda voting matrix BE. Such grading process<br />

is conducted as follows: firstly, for each object (election), clusters (candidates) are<br />

ranked according to their degree of association with respect to it (from the most to<br />

the least strongly associated). Then, the top ranked candidate receives k points, the<br />

second ranked cluster receives k − 1 points, and so on. After iterating this procedure<br />

across the l cluster ensemble components, the grading matrix BE is obtained. The<br />

whole process is described in algorithm 6.3. Notice that the Rank procedure orders<br />

the clusters from the most to the least strongly associated to each object, yielding a<br />

ranking vector r which is a list of the k clusters ordered according to their degree of<br />

association with respect to the object under consideration (i.e. its first component<br />

(r(1)) identifies the most strongly associated cluster, and so on. Thus, the Rank<br />

procedure must take into account whether the scalar values contained in λab are<br />

directly or inversely proportional to the strength of association between objects and<br />

clusters.<br />

When applied on our toy example, the resulting Borda grading matrix is the one<br />

presented in equation (6.39).<br />

⎛<br />

6 6 5 4 4 4 4 4<br />

⎞<br />

4<br />

BE = ⎝3<br />

3 2 2 2 2 4 6 6⎠<br />

(6.39)<br />

3 3 5 6 6 6 4 2 2<br />

According to Borda voting, the higher the value of the (i,j)th entry of BE, themore<br />

likely the jth object belongs to the ith cluster. Thus, again, dividing each column of<br />

matrix BE by its L1-norm transforms it into a cluster membership probability matrix,<br />

181


6.3. Voting based consensus functions<br />

Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />

Output: Borda voting matrix BE<br />

Data: k clusters, n objects<br />

Hungarian (E)<br />

BE = 0 k×n<br />

for a =1...l do<br />

for b =1...n do<br />

r = Rank (λab);<br />

for c =1...k do<br />

BE (r (c) ,b)=BE (r (c) ,b)+(k − c +1);<br />

end<br />

end<br />

end<br />

Algorithm 6.3: Symbolic description of the soft consensus function BordaConsensus.<br />

Hungarian and Rank are symbolic representations of the cluster disambiguation and cluster<br />

ordering procedures, respectively, while the vector λab represents the bth column of<br />

the ath cluster ensemble component Λa, r is a clusters ranking vector and 0 k×n represents<br />

a k × n zero matrix.<br />

i.e. a soft consensus clustering solution Λc (see equation (6.40)), and assigning each<br />

object to the cluster it is most strongly associated to –breaking ties randomly– yields<br />

the crisp consensus clustering of equation (6.41).<br />

⎛<br />

0.5 0.5 0.417 0.333 0.333 0.333 0.333 0.333<br />

⎞<br />

0.333<br />

Λc = ⎝0.25<br />

0.25 0.166 0.167 0.167 0.167 0.333 0.5 0.5⎠<br />

(6.40)<br />

0.25 0.25 0.417 0.5 0.5 0.5 0.333 0.167 0.167<br />

λc = 1 1 3 3 3 3 3 2 2 <br />

(6.41)<br />

– CondorcetConsensus (CC): just like Borda voting, the Condorcet voting method’s<br />

origin dates from the French revolution period, as an effort to address the shortcomings<br />

of simple majority voting when there are more than two candidates (Condorcet,<br />

1785). Although often considered to be a multi-step unweighed voting algorithm,<br />

the Condorcet voting method can also be regarded as a positional voting strategy, as<br />

it employs the voters’ preference choices between any given pair of candidates (van<br />

Erp, Vuurpijl, and Schomaker, 2002). In particular, this voting method performs an<br />

exhaustive pairwise candidate ranking comparison across voters, and the winner of<br />

each one of these one-to-one confrontations scores a point. The result of this process<br />

is the Condorcet score matrix CE, the(i,j)th element of which indicates how many<br />

candidates does the ith candidate beat in one-to-one comparisons in the jth election<br />

(where candidates are clusters and an election corresponds to the clusterization of an<br />

object).<br />

Algorithm 6.4 presents a description of the CondorcetConsensus consensus function.<br />

As in BordaConsensus, the Rank procedure must take into account whether the scalar<br />

182


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />

Output: Condorcet voting matrix CE<br />

Data: k clusters, n objects<br />

Hungarian (E)<br />

for b =1...n do<br />

M = 0k×k for a =1...l do<br />

r = Rank (λab);<br />

for c =1...k do<br />

M (r(c), r(c +1÷ k)) = M (r(c), r(c +1÷ k)) + 1<br />

end<br />

end<br />

for c =1...k do<br />

CE (c, b) =Count M(c, 1 ÷ k) ≥ l<br />

<br />

2<br />

end<br />

end<br />

Algorithm 6.4: Symbolic description of the soft consensus function CondorcetConsensus.<br />

Hungarian and Rank are symbolic representations of the cluster disambiguation and cluster<br />

ordering procedures, respectively, while the vector λab represents the bth column of<br />

the ath cluster ensemble component Λa, r is a clusters ranking vector and 0 k×k represents<br />

a k × k zero matrix.<br />

values contained in λab are directly or inversely proportional to the strength of association<br />

between objects and clusters. In each election, the (i,j)th entry of the square<br />

matrix M (usually referred to as the Condorcet sum matrix) counts the number of<br />

times the ith cluster is preferred over the jth one. The Count procedure is used for<br />

counting the number of elements of the cth row of matrix M that are greater or equal<br />

than l<br />

2 , which means that at least half of the voters preferred one cluster over another.<br />

Resorting again to our toy example, equation (6.42) presents the Condorcet score<br />

matrix resulting from applying CondorcetConsensus on it.<br />

⎛<br />

2 2 2 1 1 1 2 1<br />

⎞<br />

1<br />

CE = ⎝1<br />

1 0 0 0 0 2 2 2⎠<br />

(6.42)<br />

1 1 2 2 2 2 2 0 0<br />

Again, dividing each column of matrix CE by its L1-norm transforms the Condorcet<br />

score matrix into a soft consensus clustering solution Λc, whose(i,j)th entry represents<br />

the probability that the jth object belongs to the ith cluster (see equation (6.43) for<br />

the fuzzy consensus clustering solution obtained by CondorcetConsensus on our toy<br />

example). And assigning each object to the cluster it is most strongly associated to<br />

–breaking ties randomly– yields the crisp consensus clustering of equation (6.44).<br />

⎛<br />

0.500 0.500 0.500 0.333 0.333 0.333 0.333 0.333<br />

⎞<br />

0.333<br />

Λc = ⎝0.250<br />

0.250 0 0 0 0 0.333 0.667 0.667⎠<br />

(6.43)<br />

0.250 0.250 0.500 0.667 0.667 0.667 0.333 0 0<br />

183


6.4. Experiments<br />

λc = 1 1 1 3 3 3 1 2 2 <br />

(6.44)<br />

Notice that the fuzzy consensus clusterings Λc output by BordaConsensus and CondorcetConsensus<br />

(equations (6.40) and (6.43)) differ notably from those obtained by<br />

SumConsensus and ProductConsensus (equations (6.32) and (6.36)) —see the double<br />

and triple ties obtained at the clusterization of the third and seventh objects, which<br />

are due to the intrinsic differences between the distinct voting strategies applied.<br />

Moreover, notice that the two positional voting based consensus functions (BC and<br />

CC) yield structuraly similar fuzzy consensus clusterings Λc, although their contents<br />

differ slightly. However, their hardened versions λc (equations (6.41) and (6.44)) differ<br />

in a larger extent, due to the random way ties are broken.<br />

6.4 Experiments<br />

This section presents several consensus clustering experiments evaluating the consensus<br />

functions for soft cluster ensembles proposed in the previous section. These experiments<br />

are conducted according to the following design.<br />

– What do we want to measure? We are interested in comparing both in the quality<br />

of the consensus clustering solutions obtained and the time complexity of the proposed<br />

consensus functions.<br />

– How do we measure it? As regards the time complexity aspect, all consensus<br />

processes follow a flat architecture (i.e. one step consensus), and we measure the<br />

CPU time required for their execution, using the computational resources described<br />

in appendix A.6. As far as the evaluation of the quality of the consensus clustering<br />

results is concerned, despite the proposed consensus functions output fuzzy consensus<br />

clustering solutions, we have compared their hardened version with respect to the<br />

ground truth of each data set in terms of normalized mutual information φ (NMI) .The<br />

reason for this is twofold: firstly, a soft ground truth is not available for these data sets,<br />

so fuzzy consensus clusterings cannot be directly evaluated. And secondly, provided<br />

that the CSPA, HGPA, MCLA and EAC consensus functions output hard consensus<br />

clustering solutions, fair inter-consensus functions comparison requires converting the<br />

soft consensus clustering matrices Λc output by VMA, BC, CC, PC and SC to crisp<br />

consensus labelings λc —recall that this simply boils down to assigning each object<br />

to the cluster it is more strongly associated to.<br />

– How are the experiments designed? In each consensus clustering experiment we<br />

have applied our four voting-based consensus functions –SumConsensus (SC), ProductConsensus<br />

(PC), BordaConsensus (BC) and CondorcetConsensus (CC)–, besides<br />

the fuzzy versions of CSPA, EAC, HGPA and MCLA (see section 6.2) plus one of<br />

the pioneering soft consensus functions, namely VMA (Voting Merging Algorithm)<br />

(Dimitriadou, Weingessel, and Hornik, 2002) —see appendix A.5 for a description.<br />

Experiments have been conducted on the twelve unimodal data collections employed<br />

in this work, which are described in appendix A.2.1. As regards the creation of the<br />

soft cluster ensemble components, we have employed the fuzzy c-means and the kmeans<br />

clustering algorithms. Whereas the former is fuzzy by nature, the latter is not.<br />

184


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

However, if object to cluster centroid distances are inverted and normalized using a<br />

softmax normalization, so they can be interpreted as membership probabilities are<br />

obtained (i.e. the k-means clustering solutions are fuzzified). For the sake of greater<br />

algorithmic diversity, variants of k-means using the Euclidean, city block, cosine and<br />

correlation distance measures have been employed. Thus, the cardinality of the algorithmic<br />

diversity factor is |dfA| = 5. Applying all these clustering algorithms on each<br />

and every one of the distinct object representations created by the mutually crossed<br />

application of the representational and dimensional diversity factors of each data set,<br />

gives rise to soft cluster ensembles of the sizes l presented in table 6.1. In order to obtain<br />

a representative analysis of the aforementioned consensus functions performance,<br />

we have conducted multiple experiments on distinct diversity scenarios. To do so,<br />

besides using the cluster ensemble of size l, we have also generated cluster ensembles<br />

of sizes ⌊ l l l<br />

l<br />

20⌋, ⌊ 10⌋, ⌊ 5⌋ and ⌊ 2⌋, which are created by randomly picking a subset<br />

of the original cluster ensemble components. For each distinct cluster ensemble, ten<br />

independent runs of each consensus function are executed.<br />

– How are results presented? The performances of the nine soft consensus functions<br />

are summarized by means of a quality (φ (NMI) with respect to the ground truth) versus<br />

time complexity (CPU time measured in seconds) diagram that describes, in a pretty<br />

summarized manner, the qualities of the consensus functions compared. For each<br />

consensus function, the depicted scatterplot corresponds to the region limited by the<br />

mean ± 2-standard deviation curves corresponding to the two associated magnitudes<br />

(i.e. φ (NMI) and CPU time) computed throughout all the experiments conducted<br />

on each data collection —ten independent experiment runs on each one of the five<br />

cluster ensemble sizes. In order to determine whether the differences between the<br />

compared consensus functions are significant or not, we have conducted a pairwise<br />

comparison (both in CPU time and φ (NMI) terms) among them applying a t-paired<br />

test, measuring the significance level p at which the null hypothesis (equal means with<br />

possibly unequal variances) is rejected. If the typical 95% confidence interval for true<br />

difference in means is taken as a reference, significance level values of p


6.4. Experiments<br />

Data set Soft cluster ensemble size l<br />

Zoo 285<br />

Iris 45<br />

Wine 225<br />

Glass 145<br />

Ionosphere 485<br />

WDBC 565<br />

Balance 35<br />

Mfeat 30<br />

miniNG 365<br />

Segmentation 260<br />

BBC 285<br />

PenDigits 285<br />

Table 6.1: Soft cluster ensemble sizes l corresponding to the unimodal data sets.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ZOO<br />

0<br />

0 1 2 3 4<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure 6.2: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Zoo data collection.<br />

efficiency of VMA is quite expectable, due to the fact that it simultaneously solves the<br />

cluster correspondence problem and voting following an iterative procedure (Dimitriadou,<br />

Weingessel, and Hornik, 2002), whereas in SC, PC, BC and CC, these two processes are<br />

sequentially conducted.<br />

As regards the quality of the consensus clustering solutions, notice that the four consensus<br />

functions proposed achieve almost identical φ (NMI) scores than the best performing<br />

state-of-the-art alternative, VMA.<br />

Table 6.2 presents the significance level values obtained from all the t-paired tests conducted<br />

on the Zoo data set. The upper and lower triangular sections of the table correspond<br />

to the comparison between consensus functions in terms of CPU time and φ (NMI) , respectively.<br />

When pairwise comparisons between the ith and the jth consensus functions result<br />

in statistically significant differences, the corresponding significance level value p is presented<br />

in the (i,j)th entry of the table (or in the (j,i)th entry, depending on whether it is a<br />

comparison in terms of CPU time or φ (NMI) ). Otherwise, the lack of statistically significant<br />

186


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— × × 0.0001 0.0001 0.0001 0.0001 0.0013 0.0012<br />

EAC 0.0001 ——— × 0.0001 0.0002 0.0001 0.0001 0.002 0.0019<br />

HGPA 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0017 0.0016<br />

MCLA × 0.0001 0.0001 ——— 0.0001 0.0009 × 0.0003 0.0003<br />

VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0001 0.0001 0.0001 0.0001 × ——— 0.0001 × ×<br />

CC 0.0001 0.0001 0.0001 0.0001 × × ——— 0.0001 0.0001<br />

PC 0.0001 0.0001 0.0001 0.0001 × × × ——— ×<br />

SC 0.0001 0.0001 0.0001 0.0001 × 0.0337 0.0419 × ———<br />

Table 6.2: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Zoo data set. The upper and lower triangular sections<br />

of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />

Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />

differences is denoted by means of the × symbol.<br />

For instance, let us see how does BordaConsensus (BC) compare to the eight remaining<br />

consensus function in terms of execution CPU time —for an easier identification, the<br />

contents of the corresponding boxes of table 6.2 are italicized. In fact, they tell us that the<br />

differences observed in figure 6.2 (according to which BC is apparently faster than MCLA<br />

and CC, and slower than CSPA, EAC, HGPA, VMA, PC and SC) are statistically significant<br />

with respect to all but the PC and SC consensus functions.<br />

If this comparison is based on the φ (NMI) of the consensus clustering solutions, figure 6.2<br />

suggests that BC performs better than CSPA, EAC, HGPA and MCLA, which is true from a<br />

statistical significance standpoint, as the corresponding entries of table 6.2 (which are typed<br />

in boldface for ease of identification) confirm. In contrast, the small differences between<br />

the φ (NMI) values of BC, VMA, CC and PC appreciated in figure 6.2 are statistically non<br />

significant, whereas it is with respect to SC despite its apparent closeness.<br />

In order to provide the reader with a global perspective that illustrates the performance<br />

of the proposed consensus functions compared to their state-of-the-art counterparts across<br />

the twelve unimodal collections employed in this work, we have computed the total percentage<br />

of experiments in which the latter yield better, equivalent or worse results than the<br />

voting-based consensus functions —considering the statistical significance of the differences<br />

between the compared magnitudes (CPU time and φ (NMI) ).<br />

Firstly, table 6.3 presents the results of such comparative analysis when it is referred<br />

to the quality of the consensus clusterings output by the consensus functions in all the<br />

experiments conducted. It can be observed that the four proposed consensus functions<br />

outperform EAC, HGPA and MCLA in a pretty overwhelming percentage of the experiments<br />

(an average 94.4% of the total). When compared to CSPA and VMA, we can appreciate<br />

certain differences between the performance of the consensus functions based on confidence<br />

voting (PC and SC) and the ones based on positional voting (BC and CC). In general terms,<br />

SC and PC perform slightly better than BC and CC. Moreover, notice that BordaConsensus<br />

and CondorcetConsensus attain exactly the same results, whereas the similarity between<br />

the results of SC and PC is also very noticeable. We conjecture that these high degrees<br />

of resemblance is due to the fact that evaluation is conducted upon a hardened version<br />

of the soft consensus clustering output by these consensus functions. Thus, the intrinsic<br />

187


6.4. Experiments<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

φ (NMI) BC CC PC SC<br />

better than ... 27.3% 27.3% 9.1% 9.1%<br />

equivalent to ... 9.1% 9.1% 45.4% 45.4%<br />

worse than ... 63.6% 63.6% 45.4% 45.4%<br />

better than ... 0% 0% 0% 0%<br />

equivalent to ... 0% 0% 0% 0%<br />

worse than ... 100% 100% 100% 100%<br />

better than ... 0% 0% 0% 0%<br />

equivalent to ... 0% 0% 0% 0%<br />

worse than ... 100% 100% 100% 100%<br />

better than ... 0% 0% 0% 0%<br />

equivalent to ... 16.7% 16.7% 16.7% 16.7%<br />

worse than ... 83.3% 83.3% 83.3% 83.3%<br />

better than ... 25% 25% 0% 0%<br />

equivalent to ... 33.3% 33.3% 100% 91.7%<br />

worse than ... 41.7% 41.7% 0% 8.3%<br />

Table 6.3: Percentage of experiments in which the state-of-the-art consensus functions<br />

(CSPA, EAC, HGPA, MCLA and VMA) yield (statistically significant) better/equivalent/worse<br />

consensus clustering solutions than the four proposed consensus functions<br />

(BC, CC, PC and SC).<br />

differences between the distinct voting strategies is somehow lost.<br />

In either case, the φ (NMI) scores obtained by the four proposed voting based consensus<br />

functions is statistically significantly lower than that of CSPA and VMA in only a 15.3% of<br />

the experiments conducted, which clearly indicates that, from a consensus quality perspective,<br />

our proposals constitute an attractive alternative for conducting consensus clustering<br />

on soft cluster ensembles.<br />

And secondly, the results of the previously described comparison, but referred to execution<br />

CPU time, are presented in table 6.4. In general terms, the state-of-the-art consensus<br />

functions (except for EAC) are faster than the proposed consensus functions based on positional<br />

voting methods (BC and CC). This is due to the candidates ranking step that<br />

precedes the voting process itself (see algorithms 6.3 and 6.4). Moreover, the execution of<br />

CC takes longer than BC, due to the exhaustive pairwise candidate comparison involved in<br />

the Condorcet voting method. In contrast, the confidence voting based consensus functions<br />

(PC and SC) are more computationally efficient, being as fast or faster than CSPA, EAC,<br />

HGPA and MCLA in an average 80.7% of the experiments. However, they are unable to<br />

match the computational efficiency of VMA, which, as mentioned earlier, is caused by its<br />

simultaneous and iterative cluster disambiguation and voting procedure.<br />

As a conclusion, it can be stated that the four voting based consensus functions proposed<br />

in this chapter are indeed worthy of being considered as an alternative when it comes to<br />

creating consensus clustering solutions on soft cluster ensembles, as they are capable of<br />

delivering high quality consensus clustering solutions at an acceptable computational cost<br />

—this is specially true for those consensus functions based on confidence voting methods<br />

(i.e. PC and SC). The higher computational complexity of positional voting based consensus<br />

functions (BC and CC) suggests limiting their application to those cases in which the<br />

188


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

CPU time BC CC PC SC<br />

faster than ... 45.4% 72.7% 18.2% 18.2%<br />

CSPA equivalent to ... 18.2% 0% 18.2% 18.2%<br />

slower than ... 36.4% 27.3% 63.6% 63.6%<br />

faster than ... 9.1% 27.3% 9.1% 9.1%<br />

EAC equivalent to ... 27.3% 9.1% 0% 0%<br />

slower than ... 63.6% 63.6% 90.9% 90.9%<br />

faster than ... 91.7% 91.7% 33.3% 33.3%<br />

HGPA equivalent to ... 8.3% 8.3% 41.7% 41.7%<br />

slower than ... 0% 0% 25% 25%<br />

faster than ... 66.7% 83.3% 16.7% 16.7%<br />

MCLA equivalent to ... 25% 16.7% 41.7% 41.7%<br />

slower than ... 8.3% 0% 41.7% 41.7%<br />

faster than ... 100% 100% 91.7% 91.7%<br />

VMA equivalent to ... 0% 0% 8.3% 8.3%<br />

slower than ... 0% 0% 0% 0%<br />

Table 6.4: Percentage of experiments in which the state-of-the-art consensus functions<br />

(CSPA, EAC, HGPA, MCLA and VMA) are executed (statistically significantly)<br />

faster/equivalent/slower than the four proposed consensus functions (BC, CC, PC and SC).<br />

confidence values contained in the soft cluster ensemble are difficult to scale correctly (van<br />

Erp, Vuurpijl, and Schomaker, 2002).<br />

6.5 Discussion<br />

The main motivation of the proposals put forward in this chapter is the fact that most of the<br />

literature on cluster ensembles is mainly focused on the application of consensus clustering<br />

processes on hard cluster ensembles. In our opinion, however, soft consensus clustering is<br />

an alternative worth considering, inasmuch as crisp clustering is in fact a simplification of<br />

fuzzy clustering —a simplification that may give rise to the loss of valuable information.<br />

The initial source of inspiration for the soft consensus functions just presented was<br />

metasearch (aka information fusion) systems, the main purpose of which is to obtain improved<br />

search results by combining the ranked lists of documents returned by multiple<br />

search engines in response to a given query. Although the resemblance between metasearch<br />

and consensus clustering was already reported in (Gionis, Mannila, and Tsaparas, 2007),<br />

direct inspiration came from the works of Aslam and Montague (Aslam and Montague,<br />

2001; Montague and Aslam, 2002), where metasearch algorithms based on positional voting<br />

were devised —notice that this type of voting techniques lend themselves to be applied in<br />

this context, as search engines return lists of ranked documents. From that point on, the<br />

analogy between object-to-cluster association scores in a soft cluster ensemble and voters’<br />

preferences for candidates became the key issue for deriving consensus functions based on<br />

positional and confidence voting methods.<br />

Nevertheless, the application of voting methods for combining clustering solutions is not<br />

new. For instance, unweighed voting strategies (van Erp, Vuurpijl, and Schomaker, 2002)<br />

such as plurality and majority voting have been applied for deriving consensus clustering<br />

189


6.5. Discussion<br />

solutions on hard cluster ensembles (Dudoit and Fridlyand, 2003; Fischer and Buhmann,<br />

2003; Greene and Cunningham, 2006). To our knowledge, the only voting-based consensus<br />

function for soft cluster ensembles is the Voting-Merging Algorithm (VMA) of (Dimitriadou,<br />

Weingessel, and Hornik, 2002), which employs a weighted version of the sum rule for<br />

confidence voting. Moreover, all these works share a common point in that they use the<br />

Hungarian algorithm for solving the cluster correspondence problem.<br />

Additional techniques for cluster disambiguation developed in the consensus clustering<br />

literature include the correspondence estimation based on common space cluster representation<br />

by clusters clustering or Singular Value Decomposition (Boulis and Ostendorf, 2004),<br />

the Soft Correspondence Ensemble Clustering algorithm, which is based on establishing a<br />

soft correspondence between clusters (in the sense that a cluster of a given clustering corresponds<br />

to every cluster in another clustering with different weights) (Long, Zhang, and<br />

Yu, 2005), the cumulative voting approach, that, unlike common one-to-one voting schemes<br />

(e.g. Hungarian), computes a probabilistic mapping between clusters (Ayad and Kamel,<br />

2008), or the FullSearch, Greedy and <strong>La</strong>rgeKGreedy cluster alignment algorithms (Jakobsson<br />

and Rosenberg, 2007). The first two approaches coincide in that they can be indistinctly<br />

applied for aligning the clusters of crisp and fuzzy partitions. Given the key importance of<br />

the cluster disambiguation process as a prior step to voting, we plan to evaluate these alternatives<br />

to the Hungarian method, so as to investigate their impact on both the quality of<br />

the consensus clusterings obtained and the time complexity of the whole consensus process.<br />

The comparative performance analysis of the four proposed consensus functions has<br />

revealed that they constitute a feasible alternative for conducting consensus clustering processes<br />

on soft cluster ensembles, as they are capable of yielding consensus clustering solutions<br />

of comparable or superior quality to those obtained by state-of-the-art clustering combiners<br />

at a reasonable computational cost. An additional appealing feature of our proposals is<br />

that they naturally deliver fuzzy consensus clustering solutions, which makes all sense in a<br />

soft clustering scenario —a fact other recent consensus functions for soft cluster ensembles<br />

–as the one presented in (Punera and Ghosh, 2007)– do not consider. However, the lack<br />

of a fuzzy ground truth has not allowed evaluating the soft consensus clusterings obtained,<br />

which constitutes one of the future directions of research of the work conducted in this<br />

chapter. As mentioned earlier, this would probably make the differences between the proposed<br />

consensus functions more evident, as it would highlight the differences between the<br />

distinct voting methods employed.<br />

As reported earlier, the sequential application of the cluster disambiguation and the<br />

voting processes penalizes the time complexity of our proposals, specially when they are<br />

compared to VMA. Thus, in the future, we plan to adopt the iterative cluster alignment<br />

plus voting strategy employed by this consensus function, which, in our opinion, will surely<br />

reduce the execution time of the proposed voting-based consensus functions without significantly<br />

reducing the quality of the consensus functions obtained.<br />

Another significant conclusion is that the EAC and HGPA consensus functions yield<br />

the lowest quality consensus clusterings, as already noticed in the vast majority of the<br />

experiments conducted in the hard clustering scenario.<br />

190


Chapter 6. Voting based consensus functions for soft cluster ensembles<br />

6.6 Related publications<br />

Our first approach to voting based soft consensus functions was the derivation of Borda-<br />

Consensus (Sevillano, Alías, and Socoró, 2007b). The details of this publication, presented<br />

as a poster at the SIGIR 2007 conference held at Amsterdam, are described next.<br />

Authors: Xavier Sevillano, Francesc Alías and Joan Claudi Socoró<br />

Title: BordaConsensus: a New Consensus Function for Soft Cluster Ensembles<br />

In: Proceedings of the 30th ACM SIGIR Conference<br />

Pages: 743-744<br />

Year: 2007<br />

Abstract: Consensus clustering is the task of deriving a single labeling by applying<br />

a consensus function on a cluster ensemble. This work introduces BordaConsensus, a<br />

new consensus function for soft cluster ensembles based on the Borda voting scheme.<br />

In contrast to classic, hard consensus functions that operate on labelings, our proposal<br />

considers cluster membership information, thus being able to tackle multiclass<br />

clustering problems. Initial small scale experiments reveal that, compared to stateof-the-art<br />

consensus functions, BordaConsensus constitutes a good performance vs.<br />

complexity trade-off.<br />

191


Chapter 7<br />

Conclusions<br />

The contributions put forward in this thesis constitute a unitary proposal for robust clustering<br />

based on cluster ensembles, with a specific focus on the increasingly interesting application<br />

of multimedia data clustering and a view on its generalization in fuzzy clustering<br />

scenarios. In this chapter, we summarize the main features of our proposals, highlighting<br />

their strengths and weaknesses, and outlining some interesting directions for future research.<br />

As for the robustness of clustering, recall that the unsupervised nature of this problem<br />

makes it difficult (if not impossible) to select apriorithe clustering system configuration<br />

that gives rise to the best 1 data partition. Furthermore, given the myriad of options –e.g.<br />

clustering algorithms, data representations, etc.– available to the clustering practitioner,<br />

such important decision making is marked by a high degree of uncertainty. As suboptimal<br />

configuration decisions may give rise to little meaningful partitions of the data, it turns out<br />

that these clustering indeterminacies end up being very relevant in practice, which, in our<br />

opinion, justifies research efforts oriented to overcome them (such as the present one). This<br />

was the main motivation of our first approaches to robust clustering via cluster ensembles<br />

(Sevillano et al., 2006a; Sevillano et al., 2006b; Sevillano et al., 2007c), which have attracted<br />

the attention of several researchers (Tjhi and Chen, 2007; Pinto, 2008; Gonzàlez and Turmo,<br />

2008b; Tjhi and Chen, 2009).<br />

For these reasons, our approach to robust clustering intentionally reduces user decision<br />

making as much as possible, thus following an approach that is nearly opposite to the<br />

procedure usually employed in clustering: instead of using a specific clustering configuration<br />

(which is often selected blindly unless domain knowledge is available), the clustering<br />

practitioner is encouraged to use and combine all the clustering configurations at hand,<br />

compiling the resulting clusterings into a cluster ensemble, upon which a consensus clustering<br />

is derived. The more similar this consensus clustering is to the highest quality clustering<br />

contained in the cluster ensemble, the greater the robustness to clustering indeterminacies.<br />

In this context, it must be noted that our particular approach to robust clustering foments<br />

the creation of large cluster ensembles. This motivates that one of our main issues<br />

of concern is the computationally efficient derivation of a high quality consolidated cluster-<br />

1 The best data partition is an elusive concept in itself, as it basically depends on how the clustered<br />

data is interpreted. However, for any given interpretation criterion, some clustering algorithms may obtain<br />

better clusters than others (Jain, Murty, and Flynn, 1999). In this work, the quality of clusterings has been<br />

evaluated by comparison with a allegedly correct cluster structure of the data, referred to as ground truth,<br />

measuring their degree of resemblance by means of normalized mutual information, or φ (NMI) .<br />

193


Chapter 7. Conclusions<br />

ing upon the aforementioned cluster ensemble, which gives rise to the first two proposals<br />

put forward in this thesis: hierarchical consensus architectures and consensus self-refining<br />

procedures, which are reviewed in sections 7.1 and 7.2, respectively.<br />

Our proposals for robust clustering based on cluster ensembles find a natural field of<br />

application in multimedia data clustering, as the existence of multiple data modalities<br />

poses additional indeterminacies that challenge the obtention of robust clustering results.<br />

Moreover, our strategy naturally allows the simultaneous use of early and late multimodal<br />

fusion techniques, which constitutes a highly generic approach to the problem of multimedia<br />

clustering. In section 7.3, our proposals in this area are reviewed.<br />

The last proposal of this thesis can be regarded as a first step for generalizing our cluster<br />

ensembles based robust clustering approach, as it consists of several voting based consensus<br />

functions for soft cluster ensembles —recall that crisp clustering is in fact a simplification of<br />

its fuzzy counterpart. These consensus functions, which are reviewed in section 7.4, can also<br />

be considered a response to the relatively few efforts devoted to the derivation of consensus<br />

clustering strategies in the context of fuzzy clustering.<br />

We have given great importance to the experimental evaluation of all our proposals. To<br />

that effect, we have employed several state-of-the-art consensus functions for hard cluster<br />

ensembles –hypergraph based (CSPA, HGPA, MCLA) (Strehl and Ghosh, 2002), evidence<br />

accumulation (EAC) (Fred and Jain, 2005) and similarity-as-data based (ALSAD, KMSAD,<br />

SLSAD) (Kuncheva, Hadjitodorov, and Todorova, 2006)– to implement our self-refining<br />

hierarchical consensus architectures. Moreover, the fuzzy versions of CSPA, EAC, HGPA<br />

and MCLA, plus the VMA soft consensus function (Dimitriadou, Weingessel, and Hornik,<br />

2002) have been used as an evaluation benchmark for our voting based consensus functions<br />

for soft cluster ensembles. Our proposals have been tested over a total of sixteen unimodal<br />

and multimodal data collections, which contain a number of objects ranging from hundreds<br />

to several thousands. In particular, the performance of self-refining hierarchical consensus<br />

architectures has been evaluated on both unimodal (chapters 3 and 4) and multimodal data<br />

collections (chapter 5), whereas the experiments concerning soft consensus functions have<br />

been conducted on the 12 unimodal collections —see chapter 6. However, in the near future,<br />

we plan extending these latter experiments towards multimodal data sets. We expect such<br />

extension to be little costly, since any consensus function can easily accommodate our early<br />

plus late fusion multimedia clustering proposal. In this sense, we also intend to apply<br />

our multimedia clustering system on well-known multimodal data sets such as VideoClef<br />

(composed of video data along with speech recognition transcripts, metadata and shot-level<br />

keyframes) (VideoCLEF, accessed on May 2009) and ImageClef (still images annotated with<br />

text) (ImageCLEF, accessed on May 2009).<br />

Furthermore, we have conducted our experiments on cluster ensembles of very different<br />

sizes (from 6 to 5124 clusterings), in order to evaluate the influence of this factor on the<br />

computational facet of our proposals. In all the experiments, the statistical significance<br />

of the results at the 5% significance level has been evaluated, either explicitly or by their<br />

presentation by means of boxplot charts.<br />

As mentioned in appendix A.6, the experiments conducted in this thesis have been<br />

run under Matlab 7.0.4 on Dual Pentium 4 3GHz/1 GB RAM computers. A total of<br />

20 computers were employed during approximately 9 months at an almost constant pace,<br />

combining for an estimated total running time of more than 10 years!<br />

These experimental variablilty has provided us with a comparative view of the state-<br />

194


Chapter 7. Conclusions<br />

of-the-art consensus funtions employed, both in terms of the quality of the consensus clusterings<br />

they yield and their computational complexity, and the most relevant conclusions<br />

drawn are enumerated next. As regards the consensus functions for hard cluster ensembles,<br />

we have observed that, in general terms, the EAC and HGPA consensus functions deliver,<br />

by far, the poorest quality consensus clusterings. We believe that the low performance of<br />

EAC is due to the fact that it was originally devised for consolidating clusterings with a very<br />

high number of clusters into a consensus partition with a smaller number of clusters (Fred<br />

and Jain, 2005). However, in our experiments, both the cluster ensemble components and<br />

the consensus clustering have the same number of clusters, which probably has a detrimental<br />

effect on the quality of the consensus partitions obtained by the evidence accumulation<br />

approach.<br />

From a computational standpoint, the HGPA and MCLA consensus functions are applicable<br />

on larger data collections than the rest, as their complexity is linear with the number<br />

of objects n. However, the execution time of MCLA is penalized when it is executed on<br />

large cluster ensembles, as its time complexity is quadratic with l (the number of cluster<br />

ensemble components). As regards the soft consensus functions, VMA constitutes the most<br />

attractive alternative, as it yields pretty high quality consensus clusterings while being fast<br />

to execute, thanks to the simultaneous execution of the cluster disambiguation and voting<br />

procedures. A rather opposite behaviour is shown by the soft versions of the EAC and<br />

HGPA consensus functions: the former is notably time consuming, while the latter outputs<br />

really poor quality consensus clusterings.<br />

Let us get critical for a while: possibly one of the major sources of criticism for this<br />

work refers to the rather unrealistic assumption (though not uncommon in the literature)<br />

that the number of clusters the objects must be grouped into (referred to as k) isaknown<br />

parameter. In practice, however, the user seldom knows how many clusters should be found,<br />

so it becomes a further indeterminacy to deal with.<br />

In this work, all the clusterings involved in any process (i.e. the cluster ensemble components<br />

and the consensus clusterings) have the same number of clusters, which coincides<br />

with the number of true groups in the data, defined by the ground truth that constitutes<br />

the gold standard for ultimately evaluating the quality our results. By doing so, we have<br />

also disregarded a common diversity factor employed in the creation of cluster ensembles,<br />

which often contain clusterings with different numbers of clusters (often chosen randomly).<br />

However, we would like to highlight at this point that not many of the consensus functions<br />

found in the literature are capable of estimating the correct number of clusters in<br />

the data set, thus making it necessary to specify the desired value of k as one of their parameters.<br />

Quite obviously, two of the clearest future research directions of this work are i)<br />

estimating the number of clusters of the consensus clustering solution, and ii) adapting the<br />

proposed consensus functions for dealing with cluster ensemble components with distinct<br />

numbers of clusters. The achievement of these goals would constitute the ultimate step<br />

towards a fully generic approach to robust clustering based on cluster ensembles.<br />

7.1 Hierarchical consensus architectures<br />

As regards the computational efficiency of consensus processes, the fact that their space and<br />

time complexities usually scale linearly or quadratically with the cluster ensemble size can<br />

195


7.1. Hierarchical consensus architectures<br />

make the execution of traditional one-step (aka flat) consensus processes (in which the whole<br />

cluster ensemble is input to the consensus function at once) very costly or even unfeasible<br />

when conducted on highly populated cluster ensembles. For this reason, the application of<br />

a divide-and-conquer strategy on the cluster ensemble –which gives rise to the hierarchical<br />

consensus architectures (HCA) proposed in chapter 3– constitutes an alternative to classic<br />

flat consensus that, besides leaving out none of the l cluster ensemble components, is also<br />

naturally parallelizable, making it even more computationally appealing.<br />

In particular, two types of hierarchical consensus architectures have been proposed:<br />

random and deterministic HCA. Both architectures differ in the way the user intervenes in<br />

their design. In random HCA, the user selects the size (b) of the mini-ensembles intermediate<br />

consensus processes are conducted, which, together with the cluster ensemble size l<br />

determines the number of stages of the consensus architecture. Compared to them, deterministic<br />

HCA provide a more modular approach to consensus clustering, as clusterings of<br />

the same nature are combined at each stage of the hierarchy. In fact, our first approach to<br />

hierarchical consensus architectures dealt with deterministic HCA (Sevillano et al., 2007a),<br />

although it was solely focused on the analysis of the quality of the consensus clusterings<br />

obtained, not on its computational aspect.<br />

Extensive experiments have proven that their computational efficiency is highly dependent<br />

on the characteristics of the consensus function employed for combining the clusterings<br />

(in particular, it depends on how its time complexity scales with the number of clusterings<br />

combined). For instance, flat consensus based on the EAC consensus function is more efficient<br />

than any hierarchical architecture, whereas a rather opposite behaviour is observed<br />

when MCLA is used.<br />

Moreover, we have observed that HCAs become faster than flat consensus when operating<br />

on highly populated cluster ensembles, regardless of whether their fully serial or parallel<br />

implementation is considered (except when the EAC consensus function is employed). Expectably,<br />

the fully parallel version of HCAs outperforms flat consensus (often by a large<br />

margin), even when small cluster ensembles are employed. An additional interesting feature<br />

of hierarchical consensus architectures is that they provide a means for obtaining a<br />

consensus clustering solution in scenarios where the complexity of flat consensus makes its<br />

execution impossible (for a given set of computational resources).<br />

Given the fact that multiple specific implementation variants of a HCA exist, and that<br />

their time complexities can differ largely, it seems necessary to provide the user with tools<br />

that allow to predict, for a given consensus clustering problem, which is the most computationally<br />

efficient one. In this sense, a simple methodology for estimating their running time<br />

–and thus, selecting which is the least time consuming– has also been proposed. Despite<br />

its simplicity, the proposed methodology is capable of achieving an accuracy close to 80%<br />

when predicting the fastest serially implementated HCA variant, while this percentage goes<br />

down to nearly 50% in the parallel implementation case. This difference is caused by the<br />

fact that, in the parallel case, the running time estimation is more sensitive to random<br />

deviations of the measured running times the estimation is based upon, as it often ends up<br />

depending on a single execution time sample. However, the impact of incorrect predictions,<br />

when measured in running time overheads with respect to the truly fastest HCA variant,<br />

is well below 10 seconds in a vast majority of the experiments conducted —of course, the<br />

relative importance of such deviations will ultimately depend on the time requirements of<br />

the specific application the HCA is embedded in.<br />

196


Chapter 7. Conclusions<br />

Though put forward in a robust clustering via cluster ensembles framework, hierarchical<br />

consensus architectures can be of interest in any consensus clustering related problem where<br />

cluster ensembles containing a large number of components are involved. Furthermore,<br />

HCAs are directly portable to a fuzzy clustering scenario with no modifications.<br />

In our opinion, the main weakness of this proposal lies in the rather simplistic approach<br />

taken in the running time estimation methodology, which employs the execution time of a<br />

single consensus process run for estimating the time complexity of the whole HCA. Despite<br />

experiments have demonstrated that its performance is pretty good, we conjecture that<br />

a possible means for improving it –especially in the parallel case, where lower prediction<br />

accuracies are obtained– would consist in modelling statistically the running times of the<br />

consensus processes the estimation is based on.<br />

7.2 Consensus self-refining procedures<br />

Besides the computational difficulties that have motivated the development of hierarchical<br />

consensus architectures, the use of large cluster ensembles also poses a challenge as far as the<br />

quality of the obtained consensus clustering is concerned. Indeed, the somewhat indiscriminate<br />

generation of clusterings encouraged by our robust clustering via cluster ensembles<br />

proposal may presumably lead to the creation of low quality cluster ensemble components,<br />

which affects the quality of consensus clustering negatively. In order to mitigate the undesired<br />

influence of these components, we have devised an unsupervised strategy for excluding<br />

them from the consensus process.<br />

The rationale of such strategy is the following: starting with a reference clustering,<br />

we measure its similarity (in terms of average normalized mutual information, or φ (ANMI) )<br />

with respect to the l cluster ensemble components. Subsequently, a percentage p of these<br />

components is selected, after ranking them according to their similarity with respect to the<br />

aforementioned reference clustering. <strong>La</strong>st, the self-refined consensus clustering is obtained<br />

by combining the clusterings included in such reduced cluster ensemble, according to either<br />

a flat or a hierarchical architecture —a decision that can be reliably made using the running<br />

time estimation methodology mentioned earlier.<br />

Following this generic approach, two self-refining strategies have been proposed. They<br />

solely differ in the origin of the clustering used as the reference of the self-refining procedure.<br />

In the first version (denominated consensus based self-refining), the reference clustering is<br />

the result of a previous consensus process conducted upon the cluster ensemble at hand.<br />

In contrast, the second self-refining procedure (referred to as selection based self-refining)<br />

employs one of the cluster ensemble components as the reference clustering, which is selected<br />

by means of an φ (ANMI) maximization criterion.<br />

We would like to highlight the fact that the self-refining procedure is almost fully unsupervised.<br />

The only user intervention is the selection of the percentage p of the l cluster<br />

ensemble components included in the select cluster ensemble the self-refined consensus clustering<br />

is derived upon. In order to minimize the risks of negatively biasing the self-refining<br />

of procedure results by a suboptimal selection of p, the user is prompted to select not a<br />

single, but a set of values of p. The self-refining procedure will produce a self-refined consensus<br />

clustering for each distinct value of p, selecting a posteriori, in a fully unsupervised<br />

manner, the one with maximum average normalized mutual information with respect to the<br />

197


7.3. Multimedia clustering based on cluster ensembles<br />

cluster ensemble (a selection process that is given the name of supraconsensus (Strehl and<br />

Ghosh, 2002)).<br />

The analysis of the quality (measured in terms of normalized mutual information with<br />

respect to the ground truth) of the set of self-refined consensus clusterings obtained at each<br />

experiment reveals that the proposed self-refining procedure is notably successful, as it is<br />

higher than that of the reference clustering in a 83% (for consensus based self-refining) or a<br />

56% (for selection based self-refining) of the experiments conducted. Furthermore, we have<br />

also observed that producing multiple self-refined consensus clusterings is a highly beneficial<br />

approach, as the highest quality self-refined clustering is obtained for very disparate values<br />

of p depending on the experiment —from p=2% to p=90%, thus it would be pretty easy to<br />

select a suboptimal value of p if a single one was chosen. As far as the quality gains induced<br />

by the self-refining procedure are concerned, relative percentage φ (NMI) increases (referred<br />

to the non-refined consensus clustering) higher –and quite often much higher– than 10%<br />

are obtained in a vast majority of the experiments conducted.<br />

A further advantage of the self-refining procedure is its ability to uniformize the quality of<br />

the consensus clustering solutions created by distinct consensus architectures –reducing the<br />

variances between their φ (NMI) scores by a factor of 20–, thus making it easier to decide which<br />

is the most appropriate consensus architecture for a given consensus clustering problem on<br />

computational grounds solely.<br />

However, the good performance of the proposed self-refining procedure is somewhat<br />

tarnished by the limited accuracy of the supraconsensus selection process, which manages<br />

to select the highest quality self-refined consensus clustering in less than the half of the<br />

experiments conducted, which causes an average 14% relative φ (NMI) reduction between the<br />

consensus clustering selected by supraconsensus and the top quality one.<br />

For this reason, the main research activities in this area should be directed, in our<br />

opinion, towards the derivation of accurate supraconsensus selection techniques capable of<br />

choosing, in a fully blind manner and as precisely as possible, the highest quality consensus<br />

clustering among a given bunch of them.<br />

<strong>La</strong>st, we have pleasingly noticed that fighting the expectable quality decrease suffered<br />

by consensus clusterings created upon large cluster ensembles has also drawn the interest<br />

of other authors. Curiously enough, this issue has also been tackled in (Fern and Lin,<br />

2008) in a very similar fashion to our selection based self-refining procedure, which can be<br />

interpreted as a sign of the good sense of our proposals.<br />

7.3 Multimedia clustering based on cluster ensembles<br />

Undoubtedly, ‘going multimedia’ is a beneficial trend, as it provides a richer vision of<br />

information. However, it poses a challenge when multimodal data is to be processed by<br />

means of unsupervised learning techniques (e.g. clustering), as the existence of multiple<br />

modalities increases the uncertainties about what is the best way to represent, classify or<br />

describe the data. In this sense, intuition tends to suggest that constructive interactions<br />

between the distinct modalities exist, which should lead to a better explanation of the<br />

data. However, it is not clear how this modality fusion should be conducted, either at a<br />

feature level (early fusion) or at a decision level (late fusion). Indeed, our experiments have<br />

demonstrated that early fusion is not always advantageous as regards the quality of the<br />

198


Chapter 7. Conclusions<br />

clustering results —although in other contexts, such as jointly multimodal data analysis<br />

and synthesis, it becomes a crucial process (Sevillano et al., 2009).<br />

For this reason, the key point of our approach to robust multimedia clustering consists<br />

in not prioritizing nor discarding any of the modalities. Rather the contrary, the user is<br />

encouraged to create clusterings upon each separate modality and on feature level fused<br />

modalities, compiling them all into a multimodal cluster ensemble, upon which a consensus<br />

clustering is created.<br />

Interestingly enough, the application of this strategy –which is nothing but a generalized<br />

version of our approach to robust clustering– naturally calls for the use of hierarchical<br />

consensus architectures, as the existence of multiple (say m) modalities increases cluster<br />

ensemble sizes by a minimum factor of m + 1 (as we consider the m original object representations<br />

plus the one created by their feature level fusion), which poses a computational<br />

challenge to the execution of flat consensus clustering. Furthermore, the hypothetical inclusion<br />

of low quality components in such a large cluster ensemble makes the application of<br />

self-refining procedures an attractive alternative for obtaining good consensus clusterings<br />

upon the aforementioned multimodal cluster ensemble.<br />

In order to evaluate the effect of multimodality in a modular manner, separate consensus<br />

processes have been conducted for each original data modality and for the modality<br />

derived from the early fusion of these. To that effect, a deterministic hierarchical consensus<br />

architecture has been employed in our multimodal consensus clustering experiments, as it<br />

allows a structured construction of consensus clusterings both within and across modalities.<br />

As regards within modality consensus, the results obtained reveal that consensus clusterings<br />

obtained on the multimodal modality (i.e. the one resulting from the early fusion<br />

of the original modalities of the data) attain higher φ (NMI) scores than their unimodal<br />

counterparts in an average 56% of the experiments conducted.<br />

When the unimodal and multimodal consensus clusterings are combined –thus giving<br />

rise to intermodal consensus clusterings– we observe that, in terms of φ (NMI) with respect<br />

to the ground truth, these are better than the unimodal ones in a 59.5% of the experiments,<br />

while this percentage is 34.7% when compared to the multimodal consensus clustering.<br />

However, the fairly distinct results obtained depending on the data set and consensus<br />

function employed suggest that creating an intermodal consensus clustering is a pretty<br />

generic way of proceeding, as its eventual inferior quality can be compensated by means of<br />

a subsequent self-refining procedure followed by an unsupervised supraconsensus selection<br />

of the final consensus clustering.<br />

If the maximum and median quality components of the multimodal cluster ensemble<br />

are taken as reference thresholds for evaluating the robustness of the self-refined consensus<br />

clustering selected by supraconsensus, we observe that it is a 36.6% worse than the former<br />

and a 93.5% better than the latter (measured in relative percentage φ (NMI) variations). As in<br />

the unimodal case, this performance would be improved if a better supraconsensus selection<br />

process was devised —which, as aforementioned, is one of the future work priorities.<br />

As regards the future research lines in the multimodal clustering area, we plan to investigate<br />

early multimodal fusion techniques capable of unveiling constructive interactions<br />

between modalities, besides applying selection based consensus self-refining on the multimodal<br />

cluster ensemble, as we conjecture that will probably yield higher quality clusterings<br />

than those obtained by consensus based self-refining.<br />

199


7.4. Voting based soft consensus functions<br />

7.4 Voting based soft consensus functions<br />

The outcome of a fuzzy clustering process is much more informative than its crisp counterpart,<br />

as it indicates the strength of association of each object to each cluster. Despite<br />

this fact, soft clustering combination strategies are a minority in the consensus clustering<br />

literature. Allowing for this, we have made several proposals in this area, aiming to extend<br />

all our previous proposals to the more generic framework of soft clustering.<br />

There exists a pretty evident parallelism between the strength of association of each<br />

object to each cluster in a fuzzy clustering solution and the degree of preference of a voter<br />

for a candidate in an election. This fact directly allows the application of certain voting<br />

methods for consolidating soft clusterings, considering the clusters as the candidates, the<br />

cluster ensemble components as voters, and the clusterization of each object as an election.<br />

However, given the ambiguous identification of clusters inherent to clustering, a cluster<br />

alignment between the cluster ensemble components is required prior to voting.<br />

In this work, we have proposed four consensus functions for soft cluster ensembles, which<br />

are the result of applying as many voting strategies for combining the clusterings in the<br />

ensemble. In particular, we have employed two confidence voting methods –the sum and<br />

product rules, which give rise to the SumConsensus (SC) and ProductConsensus (PC) consensus<br />

functions–, and two positional voting techniques —the Borda and Condorcet voting<br />

strategies that constitute the basis of the BordaConsensus (BC) and CondorcetConsensus<br />

(CC) clustering combiners. The main difference between these two families of voting methods<br />

lies in the fact that the former operate directly on the object-to-cluster association<br />

values that make up the cluster ensemble components, whereas the latter operate on the<br />

candidates ranking according to the voters’ preferences. For disambiguating the clusters,<br />

we have employed the classic Hungarian algorithm (Kuhn, 1955).<br />

The experiments conducted have evaluated our four consensus functions (SC, PC, BC<br />

and CC), comparing them with several state-of-the-art soft consensus functions in terms<br />

of their computational complexity and the quality of the consensus clusterings they yield.<br />

In terms of execution time, confidence voting consensus functions are faster than their<br />

positional voting counterparts, as the candidate ranking process penalizes the latter from a<br />

computational standpoint. In this sense, CC is the slowest proposal, due to the exhaustive<br />

pairwise candidate confrontation implicit in the Condorcet voting method. Contrarily, the<br />

more computationally efficient PC and SC consensus functions are as fast or faster than<br />

CSPA, EAC, HGPA and MCLA in a 81% of the experiments conducted —however, they<br />

are slower than VMA in a 92% of the cases.<br />

If the quality of the hardened version of the fuzzy consensus clusterings (measured<br />

in terms of φ (NMI) with respect to the ground truth) is used as the comparison factor,<br />

we observe that the four proposed consensus functions yield (statistically significantly)<br />

better results than any of the state-of-the-art consensus functions in an average 72% of the<br />

experiments conducted, which is a clear indicator of the goodness of our proposals. It is<br />

important to highlight that it has been impossible to evaluate directly the fuzzy consensus<br />

clusterings output by the four proposed consensus functions, due to the unavailability of<br />

soft labels in the data sets employed. As a future direction of research, we plan conducting<br />

this fuzzy evaluation, and we conjecture that greater differences between SC, PC, BC and<br />

CC will be observed, as the differences between the results of the voting strategies they<br />

are based upon are somewhat masked when the fuzzy consensus clusterings they yield are<br />

200


Chapter 7. Conclusions<br />

hardened.<br />

Ours is not the first approach to soft consensus clustering based on voting. In fact,<br />

the VMA consensus function employs a weighted version of the sum voting rule (Dimitriadou,<br />

Weingessel, and Hornik, 2002). However, to our knowledge, BordaConsensus (firstly<br />

introduced in (Sevillano, Alías, and Socoró, 2007b)) and CondorcetConsensus are pioneer<br />

positional voting based consensus functions.<br />

As aforementioned, our proposals deal with clusterings with a constant number of clusters<br />

k, and it would be of paramount interest to adapt them to combine clusterings with<br />

different number of clusters. A possible way to do so would consist in completing those<br />

clusterings with fewer clusters with dummy clusters (Ayad and Kamel, 2008), as suggested<br />

in the VMA consensus function (Dimitriadou, Weingessel, and Hornik, 2002).<br />

Besides this, possibly the clearest direction for future research in this area consists of<br />

adapting the simultaneous cluster disambiguation and voting mechanism of VMA, which<br />

would probably i) reduce the time complexity of the proposed consensus functions, and ii)<br />

require introducing some adjustments to the voting methods employed. Moreover, we are<br />

also interested in exploring other existing techniques for solving the cluster disambiguation<br />

problem, analyzing their impact both in terms of the quality of the consensus clustering<br />

solutions obtained and the overall computational complexity of the consensus function.<br />

201


References<br />

References<br />

Agogino, A. and K. Tumer. 2006. Efficient agent-based cluster ensembles. In Proceedings of<br />

the 5th International Joint Conference on Autonomous Agents and Multi-Agent Systems.<br />

Agrawal, R., J. Gehrke, D. Gunopulos, and P. Raghavan. 1998. Automatic subspace<br />

clustering of high dimensional data for data mining applications. In Proceedings of the<br />

ACM-SIGMOD Conference on the Management of Data, pages 94–105, Seattle, WA,<br />

USA.<br />

Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on<br />

Automatic Control, 19(6):716–722.<br />

Al-Sultan, K. 1995. A Tabu search approach to the clustering problem. Pattern Recognition,<br />

28(9):1443–1451.<br />

Anderberg, M.R. 1973. Cluster Analysis for Applications. Monographs and Textbooks on<br />

Probability and Mathematical Statistics. New York: Academic Press, Inc.<br />

Anderson, L.W., D.R. Krathwohl, P.W. Airasian, K.A. Cruikshank, R.E. Mayer, P.R. Pintrich,<br />

J. Raths, and M.C. Wittrock. 2001. A Taxonomy for Learning, Teaching, and<br />

Assessing – A Revision of Bloom’s Taxonomy of Educational Objectives. Addison Wesley<br />

Longman, Inc.<br />

Aslam, J.-A. and M. Montague. 2001. Models for metasearch. In Proceedings of the 24th<br />

ACM SIGIR Conference, pages 276–284, New Orleans, LA, USA.<br />

Asuncion, A. and D.J. Newman. 1999. UCI Machine Learning Repository.<br />

http://www.ics.uci.edu/∼mlearn/MLRepository.html. University of California, Irvine,<br />

School of Information and Computer Sciences.<br />

Ayad, H.G. and M.S. Kamel. 2008. Cumulative Voting Consensus Method for Partitions<br />

with Variable Number of Clusters. IEEE Transactions on Pattern Analysis and Machine<br />

Intelligence, 30(1):160–173.<br />

Ball, G.H. and D.J. Hall. 1965. ISODATA, a novel method of data analysis and classification.<br />

Technical Report, Stanford University, Stanford, CA, USA.<br />

Barnard, K., P. Duygulu, D.A. Forsyth, N. De Freitas, D.M. Blei, and M.I. Jordan. 2003.<br />

Matching Words and Pictures. Journal on Machine Learning Research, 3:1107–1135.<br />

Barnard, K. and D.A. Forsyth. 2001. Learning the semantics of words and pictures. In<br />

Proceedings of the IEEE International Conference on Computer Vision, volume II, pages<br />

408–415.<br />

Barthelemy, J.P., B. <strong>La</strong>clerc, and B. Monjardet. 1986. On the use of ordered sets in<br />

problems of comparison and consensus of classifications. Journal of Classification, 3:225–<br />

256.<br />

Bekkerman, R. and J. Jeon. 2007. Multi-modal clustering for multimedia collections. In<br />

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern<br />

Recognition, pages 1–8.<br />

202


References<br />

Ben-Hur, A., D. Horn, H. Siegelmann, and V. Vapnik. 2001. Support vector clustering.<br />

Journal on Machine Learning Research, 2:125–137.<br />

Benitez, A.B and S.F. Chang. 2002. Perceptual knowledge construction from annotated<br />

image collections. In Columbia University ADVENT, pages 26–29.<br />

Berkhin, P. 2002. Survey of clustering data mining techniques. Available<br />

online at http://www.accrue.com/products/rp cluster review.pdf or<br />

http://citeseer.nj.nec.com/berkhin02survey.html.<br />

Biggs, J.B. and K. Collis. 1982. Evaluating the Quality of Learning: the SOLO taxonomy.<br />

New York: Academic Press.<br />

Bingham, E. and H. Mannila. 2001. Random projection in dimensionality reduction: applications<br />

to image and text data. In Proceedings of the 7th ACM SIGKDD International<br />

Conference on Knowledge Discovery and Data Mining, pages 245–250, San Francisco,<br />

CA, USA.<br />

Bloom, B.S. 1956. Taxonomy of Educational Objectives: The Classification of Educational<br />

Goals. Susan Fauer Company, Inc.<br />

Borda, J.C. de. 1781. Memoire sur les Elections au Scrutin. Histoire de lAcademie Royale<br />

des Sciences, Paris.<br />

Boulis, C. and M. Ostendorf. 2004. Combining multiple clustering systems. In J.F. Boulicaut,<br />

F. Esposito, F. Giannotti, and D. Pedreschi, editors, Proceedings of the 8th European<br />

Conference on Principles and Practice of Knowledge Discovery in Databases,<br />

LNCS vol. 3202, pages 63–74. Springer.<br />

Brachman, R. and T. Anand. 1996. The process of knowledge discovery in databases: A<br />

human-centered approach. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,<br />

editors, Advances in Knowledge Discovery and Data Mining, pages 37–58, Menlo<br />

Park, CA, USA. AAAI Press.<br />

Buehren, M. 2008. Functions for the rectangular assignment problem.<br />

http://www.mathworks.com/matlabcentral/fileexchange/6543.<br />

Buscaldi, D. and P. Rosso. 2007. Upv-wsd : Combining different wsd methods by means of<br />

fuzzy borda voting. In Proceedings of the International SemEval Workshop, ACL 2007,<br />

pages 434–437, Prague, Czech Republic.<br />

Cai, D., X. He, Z. Li, W.Y. Ma, and J.R. Wen. 2004. Hierarchical clustering of www<br />

image search results using visual, textual and link information. In Proceedings of the<br />

12th Annual ACM International Conference on Multimedia, pages 952–959.<br />

Calinski, R.B. and J. Harabasz. 1974. A Dendrite Method for Cluster Analysis. Communications<br />

in Statistics, 3:1–27.<br />

Carpenter, G. and S. Grossberg. 1987. A massively parallel architecture for a self-organizing<br />

neural pattern recognition machine. Computer Vision, Graphics and Image Processing,<br />

37:54–115.<br />

203


References<br />

Carpenter, G., S. Grossberg, and D. Rosen. 1991. Fuzzy ART: Fast stable learning and<br />

categorization of analog patterns by an adaptive resonance system. Neural Networks,<br />

4:759–771.<br />

Carreira-Perpiñán, M.A. 1997. A review of dimension reduction techniques. Technical<br />

Report CS-96-09, Department of Computer Science, University of Sheffield, Sheffield,<br />

UK.<br />

Chakaravathy, S.V. and J. Ghosh. 1996. Scale based clustering using a radial basis function<br />

network. IEEE Transactions on Neural Networks, 2(5):1250–1261.<br />

Cheeseman, P. and J. Stutz. 1996. Bayesian classification (Autoclass): theory and results.<br />

In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in<br />

Knowledge Discovery and Data Mining, pages 153–180, Menlo Park, CA. AAAI Press.<br />

Chen, F., U. Gargi, L. Niles, and H. Schütze. 1999. Multi-modal browsing of images in<br />

web documents. In D. Doermann, editor, Proceedings of the 1999 Symposium on Image<br />

Understanding Technology, pages 265–276, Annapolis, MD, USA. UMD.<br />

Chen, N. 2006. A Survey of Indexing and Retrieval of Multimodal Documents: Text and<br />

Images. Technical Report 2006-505, School of Computing, Queens University, Kingston,<br />

Ontario, Canada.<br />

Chiang, J. and P. Hao. 2003. A new kernel-based fuzzy clustering approach: support vector<br />

clustering with cell growing. IEEE Transactions on Fuzzy Systems, 11(4):518–527.<br />

Chu, S. and J. Roddick. 2000. A clustering algorithm using the Tabu search approach with<br />

simulated annealing. In N. Ebecken and C. Brebbia, editors, Data Mining II–Proceedings<br />

of the 2nd International Conference on Data Mining Methods and Databases, pages 515–<br />

523.<br />

Cobo, G., X. Sevillano, F. Alías, and J.C. Socoró. 2006. Técnicas de representación de<br />

textos para clasificación no supervisada de documentos. Journal of the Spanish Society<br />

for Natural <strong>La</strong>nguage Processing (Procesamiento del Lenguaje Natural)–in Spanish,<br />

37:329–336.<br />

Condorcet, M. de. 1785. Essai sur l’application de l’analyse àlaprobabilité des decisions<br />

rendues àlapluralité des voix.<br />

Cover, T.M. and J.A. Thomas. 1991. Elements of information theory. John Wiley and<br />

Sons.<br />

Cutting, D.R., D.R. Karger, J.O. Pedersen, and J.W. Tukey. 1992. Scatter/Gather: a<br />

cluster-based approach to browsing large document collections. In Proceedings of the<br />

15th annual international ACM SIGIR conference on Research and Development in<br />

Information Retrieval, pages 318–329, Copenhagen, Denmark, June.<br />

Dasgupta, S., C. Papadimitriou, and U. Vazirani. 2006. Algorithms. McGraw-Hill.<br />

Davies, D.L and D.W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions<br />

on Pattern Analysis and Machine Intelligence, 1:224–227.<br />

204


References<br />

Deerwester, S., S.-T. Dumais, G.-W. Furnas, T.-K. <strong>La</strong>ndauer, and R. Harshman. 1990.<br />

Indexing by <strong>La</strong>tent Semantic Analysis. Journal American Society Information Science,<br />

6(41):391–407.<br />

Denoeud, L. and A. Guénoche. 2006. Comparison of distance indices between partitions.<br />

In V. Batagelj, H.H. Bock, A. Ferligoj, and A. ˘ Ziberna, editors, Data Science and<br />

Classification, pages 21–28. Springer.<br />

Dhillon, I., J. Fan, and Y. Guan. 2001. Efficient clustering of very large document collections.<br />

In R.L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R.R. Namburu,<br />

editors, Data Mining for Scientific and Engineering Applications. Kluwer Academic<br />

Publishers.<br />

Dietterich, T.G. 2000. Ensemble methods in machine learning. In J. Kittler and F. Roli,<br />

editors, Multiple Classifier Systems, LNCS vol. 1857, pages 1–15. Springer.<br />

Dimitriadou, E., A. Weingessel, and K. Hornik. 2001. Voting-merging: An ensemble<br />

method for clustering. In G. Dorffner, H. Bischof, and K. Hornik, editors, Artificial<br />

Neural Networks-ICANN 2001, LNCS vol. 2130, pages 217–224. Springer.<br />

Dimitriadou, E., A. Weingessel, and K. Hornik. 2002. A combination scheme for fuzzy<br />

clustering. International Journal of Pattern Recognition and Artificial Intelligence,<br />

16(7):901–912.<br />

Ding, C., X. He, and H.D. Simon. 2005. On the equivalence of nonnegative matrix factorization<br />

and spectral clustering. In Proceedings of the 2005 SIAM Conference on Data<br />

Mining, pages 606–610.<br />

Duda, R.O., P.E. Hart, and D.G. Stork. 2001. Pattern Classification. Wiley Interscience.<br />

Dudoit, S. and J. Fridlyand. 2003. Bagging to Improve the Accuracy of a Clustering<br />

Procedure. Bioinformatics, 19(9):1090–1099.<br />

Dunn, J.C. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact<br />

well-separated clusters. Journal on Cybernetics, 3:32–57.<br />

Duygulu, P., K. Barnard, N. de Freitas, and D. Forsyth. 2002. Object recognition as<br />

machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of<br />

the Seventh European Conference on Computer Vision, volume 4, pages 97–112. Springer<br />

Verlag.<br />

Dy, J.G. and C.E. Brodley. 2004. Feature Selection for Unsupervised Learning. Journal of<br />

Machine Learning Research, 5:845–889.<br />

Ertz, L., M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and<br />

densities in noisy, high dimensional data. In Proceedings of the 2nd SIAM International<br />

Conference on Data Mining, pages 47–58, San Francisco, CA, USA.<br />

Ester, M., H.P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering<br />

clusters in large spatial data sets with noise. In Proceedings of the 2nd International<br />

Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, OR,<br />

USA.<br />

205


References<br />

Fayyad, U. 1996. Data mining and knowledge discovery: making sense out of data. IEEE<br />

Expert: Intelligent Systems and Their Applications, 11(5):20–25.<br />

Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. 1996. From data mining to knowledge<br />

discovery: an overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,<br />

editors, Advances in Knowledge Discovery and Data Mining, pages 1–30, Menlo<br />

Park, CA, USA. AAAI Press.<br />

Fenty, J. 2004. Analyzing distances. The Stata Journal, 4(1):1–26.<br />

Fern, X.Z. and C.E. Brodley. 2003. Random Projection for High Dimensional Data Clustering:<br />

A Cluster Ensemble Approach. In Proceedings of 20th International Conference<br />

on Machine Learning, Washington DC, VA, USA.<br />

Fern, X.Z. and C.E. Brodley. 2004. Solving cluster ensemble problems by bipartite graph<br />

partitioning. In Proceedings of the 21st International Conference on Machine Learning,<br />

pages 281–288.<br />

Fern, X.Z. and W. Lin. 2008. Cluster ensemble selection. In Proceedings of the 2008 SIAM<br />

International Conference on Data Mining.<br />

Filkov, V. and S. Skiena. 2004. Integrating microarray data by consensus clustering.<br />

International Journal of Artificial Intelligence Tools, 13(4):863–880.<br />

Fischer, B. and J.M. Buhmann. 2003. Bagging for path-based clustering. IEEE Transactions<br />

on Pattern Analysis and Machine Intelligence, 25(11):1411–1415.<br />

Focardi, S.M. 2001. Clustering economic and financial time series: exploring the existence<br />

of stable correlation conditions. Technical Report 2001-04, The Intertek Group, Paris,<br />

France.<br />

Fodor, I.K. 2002. A survey of dimension reduction techniques. Technical Report UCRL-<br />

ID-148494, Center for Applied Scientific Computing, <strong>La</strong>wrence Livermore National <strong>La</strong>boratory,<br />

Livermore, CA.<br />

Forgy, E. 1965. Cluster analysis of multivariate data: efficiency vs. interpretability of<br />

classifications. Biometrics, 21:768–780.<br />

Fowlkes, E.B. and C.L. Mallows. 1983. A method for comparing two hierarchical clusterings.<br />

Journal of the American Statistical Association, 78(383):553–569.<br />

Fred, A. 2001. Finding consistent clusters in data partitions. In J. Kittler and F. Roli,<br />

editors, Multiple Classifier Systems, LNCS vol. 2096, pages 309–318. Springer.<br />

Fred, A. and A.K. Jain. 2002a. Data clustering using evidence accumulation. In Proceedings<br />

of the 16th International Conference on Pattern Recognition, pages 276–280.<br />

Fred, A. and A.K. Jain. 2002b. Evidence accumulation clustering based on the k-means<br />

algorithm. In T. Caelli, A. Amin, R.P.W. Duin, M. Kamel, and D. de Ridder, editors,<br />

Structural, Syntactic, and Statistical Pattern Recognition, LNCS vol. 2396, pages 442–<br />

451. Springer.<br />

206


References<br />

Fred, A. and A.K. Jain. 2003. Robust data clustering. In Proceedings of the 2003 IEEE<br />

Computer Society Conference on Computer Vision and Pattern Recognition, volume2,<br />

pages 128–133.<br />

Fred, A. and A.K. Jain. 2005. Combining Multiple Clusterings Using Evidence Accumulation.<br />

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835–850.<br />

Gao, B., T.Y. Liu, T. Qin, X. Zheng, Q.S. Cheng, and W.Y. Ma. 2005. Web image clustering<br />

by consistent utilization of visual features and surrounding texts. In Proceedings of the<br />

13th Annual ACM International Conference on Multimedia, pages 112–121.<br />

Gionis, A., H. Mannila, and P. Tsaparas. 2007. Clustering Aggregation. ACM Transactions<br />

on Knowledge Discovery from Data, 1(1):1–30.<br />

Goder, A. and V. Filkov. 2008. Consensus clustering algorithms: Comparison and refinement.<br />

In Proceedings of the 2008 SIAM Workshop on Algorithm Engineering and<br />

Experiments (ALENEX), pages 109–117.<br />

Gonzàlez, E. and J. Turmo. 2006. Unsupervised Document Clustering by Weighted Combination.<br />

LSI Research Report LSI-06-17-R, Departament de Llenguatges i Sistemes<br />

Informátics, Barcelona.<br />

Gonzàlez, E. and J. Turmo. 2008a. Comparing Non-Parametric Ensemble Methods. In<br />

E. Kapetanios, V. Sugumaran, and M. Spiliopoulou, editors, Proceedings of the 13th<br />

International Conference on Applications of Natural <strong>La</strong>nguage to Information Systems,<br />

LNCS vol. 5039, pages 245–256. Springer.<br />

Gonzàlez, E. and J. Turmo. 2008b. Non-Parametric Document Clustering by Ensemble<br />

Methods. Procesamiento del Lenguaje Natural, 40:91–98.<br />

Gopal, S. and C. Woodcock. 1994. Theory and methods for accuracy assessment of thematic<br />

maps using fuzzy sets. Photogrammetric Engineering and Remote Sensing, 60(2):181–<br />

188.<br />

Greene, D. and P. Cunningham. 2006. Efficient ensemble methods for document clustering.<br />

Technical Report CS-2006-48, Trinity College Dublin.<br />

Greene, D., A. Tsymbal, N. Bolshakova, and P. Cunningham. 2004. Ensemble clustering in<br />

medical diagnostics. In Proceedings of the 17th IEEE Symposium on Computer-Based<br />

Medical Systems, pages 576–581. IEEE Computer Society.<br />

Gunes, H. and M. Piccardi. 2005. Affect recognition from face and body: early fusion vs.<br />

late fusion. In Proceedings of 2005 IEEE International Conference on Systems, Man<br />

and Cybernetics, vol. 4, pages 3437–3443.<br />

Hadjitodorov, S.T. and L.I. Kuncheva. 2007. Selecting diversifying heuristics for cluster<br />

ensembles. In M. Haindl, J. Kittler, and F. Roli, editors, Proceedings of the 7th International<br />

Workshop on Multiple Classifier Systems, LNCS vol. 4472, pages 200–209.<br />

Springer.<br />

Hadjitodorov, S.T., L.I. Kuncheva, and L.P. Todorova. 2006. Moderate diversity for better<br />

cluster ensembles. Information Fusion, 7(3):264–275.<br />

207


References<br />

Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2002a. Cluster Validity Methods : Part I.<br />

ACM SIGMOD Record, 31(2):40–45.<br />

Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2002b. Cluster Validity Methods : Part<br />

II. ACM SIGMOD Record, 31(3):19–27.<br />

Hall, L., I. Özyurt, and J. Bezdek. 1999. Clustering with a genetically optimized approach.<br />

IEEE Transactions on Evolutionary Computation, 3(2):103–112.<br />

Hastad, J., B. Just, J.C. <strong>La</strong>garias, and C.P. Schnorr. 1988. Polynomial Time Algorithms<br />

for Finding Integer Relations among Real Numbers. SIAM Journal of Computing,<br />

18(5):859–881.<br />

He, Z., X. Xu, , and S. Deng. 2005. A cluster ensemble method for clustering categorical<br />

data. Information Fusion, 6(2):143–151.<br />

Hearst, M.A. 2006. Clustering versus faceted categories for information exploration. Communications<br />

of the ACM, 49(4):59–61.<br />

Hettich, S. and S.D. Bay. 1999. The UCI KDD Archive. http://kdd.ics.uci.edu. University<br />

of California at Irvine, Dept. of Information and Computer Science.<br />

Hinneburg, A. and D. Keim. 1998. An efficient approach to clustering in large multimedia<br />

data sets with noise. In Proceedings of the 4th International Conference on Knowledge<br />

Discovery and Data Mining, pages 58–65.<br />

Hofmann, T. and J. Buhmann. 1997. Pairwise data clustering by deterministic annealing.<br />

IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1–14.<br />

Höppner, F., F. Klawonn, and R. Kruse. 1999. Fuzzy Cluster Analysis: Methods for<br />

Classification, Data Analysis, and Image Recognition. Wiley.<br />

Hore, P., L. Hall, and D. Goldgof. 2006. A Cluster Ensemble Framework for <strong>La</strong>rge Data<br />

sets. In Proceedings of the 2006 IEEE International Conference on Systems, Man and<br />

Cybernetics, volume 4, pages 3342–3347.<br />

Hoyer, P.O. 2004. Non-Negative Matrix Factorization with Sparseness Constraints. Journal<br />

on Machine Learning Research, 5:1457–1469.<br />

Hubert, L. and P. Arabie. 1985. Comparing Partitions. Journal of Classification, 2:193–<br />

218.<br />

Hyvärinen, A. 1999. Fast and Robust Fixed-Point Algorithms for Independent Component<br />

Analysis. IEEE Trans. on Neural Networks, 10(3):626–634.<br />

Hyvärinen, A., J. Karhunen, and E. Oja. 2001. Independent Component Analysis. John<br />

Wiley and Sons.<br />

ImageCLEF. accessed on May 2009. The CLEF cross language image retrieval track.<br />

http://www.imageclef.org.<br />

208


References<br />

Ingaramo, D., D. Pinto, P. Rosso, and M. Errecalde. 2008. Evaluation of internal validity<br />

measures in short-text corpora. In A. Gelbukh, editor, Proceedings of the 9th<br />

International Conference on Intelligent Text Processing and Computational Linguistics,<br />

volume 4919 of Lecture Notes in Computer Science, pages 555–567. Springer Verlag,<br />

Berlin, Heidelberg, New York.<br />

InternetWorldStats.com. accessed on February 2009. Internet usage statistics february<br />

2009. http://www.internetworldstats.com/stats.htm.<br />

Jäger, G. and U. Benz. 2000. Measures of Classification Accuracy Based on Fuzzy Similarity.<br />

IEEE Transactions on Geoscience and Remote Sensing, 38(3):1462–1467.<br />

Jain, A.K. 1996. Image segmentation using clustering. In K. Bowyer and N. Ahuja, editors,<br />

Advances in Image Understanding. IEEE Press.<br />

Jain, A.K. and R.C. Dubes. 1988. Algorithms for clustering data. Prentice Hall.<br />

Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data Clustering: a Survey. ACM Computing<br />

Surveys, 31(3):264–323.<br />

Jakobsson, M. and N.A. Rosenberg. 2007. CLUMPP: a cluster matching and permutation<br />

program for dealing with label switching and multimodality in analysis of population<br />

structure. Bioinformatics, 23:1801–1806.<br />

Jiang, D., C. Tang, and A. Zhang. 2004. Cluster analysis for gene expression data: a<br />

survey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370–1386.<br />

Jolliffe, I.T. 1986. Principal Component Analysis. Springer.<br />

Jomier, J., V. LeDigarcher, and S.R. Aylward. 2005. Comparison of vessel segmentations<br />

using STAPLE. In J. Duncan, editor, Proceedings of the 8th International Conference on<br />

Medical Image Computing and Computer-Assisted Intervention, pages 523–530, LNCS<br />

3749. Springer.<br />

Kaban, A. and M. Girolami. 2000. Unsupervised Topic Separation and Keyword Identification<br />

in Document Collections: A Projection Approach. Technical Report No. 10, Dept.<br />

of Computing and Information Systems, University of Paisley.<br />

Käki, M. 2005. Findex: search result categories help users when document ranking fails. In<br />

Proc. ACM SIGCHI Int’l Conference on Human Factors in Computing Systems, pages<br />

131–140. ACM Press.<br />

Kalska, E.P. 2005. Dissimilarity Representations in Pattern Recognition. Ph.D. thesis,<br />

Delft University of Technology, The Netherlands.<br />

Karayiannis, N., J. Bezdek, N. Pal, R. Hathaway, and P. Pai. 1996. Repairs to GLVQ: A<br />

new family of competitive learning schemes. IEEE Transactions on Neural Networks,<br />

7(5):1062–1071.<br />

Karypis, G., R. Aggarwal, V. Kumar, and S. Shekhar. 1997. Multilevel hypergraph partitioning:<br />

applications in VLSI domain. In Proceedings of the 34th Design and Automation<br />

Conference, pages 526–529.<br />

209


References<br />

Karypis, G., E. Han, and V. Kumar. 1999. Chameleon: Hierarchical clustering using<br />

dynamic modeling. IEEE Computer, 32(8):68–75.<br />

Karypis, G. and V. Kumar. 1998. A fast and high quality multilevel scheme for partitioning<br />

irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392.<br />

Kaski, S. 1998. Dimensionality Reduction by Random Mapping: Fast Similarity Computation<br />

for Clustering. In Proceedings of the International Joint Conference on Neural<br />

Networks, pages 413–418, Anchorage, AK, USA.<br />

Kaufman, L. and P. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster<br />

Analysis. New York, NY: John Wiley and Sons.<br />

Kleinberg, J. 2002. An impossibility theorem for clustering. Proceedings of the 2002<br />

Conference on Advances in Neural Information Processing Systems, 15:463–470.<br />

Klosgen, W., J.M. Zytkow, and J. Zyt. 2002. Handbook of Data Mining and Knowledge<br />

Discovery. USA: Oxford University Press.<br />

Kohavi, R. and G.H. John. 1998. The wrapper approach. In H. Liu and H. Motoda,<br />

editors, Feature Extraction, Construction and Selection: A Data Mining Perspective,<br />

pages 33–50. Springer-Verlag.<br />

Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480.<br />

Kotsiantis, S. and P. Pintelas. 2004. Recent advances in clustering: a brief survey. WSEAS<br />

Transactions on Information Science and Applications, 1(1):73–81.<br />

Kuhn, H. 1955. The Hungarian Method for the Assignment Problem. Naval Research<br />

Logistic Quarterly, 2:83–97.<br />

Kuncheva, L.I., S.T. Hadjitodorov, and L.P. Todorova. 2006. Experimental comparison<br />

of cluster ensemble methods. In Proceedings of the 9th International Conference on<br />

Information Fusion, pages 24–28.<br />

<strong>La</strong> Cascia, M., S. Sethi, and S. Sclaroff. 1998. Combining textual and visual cues for<br />

content-based image retrieval on the World Wide Web. In Proceedings of the IEEE<br />

Workshop on Content-Based Access of Image and Video Libraries, pages 24–28.<br />

<strong>La</strong>nge, T. and J.M. Buhmann. 2005. Combining partitions by probabilistic label aggregation.<br />

In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge<br />

Discovery and Data Mining, pages 147–156. ACM Press.<br />

<strong>La</strong>rsen, B. and C. Aone. 1999. Fast and effective text mining using linear time document<br />

clustering. In Proceedings of the 5th International Conference on Knowledge Discovery<br />

and Data Mining, pages 16–22.<br />

Lee, D.D. and H.S. Seung. 1999. Learning the Parts of Objects by Non-Negative Matrix<br />

Factorization. Nature, 401:788–791.<br />

Lee, D.D. and H.S. Seung. 2001. Algorithms for Non-Negative Matrix Factorization. Advances<br />

in Neural Information Processing Systems, 13.<br />

210


References<br />

Li, S.Z. and G. GuoDong. 2000. Content-based Audio Classification and Retrieval using<br />

SVM Learning. In Proceedings of the 1st IEEE Pacific-Rim Conference on Multimedia<br />

(Invited talk).<br />

Li, T., C. Ding, and M.I. Jordan. 2007. Solving Consensus and Semi-supervised Clustering<br />

Problems Using Nonnegative Matrix Factorization. In Proceedings of the 7th IEEE<br />

International Conference on Data Mining, pages 577–582.<br />

Lin, J. and D. Gunopulos. 2003. Dimensionality Reduction by Random Projection and<br />

<strong>La</strong>tent Semantic Indexing. In Proceedings of the 2003 SIAM Conference on Data Mining.<br />

Linnaeus, C. 1758. Systema Naturae per regna tria naturae, secundum classes, ordines,<br />

genera, species, cum characteribus, differentiis, synonymis, locis. Editio decima, reformata.<br />

Holmiae: <strong>La</strong>urentius Salvius.<br />

Liu, W. and Y. Luo. 2005. Applications of clustering data mining in customer analysis<br />

in department store. In Proceedings of the 2005 IEEE International Conference on<br />

Services Systems and Services Management, volume 2, pages 1042–1046.<br />

Loeff, N., C. Ovesdotter-Alm, and D.A. Forsyth. 2006. Discriminating image senses by clustering<br />

with multimodal features. In Proceedings of the COLING/ACL 2006 Conference,<br />

pages 547–554.<br />

Long, B., Z.M. Zhang, and P.S. Yu. 2005. Combining Multiple Clusterings by Soft Correspondence.<br />

In Proceedings of the 5th IEEE International Conference on Data Mining,<br />

pages 282–289.<br />

Maimon, O. and L. Rokach. 2005. Data Mining and Knowledge Discovery Handbook. New<br />

York: Springer.<br />

Mancas-Thillou, C. and B. Gosselin. 2007. Natural scene text understanding. In G. Obinata<br />

and A. Dutta, editors, Vision Systems, Segmentation and Pattern Recognition, pages<br />

307–333, Vienna, Austria. I-Tech Education and Publishing.<br />

Maulik, U. and S. Bandyopadhyay. 2002. Performance Evaluation of Some Clustering<br />

Algorithms and Validity Indices. IEEE Transactions on Pattern Analysis and Machine<br />

Intelligence, 24(12):1650–1654.<br />

Mc<strong>La</strong>chlan, G. and T. Krishnan. 1997. The EM Algorithm and Extensions. New York:<br />

Wiley.<br />

Meila, M. 2003. Comparing clusterings by the variation of information. In B. Scholkopf and<br />

M.K. Warmuth, editors, Proceedings of the 16th Annual Conference on Computational<br />

Learning Theory, pages 173–187, LNAI 2777. Springer.<br />

Minaei-Bidgoli, B., A. Topchy, and W.F. Punch. 2004. Ensembles of partitions via data<br />

resampling. In Proceedings of the 2004 International Conference on Information Technology,<br />

volume 2, pages 188–192.<br />

Mirkin, B.G. 1975. On the problem of reconciling partitions. In H.M. Blalock, A. Aganbegian,<br />

F.M. Borodkin, R. Boudon, and V. Capecchi, editors, Quantitative Sociology: International<br />

Perspectives on Mathematical and Statistical Modelling—Quantitative Studies<br />

in Social Relations, pages 441–449, New York. Academic Press.<br />

211


References<br />

Miyajima, K. and A. Ralescu. 1993. Modeling of Natural Objects Including Fuzziness and<br />

Application to Image Understanding. In Proceedings of the 2nd IEEE International<br />

Conference on Fuzzy Systems, pages 1049–1054.<br />

Molina, L.C., L. Belanche, and A. Nebot. 2002. Feature selection algorithms: a survey and<br />

experimental evaluation. In Proceedings of the 2002 IEEE International Conference on<br />

Data Mining, pages 306–313.<br />

Montague, M. and J.A. Aslam. 2002. Condorcet Fusion for Improved Retrieval. In Proceedings<br />

of the 2002 ACM Conference on Information and Knowledge Management, pages<br />

538–548.<br />

NetCraft.com. accessed on February 2009. February 2009 web server survey.<br />

http://news.netcraft.com/archives/2009/02/index.html.<br />

Neumann, D.A. and V.T. Norton. 1986. Clustering and isolation in the consensus problem<br />

for partitions. Journal of Classification, 3:281–298.<br />

Ng, A., M.I. Jordan, and Y. Weiss. 2002. On spectral clustering: analysis and an algorithm.<br />

In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information<br />

Processing Systems, volume 14. MIT Press.<br />

Nguyen, N. and R. Caruana. 2007. Consensus Clusterings. In Proceedings of the 7th IEEE<br />

International Conference on Data Mining, pages 607–612.<br />

Oja, E. 1992. Principal components minor components, and linear neural networks. Neural<br />

Networks, 5:927–935.<br />

Patrikainen, A. and M. Meila. 2005. Spectral Clustering for Microsoft Netscan Data. Technical<br />

report, UW-CSE-2005-06-05, Department of Computer Science and Engineering,<br />

University of Washington, Seattle, July.<br />

Piatetsky-Shapiro, G. 1991. Knowledge Discovery in Real Databases: A Report on the<br />

IJCAI-89 Workshop. AI Magazine, 11(5):68–70.<br />

Pinto, D.E. 2008. On Clustering and Evaluation of Narrow Domain Short-Text Corpora.<br />

Ph.D. thesis, Universidad Politécnica de Valencia, July.<br />

Pinto, F.R., J.A. Carriço, M. Ramirez, and J.S. Almeida. 2007. Ranked Adjusted Rand:<br />

integrating distance and partition information in a measure of clustering agreement.<br />

BMC Bioinformatics, 8(44):1–13.<br />

Punera, K. and J. Ghosh. 2007. Soft Consensus Clustering. In J. Oliveira and W. Pedrycz,<br />

editors, Advances in Fuzzy Clustering and its Applications, pages 69–92. Wiley.<br />

Rand, W.M. 1971. Objective criteria for the evaluation of clustering methods. Journal of<br />

the American Statistics Association, 66:846–850.<br />

Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464.<br />

Scott, G., D. Clark, and T. Pham. 2001. A genetic clustering algorithm guided by a descent<br />

algorithm. In Proceedings of the Congress on Evolutionary Computation, volume 2,<br />

pages 734–740, Piscataway, NJ, USA.<br />

212


References<br />

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing<br />

Surveys, 34(1):1–47.<br />

Selim, S. and K. Al-Sultan. 1991. A simulated annealing algorithm for the clustering<br />

problems. Pattern Recognition, 24(10):1003–1008.<br />

Sevillano, X., F. Alías, and J.C. Socoró. 2007b. BordaConsensus: a New Consensus Function<br />

for Soft Cluster Ensembles. In Proceedings of the 30th ACM SIGIR Conference,<br />

pages 743–744, Amsterdam, The Netherlands, July.<br />

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2006a. Feature Diversity in Cluster<br />

Ensembles for Robust Document Clustering. In Proceedings of the 29th ACM SIGIR<br />

Conference, pages 697–698, Seattle, WA, USA, August.<br />

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2007a. A hierarchical consensus architecture<br />

for robust document clustering. In G. Amati, C. Carpineto, and G. Romano,<br />

editors, Proceedings of 29th European Conference on Information Retrieval, volume 4425<br />

of Lecture Notes in Computer Science, pages 741–744, Rome, Italy. Springer Verlag.<br />

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2007c. Text clustering on latent thematic<br />

spaces: Variants, strenghts and weaknesses. In M.E. Davies, C.C. James, S.A. Abdallah,<br />

and M.D. Plumbley, editors, Proceedings of 7th International Conference on Independent<br />

Component Analysis and Signal Separation, volume 4666 of Lecture Notes in Computer<br />

Science, pages 794–801, London, UK. Springer Verlag.<br />

Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2006b. Robust Document Clustering by<br />

Exploiting Feature Diversity in Cluster Ensembles. Journal of the Spanish Society for<br />

Natural <strong>La</strong>nguage Processing (Procesamiento del Lenguaje Natural), 37:169–176.<br />

Sevillano, X., J. Melenchón, G. Cobo, J.C. Socoró, and F. Alías. 2009. Audiovisual analysis<br />

and synthesis for multimodal human-computer interfaces. In M. Redondo, C. Bravo,<br />

and M. Ortega, editors, Engineering the User Interface: From Research to Practice,<br />

pages 179–194, London. Springer Verlag.<br />

Shafiei, M., S. Wang, R. Zhang, E. Milios, B. Tang, J. Tougas, and R. Spiteri. 2006.<br />

A Systematic Study of Document Representation and Dimension Reduction for Text<br />

Clustering. Technical report, CS-2006-05, Dalhousie University.<br />

Shahnaz, F., M.W. Berry, V.P. Pauca, and R.J. Plemmons. 2004. Document clustering<br />

using nonnegative matrix factorization. Information Processing and Management,<br />

42:373–386.<br />

Sharan, R. and R. Shamir. 2000. CLICK: A clustering algorithm with applications to gene<br />

expression analysis. In Proceedings of the 8th International Conference on Intelligent<br />

Systems for Molecular Biology, pages 307–316.<br />

Sheikholeslami, C., S. Chatterjee, and A. Zhang. 1998. WaveCluster: A-MultiResolution<br />

Clustering Approach for Very <strong>La</strong>rge Spatial Data set. In A. Gupta, O. Shmueli, and<br />

J. Widom, editors, Proceedings of the 24th International Conference on Very <strong>La</strong>rge Data<br />

Bases, pages 428–439, New York, NY, USA. Morgan Kaufmann.<br />

Shi, J. and J. Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions<br />

on Pattern Analysis and Machine Intelligence, 22(8):888–905.<br />

213


References<br />

Snoek, C.G.M., M. Worring, and A.W.M. Smeulders. 2005. Early versus <strong>La</strong>te Fusion in<br />

Semantic Video Analysis. In Proceedings of the 13th ACM International Conference on<br />

Multimedia, pages 399–402.<br />

Srinivasan, S.H. 2002. Features for Unsupervised Document Classification. In Proceedings<br />

of the 6th Conference on Computational Natural <strong>La</strong>nguage Learning, pages 36–42,<br />

Taipei, Taiwan.<br />

Stein, B. and O. Niggemann. 1999. On the nature of structure and its identification. In<br />

P. Widmayer, G. Neyer, and S. Eidenbenz, editors, Proceedings of the 25th International<br />

Workshop on Graph-Theoretic Concepts in Computer Science, volume 1665 of Lecture<br />

Notes in Computer Science, pages 122–134. Springer Verlag, Berlin, Heidelberg, New<br />

York.<br />

Steinbach, M., G. Karypis, and V. Kumar. 2004. A comparison of common document<br />

clustering techniques. In Proceedings of the KDD Workshop on Text Mining, pages<br />

17–26, Boston, MA, USA.<br />

Strehl, A. 2002. Relationship-based Clustering and Cluster Ensembles for High-dimensional<br />

Data Mining. Ph.D. thesis, Faculty of the Graduate School of The University of Texas<br />

at Austin, May.<br />

Strehl, A. and J. Ghosh. 2002. Cluster Ensembles – A Knowledge Reuse Framework for<br />

Combining Multiple Partitions. Journal on Machine Learning Research, 3:583–617.<br />

Tang, B., X. Luo, M.I. Heywood, and M. Shepherd. 2004. A Comparative Study of Dimension<br />

Reduction Techniques for Document Clustering. Technical Report CS-2004-14,<br />

Faculty of Computer Science, Dalhousie University, Halifax, Canada.<br />

Tang, B., M. Shepherd, E. Milios, and M.I. Heywood. 2005. Comparing and Combining<br />

Dimension Reduction Techniques for Efficient Text Clustering. In Proceedings of the<br />

International Workshop on Feature Selection for Data Mining, pages 17–26, Newport<br />

Beach, CA, USA.<br />

Theodoridis, S. and K. Koutroumbas. 1999. Pattern Recognition. Academic Press.<br />

Tjhi, W.C. and L. Chen. 2007. Dual Fuzzy-possibilistic Co-clustering for Document Categorization<br />

and Summarization. In Optimization-based Data Mining Techniques with<br />

Applications Workshop of the IEEE International Conference on Data Mining, pages<br />

259–264.<br />

Tjhi, W.C. and L. Chen. 2009. Dual Fuzzy-possibilistic Co-clustering for Categorization of<br />

Documents. IEEE Transactions on Fuzzy Systems. Accepted for future publication (as<br />

of May 2009).<br />

Tombros, A., R. Villa, and C.J. van Rijsbergen. 2002. The effectiveness of query-specific<br />

hierarchic clustering in information retrieval. International Journal on Information<br />

Processing and Management, 38(4):559–582.<br />

Topchy, A., A.K. Jain, and W. Punch. 2003. Combining Multiple Weak Clusterings. In<br />

Proceedings of the 3rd IEEE International Conference on Data Mining, pages 331–338,<br />

Melbourne, FLA, USA.<br />

214


References<br />

Topchy, A., A.K. Jain, and W. Punch. 2004. A Mixture Model for Clustering Ensembles.<br />

In Proceedings of the 2004 SIAM Conference on Data Mining, pages 379–390.<br />

Topchy, A., M. <strong>La</strong>w, A.K. Jain, and A. Fred. 2004. Analysis of consensus partition in<br />

clustering ensemble. In Proceedings of the 4th International Conference on Data Mining,<br />

pages 225–232, Brighton, UK.<br />

Torkkola, K. 2003. Discriminative features for text document classification. Pattern Analysis<br />

and Applications, 6(4):301–308.<br />

Tseng, L. and S. Yang. 2001. A genetic approach to the automatic clustering problem.<br />

Pattern Recognition, 34:415–424.<br />

Turnbull, D., L. Barrington, D. Torres, and G. <strong>La</strong>nckriet. 2007. Towards Musical Queryby-Semantic-Description<br />

using the CAL500 Dataset. In Proceedings of the 30th ACM<br />

SIGIR Conference, pages 439–446, Amsterdam, The Netherlands, July.<br />

Valencia, A. 2002. Search and retrieve. EMBO Reports, 3(5):396–400.<br />

van Dongen, S. 2000. Performance criteria for graph clustering and Markov cluster experiments.<br />

Technical Report INS-R0012, Centrum voor Wiskunde en Informatica.<br />

van Erp, M., L. Vuurpijl, and L. Schomaker. 2002. An overview and comparison of voting<br />

methods for pattern recognition. In Proceedings of the Eighth International Workshop<br />

on Frontiers in Handwriting Recognition, pages 195–200, Ontario, Canada, August.<br />

van Rijsbergen, C.J. 1979. Information Retrieval. Buttersworth-Heinemann.<br />

VideoCLEF. accessed on May 2009. The CLEF cross language video retrieval track.<br />

http://www.cdvp.dcu.ie/VideoCLEF.<br />

von Luxburg, U. 2006. A tutorial on spectral clustering. Technical Report TR-149, Department<br />

for Empirical Inference, Max Planck Institute for Biological Cybernetics.<br />

Wallace, C. and D. Dowe. 1994. Intrinsic classification by MML – the Snob program.<br />

In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, pages<br />

37–44, Armidale, Australia.<br />

Wang, W., J. Yang, and R. Muntz. 1997. Sting: A statistical information grid approach<br />

to spatial data mining. In M. Jarke, M.J. Carey, K.R. Dittrich, F.H. Lochovsky,<br />

P. Loucopoulos, and M.A. Jeusfeld, editors, Proceedings of the 23rd International Conference<br />

on Very <strong>La</strong>rge Data Bases, pages 186–195, Athens, Greece. Morgan Kaufmann.<br />

Witten, I.H. and E. Frank. 2005. Data mining: practical machine learning tools and<br />

techniques. Morgan Kauffman Publishers.<br />

Wunsch, D., T. Caudell, C. Capps, R. Marks, and R. Falk. 1993. An optoelectronic<br />

implementation of the adaptive resonance neural network. IEEE Transactions on Neural<br />

Networks, 4(4):673–684.<br />

www.who.int. accessed on February 2009. World Health Organization International Classification<br />

of Diseases (ICD). http://www.who.int/classifications/icd/en/.<br />

215


References<br />

Xu, R. and D. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Transactions on<br />

Neural Networks, 16(2):645–678.<br />

Xu, W., X. Liu, and Y. Gong. 2003. Document Clustering Based on Non-Negative Matrix<br />

Factorization. In Proceedings of the 26th ACM SIGIR Conference, volume 2, pages<br />

267–273, Toronto, Canada.<br />

Yang, J. and S. Olafsson. 2005. Near-optimal feature selection. In Proceedings of the<br />

International Workshop on Feature Selection for Data Mining, pages 27–34, Newport<br />

Beach, CA.<br />

Zahn, C.T. 1971. Graph-theoretical methods for detecting and describing Gestalt clusters.<br />

IEEE Transactions on Computers, 20(1):68–86.<br />

Zeng, Y., J. Tang, J. Garcia-Frias, and G.R. Gao. 2002. An adaptive meta-clustering<br />

approach: combining the information from different clustering results. In Proceedings<br />

of the IEEE Computer Society Conference on Bioinformatics, pages 276–287.<br />

Zhao, R. and W.I. Grosky. 2002. Negotiating the semantic gap: from feature maps to<br />

semantic landscapes. Pattern Recognition, 35:593–600.<br />

Zhao, Y. and G. Karypis. 2001. Criterion functions for document clustering: Experiments<br />

and analysis. Technical Report TR #0140, Department of Computer Science, University<br />

of Minnesota, Minneapolis.<br />

Zhao, Y. and G. Karypis. 2003a. Clustering in life sciences. In A. Khodursky and Brownstein<br />

M., editors, Functional Genomics: Methods and Protocols, pages 183–218. Humana<br />

Press.<br />

Zhao, Y. and G. Karypis. 2003b. Hierarchical Clustering Algorithms for Document<br />

Datasets. Technical Report UMN CS #03-027, Department of Computer Science, University<br />

of Minnesota, Minneapolis.<br />

216


Appendix A<br />

Experimental setup<br />

A.1 The CLUTO clustering package<br />

All the clustering algorithms employed in the experimental sections of this work have been<br />

extracted from the CLUTO clustering toolkit. In its authors’ words, “CLUTO is a software<br />

package for clustering low- and high-dimensional data sets and for analyzing the characteristics<br />

of the obtained clusters. CLUTO is well-suited for clustering data sets arising in many<br />

diverse application areas including information retrieval, customer purchasing transactions,<br />

web, geographic information systems, science, and biology”. It is available online for download<br />

at http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download. We chose CLUTO as<br />

our clustering algorithm provider due to its ease of use, robustness, speed and scalability,<br />

as CLUTO’s algorithms have been optimized for operating on very large data sets both<br />

in terms of the number of objects (up to ∼ 10 5 ) as well as the number of features (up to<br />

∼ 10 4 ).<br />

CLUTO provides clustering algorithms based on the partitional, agglomerative, and<br />

graph-partitioning paradigms. Most of the algorithms implemented in CLUTO treat clustering<br />

as an optimization problem, thus seeking to maximize or minimize a particular clustering<br />

criterion function, which can be defined either globally or locally over the entire<br />

clustering solution space. As in any clustering process, computing the value of these criterion<br />

functions requires measuring the similarity between the objects in the data set. This<br />

means that, for applying a specific CLUTO clustering algorithm, it is necessary to select<br />

the desired:<br />

– clustering strategy (clustering paradigm and specific implementation): CLUTO includes<br />

six implementations of partitional, hierarchical agglomerative and graph-based<br />

clustering strategies.<br />

– criterion function: CLUTO provides a total of eleven criterion functions for driving<br />

its clustering algorithms.<br />

– similarity measure: CLUTO allows to measure the similarity between objects using<br />

four distinct alternatives.<br />

The six implementations of partitional, hierarchical agglomerative and graph-based clustering<br />

strategies available in the CLUTO clustering toolkit are briefly described in the<br />

217


A.1. The CLUTO clustering package<br />

following paragraphs:<br />

1. direct: this method computes the desired k-way clustering solution by simultaneously<br />

finding all k clusters.<br />

2. rb: repeated-bisecting clustering process, in which the desired k-way clustering solution<br />

is computed by performing a sequence of k −1 repeated bisections of the data set.<br />

At each bisecting step, one of the obtained clusters is selected and bisected further, so<br />

that each partial 2-way clustering solution optimizes the selected clustering criterion<br />

function locally.<br />

3. rbr: a refined repeated-bisecting method that performs a global optimization of the<br />

clustering solution obtained by the rb algorithm.<br />

4. agglo: agglomerative hierarchical clustering that locally optimizes the selected criterion<br />

function, stopping the agglomeration process when k clusters are obtained.<br />

5. bagglo: biased agglomerative clustering, which applies the agglo clustering method<br />

on an augmented representation of the objects created by concatenating the d original<br />

attributes of each object and √ n new features which are proportional to the<br />

similarity between that object and its cluster centroid according to a √ n-way partitional<br />

clustering solution that is initially computed on the data set by means of the<br />

rb algorithm.<br />

6. graph: graph-based clustering, in which the data set is modelled as a nearest-neighbor<br />

graph (each object is a vertex connected to the vertices representing its most similar<br />

objects) that is partitioned into k clusters according to one of the graph criterion<br />

functions.<br />

An enumeration of the eleven criterion functions implemented in the CLUTO software<br />

package follows (Zhao and Karypis, 2001):<br />

a. i1 : internal criterion function that maximizes the sum of the average pairwise similarities<br />

between the objects assigned to each cluster, weighted according to its size. Its<br />

maximization is equivalent to minimize sum-of-squared-distances between the objects<br />

in the same cluster, as in traditional k-means (Zhao and Karypis, 2001).<br />

b. i2 : internal criterion function that maximizes the similarity between each object and<br />

the centroid of the cluster it is assigned to.<br />

c. e1 : external criterion function that minimizes the proximity between each cluster’s<br />

centroid and the common centroid of the rest of the data set.<br />

d. h1 : hybrid criterion function that simultaneosly maximizes i1 and minimizes e1.<br />

e. g1 : MinMaxCut criterion function applied on the graph obtained by computing pairwise<br />

object similarities, partitioning the objects into groups by minimizing the edgecut<br />

of each partition (for graph-based clustering only).<br />

f. g1p: normalized Cut criterion function applied on the graph obtained by viewing the<br />

objects and their features as a bipartite graph, simultaneously partitioning the objects<br />

and their features (for graph-based clustering only).<br />

218


Appendix A. Experimental setup<br />

g. slink: traditional single-link criterion function (for agglomerative hierarchical clustering<br />

only).<br />

h. wslink: cluster-weighted single-link criterion function (for agglomerative hierarchical<br />

clustering only).<br />

i. clink: traditional complete-link criterion function (for agglomerative hierarchical clustering<br />

only).<br />

j. wclink: cluster-weighted complete-link criterion function (for agglomerative hierarchical<br />

clustering only).<br />

k. upgma: traditional unweighted pair-group method with arithmetic means criterion<br />

function (for agglomerative hierarchical clustering only).<br />

Finally, as regards the similarity measures that can be employed by the clustering algorithms<br />

implemented in CLUTO, they are described next (Zhao and Karypis, 2001):<br />

i. cos: the similarity between objects is computed using the cosine function.<br />

ii. corr : the similarity between objects is computed using the correlation coefficient.<br />

iii. dist: the similarity between objects is computed to be inversely proportional to the<br />

Euclidean distance (for graph-based clustering only).<br />

iv. jacc: the similarity between objects is computed using the extended Jaccard coefficient<br />

(for graph-based clustering only).<br />

For further insight on the distinct implementations of the clustering strategies, formal<br />

definitions of the criterion functions and similarity measures, or the criterion functions<br />

optimization procedure, the interested reader is referred to (Zhao and Karypis, 2001; Zhao<br />

and Karypis, 2003b).<br />

As the reader may infer from the previous enumerations, not all the clustering strategycriterion<br />

function-similarity measure combinations are possible in CLUTO. Table A.1 presents<br />

which triplets are allowed (denoted by the symbol), which are not allowed (denoted<br />

by ×), and which have been employed in our experiments (denoted by •)—28 out of the 68<br />

combinations allowed by CLUTO.<br />

In the experiments, each specific algorithm is identified by the clustering strategysimilarity<br />

measure-criterion function triplet employed, e.g. agglo-cos-slink (agglomerative<br />

clustering using the single link criterion and measuring object proximity with the cosine<br />

similarity), graph-jacc-i2 (graph-based clustering using the internal criterion function #2<br />

and the Jaccard distance), etc.<br />

A.2 Data sets<br />

In this work we have applied clustering processes on a total of sixteen data sets, 12 unimodal<br />

and four multimodal. In this section, we present their main features, such as their origin,<br />

the number of objects they contain (denoted throughout this work by n), the number (d)<br />

and meaning of their attributes, and the expected number of categories (k).<br />

219


A.2. Data sets<br />

Strategy Similarity<br />

cos<br />

i1<br />

<br />

i2<br />

•<br />

e1<br />

•<br />

h1 g1<br />

Criterion function<br />

g1p slink wslink clink wclink upgma<br />

rb<br />

√ × × × × × × ×<br />

corr<br />

dist<br />

<br />

×<br />

•<br />

×<br />

•<br />

×<br />

<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

jacc × × × × × × × × × × ×<br />

cos • • × × × × × × ×<br />

rbr<br />

corr<br />

dist<br />

<br />

×<br />

•<br />

×<br />

•<br />

×<br />

<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

jacc × × × × × × × × × × ×<br />

cos • • × × × × × × ×<br />

direct<br />

corr<br />

dist<br />

•<br />

×<br />

<br />

×<br />

•<br />

×<br />

<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

jacc × × × × × × × × × × ×<br />

cos × × × × × × • • •<br />

agglo<br />

corr<br />

dist<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

•<br />

×<br />

<br />

×<br />

•<br />

×<br />

<br />

×<br />

•<br />

×<br />

jacc × × × × × × × × × × ×<br />

cos × × × × × × • • •<br />

bagglo<br />

corr<br />

dist<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

•<br />

×<br />

<br />

×<br />

•<br />

×<br />

<br />

×<br />

•<br />

×<br />

jacc × × × × × × × × × × ×<br />

cos • • × × × × ×<br />

graph<br />

corr<br />

dist<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

×<br />

jacc • • × × × × ×<br />

Table A.1: Cross-option table indicating which clustering strategy-criterion functionsimilarity<br />

measure combinations are allowed (), not allowed (×) and employed in our<br />

experiments (•).<br />

220


A.2.1 Unimodal data sets<br />

Appendix A. Experimental setup<br />

A total of twelve unimodal data sets have been used in the experimental sections of this<br />

thesis. Unless noted otherwise, these data sets have been obtained from two classic public<br />

data repositories for the data mining and machine learning research communities such as the<br />

UCI Knowledge Discovery in Databases Archive (Hettich and Bay, 1999) and the UCI Machine<br />

Learning Repository (Asuncion and Newman, 1999). In the following paragraphs, we<br />

present a brief description of each data set, summarizing their most relevant characteristics<br />

in table A.2 as a quick reference source.<br />

1. Zoo: the goal of this data set is to learn to classify animals into seven classes given 17<br />

binary attributes representing features such as the presence of hair, feathers, backbone<br />

or teeth, or whether it is an aquatic or airborne animal, among others. The number<br />

of objects (i.e. animals) in the data set is 101.<br />

2. Iris: a classic data set in machine learning and pattern recognition. It contains 150<br />

objects (instances of Iris plants) represented by four real-valued features measuring<br />

the width and length of its petals and sepals. The goal is to classify the objects into<br />

one of the three classes of Iris plants, one of which is linearly separable from the<br />

others, while the latter two are not linearly separable from each other.<br />

3. Wine: this data set’s goal is to determine the origin of wines by means of chemical<br />

analysis. It contains 178 samples of wine which must be categorized into three wine<br />

classes based on their contents of 13 constituents such as alcohol, malic acid, or<br />

magnesium, represented as real-valued features.<br />

4. Glass: in this data set, 214 instances of glass are represented by 10 real-valued attributes<br />

corresponding to their contents in chemical elements such as aluminium,<br />

sodium, calcium, etc. The goal is to classify the objects into one of the predefined six<br />

categories (types of glass).<br />

5. Ionosphere: the contents of this data set are 351 radar returns from the ionosphere,<br />

classified either as good or bad depending on whether they show evidence of some<br />

type of structure in the ionosphere or not. Each radar return is described by 34<br />

autocorrelation-based real-valued features.<br />

6. WDBC : its complete name is Wisconsin Diagnostic Breast Cancer data set. It contains<br />

569 objects (breast mass images) represented by 32 real-valued features describing<br />

characteristics of the cell nuclei present in the image (radius, texture, perimenter,<br />

etc.). The goal is to classify these objects into one of the possible cancer diagnostics<br />

(malignant or benign).<br />

7. Balance: this data set was generated to model psychological experimental results.<br />

Each of the 625 objects is classified into three classes (as having the balance scale tip<br />

to the right, tip to the left, or balanced). The integer-valued attributes are the left<br />

weight, the left distance, the right weight, and the right distance.<br />

8. Mfeat: its original name is Multiple Features data set, as it represents the objects<br />

it contains (handwritten numerals from 0 to 9) using different real-valued features<br />

such as Fourier coefficients, profile correlations, Karhunen-Loève coefficients, pixel<br />

averages, Zernike moments and morphological attributes.<br />

221


A.2. Data sets<br />

Data set Number of Number of Number of Class<br />

name objects (n) attributes (d) classes (k) imbalance<br />

Zoo 101 17 7 40.6%–3.9%<br />

Iris 150 4 3 33.3%–33.3%<br />

Wine 178 13 3 39.9%–26.9%<br />

Glass 214 10 6 35.5%–4.2%<br />

Ionosphere 351 34 2 64.1%–35.9%<br />

WDBC 569 32 2 62.7%–37.3%<br />

Balance 625 4 3 46.1%–7.8%<br />

Mfeat 2000 649 10 10%–10%<br />

miniNG 2000 6679 20 5%–5%<br />

Segmentation 2100 19 7 14.3%–14.3%<br />

BBC 2225 6767 5 22.9%–17.3%<br />

PenDigits 7494 16 10 10.4%–9.6%<br />

Table A.2: Summary of the unimodal data sets employed in the experimental sections of<br />

this thesis. The “Class imbalance” column presents the percentage of objects in the data<br />

set belonging to the most and least populated categories, respectively.<br />

9. miniNG: this is a reduced version of the 20 Newsgroups text data set, as it contains<br />

only 2000 objects (text articles posted in Usenet) belonging to one of the 20 predefined<br />

thematic classes (e.g. sci.electronics, rec.sport.baseball or talk.politics.mideast).<br />

Typical text preprocessing tasks such as the removal of stop words and of terms appearing<br />

in less than 4 documents (document frequency thresholding) gives rise to a<br />

bag-of-words representation of each article on a 6679-dimensional tfidf -weighted (i.e.<br />

real-valued) term space (Sebastiani, 2002).<br />

10. Segmentation: known as the Image Segmentation data set, it contains 2100 outdoor<br />

images regions represented by 19-dimensional real-valued feature vectors that should<br />

be classified into one of seven texture classes: brickface, sky, foliage, cement, window,<br />

path and grass. We have employed the test subset of the Segmentation collection.<br />

11. BBC : this data set has been obtained from the online repository of the Machine Learning<br />

Group of the University College Dublin (http://mlg.ucd.ie/content/view/21/). It<br />

consists of 2225 documents from the BBC news website corresponding to stories in<br />

five topical areas (business, entertainment, politics, sport, tech). The original documents’<br />

representation used a 9636-dimensional term space which was reduced to 6767<br />

real-valued attributes after removing those terms with a document frequency smaller<br />

or equal to 4 (Sebastiani, 2002).<br />

12. PenDigits: its original name is Pen-Based Recognition of Handwritten Digits data<br />

set, whose training subset contains 7494 digitized handwritten digits (from 0 to 9)<br />

captured using a pressure sensitive tablet. Each object is represented by 16 integer<br />

attributes corresponding to the (x, y) coordinates of the electronic pen on the tablet<br />

sampled every 100 miliseconds.<br />

222


A.2.2 Multimodal data sets<br />

Appendix A. Experimental setup<br />

A total of four multimodal data sets have been used in the experimental sections of this<br />

thesis. Three of these data sets are multimodal in nature, whereas the remaining one has<br />

been generated artificially by the combination of two unimodal data collections. In the<br />

following paragraphs, we present a brief description of each data set, summarizing their<br />

most relevant features in table A.3 as a quick reference source.<br />

1. CAL500 : the Computer Audition <strong>La</strong>b 500-song data set is a collection of five hundred<br />

Western popular songs represented by means of two modalities: acoustic features and<br />

textual annotations (Turnbull et al., 2007). As regards the acoustic modality, we have<br />

employed the mean and standard deviations of the original real-valued delta MFCC<br />

(Mel-Frequency Cepstrum Coefficients) features as the acoustic attributes of each song<br />

(Li and GuoDong, 2000). As described in (Turnbull et al., 2007), the text modality<br />

was generated by means of an auditory experience survey, where 55 listeners were<br />

asked to annotate each song with several terms extracted from a musically relevant<br />

174-word vocabulary. As a result, each song was annotated with those terms assigned<br />

by at least three listeners. These annotating terms describe song-related semantic<br />

concepts as instruments (e.g. bass, piano, trumpet), vocals (e.g. falsetto, breathy,<br />

agressive), usage (e.g. at a party, waking up, driving) or emotion (e.g. cheerful,<br />

calming, tender). In order to evaluate the clustering processes conducted on this<br />

data set, we have used the annotating term “Best Genre” as the class of each object<br />

(it reflects the musical genre which best fits each song 1 ). There exist sixteen “Best<br />

Genre” labels, such as Alternative, Classic Rock, Punk, Country, and so on. Finally,<br />

we selected the subset of 297 songs which are assigned a single “Best Genre” label.<br />

2. IsoLetters: this multimodal data collection is the result of the artificial fusion of two<br />

unimodal data sets of the UCI Machine Learning Repository (Asuncion and Newman,<br />

1999): Isolet and LetterRecognition. Both data sets contain the same type of object:<br />

the 26 letters of the English alphabet. Whereas the Isolet collection is oriented to<br />

the spoken letter recognition problem (each object is the name of a letter uttered<br />

by a speaker, and represented by a total of 617 real-valued acoustic attributes such<br />

as spectral coefficients, contour features, or sonorant features), the LetterRecognition<br />

data set contains 16 visual features (statistical moments and edge counts) extracted<br />

from black and white images of the twenty-six capital letters of the English alphabet.<br />

Thus, IsoLetters, the multimodal data collection we have created as a combination<br />

of both unimodal data sets, pursues the goal of recognizing letters upon acoustic and<br />

visual features. The total number of objects in the IsoLetters data set (1559) is fixed<br />

by the size of the test subset of the Isolet collection.<br />

3. InternetAds: this data set (which is also found in the UCI Machine Learning Repository)<br />

represents a set of images which are possible advertisements on Internet pages.<br />

The goal is to classify them as an advertisement or not. Each object is described by<br />

means of 1558 real-valued attributes, including image geometry features, its textual<br />

caption, the alt text, phrases occuring in the URL, the image’s URL, the anchor text,<br />

and words occuring near the anchor text. We consider this data set as multimodal<br />

1<br />

So as to avoid biasing the results, all the genre-related annotating terms were eliminated, reducing the<br />

size of the vocabulary to 127 terms.<br />

223


A.3. Data representations<br />

Data set name<br />

CAL500<br />

(audio/text)<br />

IsoLetters<br />

(speech/image)<br />

InternetAds<br />

(object/collateral)<br />

Corel<br />

(image/text)<br />

Number of Number of Number of<br />

objects attributes per classes<br />

(n) mode (d 1/d 2) (k)<br />

Class imbalance<br />

297 78/127 16 14.5%–1.3%<br />

1559 617/16 26 3.8%–3.8%<br />

2359 133/1425 2 83.8%–16.2%<br />

3960 500/371 44 2.2%–2.3%<br />

Table A.3: Summary of the multimodal data sets employed in the experimental sections of<br />

this thesis. The “Class imbalance” column presents the percentage of objects in the data<br />

set belonging to the most and least populated categories, respectively.<br />

in the sense that some attributes are directly related to the object (as the image geometry<br />

features, the caption and the alt text, which totalize 133 features) while the<br />

remaining 1425 features are referred to collateral elements such as the anchor text or<br />

the URL. We have removed those objects in the data set with missing features (28%<br />

of the total), obtaining a reduced version of the InternetAds data set containing 2359<br />

objects.<br />

4. Corel: this is a rather classic multimodal data set (Duygulu et al., 2002), consisting of<br />

5000 text-annotated images from 50 Corel Stock Photo CDs. Each CD contains 100<br />

images on the same topic, such as “Sunrises and Sunsets”, “Mountains of America”<br />

or “Wild Animals” (Bekkerman and Jeon, 2007). Experiments have been conducted<br />

on the subset of 3960 images of the training subset of the Corel collection which are<br />

assigned at least one topic. The visual modality is codified as follows (Duygulu et<br />

al., 2002): images, represented using 33 color features, are segmented into regions,<br />

and these regions are clustered into 500 smaller connected regions (aka blobs), which<br />

are deemed visual terms, so that each image can be expressed in terms of these. As<br />

regards the textual modality, every image has a caption (i.e. a brief description of the<br />

scene) and an annotation (a list of objects appearing in the image). The vocabulary<br />

contains 371 words, and the term vectors are parameterized using the tfidf weighting<br />

scheme (van Rijsbergen, 1979).<br />

A.3 Data representations<br />

A.3.1 Unimodal data representations<br />

As mentioned in section 1.1, in this work objects are expressed in terms of d numerical<br />

attributes, so each object is represented as a column vector x ∈ Rd . Therefore, a whole<br />

data set containing n objects is mathematically expressed by means of a d × n matrix X.<br />

This original object representation is referred to as baseline throughout the thesis.<br />

Starting from the baseline representation, four other object representations have been<br />

224


Appendix A. Experimental setup<br />

generated by applying the following well-known feature extraction techniques 2 : Principal<br />

Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative Matrix<br />

Factorization (NMF) and Random Projection (RP). Besides providing diversity as<br />

regards data representation, these techniques are also employed with dimensionality reduction<br />

purposes. In this work, we refer to the reduced dimensionality of the resulting feature<br />

space by r, which takes a whole range of values in the interval [3,d]. As a result of each<br />

feature extraction procedure, a batch of r × n matrices (X r PCA , Xr ICA , Xr NMF or Xr RP )are<br />

obtained.<br />

The following paragraphs are devoted to a brief description of the main concepts regarding<br />

the aforementioned feature extraction techniques.<br />

– Principal Component Analysis, which is one of the most typical feature extraction<br />

techniques, is based on projecting the data onto a dimensionally reduced feature space<br />

such that i) the newly obtained features are decorrelated, and ii) thevarianceofthe<br />

original data is maximally retained. For these reasons, PCA is said to be capable of<br />

removing data redundancies while keeping the most relevant information contained<br />

in the data. There exist several ways for conducting PCA, from the eigenanalysis of<br />

the covariance matrix of X (Jolliffe, 1986) to neural network approaches (Oja, 1992).<br />

In this work, PCA is implemented by means of Singular Value Decomposition (SVD),<br />

following a similar approach to that of <strong>La</strong>tent Semantic Analysis (Deerwester et al.,<br />

1990). More specifically, given the d×n data matrix X, its SVD is expressed according<br />

to equation (A.1).<br />

X = U · Σ · V T<br />

(A.1)<br />

Matrix Σ contains the singular values of X ordered in decreasing order and the<br />

columns of matrices U and V are the left and right singular vectors of X, respectively.<br />

Dimensionality reduction is achieved by retaining the r largest singular values in Σ<br />

and the corresponding columns of matrix V, so that the r ×n matrix Xr T<br />

PCA = ΣrVr<br />

–where Σr and Vr are the reduced version of the singular values and right singular<br />

vectors matrices, respectively– will contain the location of the n objects in the<br />

r-dimensional PCA space, where clustering is conducted.<br />

– Independent Component Analysis (ICA) can be regarded as an extension of PCA<br />

for non-Gaussian data (Xu and Wunsch II, 2005), in which the projected data components<br />

are forced to be statistically independent—a stronger condition than PCA’s<br />

decorrelation. Being tightly bound to the blind source separation problem (Hyvärinen,<br />

Karhunen, and Oja, 2001), the application of ICA for feature extraction usually assumes<br />

the existence of a generative model that, in its simplest version, defines the<br />

observed data as the result of an unknown linear noiseless combination of r statistically<br />

independent latent variables (the so-called independent components). The<br />

goal of ICA algorithms is to recover the independent components making no further<br />

2 We choose applying feature extraction over feature selection given its greater ease of application (Jain,<br />

Murty, and Flynn, 1999), as our main goal is creating representational diversity, not elaborating on object<br />

representations. In informal experiments not reported here, other object representations based on feature<br />

selection plus change of basis (Srinivasan, 2002) were also tested but finally discarded as, in general terms,<br />

gave rise to lower quality clustering results.<br />

225


A.3. Data representations<br />

assumption than their statistical independence. The application of ICA in feature extraction<br />

is usually preceded by PCA with dimensionality reduction, as this procedure<br />

is equivalent to the usual whitening step that simplifies ICA algorithms (Hyvärinen,<br />

Karhunen, and Oja, 2001). Applying ICA on the PCA data yields an estimation of<br />

the r independent latent variables which generated the observed data:<br />

X r ICA = WX r PCA<br />

(A.2)<br />

where matrix W is known as the separating matrix. Equation (A.2) can be interpreted<br />

as a linear transformation of the data through its projection on the basis vectors<br />

contained in the rows of the separating matrix W. In this work, a version of the<br />

FastICA algorithm (Hyvärinen, 1999) that maximizes the skewness of the data is<br />

employed for obtaining the ICA representation of the data (Kaban and Girolami,<br />

2000).<br />

– Non-Negative Matrix Factorization (NMF) (Lee and Seung, 1999) is a feature extraction<br />

technique based on linear representations of non-negative data—i.e. NMF can<br />

only be applied when the original representation of the data is non-negative. Intuitively,<br />

NMF can be interpreted as a linear generative model somewhat similar to that<br />

of ICA but subject to non-negativity constraints, as it factorizes the non-negative data<br />

matrix X into the approximate product of two non-negative matrices W and H, as<br />

defined in equation (A.3). Thus, it can be argued that the data set is generated by<br />

the sum of a set of the latent non-negative variables contained in matrix H, while the<br />

elements of W are the weights of their linear combination.<br />

X ≈ WH ,whereX r NMF<br />

= H (A.3)<br />

From a practical viewpoint, i) NMF is usually implemented by means of iterative<br />

algorithms which try to minimize a cost function proportional to the reconstruction<br />

error ||X−W·H|| (Lee and Seung, 2001), and ii) dimensionality reduction is achieved<br />

by setting the respective sizes of the factorization matrices W and H to d × r and<br />

r × n at the time of computing the approximate factorization of equation (A.3).<br />

In this work, the NMF-based object representation XNMF is obtained by applying a<br />

mean square reconstruction error minimization algorithm from NMFPACK, a software<br />

package for NMF in Matlab (Hoyer, 2004). Besides its use as a feature extraction<br />

technique, the vision of NMF as a means for obtaining a parts-based description of<br />

the data has motivated alternative NMF-based clustering strategies (Xu, Liu, and<br />

Gong, 2003; Shahnaz et al., 2004), alongside studies on the theoretical connections<br />

between NMF and classic clustering approaches (Ding, He, and Simon, 2005).<br />

Compared to PCA and ICA, NMF is advantageous as the non-negativity of its basis<br />

vectors favours their interpretability. On the flip side, the derivation of the NMF<br />

representation is usually more computationally demanding.<br />

– Random Projection (RP) is a computationally efficient dimensionality reduction technique,<br />

proposed as an alternative to those feature extraction techniques that become<br />

too costly when the dimensionality of the original representation (d) isveryhigh<br />

(Kaski, 1998). The rationale behind RP is pretty straightforward: a dimensionality<br />

reduction method is effective as long as the distance between the objects in the<br />

226


Appendix A. Experimental setup<br />

original feature space is approximately preserved in the reduced r-dimensional space.<br />

Allowing for this fact, Kaski proved that this could be achieved using a random linear<br />

mapping embodied in a r × d random projection matrix R, where the columns of R<br />

are realizations of independent and identically distributed (i.i.d.) zero-mean normal<br />

variables, scaled to have unit length (Fodor, 2002).<br />

X r RP<br />

= RX (A.4)<br />

Several experimental studies bear witness of the fact that i) RP takes a fraction of<br />

the time required for executing other feature extraction techniques as PCA or ICA,<br />

among others, and ii) clustering results on RP data representations are sometimes<br />

comparable or even better than those obtained using, for instance, PCA —which<br />

somehow reinforces the notion of the data representation indeterminacy outlined in<br />

section 1.4 (Kaski, 1998; Bingham and Mannila, 2001; Lin and Gunopulos, 2003; Tang<br />

et al., 2005).<br />

A.3.2 Multimodal data representations<br />

As regards the representation of the objects of the multimodal data sets described in section<br />

A.2.2, two distinct approaches have been followed. Firstly, unimodal representations have<br />

been created for each mode separately, applying the same strategies as in the unimodal<br />

data sets—thus, we will not expand on this point. And secondly, we have generated truly<br />

multimodal data representations by combining both modalities. We elaborate on this latter<br />

issue in the following paragraphs.<br />

The simple concatenation of the baseline feature vectors of both modalities (previously<br />

normalized to unit length3 ) gives rise to the multimodal baseline representation, represented<br />

on a (d1 + d2)-dimensional attribute space —where d1 and d2 are the dimensionalities of<br />

the baseline representation of each modality.<br />

Subsequently, the feature extraction techniques described in section A.3.1 (i.e. PCA,<br />

ICA, RP and NMF—this latter only when data is non-negative) are applied on the multimodal<br />

baseline representation, yielding representations of dimensionalities r ∈ [3,d1 + d2].<br />

This procedure, known as early fusion or feature-level fusion in the literature, is a common<br />

strategy for creating representations of multimodal data from unimodal representations,<br />

and it has been applied in content-based image retrieval (<strong>La</strong> Cascia, Sethi, and Sclaroff,<br />

1998; Zhao and Grosky, 2002), semantic video analysis (Snoek, Worring, and Smeulders,<br />

2005), human affect recognition (Gunes and Piccardi, 2005), audiovisual video sequence<br />

analysis (Sevillano et al., 2009), besides multimodal clustering (Benitez and Chang, 2002).<br />

A.4 Cluster ensembles<br />

In this section, we briefly describe the cluster ensembles employed in the experimental<br />

sections of this thesis. As described in section 2.1, in this work we combine both the homogeneous<br />

and heterogeneous approaches for creating cluster ensembles. This means that we<br />

3 An uneven weighting of the concatenated vectors would give more importance to one of the modalities.<br />

As it is not clear how to appropriately weight each modality a priori, we forced that each subvector had<br />

unitary norm so as to avoid any bias.<br />

227


A.4. Cluster ensembles<br />

Data set name |dfA| =1 |dfA| =10 |dfA| =19 |dfA| =28<br />

Zoo 57 570 1083 1596<br />

Iris 9 90 171 252<br />

Wine 45 450 855 1260<br />

Glass 29 290 551 812<br />

Ionosphere 97 970 1843 2716<br />

WDBC 113 1130 2147 3164<br />

Balance 7 70 133 196<br />

Mfeat 6 60 114 168<br />

miniNG 73 730 1387 2044<br />

Segmentation 52 520 988 1456<br />

BBC 57 570 1083 1596<br />

PenDigits 57 570 1083 1596<br />

Table A.4: Cluster ensemble sizes l corresponding to distinct algorithmic diversity configurations<br />

for the unimodal data sets.<br />

employ several mutually crossed diversity factors (the twenty-eight clustering algorithms of<br />

the CLUTO clustering package presented in section A.1 are run on the data representations<br />

with varying dimensionalities described in section A.3) so as to generate the individual<br />

components of our cluster ensembles.<br />

However, several cluster ensemble instances have been generated by limiting the cardinality<br />

of the algorithmic diversity factor |dfA| (i.e. the number of clustering algorithms<br />

considered in creating the cluster ensemble components) to a discrete set of values: |dfA| =<br />

{1, 10, 19, 28}. This strategy is adopted with the objective of experimentally evaluating our<br />

proposals both in terms of i) their sensitivity to the cluster ensemble diversity (as the larger<br />

|dfA|, the more diverse the cluster ensemble), and ii) their computational scalability as regards<br />

the cluster ensemble size l (since this factor is proportional to |dfA|). Notice that the<br />

cluster ensembles with |dfA| = {1, 10, 19} are randomly sampled subsets of the maximally<br />

diverse cluster ensemble (the one corresponding to |dfA| =28).<br />

Tables A.4 and A.5 present the sizes of the cluster ensembles corresponding to the<br />

distinct diversity scenarios (i.e. cardinalities of the algorithmic diversity factor dfA) onthe<br />

unimodal and multimodal data collections employed in this work.<br />

Firstly, table A.4 presents the cluster ensemble sizes corresponding to the unimodal<br />

data sets. As expected, the cluster ensemble size l grows linearly with the value of |dfA|.<br />

Depending on the cardinalities of the representational and dimensional diversity factors of<br />

each data collection, fairly distinct cluster ensembles sizes are obtained (from the modest<br />

values of the Iris data set to the highly populated cluster ensembles of the WDBC collection).<br />

<strong>La</strong>st, table A.5 presents the cluster ensembles corresponding to the four multimodal data<br />

collections employed in this work for each diversity scenario. It is important to highlight<br />

the fact that the values of l presented in this table encompass the two unimodal and the<br />

multimodal data representations of the objects contained in these data sets.<br />

The reader is referred to appendix B for an analysis of the quality and diversity of the<br />

components of these cluster ensembles.<br />

228


Appendix A. Experimental setup<br />

Data set name |dfA| =1 |dfA| =10 |dfA| =19 |dfA| =28<br />

CAL500 102 1020 1938 2856<br />

IsoLetters 111 1110 2109 3108<br />

InternetAds 183 1830 3477 5124<br />

Corel 123 1230 2337 3444<br />

Table A.5: Cluster ensemble sizes l corresponding to distinct algorithmic diversity configurations<br />

for the multimodal data sets.<br />

A.5 Consensus functions<br />

In this section, we briefly describe the consensus functions employed in the experimental<br />

section of this thesis, placing special emphasis on specific implementation details when<br />

necessary. Moreover, we present the time complexity of each consensus function for a given<br />

cluster ensemble size (l), number of objects in the data set (n) and clusters (k).<br />

The first seven consensus functions are employed on experiments considering both hard<br />

and soft cluster ensembles (i.e. chapters from 3 to 6). On its part, the last one (VMA) is<br />

only applied on soft cluster ensembles, that is, in chapter 6.<br />

The Matlab source code of the first three consensus functions is available for download<br />

at http://www.strehl.com, whereas the remaining ones were implemented ad hoc for this<br />

work. For a more theoretical description of these consensus functions, see section 2.2.<br />

– CSPA (Cluster-based Similarity Partitioning Algorithm): this consensus function<br />

shares a lot of the rationale of the Evidence Accumulation consensus function (see<br />

below), as it is based on deriving a pairwise object similarity measure from the cluster<br />

ensemble and applying a similarity-based clustering algorithm on it—the METIS<br />

graph partitioning algorithm (Karypis and Kumar, 1998) in this case. Its computational<br />

complexity is O n 2 kl (Strehl and Ghosh, 2002).<br />

– HGPA (HyperGraph Partitioning Algorithm): this clustering combiner exploits a hypergraph<br />

representation of the cluster ensemble, re-partitioning the data by finding<br />

a hyperedge separator that cuts a minimal number of hyperedges, yielding k unconnected<br />

components of approximately the same size—which makes HGPA not an<br />

appropriate consensus function when clusters are highly imbalanced. The hypergraph<br />

partition is conducted by means of the HMETIS package (Karypis et al., 1997). Its<br />

time complexity is O (mkl) (Strehl and Ghosh, 2002).<br />

– MCLA (Meta-CLustering Algorithm): as in HGPA, each cluster corresponds to a hyperedge<br />

of the hypergraph representing the cluster ensemble. Subsequently, related<br />

hyperedges are detected by grouping them using the METIS graph-based clustering algorithm<br />

(Karypis and Kumar, 1998). Next, related hyperedges are collapsed and each<br />

object is assigned to the collapsed hyperedge in which it participates most strongly.<br />

Its computational complexity is O mk 2 l 2 (Strehl and Ghosh, 2002).<br />

– EAC (Evidence Accumulation): this is a pretty direct implementation of the consensus<br />

function presented in (Fred and Jain, 2002a). It consists in the computation of the<br />

pairwise object co-association matrix and the subsequent application of the single-link<br />

229


A.6. Computational resources<br />

hierarchical clustering algorithm on it. The main difference between our implementation<br />

and the original one lies in the fact that we cut the resulting dendrogram at the<br />

desired number of clusters k, whereas Fred proposes performing the cut at the highest<br />

lifetime level, so that the very consensus function finds the natural number of clusters<br />

in the data set. Its computational complexity is O n 2 l (Fred and Jain, 2005).<br />

– ALSAD (Average-Link on Similarity As Data): this is one of the three consensus<br />

functions presented in (Kuncheva, Hadjitodorov, and Todorova, 2006) based on considering<br />

object similarity measures as object features. Despite the authors do not give<br />

a specific name to this family of consensus functions, we have named them xxSAD<br />

so as to indicate that similarities are deemed as data, replacing xx by the acronym<br />

of the particular clustering algorithm used for obtaining the consensus clustering solution.<br />

In this case, the pairwise object co-association matrix is partitioned using the<br />

average-link (AL) hierarchical clustering algorithm, cutting the resulting dendrogram<br />

at the desired number of clusters. Its computational complexity is O n 2 l for creating<br />

the object similarity matrix plus O n 2 for partitioning it with the hierarchical AL<br />

clustering algorithm (Xu and Wunsch II, 2005).<br />

– KMSAD (K-Means on Similarity As Data): this consensus function belongs to the<br />

same family as the previous one. This time, the object co-association matrix is clustered<br />

using the classic k-means (KM) partitional algorithm. Its computational complexity<br />

is O n 2 l for creating the object similarity matrix plus O (tkm) for partitioning<br />

it with the k-means clustering algorithm (Xu and Wunsch II, 2005) —where t is the<br />

number of iterations of k-means.<br />

– SLSAD (Single-Link on Similarity As Data): following the same approach as the AL-<br />

SAD and KMSAD consensus functions, the pairwise object co-association matrix is<br />

partitioned by means of the single-link (SL) hierarchical clustering algorithm in this<br />

case. As in the ALSAD consensus function, the consensus clustering solution is obtained<br />

by cutting the dendrogram at the desired number of clusters. Its computational<br />

costisthesameasALSAD.<br />

– VMA (Voting Merging Algorithm): this consensus function is based on sequentially<br />

solving the cluster correspondence problem on pairs of cluster ensemble components,<br />

and, at each iteration, applying a weighted version of the sum rule confidence voting<br />

method. This algorithm scales linearly in the number of objects in the data set and<br />

the number of cluster ensemble components, i.e. its complexity is O (nl) (Dimitriadou,<br />

Weingessel, and Hornik, 2002).<br />

A.6 Computational resources<br />

All the experiments conducted in this thesis have been executed under Matlab 7.0.4 on<br />

Dual Pentium 4 3GHz/1 GB RAM computers. The reason for choosing Matlab as the<br />

programming language for codifying our proposals is threefold: besides the fact we are<br />

familiar with it, the existence of multiple built-in functions simplifies the implementation of<br />

many of the processes involved in our proposals (Principal Component Analysis and Random<br />

Projection feature extraction, for instance). Moreover, the availability of the full Matlab<br />

source code of several components of our proposals (e.g. hypergraph consensus functions<br />

230


Appendix A. Experimental setup<br />

or Non-negative Matrix Factorization feature extraction) has been a further incentive for<br />

this decision. However, the downside of this choice is the relatively slow execution of<br />

our proposals implementation, due to the fact that Matlab is an interpreted programming<br />

language.<br />

231


Appendix B<br />

Experiments on clustering<br />

indeterminacies<br />

The goal of this appendix is to present experimental evidences of the indeterminacies affecting<br />

the practical selection of a specific clustering configuration introduced in chapter 1. In<br />

particular, we focus on the indeterminacies regarding the selection of the data representation<br />

and clustering algorithm that yields the best clustering results for both unimodal and<br />

multimodal data collections.<br />

As already noted in chapter 1, the evaluation of the clustering results is based on computing<br />

the normalized mutual information φ (NMI) between a given label vector and the<br />

ground truth that is not available to the clustering process, being only used with evaluation<br />

purposes. Recall that φ (NMI) ranges from 0 to 1, the higher its value the more similar the<br />

clustering result to the ground truth.<br />

B.1 Clustering indeterminacies in unimodal data sets<br />

In this section, we analyze which clustering configurations (data representation plus clustering<br />

algorithm) give rise to the best partitioning of the unimodal data sets described in<br />

section A.2.1. We aim to demonstrate the dependence between the quality of the clustering<br />

results and the selection of the way objects are represented and clustered.<br />

As described in section A.3.1, starting with the original data representation (denoted<br />

as baseline), four additional representations have been created by applying several feature<br />

extraction techniques with multiple dimensionalities, namely Principal Component Analysis<br />

(PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization<br />

(NMF) and Random Projection (RP) 1 .<br />

On each distinct object representation, the 28 clustering algorithms from the CLUTO<br />

toolbox presented in section A.1 have been applied, which gives rise to the number of partitions<br />

per data representation presented in table B.1. Notice that, in those data sets not<br />

satisfying non-negativity constraints, the NMF representation was not derived. Moreover,<br />

1 The only exception to this rule is the MFeat data set, where no attribute transformation was applied,<br />

as its original form already presents data representation diversity through the use of several features, see<br />

section A.2.1.<br />

233


B.1. Clustering indeterminacies in unimodal data sets<br />

Data set Data representation<br />

name Baseline PCA ICA NMF RP<br />

Zoo 28 392 392 392 392<br />

Iris 28 56 56 56 56<br />

Wine 28 308 308 308 308<br />

Glass 28 196 196 196 196<br />

Ionosphere 28 896 896 - 896<br />

WDBC 28 784 784 784 784<br />

Balance 28 56 - 56 56<br />

Mfeat<br />

28 on each of its 6 representations<br />

(FAC, FOU, KAR, MOR, PIX and ZER)<br />

miniNG 28 504 504 504 504<br />

Segmentation 28 476 476 - 476<br />

BBC 28 392 392 392 392<br />

PenDigits 28 392 392 392 392<br />

Table B.1: Number of individual clusterings per data representation on each unimodal data<br />

set.<br />

the ICA algorithm employed for deriving the homonymous object representation presented<br />

convergence problems when executed on the Balance data collection, so no ICA representation<br />

was created on this data set.<br />

In the next paragraphs, we describe the clustering results obtained on each data set,<br />

emphasizing which clustering configurations lead to the best clustering results in each case.<br />

B.1.1 Zoo data set<br />

Figure B.1 presents the histograms of the φ (NMI) values (ranging in the [0,1] interval) obtained<br />

by all the clustering algorithms on each data representation for the Zoo data collection.<br />

Recall that φ (NMI) = 1 corresponds to a perfect match between the ground truth and<br />

a clustering solution. The analysis of these histograms help us to interpret the influence of<br />

the clustering indeterminacies on the quality of the clustering results.<br />

clustering count<br />

30<br />

20<br />

10<br />

Zoo Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

30<br />

20<br />

10<br />

Zoo PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

30<br />

20<br />

10<br />

Zoo ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

30<br />

20<br />

10<br />

Zoo NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

30<br />

20<br />

10<br />

Zoo RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.1: Histograms of the φ (NMI) values obtained on each data representation in the<br />

Zoodataset.<br />

Firstly, by inspecting the histogram corresponding to the clustering results obtained by<br />

applying the 28 algorithms on the baseline object representation (figure B.1(a)), we can<br />

234


Appendix B. Experiments on clustering indeterminacies<br />

see that φ (NMI) values scattered in a range extending approximately from φ (NMI) =0.45 to<br />

φ (NMI) =0.85 are obtained. It is important to notice that such diverse results are solely due<br />

to the clustering algorithm selection indeterminacy, as this histogram presents the results<br />

of running multiple distinct clustering algorithms on a single data representation.<br />

If this analysis is extended to the remaining histograms (figures B.1(b) to B.1(e)), it can<br />

be observed that the φ (NMI) scatter extends across an even wider range for each distinct type<br />

of representation. This somehow gives an idea of the dependence between the quality of the<br />

clustering results and the selection of the clustering algorithm. However, this conclusion<br />

cannot be drawn as directly as in the baseline representation, given that histograms B.1(b)<br />

to B.1(e) present the results of running the 28 algorithms on multiple representations with<br />

distinct dimensionalities derived by each feature extraction technique. In other words,<br />

the diversity observed in these histograms is produced by the joint effect of the clustering<br />

algorithm and dimensionality reduction data representation selection indeterminacies.<br />

However, if figures B.1(a) to B.1(e) are compared among themselves, the different histogram<br />

distributions reveal the effect of the clustering indeterminacy regarding the type of<br />

data representation. For example, clustering results on the NMF representations of this<br />

data set span across a comparatively narrower and higher range of φ (NMI) values than their<br />

PCA, ICA and RP counterparts, indicating that it is more probable to obtain better results<br />

if clustering is run on NMF representations than on the remaining ones.<br />

B.1.2 Iris data set<br />

Compared to other data sets, a pretty small number of clustering solutions have been generated<br />

on the Iris collection. Regardless of this fact, the effect of the clustering indeterminacies<br />

can also be observed in figure B.2.<br />

clustering count<br />

10<br />

5<br />

Iris Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

10<br />

5<br />

Iris PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

10<br />

5<br />

Iris ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

10<br />

5<br />

Iris NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

10<br />

5<br />

Iris RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.2: Histograms of the φ (NMI) values obtained on each data representations in the<br />

Iris data set.<br />

In this case, the wide span of the φ (NMI) histograms of the PCA and ICA representations<br />

(figures B.2(b) and B.2(c)) is the clearest indicator of the representation dimensionality and<br />

algorithm selection indeterminacies.<br />

If the qualities of the clustering solutions obtained for the distinct types of object representation<br />

are compared, we can observe that the highest φ (NMI) values are obtained using<br />

the RP and the baseline representations.<br />

235


B.1. Clustering indeterminacies in unimodal data sets<br />

B.1.3 Wine data set<br />

The histograms of the φ (NMI) values obtained by each clustering algorithm across all the<br />

data representations employed in the Wine data set are presented in figure B.3.<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Wine Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Wine PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Wine ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Wine NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Wine RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.3: Histograms of the φ (NMI) values obtained on each data representation in the<br />

Wine data set.<br />

The clustering indeterminacy regarding the selection of both the clustering algorithm<br />

and the dimensionality of the data representation is clearly observed in figures B.3(b) and<br />

B.3(c). For both the PCA and ICA data representations, a rather even histogram is obtained,<br />

spanning from φ (NMI) =0.04 to φ (NMI) =0.84.<br />

Moreover, notice that it is only with these data representations (PCA and ICA) that<br />

φ (NMI) values above 0.5 are obtained on this data set, which reinforces the importance of<br />

using the optimal type of features for the obtention of good clustering results.<br />

B.1.4 Glass data set<br />

The φ (NMI) histograms corresponding to the Glass data set are presented in figure B.4.<br />

clustering count<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Glass Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Glass PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Glass ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Glass NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Glass RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.4: Histograms of the φ (NMI) values obtained on each data representation in the<br />

Glass data set.<br />

Notice the distinct histogram distributions obtained for each data representation, which<br />

gives an idea of how the selection of a particular data representation influences the quality of<br />

the clustering results. Additionally, a pretty wide range of values of φ (NMI) are observed in<br />

the histograms corresponding to the feature extraction based data representations (figures<br />

B.4(b) to B.4(e)), thus evidencing the effect of the dimensionality reduction and clustering<br />

algorithm selection indeterminacy.<br />

236


B.1.5 Ionosphere data set<br />

Appendix B. Experiments on clustering indeterminacies<br />

As regards the Ionosphere data collection, pretty similar φ (NMI) distributions are obtained<br />

for the PCA, ICA and RP representations (see figures B.5(b) to B.5(d)). Thus, in this<br />

case, there apparently exists a lower dependence between the quality of clustering and the<br />

feature extraction technique used for representing the objects. Nevertheless, despite the<br />

notable concentration of clustering results on the leftmost part of the histograms (i.e. poor<br />

clusterings with low values of φ (NMI) ), there exist some clustering solutions reaching φ (NMI)<br />

values above 0.5 using PCA and ICA feature extraction (see figures B.5(b) and B.5(c)).<br />

Moreover, notice that pretty poor quality clusterings are obtained when operating on the<br />

baseline object representation (figure B.5(a)).<br />

clustering count<br />

150<br />

100<br />

50<br />

Ionosphere Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

150<br />

100<br />

50<br />

Ionosphere PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

150<br />

100<br />

50<br />

Ionosphere ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

150<br />

100<br />

50<br />

Ionosphere RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) RP<br />

Figure B.5: Histograms of the φ (NMI) values obtained on each data representation in the<br />

Ionosphere data set.<br />

B.1.6 WDBC data set<br />

As regards the WDBC data collection, there exists a notable difference between the profiles<br />

of the histograms of the PCA, ICA and NMF representations when compared to the<br />

baseline and RP histograms. Indeed, the former present a sharp peak located in the lowest<br />

region of the φ (NMI) range, whereas the latter do not—which reflects the data representation<br />

clustering indeterminacy. The notably large differences between the highest and lowest<br />

φ (NMI) values of all the histograms reveal the influence of the clustering algorithm and data<br />

dimensionality selection on the quality of the partition results.<br />

clustering count<br />

100<br />

50<br />

WDBC Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

100<br />

50<br />

WDBC PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

100<br />

50<br />

WDBC ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

100<br />

50<br />

WDBC NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

100<br />

50<br />

WDBC RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.6: Histograms of the φ (NMI) values obtained on each data representation in the<br />

WDBC data set.<br />

237


B.1. Clustering indeterminacies in unimodal data sets<br />

B.1.7 Balance data set<br />

The approximately even distributions of the φ (NMI) histograms corresponding to the four<br />

object representations employed in the Balance data set (with the exception of the peak<br />

around φ (NMI) =0.04 in figure B.7(c)) transmit the idea that the chance of randomly selecting<br />

a good or a bad clustering configuration is rather equiprobable in this data collection.<br />

clustering count<br />

10<br />

5<br />

Balance Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

10<br />

5<br />

Balance PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

10<br />

5<br />

Balance NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) NMF<br />

clustering count<br />

10<br />

5<br />

Balance RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) RP<br />

Figure B.7: Histograms of the φ (NMI) values obtained on the each data representation in<br />

the Balance data set.<br />

B.1.8 MFeat data set<br />

In this data set, six distinct feature types were employed for representing the objects, each<br />

with a single dimensionality. Therefore, the φ (NMI) scatter observed in each of the figures<br />

from B.8(a) to B.8(f) is solely due to the algorithm selection indeterminacy.<br />

Notice that, in all these histograms, a pretty high density of clustering solutions around<br />

φ (NMI) =0.5 can be observed. Nevertheless, notably better clustering results (φ (NMI) ≈ 0.8)<br />

can be obtained using the KAR and PIX object representations (see figures B.8(c) and<br />

B.8(e)), which reveals the data representation indeterminacy effect.<br />

B.1.9 miniNG data set<br />

The wide spread of the φ (NMI) values observed in figure B.9(a) –from φ (NMI) =0.06 to<br />

φ (NMI) =0.64– is a clear evidence of how the selection of a particular clustering algorithm<br />

affects the quality of the clustering results.<br />

Moreover, notice that the clustering solutions obtained on the RP representation yield<br />

φ (NMI) values below 0.3, whereas the best results obtained on the remaining representations<br />

reach and even surpass φ (NMI) =0.5 —i.e. distinct object representations can significantly<br />

alter the results of a clustering process.<br />

B.1.10 Segmentation data set<br />

As regards the effect of applying distinct clustering algorithms on the same object representation,<br />

figure B.10(a) shows how, despite the accumulation of clustering solutions around<br />

φ (NMI) =0.35, a maximum quality of φ (NMI) =0.65 can be obtained on the baseline represenmtation<br />

of the objects.<br />

238


clustering count<br />

clustering count<br />

10<br />

5<br />

Mfeat FAC<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

10<br />

5<br />

(a) FAC<br />

Mfeat MOR<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) MOR<br />

Appendix B. Experiments on clustering indeterminacies<br />

clustering count<br />

clustering count<br />

10<br />

5<br />

Mfeat FOU<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

10<br />

5<br />

(b) FOU<br />

Mfeat PIX<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) PIX<br />

clustering count<br />

clustering count<br />

10<br />

5<br />

Mfeat KAR<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

10<br />

5<br />

(c) KAR<br />

Mfeat ZER<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(f) ZER<br />

Figure B.8: Histograms of the φ (NMI) values obtained on each data representation in the<br />

MFeat data set.<br />

clustering count<br />

60<br />

40<br />

20<br />

miniNG Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

60<br />

40<br />

20<br />

miniNG PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

60<br />

40<br />

20<br />

miniNG ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

60<br />

40<br />

20<br />

miniNG NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

60<br />

40<br />

20<br />

miniNG RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.9: Histograms of the φ (NMI) values obtained on each data representation in the<br />

miniNG data set.<br />

Furthermore, if figures B.10(b) and B.10(c) are compared to figure B.10(d), it is easy to<br />

see that whereas the two former present a wide and sharp peak centered around φ (NMI) =0.7<br />

(thus indicating that clustering solutions this good are likely to be obtained using the PCA<br />

and ICA representations of the objects), the latter has its acme around φ (NMI) =0.35—i.e.<br />

the quality of the RP based clustering solutions tends to be lower in this data set.<br />

B.1.11 BBC data set<br />

The BBC data collection constitutes another example where very diverse clustering solutions<br />

–with qualities ranging from φ (NMI) =0.01 to φ (NMI) =0.81– are obtained when clustering<br />

is conducted on the original representation of the objects (see figure B.11(a)).<br />

As far as the remaining data representations are concerned, the best results seem to<br />

be obtained using the NMF feature extraction technique, as its corresponding histogram is<br />

more scarcely and densely populated at the low and high ranges of φ (NMI) , respectively.<br />

239


B.1. Clustering indeterminacies in unimodal data sets<br />

clustering count<br />

Segmentation Baseline<br />

40<br />

30<br />

20<br />

10<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Segmentation PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Segmentation ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

40<br />

30<br />

20<br />

10<br />

Segmentation RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) RP<br />

Figure B.10: Histograms of the φ (NMI) values obtained on each data representation in the<br />

Segmentation data set.<br />

clustering count<br />

60<br />

40<br />

20<br />

BBC Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

60<br />

40<br />

20<br />

BBC PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

60<br />

40<br />

20<br />

BBC ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

60<br />

40<br />

20<br />

BBC NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

60<br />

40<br />

20<br />

BBC RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.11: Histograms of the φ (NMI) values obtained on each data representation in the<br />

BBC data set.<br />

B.1.12 PenDigits data set<br />

In this case, the distinct object representations present a reasonably similar behaviour<br />

according to the histograms depicted in figure B.12. Assuming a simplifying viewpoint,<br />

these can be decomposed into a negatively skewed peak with its acme around φ (NMI) =0.6,<br />

and two other narrow peaks, one located near φ (NMI) =0.8 and the other on the low range<br />

of the histogram. Thus, as opposed to what has been observed in other data collections, the<br />

application of the twenty-eight clustering algorithms on the distinct object representations<br />

yield comparable quality results in this data set.<br />

clustering count<br />

30<br />

20<br />

10<br />

PenDigits Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

clustering count<br />

30<br />

20<br />

10<br />

PenDigits PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA<br />

clustering count<br />

30<br />

20<br />

10<br />

PenDigits ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA<br />

clustering count<br />

30<br />

20<br />

10<br />

PenDigits NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF<br />

clustering count<br />

30<br />

20<br />

10<br />

PenDigits RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP<br />

Figure B.12: Histograms of the φ (NMI) values obtained on each data representation in the<br />

PenDigits data set.<br />

240


B.1.13 Summary<br />

Appendix B. Experiments on clustering indeterminacies<br />

So as to provide a summarized vision of the data representation and the clustering algorithm<br />

selection indeterminacies across all the analyzed data sets, table B.2 presents the φ (NMI)<br />

corresponding to the best clustering solution achieved by each one of the five families of<br />

clustering algorithms employed in this work, namely agglomerative (agglo), biased agglomerative<br />

(bagglo), direct, graph, repeated-bisecting (rb) and refined repeated-bisecting (rbr),<br />

indicating the type of object representation employed in each case (either baseline, PCA,<br />

ICA, NMF or RP).<br />

There are several worth observing facts as regards the data representation indeterminacy.<br />

Notice that, in some data sets (e.g. Zoo or miniNG), there exists a notable diversity as<br />

regards the type of representation that yields the top clustering result for each family of<br />

clustering algorithms. In contrast, in other data collections, there seems to exist a particular<br />

object representation that apparently reveals the data set structure regardless of the type of<br />

clustering algorithm applied. This behaviour is observed in the Iris and Balance collections,<br />

and also, to a lesser extent, in the WDBC and Segmentation data sets. Moreover, notice<br />

the variablility of these optimal object representations across the analyzed data sets, which<br />

is a clear indicator of the clustering indeterminacy regarding data representations.<br />

As far as the selection of the optimal clustering algorithm is concerned, it is important<br />

to note that at least one representative of the five families of clustering algorithms employed<br />

in this work reach the best absolute performance in at least one of the analyzed data sets,<br />

which gives an idea of the algorithm selection indeterminacy. Moreover, notice that choosing<br />

the wrong type of clustering algorithm may affect the quality of the clustering solution<br />

dramatically (see the Ionosphere and Balance collections) or not (as in the Segmentation<br />

data set).<br />

B.2 Clustering indeterminacies in multimodal data sets<br />

The goal of this section is to evaluate the effect of clustering indeterminacies in the context<br />

of multimodal data collections. Along with the data representation and clustering algorithm<br />

selection indeterminacies, multimodality introduces a further focus of uncertainty, as it is<br />

not evident to decide whether the combination of the m modalities will benefit the quality<br />

of the obtained clustering solution or not. And again, to make things worse, it is important<br />

to recall that all these indeterminacies are local to each data collection, so, in general, it is<br />

not possible to drawn universally valid conclusions.<br />

As done in appendix B.1, we start by presenting the total number of individual clustering<br />

solutions obtained by applying the 28 clustering algorithms extracted from the CLUTO<br />

toolbox on all the data representations of the objects contained in the employed multimodal<br />

data sets2 —see table B.3. Notice that the CAL500 and InternetAds collections lack the<br />

NMF representations as they do not satisfy the necessary non-negativity constraints.<br />

In the next paragraphs, we describe the clustering results obtained on the four multimodal<br />

data sets, placing special emphasis on which data representations and modalities<br />

lead to the best clustering results in each case.<br />

2 See appendices A.1, A.2.2, and A.3.2 for a description of the clustering algorithms, the multimodal<br />

collections and the multimodal objects representations employed in this work.<br />

241


B.2. Clustering indeterminacies in multimodal data sets<br />

Data set Highest maximum φ (NMI) −→ Lowest maximum φ (NMI)<br />

agglo bagglo direct rb rbr graph<br />

Zoo NMF-13d PCA-13d baseline RP-12d RP-12d PCA-6d<br />

(0.865) (0.858) (0.853) (0.848) (0.848) (0.730)<br />

bagglo direct rb rbr agglo graph<br />

Iris baseline baseline baseline baseline baseline baseline<br />

(0.899) (0.899) (0.899) (0.899) (0.837) (0.821)<br />

direct rbr bagglo rb graph agglo<br />

Wine PCA-10d PCA-10d ICA-12d PCA-10d PCA-8d ICA-5d<br />

(0.836) (0.836) (0.802) (0.795) (0.755) (0.720)<br />

agglo direct rb rbr bagglo graph<br />

Glass RP-8d RP-5d PCA-3d PCA-6d PCA-4d PCA-3d<br />

(0.487) (0.442) (0.423) (0.418) (0.417) (0.392)<br />

graph agglo bagglo direct rb rbr<br />

Ionosphere PCA-8d PCA-31d RP-14d RP-9d RP-9d RP-9d<br />

(0.656) (0.314) (0.309) (0.234) (0.234) (0.234)<br />

graph bagglo direct rb rbr agglo<br />

WDBC NMF-3d NMF-3d NMF-3d NMF-3d NMF-3d RP-15d<br />

(0.637) (0.603) (0.563) (0.563) (0.563) (0.522)<br />

agglo bagglo rbr rb direct graph<br />

Balance PCA-3d PCA-3d PCA-3d PCA-3d PCA-3d PCA-3d<br />

(0.703) (0.411) (0.394) (0.388) (0.370) (0.324)<br />

graph rbr direct agglo bagglo rb<br />

MFeat PIX PIX PIX KAR PIX KAR<br />

(0.811) (0.676) (0.669) (0.664) (0.606) (0.585)<br />

rb rbr bagglo direct graph agglo<br />

miniNG baseline PCA-70d baseline ICA-50d PCA-50d NMF-50d<br />

(0.638) (0.597) (0.559) (0.558) (0.412) (0.377)<br />

graph rbr agglo bagglo rb direct<br />

Segmentation PCA-11d PCA-13d ICA-13d PCA-13d PCA-14d ICA-17d<br />

(0.786) (0.741) (0.733) (0.731) (0.728) (0.720)<br />

rbr direct graph agglo bagglo rb<br />

BBC baseline baseline ICA-5d ICA-5d ICA-5d baseline<br />

(0.808) (0.808) (0.777) (0.750) (0.744) (0.726)<br />

graph agglo rbr direct bagglo rb<br />

PenDigits NMF-13d ICA-9d NMF-9d NMF-10d NMF-11d NMF-7d<br />

(0.839) (0.724) (0.682) (0.681) (0.665) (0.648)<br />

Table B.2: Top clustering results obtained by each clustering algorithm family across all<br />

the unimodal data sets, sorted from highest to lowest φ (NMI) .<br />

242


Appendix B. Experiments on clustering indeterminacies<br />

Data representation<br />

CAL500<br />

Data set<br />

Corel InternetAds IsoLetters<br />

MM 28 28 28 28<br />

Baseline M1 28 28 28 28<br />

M2 28 28 28 28<br />

MM 504 420 308 532<br />

PCA M1 280 196 308 196<br />

M2 140 224 392 532<br />

MM 504 420 308 532<br />

ICA M1 280 196 308 196<br />

M2 140 224 392 532<br />

MM – 420 – 532<br />

NMF M1 – 196 – 196<br />

M2 – 224 – 532<br />

MM 504 420 308 532<br />

RP M1 280 196 308 196<br />

M2 140 224 392 532<br />

Table B.3: Number of individual clusterings per data representation on each multimodal<br />

data set, where MM, M1 and M2 stand for multimodal, mode #1 and mode #2, respectively.<br />

B.2.1 CAL500 data set<br />

The φ (NMI) histograms presented in figure B.13 summarize the clustering results obtained by<br />

running the aforementioned twenty-eight algorithms on each type of object representation<br />

for each one of the two modalities, and for the multimodal representations as well.<br />

If the histograms are compared representationwise, we observe that all representations<br />

yield clustering solutions whose quality spans over similarly wide ranges below φ (NMI) =0.5.<br />

For a given modality, there exists no clearly superior object representation.<br />

However, if the histograms are compared across the modalities, it can be observed that<br />

better results are obtained when clustering is conducted on the audio modality of this data<br />

set, regardless of the type of representation employed. Moreover, the multimodal data<br />

representation seems to yield intermediate quality clustering results (i.e. slightly better<br />

than clustering on text only, but worse than clustering solely on audio), which reveals that<br />

the early fusion of acoustic and textual features is not beneficial in this case.<br />

B.2.2 Corel data set<br />

Figure B.14 presents the φ (NMI) histograms corresponding to the multimodal and unimodal<br />

clustering of the captioned images of the Corel data set.<br />

As regards the comparison across object representations, it can be observed that, specially<br />

for the image and multimodal modalities, the RP representation offers a large amount<br />

of good clustering solutions, whereas the quality of the clusterings obtained on the remaining<br />

representations is scattered over a wide range of φ (NMI) values.<br />

If the clustering results obtained on the two modalities are compared, we can see that<br />

the image modality is the one yielding the best clustering results (up to φ (NMI) =0.68),<br />

243


B.2. Clustering indeterminacies in multimodal data sets<br />

clustering count<br />

clustering count<br />

clustering count<br />

80<br />

60<br />

40<br />

20<br />

CAL500 Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

(multimodal)<br />

CAL500 Baseline M1<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(e) Baseline<br />

(audio)<br />

CAL500 Baseline M2<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(i) Baseline<br />

(text)<br />

clustering count<br />

clustering count<br />

clustering count<br />

CAL500 PCA<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(b) PCA (multimodal)<br />

CAL500 PCA M1<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(f) PCA (audio)<br />

CAL500 PCA M2<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(j) PCA (text)<br />

clustering count<br />

clustering count<br />

clustering count<br />

CAL500 ICA<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(c) ICA (multimodal)<br />

CAL500 ICA M1<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(g) ICA (audio)<br />

CAL500 ICA M2<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(k) ICA (text)<br />

clustering count<br />

clustering count<br />

clustering count<br />

CAL500 RP<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(d) RP (multimodal)<br />

CAL500 RP M1<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(h) RP (audio)<br />

CAL500 RP M2<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 0.5<br />

φ<br />

1<br />

(NMI)<br />

(l) RP (text)<br />

Figure B.13: Histograms of the φ (NMI) values on the CAL500 data set obtained on the<br />

following data representations.<br />

which are far better than those obtained on the text modality (always below φ (NMI) =0.3).<br />

<strong>La</strong>st but not least, the multimodal object representation seems to benefit slightly from the<br />

early fusion of the visual and textual features of both modalities, as better clustering results<br />

are obtained in this case, although by a very small margin.<br />

B.2.3 InternetAds data set<br />

The clustering results corresponding to the InternetAds data collection are summarized in<br />

figure B.15. Many poor clustering results are obtained on this data set, as the high peaks<br />

located on the leftmost regions of the histograms reveal. The distinct data representations<br />

and modalities present a pretty erratic behaviour, as discussed next.<br />

If the two modalities are compared, the best clustering results are obtained, in general<br />

terms, using the collateral information of the Internet advertisements (which are the objects<br />

in this data set). However, the multimodal composition of the objects tends to yield superior<br />

–although still poor– quality clusterings, except for the PCA representation.<br />

244


clustering count<br />

clustering count<br />

clustering count<br />

60<br />

40<br />

20<br />

Corel Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

(multimodal)<br />

60<br />

40<br />

20<br />

Corel Baseline M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(f) Baseline<br />

(image)<br />

60<br />

40<br />

20<br />

Corel Baseline M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(k) Baseline<br />

(text)<br />

clustering count<br />

clustering count<br />

clustering count<br />

60<br />

40<br />

20<br />

Corel PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA (multimodal)<br />

60<br />

40<br />

20<br />

Corel PCA M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(g) PCA (image)<br />

60<br />

40<br />

20<br />

Corel PCA M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(l) PCA (text)<br />

Appendix B. Experiments on clustering indeterminacies<br />

clustering count<br />

clustering count<br />

clustering count<br />

60<br />

40<br />

20<br />

Corel ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA (multimodal)<br />

60<br />

40<br />

20<br />

Corel ICA M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(h) ICA (image)<br />

60<br />

40<br />

20<br />

Corel ICA M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(m) ICA (text)<br />

clustering count<br />

clustering count<br />

clustering count<br />

60<br />

40<br />

20<br />

Corel NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF (multimodal)<br />

60<br />

40<br />

20<br />

Corel NMF M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(i) NMF (image)<br />

60<br />

40<br />

20<br />

Corel NMF M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(n) NMF (text)<br />

clustering count<br />

clustering count<br />

clustering count<br />

60<br />

40<br />

20<br />

Corel RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP (multimodal)<br />

60<br />

40<br />

20<br />

Corel RP M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

60<br />

40<br />

20<br />

(j) RP (image)<br />

Corel RP M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(o) RP (text)<br />

Figure B.14: Histograms of the φ (NMI) values on the Corel data set obtained on the following<br />

data representations.<br />

B.2.4 IsoLetters data set<br />

The collection of clustering solutions obtained on the IsoLetters artificial multimodal data<br />

collection are presented representation and modality-wise in figure B.16.<br />

In this case, the quality of the clusterings created on the distinct object representations<br />

present two clearly different histogram patterns depending on the modality. For instance,<br />

in the speech modality (figures B.16(e) to B.16(h)), the baseline and RP histograms present<br />

a main and a secondary minor peaks, whereas the PCA and ICA representations yield a<br />

pretty uniform distribution of clusterings. In contrast, a totally different distribution is<br />

found when clustering is run on the visual mode, where a single negatively skewed bell<br />

shape is observed (see figures B.16(i) to B.16(l)).<br />

Finally, it is to note that, regardless of the object representation employed, the early<br />

fusion of the speech and visual features of this data set gives rise to a notable increase in<br />

the quality of the clustering results (a 16.2% in averaged relative terms as regards the top<br />

quality individual clustering solution).<br />

245


B.2. Clustering indeterminacies in multimodal data sets<br />

clustering count<br />

clustering count<br />

clustering count<br />

200<br />

100<br />

InternetAds Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

(multimodal)<br />

200<br />

100<br />

InternetAds Baseline M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(f) Baseline<br />

(object)<br />

200<br />

100<br />

InternetAds Baseline M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(k) Baseline<br />

(collateral)<br />

clustering count<br />

clustering count<br />

clustering count<br />

200<br />

100<br />

InternetAds PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA (multimodal)<br />

200<br />

100<br />

InternetAds PCA M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(g) PCA (object)<br />

200<br />

100<br />

InternetAds PCA M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(l) PCA (collateral)<br />

clustering count<br />

clustering count<br />

clustering count<br />

200<br />

100<br />

InternetAds ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA (multimodal)<br />

200<br />

100<br />

InternetAds ICA M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(h) ICA (object)<br />

200<br />

100<br />

InternetAds ICA M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(m) ICA (collateral)<br />

clustering count<br />

clustering count<br />

clustering count<br />

200<br />

100<br />

InternetAds NMF<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) NMF (multimodal)<br />

200<br />

100<br />

InternetAds NMF M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(i) NMF (object)<br />

200<br />

100<br />

InternetAds NMF M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(n) NMF (collateral)<br />

clustering count<br />

clustering count<br />

clustering count<br />

200<br />

100<br />

InternetAds RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) RP (multimodal)<br />

200<br />

100<br />

InternetAds RP M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(j) RP (object)<br />

200<br />

100<br />

InternetAds RP M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(o) RP (collateral)<br />

Figure B.15: Histograms of the φ (NMI) values on the InternetAds data set obtained on the<br />

following data representations.<br />

B.2.5 Summary<br />

Repeating the formula employed in section B.1, table B.4 presents the φ (NMI) values attained<br />

by the top clustering solution achieved by the best representative of each one of the five<br />

families of clustering algorithms employed in this work (i.e. agglo, bagglo, direct, graph,<br />

rb and rbr), referring the employed type of representation (baseline, PCA, ICA, NMF or<br />

RP) and modality (multimodal –MM–, mode #1 –M1– or mode #2 –M2–). The idea<br />

is to present a condensed view of the influence of the data representation and clustering<br />

algorithm selection indeterminacies.<br />

Notice the distinct ordering of the five families of clustering algorithms in every data<br />

set. A clear indicator of the algorithm selection indeterminacy, is the fact that the rbr type<br />

of algorithms yield the top clustering solution in three of the four data sets, while offering<br />

the poorest performance in the InternetAds collection.<br />

The indeterminacy regarding the use of multimodal or unimodal data representations<br />

also becomes evident, as in two of the data sets (Corel and IsoLetters) the multimodal<br />

246


clustering count<br />

clustering count<br />

clustering count<br />

30<br />

20<br />

10<br />

IsoLetters Baseline<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(a) Baseline<br />

(multimodal)<br />

30<br />

20<br />

10<br />

IsoLetters Baseline M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(e) Baseline<br />

(speech)<br />

30<br />

20<br />

10<br />

IsoLetters Baseline M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(i) Baseline<br />

(image)<br />

clustering count<br />

clustering count<br />

clustering count<br />

30<br />

20<br />

10<br />

Appendix B. Experiments on clustering indeterminacies<br />

IsoLetters PCA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(b) PCA (multimodal)<br />

30<br />

20<br />

10<br />

IsoLetters PCA M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(f) PCA<br />

(speech)<br />

30<br />

20<br />

10<br />

IsoLetters PCA M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(j) PCA (image)<br />

clustering count<br />

clustering count<br />

clustering count<br />

30<br />

20<br />

10<br />

IsoLetters ICA<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(c) ICA (multimodal)<br />

30<br />

20<br />

10<br />

IsoLetters ICA M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(g) ICA<br />

(speech)<br />

30<br />

20<br />

10<br />

IsoLetters ICA M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(k) ICA (image)<br />

clustering count<br />

clustering count<br />

clustering count<br />

30<br />

20<br />

10<br />

IsoLetters RP<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(d) RP (multimodal)<br />

30<br />

20<br />

10<br />

IsoLetters RP M1<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(h) RP (speech)<br />

30<br />

20<br />

10<br />

IsoLetters RP M2<br />

0<br />

0 0.5 1<br />

φ (NMI)<br />

(l) RP (image)<br />

Figure B.16: Histograms of the φ (NMI) values on the IsoLetters data set obtained on the<br />

following data representations.<br />

representations dominate the best clustering results across all the families of algorithms,<br />

whereas it is one of the unimodal representations the ones to do so in the CAL500 and InternetAds<br />

collections. And finally, notice the diversity of types of representations appearing<br />

in table B.4, which suggests that, for a given data set, it is very difficult to select the data<br />

representation and clustering strategy that yield the best clustering results.<br />

247


B.2. Clustering indeterminacies in multimodal data sets<br />

Data set Highest φ (NMI) −→ Lowest φ (NMI)<br />

CAL500<br />

Corel<br />

InternetAds<br />

IsoLetters<br />

rbr direct agglo rb bagglo graph<br />

RP-M1 RP-M1 RP-M1 RP-M1 RP-M1 baseline-M1<br />

120d 100d 100d 120d 120d –<br />

(0.411) (0.404) (0.401) (0.384) (0.381) (0.364)<br />

rbr graph direct rb bagglo agglo<br />

NMF-MM RP-MM NMF-MM NMF-MM baseline-M1 baseline-MM<br />

550d 400d 450d 300d – –<br />

(0.675) (0.672) (0.671) (0.641) (0.624) (0.622)<br />

bagglo graph agglo direct rb rbr<br />

RP-M1 NMF-MM baseline-M2 NMF-M2 NMF-M2 NMF-M2<br />

70d 150d – 550d 550d 550d<br />

(0.430) (0.319) (0.258) (0.087) (0.087) (0.087)<br />

rbr direct graph agglo rb bagglo<br />

PCA-MM PCA-MM PCA-MM RP-MM baseline-MM baseline-MM<br />

100d 100d 100d 600d – –<br />

(0.897) (0.875) (0.846) (0.790) (0.751) (0.728)<br />

Table B.4: Top clustering results obtained by each clustering algorithm family across all<br />

the multimodal data sets, sorted from highest to lowest φ (NMI) .<br />

248


Appendix C<br />

Experiments on hierarchical<br />

consensus architectures<br />

This appendix presents several experiments regarding self-refining hierarchical consensus<br />

architectures.<br />

C.1 Configuration of a random hierarchical consensus architecture<br />

In this section, we present some examples that describe, in detail, the configuration process<br />

of random hierarchical consensus architectures (RHCA). The aim is to demostrate how,<br />

given a cluster ensemble size l and a mini-ensemble size b, equations (C.1), (C.2) and (C.3)<br />

allow determining the number of stages s, the number of consensus per stage Ki and the<br />

effective size of each mini-ensemble bij of the corresponding RHCA.<br />

For starters, let us evaluate carefully the three RHCA examples presented in section 3.2.<br />

In these toy examples, the mini-ensemble size is set to b = 2, while the respective cluster<br />

ensembles have l =7, 8 and 9 components. The first step of the RHCA design process<br />

consists of determining the number of stages of the hierarchy, s, accordingtoequation<br />

(C.1).<br />

⎧<br />

⎪⎩<br />

⌊log b (l)⌉ if<br />

⎪⎨<br />

<br />

s = ⌊logb (l)⌉−1 if<br />

⌊log b (l)⌉ +1 if<br />

<br />

<br />

l<br />

b ⌊log b (l)⌉<br />

l<br />

b ⌊log b (l)⌉<br />

l<br />

b ⌊log b (l)⌉<br />

<br />

≤ 1and<br />

<br />

≤ 1and<br />

<br />

> 1<br />

l<br />

b ⌊log b (l)⌉−1<br />

l<br />

b ⌊log b (l)⌉−1<br />

<br />

> 1<br />

<br />

=1<br />

(C.1)<br />

Table C.1 presents the results of this computation for the three aforementioned examples<br />

(one row per example), specifying the values of the decision factors used for selecting one<br />

of the three options presented in equation (C.1).<br />

Once the number of stages of the RHCA is computed, the next step consists of determining<br />

how many consensus processes are to be executed at each RHCA stage. This factor,<br />

249


C.1. Configuration of a random hierarchical consensus architecture<br />

l<br />

b⌊logb (l)⌉<br />

l<br />

b⌊logb (l)⌉−1<br />

l =7 0.875 1.75<br />

s<br />

⌊logb (l)⌉−1=2<br />

l =8 1 2 ⌊logb (l)⌉ =3<br />

l =9 1.125 2.25 ⌊logb (l)⌉ =3<br />

Table C.1: Examples of computation of the number of stages s of a RHCA on cluster<br />

ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2.<br />

which is designated by Ki (where the subindex i denotes the stage number), is computed<br />

according to equation (C.2).<br />

<br />

l<br />

Ki =max<br />

bi <br />

, 1<br />

(C.2)<br />

The number of consensus processes per stage of the three RHCA examples discussed<br />

are presented in table C.2.<br />

1<br />

Stage number<br />

2 3<br />

l =7 K1 =max(⌊3.5⌋ , 1) = 3<br />

as<br />

K2 =max(⌊1.75⌋ , 1) = 1<br />

–<br />

l<br />

b1 =3.5 as l<br />

b2 l =8<br />

=1.75<br />

K1 =max(⌊4⌋ , 1) = 4<br />

as<br />

K2 =max(⌊2⌋ , 1) = 2 K3 =max(⌊1⌋ , 1) = 1<br />

l<br />

b1 =4 as l<br />

b2 =2 as l<br />

b3 l =9<br />

=1<br />

K1 =max(⌊4.5⌋ , 1) = 4<br />

as<br />

K2 =max(⌊2.25⌋ , 1) = 2 K3 =max(⌊1.125⌋ , 1) = 1<br />

l<br />

b1 =4.5 as l<br />

b2 =2.25 as l<br />

b3 =1.125<br />

Table C.2: Examples of computation of the number of consensus per stage (Ki) ofaRHCA<br />

on cluster ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2.<br />

And finally, the real mini-ensembles size bij, ∀i ∈ [1,s]and∀j ∈ [1,Ki] mustbecomputed.<br />

Recall that the effective size of all the mini-ensembles of the RHCA is equal to the<br />

user-defined mini-ensemble size (i.e bij = b ∀i, j) if and only if l is an integer power of b. In<br />

practice, the effective mini-ensembles size is computed according to equation (C.3), which<br />

adjusts this factor so that all the original and intermediate clusterings are subject to a consensus<br />

process. The bij values corresponding to the three RHCA examples are presented<br />

in table C.3 along with the corresponding number of consensus Ki at each RHCA stage in<br />

brackets.<br />

⎧<br />

⎪⎨<br />

b if i


l =7<br />

l =8<br />

l =9<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

1<br />

Stage number<br />

2 3<br />

[K1 =3] [K2 =1]<br />

b11 = b =2<br />

b12 = b =2<br />

b13 = b + l mod b =3<br />

b21 = K1 =3<br />

–<br />

[K1 =4] [K2 =2] [K3 =1]<br />

b11 = b =2<br />

b12 = b =2<br />

b21 = b =2<br />

b13 = b =2<br />

b14 = b + l mod b =2<br />

b22 = b + K1 mod b =2<br />

[K1 =4] [K2 =2] [K3 =1]<br />

b11 = b =2<br />

b12 = b =2<br />

b21 = b =2<br />

b13 = b =2<br />

b14 = b + l mod b =3<br />

b22 = b + K1 mod b =2<br />

b31 = K2 =2<br />

b31 = K2 =2<br />

Table C.3: Examples of computation of the mini-ensembles size of a RHCA on cluster<br />

ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2.<br />

cluster ensemble of an arbitrarily chosen size of l = 30 using the following predefined miniensembles<br />

sizes: b = {2, 3, 5, 15}.<br />

It can be observed that the larger the mini-ensemble size b, the smaller the number of<br />

stages s. Moreover, notice that, for any given RHCA, the number of consensus per stage Ki<br />

progressively converges to unity, i.e. Ks = 1, giving rise to a regular pyramidal hierarchy of<br />

consensus processes. <strong>La</strong>st, it is worth observing how the effective size of the mini-ensembles<br />

bij is determined. Notice bij is regularly set to b maybe except for the last (i.e. the Kith)<br />

consensus of each stage and/or the only consensus of the final stage, which may vary their<br />

size so as to accomodate the necessary number of clusterings into the associated consensus<br />

process.<br />

l =30<br />

2<br />

mini-ensemble size b<br />

3 5 15<br />

#ofstages s =4 s =3 s =2 s =2<br />

#of<br />

consensus<br />

per stage<br />

Ki = {15, 7, 3, 1} Ki = {10, 3, 1} Ki = {6, 1} Ki = {2, 1}<br />

miniensembles<br />

size bij<br />

b1j =2∀j<br />

<br />

∈ [1, 15]<br />

2 ∀j ∈ [1, 6]<br />

b2j =<br />

3 j =7<br />

<br />

2 ∀j ∈ [1, 2]<br />

b3j =<br />

3 j =3<br />

b41 =3<br />

b1j =3∀j ∈ [1, 10]<br />

<br />

3 ∀j ∈ [1, 2]<br />

b2j =<br />

4 j =3<br />

b31 =3<br />

b1j =5∀j ∈ [1, 6] b1j =15∀j ∈ [1, 2]<br />

b21 =6 b21 =2<br />

Table C.4: Configuration of RHCA topologies on a cluster ensemble of size l =30with<br />

varying mini-ensembles sizes b = {2, 3, 5, 15}.<br />

251


C.2. Estimation of the computationally optimal RHCA<br />

C.2 Estimation of the computationally optimal RHCA<br />

Section 3.2 presents a methodology for selecting the most computationally efficient implementation<br />

variant of random hierarchical consensus architectures. In short, such methodology<br />

consists of estimating the running time of several RHCA variants differing in the<br />

mini-ensembles size b, selecting the one that yields the minimum running time, which is the<br />

one to be truly executed.<br />

So as to validate this procedure, in this section we present the estimated and real running<br />

times of several variants of the fully serial and parallel implementations of RHCA on the<br />

Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat unimodal data sets (see appendix<br />

A.2.1 for a description of these collections) across the four experimental diversity scenarios<br />

employed in this work —see appendix A.4. The objective of this experiment is twofold:<br />

firstly, we seek to verify whether the proposed strategy succeeds in predicting the most<br />

computationally efficient RHCA variant. And secondly, we intend to analyze the conditions<br />

under which random hierarchical consensus architectures are computationally advantageous<br />

compared to flat consensus clustering. The experimental design that has been followed is<br />

outlined next.<br />

– What do we want to measure?<br />

i) The time complexity of random hierarchical consensus architectures.<br />

ii) The ability of the proposed methodology for predicting the computationally optimal<br />

RHCA variant, in both the fully serial and parallel implementations.<br />

– How do we measure it?<br />

i) The time complexity of the implemented serial and parallel RHCA variants is<br />

measured in terms of the CPU time required for their execution —serial running<br />

time (SRTRHCA) and parallel running time (PRTRHCA).<br />

ii) The estimated running times of the same RHCA variants –serial estimated running<br />

time (SERTRHCA) and parallel estimated running time (PERTRHCA)– are<br />

computed by means of the proposed running time estimation methodology, which<br />

is based on the measured running time of c = 1 consensus clustering process. Predictions<br />

regarding the computationally optimal RHCA variant will be successful<br />

in case that both the real and estimated running times are minimized by the<br />

same RHCA variant, and the percentage of experiments in which prediction is<br />

successful is given as a measure of its performance. In order to measure the<br />

impact of incorrect predictions, we also measure the execution time differences<br />

(in both absolute and relative terms) between the truly and the allegedly fastest<br />

RHCA variants in the case prediction fails. This evaluation process is replicated<br />

for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />

on the prediction accuracy of the proposed methodology.<br />

– How are the experiments designed? All the RHCA variants corresponding to<br />

the sweep of values of b resulting from the proposed running time estimation methodology<br />

have been implemented (see table 3.2). In order to test our proposals under a<br />

wide spectrum of experimental situations, consensus processes have been conducted<br />

using the seven consensus functions for hard cluster ensembles presented in appendix<br />

252


Appendix C. Experiments on hierarchical consensus architectures<br />

A.5 (i.e. CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD), employing<br />

cluster ensembles of the sizes corresponding to the four diversity scenarios described<br />

in appendix A.4 —which basically boils down to compiling the clusterings output by<br />

|dfA| = {1, 10, 19, 28} clustering algorithms. In all cases, the real running times correspond<br />

to an average of 10 independent runs of the whole RHCA, in order to obtain<br />

representative real running time values (recall that the mini-ensemble components<br />

change from run to run, as they are randomly selected). For a description of the<br />

computational resources employed in or experiments, see appendix A.6.<br />

– How are results presented? Both the real and estimated running times of the<br />

serial and parallel implementations of the RHCA variants are depicted by means of<br />

curves representing their average values.<br />

C.2.1 Iris data set<br />

For starters, let us analyze the results corresponding to the Iris data collection. In this<br />

case, each diversity scenario corresponds to a cluster ensemble of size l =9, 90, 171 and 252,<br />

respectively. The left and right columns of figure C.1 present the estimated and real running<br />

times of several variants of the serial implementation of the RHCA on this data set across<br />

the four diversity scenarios. It can be observed that, as the size of the cluster ensemble<br />

grows, there appear RHCA variants more computationally efficient than flat consensus<br />

(especially when the MCLA and KMSAD consensus functions are employed). However,<br />

there are no significant differences between the running times of the fastest RHCA variant<br />

and flat consensus, probably due to the small size of this data set and of the associated<br />

cluster ensembles. For this reason, the inaccuracies of the running time prediction based<br />

on SERTRHCA are of little importance in practice.<br />

Figure C.2 presents the estimated and real running times of the parallel implementation<br />

of the RHCA in the four diversity scenarios analyzed. According to PERTRHCA, the<br />

parallel RHCA variants with the s = 2/lowest b and s = 3/highest b configurations yield<br />

the maximum computational efficiency —except for the lowest diversity scenario, where flat<br />

consensus is correctly designated to be the fastest option. If these predictions are compared<br />

to the real running times presented on the right column of figure C.2, it can be observed<br />

that, as the diversity level grows, they maintain their accuracy as regards the identification<br />

of the fastest consensus architecture for most consensus functions. However, in the case the<br />

prediction strategy fails to identify the fastest RHCA variant, we make, in terms of absolute<br />

running time penalization, a perfectly assumable error, as the real running times of parallel<br />

RHCA are below one second for the particular case of this data set.<br />

C.2.2 Wine data set<br />

In this section, we present the estimated and real running times of the serial and parallel<br />

implementations of RHCA on the Wine data collection. As aforementioned, this experiment<br />

has been replicated across four diversity scenarios that, in the case of this data set, correspond<br />

to cluster ensembles of size l =45, 450, 855 and 1260. Thus, notice that considerably<br />

large cluster ensembles are obtained in this case, especially if compared to those of the Iris<br />

data collection. This is due to the fact that the Wine data set has a much richer dimensional<br />

diversity as regards the distinct object representations generated (approximately five times<br />

253


C.2. Estimation of the computationally optimal RHCA<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

3<br />

10<br />

2 2 1<br />

0<br />

s : number of stages<br />

10 −1<br />

2 3 4 9<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

6 4 3 3 2 2 1<br />

10 1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

2 3 4 7 8 45 90<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 85 171<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 12 13 126 252<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

3<br />

10<br />

2 2 1<br />

0<br />

s : number of stages<br />

10 −1<br />

2 3 4 9<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

6 4 3 3 2 2 1<br />

10 1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

2 3 4 7 8 45 90<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 85 171<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 12 13 126 252<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.1: Estimated and real running times (RT) of the serial RHCA implementation on<br />

the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

254<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 −1<br />

10 0<br />

10 −1<br />

10 0<br />

10 −1<br />

10 0<br />

10 −1<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

s : number of stages<br />

3 2 2 1<br />

2 3 4 9<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

6 4 3 3 2 2 1<br />

2 3 4 7 8 45 90<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 85 171<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 12 13 126 252<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 −1<br />

10 0<br />

10 −1<br />

10 0<br />

10 −1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

3 2 2 1<br />

2 3 4 9<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

6 4 3 3 2 2 1<br />

2 3 4 7 8 45 90<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 85 171<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 12 13 126 252<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.2: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

255<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

richer), which boosts the size of the cluster ensemble.<br />

Firstly, figure C.3 depicts the results corresponding to the fully serial RHCA implementation<br />

across the four diversity scenarios (estimated running times on the left column, real<br />

running times on the right). The first remarkable fact is that SERTRHCA is a pretty accurate<br />

predictor of SRTRHCA, which can be easily verified if the pair of subfigures presented on<br />

each row of figure C.3 are compared. Again, RHCA becomes more computationally attractive<br />

as the size of the cluster ensemble increases (except for the EAC consensus function).<br />

Moreover, among the distinct RHCA variants executed in each experiment, the greatest<br />

efficiency is achieved by the ones with 2 or 3 stages.<br />

The estimated and real execution times of the parallel implementation of RHCA are<br />

depicted in figure C.4. As already observed in the previous data sets, PERTRHCA is a<br />

modestly accurate estimator of PRTRHCA, although it is a fairly good predictor of the most<br />

computationally efficient consensus architecture. Notice that, in the most diverse scenario<br />

(|dfA| = 28), the least time consuming RHCA variant is nearly two orders of magnitude<br />

faster than flat consensus —thus, it can be argued that being able to predict which RHCA<br />

configuration requires the least computation time constitutes a significantly advantageous<br />

strategy compared to the traditional one-step approach to consensus clustering.<br />

C.2.3 Glass data set<br />

This section presents the results of estimating the execution times of the fully serial and<br />

parallel implementations of RHCA in the four diversity scenarios for the Glass data set,<br />

which give rise to cluster ensembles of sizes l =29, 290, 551 and 812 each.<br />

Firstly, figure C.5 depicts both the estimated and real running times of several serial<br />

RHCA variants. These results are quite comparable to those obtained in the previous<br />

data collections. That is, except for the EAC consensus function, the RHCA variants with<br />

s =2ands = 3 stages become the most computationally efficient as the size of the cluster<br />

ensemble increases. Moreover, in the most diverse scenario (|dfA| = 28) flat consensus is not<br />

executable if the MCLA consensus function is employed as the clustering combiner, whereas<br />

hierarchical consensus does provide a means for obtaining a consolidated clustering solution<br />

upon the same cluster ensemble using this consensus function. Furthermore, notice that the<br />

proposed methodology for estimating the running time of serial RHCA yields fairly reliable<br />

predictions of their real execution time.<br />

And secondly, the results corresponding to the parallel implementation of RHCA are<br />

presented in figure C.6. Again, it can be observed that the estimated running time of the<br />

parallel RHCA is an arguably accurate approximation of the real execution time. However,<br />

notice that this lack of accuracy is tolerable inasmuch as i) the location of the minima of<br />

PERTRHCA mostly coincides with the minima of PRTRHCA —which means that the fastest<br />

consensus architecture is successfully predicted, and ii) the selection of a computationally<br />

suboptimal RHCA variant involves a light penalization in terms of real execution time.<br />

C.2.4 Ionosphere data set<br />

This section describes the results of the minimum complexity RHCA variant selection based<br />

on running time estimation. In the case of the Ionosphere data collection, cluster ensembles<br />

of sizes l =97, 970, 1843 and 2716 correspond to the four diversity scenarios where this<br />

256


SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

5<br />

10<br />

4 3 3 2 2 1<br />

1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

2 3 4 5 6 22 45<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

8 5 5 4 4 3 3 2 2 1<br />

10 2<br />

s : number of stages<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

2 3 4 5 6 7 17 18 225 450<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 23 24 427 855<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 9 10 28 29 630 1260<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

5<br />

10<br />

4 3 3 2 2 1<br />

1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

2 3 4 5 6 22 45<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

8 5 5 4 4 3 3 2 2 1<br />

10 2<br />

s : number of stages<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

2 3 4 5 6 7 17 18 225 450<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 23 24 427 855<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 9 10 28 29 630 1260<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.3: Estimated and real running times (RT) of the serial RHCA implementation on<br />

the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

257<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 0<br />

10 −1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

2 3 4 5 6 22 45<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

8 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 17 18 225 450<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 23 24 427 855<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 9 10 28 29 630 1260<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 0<br />

10 −1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

2 3 4 5 6 22 45<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

8 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 17 18 225 450<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 23 24 427 855<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 9 10 28 29 630 1260<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.4: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

258<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

s : number of stages<br />

4 3 3 2 2 1<br />

2 3 4 5 14 29<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

8 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 13 14 145 290<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 275 551<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

10<br />

2 3 4 5 8 9 23 24 406 812<br />

0<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

s : number of stages<br />

4 3 3 2 2 1<br />

2 3 4 5 14 29<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

8 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 13 14 145 290<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 275 551<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

10<br />

2 3 4 5 8 9 23 24 406 812<br />

0<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.5: Estimated and real running times (RT) of the serial RHCA implementation on<br />

the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

259<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 0<br />

10 −1<br />

s : number of stages<br />

4 3 3 2 2 1<br />

2 3 4 5 14 29<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

8<br />

10<br />

5 4 4 3 3 2 2 1<br />

1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

2 3 4 5 6 13 14 145 290<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 275 551<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

9 6 5 4 4 3 3 2 2 1<br />

10 1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

2 3 4 5 8 9 23 24 406 812<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 0<br />

10 −1<br />

s : number of stages<br />

4 3 3 2 2 1<br />

2 3 4 5 14 29<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

8<br />

10<br />

5 4 4 3 3 2 2 1<br />

1<br />

s : number of stages<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

10 1<br />

10 0<br />

10 −1<br />

2 3 4 5 6 13 14 145 290<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 7 8 19 20 275 551<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

9 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 23 24 406 812<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.6: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

260<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


Appendix C. Experiments on hierarchical consensus architectures<br />

experiment is conducted.<br />

For starters, figure C.7 presents the estimated and real exectution times of several variants<br />

of the fully serial implementation of RHCA. If the estimated and real running times are<br />

compared, it can be observed that it is possible to accurately predict the real execution time<br />

of serial RHCA variants which, at the same time, allows the precise prediction of the most<br />

computationally efficient RHCA variant —the ultimate goal of the proposed methodology.<br />

Figure C.8 depicts the estimated and real running times of the parallel RHCA implementation<br />

across the sweep of values of b for all four diversity scenarios. In comparison to<br />

what is observed in other data sets, PERTRHCA is a better estimator of PRTRHCA in this<br />

case. Moreover, as the diversity of the cluster ensemble grows, the computational savings<br />

derived from employing the fastest RHCA instead of flat consensus are noteworthy (especially<br />

for the HGPA, CSPA, ALSAD and SLSAD consensus functions). <strong>La</strong>st, note that<br />

flat consensus is not executable in the three of the four diversity scenarios if consensus is<br />

obtained by means of MCLA, due to the large size of the mini-ensembles b.<br />

C.2.5 WDBC data set<br />

In this section, we present the results of estimating the execution times of the serial and<br />

parallel RHCA implementations on the WDBC data collection. According to the four<br />

diversity scenarios generated by employing |dfA| =1, 10, 19 and 28 clustering algorithms<br />

for generating the cluster ensembles, these contain l = 113, 1130, 2147 and 3164 individual<br />

partitions.<br />

The estimated and real running times corresponding to the serial implementation of<br />

RHCA are depicted in figure C.9. As observed in the remaining data collections, SERTRHCA<br />

is a fairly accurate estimator of SRTRHCA, which allows predicting the fastest consensus<br />

architecture with a high precision. Notice that, as already noted in other collections, RHCA<br />

becomes a competitive option as the size of the cluster ensemble grows, except when the EAC<br />

consensus functions is employed. Notice that, for the most diverse scenarios, all consensus<br />

architectures are highly costly (in terms of execution time), so being able to predict which<br />

is the fastest can lead to important computation savings.<br />

As regards the fully parallel implementation of RHCA, the estimated and real running<br />

times corresponding to the four aforementioned diversity scenarios are presented in figure<br />

C.10. Despite the estimation of the real execution time is not as accurate as in the serial<br />

case, PERTRHCA is a reasonable predictor of the fastest consensus architecture in most<br />

cases.<br />

C.2.6 Balance data set<br />

This section presents the estimated and real execution times of multiple variants of random<br />

hierarchical consensus architectures on the Balance data collection, both in its serial and<br />

parallel versions. The low cardinality of the dimensional diversity factor of this data set<br />

gives rise to relatively small cluster ensembles in the four diversity scenarios, which are<br />

equal to l =7, 70, 133 and 196 in this case.<br />

Firstly, figure C.11 depicts the estimated and real running times of the serial RHCA<br />

implementation in the four diversity scenarios. As already observed in the previous data<br />

261


C.2. Estimation of the computationally optimal RHCA<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 48 97<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

9 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 8 9 25 26 485 970<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 10 11 35 36 921 1843<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 7 12 13 42 43 1358 2716<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 48 97<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

9 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 8 9 25 26 485 970<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 10 11 35 36 921 1843<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 7 12 13 42 43 1358 2716<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.7: Estimated and real running times (RT) of the serial RHCA implementation on<br />

the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

262<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 48 97<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

9 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 8 9 25 26 485 970<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 10 11 35 36 921 1843<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 12 13 42 43 1358 2716<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

10 1<br />

10 0<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 48 97<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

9 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 8 9 25 26 485 970<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

10 7 6 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 10 11 35 36 921 1843<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

11<br />

10<br />

7 6 5 5 4 4 3 3 2 2 1<br />

2<br />

s : number of stages<br />

10 1<br />

10 0<br />

2 3 4 5 6 7 12 13 42 43 1358 2716<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.8: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

263<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 56 113<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 9 10 27 28 565 1130<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 7 11 12 37 38 1073 2147<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 7 12 13 45 46 1582 3164<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 56 113<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 9 10 27 28 565 1130<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 7 11 12 37 38 1073 2147<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

2 3 4 5 6 7 12 13 45 46 1582 3164<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.9: Estimated and real running times (RT) of the serial RHCA implementation on<br />

the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

264<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 56 113<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 9 10 27 28 565 1130<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 11 12 37 38 1073 2147<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 12 13 45 46 1582 3164<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 56 113<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

10 7 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 9 10 27 28 565 1130<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 11 12 37 38 1073 2147<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

11 7 6 5 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 7 12 13 45 46 1582 3164<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.10: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

265<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

sets, our proposed method allows obtaining a pretty accurate estimation of the execution<br />

time of any serial RHCA variant which, at the same time, allows the user to make a reliable<br />

decision regarding the most computationally efficient consensus architecture whatever the<br />

diversity scenario and consensus function is employed.<br />

Secondly, figure C.12 presents the corresponding magnitudes in the case that the fully<br />

parallel implementations of RHCA are employed. In this situation, the estimation of the real<br />

execution time is not as accurate as in the serial case, although the running time deviation<br />

suffered when a suboptimal RHCA architecture is selected is, from a practical viewpoint,<br />

perfectly assumable — a fact that has already been reported in the previous data sets.<br />

C.2.7 MFeat data set<br />

In this section, the results of estimating the execution times of RHCA are compared to their<br />

real counterparts across four diversity scenarios in the context of the MFeat data collection.<br />

The cluster ensemble sizes corresponding to these diversity scenarios are l =6, 60, 114 and 168,<br />

respectively.<br />

Figure C.13 presents the estimated and real running times of multiple variants of the<br />

serial implementation of RHCA on this data set. Besides the notably high accuracy of<br />

the estimation, we would like to highlight that flat consensus turns out to be the most<br />

efficient consensus architecture in the four diversity scenarios for all but two of the consensus<br />

functions employed (MCLA and HGPA), a behaviour that has already been observed in<br />

other data collections with small cluster ensembles (e.g. the Iris data set).<br />

The results corresponding to the parallel implementation of RHCA are depicted in<br />

figure C.14. In this case, the use of the HGPA and MCLA consensus functions as clustering<br />

combiners also make the RHCA variants with s =2ands = 3 stages computationally<br />

optimal. However, for the remaining consensus functions, flat consensus mostly prevails as<br />

the most efficient consensus architecture in most diversity scenarios.<br />

266


SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

2<br />

10<br />

2 1<br />

2<br />

s : number of stages<br />

10 1<br />

10 0<br />

10 −1<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

2 3 7<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

6 4 3 3 2 2 1<br />

2 3 4 6 7 35 70<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 9 10 66 133<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

10<br />

2 3 4 5 6 11 12 98 196<br />

0<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

2<br />

10<br />

2 1<br />

2<br />

s : number of stages<br />

10 1<br />

10 0<br />

10 −1<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

2 3 7<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

6 4 3 3 2 2 1<br />

2 3 4 6 7 35 70<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 9 10 66 133<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

10<br />

2 3 4 5 6 11 12 98 196<br />

0<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.11: Estimated and real running times (RT) of the serial RHCA implementation<br />

on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

267<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 1<br />

10 0<br />

10 −1<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

2 2 1<br />

2 3 7<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

6 4 3 3 2 2 1<br />

2 3 4 6 7 35 70<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 9 10 66 133<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 11 12 98 196<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 1<br />

10 0<br />

10 −1<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

2 2 1<br />

2 3 7<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

6 4 3 3 2 2 1<br />

2 3 4 6 7 35 70<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 9 10 66 133<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 4 3 3 2 2 1<br />

2 3 4 5 6 11 12 98 196<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.12: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

268<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

SERT RHCA (sec.)<br />

10 2<br />

10 0<br />

10 4<br />

10 2<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

s : number of stages<br />

2 2 1<br />

2 3 6<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

10<br />

2 3 4 6 7 30 60<br />

0<br />

b : mini−ensemble size<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 4<br />

10 2<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 57 114<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 84 168<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

SRT RHCA (sec.)<br />

10 2<br />

10 0<br />

10 4<br />

10 2<br />

s : number of stages<br />

2 2 1<br />

2 3 6<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

10<br />

2 3 4 6 7 30 60<br />

0<br />

b : mini−ensemble size<br />

10 4<br />

10 3<br />

10 2<br />

10 1<br />

10 4<br />

10 2<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 57 114<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 84 168<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.13: Estimated and real running times (RT) of the serial RHCA implementation<br />

on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

269<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


C.2. Estimation of the computationally optimal RHCA<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

PERT RHCA (sec.)<br />

10 2<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

2 2 1<br />

2 3 6<br />

b : mini−ensemble size<br />

(a) Estimated RT, |dfA| =1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

2 3 4 6 7 30 60<br />

b : mini−ensemble size<br />

(c) Estimated RT, |dfA| =10<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 57 114<br />

b : mini−ensemble size<br />

(e) Estimated RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 84 168<br />

b : mini−ensemble size<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

PRT RHCA (sec.)<br />

10 2<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

10 3<br />

10 2<br />

10 1<br />

10 0<br />

s : number of stages<br />

2 2 1<br />

2 3 6<br />

b : mini−ensemble size<br />

(b) Real RT, |dfA| =1<br />

s : number of stages<br />

5 4 3 3 2 2 1<br />

2 3 4 6 7 30 60<br />

b : mini−ensemble size<br />

(d) Real RT, |dfA| =10<br />

s : number of stages<br />

6 4 4 3 3 2 2 1<br />

2 3 4 5 8 9 57 114<br />

b : mini−ensemble size<br />

(f) Real RT, |dfA| =19<br />

s : number of stages<br />

7 5 4 3 3 2 2 1<br />

2 3 4 5 10 11 84 168<br />

b : mini−ensemble size<br />

(h) Real RT, |dfA| =28<br />

Figure C.14: Estimated and real running times (RT) of the parallel RHCA implementation<br />

on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

270<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD


Appendix C. Experiments on hierarchical consensus architectures<br />

C.3 Estimation of the computationally optimal DHCA<br />

The methodology for selecting the most computationally efficient implementation variant<br />

of deterministic hierarchical consensus architectures presented in section 3.3 –consisting of<br />

estimating the running time of several DHCA variants differing in the order diversity factors<br />

are associated to the stages of the hierarchical consensus architecture, selecting the one that<br />

yields the minimum running time, which is the one to be truly executed– has been applied<br />

on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat unimodal data sets (see<br />

appendix A.2.1 for a description of these collections).<br />

In these experiments, the fully serial and parallel implementations of DHCA variants<br />

have been considered across the four experimental diversity scenarios employed in this work<br />

—see appendix A.4. The objective of this experiment is twofold: firstly, we seek to verify<br />

whether the proposed strategy succeeds in predicting the most computationally efficient<br />

DHCA variant. And secondly, we intend to analyze the conditions under which deterministic<br />

hierarchical consensus architectures are computationally advantageous compared to flat<br />

consensus clustering. We have followed the experimental design outlined next.<br />

– What do we want to measure?<br />

i) The time complexity of deterministic hierarchical consensus architectures.<br />

ii) The ability of the proposed methodology for predicting the computationally optimal<br />

DHCA variant, in both the fully serial and parallel implementations.<br />

iii) The predictive power of the proposed methodology based on running time estimation<br />

vs the computational optimality criterion based on designing the DHCA<br />

according to a decreasing diversity factor cardinality order, in both the fully<br />

serial and parallel implementations.<br />

– How do we measure it?<br />

i) The time complexity of the implemented serial and parallel DHCA variants is<br />

measured in terms of the CPU time required for their execution —serial running<br />

time (SRTDHCA) and parallel running time (PRTDHCA).<br />

ii) The estimated running times of the same DHCA variants –serial estimated running<br />

time (SERTDHCA) and parallel estimated running time (PERTDHCA)– are<br />

computed by means of the proposed running time estimation methodology, which<br />

is based on the measured running time of c = 1 consensus clustering process. Predictions<br />

regarding the computationally optimal DHCA variant will be successful<br />

in case that both the real and estimated running times are minimized by the<br />

same DHCA variant, and the percentage of experiments in which prediction is<br />

successful is given as a measure of its performance. In order to measure the<br />

impact of incorrect predictions, we also measure the execution time differences<br />

(in both absolute and relative terms) between the truly and the allegedly fastest<br />

DHCA variants in the case prediction fails. This evaluation process is replicated<br />

for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />

on the prediction accuracy of the proposed methodology.<br />

iii) Both computationally optimal DHCA variants prediction approaches are compared<br />

in terms of the percentage of experiments in which prediction is successful,<br />

271


C.3. Estimation of the computationally optimal DHCA<br />

and in terms of the execution time overheads (in both absolute and relative terms)<br />

between the truly and the allegedly fastest DHCA variants in the case prediction<br />

fails.<br />

– How are the experiments designed? The f! DHCA variants corresponding to<br />

all the possible permutations of the f diversity factors employed in the generation<br />

of the cluster ensemble have been implemented (see table 3.6). As described in appendix<br />

A.4, cluster ensembles have been created by the mutual crossing of f =3<br />

diversity factors: clustering algorithms (dfA), object representations (dfR) and data<br />

dimensionalities (dfD). Thus, in all our experiments, the number of DHCA variants is<br />

f! = 3! = 6, which are identified by an acronym describing the order in which diversity<br />

factors are assigned to stages —for instance, ADR describes the DHCA variant<br />

defined by the ordered list O = {df1 = dfA,df2 = dfD,df3 = dfR}. For a given data<br />

collection, the cardinalities of the representational and dimensional diversity factors<br />

(|dfR| and |dfD|, respectively) are constant, while the cardinality of the algorithmic<br />

diversity factor takes four distinct values |dfA| = {1, 10, 19, 28}, giving rise to the four<br />

diversity scenarios where our proposals are analyzed. Moreover, consensus clustering<br />

has been conducted by means of the seven consensus functions for hard cluster<br />

ensembles described in appendix A.5, which allows evaluating the behaviour of our<br />

proposals under distinct consensus paradigms. In all cases, the real running times<br />

correspond to an average of 10 independent runs of the whole RHCA, in order to<br />

obtain representative real running time values. As described in appendix A.6, all the<br />

experiments have been executed under Matlab 7.0.4 on Pentium 4 3GHz/1 GB RAM<br />

computers.<br />

– How are results presented? Both the real and estimated running times of the<br />

serial and parallel implementations of the DHCA variants are depicted by means of<br />

curves representing their average values.<br />

– Which data sets are employed? For brevity reasons, this section only describes<br />

the results of the experiments conducted on the Zoo data collection. On this data<br />

set, the cardinalities of the representational and dimensional diversity factors are<br />

|dfR| = 5 and |dfD| = 14, respectively. The presentation of the results of these<br />

same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat<br />

unimodal data collections is deferred to appendix C.3.<br />

C.3.1 Iris data set<br />

In this section, we present the estimated and real running times of the serial and parallel implementations<br />

of DHCA on the Iris data collection. As aforementioned, this experiment has<br />

been replicated across four diversity scenarios that, in the case of this data set, correspond<br />

to cluster ensembles of size l =9, 90, 171 and 252.<br />

The left and right columns of figure C.15 present the estimated and real running times<br />

of several variants of the serial implementation of the DHCA and flat consensus on this data<br />

set across the four diversity scenarios. There are a couple of issues worth noting: firstly,<br />

SERTDHCA is a pretty accurate estimator of the real execution time of the serial DHCA<br />

implementation, SRTDHCA. Secondly, notice that flat consensus is faster than the most<br />

efficient DHCA variants regardless of the consensus function and the diversity scenario.<br />

272


Appendix C. Experiments on hierarchical consensus architectures<br />

Furthermore, we would like to highlight that the computationally optimal DHCA variants<br />

are those defined by an ordered list of diversity factors in decreasing cardinality order, a<br />

trend that is notably well captured by SERTDHCA.<br />

Figure C.16 presents the estimated and real execution times of the fully parallel implementations<br />

of the DHCA variants. If compared to the serial case, the running time<br />

estimation PERTDHCA is not as accurate. Moreover, notice that, as already outlined in<br />

section 3.3, the execution times of the distinct DHCA variants tend to be quite similar.<br />

<strong>La</strong>st, DHCA become faster than flat consensus in the highest diversity scenario.<br />

C.3.2 Wine data set<br />

This section presents the estimated and real execution times of multiple variants of deterministic<br />

hierarchical consensus architectures on the Wine data collection, both in its serial<br />

and parallel versions. The high cardinality of the dimensional diversity factor of this data<br />

set gives rise to relatively large cluster ensembles in the four diversity scenarios, the sizes<br />

of which are equal to l =45, 450, 855 and 1260 in this case.<br />

Figure C.17 depicts the results of this experiment when the fully serial implementation<br />

of DHCA is considered. As already observed in section C.3.1, SERTDHCA is a pretty<br />

good estimator of SRTDHCA. Moreover, as regards the computational efficiency of DHCA<br />

variants, notice that i) those defined by the decreasing cardinality ordered list of diversity<br />

factors are the fastest, and ii) they become faster than flat consensus as the size of the<br />

cluster ensemble is increased.<br />

Meanwhile, figure C.18 presents the results corrresponding to the parallel DHCA implementation.<br />

Again, the time complexities of DHCA variants tend to converge, which<br />

reinforces our hypothesis regarding the irrelevance of the way diversity factors are associated<br />

to stages in parallel scenarios. Moreover, notice that DHCA variants are faster than<br />

flat consensus in all but one of the diversity scenarios —except when the EAC consensus<br />

function is employed.<br />

C.3.3 Glass data set<br />

In this section, we present the results of estimating the execution times of the serial and parallel<br />

DHCA implementations on the Glass data collection. According to the four diversity<br />

scenarios generated by employing |dfA| =1, 10, 19 and 28 clustering algorithms for generating<br />

the cluster ensembles, these contain l =29, 290, 551 and 812 individual partitions.<br />

For starters, the results corresponding to the serial implementation of deterministic<br />

hierarchical consensus architectures are presented in figure C.19. It can be observed that<br />

the estimation of the real execution time is pretty accurate, both in absolute terms (i.e.<br />

SERTDHCA is a good approximation of SRTDHCA) and as regards the determination of the<br />

computationally optimal consensus architecture. Furthermore, notice how the definition of<br />

DHCA variants by diversity factors arranged in decreasing cardinality order gives rise to<br />

the least time consuming configurations, which are even faster than flat consensus as the<br />

cluster ensemble size increases —again, consensus architectures based on the EAC consensus<br />

function constitute the only exception to this rule. <strong>La</strong>st, notice that when consensus are<br />

built using the MCLA consensus function, flat consensus is not executable in the highest<br />

diversity scenario.<br />

273


C.3. Estimation of the computationally optimal DHCA<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

|df A | = 10 , |df D | = 2 , |df R | = 5<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

|df A | = 19 , |df D | = 2 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.15: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

274


PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 0<br />

10 −1<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 −1<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

|df A | = 10 , |df D | = 2 , |df R | = 5<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

|df A | = 19 , |df D | = 2 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 −1<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.16: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

275


C.3. Estimation of the computationally optimal DHCA<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 0<br />

|df A | = 1 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

|df A | = 10 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

(c) Estimated RT, |dfA| =10<br />

10<br />

ADR ARD DAR DRA RAD RDA flat<br />

0<br />

DHCA variant<br />

10 1<br />

|df A | = 19 , |df D | = 11 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 0<br />

|df A | = 1 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 11 , |df R | = 5<br />

10<br />

ADR ARD DAR DRA RAD RDA flat<br />

0<br />

DHCA variant<br />

10 1<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.17: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

276


PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 0<br />

10 −1<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

|df A | = 10 , |df D | = 11 , |df R | = 5<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

|df A | = 19 , |df D | = 11 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

10 −1<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 11 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.18: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

277


C.3. Estimation of the computationally optimal DHCA<br />

Figure C.20 depicts the estimated and real execution times of fully parallel DHCA variants<br />

and flat consensus. The patterns presented in both columns of this figure (estimated<br />

running times on the left column, real execution times on the right) reveal the same behaviour<br />

observed on the previous data sets. That is, all DHCA variants have comparable<br />

running times, which would make running time estimations unnecessary as far as the election<br />

of the fastest DHCA variant is concerned. However, this estimation is necessary to<br />

decide whether hierarchical consensus is faster that its flat alternative, which occurs in all<br />

the diversity scenarios but in the lowest diversity one.<br />

C.3.4 Ionosphere data set<br />

In this section, the results of estimating the execution times of DHCA are compared to their<br />

real counterparts across four diversity scenarios on the Ionosphere data collection. The cluster<br />

ensemble sizes corresponding to these diversity scenarios are l =97, 970, 1843 and 2716,<br />

respectively.<br />

Firstly, figure C.21 depicts the estimated and real execution times considering the fully<br />

serial implementation of deterministic hierarchical consensus architectures. In this case,<br />

SERTDHCA is a fairly good estimator of SRTDHCA, and it constitutes a good base for<br />

predicting the least time consuming consensus architecture. Notice that, when consensus<br />

clusterings are built by means of the MCLA consensus function, flat consensus execution<br />

becomes impossible (given the computational resources employed in our experiments, see<br />

appendix A.6), so its hierarchical counterpart becomes a feasible alternative. Moreover, if<br />

hierarchical consensus is implemented by means of the DHCA variant defined by an ordered<br />

list of diversity factors arranged in decreasing cardinality order, notable computation time<br />

savings can be obtained.<br />

And secondly, as far as the fully parallel DHCA implementation is concerned (see figure<br />

C.22), the following observations can be made: i) PERTDHCA is a pretty accurate estimator<br />

of PRTDHCA, ii) there are no significant differences between the running times of the<br />

distinct variants of DHCA, and iii) flat consensus is more computationally costly than its<br />

hierarchical counterpart in all but one of the diversity scenarios considered.<br />

C.3.5 WDBC data set<br />

In this section, let us analyze the results corresponding to the WDBC data collection. In<br />

this case, each diversity scenario corresponds to a cluster ensemble of size l = 113, 1130, 2147<br />

and 3164, respectively. In first place, the left and right columns of figure C.23 present the<br />

estimated and real running times of the variants of the serial implementation of the DHCA<br />

on this data set across the four diversity scenarios.<br />

It can be observed that the proposed methodology yields a pretty good estimation<br />

of the real running time of DHCA variants. This allows the user to make well-grounded<br />

decisions regarding the most efficient hierarchical consensus architectures. For this data set,<br />

flat consensus is the computationally optimal architecture except in the highest diversity<br />

scenario —except when the EAC consensus function is employed.<br />

In second place, figure C.24 depicts the results corresponding to the parallel implementation<br />

of DHCA. The same conclusions drawn for the previous data collections are also<br />

applicable in the WDBC data set. That is, running times are almost independent of the<br />

278


SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

|df A | = 10 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

|df A | = 19 , |df D | = 7 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 0<br />

|df A | = 1 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.19: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

279


C.3. Estimation of the computationally optimal DHCA<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

|df A | = 10 , |df D | = 7 , |df R | = 5<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

|df A | = 19 , |df D | = 7 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 0<br />

10 −1<br />

|df A | = 1 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 7 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.20: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

280


SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 2<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

|df A | = 10 , |df D | = 32 , |df R | = 4<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

|df A | = 19 , |df D | = 32 , |df R | = 4<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 2<br />

10 0<br />

|df A | = 1 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.21: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

281


C.3. Estimation of the computationally optimal DHCA<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 1 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

|df A | = 10 , |df D | = 32 , |df R | = 4<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 19 , |df D | = 32 , |df R | = 4<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 1 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 32 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.22: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

282


Appendix C. Experiments on hierarchical consensus architectures<br />

DHCA variant implemented, at least, no significant differences are observed among them –<br />

as opposed to the serial implementation case–, while hierarchical consensus is more efficient<br />

than flat consensus as soon as the cluster ensemble size grows.<br />

C.3.6 Balance data set<br />

This section presents the results of estimating the execution times of the fully serial and<br />

parallel implementations of DHCA in the four diversity scenarios for the Balance data set,<br />

which give rise to cluster ensembles of sizes l =7, 70, 133 and 196 each.<br />

Firstly, figure C.25 presents the estimated and real execution times corresponding to<br />

the serial implementation context. It is quite apparent that SERTDHCA provides the user<br />

with a good estimation of the real running time of consensus architectures (SRTDHCA) and,<br />

as such, it allows determining which is the computationally optimal consensus architecture<br />

with a high degree of accuracy. In this case, given the small size of the cluster ensemble<br />

in either of the four diversity scenarios, flat consensus is faster than most serial DHCA<br />

variants.<br />

And secondly, the results corresponding to the parallel implementation of DHCA are<br />

depicted in figure C.26. Once more, all DHCA variants have very similar running times.<br />

As in the serial case, flat consensus is faster than any of its hierarhical counterparts, except<br />

when the HGPA and MCLA consensus functions are employed as clustering combiners.<br />

C.3.7 MFeat data set<br />

This section describes the results of the minimum complexity DHCA variant selection based<br />

on running time estimation. In the case of the MFeat data collection, cluster ensembles of<br />

sizes l =6, 60, 114 and 168 correspond to the four diversity scenarios where this experiment<br />

is conducted.<br />

Figure C.27 depicts the estimated and real execution times of the fully serial implementation<br />

of DHCA. In this case, SERTDHCA is a pretty accurate estimator of SRTDHCA, and,<br />

as such, it is a good predictor of the most computationally efficient consensus architecture.<br />

In most cases, however, due to the relatively small sizes of the cluster ensembles in this data<br />

set, flat consensus is faster than any of the DHCA variants —except when the HGPA and<br />

MCLA consensus functions are employed in high diversity scenarios.<br />

<strong>La</strong>st, figure C.28 presents the results corrresponding to the parallel DHCA implementation.<br />

Again, the time complexities of DHCA variants reach pretty similar values. However,<br />

notice that DHCA variants are slower than flat consensus in most of the diversity scenarios<br />

—except when the HGPA and MCLA consensus functions are used for combining the<br />

clusterings.<br />

283


C.3. Estimation of the computationally optimal DHCA<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 2<br />

10 0<br />

|df A | = 1 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

(a) Estimated RT, |dfA| =1<br />

|df A | = 10 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(c) Estimated RT, |dfA| =10<br />

|df A | = 19 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 2<br />

10 0<br />

|df A | = 1 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.23: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

284


PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 1<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 10 , |df D | = 28 , |df R | = 5<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 19 , |df D | = 28 , |df R | = 5<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 1<br />

10 0<br />

|df A | = 1 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 28 , |df R | = 5<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.24: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

285


C.3. Estimation of the computationally optimal DHCA<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 2<br />

10 0<br />

|df A | = 1 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 10 , |df D | = 2 , |df R | = 4<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 19 , |df D | = 2 , |df R | = 4<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 2<br />

10 0<br />

|df A | = 1 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.25: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

286


PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

|df A | = 10 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

|df A | = 19 , |df D | = 2 , |df R | = 4<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 0<br />

|df A | = 1 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 1<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 2 , |df R | = 4<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.26: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

287


C.3. Estimation of the computationally optimal DHCA<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

SERT DHCA (sec.)<br />

10 0<br />

|df A | = 1 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(a) Estimated RT, |dfA| =1<br />

|df A | = 10 , |df D | = 1 , |df R | = 6<br />

10<br />

ADR ARD DAR DRA RAD RDA flat<br />

0<br />

DHCA variant<br />

10 4<br />

10 2<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 4<br />

10 2<br />

|df A | = 19 , |df D | = 1 , |df R | = 6<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

SRT DHCA (sec.)<br />

10 0<br />

|df A | = 1 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 1 , |df R | = 6<br />

10<br />

ADR ARD DAR DRA RAD RDA flat<br />

0<br />

DHCA variant<br />

10 4<br />

10 2<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 4<br />

10 2<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.27: Estimated and real running times (RT) of the serial DHCA implementation<br />

on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

288


PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

PERT DHCA (sec.)<br />

10 2<br />

10 0<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

|df A | = 1 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(a) Estimated RT, |dfA| =1<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 10 , |df D | = 1 , |df R | = 6<br />

(c) Estimated RT, |dfA| =10<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

|df A | = 19 , |df D | = 1 , |df R | = 6<br />

(e) Estimated RT, |dfA| =19<br />

|df A | = 28 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(g) Estimated RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

PRT DHCA (sec.)<br />

10 2<br />

10 0<br />

|df A | = 1 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(b) Real RT, |dfA| =1<br />

|df A | = 10 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(d) Real RT, |dfA| =10<br />

|df A | = 19 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

10 2<br />

10 0<br />

(f) Real RT, |dfA| =19<br />

|df A | = 28 , |df D | = 1 , |df R | = 6<br />

ADR ARD DAR DRA RAD RDA flat<br />

DHCA variant<br />

(h) Real RT, |dfA| =28<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

Figure C.28: Estimated and real running times (RT) of the parallel DHCA implementation<br />

on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

289


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

C.4 Computationally optimal RHCA, DHCA and flat consensus<br />

comparison<br />

In this section, we compare those random and deterministic hierarchical consensus architectures<br />

deemed to be computationally optimal against classic flat consensus in terms of two<br />

factors: i) their execution times, and ii) the quality of the consensus clustering solutions<br />

they yield. This twofold comparison is intended to determine under which conditions any<br />

the aforementioned consensus architectures outperforms the others, not only in terms of<br />

their computational efficiency, but also as far as their perfomance for the construction of<br />

robust clustering systems is concerned.<br />

This comparison has been conducted across the following eleven unimodal data collection:<br />

Iris, Wine, Glass, Ionosphere, WDBC, Balance, MFeat, miniNG, Segmentation, BBC<br />

and PenDigits. For each data set, ten independent experiments have been conducted on four<br />

diversity scenarios. Each diversity scenario is characterized by the use of cluster ensembles<br />

generated by applying a certain number of clustering algorithms |dfA| = {1, 10, 19, or 28}.<br />

In each experiment, the CPU time required for executing either the whole hierarchical consensus<br />

architecture or flat consensus is measured, and the quality of the consensus clustering<br />

solution is evaluated in terms of its normalized mutual information φ (NMI) with respect the<br />

ground truth.<br />

From a visualization perspective, both the execution times and the φ (NMI) values are<br />

presented by means of their respective boxplots —each of which comprises the ten independent<br />

experiments conducted on each diversity scenario for each data collection. When<br />

comparing boxplots, notice that non-overlapping boxes notches indicate that the medians<br />

of the compared running times differ at the 5% significance level, which allows a quick<br />

inference of the statistical significance of the results.<br />

C.4.1 Iris data set<br />

In this section, the running time and consensus quality comparison experiments are conducted<br />

on the Iris data collection. The four diversity scenarios correspond to cluster ensembles<br />

of sizes l =9, 90, 171 and 252.<br />

Running time comparison<br />

Figure C.29 presents the running times of the allegedly computationally optimal RHCA<br />

and DHCA variants (considering their serial implementation) and flat consensus. Due to<br />

the relatively small cluster ensembles on this data set, it can be observed that flat consensus<br />

is the fastest option in most cases, regardless of the diversity scenario and the consensus<br />

function employed. The only exceptions occur when consensus are built using the MCLA<br />

consensus function in the two highest diversity scenarios –in these cases, DHCA turns out to<br />

be the most efficient consensus architecture–, as this is the only consensus function the time<br />

complexity of which grows quadratically with the size of the cluster ensemble l. Amongthe<br />

hierarchical consensus architectures, DHCA tends to outperform RHCA in computational<br />

terms, except when the ALSAD consensus function is employed (although the differences<br />

between RHCA and DHCA are in general minor).<br />

The execution times corresponding to flat consensus and the entirely parallel implemen-<br />

290


Appendix C. Experiments on hierarchical consensus architectures<br />

tation of hierarchical architectures are presented in figure C.30. It can be observed that<br />

flat consensus gradually moves from being the optimal consensus architecture in the lowest<br />

diversity scenario to being the slowest in the highest diversity one. Compared to the serial<br />

implementation, the running time differences between RHCA and DHCA are less significant<br />

in this case —except when the ALSAD consensus function is the base of the consensus<br />

architecture.<br />

Consensus quality comparison<br />

As regards the quality of the consensus clustering solutions yielded by each consensus architecture<br />

as a function of the consensus function employed and the diversity scenario, the<br />

results obtained are presented in figure C.31. A few observations can be made: firstly, if<br />

the results obtained by the seven consensus functions are compared, it is to notice that<br />

fairly different performances are obtained: for instance, HGPA gives rise to pretty poorer<br />

quality consensus than the remaining consensus functions, as none of the boxes exceeds the<br />

φ (NMI) =0.6 level. Moreover, these relative performances are maintained across the different<br />

diversity scenarios. Secondly, the variability of the quality of the consensus clustering<br />

solutions can be evaluated by observing figure C.31(d), as the depicted boxplots corresponds<br />

to ten independent runs of the consensus clustering processes on the same cluster ensemble.<br />

Notice that the major differences are observed in the HGPA, MCLA and KMSAD consensus<br />

functions, which, as aforementioned, contain some random parameters that makes their<br />

performance vary (largely, as in HGPA or slightly, as in MCLA) from run to run. Thirdly,<br />

the relative comparison of the quality of the consensus solutions yielded by the the two HCA<br />

and flat consensus is local to the consensus function employed. Whereas DHCA seems to<br />

give rise to better consensus clustering solutions when the CSPA, ALSAD and KMSAD<br />

consensus functions are employed, it tends to be outperformed by RHCA flat consensus<br />

when clusterings are combined by EAC or HGPA. <strong>La</strong>st, notice that the highest level of<br />

similarity between the top-quality cluster ensemble components and consensus clustering<br />

solutions correspond to DHCA based on the CSPA consensus function.<br />

C.4.2 Wine data set<br />

This section presents the comparison between flat consensus and the computationally optimal<br />

consensus architectures in terms of CPU execution time and normalized mutual information<br />

between the ground truth and the consensus clustering solution yielded by each<br />

one of them. On this data collection, the cluster ensemble sizes corresponding to the four<br />

diversity scenarios are l =45, 450, 855 and 1260.<br />

Running time comparison<br />

As regards the execution times of the fully serial implementations of the estimated optimal<br />

RHCA and DHCA variants and flat consensus, a double evolutive behaviour can be<br />

observed —see figure C.32. Firstly, those consensus architectures using the CSPA, HGPA,<br />

MCLA, ALSAD, KMSAD and SLSAD consensus functions follow the same evolution pattern<br />

observed, for instance, in the Iris data collection (i.e. the larger the cluster ensemble,<br />

the more efficient hierarhical architectures become compared to flat consensus). In contrast,<br />

the consensus architectures based on the EAC consensus function present a fairly different<br />

291


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.29: Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Iris data collection in the four diversity scenarios |dfA| =<br />

{1, 10, 19, 28}.<br />

292


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.5<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.30: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Iris data collection in the four diversity scenarios |dfA| =<br />

{1, 10, 19, 28}.<br />

293


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.31: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Iris data collection in the four diversity<br />

scenarios |dfA| = {1, 10, 19, 28}.<br />

294


Appendix C. Experiments on hierarchical consensus architectures<br />

behaviour, as flat consensus is the fastest option regardless of the diversity scenario. That is,<br />

as already observed in sections C.2.5 and C.3.5, the time complexity behaviour of consensus<br />

architectures is local to the consensus function employed for combining the clusterings.<br />

As regards the computational complexity of the parallel implementation of HCA (see<br />

figure C.33), it can be observed that these become much faster than flat consensus as soon<br />

as the size of the cluster ensemble is increased. As in the previous data collection, the<br />

running times of parallel RHCA and DHCA are pretty similar.<br />

Consensus quality comparison<br />

As far as the quality of the consensus clustering solutions obtained by the distinct consensus<br />

architectures, figure C.34 depicts the corresponding φ (NMI) boxplots. Again, performances<br />

are highly local to the consensus function employed: in this case, those consensus architectures<br />

based on the EAC, HGPA and SLSAD consensus functions give rise to the lowest<br />

quality consensus clusterings. If the three consensus architectures are compared, it can<br />

be observed that RHCA and flat consensus tend to perform quite similarly, while worse<br />

clustering solutions are generally obtained from DHCA. Notice that the highest robustness<br />

to clustering indeterminacies (i.e. consensus clustering solutions of comparable quality to<br />

the cluster ensemble components of highest φ (NMI) ) are obtained from the RHCA and flat<br />

consensus architectures based on MCLA, ALSAD and KMSAD.<br />

C.4.3 Glass data set<br />

In this section, we present the running times and quality evaluation (by means of φ (NMI)<br />

values) of the consensus clustering processes implemented by means of the serial and parallel<br />

RHCA DHCA implementations and flat consensus on the Glass data collection. The cluster<br />

ensembles sizes corresponding to the four diversity scenarios in which our experiments are<br />

conducted are l =29, 290, 551 and 812.<br />

Running time comparison<br />

Figure C.35 presents the boxplot charts that represent the running times of the three implemented<br />

consensus architectures, considering the entirely serial implementation of the<br />

hierarhical ones. As in the previous data collections, flat consensus is the fastest option<br />

in the lowest diversity scenario, whereas hierarchical consensus architectures become more<br />

computationally efficient as soon as the size of the cluster ensemble increases —for all but<br />

the EAC consensus function. which again highlights the interest of structuring consensus<br />

processes in a hierarchical manner as a means for i) reducing their time complexity when<br />

they are to be conducted on large cluster ensembles, and ii) obtaining a consensus clustering<br />

solution when the execution of flat consensus becomes unfeasible (e.g. when the MCLA<br />

consensus function is employed in the highest diversity scenario).<br />

The computational complexity of the consensus architectures presents a very similar<br />

behaviour when the parallel implementation of hierarchical versions is studied (see figure<br />

C.36). In this case, though, the differences between the running times of flat and hierarchical<br />

consensus architectures is even larger.<br />

295


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

5<br />

4<br />

3<br />

2<br />

1<br />

8<br />

7<br />

6<br />

5<br />

4<br />

8<br />

7.5<br />

7<br />

6.5<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

2.6<br />

2.4<br />

2.2<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

CPU time (sec.)<br />

0<br />

3<br />

2.5<br />

2<br />

1.5<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

2.4<br />

2.2<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

6<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

20<br />

15<br />

10<br />

5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(d) Serial implementation running time, |dfA| =28<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.32: Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the Wine data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

296<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

3.5<br />

2.5<br />

1.5<br />

0.5<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

4<br />

3<br />

2<br />

1<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

1.2<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.24<br />

0.22<br />

0.2<br />

0.18<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

4<br />

3<br />

2<br />

1<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

20<br />

15<br />

10<br />

5<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

(c) Parallel implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

6<br />

4<br />

2<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

40<br />

30<br />

20<br />

10<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.33: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Wine data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

297


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.34: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Wine data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

298


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

12<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

25<br />

20<br />

15<br />

10<br />

5<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

EAC<br />

CPU time (sec.)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

flat<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

6<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.5<br />

1<br />

0.5<br />

0<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

12<br />

10<br />

8<br />

6<br />

4<br />

25<br />

20<br />

15<br />

10<br />

5<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

25<br />

20<br />

15<br />

10<br />

5<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.35: Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the Glass data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

299


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.5<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.5<br />

1<br />

0.5<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

(b) Parallel implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.36: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Glass data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

300


Consensus quality comparison<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

As regards the quality of the consensus clustering process, figure C.37 presents the boxplots<br />

depicting the φ (NMI) values of the components of the cluster ensemble E and of the consensus<br />

clustering solutions output by the RHCA, DHCA and flat consensus architectures. On<br />

this data collection, the φ (NMI) differences between the clustering solutions output by the<br />

three distinct consensus architectures are, in general terms, small —except when the EAC<br />

consensus function is employed, as flat consensus is clearly superior in this case. Moreover,<br />

we can observe that ALSAD and SLSAD outstand among the remaining consenus functions<br />

as the top performers.<br />

C.4.4 Ionosphere data set<br />

This section presents the execution times of the computationally optimal RHCA, DHCA<br />

and flat consensus architecture and the φ (NMI) values of the consensus clustering solutions<br />

yielded by them on the Ionosphere data collection. The presented results consider the<br />

experiments conducted across four diversity scenarios, and the cluster ensemble sizes corresponding<br />

to them are l =97, 970, 1843 and 2716, respectively.<br />

Running time comparison<br />

The execution times of flat consensus and the serially implemented RHCA and DHCA are<br />

depicted in the boxplots charts presented in figure C.38. The relative behaviour of the<br />

three consensus architectures is pretty similar to the one observed on the previous data<br />

collections. That is, i) flat consensus becomes slower than its hierarchical counterparts as<br />

the size of the cluster ensemble becomes larger, except when the EAC consensus function is<br />

employed, and ii) RHCA tends to be faster than DHCA when the EAC, ALSAD, SLSAD,<br />

whereas the opposite behaviour is observed in the hypergraph based consensus functions.<br />

The running times obtained in the case the hierarchical consensus architectures are<br />

implemented in an entirely parallel manner are presented in figure C.39. As expected,<br />

RHCA and DHCA become sensibly more efficient than flat consensus. Notice, however,<br />

that notable differences between the running times of the two hierarchical architectures can<br />

be found under certain consensus functions, such as EAC, ALSAD, or SLSAD.<br />

Consensus quality comparison<br />

As far as the quality of the consensus clustering process is concerned, the φ (NMI) boxplots<br />

corresponding to the consensus clustering solutions obtained by the RHCA, DHCA and flat<br />

consensus architectures across the four diversity scenarios on the Ionosphere data collection<br />

are presented in figure C.40. A notably high variability as regards the optimality of the<br />

different consensus architectures is found depending on the consensus function employed.<br />

For instance, DHCA tends to yield the highest quality results when consensus is conducted<br />

by means of SLSAD, flat consensus gives the best consensus clustering solutions derived by<br />

HGPA, and when MCLA is chosen as the clustering combiner, RHCA attains the higher<br />

φ (NMI) values than the remaining consensus architectures. In contrast, marginal quality<br />

differences are observed between the qualities of the consensus clustering solutions derived<br />

by the three consensus architectures when the remaining consensus functions are employed.<br />

301


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.37: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Glass data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

302


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

21<br />

20<br />

19<br />

18<br />

17<br />

16<br />

15<br />

14<br />

13<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

20<br />

15<br />

10<br />

5<br />

0<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

100<br />

80<br />

60<br />

40<br />

20<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

CPU time (sec.)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

20<br />

15<br />

10<br />

5<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

6<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7.5<br />

7<br />

6.5<br />

6<br />

5.5<br />

5<br />

4.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(c) Serial implementation running time, |dfA| =19<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

10<br />

9.5<br />

9<br />

8.5<br />

8<br />

7.5<br />

7<br />

6.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

flat<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

20<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

50<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.38: Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Ionosphere data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

303<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

SLSAD<br />

RHCA<br />

DHCA<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

flat


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

20<br />

15<br />

10<br />

5<br />

0<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

20<br />

15<br />

10<br />

5<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

20<br />

15<br />

10<br />

5<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.5<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

20<br />

15<br />

10<br />

5<br />

0<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

20<br />

15<br />

10<br />

5<br />

0<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.39: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Ionosphere data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

304


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.40: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Ionosphere data collection in the<br />

four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

305


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

C.4.5 WDBC data set<br />

In this section, we present the running times and consensus clustering solution qualities<br />

of the hierarchical and flat consensus architectures corresponding to the WDBC data collection.<br />

In this case, each diversity scenario corresponds to a cluster ensemble of size<br />

l = 113, 1130, 2147 and 3164, respectively.<br />

Running time comparison<br />

Figure C.41 presents the running time of flat consensus and of the serial implementation<br />

of the fastest random and hierarchical consensus architectures across the four diversity<br />

scenarios. In this case, the relationship between the execution times of these consensus architectures<br />

is a little different from what has been observed in the previous data collections.<br />

In particular, flat consensus is more a competitive alternative, being faster or almost as fast<br />

as RHCA in all diversity scenarios when consensus is based on the CSPA, EAC, ALSAD<br />

and SLSAD clustering combiners. In contrast, DHCA is notably slower than RHCA in most<br />

cases. This is due to the large cardinality of the dimensional diversity factor (|dfD|=28 on<br />

this data set) that makes the DHCA stage where consensus is conducted on this diversity<br />

factor much more computationally costly compared to the intermediate consensus processes<br />

of RHCA.<br />

In figure C.42, the execution times of the computationally optimal parallel RHCA and<br />

DHCA variants and flat consensus are presented. Two trends are observed in these boxplots:<br />

firstly, hierarchical architectures are faster than flat consensus, especially in diversity<br />

scenarios where large cluster ensembles are employed. And secondly, parallel DHCA are in<br />

general slower than their RHCA counterparts, for the same reason stated before.<br />

Consensus quality comparison<br />

Figure C.43 presents the φ (NMI) of the consensus clustering solutions yielded by the RHCA,<br />

DHCA and flat consensus architectures across the four diversity scenarios on the WDBC<br />

data collection. Firstly, notice that the EAC and SLSAD consensus functions give rise to<br />

very low quality consensus clusterings regardless of the consensus architecture employed.<br />

In contrast, flat consensus yields reasonably good consensus clusterings when it is derived<br />

by means of HGPA, although hierarchical consensus architectures based on this consensus<br />

function output poor consensus clustering solutions. Meanwhile, the remaining clustering<br />

combiners yield pretty good consensus —notice that slightly better results are obtained<br />

when it is derived by means of RHCA and flat consensus architectures.<br />

C.4.6 Balance data set<br />

This section presents the execution times of the estimated computationally optimal serial<br />

and parallel implementations of RHCA and DHCA and flat consensus in the four<br />

diversity scenarios for the Balance data set, which give rise to cluster ensembles of sizes<br />

l =7, 70, 133 and 196 each. Moreover, the quality of the consensus clustering solutions<br />

output by each consensus architecture are evaluated in terms of the normalized mutual<br />

information (φ (NMI) ) with respect to the ground truth.<br />

306


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

140<br />

120<br />

100<br />

80<br />

60<br />

200<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

CSPA<br />

RHCA<br />

DHCA<br />

CSPA<br />

flat<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

200<br />

150<br />

100<br />

50<br />

0<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

CPU time (sec.)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

50<br />

40<br />

30<br />

20<br />

10<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

30<br />

25<br />

20<br />

15<br />

10<br />

50<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

100<br />

80<br />

60<br />

40<br />

20<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

150<br />

100<br />

50<br />

0<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

500<br />

400<br />

300<br />

200<br />

100<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.41: Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the WDBC data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

307


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

60<br />

50<br />

40<br />

30<br />

20<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.45<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

200<br />

150<br />

100<br />

50<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

200<br />

150<br />

100<br />

50<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

80<br />

60<br />

40<br />

20<br />

0<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.42: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the WDBC data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

308


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.43: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the WDBC data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

309


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

Running time comparison<br />

The characteristics of this data set –in particular, the low cardinalities of the diversity factors<br />

associated to it–, make flat consensus the fastest consensus architecture when compared to<br />

the serial implementations of RHCA and DHCA regardless of the diversity scenario —see<br />

figure C.44. The only exception is the MCLA consensus function, whose time complexity<br />

scales quadratically with the size of the ensemble, which penalizes the execution of one-step<br />

consensus processes in front of their hierarchical counterparts.<br />

A somewhat lighter version of this same behaviour is observed in the running time<br />

analysis of the parallel implementations of the HCA, which is presented in figure C.45. In<br />

this case, though, RHCA and DHCA are as fast or faster than flat consensus when the<br />

HGPA, MCLA and KMSAD consensus functions are employed.<br />

Consensus quality comparison<br />

As regards the quality of the consensus clustering solutions output by the three consensus<br />

architectures, figure C.46 shows the results obtained on the Balance data collection. It is<br />

to notice that the EAC and HGPA consensus functions yield, in general, the lowest quality<br />

results. For the remaining consensus functions, pretty similar quality consensus solutions are<br />

obtained by means of the three architectures, except for the ALSAD and SLSAD consensus<br />

function, where notable differences are observed between HCA and flat consensus.<br />

310


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

20<br />

15<br />

10<br />

5<br />

20<br />

15<br />

10<br />

5<br />

25<br />

20<br />

15<br />

10<br />

5<br />

1<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

15<br />

10<br />

5<br />

0<br />

20<br />

15<br />

10<br />

5<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

40<br />

30<br />

20<br />

10<br />

0<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

40<br />

30<br />

20<br />

10<br />

0<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.44: Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the Balance data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

311


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

2.8<br />

2.6<br />

2.4<br />

2.2<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

3.5<br />

3<br />

2.5<br />

2<br />

3.5<br />

3<br />

2.5<br />

2<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

CPU time (sec.)<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

flat<br />

CPU time (sec.)<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

10<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

10<br />

8<br />

6<br />

4<br />

2<br />

8<br />

6<br />

4<br />

2<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

40<br />

30<br />

20<br />

10<br />

0<br />

25<br />

20<br />

15<br />

10<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

20<br />

15<br />

10<br />

5<br />

5<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.45: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the Balance data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

312


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.46: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Balance data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

313


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

C.4.7 MFeat data set<br />

This section describes the performance of the minimum complexity RHCA and DHCA<br />

serial and parallel variants in the case of the MFeat data collection, which are compared to<br />

classic flat consensus in terms of the time required for their execution and the quality of the<br />

consensus clustering solutions they yield. Cluster ensembles of sizes l =6, 60, 114 and 168<br />

correspond to the four diversity scenarios where these experiments are conducted.<br />

Running time comparison<br />

Figure C.47 presents the running times of flat consensus and of the serial implementations<br />

of RHCA and DHCA. Notice that, except when the HGPA and MCLA consensus functions<br />

are employed, flat consensus is faster than any of its hierarchical counterparts regardless of<br />

the size of the cluster ensemble (i.e. it is faster in all the diversity scenarios).<br />

When the parallel implementation of the HCA is considered (see figure C.48), the observed<br />

behaviour is very similar to the one that has been just reported. That is, flat consensus<br />

is the most computationally efficient consensus architecture, except when consensus<br />

functions based on hypergraph partition are employed. This is due to the fact that, on the<br />

MFeat data collection, the low cardinality of the diversity factors gives rise to relatively<br />

small cluster ensembles, which makes flat consensus a competitive alternative to hierarhical<br />

consensus architectures.<br />

Consensus quality comparison<br />

Figure C.49 presents the quality of the consensus clustering solutions yielded by the flat<br />

and hierarhical consensus architectures, under the shape of φ (NMI) boxplot diagrams. An<br />

inter-consensus function analysis reveals that EAC, HGPA and SLSAD yield, in general<br />

terms, the lowest quality results, while CSPA, ALSAD and KMSAD stand out as the best<br />

performing consensus functions, as they yield consensus clustering solutions the quality of<br />

which is comparable to that of the cluster ensemble components that best reveal the true<br />

cluster structure of the data set (i.e. those attaining the highest φ (NMI) values). Meanwhile,<br />

if an intra-consensus function study is conducted, we can conclude that whereas the three<br />

consensus architectures yield pretty similar quality consensus solutions when based on CSPA<br />

and ALSAD, larger differences between RHCA, DHCA and flat consensus are observed in<br />

other cases, as when consensus clustering is conducted by means of the EAC, HGPA, MCLA<br />

or SLSAD consensus functions.<br />

C.4.8 miniNG data set<br />

In this section, we present the running times and quality evaluation (by means of φ (NMI)<br />

values) of the consensus clustering processes implemented by means of the serial and parallel<br />

RHCA and DHCA implementations and flat consensus on the miniNG data collection.<br />

The cluster ensembles sizes corresponding to the four diversity scenarios in which our experiments<br />

are conducted are l =73, 730, 1387 and 2044.<br />

314


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

20<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

1.6<br />

1.5<br />

1.4<br />

1.3<br />

1.2<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

250<br />

200<br />

150<br />

100<br />

50<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.47: Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the MFeat data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

315


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

20<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

25<br />

20<br />

15<br />

10<br />

20<br />

19<br />

18<br />

17<br />

16<br />

15<br />

14<br />

13<br />

12<br />

23<br />

22<br />

21<br />

20<br />

19<br />

18<br />

17<br />

16<br />

15<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

15<br />

10<br />

5<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

250<br />

200<br />

150<br />

100<br />

50<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.48: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the MFeat data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

316


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.49: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the MFeat data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

317


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

Running time comparison<br />

The miniNG data collection is one of those cases where the cardinality of the diversity factors<br />

employed for generating the cluster ensemble, besides the number of objects it contains,<br />

makes flat consensus non-executable (for all but the EAC consensus function) in those<br />

scenarios where the cluster ensemble size is relatively large. In this situation, hierarchical<br />

consensus architectures become a means for making consensus clustering feasible.<br />

As regards the serial implementation of RHCA and DHCA –figure C.50–, the former<br />

tends to be faster than the latter, except when the HGPA and MCLA consensus functions<br />

are employed. This inter-consensus architecture performance is also observed in the parallel<br />

implementation case, presented in figure C.51.<br />

Consensus quality comparison<br />

The analysis of the quality of the consensus clustering solutions output by the flat and<br />

hierarchical consensus architectures can be made based on the φ (NMI) boxplot charts depicted<br />

in figure C.52. A single remark as regards the perforance of the distinct consensus<br />

functions: notice that CSPA, ALSAD and KMSAD based consensus solutions are the best<br />

ones in quality terms. And last, the φ (NMI) values of the consensus clusterings output by<br />

the two hierarchical consensus architectures –the only ones able to operate across all the<br />

diversity scenarios– are pretty similar in most cases.<br />

C.4.9 Segmentation data set<br />

This section presents the comparison between flat consensus and the computationally optimal<br />

consensus architectures in terms of CPU execution time and normalized mutual information<br />

between the ground truth and the consensus clustering solution yielded by each one<br />

of them. On the Segmentation data collection, the cluster ensemble sizes corresponding to<br />

the four diversity scenarios are l =52, 520, 988 and 1456.<br />

Running time comparison<br />

Figure C.53 presents the execution times of the flat consensus architecture and the estimated<br />

computationally optimal serial random and deterministic hierarchical consensus<br />

architectures. In this case, flat consensus is faster than RHCA and DHCA regardless of<br />

the cluster ensemble size (in our range of observation), except when the HGPA and MCLA<br />

consensus functions are employed —in fact, MCLA-based flat consensus is unfeasible in the<br />

two largest diversity scenarios. Moreover, the relative speed comparison between RHCA<br />

and DHCA yields different results depending on the consensus function employed: RHCA<br />

is faster than DHCA if consensus is based on CSPA, EAC, ALSAD or SLSAD, while the<br />

opposite behaviour is observed when the HGPA, MCLA and KMSAD consensus functions<br />

are used.<br />

Pretty similar results are obtained when the running times of the fully parallel implementation<br />

of RHCA and DHCA are analyzed, as figure C.53 reveals. The main difference<br />

with respect to what has been just reported is the logical speed up of HCA, which makes<br />

them be faster than flat consensus in the highest diversity scenario.<br />

318


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

650<br />

600<br />

550<br />

500<br />

450<br />

400<br />

350<br />

300<br />

1100<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

1200<br />

1100<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

15000<br />

10000<br />

5000<br />

EAC<br />

RHCA<br />

DHCA<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

flat<br />

x 10 4EAC<br />

0<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

x 10 4 ALSAD<br />

RHCA<br />

DHCA<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

75<br />

70<br />

65<br />

60<br />

55<br />

50<br />

45<br />

40<br />

35<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

flat<br />

x 10 4 ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

15000<br />

10000<br />

(b) Serial implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

140<br />

130<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

60<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

250<br />

200<br />

150<br />

100<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

5000<br />

0<br />

1.5<br />

1<br />

0.5<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

x 10<br />

2<br />

4 KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

x KMSAD 104<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.50: Running times of the computationally optimal serial RHCA, DHCA and<br />

flat consensus architectures on the miniNG data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

319


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

30<br />

28<br />

26<br />

24<br />

22<br />

20<br />

30<br />

28<br />

26<br />

24<br />

22<br />

20<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

4000<br />

3000<br />

2000<br />

1000<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1200<br />

1000<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

2000<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

(b) Parallel implementation running time, |dfA| =10<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

CPU time (sec.)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

200<br />

150<br />

100<br />

50<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

2500<br />

2000<br />

1500<br />

1000<br />

(c) Parallel implementation running time, |dfA| =19<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

200<br />

150<br />

100<br />

50<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

500<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

2000<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

2000<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.51: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the miniNG data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

320


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.52: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the miniNG data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

321


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

900<br />

800<br />

700<br />

600<br />

500<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.2<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

x ALSAD 104<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

RHCA<br />

DHCA<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

flat<br />

x ALSAD 104<br />

3<br />

RHCA<br />

DHCA<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

400<br />

300<br />

200<br />

100<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

30<br />

25<br />

20<br />

15<br />

10<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

flat<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

50<br />

45<br />

40<br />

35<br />

30<br />

25<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

SLSAD<br />

RHCA<br />

DHCA<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.53: Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the Segmentation data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

322


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

30<br />

25<br />

20<br />

15<br />

10<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

400<br />

300<br />

200<br />

100<br />

0<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

CSPA<br />

RHCA<br />

DHCA<br />

0<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

CPU time (sec.)<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4000<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

CPU time (sec.)<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

2000<br />

1500<br />

1000<br />

500<br />

2000<br />

1500<br />

1000<br />

500<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

CPU time (sec.)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

flat<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

(b) Parallel implementation running time, |dfA| =10<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

400<br />

300<br />

200<br />

100<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1200<br />

1000<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

800<br />

600<br />

400<br />

200<br />

1200<br />

1000<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

1200<br />

1100<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.54: Running times of the computationally optimal parallel RHCA, DHCA and flat<br />

consensus architectures on the Segmentation data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

323


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

Consensus quality comparison<br />

The φ (NMI) values of the consensus clustering solutions yielded by flat, random hierarchical<br />

and deterministic hierarhical consensus architectures follows a pattern that is quite similar<br />

to what has been observed in the previous data collections, at least as far as the performance<br />

of the distinct consensus functions is concerned. That is, the lowest quality consensus solutions<br />

are obtained by means of the EAC and HGPA consensus functions, whereas ALSAD<br />

tends to yield the best results.<br />

C.4.10 BBC data set<br />

In this section, the running time and consensus quality comparison experiments are conducted<br />

on the BBC data collection. The four diversity scenarios correspond to cluster<br />

ensembles of sizes l =57, 570, 1083 and 1596.<br />

Running time comparison<br />

As far as the running times of the entirely serial implementation of RHCA and DHCA<br />

and of flat consensus are concerned, the boxplots depicted in figure C.56 show that flat<br />

consensus constitutes the most computationally competitive consensus architecture in most<br />

cases —in fact, the only exceptions occur when the HGPA and MCLA consensus functions<br />

are employed.<br />

When the parallel implementation of hierarchical consensus architecture is considered,<br />

they become more competitive (in computational terms), reverting the situation observed<br />

in the serial case for the CSPA and KMSAD consensus functions —see figure C.57.<br />

Consensus quality comparison<br />

As regards the quality of the consensus clustering solutions yielded by the three consensus<br />

architectures (measured as the φ (NMI) with respect to the ground truth that defines the<br />

true group structure of the BBC data collection), we can observe great differences between<br />

the performance of the distinct consensus functions –see figure C.58–: while the MCLA,<br />

ALSAD and KMSAD clustering combiners tend to yield consensus clusterings of quality<br />

comparable to the best components of the cluster ensemble, the clustering solutions output<br />

by consensus architectures based on EAC, HGPA and SLSAD are notably poorer.<br />

C.4.11 PenDigits data set<br />

This section presents the execution times of the computationally optimal RHCA, DHCA<br />

and flat consensus architecture and the φ (NMI) values of the consensus clustering solutions<br />

yielded by them on the PenDigits data collection. The presented results consider the<br />

experiments conducted across four diversity scenarios, and the cluster ensemble sizes corresponding<br />

to them are l =57, 570, 1083 and 1596, respectively. Due to the number of objects<br />

contained in this data set, only the HGPA and MCLA consensus functions are executable<br />

on it, as they are the only ones the space complexity of which scales linearly with this<br />

324


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.55: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the Segmentation data collection in the<br />

four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

325


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

x 10<br />

2<br />

4EAC<br />

1.5<br />

1<br />

0.5<br />

0<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

RHCA<br />

DHCA<br />

flat<br />

x 10<br />

2.5<br />

4EAC<br />

1.5<br />

1<br />

0.5<br />

0<br />

20000<br />

15000<br />

10000<br />

5000<br />

RHCA<br />

DHCA<br />

flat<br />

x 10<br />

2<br />

4EAC<br />

0<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

4<br />

3<br />

2<br />

1<br />

0<br />

x 10 4 ALSAD<br />

RHCA<br />

DHCA<br />

(a) Serial implementation running time, |dfA| =1<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

150<br />

100<br />

50<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

4<br />

3<br />

2<br />

1<br />

0<br />

flat<br />

x 10 4 ALSAD<br />

RHCA<br />

DHCA<br />

(b) Serial implementation running time, |dfA| =10<br />

CPU time (sec.)<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

30<br />

25<br />

20<br />

15<br />

10<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

flat<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

40<br />

35<br />

30<br />

25<br />

20<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

15000<br />

10000<br />

5000<br />

0<br />

15000<br />

10000<br />

5000<br />

0<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

SLSAD<br />

RHCA<br />

DHCA<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.56: Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the BBC data collection in the four diversity scenarios |dfA| =<br />

{1, 10, 19, 28}.<br />

326


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

35<br />

30<br />

25<br />

20<br />

15<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

800<br />

600<br />

400<br />

200<br />

0<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

CPU time (sec.)<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

ALSAD<br />

RHCA<br />

DHCA<br />

(a) Parallel implementation running time, |dfA| =1<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

150<br />

100<br />

50<br />

0<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

flat<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

1200<br />

1000<br />

(b) Parallel implementation running time, |dfA| =10<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

2000<br />

1500<br />

1000<br />

500<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

800<br />

600<br />

400<br />

200<br />

0<br />

1200<br />

1000<br />

(c) Parallel implementation running time, |dfA| =19<br />

EAC<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

HGPA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

MCLA<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

ALSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

1000<br />

(d) Parallel implementation running time, |dfA| =28<br />

CPU time (sec.)<br />

800<br />

600<br />

400<br />

200<br />

0<br />

0<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

RHCA<br />

DHCA<br />

flat<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4000<br />

3000<br />

2000<br />

1000<br />

4000<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

1800<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

0<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.57: Running times of the computationally optimal parallel RHCA, DHCA and flat<br />

consensus architectures on the BBC data collection in the four diversity scenarios |dfA| =<br />

{1, 10, 19, 28}.<br />

327


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

CSPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

EAC<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

HGPA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

MCLA<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

KMSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

SLSAD<br />

E<br />

RHCA<br />

DHCA<br />

flat<br />

Figure C.58: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the BBC data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

328


Appendix C. Experiments on hierarchical consensus architectures<br />

attribute. Moreover, flat consensus is unfeasible even when based on the aforementioned<br />

consensus functions.<br />

Running time comparison<br />

Figure C.59 shows the running times corresponding to the serial implementation of RHCA<br />

and DHCA. It can be observed that, as the cluster ensemble size increases, RHCA becomes<br />

faster than DHCA. This is due to the fact that this growth is induced by an augmentation of<br />

the cardinality of the algorithmic diversity factor |dfA|, which directly produces an increase<br />

of the time complexity of one of the DHCA stages, while the impact of this growth is<br />

somewhat scattered across the distinct stages of a RHCA.<br />

Approximately the same behaviour is observed when the parallel implementation of both<br />

hierarchical consensus architectures is analyzed —see figure C.60.<br />

Consensus quality comparison<br />

Figure C.61 presents the φ (NMI) corresponding to the two aforementioned consensus architectures.<br />

There are a couple of issues worth noting in this case. Firstly, notice that HGPA<br />

yields very poor consensus clustering solutions on this data collection (recall that this fact<br />

has also been reported in several of the previous data sets). And secondly, it is important to<br />

highlight the notable differences between the quality of the consensus clusterings output by<br />

RHCA and DHCA, as the latter consensus architecture tends to yield far better consensus<br />

clustering solutions than the former —a trend that becomes more evident in high diversity<br />

scenarios.<br />

329


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

4<br />

3.5<br />

3<br />

2.5<br />

30<br />

28<br />

26<br />

24<br />

22<br />

20<br />

55<br />

50<br />

45<br />

40<br />

85<br />

80<br />

75<br />

70<br />

65<br />

2<br />

RHCA<br />

RHCA<br />

RHCA<br />

RHCA<br />

HGPA<br />

DHCA<br />

HGPA<br />

DHCA<br />

flat<br />

(a) Serial implementation running time, |dfA| =1<br />

flat<br />

(b) Serial implementation running time, |dfA| =10<br />

HGPA<br />

DHCA<br />

flat<br />

(c) Serial implementation running time, |dfA| =19<br />

HGPA<br />

DHCA<br />

flat<br />

(d) Serial implementation running time, |dfA| =28<br />

Figure C.59: Running times of the computationally optimal serial RHCA, DHCA and flat<br />

consensus architectures on the PenDigits data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

330<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

30<br />

28<br />

26<br />

24<br />

22<br />

20<br />

65<br />

60<br />

55<br />

50<br />

45<br />

40<br />

110<br />

100<br />

90<br />

80<br />

70<br />

60<br />

RHCA<br />

RHCA<br />

RHCA<br />

RHCA<br />

MCLA<br />

DHCA<br />

MCLA<br />

DHCA<br />

MCLA<br />

DHCA<br />

MCLA<br />

DHCA<br />

flat<br />

flat<br />

flat<br />

flat


CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

1.8<br />

1.7<br />

1.6<br />

1.5<br />

1.4<br />

1.3<br />

1.2<br />

1.1<br />

1<br />

2<br />

1.9<br />

1.8<br />

1.7<br />

1.6<br />

1.5<br />

1.4<br />

2.5<br />

2<br />

1.5<br />

1<br />

RHCA<br />

RHCA<br />

RHCA<br />

RHCA<br />

DHCA<br />

Appendix C. Experiments on hierarchical consensus architectures<br />

HGPA<br />

flat<br />

(a) Parallel implementation running time, |dfA| =1<br />

HGPA<br />

DHCA<br />

flat<br />

(b) Parallel implementation running time, |dfA| =10<br />

HGPA<br />

DHCA<br />

flat<br />

(c) Parallel implementation running time, |dfA| =19<br />

HGPA<br />

DHCA<br />

flat<br />

(d) Parallel implementation running time, |dfA| =28<br />

Figure C.60: Running times of the computationally optimal parallel RHCA, DHCA and<br />

flat consensus architectures on the PenDigits data collection in the four diversity scenarios<br />

|dfA| = {1, 10, 19, 28}.<br />

331<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

CPU time (sec.)<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

1.8<br />

1.7<br />

1.6<br />

1.5<br />

1.4<br />

1.3<br />

1.2<br />

1.1<br />

1<br />

2.5<br />

2<br />

1.5<br />

1<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

RHCA<br />

RHCA<br />

RHCA<br />

RHCA<br />

MCLA<br />

DHCA<br />

MCLA<br />

DHCA<br />

MCLA<br />

DHCA<br />

MCLA<br />

DHCA<br />

flat<br />

flat<br />

flat<br />

flat


C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

E<br />

E<br />

E<br />

RHCA<br />

RHCA<br />

RHCA<br />

RHCA<br />

HGPA<br />

HGPA<br />

HGPA<br />

HGPA<br />

DHCA<br />

flat<br />

(a) φ (NMI) of the consensus solutions, |dfA| =1<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

E<br />

RHCA<br />

RHCA<br />

(b) φ (NMI) of the consensus solutions, |dfA| =10<br />

DHCA<br />

flat<br />

(c) φ (NMI) of the consensus solutions, |dfA| =19<br />

DHCA<br />

flat<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

E<br />

RHCA<br />

RHCA<br />

(d) φ (NMI) of the consensus solutions, |dfA| =28<br />

Figure C.61: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />

RHCA, DHCA and flat consensus architectures on the PenDigits data collection in the four<br />

diversity scenarios |dfA| = {1, 10, 19, 28}.<br />

332<br />

MCLA<br />

MCLA<br />

MCLA<br />

MCLA<br />

DHCA<br />

DHCA<br />

DHCA<br />

DHCA<br />

flat<br />

flat<br />

flat<br />

flat


Appendix D<br />

Experiments on self-refining<br />

consensus architectures<br />

This appendix presents several experiments regarding self-refining flat and hierarchical consensus<br />

architectures described in chapter 4. In appendix D.1, the proposal for automatically<br />

refining a previously derived consensus clustering solution –what is called consensus based<br />

self-refining– is experimentally evaluated, whereas appendix D.2 presents the experiments<br />

regarding the creation of a refined consensus clustering solution upon the selection of a high<br />

quality cluster ensemble component, i.e. selection-based self-refining.<br />

In both cases, the experiments are conducted on eleven unimodal data collections,<br />

namely: Iris, Wine, Glass, Ionosphere, WDBC, Balance, MFeat, miniNG, Segmentation,<br />

BBC and PenDigits. The results of the self-refining experiments are displayed by means<br />

of boxplot charts showing the normalized mutual information (φ (NMI) ) with respect to the<br />

ground truth of each data set compiled across 100 independent experiment runs of i) the<br />

cluster ensemble E each experiment is conducted upon, ii) the clustering solution employed<br />

as the reference of the self-refining procedure, and iii) the self-refined consensus cluster-<br />

ing solutions λc p i obtained upon select cluster ensembles Epi<br />

created by the selection of<br />

a percentage pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90} of the whole ensemble E. Asinall<br />

the experimental sections of this thesis, consensus processes have been replicated using the<br />

set of seven consensus functions described in appendix A.5, namely: CSPA, EAC, HGPA,<br />

MCLA, ALSAD, KMSAD and SLSAD.<br />

D.1 Experiments on consensus-based self-refining<br />

In this section, the results of applying the consensus-based self-refining procedure described<br />

in section 4.1 on the aforementioned eleven data sets are presented. The self-refining process<br />

is intended to better the quality of a consensus clustering solution λc output by a flat, a<br />

random (RHCA) and a deterministic hierarchical consensus architecture (DHCA).<br />

For this reason, the results are displayed as a matrix of boxplot charts with three columns<br />

(the leftmost one for flat consensus, the central one presenting the results of RHCA, and<br />

DHCA on the right), and as many rows as consensus functions are employed on each<br />

particular data collection (seven in all cases except for the PenDigits data set, where only<br />

two consensus functions are applicable due to memory limitations given our computational<br />

333


D.1. Experiments on consensus-based self-refining<br />

resources —see appendix A.6).<br />

Moreover, the clustering solution deemed as the optimal by the supraconsensus function<br />

described in section 4.1 in a majority of experiment runs is highlighted by means of a vertical<br />

green dashed line, so that its performance can be qualitatively evaluated at a glance.<br />

D.1.1 Iris data set<br />

Figure D.1 presents the results of the self-refining consensus procedure applied on the Iris<br />

data set. As regards the results obtained using the CSPA consensus function, we can see<br />

that self-refining introduces no variations with respect to the quality of the non-refined<br />

consensus clustering solution λc in the case of the flat and RHCA consensus architectures.<br />

In contrast, slight but important φ (NMI) gains are obtained in the case of DHCA with the<br />

refined clustering solutions λ c 20 and λ c 40. Unfortunately, the supraconsensus function fails<br />

to select in this case one of the highest quality clustering solutions. A very similar situation is<br />

observed on the self-refining experiments based on the EAC, ALSAD and SLSAD consensus<br />

functions.<br />

Examples in which self-refining and supraconsensus perform successfully are the ones<br />

regarding both hierarchical consensus architectures using the HGPA consensus function. In<br />

this cases, self-refined consensus clustering solutions of higher quality than the one of λc<br />

are obtained and selected by the supraconsensus function. In contrast, in the experiments<br />

basedonMCLA,little(ifany)φ (NMI) gains are obtained via refining, and supraconsensus<br />

tends to select a clustering solution of slightly lower quality than λc.<br />

D.1.2 Wine data set<br />

The results corresponding to the application of the consensus-based self-refining procedure<br />

on the Wine data set are depicted in figure D.2. As far as the refining of the consensus<br />

solution output by the flat consensus architecture (leftmost column of figure D.2), we can<br />

see that quality improvements with respect to the initial consensus clustering λc are obtained<br />

in all cases, except when the HGPA consensus function is employed. In some cases,<br />

supraconsensus manages to select the highest quality clustering, such as when consensus<br />

is based on MCLA and SLSAD, whereas suboptimal solutions are selected in other cases<br />

—see for instance the CSPA, EAC, HGPA and ALSAD boxplots.<br />

We would like to hihglight the specially good results obtained on the self-refining of<br />

the consensus output by the DHCA architecture, see the rightmost column of figure D.2.<br />

Regardless of the consensus function employed, the self-refining procedure gives rise to<br />

higher quality clustering solutions, and the supraconsensus function selects the top quality<br />

one in most cases.<br />

D.1.3 Glass data set<br />

Figure D.3 presents the results of the consensus self-refining process when applied on the<br />

Glass data collection. In this case, little φ (NMI) gains are obtained by self-refining for most<br />

consensus functions. The clearest exception is EAC, where notable quality increases are<br />

observed, specially when self-refining is applied on the consensus solutions output by the<br />

hierarchical consensus architectures (RHCA and DHCA).<br />

334


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.1: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Iris data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by the<br />

supraconsensus function in each experiment.<br />

335


D.1. Experiments on consensus-based self-refining<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.2: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Wine data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by the<br />

supraconsensus function in each experiment.<br />

336


Appendix D. Experiments on self-refining consensus architectures<br />

As regards the performance of the supraconsensus function, the generally small quality<br />

variations among the non-refined and self-refined consensus clustering solutions gives<br />

relative importance to the lack of precision of supraconsensus in most cases. Again, the<br />

only exceptions to this behaviour occur in the refining of the consensus clustering solutions<br />

output by RHCA and DHCA when EAC is employed. In these cases, the supraconsensus<br />

function erroneously selects the non-refined consensus clustering solution λc as the final<br />

clustering solution, although higher quality self-refined partitions are available.<br />

D.1.4 Ionosphere data set<br />

The application of the consensus-based self-refining procedure on the Ionosphere data collection<br />

yields the φ (NMI) boxplots presented in figure D.4. On this collection, self-refining<br />

introduces quality gains in a few cases, such as the refining of the consensus clustering<br />

output by i) RHCA and DHCA using HGPA, or ii) flat consensus architecture and RHCA<br />

based on the SLSAD consensus function. In the remaining cases, the self-refining procedure<br />

brings about little (if any) quality gains.<br />

As regards the selection accuracy of the supraconsensus function, it consistently selects<br />

a good quality clustering solution, if not the highest quality one.<br />

D.1.5 WDBC data set<br />

Figure D.5 presents the φ (NMI) boxplots corresponding to the application of the consensusbased<br />

self-refining procedure on the WDBC data set.<br />

Fairly distinct results are obtained depending on the consensus function employed. For<br />

instance, when consensus is based on EAC and SLSAD, self-refining brings about nothing.<br />

In contrast, spectacular quality gains are obtained on the hierarchically derived consensus<br />

clusterings that employ HGPA. In the remaining cases, more modest φ (NMI) increases are<br />

observed.<br />

<strong>La</strong>st, notice that the supraconsensus function performs pretty accurately, as it selects<br />

good quality clustering solutions in most cases, although it rarely chooses the top quality<br />

one.<br />

D.1.6 Balance data set<br />

The application of the self-refining consensus procedure on the Balance data set yields the<br />

results summarized by the boxplots presented in figure D.6. It can be observed that, for<br />

most consensus functions and consensus architectures, the self-refined consensus clusterings<br />

show higher φ (NMI) values than those of their non-refined counterpart, λc. In some cases,<br />

these quality gains are notable, as, for instance, when consensus self-refining is based on<br />

the SLSAD consensus function on the flat consensus architecture —bottom row of figure<br />

D.5. In other cases, as in MCLA-based self-refining, the achieved φ (NMI) increases are more<br />

modest.<br />

As regards the ability of the supraconsensus function to select the top quality (nonrefined<br />

or refined) clustering solution, it can be observed that it is a hardly occuring event,<br />

which motivates the low percentage selection accuracy reported in section 4.2.2.<br />

337


D.1. Experiments on consensus-based self-refining<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.3: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Glass data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by<br />

the supraconsensus function in each experiment.<br />

338


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.4: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Ionosphere data collection across all the consensus<br />

functions employed. The green dashed vertical line identifies the clustering solution selected<br />

by the supraconsensus function in each experiment.<br />

339


D.1. Experiments on consensus-based self-refining<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.5: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the WDBC data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by<br />

the supraconsensus function in each experiment.<br />

340


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.6: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Balance data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by<br />

the supraconsensus function in each experiment.<br />

341


D.1. Experiments on consensus-based self-refining<br />

D.1.7 MFeat data set<br />

Figure D.7 presents the boxplots of the clusterings resulting from running the consensus<br />

self-refining procedure on the MFeat data collection. In this case, pretty varied behaviours<br />

are observed. For instance, when a high quality consensus clustering solution λc is available<br />

prior to self-refining, none of the refined consensus clusterings achieves a higher φ (NMI) —<br />

see, for instance, the boxplots corresponding to the CSPA, ALSAD and KMSAD consensus<br />

functions. In contrast, in cases in which λc has a low φ (NMI) , self-refining brings about<br />

sometimes notable quality gains, such as the ones observed in the EAC or SLSAD based<br />

flat and RHCA consensus architectures. However, the supraconsensus function tends to<br />

select the non-refined clustering solution as the final partition of the process in a majority<br />

of cases.<br />

D.1.8 miniNG data set<br />

The boxplot charts depicted in figure D.8 summarize the performance of the consensusbased<br />

self-refining procedure when applied on the miniNG data set. It is interesting to<br />

note that, except when self-refining is based on the EAC consensus function, important<br />

quality gains are obtained —in most cases, there exists at least one self-refined consensus<br />

clustering with higher φ (NMI) than the non-refined clustering λc. Notice the large quality<br />

gains obtained when self-refining is based on MCLA, as we move from a very low φ (NMI)<br />

non-refined consensus clustering solution λc to self-refined clusterings that are comparable<br />

to the highest quality components in the cluster ensemble E. However, when self-refining is<br />

basedonEAC,little(ifany)φ (NMI) increases are introduced by the self-refining procedure.<br />

<strong>La</strong>st, notice how, in most cases, the supraconsensus function selects high quality clustering<br />

solutions as the final partition.<br />

D.1.9 Segmentation data set<br />

Figure D.9 presents the boxplots of the non-refined and self-refined consensus clustering<br />

solutions obtained by the consensus-based self-refining procedure applied on the Segmentation<br />

data collection. Notice that, thanks to the proposed refining process, at least one<br />

self-refined clustering solution of higher quality than that of the non-refined consensus clustering<br />

λc is obtained in most cases. In fact, the only exceptions occur in the refinement of<br />

the λc output by the flat and DHCA consensus architecture based on the EAC consensus<br />

function.<br />

As regards the performance of the supraconsensus function, we can see that it casts a<br />

shadow over the good results of the self-refining process just reported, as it rarely picks up<br />

the highest quality consensus clustering solution —although it usually selects one of the<br />

higher quality ones.<br />

D.1.10 BBC data set<br />

The qualities of the clusterings resulting from applying the consensus-based self-refining<br />

procedure on the BBC data set are presented in the boxplots of figure D.10. Notice that,<br />

although the quality of the non-refined consensus clustering λc is highly dependent on<br />

the consensus function employed (from the high φ (NMI) values in CSPA, MCLA, ALSAD<br />

342


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.7: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the MFeat data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by<br />

the supraconsensus function in each experiment.<br />

343


D.1. Experiments on consensus-based self-refining<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.8: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the miniNG data collection across all the consensus<br />

functions employed. The green dashed vertical line identifies the clustering solution selected<br />

by the supraconsensus function in each experiment.<br />

344


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.9: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the Segmentation data collection across all the consensus<br />

functions employed. The green dashed vertical line identifies the clustering solution selected<br />

by the supraconsensus function in each experiment.<br />

345


D.1. Experiments on consensus-based self-refining<br />

or KMSAD to the poorer qualities for EAC, HGPA or SLSAD), the self-refining process<br />

manages to yield better clusterings in most cases, although the observed φ (NMI) increases<br />

are, in general, modest.<br />

Notice that, unfortunately, the supraconsensus function is reasonably successful in selecting<br />

the top quality consensus clustering solution blindly.<br />

D.1.11 PenDigits data set<br />

Figure D.11 depicts the φ (NMI) values of the non-refined and self-refined consensus clusterings<br />

resulting from the application of the consensus-based self-refining procedure on the<br />

PenDigits data set. Remember that, on this collection, only the HGPA and MCLA consensus<br />

functions are applicable using the hierarchical consensus architectures. Whereas the<br />

quality of the clusterings obtained using HGPA is dramatically bad, the results obtained<br />

with MCLA are pretty encouraging. The large φ (NMI) gain observed when refining the consensus<br />

clustering λc output by RHCA is noteworthy. Moreover, notice that supraconsensus<br />

selects correctly the highest quality clustering solution in this case.<br />

346


φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

KMSAD<br />

SLSAD<br />

(a) flat<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E p=0p=2p=5p=10p=15p=20p=30p=40p=50p=60p=75p=90<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

KMSAD<br />

SLSAD<br />

(b) RHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

CSPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

EAC<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

MCLA<br />

E p=0p=2p=5p=10p=15p=20p=30p=40p=50p=60p=75p=90<br />

ALSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

KMSAD<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

SLSAD<br />

(c) DHCA<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.10: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the BBC data collection across all the consensus functions<br />

employed. The green dashed vertical line identifies the clustering solution selected by the<br />

supraconsensus function in each experiment.<br />

347


D.2. Experiments on selection-based self-refining<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

MCLA<br />

(a) RHCA<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

1<br />

0.5<br />

0<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

HGPA<br />

E λc<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

MCLA<br />

(b) DHCA<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.11: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />

λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />

DHCA consensus architectures on the PenDigits data collection across all the consensus<br />

functions employed. The green dashed vertical line identifies the clustering solution selected<br />

by the supraconsensus function in each experiment.<br />

D.2 Experiments on selection-based self-refining<br />

This section presents the results of the clustering self-refining procedure based on the selection<br />

of a cluster ensemble component λref by means of an average normalized mutual<br />

information (φ (ANMI) ) criterion —see section 4.3.<br />

The results are presented in a very similar fashion to that of the previous section,<br />

that is, by means of boxplot charts displaying the φ (NMI) of the cluster ensemble E, the<br />

selected cluster ensemble component λref and the self-refined consensus solutions λcpi ,<br />

with pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}. Moreover, the clustering solution designated<br />

to be the optimal according to the supraconsensus function is highlighted by a vertical<br />

green dashed line, which provides a simple and fast means for evaluating its performance<br />

qualitatively.<br />

D.2.1 Iris data set<br />

Figure D.12 presents the results of the selection-based self-refining procedure applied on<br />

the Iris data set. It can be observed that the selected cluster ensemble component λref is<br />

of notable quality —i.e. well above the median partition in the cluster ensemble E. Notice<br />

how the self-refining process brings about relevant φ (NMI) gains depending on the consensus<br />

function employed. This is the case of the CSPA, MCLA, ALSAD, KMSAD and SLSAD<br />

consensus functions. However, the supraconsensus function selects λref as the optimal<br />

partition, thus ignoring the improvements introduced by the self-refining process on the<br />

aforementioned cases. This again highlights the need for good performing supraconsensus<br />

functions.<br />

348


φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.12: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Iris data collection<br />

across all the consensus functions employed. The green dashed vertical line identifies the<br />

clustering solution selected by the supraconsensus function in each experiment.<br />

D.2.2 Wine data set<br />

The results obtained by the application of the selection-based self-refining procedure on<br />

the Wine data collection are presented in figure D.13. Notice that the cluster ensemble<br />

component selected by means of average normalized mutual information criteria, λref, is<br />

nearly the top quality partition contained in the cluster ensemble E. In this case, the creation<br />

of select cluster ensembles brings about no quality gains, regardless of the consensus function<br />

employed.<br />

As regards the performance of the supraconsensus function, it selects λref as the final<br />

clustering solution in a majority of cases, choosing a noticeably worse partition in those<br />

experiments based on the KMSAD and CSPA consensus functions.<br />

D.2.3 Glass data set<br />

Figure D.14 presents the φ (NMI) boxplots corresponding to the selection-based self-refining<br />

procedure applied on the Glass data set. Notice how the notably high quality clustering<br />

349


D.2. Experiments on selection-based self-refining<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.13: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Wine data collection<br />

across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

solution λref is hardly surpassed by any of the refined consensus clustering solutions —<br />

in fact, this only happens when self-refining is based on the EAC and SLSAD consensus<br />

functions. In most cases, supraconsensus selects the selected cluster ensemble component<br />

λref as the final clustering solution.<br />

D.2.4 Ionosphere data set<br />

The application of the selection-based self-refining process on the Ionosphere data collection<br />

gives rise to modest quality increases, as depicted in figure D.15. Notice how only when<br />

the self-refining procedure is based on the CSPA and HGPA consensus functions, clustering<br />

solutions of higher φ (NMI) than λref are obtained.<br />

Furthermore, notice that the supraconsensus function selects, in most cases, the highest<br />

quality clustering solutions —unfortunately, the only exceptions occur when self-refining is<br />

based on CSPA and HGPA, i.e. the cases when self-refining introduces some significant<br />

quality gains.<br />

350


φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.14: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Glass data collection<br />

across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

D.2.5 WDBC data set<br />

Figure D.16 presents the φ (NMI) boxplots corresponding to the selection-based self-refining<br />

process applied on the WDBC data set. Firstly, notice that the cluster ensemble component<br />

selected by means of the φ (ANMI) criterion –λref– is pretty close to the highest quality<br />

partition contained in the cluster ensemble. Secondly, the effect of self-refining is highly<br />

dependent on the consensus function employed. For instance, no quality gains are achieved<br />

when CSPA, EAC or HGPA are used. In contrast, φ (NMI) gains (although modest) are<br />

obtained when self-refining is conducted using the MCLA, ALSAD, KMSAD and SLSAD<br />

consensus functions. <strong>La</strong>st, notice that the supraconsensus function selects very high quality<br />

clusterings as the final partition.<br />

D.2.6 Balance data set<br />

As far as the performance of the selection-based self-refining process when applied on the<br />

Balance data collection, figure D.17 shows that self-refined clustering solutions of higher<br />

351


D.2. Experiments on selection-based self-refining<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.15: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Ionosphere data<br />

collection across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

quality of that of the selected cluster ensemble component λref are obtained for most consensus<br />

functions —in fact, the clearest exceptions to this behaviour are EAC and HGPA.<br />

However, the supraconsensus function is not capable of selecting those better quality clusterings<br />

as the final partition in most cases, which again gives an idea of its limited performance.<br />

D.2.7 MFeat data set<br />

Figure D.18 presents the φ (NMI) boxplots of the clusterings obtained after applying the<br />

selection-based self-refining process on the MFeat data set. Notice how, for four out of<br />

the seven consensus functions (CSPA, MCLA, ALSAD and KMSAD), notable quality gains<br />

are obtained (i.e. at least one of the refined clusterings attains a higher φ (NMI) than the<br />

selected cluster ensemble component λref). Unfortunately, the supraconsensus function fails<br />

to select these high quality partitions as the final clustering solution λ c final, as it constantly<br />

selects λref as the optimal one, which is the correct option when self-refining is based on<br />

the EAC, HGPA and SLSAD consensus functions.<br />

352


φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.16: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the WDBC data<br />

collection across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

D.2.8 miniNG data set<br />

The application of the selection-based self-refining procedure on the miniNG data collection<br />

yields the boxplots presented in figure D.19. Notice that the selected cluster ensemble<br />

component λref is comparable to the best individual partitions contained in the ensemble<br />

E. Despite of this, the self-refining process manages to obtain even higher quality consensus<br />

clusterings when based on the CSPA, MCLA, ALSAD and KMSAD consensus functions.<br />

Unfortunately, the supraconsensus functions fails in most occasions in selecting the maximum<br />

φ (NMI) clustering —in fact, it conducts the correct election when the EAC, HGPA<br />

and SLSAD consensus functions are employed.<br />

D.2.9 Segmentation data set<br />

Figure D.20 presents the φ (NMI) boxplots of the selection-based self-refined clustering solutions<br />

obtained on the Segmentation data set. Despite the notable quality of the selected<br />

cluster ensemble component λref, notice that important quality gains are obtained when<br />

353


D.2. Experiments on selection-based self-refining<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.17: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Balance data<br />

collection across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

self-refining is applied, especially when the CSPA, ALSAD and KMSAD consensus functions<br />

are employed —more modest improvements are obtained when using MCLA or SLSAD,<br />

whereas none is attained when self-refining is based on EAC and HGPA.<br />

As regards the ability of the supraconsensus function to select the top quality clustering<br />

solution as the final one, it only suceeds when consensus is based on EAC, HGPA and<br />

MCLA. However, the φ (NMI) losses caused by suboptimal supraconsensus selection are, in<br />

general, moderate.<br />

D.2.10 BBC data set<br />

The application of the selection-based self-refining procedure on the BBC data set yields<br />

the φ (NMI) boxplots depicted in figure D.21. Notice that, in this very data collection, the<br />

cluster ensemble component λref selected via average φ (NMI) is very close (if not equal)<br />

to the maximum quality individual partition contained in the cluster ensemble E. Starting<br />

from this high quality reference point, the self-refining procedure manages to yield slightly<br />

better clusterings when it is based on the MCLA, ALSAD and KMSAD consensus functions.<br />

354


φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.18: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the MFeat data<br />

collection across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

Moreover, notice that, regardless of the consensus function employed, the supraconsensus<br />

function tends to select pretty high quality clustering solutions as the final ones.<br />

D.2.11 PenDigits data set<br />

Figure D.22 presents the φ (NMI) boxplots corresponding to the application of the selectionbased<br />

clustering self-refining procedure on the PenDigits data set. Recall that, due to its<br />

size, only the HGPA and MCLA consensus functions are executable on this data collection.<br />

As regards the results obtained, notice that the selected cluster ensemble component λref<br />

has a notably high quality. However, the results obtained when it is self-refined differ<br />

dramatically depending on the consensus function applied. In the case of HGPA, selfrefining<br />

brings about no quality gains, and the supraconsensus function correctly selects<br />

λref as the final clustering solution. In contrast, refined clusterings yielded by MCLA<br />

are capable of achieving slightly higher φ (NMI) values than the selected cluster ensemble<br />

component λref. However, supraconsensus conducts a suboptimal selection, as it does not<br />

choose the maximum quality refined clustering as the final partition.<br />

355


D.2. Experiments on selection-based self-refining<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.19: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the miniNG data<br />

collection across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

356


φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

Appendix D. Experiments on self-refining consensus architectures<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.20: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the Segmentation<br />

data collection across all the consensus functions employed. The green dashed vertical line<br />

identifies the clustering solution selected by the supraconsensus function in each experiment.<br />

357


D.2. Experiments on selection-based self-refining<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

HGPA<br />

1<br />

0.5<br />

0<br />

(c) HGPA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

CSPA<br />

(a) CSPA<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

φ (NMI)<br />

KMSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(f) KMSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

MCLA<br />

1<br />

0.5<br />

0<br />

(d) MCLA<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

EAC<br />

(b) EAC<br />

φ (NMI)<br />

SLSAD<br />

1<br />

0.5<br />

0<br />

E<br />

(g) SLSAD<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

ALSAD<br />

(e) ALSAD<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.21: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the BBC data collection<br />

across all the consensus functions employed. The green dashed vertical line identifies the<br />

clustering solution selected by the supraconsensus function in each experiment.<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

HGPA<br />

(a) HGPA<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

φ (NMI)<br />

1<br />

0.5<br />

0<br />

E<br />

λref<br />

λ c 2<br />

λ c 5<br />

λ c 10<br />

MCLA<br />

(b) MCLA<br />

λ c 15<br />

λ c 20<br />

λ c 30<br />

λ c 40<br />

λ c 50<br />

λ c 60<br />

λ c 75<br />

λ c 90<br />

Figure D.22: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />

λref and the self-refined consensus clustering solutions λc p i on the PenDigits data<br />

collection across all the consensus functions employed. The green dashed vertical line identifies<br />

the clustering solution selected by the supraconsensus function in each experiment.<br />

358


Appendix E<br />

Experiments on multimodal<br />

consensus clustering<br />

This appendix presents several experiments regarding multimodal self-refining consensus<br />

architectures described in chapter 5, applied to the CAL500, InternetAds and Corel data<br />

collections. Due to space limitations, the experiments described correspond to the application<br />

of the proposed methodology on cluster ensembles resulting from the application of four<br />

of the twenty-eight clustering algorithms employed in this thesis, namely agglo-cos-upgma,<br />

direct-cos-i2, graph-cos-i2 and rb-cos-i2.<br />

For each one of the data sets, two facets of the experiments are presented separately.<br />

Firstly, the consensus clusterings obtained on each modality and across modalities is qualitatively<br />

evaluated. To do so, a set of boxplot charts displaying the φ (NMI) values of the<br />

components of the corresponding cluster ensemble E, and of the unimodal, multimodal and<br />

intermodal consensus clusterings obtained by the seven consensus functions employed in<br />

this work across 10 independent runs.<br />

And secondly, the quality of the self-refined consensus clustering solutions output by<br />

the proposed consensus self-refining procedure is also evaluated with the help of boxplot<br />

diagrams displaying the φ (NMI) values of the corresponding cluster ensembles, of the nonrefined<br />

consensus clustering λc and of the self-refined consensus clustering solutions λc p i .<br />

As regards the latter, a set of refined clusterings are obtained using a range of percentages<br />

pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75} of the whole ensemble E. The performance of the<br />

φ (ANMI) -based supraconsensus function for picking up one of the λc p i is also qualitatively<br />

evaluated.<br />

E.1 CAL500 data set<br />

In this section, the results of the multimodal consensus clustering experiments conducted<br />

on the CAL500 data collection are described. The modalities contained in this data set are<br />

audio and text —see appendix A.2.2 for a description.<br />

359


E.1. CAL500 data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio<br />

λ agglo−cos−upgma<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ agglo−cos−upgma<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

audio+text<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c agglo−cos−upgma<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.1: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the CAL500 data set.<br />

E.1.1 Consensus clustering per modality and across modalities<br />

For starters, the quality of the consensus clustering solutions obtained on i) the two original<br />

modalities, ii) the fused audio+text multimodal modality, and iii) across the previous<br />

three modalities are evaluated. In figure E.1, the results obtained after the application<br />

of the proposed multimodal consensus architecture on the cluster ensemble resulting from<br />

the compilation of the partitions output by the agglo-cos-upgma clustering algorithm are<br />

presented. It can be observed that the quality of the clusterings corresponding to the<br />

audio modality are notably better than those obtained on the text mode (except when<br />

the EAC consensus function is employed). The early fusion of the auditory and textual<br />

features does not introduce any beneficial effect, rather the contrary. The quality of the<br />

intermodal consensus clusterings λc corresponding to the combination of three modalities<br />

are approximately a trade-off between them.<br />

Figures E.2, E.3 and E.4 depict, respectively, the results obtained on the cluster ensembles<br />

created upon the direct-cos-i2, graph-cos-i2 and rb-cos-i2 CLUTO clustering algorithms.<br />

It can be observed that pretty similar results to the ones just reported are obtained<br />

in all cases: that is, the consensus clusterings based on the audio mode attain higher qualities<br />

than on the remaining modalities, while multimodal and intermodal consensus clustering<br />

solutions are a kind of trade-off between modalities.<br />

360


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Appendix E. Experiments on multimodal consensus clustering<br />

text<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio+text<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c direct−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.2: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the CAL500 data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio+text<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c graph−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.3: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the CAL500 data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

audio+text<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c rb−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.4: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the CAL500 data set.<br />

361


E.1. CAL500 data set<br />

φ (NMI)<br />

CSPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC agglo−cos−upgma<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

ALSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

HGPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

KMSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

λ 30<br />

c<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

MCLA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

SLSAD agglo−cos−upgma<br />

1<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.5: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the CAL500 data set.<br />

E.1.2 Self-refined consensus clustering across modalities<br />

In this section, the results of running the self-refining procedure on the intermodal consensus<br />

clustering solution λc are evaluated. Firstly, the results of the process applied on<br />

the cluster ensemble created by the compilation of the clusterings output by the agglo-cosupgma<br />

clustering algorithm are presented in figure E.5. On each one of the seven boxplot<br />

charts displayed (one per consensus function), the clustering solution selected by the supraconsensus<br />

function, λfinal c , is highlighted by a green dashed vertical line. Notice that in<br />

all cases there exists at least one self-refined consensus clustering λpi c that attains a higher<br />

φ (NMI) than the non-refined consensus clustering solution, λc. However, the supraconsensus<br />

function mostly fails to select the top quality clustering as the final partition of the data<br />

—in fact, it only does so in the experiments based on the CSPA, ALSAD and KMSAD<br />

consensus functions. This situation is a clear illustrative example of the advantages of the<br />

proposed self-refining procedures and the shortcomings of the φ (NMI) based supraconsensus<br />

function.<br />

Figures E.6 to E.8 present the self-refining results obtained on the cluster ensembles<br />

constructed upon the clusterings generated by the direct-cos-i2, graph-cos-i2 and rb-cosi2<br />

clustering algorithms. Notice that, in most cases, the self-refining procedure yields at<br />

least one clustering of superior quality than the non-refined consensus clustering solution.<br />

Exceptions to this behaviour occur, for instance, when the KMSAD and EAC consensus<br />

functions are employed for clustering combination on the direct-cos-i2 and graph-cos-i2<br />

cluster ensembles (see figures E.6(f) and E.7(b), respectively). Again, the supraconsensus<br />

function performs with modest accuracy, managing to select the top quality clustering<br />

solution in some cases (see figure E.8(b)) and failing clamorously in others (as in figure<br />

362<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

Appendix E. Experiments on multimodal consensus clustering<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

SLSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(g) SLSAD<br />

MCLA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.6: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the CAL500 data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

SLSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

MCLA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.7: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the CAL500 data set.<br />

E.7(g)).<br />

363<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


E.2. InternetAds data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 40<br />

c<br />

MCLA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.8: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the CAL500 data set.<br />

E.2 InternetAds data set<br />

In the following paragraphs, the results corresponding to the execution of self-refining multimodal<br />

consensus clustering on the InternetAds collection are presented. In this case, the<br />

modalities are object (i.e image) attributes and collateral image attributes (see appendix<br />

A.2.2 for a description).<br />

E.2.1 Consensus clustering per modality and across modalities<br />

In this section, the quality of the unimodal, multimodal and intermodal consensus clusterings<br />

obtained on the cluster ensembles generated upon the agglo-cos-upgma, direct-cos-i2,<br />

graph-cos-i2 and rb-cos-i2 clustering algorithms is evaluated.<br />

Firstly, the results corresponding to the agglo-cos-upgma cluster ensemble are depicted<br />

in figure E.9. The first thing to notice is the extremely low quality of the cluster ensemble<br />

components, regardless of the modality. This fact conditions the consensus clustering results<br />

obtained, which are also of very low quality. Moreover, in contrast to what has been observed<br />

in the rest of experiments, there exist very little differences among the performance of the<br />

distinct consensus functions employed.<br />

A very similar behaviour is observed in the experiments conducted on the direct-cos-i2<br />

and rb-cos-i2 cluster ensembles (figures E.10 and E.12). However, pretty different results are<br />

obtained on the graph-cos-i2 cluster ensemble (see figure E.11). In this case, the execution<br />

of consensus clustering on the collateral and the multimodal modalities yields better results.<br />

364<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

object<br />

λ agglo−cos−upgma<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Appendix E. Experiments on multimodal consensus clustering<br />

collateral<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

object+collateral<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c agglo−cos−upgma<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.9: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the InternetAds data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

object<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

collateral<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

object+collateral<br />

λ direct−cos−i2<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c direct−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.10: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the InternetAds data set.<br />

E.2.2 Self-refined consensus clustering across modalities<br />

The results of the application of the self-refining procedure on the intermodal consensus<br />

clustering λc are presented next. Again, the consensus clustering selected by the supraconsensus<br />

function, λ final<br />

c , is highlighted by a green vertical dashed line.<br />

Figures E.13, E.14 and E.16, which show the results corresponding to the agglo-cosupgma,<br />

direct-cos-i2 and rb-cos-i2 cluster ensembles, reveal that little is achieved by selfrefining<br />

in these cases. In contrast, the usual growing φ (NMI) patterns induced by selfrefining<br />

are observed in figure E.15, especially when the MCLA, ALSAD and KMSAD<br />

consensus functions are employed (see figures E.15(d), E.15(e) and E.15(f)). Unfortunately,<br />

the supraconsensus function mostly fails in selecting the top quality clustering in these<br />

cases.<br />

365


E.3. Corel data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

object<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

collateral<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

object+collateral<br />

λ graph−cos−i2<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c graph−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.11: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the InternetAds data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

object<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

collateral<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

object+collateral<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c rb−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.12: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the InternetAds data set.<br />

E.3 Corel data set<br />

This section is devoted to the presentation of the results of the multimodal consensus<br />

clustering experiments executed on the Corel data collection. On this data set, modalities<br />

are image and text features.<br />

E.3.1 Consensus clustering per modality and across modalities<br />

For starters, figure E.17 depicts the unimodal, multimodal and intermodal consensus clusterings<br />

obtained on the agglo-cos-upgma cluster ensemble. Notice the notable differences<br />

between both modalities, as clustering this collection using the textual features of the objects<br />

leads to the obtention of better partitions than those obtained on the image modality.<br />

Apparently, the multimodal modality resulting from the early fusion of textual and visual<br />

features, yields clusterings the quality of which is equal or slightly lower than the textual<br />

ones. Thus, in this case, multimodality brings about no gains as regards the obtention of<br />

higher quality partitions. <strong>La</strong>st, the intermodal consensus clustering λc attains φ (NMI) values<br />

366


φ (NMI)<br />

CSPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Appendix E. Experiments on multimodal consensus clustering<br />

EAC agglo−cos−upgma<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

ALSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

HGPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

KMSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

MCLA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

SLSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.13: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the InternetAds data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

(e) ALSAD<br />

EAC direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(b) EAC<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

(f) KMSAD<br />

HGPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

SLSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

MCLA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(d) MCLA<br />

Figure E.14: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the InternetAds data set.<br />

comparable to the best text modality based partitions when the CSPA, ALSAD and KM-<br />

SAD consensus functions are employed, while it constitutes a trade-off between modalities<br />

when the remaining clustering combiners are used.<br />

A pretty similar performance is observed when the consensus process is applied on the<br />

367<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


E.3. Corel data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

SLSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(g) SLSAD<br />

MCLA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.15: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the InternetAds data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

SLSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

MCLA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.16: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the InternetAds data set.<br />

direct-cos-i2, graph-cos-i2 and rb-cos-i2 cluster ensembles, as figures E.18, E.19 and E.20<br />

reveal.<br />

368<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ agglo−cos−upgma<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Appendix E. Experiments on multimodal consensus clustering<br />

image<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

text+image<br />

λ agglo−cos−upgma<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c agglo−cos−upgma<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.17: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the agglo-cos-upgma algorithm on the Corel data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text+image<br />

λ direct−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c direct−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.18: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the direct-cos-i2 algorithm on the Corel data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text+image<br />

λ graph−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c graph−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.19: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the graph-cos-i2 algorithm on the Corel data set.<br />

E.3.2 Self-refined consensus clustering across modalities<br />

In the following paragraphs, the results of applying the consensus self-refining procedure on<br />

the intermodal consensus clustering λc are qualitatively described.<br />

For starters, the φ (NMI) values of the non-refined and self-refined consensus clusterings<br />

obtained on the agglo-cos-upgma are presented in figure E.21. We can observe that, in<br />

all cases, there exists at least one refined consensus clustering that attains a higher φ (NMI)<br />

value than the non-refined clustering λc. Moreover, in this case, the supraconsensus function<br />

is pretty successful in selecting the top quality consensus clustering as the final partition<br />

369


E.3. Corel data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(a) Modality 1<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

image<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(b) Modality 2<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

text+image<br />

λ rb−cos−i2<br />

c<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(c) Multimodal<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ c rb−cos−i2<br />

E<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

ALSAD<br />

KMSAD<br />

SLSAD<br />

(d) Intermodal<br />

Figure E.20: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />

solutions using the rb-cos-i2 algorithm on the Corel data set.<br />

λ final<br />

c , which is again highlighted by means of a vertical green dashed line.<br />

The performance of the self-refining procedure is equally satisfying when conducted on<br />

the cluster ensembles created by means of the three remaining clustering algorithms, as<br />

figures E.22, E.23 and E.24 depict. However, the selection accuracy of the supraconsensus<br />

function is somewhat inconsistent, as already observed on the previous data sets.<br />

φ (NMI)<br />

CSPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

EAC agglo−cos−upgma<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

ALSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

HGPA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

KMSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

MCLA agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

SLSAD agglo−cos−upgma<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.21: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the agglo-cos-upgma algorithm on the Corel data set.<br />

370<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

Appendix E. Experiments on multimodal consensus clustering<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

SLSAD direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(g) SLSAD<br />

MCLA direct−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.22: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the direct-cos-i2 algorithm on the Corel data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

ALSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

KMSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

0<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

SLSAD graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

MCLA graph−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.23: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the graph-cos-i2 algorithm on the Corel data set.<br />

371<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


E.3. Corel data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

CSPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(a) CSPA<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

ALSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(e) ALSAD<br />

λ 40<br />

c<br />

EAC rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(b) EAC<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

KMSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(f) KMSAD<br />

λ 40<br />

c<br />

HGPA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(c) HGPA<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

λ 30<br />

c<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

SLSAD rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

λ 30<br />

c<br />

(g) SLSAD<br />

MCLA rb−cos−i2<br />

E λc<br />

λ 2 c<br />

λ 5 c<br />

λ 10<br />

c<br />

λ 15<br />

c<br />

λ 20<br />

c<br />

(d) MCLA<br />

Figure E.24: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />

using the rb-cos-i2 algorithm on the Corel data set.<br />

372<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c<br />

λ 30<br />

c<br />

λ 40<br />

c<br />

λ 50<br />

c<br />

λ 75<br />

c


Appendix F<br />

Experiments on soft consensus<br />

clustering<br />

This appendix presents the results of the consensus clustering experiments on soft cluster<br />

ensembles. The main purpose of these experiments is to compare the four voting based consensus<br />

functions put forward in chapter 6 –namely BordaConsensus (BC), CondorcetConsensus<br />

(CC), ProductConsensus (PC) and SumConsensus (SC)– with five state-of-the-art<br />

clustering combiners: the soft versions of the hypergraph based hard consensus functions<br />

CSPA, HGPA and MCLA (Strehl and Ghosh, 2002), and the evidence accumulation approach<br />

(EAC) (Fred and Jain, 2005) (see section 6.2), plus the voting-merging soft consensus<br />

function (VMA) of (Dimitriadou, Weingessel, and Hornik, 2002).<br />

Such comparison entails two aspects: the quality of the consensus clustering solutions<br />

obtained (measured in terms of normalized mutual information –φ (NMI) – with respect to<br />

the ground truth of each data set), and the time complexity of each consensus function<br />

(measured in terms of the CPU time required for their execution —see appendix A.6 for a<br />

description of the computational resources employed in this work).<br />

From a formal viewpoint, the results of these experiments are presented by means of a<br />

φ (NMI) vs. CPU time diagram, onto which the performance of each consensus function is<br />

described by means of a scatterplot covering the mean ± 2-standard deviation region of the<br />

corresponding magnitude (i.e. φ (NMI) and CPU time). Moreover, the statistical significance<br />

of the results is evaluated by means of Student’s t-tests that compare all the consensus<br />

functions on a pairwise basis, thus analyzing whether the hypothetical superiority of any of<br />

them is sustained on firm statistical grounds, using the traditional 95% confidence interval<br />

as a reference for distinguishing between significant and non significant differences.<br />

These soft consensus clustering experiments have been conducted on the twelve unimodal<br />

data collections employed in this work (see appendix A.2.1 for a description). The results<br />

corresponding to the Zoo data collection are presented in chapter 6, and the following<br />

paragraphs describe the results obtained on the eleven remaining data sets.<br />

373


F.1. Iris data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

IRIS<br />

0<br />

0 0.1 0.2 0.3 0.4<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.1: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Iris data collection.<br />

F.1 Iris data set<br />

This section described the results of the soft consensus clustering experiments run on the<br />

Iris data set. Figure F.1 presents the φ (NMI) vs CPU time mean ± 2-standard deviation<br />

regions of the nine consensus functions compared. Quite obviously, the closer the scatterplot<br />

of a consensus function was to the top left corner of the diagram, the better its performance<br />

would be (i.e. it would yield high quality consensus clustering solutions with low time<br />

complexity).<br />

In this case, the proposed SC and PC consensus functions match the performance of the<br />

VMA, both in terms of time complexity and consensus quality. The performance of the other<br />

two consensus functions proposed (BC and CC) is pretty comparable as far as the quality<br />

of the consensus clustering solutions is concerned, but their computational complexity is<br />

higher. As regards the state-of-the-art consensus functions, CSPA seems to yield slightly<br />

better quality results, although its CPU time more than doubles our proposals, being the<br />

most costly. On its part, MCLA seems to be competitive from a computational viewpoint,<br />

but it yields lower quality consensus clusterings. <strong>La</strong>st, EAC and HGPA are the worst<br />

performing consensus functions.<br />

If the statistical significance of the results is evaluated –see table F.1–, it can be obvserved<br />

that the φ (NMI) superiority of CSPA is only apparent, as the differences with respect<br />

to BC and CC are not statistically significant, and the quality of the consensus clusterings<br />

output by SC and PC are significantly better than those of CSPA. Moreover, SC and PC<br />

are statistically equivalent to VMA both in terms of quality and execution time.<br />

F.2 Wine data set<br />

The soft consensus clustering results obtained on the Wine data collection are presented<br />

next. For starters, figure F.2 displays the φ (NMI) vs CPU time mean ± 2-standard deviation<br />

regions corresponding to the nine consensus functions compared. In general terms, it can<br />

be observed that VMA is the fastest alternative, while the best quality consensus clustering<br />

374


Appendix F. Experiments on soft consensus clustering<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

EAC 0.0001 ——— 0.0014 0.0001 0.0001 0.0001 0.001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— × 0.0001 × × 0.0001 0.0001<br />

MCLA 0.043 0.0001 0.0001 ——— 0.0001 × × 0.0001 0.0001<br />

VMA 0.0146 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 × ×<br />

BC × 0.0001 0.0001 0.0001 0.0377 ——— 0.0163 0.0001 0.0001<br />

CC × 0.0001 0.0001 0.0001 0.0373 × ——— 0.0001 0.0001<br />

PC 0.0377 0.0001 0.0001 0.0001 × × × ——— ×<br />

SC 0.0289 0.0001 0.0001 0.0001 × × × × ———<br />

Table F.1: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Iris data set. The upper and lower triangular sections<br />

of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />

Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

WINE<br />

0<br />

0 0.2 0.4 0.6 0.8 1<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.2: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Wine data collection.<br />

solutions are the ones output by two of the proposed consensus functions: BC and CC.<br />

The analysis of the statistical significance of these results reinforces these notions (see<br />

table F.2). Indeed, the CPU time differences between VMA and the remaining consensus<br />

functions is always statistically significant, with very high significance levels (around<br />

0.0001). Moreover, in terms of φ (NMI) , BC and CC are significantly better than any of the<br />

alternatives. On their part, as already suggested by the diagram of figure F.2, SC and PC<br />

are not statistically different from VMA as far as the quality of the consensus clustering<br />

solutions is concerned.<br />

F.3 Glass data set<br />

This section describes the results of the quality and time complexity comparison experiments<br />

between the nine soft consensus functions employed in this work, when applied on the Glass<br />

data set.<br />

375


F.3. Glass data set<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.001 0.0001 × 0.0001 0.0249 0.0005 0.0001 0.0001<br />

EAC 0.0001 ——— 0.0001 × 0.0001 × 0.0105 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— 0.0004 0.0001 0.0001 0.0001 × ×<br />

MCLA × 0.0001 0.0001 ——— 0.0001 × 0.0199 0.0013 0.0014<br />

VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0002 0.0002<br />

BC 0.0001 0.0001 0.0001 0.0001 0.0006 ——— × 0.0001 0.0001<br />

CC 0.0001 0.0001 0.0001 0.0001 0.0006 × ——— 0.0001 0.0001<br />

PC 0.0001 0.0001 0.0001 0.0001 × 0.001 0.001 ——— ×<br />

SC 0.0001 0.0001 0.0001 0.0001 × 0.0129 0.0129 × ———<br />

Table F.2: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Wine data set. The upper and lower triangular sections<br />

of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />

Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

GLASS<br />

0<br />

0 0.5 1 1.5 2 2.5<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.3: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Glass data collection.<br />

As figure F.3 suggests, VMA is again the least time consuming consensus function. As<br />

mentioned earlier, this is due to the simultaneity of the cluster disambiguation and voting<br />

processes in this consensus function. In contast, the proposed CC consensus function is by<br />

far the slowest, probably due to the exhaustive pairwise cluster confrontation implicit in<br />

the Condorcet voting method.<br />

In terms of quality, there is an apparent equality between the VMA, PC and SC consensus<br />

functions, attaining the highest φ (NMI) scores. The CSPA, BC, CC and MCLA<br />

consensus functions apparently yield lower quality consensus clustering solutions.<br />

When the statistical significance of these results is analyzed –see table F.3–, we see that<br />

the apparent time complexity superiority of VMA is statistically significant. As regards the<br />

quality of the consensus clustering solutions, it can be observed that the performances of<br />

VMA, SC and PC are statistically equivalent, whereas the differences between these and<br />

BC and CC are indeed significant.<br />

376


Appendix F. Experiments on soft consensus clustering<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0002 0.0001 0.0345 0.0001 × 0.0001 0.0016 0.0017<br />

EAC 0.0001 ——— 0.0001 × 0.0001 × 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— 0.0002 0.0001 0.0001 0.0001 × ×<br />

MCLA × 0.0001 0.0001 ——— 0.0001 × 0.0001 0.002 0.002<br />

VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0233 0.0001 0.0001 0.0001 0.0025 ——— 0.0001 0.0037 0.0038<br />

CC 0.0247 0.0001 0.0001 0.0001 0.0022 × ——— 0.0001 0.0001<br />

PC 0.0001 0.0001 0.0001 0.0001 × 0.0092 0.0084 ——— ×<br />

SC 0.0001 0.0001 0.0001 0.0001 × 0.0064 0.0058 × ———<br />

Table F.3: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Glass data set. The upper and lower triangular sections<br />

of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />

Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />

F.4 Ionosphere data set<br />

In the following paragraphs, the results of the soft consensus clustering experiments conducted<br />

on the Ionosphere data collection are described.<br />

For starters, figure F.4 displays the φ (NMI) vs CPU time mean ± 2-standard deviation<br />

regions corresponding to the nine consensus functions compared in this experiment. It can<br />

be observed that pretty low quality consensus clustering solutions (φ (NMI) < 0.1) are yielded<br />

by all clustering combiners. The highest φ (NMI) scores are obtained by CSPA, BC and CC,<br />

whose performance is statistically significantly better than that of the other six consensus<br />

functions —see table F.4 for the results of the statistical significance analysis of the results<br />

of this experiment.<br />

As regards time complexity, it can be observed that VMA is the most computationally<br />

efficient option, closely followed by HGPA. The proposed PC and SC consensus functions<br />

are comparable to CSPA and MCLA in computational terms, while the positional voting<br />

based BC and CC consensus functions are, together with EAC, the most time consuming<br />

alternatives. The differences between these three groups are statistically significant, as it<br />

can be inferred from the data measurements presented in table F.4.<br />

F.5 WDBC data set<br />

This section describes the results of the soft consensus clustering experiments conducted on<br />

the WDBC data set.<br />

The φ (NMI) vs CPU time mean ± 2-standard deviation regions of the consensus functions<br />

are depicted in figure F.5. Once again, VMA is the most computationally efficient<br />

consensus function (which, as mentioned earlier, is due to the simultaneity of the cluster<br />

disambiguation and voting processes), closely followed by HGPA. However, the confidence<br />

voting based consensus functions (PC and SC) are pretty close to VMA in CPU time terms,<br />

being slightly faster than CSPA and MCLA. As already noticed in the previous experiments,<br />

positional voting makes the BC and CC consensus functions more computationally costly<br />

(in this case, CC is slightly faster than BC, due to the fact that the low number of clusters<br />

377


F.5. WDBC data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

IONOSPHERE<br />

0<br />

0 0.5 1 1.5 2 2.5<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.4: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Ionosphere data collection.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 × 0.0001 0.0002 0.0002 × ×<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 × × 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— 0.0008 0.0001 0.0001 0.0001 0.0033 0.0031<br />

MCLA 0.012 0.0004 0.0492 ——— 0.0001 0.0009 0.0014 × ×<br />

VMA 0.0336 0.0001 0.0001 × ——— 0.0001 0.0001 0.0001 0.0001<br />

BC × 0.0001 0.0001 0.0002 0.0001 ——— × 0.0003 0.0003<br />

CC × 0.0001 0.0001 0.0002 0.0001 × ——— 0.0004 0.0004<br />

PC 0.0312 0.0001 0.0001 × × 0.0001 0.0001 ——— ×<br />

SC 0.0302 0.0001 0.0001 × × 0.0001 0.0001 × ———<br />

Table F.4: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Ionosphere data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

in this data set –k = 2– does not penalize CondorcetConsensus), while EAC is the least<br />

efficient option.<br />

As far as the quality of the consensus clustering solutions is concerned, PC and SC get<br />

to match VMA as the best performing consensus functions, showing smaller dispersion in<br />

φ (NMI) terms than BC, CC and MCLA.<br />

If the statistical significance of the CPU time and φ (NMI) differences between consensus<br />

functions is evaluated –see table F.5– it can be observed that PC and SC are, in execution<br />

time terms, equivalent to CSPA and MCLA. As regards the quality of the consensus<br />

clustering solutions, no significant differences are observed between VMA, BC, CC, PC and<br />

SC, which, as aforementioned, turn out to be the best performing consensus functions.<br />

378


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

Appendix F. Experiments on soft consensus clustering<br />

WDBC<br />

0<br />

0 2 4 6<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.5: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the WDBC data collection.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 × 0.0001 0.0001 0.0002 × ×<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— 0.0026 0.0001 0.0001 0.0001 0.0267 0.0249<br />

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0002 0.0004 × ×<br />

VMA 0.0001 0.0001 0.0001 0.0057 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0001 0.0001 0.0001 × × ——— × 0.0001 0.0001<br />

CC 0.0001 0.0001 0.0001 × × × ——— 0.0001 0.0001<br />

PC 0.0001 0.0001 0.0001 0.0025 × × × ——— ×<br />

SC 0.0001 0.0001 0.0001 0.0103 × × × × ———<br />

Table F.5: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the WDBC data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

F.6 Balance data set<br />

In this section, the performance of the soft consensus functions is compared through a set<br />

of consensus clustering experiments conducted on the Balance data set.<br />

Figure F.6 depicts the diagram that qualitatively compares the nine consensus functions<br />

in terms of CPU time required for their execution and the φ (NMI) of the consensus clustering<br />

solutions they yield.<br />

As regards the former aspect, we can observe that VMA, PC and SC are the most efficient<br />

consensus functions, and the differences between them, though small, are statistically<br />

significant according to the results of the t-paired tests presented in table F.6. Moreover,<br />

we can also observe that the BC and CC consensus functions achieve a mid-range time<br />

complexity, being slower than MCLA and HGPA, but faster than CSPA and EAC.<br />

<strong>La</strong>st, as far as the quality of the consensus clustering solutions is concerned, there is a<br />

high degree of equality between consensus functions. In fact, the differences between the top<br />

379


F.7. MFeat data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

BALANCE<br />

10 0<br />

CPU time (sec.)<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.6: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Balance data collection.<br />

performing consensus functions (CSPA, VMA, PC and SC) are not statistically significant.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0003 ——— 0.0419 0.0001 0.0004 0.0001 0.0001 0.0001<br />

MCLA 0.0291 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001<br />

VMA × 0.0001 0.0001 × ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0139 0.0001 0.0001 × × ——— 0.0044 0.0001 0.0001<br />

CC 0.0139 0.0001 0.0001 × × × ——— 0.0001 0.0001<br />

PC × 0.0001 0.0001 × × 0.0322 0.0322 ——— ×<br />

SC × 0.0001 0.0001 × × × × × ———<br />

Table F.6: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Balance data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

F.7 MFeat data set<br />

The results of the soft consensus clustering experiments conducted on the MFeat data set<br />

are presented in this section. For this purpose, figure F.7 depicts the diagram displaying<br />

the φ (NMI) vs CPU time mean ± 2-standard deviation regions corresponding to the nine<br />

soft consensus functions compared, and table F.7 presents the results of the statistical<br />

significance t-paired tests that compares them pairwise.<br />

In time complexity terms, VMA is the best performing consensus function, closely followed<br />

by MCLA, HGPA, PC and SC (the two latter being statistically equivalent). Among<br />

the two proposed positional voting based consensus functions, BC is clearly more efficient<br />

than CC. This probably is due to the larger number of classes (i.e. candidates) in this data<br />

set, which makes CC more costly due to the exhaustive pairwise candidate confrontation<br />

380


Appendix F. Experiments on soft consensus clustering<br />

involved in the Condorcet voting method. However, executing CC takes approximately as<br />

long as running CSPA, and much less time than doing so with EAC, which is, by far, the<br />

least efficient consensus function.<br />

When the quality of the consensus clustering solutions delivered by these consensus<br />

functions is compared, we can see that PC, BC and CC obtain the highest φ (NMI) scores<br />

–with no significant differences among them–, closely followed by VMA, SC and CSPA.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 0<br />

MFEAT<br />

CPU time (sec.)<br />

10 2<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.7: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the MFeat data collection.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0022 0.0001 0.0001<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— 0.0013 0.0001 0.0015 0.0001 0.0266 0.026<br />

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 × ×<br />

VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0007 0.0001 0.0001 0.0001 0.0034 ——— 0.0001 0.0001 0.0001<br />

CC 0.0008 0.0001 0.0001 0.0001 0.0043 × ——— 0.0001 0.0001<br />

PC 0.0382 0.0001 0.0001 0.0001 × × × ——— ×<br />

SC × 0.0001 0.0001 0.0001 × 0.0024 0.003 × ———<br />

Table F.7: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the MFeat data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

F.8 miniNG data set<br />

In this section we present the results of the soft consensus clustering experiments conducted<br />

on the miniNG data collection. The φ (NMI) vs CPU time diagram of figure F.8 reveals that<br />

three of the proposed voting based consensus functions (BC, PC and SC) constitute a good<br />

trade-off between consensus quality and time complexity.<br />

381


F.9. Segmentation data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 0<br />

MINING<br />

CPU time (sec.)<br />

10 2<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.8: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the miniNG data collection.<br />

Indeed, the consensus clustering solutions they yield are statistically significantly better<br />

than those output by state-of-the-art consensus functions such as VMA (which is the<br />

least time consuming), CSPA or MCLA —see table F.8 for further details regarding the<br />

statistical significance of the differences between consensus functions. The fourth consensus<br />

function proposed (CC) also yields higher quality than VMA, CSPA and MCLA, but its<br />

time complexity is notably higher, due to the nature of the Condorcet voting method.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 × 0.0001 0.0015 0.0001 0.0114 0.0115<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— 0.0041 0.0001 0.0001 0.0001 0.0001 0.0001<br />

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0025 0.0001 0.0185 0.0187<br />

VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0051 0.0001 0.0001 0.0001 0.0038 ——— 0.0001 × ×<br />

CC 0.008 0.0001 0.0001 0.0001 0.0061 × ——— 0.0001 0.0001<br />

PC × 0.0001 0.0001 0.0001 × × × ——— ×<br />

SC 0.0001 0.0001 0.0001 0.0001 0.0001 × × 0.0126 ———<br />

Table F.8: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the miniNG data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

F.9 Segmentation data set<br />

The results of the application of the nine soft consensus functions upon the Segmentation<br />

data set are described next. Figure F.9 presents the φ (NMI) vs CPU time mean ± 2-standard<br />

deviation regions employed for comparing them.<br />

Again, VMA is the most computationally efficient consensus function. The proposed<br />

preference voting based clustering combiners (PC and SC), together with MCLA and<br />

382


φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 0<br />

Appendix F. Experiments on soft consensus clustering<br />

SEGMENTATION<br />

CPU time (sec.)<br />

10 2<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.9: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the Segmentation data collection.<br />

HPGA, are the immediate followers. Between the two proposed consensus functions based<br />

on positional voting, BC is once more the most efficient (being comparable to CSPA in<br />

terms of execution CPU time), as Borda voting is less computationally demanding than<br />

Condorcet voting.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 0.0001 0.0001 × 0.0001 0.0001 0.0001<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— × 0.0001 0.0002 0.0001 × ×<br />

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.006 0.0001 × ×<br />

VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.0006 0.0001 0.0001 0.0001 0.0069 ——— 0.0001 0.0007 0.0007<br />

CC 0.0006 0.0001 0.0001 0.0001 0.0069 × ——— 0.0001 0.0001<br />

PC × 0.0001 0.0001 0.0001 × 0.02 0.02 ——— ×<br />

SC × 0.0001 0.0001 0.0001 × 0.0307 0.0307 × ———<br />

Table F.9: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the Segmentation data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

However, these two consensus functions (BC and CC) are the ones to obtain the highest<br />

quality consensus clustering solutions, and the difference between their φ (NMI) scores and<br />

those of the remaining clustering combiners is statistically significant, as the figures shown<br />

in table F.9 reveal. The quality of the consensus clusterings output by the other two voting<br />

based consensus functions (PC and SC) is, from a statistical standpoint, equivalent to that<br />

of the VMA and CSPA consensus functions.<br />

383


F.10. BBC data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 0<br />

BBC<br />

CPU time (sec.)<br />

10 2<br />

CSPA<br />

EAC<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.10: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the BBC data collection.<br />

CSPA EAC HGPA MCLA VMA BC CC PC SC<br />

CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0002 0.0012 0.0001 0.0001<br />

EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />

HGPA 0.0001 0.0001 ——— × 0.0001 0.0001 0.0001 × ×<br />

MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 × ×<br />

VMA 0.0456 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />

BC 0.004 0.0001 0.0001 0.0001 × ——— 0.0001 0.0002 0.0002<br />

CC 0.004 0.0001 0.0001 0.0001 × × ——— 0.0001 0.0001<br />

PC × 0.0001 0.0001 0.0001 × × × ——— ×<br />

SC × 0.0001 0.0001 0.0001 × 0.0279 0.0279 × ———<br />

Table F.10: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the BBC data set. The upper and lower triangular sections<br />

of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />

Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />

F.10 BBC data set<br />

This section is devoted to the presentation of the results of the soft consensus clustering experiments<br />

conducted on the BBC data set. A qualitative description of them is provided by<br />

the φ (NMI) vs CPU time diagram of figure F.10, and the results of the statistical significance<br />

study of the differences between consensus functions is presented in table F.10.<br />

It can be observed that VMA is again the fastest consensus function. The confidence<br />

voting consensus functions (PC and SC) are, in statistical terms, as fast as MCLA and<br />

HGPA. The positional voting consensus functions (BC and CC) are slower than those, the<br />

former being also faster than CSPA, while the latter is slower than it.<br />

As regards the quality of the consensus clustering solutions obtained, CSPA, PC and SC<br />

yield the highest φ (NMI) scores, being equivalent from a statistical significance viewpoint.<br />

The BordaConsensus and CondorcetConsensus clustering combiners also deliver pretty good<br />

performances, together with the VMA consensus function, being notably better than MCLA<br />

(and far better than EAC and HGPA, which yield extremely poor consensus clustering<br />

384


solutions).<br />

F.11 PenDigits data set<br />

Appendix F. Experiments on soft consensus clustering<br />

The results of the soft consensus clustering experiments conducted on the PenDigits data<br />

set are described in the following paragraphs. Due to the number of objects n contained in<br />

this collection, the CSPA and EAC consensus functions were not executable –as their space<br />

complexity scales quadratically with n.<br />

Thus, as a means for comparing the seven consensus functions applied on this data set,<br />

figure F.11 depicts the φ (NMI) vs CPU time mean ± 2-standard deviation regions corresponding<br />

to them. It can be observed that, in this case, the four voting based consensus<br />

functions proposed are the most time consuming. However, those based on confidence voting<br />

(PC and SC) are relatively comparable to HGPA and MCLA, while BC and CC are the<br />

most computationally costly (especially the latter). As in the previous cases, VMA is the<br />

most efficient of the consensus functions compared.<br />

When comparison is referred to the φ (NMI) of the consensus clustering solutions yielded<br />

by the seven consensus functions, we can observe that the highest quality is obtained by PC<br />

and SC, which match VMA in this aspect. The other two voting based consensus functions<br />

(BC and CC) perform slightly worse, but far better than MCLA and HGPA.<br />

385


F.11. PenDigits data set<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 0<br />

PENDIGITS<br />

10 1<br />

CPU time (sec.)<br />

10 2<br />

HGPA<br />

MCLA<br />

VMA<br />

BC<br />

CC<br />

PC<br />

SC<br />

Figure F.11: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />

functions on the PenDigits data collection.<br />

HGPA MCLA VMA BC CC PC SC<br />

HGPA ——— × 0.0001 0.0005 0.0001 × ×<br />

MCLA 0.0001 ——— 0.0015 0.0001 0.0001 0.0273 0.0269<br />

VMA 0.0001 0.0001 ——— 0.0001 0.0001 0.0002 0.0002<br />

BC 0.0001 0.0001 0.0001 ——— 0.0001 0.0097 0.0098<br />

CC 0.0001 0.0001 0.0001 × ——— 0.0001 0.0001<br />

PC 0.0001 0.0001 × 0.0001 0.0001 ——— ×<br />

SC 0.0001 0.0001 × 0.0001 0.0001 × ———<br />

Table F.11: Significance levels p corresponding to the pairwise comparison of soft consensus<br />

functions using a t-paired test on the PenDigits data set. The upper and lower triangular<br />

sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />

respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />

×.<br />

386


C.I.F. G: 59069740 Universitat Ramon Lull Fundació Privada. Rgtre. Fund. Generalitat de Catalunya núm. 472 (28-02-90)<br />

Aquesta Tesi Doctoral ha estat defensada el dia ____ d __________________ de ____<br />

al Centre _______________________________________________________________<br />

de la Universitat Ramon Llull<br />

davant el Tribunal format pels Doctors sotasignants, havent obtingut la qualificació:<br />

President/a<br />

_______________________________<br />

Vocal<br />

_______________________________<br />

Vocal<br />

_______________________________<br />

Vocal<br />

_______________________________<br />

Secretari/ària<br />

_______________________________<br />

Doctorand/a<br />

C. Claravall, 1-3<br />

08022 Barcelona<br />

Tel. 936 022 200<br />

Fax 936 022 249<br />

E-mail: urlsc@sec.url.es<br />

www.url.es

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!