TESI DOCTORAL - La Salle
TESI DOCTORAL - La Salle
TESI DOCTORAL - La Salle
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
C.I.F. G: 59069740 Universitat Ramon Lull Fundació Privada. Rgtre. Fund. Generalitat de Catalunya núm. 472 (28-02-90)<br />
<strong>TESI</strong> <strong>DOCTORAL</strong><br />
Títol Hierarchical self-refining consensus architectures and soft consensus<br />
functions for robust multimedia clustering<br />
Realitzada per Xavier Sevillano Domínguez<br />
en el Centre Enginyeria i Arquitectura <strong>La</strong> <strong>Salle</strong><br />
i en el Departament Comunicacions i Teoria del Senyal<br />
Dirigida per Dr. Francesc Alías Pujol i Dr. Joan Claudi Socoró Carrié<br />
C. Claravall, 1-3<br />
08022 Barcelona<br />
Tel. 936 022 200<br />
Fax 936 022 249<br />
E-mail: urlsc@sec.url.es<br />
www.url.es
Agraïments<br />
Aquesta tesi és fruit de moltes hores de treball personal. Tanmateix, hi ha molta gent a qui<br />
estic agraït pel seu suport durant aquests anys.<br />
En primer lloc, vull citar el meu fantàstic equip de co-directors: en Joan Claudi Socoró,<br />
a qui agraeixo que hagi estat un magnífic ponent/director des dels ja llunyans temps del<br />
TFC (equalització adaptativa, uf!), a banda d’haver-me donat la llibertat de fer la tesi que<br />
volia i oferir-me sempre el seu ajut en moments crítics. I en Francesc Alías (en Xuti), per<br />
l’empenta que va acabar suposant l’inici d’aquesta tesi, pel seu constant esperit de millora<br />
i, sobretot, per la seva amistat que dura des de temps encara més llunyans.<br />
Als caps directes que he tingut al llarg d’aquests anys, que són en David Badia, l’Elisa<br />
Martínez i en Gabriel Fernández, els agraeixo l’haver-me concedit un espai a recer, moltes<br />
vegades, de l’habitual pluja de marrons.<br />
També estic molt agraït al meu bon amic i deixeble Germán Cobo, que va donar amb mi<br />
els primers passos del que ha acabat essent aquesta tesi, i amb qui espero seguir treballant<br />
en un futur, ni que sigui a distància (és el que té laUOC).<br />
Les llargues hores de simulacions ho haurien estat encara més si no hagués estat pel<br />
personal de Manteniment (Chus, Héctor, Gerard, Oscar, Raúl), que em va ajudar i facilitar<br />
l’ús (quasi monopoli) d’un bon grapat de PC’s. També els estic molt agraïts als meus<br />
companys de secció que em van permetre okupar els seus ordinadors, moltes vegades en<br />
perjudici propi: Germán, Joan Claudi, José Antonio Montero i Ester Cierco. Gràcies a en<br />
Lluís Formiga per obrir-me les portes de la Multimodal, el que va permetre accelerar molt<br />
el procés d’experimentació.<br />
Gràcies a la Berta Martínez, l’ Àngel Calzada i en Lluís Cortés per haver fet més suportable<br />
l’estiu del 2008, i a en Germán (de nou!) per descobrir-me el Deezer, que ha posat<br />
banda sonora a aquesta tesi. Gràcies, en general, a tots els companys de l’antiga secció de<br />
Tractament, del Departament de Comunicacions i de l’actual DTM.<br />
Gracias a mi madre por su amor y apoyo a lo largo de toda la vida. Gracias a mi<br />
padre por inculcarme la pasión por estudiar, y gracias a ambos por los esfuerzos para que<br />
pudiera hacerlo en las mejores condiciones. Y gracias a toda mi familia y amigos en general<br />
—aunque no lo sepáis, era muy reconfortante para mí que me preguntárais cómo iba la<br />
tesis.<br />
Y gracias a Susana, por su paciencia a lo largo de todo este proceso, dándome ánimos y<br />
creyendo siempre en mí, y por seguir ahí para disfrutar juntos de lo que vendrá ... cuando<br />
acabe la tesis.<br />
iii
Resum<br />
En segmentar de forma no supervisada una col·lecció de dades, l’usuari ha de prendre<br />
múltiples decisions –quin algorisme aplicar, com representar els objectes, en quants grups<br />
agrupar aquests, entre d’altres– que condicionen, en gran mesura, la qualitat de la partició<br />
resultant. Malauradament, la naturalesa no supervisada del problema fa difícil (quan no<br />
impossible) prendre aquestes decisions de manera fonamentada, a no ser que es disposi de<br />
cert coneixement del domini.<br />
En un intent per combatre aquestes incerteses, aquesta tesi proposa una aproximació<br />
al problema que minimitza, intencionadament, la presa de decisions per part de l’usuari.<br />
Ans al contrari, s’encoratja l’ús de tants sistemes de classificació no supervisada com sigui<br />
possible, combinant-los per tal d’obtenir la partició final de les dades (o partició deconsens).<br />
Com més semblant sigui aquesta a la partició demàxima qualitat oferida pels sistemes de<br />
classificació subjectes a combinació, major serà el grau de robustesa assolit respecte les<br />
indeterminacions inherents a la classificació no supervisada.<br />
Nogensmenys, la combinació indiscriminada de classificadors no supervisats planteja<br />
dues dificultats principals, que són i) l’increment de la complexitat computacional del procés<br />
de combinació, fins al punt que la seva execució pot esdevenir inviable si el nombre de<br />
sistemes a combinar és excessiu, i ii) l’obtenció de particions de consens de baixa qualitat<br />
deguda a la inclusió de sistemes de classificació pobres. Amb l’objectiu de lluitar contra<br />
aquests problemes, aquesta tesi introdueix les arquitectures de consens jeràrquiques autorefinables<br />
com a via per a l’obtenció de particions de consens de bona qualitat amb baix<br />
cost computacional, tal i com confirmen els nombrosos experiments realitzats.<br />
Amb la intenció d’exportar aquesta estratègia de classificació no supervisada robusta<br />
a un marc generalista, es proposa un conjunt de funcions de consens basades en votació<br />
per a la combinació de classificadors difusos. Diversos experiments demostren que les seves<br />
prestacions són comparables o superiors a bona part de l’estat de l’art.<br />
Les nostres propostes són aplicables de forma natural a la classificació robusta de dades<br />
multimodals –un problema d’interès ben actual donada la ubiqüitat de la multimèdia–, ja<br />
que l’existència de múltiples modalitats planteja indeterminacions addicionals que dificulten<br />
l’obtenció de particions robustes. <strong>La</strong> base de la nostra proposta és la creació deconjunts<br />
de particions multimodals, el que permet l’ús natural i simultani de tècniques de fusió de<br />
modalitats avançada i retardada, donant peu a una aproximació genèrica i eficient a la classificació<br />
multimèdia —els resultats de la qual s’analitzen al llarg de múltiples experiments.<br />
v
Resumen<br />
Al segmentar de forma no supervisada una colección de datos, el usuario debe tomar<br />
múltiples decisiones –qué algoritmo aplicar, cómo representar los objetos, en cuantos grupos<br />
agrupar éstos, entre otras– que condicionan, en gran medida, la calidad de la partición resultante.<br />
Desgraciadamente, la naturaleza no supervisada del problema hace difícil (cuando<br />
no imposible) tomar estas decisiones de manera fundamentada, a no ser que se disponga de<br />
cierto conocimiento del dominio.<br />
En un intento por combatir estas incertidumbres, esta tesis propone una aproximación<br />
al problema que minimiza, intencionadamente, la toma de decisiones por parte del usuario.<br />
Al contrario, se alienta el uso de tantos sistemas de clasificación no supervisada como sea<br />
posible, combinándolos con el fin de obtener la partición final de los datos (o partición de<br />
consenso). Cuanto más similar sea ésta a la partición de máxima calidad ofrecida por los<br />
sistemas de clasificación sujetos a combinación, mayor será el grado de robustez respecto a<br />
las indeterminaciones inherentes a la clasificación no supervisada.<br />
No obstante, la combinación indiscriminada de clasificadores no supervisados plantea<br />
dos dificultades principales, que son i) el incremento de la complejidad computacional del<br />
proceso de combinación, hasta el punto que su ejecución puede ser inviable si el número<br />
de sistemas a combinar es excesivo, y ii) la obtención de particiones de consenso de baja<br />
calidad debida a la inclusión de sistemas de clasificación pobres. Con el objetivo de luchar<br />
contra estos problemas, esta tesis introduce las arquitecturas de consenso jerárquicas autorefinables<br />
como vía para la obtención de particiones de consenso de buena calidad con bajo<br />
coste computacional, tal como confirman los numerosos experimentos realizados.<br />
Con la intención de exportar esta estrategia de clasificación no supervisada robusta a<br />
un marco generalista, se propone un conjunto de funciones de consenso basadas en votación<br />
para la combinación de clasificadores difusos. Diversos experimentos demuestran que sus<br />
prestaciones son comparables o superiores a buena parte del estado del arte.<br />
Nuestras propuestas son aplicables de forma natural a la clasificación robusta de datos<br />
multimodales –un problema de interés actual dada la ubicuidad de la multimedia–, ya que la<br />
existencia de múltiples modalidades plantea indeterminaciones adicionales que dificultan la<br />
obtención de particiones robustas. <strong>La</strong> base de nuestra propuesta es la creación de conjuntos<br />
de particiones multimodales, lo que permite el uso natural y simultáneo de técnicas de fusión<br />
de modalidades temprana y tardía, dando pie a una aproximación genérica y eficiente a la<br />
clasificación multimedia —cuyos resultados se analizan a lo largo de múltiples experimentos.<br />
vii
Abstract<br />
When facing the task of partitioning a data collection in an unsupervised fashion, the clustering<br />
practitioner must make several crucial decisions –which clustering algorithm to apply,<br />
how the objects in the data set are represented, how many clusters are to be found, among<br />
others– that condition, to a large extent, the quality of the resulting partition. However,<br />
the unsupervised nature of the clustering problem makes it difficult (if not impossible) to<br />
make well-founded decisions unless domain knowledge is available.<br />
In an attempt to fight these indeterminacies, we propose an approach to the clustering<br />
problem that intentionally reduces user decision making as much as possible. Rather the<br />
contrary, the clustering practitioner is encouraged to simultaneously employ as many clustering<br />
systems as possible (compiling their outcomes into a cluster ensemble), combining<br />
them in order to obtain the final partition (or consensus clustering). The greater the similarity<br />
between the highest quality cluster ensemble component and the consensus clustering,<br />
the larger degree of robustness to the inherent indeterminacies of clustering is achieved.<br />
However, the indiscriminate creation of cluster ensemble components poses two main<br />
challenges to the clustering combination process, namely i) an increase of its computational<br />
complexity, to the point that the creation of the consensus clustering can even become unfeasible<br />
if the number of clustering systems combined is too large, and ii) the obtention of a<br />
low quality consensus partition due to the inclusion of poor clustering systems in the cluster<br />
ensemble. In order to fight against these inconveniences, this thesis introduces hierarchical<br />
self-refining consensus architectures as a means for obtaining good quality partitions at a<br />
reduced computational cost, as confirmed by extensive experimental evaluation.<br />
Aiming to port this robust clustering strategy to a more generic framework, a set of<br />
voting based consensus functions for fuzzy clustering systems combination is proposed.<br />
Several experiments demonstrate that the quality of the consensus clusterings they yield is<br />
comparable or better than that of multiple state-of-the-art soft consensus functions.<br />
Our proposals find a natural field of application in the robust clustering of multimodal<br />
data –a problem of current interest due to the growing ubiquity of multimedia–, as the<br />
existence of multiple data modalities poses additional indeterminacies that challenge the<br />
obtention of robust clustering results. The basis of our proposal is the creation of multimodal<br />
cluster ensembles, which naturally allows the simultaneous use of early and late modality<br />
fusion techniques, thus providing a highly generic and efficient approach to multimedia<br />
clustering —the performance of which is analyzed in multiple experiments.<br />
ix
Contents<br />
Resum iii<br />
Resumen v<br />
Abstract vii<br />
List of tables xvi<br />
List of figures xxii<br />
List of algorithms xxxiii<br />
List of symbols xxxiv<br />
1 Framework of the thesis 1<br />
1.1 Knowledge discovery and data mining . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2 Clustering in knowledge discovery and data mining . . . . . . . . . . . . . . 7<br />
1.2.1 Overview of clustering methods . . . . . . . . . . . . . . . . . . . . . 8<br />
1.2.2 Evaluation of clustering processes . . . . . . . . . . . . . . . . . . . . 15<br />
1.3 Multimodal clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
1.4 Clustering indeterminacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.5 Motivation and contributions of the thesis . . . . . . . . . . . . . . . . . . . 25<br />
2 Cluster ensembles and consensus clustering 27<br />
2.1 Related work on cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . 31<br />
2.2 Related work on consensus functions . . . . . . . . . . . . . . . . . . . . . . 33<br />
2.2.1 Consensus functions based on voting . . . . . . . . . . . . . . . . . . 34<br />
2.2.2 Consensus functions based on graph partitioning . . . . . . . . . . . 36<br />
2.2.3 Consensus functions based on object co-association measures . . . . 37<br />
2.2.4 Consensus functions based on categorical clustering . . . . . . . . . 39<br />
2.2.5 Consensus functions based on probabilistic approaches . . . . . . . . 39<br />
xi
Contents<br />
2.2.6 Consensus functions based on reinforcement learning . . . . . . . . . 40<br />
2.2.7 Consensus functions based on interpeting object similarity as data . 40<br />
2.2.8 Consensus functions based on cluster centroids . . . . . . . . . . . . 41<br />
2.2.9 Consensus functions based on correlation clustering . . . . . . . . . 41<br />
2.2.10 Consensus functions based on search techniques . . . . . . . . . . . . 42<br />
2.2.11 Consensus functions based on cluster ensemble component selection 42<br />
2.2.12 Other interesting works on consensus clustering . . . . . . . . . . . . 43<br />
3 Hierarchical consensus architectures 45<br />
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
3.2 Random hierarchical consensus architectures . . . . . . . . . . . . . . . . . 49<br />
3.2.1 Rationale and definition . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />
3.2.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
3.2.3 Running time minimization . . . . . . . . . . . . . . . . . . . . . . . 54<br />
3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
3.3 Deterministic hierarchical consensus architectures . . . . . . . . . . . . . . . 69<br />
3.3.1 Rationale and definition . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
3.3.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
3.3.3 Running time minimization . . . . . . . . . . . . . . . . . . . . . . . 72<br />
3.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />
3.4 Flat vs. hierarchical consensus . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
3.4.1 Running time comparison . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
3.4.2 Consensus quality comparison . . . . . . . . . . . . . . . . . . . . . . 96<br />
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
3.6 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
4 Self-refining consensus architectures 109<br />
4.1 Description of the consensus self-refining procedure . . . . . . . . . . . . . . 110<br />
4.2 Flat vs. hierarchical self-refining . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
4.2.1 Evaluation of the consensus-based self-refining process . . . . . . . . 116<br />
4.2.2 Evaluation of the supraconsensus process . . . . . . . . . . . . . . . 120<br />
4.3 Selection-based self-refining . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
4.3.1 Evaluation of the selection-based self-refining process . . . . . . . . . 124<br />
4.3.2 Evaluation of the supraconsensus process . . . . . . . . . . . . . . . 126<br />
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
4.5 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />
xii
Contents<br />
5 Multimedia clustering based on cluster ensembles 133<br />
5.1 Generation of multimodal cluster ensembles . . . . . . . . . . . . . . . . . . 134<br />
5.2 Self-refining multimodal consensus architecture . . . . . . . . . . . . . . . . 136<br />
5.3 Multimodal consensus clustering results . . . . . . . . . . . . . . . . . . . . 138<br />
5.3.1 Consensus clustering per modality and across modalities . . . . . . . 141<br />
5.3.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 152<br />
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160<br />
5.5 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161<br />
6 Voting based consensus functions for soft cluster ensembles 163<br />
6.1 Soft cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />
6.2 Adapting consensus functions to soft cluster ensembles . . . . . . . . . . . . 166<br />
6.3 Voting based consensus functions . . . . . . . . . . . . . . . . . . . . . . . . 171<br />
6.3.1 Cluster disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />
6.3.2 Voting strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176<br />
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184<br />
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189<br />
6.6 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191<br />
7 Conclusions 193<br />
7.1 Hierarchical consensus architectures . . . . . . . . . . . . . . . . . . . . . . 195<br />
7.2 Consensus self-refining procedures . . . . . . . . . . . . . . . . . . . . . . . 197<br />
7.3 Multimedia clustering based on cluster ensembles . . . . . . . . . . . . . . . 198<br />
7.4 Voting based soft consensus functions . . . . . . . . . . . . . . . . . . . . . 200<br />
References 202<br />
Appendices 216<br />
A Experimental setup 217<br />
A.1 The CLUTO clustering package . . . . . . . . . . . . . . . . . . . . . . . . . 217<br />
A.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />
A.2.1 Unimodal data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221<br />
A.2.2 Multimodal data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 223<br />
A.3 Data representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />
A.3.1 Unimodal data representations . . . . . . . . . . . . . . . . . . . . . 224<br />
A.3.2 Multimodal data representations . . . . . . . . . . . . . . . . . . . . 227<br />
A.4 Cluster ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227<br />
xiii
Contents<br />
A.5 Consensus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229<br />
A.6 Computational resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230<br />
B Experiments on clustering indeterminacies 233<br />
B.1 Clustering indeterminacies in unimodal data sets . . . . . . . . . . . . . . . 233<br />
B.1.1 Zoo data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234<br />
B.1.2 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />
B.1.3 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236<br />
B.1.4 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236<br />
B.1.5 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237<br />
B.1.6 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237<br />
B.1.7 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />
B.1.8 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />
B.1.9 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />
B.1.10 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 238<br />
B.1.11 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239<br />
B.1.12 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240<br />
B.1.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241<br />
B.2 Clustering indeterminacies in multimodal data sets . . . . . . . . . . . . . . 241<br />
B.2.1 CAL500 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243<br />
B.2.2 Corel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243<br />
B.2.3 InternetAds data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 244<br />
B.2.4 IsoLetters data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245<br />
B.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246<br />
C Experiments on hierarchical consensus architectures 249<br />
C.1 Configuration of a random hierarchical consensus architecture . . . . . . . . 249<br />
C.2 Estimation of the computationally optimal RHCA . . . . . . . . . . . . . . 252<br />
C.2.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253<br />
C.2.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253<br />
C.2.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256<br />
C.2.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256<br />
C.2.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261<br />
C.2.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261<br />
C.2.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266<br />
C.3 Estimation of the computationally optimal DHCA . . . . . . . . . . . . . . 271<br />
C.3.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272<br />
xiv
Contents<br />
C.3.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273<br />
C.3.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273<br />
C.3.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278<br />
C.3.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278<br />
C.3.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283<br />
C.3.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283<br />
C.4 Computationally optimal RHCA, DHCA and flat consensus comparison . . 290<br />
C.4.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290<br />
C.4.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291<br />
C.4.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295<br />
C.4.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301<br />
C.4.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306<br />
C.4.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306<br />
C.4.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314<br />
C.4.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314<br />
C.4.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 318<br />
C.4.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324<br />
C.4.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324<br />
D Experiments on self-refining consensus architectures 333<br />
D.1 Experiments on consensus-based self-refining . . . . . . . . . . . . . . . . . 333<br />
D.1.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334<br />
D.1.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334<br />
D.1.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334<br />
D.1.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337<br />
D.1.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337<br />
D.1.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337<br />
D.1.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />
D.1.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />
D.1.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />
D.1.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342<br />
D.1.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346<br />
D.2 Experiments on selection-based self-refining . . . . . . . . . . . . . . . . . . 348<br />
D.2.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348<br />
D.2.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />
D.2.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />
D.2.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350<br />
xv
Contents<br />
D.2.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351<br />
D.2.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351<br />
D.2.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352<br />
D.2.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353<br />
D.2.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . 353<br />
D.2.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354<br />
D.2.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355<br />
E Experiments on multimodal consensus clustering 359<br />
E.1 CAL500 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359<br />
E.1.1 Consensus clustering per modality and across modalities . . . . . . . 360<br />
E.1.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 362<br />
E.2 InternetAds data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364<br />
E.2.1 Consensus clustering per modality and across modalities . . . . . . . 364<br />
E.2.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 365<br />
E.3 Corel data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366<br />
E.3.1 Consensus clustering per modality and across modalities . . . . . . . 366<br />
E.3.2 Self-refined consensus clustering across modalities . . . . . . . . . . . 369<br />
F Experiments on soft consensus clustering 373<br />
F.1 Iris data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374<br />
F.2 Wine data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374<br />
F.3 Glass data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375<br />
F.4 Ionosphere data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377<br />
F.5 WDBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377<br />
F.6 Balance data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379<br />
F.7 MFeat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380<br />
F.8 miniNG data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381<br />
F.9 Segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382<br />
F.10 BBC data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384<br />
F.11 PenDigits data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385<br />
xvi
List of Tables<br />
1.1 Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits<br />
data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms . . . 23<br />
1.2 Illustration of the clustering indeterminacies on the CAL500, Corel, InternetAds<br />
and IsoLetters multimoda data sets . . . . . . . . . . . . . . . . . . . 24<br />
2.1 Taxonomy of consensus functions according to their theoretical basis . . . . 34<br />
3.1 Number of inner loop iterations as a function of the outer’s loop index i . . 53<br />
3.2 Methodology for estimating the running time of multiple RHCA variants . 56<br />
3.3 Evaluation of the minimum complexity RHCA variant estimation methodology<br />
in terms of the percentage of correct predictions and running time<br />
penalizations resulting from mistaken predictions . . . . . . . . . . . . . . . 63<br />
3.4 Computationally optimal consensus architectures (flat or RHCA) on the unimodal<br />
data collections assuming a fully serial implementation . . . . . . . . 67<br />
3.5 Computationally optimal consensus architectures (flat or RHCA) on the unimodal<br />
data collections assuming a fully parallel implementation . . . . . . . 68<br />
3.6 Methodology for estimating the running time of multiple DHCA variants . 74<br />
3.7 Evaluation of the minimum complexity DHCA variant estimation methodology<br />
in terms of the percentage of correct predictions and running time<br />
penalizations resulting from mistaken predictions . . . . . . . . . . . . . . . 81<br />
3.8 Evaluation of the minimum complexity serial DHCA variant prediction based<br />
on decreasing diversity factor ordering in terms of the percentage of correct<br />
predictions and running time penalizations resulting from mistaken predictions 83<br />
3.9 Running time differences between the most and least computationally efficient<br />
DHCA variants in both the serial and parallel implementations . . . . 84<br />
3.10 Computationally optimal consensus architectures (flat or DHCA) on the unimodal<br />
data collections assuming a fully serial implementation . . . . . . . . 86<br />
3.11 Computationally optimal consensus architectures (flat or DHCA) on the unimodal<br />
data collections assuming a fully parallel implementation . . . . . . . 87<br />
3.12 Percentage of experiments in which the consensus clustering solution is better<br />
than the median cluster ensemble component . . . . . . . . . . . . . . . . . 104<br />
xvii
List of Tables<br />
3.13 Relative percentage φ (NMI) gain between the consensus clustering solution<br />
and the median cluster ensemble component . . . . . . . . . . . . . . . . . . 104<br />
3.14 Percentage of experiments in which the consensus clustering solution is better<br />
than the best cluster ensemble component . . . . . . . . . . . . . . . . . . . 105<br />
3.15 Relative percentage φ (NMI) gain between the consensus clustering solution<br />
and the best cluster ensemble component . . . . . . . . . . . . . . . . . . . 105<br />
4.1 Methodology of the consensus self-refining procedure . . . . . . . . . . . . . 112<br />
4.2 Percentage of self-refining experiments in which one of the self-refined consensus<br />
clustering solutions is better than its non-refined counterpart . . . . 117<br />
4.3 Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />
clustering solutions with respect to its non-refined counterpart . . . . . . . 117<br />
4.4 Percentage of experiments in which the best (non-refined or self-refined) consensus<br />
clustering solution is better than the best cluster ensemble component 118<br />
4.5 Relative percentage φ (NMI) gain between the best (non-refined or self-refined)<br />
consensus clustering solution and the best cluster ensemble component . . . 118<br />
4.6 Percentage of experiments in which the best (non-refined or self-refined) consensus<br />
clustering solution is better than the median cluster ensemble component119<br />
4.7 Relative percentage φ (NMI) gain between the best (non-refined or self-refined)<br />
consensus clustering solution and the median cluster ensemble component . 119<br />
4.8 φ (NMI) variance of the non-refined and the best non/self-refined consensus<br />
clustering solutions across the flat, RHCA and DHCA consensus architectures120<br />
4.9 Percentage of experiments in which the supraconsensus function selects the<br />
top quality consensus clustering solution . . . . . . . . . . . . . . . . . . . . 120<br />
4.10 Relative percentage φ (NMI) losses due to suboptimal self-refined consensus<br />
clustering solution selection by supraconsensus . . . . . . . . . . . . . . . . 121<br />
4.11 Methodology of the cluster ensemble component selection-based consensus<br />
self-refining procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
4.12 Percentage of self-refining experiments in which one of the self-refined consensus<br />
clustering solutions is better than the selected cluster ensemble component<br />
reference λref . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
4.13 Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />
clustering solutions with respect to the maximum φ (ANMI) cluster ensemble<br />
component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
4.14 Percentage of experiments where either the top quality self-refined consensus<br />
clustering solution or λref better the best cluster ensemble component, and<br />
relative φ (NMI) gain percentage with respect to it . . . . . . . . . . . . . . . 126<br />
4.15 Percentage of experiments where either the top quality self-refined consensus<br />
clustering solution or λref better the median cluster ensemble component,<br />
and relative φ (NMI) gain percentage with respect to it . . . . . . . . . . . . . 126<br />
xviii
List of Tables<br />
4.16 Percentage of experiments in which the supraconsensus function selects the<br />
top quality clustering solution, and relative percentage φ (NMI) losses between<br />
the top quality clustering solution and the one selected by supraconsensus,<br />
averaged across the twelve data collections . . . . . . . . . . . . . . . . . . . 127<br />
5.1 Range and cardinality of the dimensional diversity factor dfD per modality<br />
for each one of the four multimedia data sets . . . . . . . . . . . . . . . . . 136<br />
5.2 Percentage of cluster ensemble components that attain a higher φ (NMI) than<br />
the unimodal and multimodal consensus clusterings, across the four multimedia<br />
data collections and the seven consensus functions . . . . . . . . . . . 145<br />
5.3 Evaluation of the unimodal and multimodal consensus clusterings with respect<br />
to the best cluster ensemble component, across the four multimedia<br />
data collections and the seven consensus functions . . . . . . . . . . . . . . 148<br />
5.4 Evaluation of the unimodal and multimodal consensus clusterings with respect<br />
to the median cluster ensemble component, across the four multimedia<br />
data collections and the seven consensus functions . . . . . . . . . . . . . . 149<br />
5.5 Evaluation of the multimodal consensus clusterings with respect to their<br />
unimodal counterparts, across the four multimedia data collections and the<br />
seven consensus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
5.6 Evaluation of the intermodal consensus clustering with respect to the unimodal<br />
and multimodal consensus clusterings, across the four multimedia data<br />
collections and the seven consensus functions . . . . . . . . . . . . . . . . . 151<br />
5.7 Percentage of multimodal self-refining experiments in which one of the selfrefined<br />
consensus clustering solutions is better than its non-refined counterpart154<br />
5.8 Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />
clustering solutions with respect to its non-refined counterpart . . . . . . . 156<br />
5.9 Percentage of the cluster ensemble components that attain a higher φ (NMI)<br />
score than the top quality self-refined consensus clustering solution . . . . . 156<br />
5.10 Percentage of experiments in which the best (either non-refined or self-refined)<br />
consensus clustering solution is better than the best cluster ensemble component<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
5.11 Relative φ (NMI) percentage difference between the top quality (either nonrefined<br />
or self-refined) consensus clustering solution with respect to the best<br />
ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
5.12 Percentage of experiments in which the top quality (either non-refined or<br />
self-refined) consensus clustering solution is better than the median cluster<br />
ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />
5.13 Relative φ (NMI) percentage difference between the top quality (either nonrefined<br />
or self-refined) consensus clustering solution with respect to the median<br />
ensemble component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
5.14 Percentage of experiments in which the supraconsensus function selects the<br />
top quality consensus clustering solution . . . . . . . . . . . . . . . . . . . . 159<br />
xix
List of Tables<br />
5.15 Relative φ (NMI) percentage differences between the best and median components<br />
of the cluster ensemble and the consensus clustering λ final<br />
c selected by<br />
supraconsensus, across the four multimedia data collections . . . . . . . . . 160<br />
6.1 Soft cluster ensemble sizes of the unimodal data sets . . . . . . . . . . . . . 186<br />
6.2 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Zoo data set . . . . . . . . . . . . 187<br />
6.3 Percentage of experiments in which the state-of-the-art consensus functions<br />
(CSPA, EAC, HGPA, MCLA and VMA) yield better/equivalent/worse consensus<br />
clustering solutions than the four proposed consensus functions (BC,<br />
CC, PC and SC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188<br />
6.4 Percentage of experiments in which the state-of-the-art consensus functions<br />
(CSPA, EAC, HGPA, MCLA and VMA) are executed faster/equivalent/slower<br />
than the four proposed consensus functions (BC, CC, PC and SC) . . . . . 189<br />
A.1 Cross-option table indicating the clustering strategy-criterion function-similarity<br />
measure combinations available in CLUTO . . . . . . . . . . . . . . . . . . 220<br />
A.2 Summary of the unimodal data sets employed in the experiments . . . . . . 222<br />
A.3 Summary of the multimodal data sets employed in the experiments . . . . . 224<br />
A.4 Cluster ensemble sizes corresponding to distinct algorithmic diversity configurations<br />
for the unimodal data sets . . . . . . . . . . . . . . . . . . . . . . 228<br />
A.5 Cluster ensemble sizes corresponding to distinct algorithmic diversity configurations<br />
for the multimodal data sets . . . . . . . . . . . . . . . . . . . . . 229<br />
B.1 Number of individual clusterings per data representation on each unimodal<br />
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234<br />
B.2 Top clustering results of each clustering algorithm family sorted from highest<br />
to lowest φ (NMI) on the unimodal collections . . . . . . . . . . . . . . . . . . 242<br />
B.3 Number of individual clusterings per data representation on each multimodal<br />
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243<br />
B.4 Top clustering results of each clustering algorithm family sorted from highest<br />
to lowest φ (NMI) on the multimodal collections . . . . . . . . . . . . . . . . 248<br />
C.1 Examples of computation of the number of stages s of a RHCA with l =<br />
7, 8and9andb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250<br />
C.2 Examples of computation of the number of consensus per stage (Ki) ofa<br />
RHCA with l =7, 8 and 9 and b = 2 . . . . . . . . . . . . . . . . . . . . . . 250<br />
C.3 Examples of computation of the mini-ensembles size of a RHCA with l =<br />
7, 8and9andb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251<br />
C.4 Configuration of RHCA topologies on a cluster ensemble of size l =30with<br />
varying mini-ensembles sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 251<br />
xx
List of Tables<br />
F.1 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Iris data set . . . . . . . . . . . . 375<br />
F.2 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Wine data set . . . . . . . . . . . 376<br />
F.3 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Glass data set . . . . . . . . . . . 377<br />
F.4 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Ionosphere data set . . . . . . . . 378<br />
F.5 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the WDBC data set . . . . . . . . . . 379<br />
F.6 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Balance data set . . . . . . . . . . 380<br />
F.7 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the MFeat data set . . . . . . . . . . . 381<br />
F.8 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the miniNG data set . . . . . . . . . . 382<br />
F.9 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Segmentation data set . . . . . . .<br />
F.10 Significance levels p corresponding to the pairwise comparison of soft consen-<br />
383<br />
sus functions using a t-paired test on the BBC data set . . . . . . . . . . . 384<br />
F.11 Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the PenDigits data set . . . . . . . . . 386<br />
xxi
List of Figures<br />
1.1 Evolution of the total number of websites across all Internet domains, from<br />
November 1995 to February 2009 . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.2 Schematic diagram of the steps involved in a knowledge discovery process . 4<br />
1.3 Taxonomy of data mining methods . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.4 Toy example of a hierarchical clustering dendrogram . . . . . . . . . . . . . 10<br />
1.5 Illustration of the data representation indeterminacy on the Wine and miniNG<br />
data sets clustered by the rbr-corr-e1 algorithm. . . . . . . . . . . . . . 21<br />
1.6 Block diagram of the robust multimodal clustering system based on selfrefining<br />
hierarchical consensus architectures . . . . . . . . . . . . . . . . . . 26<br />
2.1 Scatterplot of an artificially generated two-dimensional toy data set containing<br />
n = 9 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
2.2 Schematic representation of a consensus clustering process on a hard cluster<br />
ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
3.1 Flat vs hierarchical construction of a consensus clustering solution on a hard<br />
cluster ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
3.2 Three examples of topologies of random hierarchical consensus architectures 52<br />
3.3 Evolution of RHCA parameters as a function of the mini-ensembles size b . 54<br />
3.4 Estimated and real running times of the serial and parallel RHCA implementations<br />
on the Zoo data collection in the |dfA| = 1 diversity scenario . . . . 58<br />
3.5 Estimated and real running times of the serial and parallel RHCA implementations<br />
on the Zoo data collection in the |dfA| = 10 diversity scenario . . . . 59<br />
3.6 Estimated and real running times of the serial and parallel RHCA implementations<br />
on the Zoo data collection in the |dfA| = 10 diversity scenario . . . . 61<br />
3.7 Estimated and real running times of the serial and parallel RHCA implementations<br />
on the Zoo data collection in the |dfA| = 28 diversity scenario . . . . 62<br />
3.8 Evolution of the accuracy of RHCA running time estimation as a function of<br />
the number of consensus processes . . . . . . . . . . . . . . . . . . . . . . . 65<br />
3.9 An example of a deterministic hierarchical consensus architecture . . . . . . 71<br />
3.10 Estimated and real running times of the serial and parallel dHCA implementations<br />
on the Zoo data collection in the |dfA| = 1 diversity scenario . . . . 77<br />
xxiii
List of Figures<br />
3.11 Estimated and real running times of the serial and parallel dHCA implementations<br />
on the Zoo data collection in the |dfA| = 10 diversity scenario . . . .<br />
3.12 Estimated and real running times of the serial and parallel dHCA implemen-<br />
78<br />
tations on the Zoo data collection in the |dfA| = 19 diversity scenario . . . .<br />
3.13 Estimated and real running times of the serial and parallel dHCA implemen-<br />
79<br />
tations on the Zoo data collection in the |dfA| = 28 diversity scenario . . . .<br />
3.14 Evolution of the accuracy of DHCA running time estimation as a function of<br />
80<br />
the number of consensus processes . . . . . . . . . . . . . . . . . . . . . . . 82<br />
3.15 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario<br />
corresponding to a cluster ensemble of size l = 57 . . . . . . . . . . . . . . .<br />
3.16 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario<br />
89<br />
corresponding to a cluster ensemble of size l = 570 . . . . . . . . . . . . . . 91<br />
3.17 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario<br />
corresponding to a cluster ensemble of size l = 1083 . . . . . . . . . . . . . 92<br />
3.18 Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario<br />
corresponding to a cluster ensemble of size l = 1596 . . . . . . . . . . . . . 93<br />
3.19 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures across all data collections for all the diversity scenarios 95<br />
3.20 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures across all data collections for all the diversity<br />
scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .<br />
3.21 φ<br />
97<br />
(NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />
for the diversity scenario corresponding to a cluster ensemble of size l =57<br />
3.22 φ<br />
99<br />
(NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />
for the diversity scenario corresponding to a cluster ensemble of size l = 570<br />
3.23 φ<br />
100<br />
(NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />
for the diversity scenario corresponding to a cluster ensemble of size l = 1083 100<br />
3.24 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection<br />
for the diversity scenario corresponding to a cluster ensemble of size l = 1596 101<br />
3.25 φ (NMI) of the consensus solutions obtained by the computationally optimal<br />
parallel RHCA, DHCA and flat consensus architectures across all data collections<br />
for all the diversity scenarios . . . . . . . . . . . . . . . . . . . . . . 103<br />
4.1 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Zoo<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
xxiv
List of Figures<br />
4.2 Decreasingly ordered φ (NMI) (wrt ground truth) values of the 300 clusterings<br />
included in the toy cluster ensemble (left), and their corresponding φ (ANMI)<br />
values (wrt the toy cluster ensemble) (right) . . . . . . . . . . . . . . . . . . 121<br />
4.3 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Zoo data collection . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
5.1 Block diagram of the proposed multimodal consensus clustering system . . 134<br />
5.2 An example of a deterministic hierarchical consensus architecture DRM variant139<br />
5.3 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the IsoLetters data<br />
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />
5.4 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the IsoLetters data set 143<br />
5.5 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the IsoLetters data set 143<br />
5.6 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the IsoLetters data set . . 144<br />
5.7 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the IsoLetters data set . . . . . . . 153<br />
5.8 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . 153<br />
5.9 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . 154<br />
5.10 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the IsoLetters data set . . . . . . . . . . . 155<br />
6.1 Scatterplot of an artificially generated two-dimensional data set containing<br />
n = 9 objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164<br />
6.2 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Zoo data collection . . . . . . . . . . . . . . . . . . . . 186<br />
B.1 φ (NMI) histograms on the Zoo data set . . . . . . . . . . . . . . . . . . . . . 234<br />
B.2 φ (NMI) histograms on the Iris data set . . . . . . . . . . . . . . . . . . . . . 235<br />
B.3 φ (NMI) histograms on the Wine data set . . . . . . . . . . . . . . . . . . . . 236<br />
B.4 φ (NMI) histograms on the Glass data set . . . . . . . . . . . . . . . . . . . . 236<br />
B.5 φ (NMI) histograms on the Ionosphere data set . . . . . . . . . . . . . . . . . 237<br />
B.6 φ (NMI) histograms on the WDBC data set . . . . . . . . . . . . . . . . . . . 237<br />
B.7 φ (NMI) histograms on the Balance data set . . . . . . . . . . . . . . . . . . . 238<br />
B.8 φ (NMI) histograms on the MFeat data set . . . . . . . . . . . . . . . . . . . 239<br />
B.9 φ (NMI) histograms on the miniNG data set . . . . . . . . . . . . . . . . . . . 239<br />
B.10 φ (NMI) histograms on the Segmentation data set . . . . . . . . . . . . . . . . 240<br />
xxv
List of Figures<br />
B.11 φ (NMI) histograms on the BBC data set . . . . . . . . . . . . . . . . . . . . 240<br />
B.12 φ (NMI) histograms on the PenDigits data set . . . . . . . . . . . . . . . . . . 240<br />
B.13 φ (NMI) histograms on the CAL500 data set . . . . . . . . . . . . . . . . . . 244<br />
B.14 φ (NMI) histograms on the Corel data set . . . . . . . . . . . . . . . . . . . . 245<br />
B.15 φ (NMI) histograms on the InternetAds data set . . . . . . . . . . . . . . . . 246<br />
B.16 φ (NMI) histograms on the IsoLetters data set . . . . . . . . . . . . . . . . . . 247<br />
C.1 Estimated and real running times of the serial RHCA implementation on the<br />
Iris data collection in the four diversity scenarios . . . . . . . . . . . . . . . 254<br />
C.2 Estimated and real running times of the parallel RHCA implementation on<br />
the Iris data collection in the four diversity scenarios . . . . . . . . . . . . . 255<br />
C.3 Estimated and real running times of the serial RHCA implementation on the<br />
Wine data collection in the four diversity scenarios . . . . . . . . . . . . . . 257<br />
C.4 Estimated and real running times of the parallel RHCA implementation on<br />
the Wine data collection in the four diversity scenarios . . . . . . . . . . . . 258<br />
C.5 Estimated and real running times of the serial RHCA implementation on the<br />
Glass data collection in the four diversity scenarios . . . . . . . . . . . . . . 259<br />
C.6 Estimated and real running times of the parallel RHCA implementation on<br />
the Glass data collection in the four diversity scenarios . . . . . . . . . . . . 260<br />
C.7 Estimated and real running times of the serial RHCA implementation on the<br />
Ionosphere data collection in the four diversity scenarios . . . . . . . . . . . 262<br />
C.8 Estimated and real running times of the parallel RHCA implementation on<br />
the Ionosphere data collection in the four diversity scenarios . . . . . . . . . 263<br />
C.9 Estimated and real running times of the serial RHCA implementation on the<br />
WDBC data collection in the four diversity scenarios . . . . . . . . . . . . . 264<br />
C.10 Estimated and real running times of the parallel RHCA implementation on<br />
the WDBC data collection in the four diversity scenarios . . . . . . . . . . . 265<br />
C.11 Estimated and real running times of the serial RHCA implementation on the<br />
Balance data collection in the four diversity scenarios . . . . . . . . . . . . 267<br />
C.12 Estimated and real running times of the parallel RHCA implementation on<br />
the Balance data collection in the four diversity scenarios . . . . . . . . . . 268<br />
C.13 Estimated and real running times of the serial RHCA implementation on the<br />
Mfeat data collection in the four diversity scenarios . . . . . . . . . . . . . . 269<br />
C.14 Estimated and real running times of the parallel RHCA implementation on<br />
the Mfeat data collection in the four diversity scenarios . . . . . . . . . . . 270<br />
C.15 Estimated and real running times of the serial DHCA implementation on the<br />
Iris data collection in the four diversity scenarios . . . . . . . . . . . . . . . 274<br />
C.16 Estimated and real running times of the parallel DHCA implementation on<br />
the Iris data collection in the four diversity scenarios . . . . . . . . . . . . . 275<br />
C.17 Estimated and real running times of the serial DHCA implementation on the<br />
Wine data collection in the four diversity scenarios . . . . . . . . . . . . . . 276<br />
xxvi
List of Figures<br />
C.18 Estimated and real running times of the parallel DHCA implementation on<br />
the Wine data collection in the four diversity scenarios . . . . . . . . . . . . 277<br />
C.19 Estimated and real running times of the serial DHCA implementation on the<br />
Glass data collection in the four diversity scenarios . . . . . . . . . . . . . . 279<br />
C.20 Estimated and real running times of the parallel DHCA implementation on<br />
the Glass data collection in the four diversity scenarios . . . . . . . . . . . . 280<br />
C.21 Estimated and real running times of the serial DHCA implementation on the<br />
Ionosphere data collection in the four diversity scenarios . . . . . . . . . . . 281<br />
C.22 Estimated and real running times of the parallel DHCA implementation on<br />
the Ionosphere data collection in the four diversity scenarios . . . . . . . . . 282<br />
C.23 Estimated and real running times of the serial DHCA implementation on the<br />
WDBC data collection in the four diversity scenarios . . . . . . . . . . . . . 284<br />
C.24 Estimated and real running times of the parallel DHCA implementation on<br />
the WDBC data collection in the four diversity scenarios . . . . . . . . . . . 285<br />
C.25 Estimated and real running times of the serial DHCA implementation on the<br />
Balance data collection in the four diversity scenarios . . . . . . . . . . . . 286<br />
C.26 Estimated and real running times of the parallel DHCA implementation on<br />
the Balance data collection in the four diversity scenarios . . . . . . . . . . 287<br />
C.27 Estimated and real running times of the serial DHCA implementation on the<br />
Mfeat data collection in the four diversity scenarios . . . . . . . . . . . . . . 288<br />
C.28 Estimated and real running times of the parallel DHCA implementation on<br />
the Mfeat data collection in the four diversity scenarios . . . . . . . . . . . 289<br />
C.29 Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the Iris data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 292<br />
C.30 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Iris data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 293<br />
C.31 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Iris data collection in<br />
the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . 294<br />
C.32 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Wine data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 296<br />
C.33 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Wine data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 297<br />
C.34 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Wine data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 298<br />
xxvii
List of Figures<br />
C.35 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Glass data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 299<br />
C.36 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Glass data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 300<br />
C.37 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Glass data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 302<br />
C.38 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Ionosphere data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 303<br />
C.39 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Ionosphere data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 304<br />
C.40 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Ionosphere data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . 305<br />
C.41 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the WDBC data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 307<br />
C.42 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the WDBC data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 308<br />
C.43 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the WDBC data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 309<br />
C.44 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Balance data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 311<br />
C.45 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Balance data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . 312<br />
C.46 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Balance data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . 313<br />
C.47 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the MFeat data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 315<br />
C.48 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the MFeat data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 316<br />
xxviii
List of Figures<br />
C.49 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the MFeat data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 317<br />
C.50 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the miniNG data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 319<br />
C.51 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the miniNG data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . 320<br />
C.52 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the miniNG data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . 321<br />
C.53 Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the Segmentation data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 322<br />
C.54 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Segmentation data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . 323<br />
C.55 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Segmentation data<br />
collection in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . 325<br />
C.56 Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the BBC data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 326<br />
C.57 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the BBC data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 327<br />
C.58 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the BBC data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . 328<br />
C.59 Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the PenDigits data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . . . . . 330<br />
C.60 Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the PenDigits data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . . . . . . . . . . . . 331<br />
C.61 φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the PenDigits data collection<br />
in the four diversity scenarios |dfA| = {1, 10, 19, 28} . . . . . . . . . . 332<br />
D.1 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Iris<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335<br />
xxix
List of Figures<br />
D.2 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Wine<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336<br />
D.3 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Glass<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338<br />
D.4 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Ionosphere<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339<br />
D.5 φ (NMI) boxplots of the self-refined consensus clustering solutions on the WDBC<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340<br />
D.6 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Balance<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341<br />
D.7 φ (NMI) boxplots of the self-refined consensus clustering solutions on the MFeat<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343<br />
D.8 φ (NMI) boxplots of the self-refined consensus clustering solutions on the miniNG<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344<br />
D.9 φ (NMI) boxplots of the self-refined consensus clustering solutions on the Segmentation<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345<br />
D.10 φ (NMI) boxplots of the self-refined consensus clustering solutions on the BBC<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347<br />
D.11 φ (NMI) boxplots of the self-refined consensus clustering solutions on the PenDigits<br />
data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348<br />
D.12 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Iris data collection . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />
D.13 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Wine data collection . . . . . . . . . . . . . . . . . . . . . . . . 350<br />
D.14 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Glass data collection . . . . . . . . . . . . . . . . . . . . . . . . 351<br />
D.15 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Ionosphere data collection . . . . . . . . . . . . . . . . . . . . . 352<br />
D.16 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the WDBC data collection . . . . . . . . . . . . . . . . . . . . . . . 353<br />
D.17 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Balance data collection . . . . . . . . . . . . . . . . . . . . . . 354<br />
D.18 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the MFeat data collection . . . . . . . . . . . . . . . . . . . . . . . 355<br />
D.19 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the miniNG data collection . . . . . . . . . . . . . . . . . . . . . . 356<br />
D.20 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the Segmentation data collection . . . . . . . . . . . . . . . . . . . 357<br />
D.21 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the BBC data collection . . . . . . . . . . . . . . . . . . . . . . . . 358<br />
xxx
List of Figures<br />
D.22 φ (NMI) boxplots of the selection-based self-refined consensus clustering solutions<br />
on the PenDigits data collection . . . . . . . . . . . . . . . . . . . . . 358<br />
E.1 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the CAL500 data<br />
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360<br />
E.2 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the CAL500 data set . 361<br />
E.3 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the CAL500 data set . 361<br />
E.4 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the CAL500 data set . . . 361<br />
E.5 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the CAL500 data set . . . . . . . . 362<br />
E.6 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . 363<br />
E.7 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . 363<br />
E.8 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the CAL500 data set . . . . . . . . . . . . 364<br />
E.9 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the InternetAds data<br />
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365<br />
E.10 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the InternetAds data<br />
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365<br />
E.11 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the InternetAds data<br />
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366<br />
E.12 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the InternetAds data set . 366<br />
E.13 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the InternetAds data set . . . . . . 367<br />
E.14 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the InternetAds data set . . . . . . . . 367<br />
E.15 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the InternetAds data set . . . . . . . . 368<br />
E.16 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the InternetAds data set . . . . . . . . . . 368<br />
E.17 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the Corel data set 369<br />
xxxi
List of Figures<br />
E.18 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the Corel data set . . . 369<br />
E.19 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the Corel data set . . . 369<br />
E.20 φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the Corel data set . . . . . 370<br />
E.21 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the Corel data set . . . . . . . . . . 370<br />
E.22 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . 371<br />
E.23 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . 371<br />
E.24 φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the Corel data set . . . . . . . . . . . . . . 372<br />
F.1 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Iris data collection . . . . . . . . . . . . . . . . . . . . 374<br />
F.2 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Wine data collection . . . . . . . . . . . . . . . . . . . 375<br />
F.3 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Glass data collection . . . . . . . . . . . . . . . . . . . 376<br />
F.4 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Ionosphere data collection . . . . . . . . . . . . . . . . 378<br />
F.5 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the WDBC data collection . . . . . . . . . . . . . . . . . . 379<br />
F.6 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Balance data collection . . . . . . . . . . . . . . . . . . 380<br />
F.7 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the MFeat data collection . . . . . . . . . . . . . . . . . . . 381<br />
F.8 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the miniNG data collection . . . . . . . . . . . . . . . . . . 382<br />
F.9 φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Segmentation data collection . . . . . . . . . . . . . . .<br />
F.10 φ<br />
383<br />
(NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the BBC data collection . . . . . . . . . . . . . . . . . . . .<br />
F.11 φ<br />
384<br />
(NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the PenDigits data collection . . . . . . . . . . . . . . . . . 386<br />
xxxii
List of Algorithms<br />
6.1 Symbolic description of the soft consensus function SumConsensus . . . . . . 178<br />
6.2 Symbolic description of the soft consensus function ProductConsensus . . . . 180<br />
6.3 Symbolic description of the soft consensus function BordaConsensus . . . . . 182<br />
6.4 Symbolic description of the soft consensus function CondorcetConsensus . . 183<br />
xxxiii
List of symbols<br />
A: set of algorithms used for creating a cluster ensemble<br />
BE: Borda voting score matrix related to a cluster ensemble E<br />
CE: Condorcet voting score matrix related to a cluster ensemble E<br />
b: size of the mini-ensembles of a hierarchical consensus architecture<br />
c: number of executions of a consensus function in the running time estimation process<br />
Cλ: cluster co-association matrix of a clustering λ<br />
d: number of attributes used for representing an object<br />
D: set of object representation dimensionalities used for creating a cluster ensemble<br />
dfi: ith diversity factor employed in the generation of a cluster ensemble<br />
DHCA: deterministic hierachical consensus architecture<br />
E: (hard or soft) cluster ensemble<br />
f: number of diversity factors employed in the generation of a cluster ensemble<br />
F: consensus function<br />
γ: ground truth vector<br />
HCA: hierachical consensus architecture<br />
Iλ: incidence matrix of a hard clustering solution λ<br />
k: numberofclusters<br />
Ki: number of consensus processes executed in the ith stage of a hierarchical consensus<br />
architecture<br />
l: number of clusterings contained in a cluster ensemble<br />
λ: label vector resulting from a hard clustering process<br />
Λ: clustering matrix resulting from a soft clustering process<br />
λc: label vector resulting from a hard consensus clustering process<br />
Λc: clustering matrix resulting from a soft consensus clustering process<br />
m: number of modalities of a multimodal data set<br />
n: number of objects in a data set<br />
OEλ : object co-association matrix of a hard cluster ensemble<br />
OEΛ : object co-association matrix of a soft cluster ensemble<br />
Oλ: object co-association matrix of a hard clustering solution λ<br />
xxxv
List of symbols<br />
OΛ: object co-association matrix of a soft clustering solution Λ<br />
π λ1 ,λ 2 : cluster correspondence vector between the hard clustering solutions λ1 and λ2<br />
π Λ1 ,Λ 2 : cluster correspondence vector between the soft clustering solutions Λ1 and Λ2<br />
P λ1 ,λ 2 : cluster permutation matrix between the hard clustering solutions λ1 and λ2<br />
P Λ1 ,Λ 2 : cluster permutation matrix between the soft clustering solutions Λ1 and Λ2<br />
ΠE: product rule voting score matrix related to a cluster ensemble E<br />
φ (ANMI) : average normalized mutual information<br />
φ (NMI) : normalized mutual information<br />
PERTDHCA: estimated running time of a parallel DHCA<br />
PRTDHCA: real running time of a parallel DHCA<br />
PERTRHCA: estimated running time of a parallel RHCA<br />
PRTRHCA: real running time of a parallel RHCA<br />
r: number of attributes of an object after a dimensionality reduction process<br />
R: set of object representations used for creating a cluster ensemble<br />
RHCA: random hierachical consensus architecture s: number of stages of a hierarchical<br />
consensus architecture<br />
S λ1 ,λ 2 : cluster similarity matrix between the hard clustering solutions λ1 and λ2<br />
S Λ1 ,Λ 2 : cluster similarity matrix between the soft clustering solutions Λ1 and Λ2<br />
SERTDHCA: estimated running time of a serial DHCA<br />
SRTDHCA: real running time of a serial DHCA<br />
SERTRHCA: estimated running time of a serial RHCA<br />
SRTRHCA: real running time of a serial RHCA<br />
ΣE: sum rule voting score matrix related to a cluster ensemble E<br />
w: power proportionality factor of the time complexity of a consensus function<br />
xi: d-dimensional column vector denoting the ith object contained in a data set<br />
X: d × n matrix denoting a data set<br />
xxxvi
Chapter 1<br />
Framework of the thesis<br />
The information and communications technologies (ICT) play a key role in the construction<br />
of the so-called global knowledge society. In fact, the increasingly rapid development of the<br />
ICT and the democratization of their use have facilitated information generation and access<br />
–a basic component of knowledge acquisition– to large segments of population.<br />
Possibly, the most paradigmatic example of this evolution is the World Wide Web<br />
(WWW, or “the Web” for short), which offers almost universal access to information to<br />
over 1500 million users worldwide, experiencing a percentage increase of 336% since year<br />
2000 (InternetWorldStats.com, accessed on February 2009). In parallel, this evolution has<br />
boosted the number of existing websites, which has grown exponentially up to over 215<br />
millions (NetCraft.com, accessed on February 2009)—see figure 1.1. Quite obviously, this<br />
latter fact incides on the quality of the information available on the Web (e.g. webpages<br />
with replicated, erroneous or even forged content are commonly found), making it sometimes<br />
difficult to separate the wheat from the chaff.<br />
This is an example of how the development of the ICT entails two instrinsically contradictory<br />
consequences: while facilitating knowledge acquisition by making information easier<br />
to share and access, it has also fostered the generation of increasingly growing amounts of<br />
digital information, giving rise to the so-called data overload effect —a problem that started<br />
to attract the attention of researchers more than a decade ago (Fayyad, Piatetsky-Shapiro,<br />
and Smyth, 1996).<br />
Indeed, as computing technologies have become increasingly efficient in the compression,<br />
transmission, storage and manipulation of data, the real bottleneck has moved towards the<br />
user side: that is, human information interpretation capabilities are often exceeded by the<br />
sheer amount of data. Not seldom, this situation is aggravated by the existence of data<br />
inconsistencies, making knowledge extraction even more difficult.<br />
Although exemplified in the WWW context, the data overload effect is far from being<br />
exclusive to it. At a smaller scale, large amounts of scientific and social data are being<br />
generated and made either freely or commercially available. Examples of these include<br />
experimental or observational data sets in the physics, chemistry, biomedical, marketing,<br />
financial or social sciences fields, whose sizes can even range over the Terabyte (Valencia,<br />
2002; Witten and Frank, 2005).<br />
In addition to boosting the volume of information repositories, the evolution of the<br />
ICT has also brought about an increase of data complexity (Hinneburg and Keim, 1998).<br />
1
Chapter 1. Framework of the thesis<br />
Figure 1.1: Evolution of the total number of websites across all Internet<br />
domains, from November 1995 to February 2009 (extracted from<br />
http://news.netcraft.com/archives/2009/02/index.html).<br />
Resorting to the WWW example again, we have all witnessed, through the last decade, how<br />
web pages have evolved from static plain text to dynamic multimedia contents. That is,<br />
the information available on the Web is, to a large extent, no longer restricted to a single<br />
modality (e.g. news in text format). Rather the contrary, data is increasingly becoming<br />
multimodal, i.e. a combination of several modalities (e.g. text news accompanied with<br />
photos, graphics, audio or video).<br />
This shift towards data multimodality can be regarded as a change of paradigm which<br />
is also found in many other domains (Klosgen, Zytkow, and Zyt, 2002). For instance,<br />
meteorological information often combines satellite and radar imagery with meteorological<br />
data in numerical form (e.g. temperature, humidity, wind speed, rainfall, etc.). In medical<br />
contexts, repositories often contain data obtained from several diagnostic tests (e.g.<br />
blood analysis, radiography, electrocardiophy, electroencephalography, functional magnetic<br />
resonance) whose results are represented under distinct modalities (nominal and numerical<br />
data, images, etc.).<br />
To sum up, despite providing enormous quantities of information on a silver plate, the<br />
development and expansion of the ICT pose a serious challenge to human analytic and<br />
understanding capabilities, not only by the large volumes of data available, but also by its<br />
growing complexity. Therefore, it seems logical to highlight the importance of developing<br />
automatic tools that allow knowledge extraction from large multimodal data repositories,<br />
regardless of their domain (Witten and Frank, 2005). The techniques supporting these tools<br />
belong to the fields of knowledge discovery and data mining (Klosgen, Zytkow, and Zyt,<br />
2002), which constitute, in a broad sense, the frame of reference of this thesis.<br />
When it comes to extracting knowledge from a given data collection, one of the primary<br />
tasks one thinks of is organization: clearly, arranging the contents of a data repository<br />
according to some meaningful structure helps to gain some perspective on it –in fact, orga-<br />
2
Chapter 1. Framework of the thesis<br />
nizing information is one of the most innate activities involved in human learning (Anderberg,<br />
1973). In general terms, the structures according to which objects 1 are classified are<br />
known as taxonomies, and, although their shape can vary widely (e.g. from parent-child<br />
hierarchical trees to network schemes or simple group structures), they share the common<br />
goal of allowing knowledge extraction by implementing a structure in an unstructured world<br />
of information. Taxonomies have been proposed by experts in almost every scientific field,<br />
such as biology (e.g. the Linnaean taxonomy (Linnaeus, 1758), which settled the basis of<br />
species classification), medicine (for instance, the International Classification of Diseases of<br />
the World Health Organization (www.who.int, accessed on February 2009)) or education<br />
(from classifications of the different learning objectives and skills that educators set for<br />
students (Bloom, 1956; Anderson et al., 2001) to taxonomic models that describe the levels<br />
of increasing complexity in student’s understanding of subjects (Biggs and Collis, 1982)).<br />
When dealing with digital data, the manual creation of a taxonomy can become a very<br />
challenging and burdensome task, as it requires previous domain knowledge (which is not<br />
always available) and/or careful inspection of the whole data collection before designing<br />
the most suitable taxonomic structure. For this reason, it would be very useful to develop<br />
systems capable of organizing data in a fully automatic manner, so that no expert supervision<br />
nor domain knowledge was required. If this goal was accomplished, the role of expert<br />
taxonomists would be minimized —good news provided the dramatic pace at which digital<br />
data is generated.<br />
Regardless of the taxonomic scheme’s layout, data organization criteria are typically<br />
based on analyzing the similarities between objects, grouping them according to their degree<br />
of similarity —i.e. the goal is to place dissimilar instances in separate and distant groups (or<br />
clusters), while placing similar objects in the same group (or in different but closely located<br />
clusters). This task, known as unsupervised classification or clustering, isanimportant<br />
process which underlies many automated knowledge discovery processes (Fayyad, Piatetsky-<br />
Shapiro, and Smyth, 1996; Klosgen, Zytkow, and Zyt, 2002; Witten and Frank, 2005).<br />
The remainder of this chapter provides an insight on the general framework of the<br />
thesis, highlighting the importance of clustering processes as a part of automatic knowledge<br />
discovery systems, and introducing the central focus of this thesis: the robust clustering of<br />
multimodal data.<br />
1.1 Knowledge discovery and data mining<br />
The subject of knowledge discovery from data repositories has come a long way. In fact,<br />
the interest in this field emerged more than a decade ago as a response to the data overload<br />
effect (Fayyad, Piatetsky-Shapiro, and Smyth, 1996; Fayyad, 1996), when the growth of<br />
digital data repositories started to surpass human analytic capabilities. Indeed, while the<br />
analysis and understanding abilities of human analysts remain more or less the same, the<br />
seemingly ever-growing computers’ storing capacity is holding back a true avalanche of data<br />
(Witten and Frank, 2005). Nowadays, many textbooks, journals, workshops and conferences<br />
are devoted to this scientific area, denoting it still is a very active research field (Klosgen,<br />
Zytkow, and Zyt, 2002).<br />
1 By object, we refer to anything -animate beings, inanimate objects, places, concepts, events, properties,<br />
or relationships- liable to be classified according to some taxonomic scheme.<br />
3
1.1. Knowledge discovery and data mining<br />
Figure 1.2: Schematic diagram of the steps involved in the KD process (extracted from<br />
(Fayyad, 1996)).<br />
It is a commonplace that potentially useful and beneficial information patterns lie in<br />
digital data repositories awaiting analysis (Witten and Frank, 2005). However, the concept<br />
of analysis and its goals are highly local to the context it is applied (Fayyad, 1996).<br />
Typical application scenarios can be as disparate as i) mining records of buyers’ choices<br />
for creating marketing campaigns adapted to distinct customer profiles (Witten and Frank,<br />
2005), ii) analyzing credit card transactions history of bank customers so as to detect possible<br />
fraudulent operations from unauthorized users (Fayyad, 1996) or iii) locating and<br />
cataloging geologic objects of interest in remotely sensed images of planets or asteroids<br />
(Fayyad, Piatetsky-Shapiro, and Smyth, 1996).<br />
Thus, be it either economic or scientific, there exists a great interest in replacing (or, at<br />
least, augmenting) human analytic capabilities by computer-based means. The field of computer<br />
science devoted to the extraction of useful patterns from data has been given different<br />
names in the literature, such as information discovery, information harvesting or data archaeology<br />
(Fayyad, Piatetsky-Shapiro, and Smyth, 1996), being knowledge discovery2 (KD)<br />
and data mining (DM) the two most common denominations.<br />
However, the use of KD and DM as synonymous concepts has been a matter of dispute in<br />
the research community (Klosgen, Zytkow, and Zyt, 2002): while deemed equivalent by some<br />
authors (Witten and Frank, 2005), others refer to KD as the whole process of extracting<br />
knowledge from data, defining DM as the central constituting step of KD processes (Fayyad,<br />
Piatetsky-Shapiro, and Smyth, 1996), as depicted in figure 1.2.<br />
According to this latter standpoint (to which we adhere in this thesis), KD is defined<br />
as the ‘non-trivial process of identifying valid, novel, useful and ultimately understandable<br />
patterns in data’, whereas DM is ‘the application of specific algorithms for extracting patterns<br />
from data’ (Fayyad, Piatetsky-Shapiro, and Smyth, 1996). By ‘extracting patterns<br />
patterns from data’ we refer to making any high-level description of a set of data, e.g. fitting<br />
a model to data or finding structure from it (Fayyad, Piatetsky-Shapiro, and Smyth,<br />
1996). Thus, according to this point of view, KD and DM constitute what could be called<br />
2 Although this discipline was originally named KDD —for Knowledge Discovery in Databases (Piatetsky-<br />
Shapiro, 1991)— in this work we assume that operations are conducted on a flat file extracted from the<br />
database, i.e. we remove the second D in KDD and focus on the knowledge discovery process.<br />
4
Chapter 1. Framework of the thesis<br />
the general and specific frames of reference of this work, respectively. Due to its generic<br />
definition, KD is a crossroad of several disciplines, and, as such, it attracts researchers and<br />
practitioners from the fields of statistics, machine learning, pattern recognition, information<br />
retrieval, or visualization, to name a few (Fayyad, 1996).<br />
As shown in the flow diagram presented in figure 1.2, the KD process can be regarded<br />
as a succession of five steps, namely: selection, preprocessing, transformation, data mining<br />
and interpretation/evaluation. That is, extracting knowledge from a given data set can be<br />
regarded as a multistage, interactive and iterative process (Brachman and Anand, 1996), as<br />
the user evaluation of the extracted patterns can lead to a re-execution of any of the stages<br />
(as denoted by the dashed arrows in figure 1.2). However, depending on the nature of the<br />
data and the problem at hand, the two first steps may even be skipped. The following<br />
paragraphs present a brief description of each of these five phases.<br />
For starters, the target data set that will be subject to the KD process is created in the<br />
selection phase. This typically implies selecting a subset of the available objects in the<br />
database, although, in some cases, no selection is conducted, and all the data items in the<br />
repository are included in the target data set.<br />
Optionally, this stage can also consider representing the objects upon a subset of the<br />
variables (aka attributes or features) that constitute them. In general terms, these variables<br />
can either be numerical or nominal. In the former case, attributes usually represent the<br />
value of a quantitatively measurable magnitude (e.g. temperature, altitude or population).<br />
In the latter case, features can only take one of a predefined set of categorical values, such<br />
as the outlook feature in the classic weather data repository, i.e. outlook = {overcast,<br />
sunny, rainy} (Witten and Frank, 2005). In this work, all objects will be represented by<br />
means of a set of d numeric attributes gathered into real-valued d-dimensional vectors.<br />
Secondly, a data preprocessing step is carried out, which includes data parameterization,<br />
noise and/or outliers removal or missing data fields handling, among others.<br />
Thirdly, the objects in the data set are subject to a transformation process, which<br />
basically consists of finding useful features for representing the objects according to the<br />
goals of the overall KD process. In general terms, this step is aimed to decrease the number<br />
of variables used for representing the objects (dimensionality reduction) so as to improve<br />
the results and/or the computational complexity of the data mining step of the KD process.<br />
The reasoning underlying dimensionality reduction is based on the fact that the original<br />
data representation is often redundant, e.g. there may exist high levels of correlation between<br />
several variables, or the values of some features may be so small that they are almost<br />
irrelevant (Carreira-Perpiñán, 1997). Moreover, when the number of variables associated<br />
with each object is too high, the scalability of KD systems is negatively affected, as the<br />
time complexity of the DM stage is usually proportional to the number of attributes (Yang<br />
and Olafsson, 2005). Furthermore, many methods suffer performance breakdowns when the<br />
dimensionality of the feature space is very high (Fodor, 2002).<br />
There exist two main strategies to conduct dimensionality reduction: feature selection<br />
and feature extraction. In feature selection, the reduced feature set is a subset of the original<br />
object attributes, whereas in feature extraction, the original variables are transformed into<br />
a set of new features, typically obtained as combinations of the original ones. For an insight<br />
on feature selection and extraction techniques, the reader is referred to (Molina, Belanche,<br />
and Nebot, 2002; Dy and Brodley, 2004) and (Fodor, 2002), respectively.<br />
5
1.1. Knowledge discovery and data mining<br />
Verrification<br />
Goodness of<br />
fit<br />
HHypothesis<br />
teesting<br />
Analysis oof<br />
variancee<br />
Data miining<br />
methoods<br />
Prediction<br />
Classific cation<br />
Regres ssion<br />
Discovery<br />
Description<br />
Clusteriing<br />
Summarization<br />
Figure 1.3: Taxonomy of data mining methods (adapted from (Maimon and Rokach, 2005)).<br />
Next, depending on the goals of the KD process, a suitable data mining task must<br />
be chosen. According to the taxonomy presented in figure 1.3, data mining methods can<br />
be classified into two main groups: verification-oriented and discovery-oriented (Fayyad,<br />
Piatetsky-Shapiro, and Smyth, 1996). In verification-oriented DM, the role of the system<br />
is to evaluate an hypothesis proposed by the user, and this task is usually accomplished<br />
by means of traditional statistical methods. In discovery-oriented data mining, the goal of<br />
the system is to discover useful patterns in the data. In this work, we focus on this latter<br />
family of DM tasks.<br />
Among discovery-oriented data mining methods, one can distinguish between prediction<br />
and description DM tasks. In prediction tasks, the goal of the system is to build<br />
a behavioral model upon the data, whereas in description tasks, the system aims to find<br />
human-understandable patterns that facilitate knowledge extraction from data.<br />
According to figure 1.3, there exist two main prediction DM tasks: classification and<br />
regression. In classification problems, the goal is to learn a mapping between the categories<br />
in a known taxonomic scheme and a set of pre-classified objects, so that any unseen object<br />
can be categorized into any of these predefined classes. The aim of regression tasks is to<br />
learn a function that maps unseen data objects to a real-valued prediction variable.<br />
As aforementioned, description-oriented DM methods focus on finding understandable<br />
representations of the underlying structure of the data (Maimon and Rokach, 2005). One of<br />
the most common descriptive DM tasks is clustering, which consists of identifying a finite<br />
set of categories to describe the data with no previous knowledge (i.e. deriving a taxonomy<br />
solely from the data). Another description-oriented data mining task is summarization,<br />
whose aim is to find a compact description for a subset of data. To do so, summarization<br />
techniques often make use of multivariate visualization methods.<br />
Once the data mining task that fits the goals of the KD process is identified, there<br />
comes the time to select the specific data mining algorithm to be applied. This selection<br />
must take into account not only which models and parameters are the most appropriate<br />
from an algorithmic viewpoint, but also the desired level of accuracy, utility, and intel-<br />
6
Chapter 1. Framework of the thesis<br />
ligibility of the descriptions of the structural patterns of the data (Fayyad, 1996). This<br />
latter issue is of paramount importance with a view to the final stage of the KD process —<br />
evaluation/interpretation—, which often involves visualizing the mined patterns and/or<br />
the data according to these. As mentioned earlier, depending on the user’s evaluation of the<br />
extracted patterns, it may be necessary to re-execute any of the previous steps for further<br />
refinement of the KD process (Halkidi, Batistakis, and Vazirgiannis, 2002a).<br />
As the reader may have observed, the user must make several critical decisions at every<br />
step of the knowledge discovery process (Fayyad, Piatetsky-Shapiro, and Smyth, 1996).<br />
By critical we mean that a wrong election in any of the intermediate stages could lead to<br />
the extraction of little meaningful patterns, and, as a consequence, the objectives of the<br />
whole KD process would not be reached. This issue becomes especially tricky when the DM<br />
stage is based on unsupervised learning techniques, as when, for instance, clustering is the<br />
data mining task selected for the DM phase of the KD process. In this situation, the user’s<br />
decisions are usually made blindly, which may conclude in an insatisfactory evaluation, thus<br />
requiring a (blind) re-execution of one or several of the phases of the KD process —which,<br />
in the worst-case scenario, can end in a tedious iterative loop while groping for the right<br />
decisions.<br />
This thesis is focused on the discovery and description-oriented data mining task of<br />
clustering, placing special emphasis on its application on multimodal data. In particular,<br />
one of our main goals is the design of clustering systems robust to the indeterminacies<br />
induced by the fact that many decisions surrounding clustering processes must be made<br />
blindly (e.g. which features should be used, which clustering algorithm should be applied).<br />
In the next section, we present a by no means exhaustive overview of several fundamental<br />
aspects of the clustering problem, which will lead to a description of the key problems<br />
addressed in this thesis.<br />
1.2 Clustering in knowledge discovery and data mining<br />
Clustering can be defined as the process of separating a finite unlabeled data set into a<br />
finite and discrete set of natural clusters based on similarity (Xu and Wunsch II, 2005;<br />
Jain, Murty, and Flynn, 1999).<br />
After a successful clustering process, the presumably high number of objects contained in<br />
the data set can be represented by means of a comparatively smaller number of meaningful<br />
clusters, which necessarily implies a loss of certain fine details, but yields a simplified clusterbased<br />
data model (Berkhin, 2002). In other words, clustering is a description-oriented data<br />
mining task, as the obtained clusters should somehow reflect the mechanisms that cause<br />
some objects to be more similar to one another than to the remaining ones (Witten and<br />
Frank, 2005).<br />
It is important to notice that clustering is a non-supervised classification task, as the<br />
objects in the data set are unlabeled (i.e. there is no prior knowledge about how they should<br />
be grouped). This fact states a clear difference between clustering and the predictionoriented<br />
task of supervised classification (see figure 1.3). In this latter case, we are provided<br />
with a collection of labeled (i.e. pre-classified) objects so as to learn the descriptions of<br />
classes, which in turn are used to categorize new data items. In clustering, objects are also<br />
assigned labels, but these are data driven —that is, cluster labels are obtained solely from<br />
7
1.2. Clustering in knowledge discovery and data mining<br />
the data, not provided by an external source (Jain, Murty, and Flynn, 1999).<br />
Being such a generic task, clustering has found applications in multiple research fields,<br />
among which we can name the following few:<br />
– information retrieval, where clustering has been applied for organizing the results<br />
returned by a search engine in response to a users query (i.e. post-retrieval clustering)<br />
(Tombros, Villa, and van Rijsbergen, 2002; Hearst, 2006), for refining ambiguous<br />
queries input to retrieval systems (Käki, 2005), or for improving their performance<br />
(van Rijsbergen, 1979).<br />
– text mining, where browsing through large document collections is simplified if they<br />
are previously clustered (Cutting et al., 1992; Steinbach, Karypis, and Kumar, 2004;<br />
Dhillon, Fan, and Guan, 2001).<br />
– computational genomics, where clustering of gene expression data from DNA microarray<br />
experiments is applied for identifying the functionality of genes, finding out what<br />
genes are co-regulated or distinguishing the important genes between abnormal and<br />
normal tissues (Zhao and Karypis, 2003a; Jiang, Tang, and Zhang, 2004).<br />
– ecomomics, where clustering economic and financial time series can be employed for<br />
identifying i) areas or sectors for policy-making purposes, ii) structural similarities<br />
in economic processes for economic forecasting, iii) stable dependencies for risk management<br />
and investment management (Focardi, 2001), or iv) customer profiles and<br />
customers-products relationships (Liu and Luo, 2005).<br />
– computer vision, where clustering is applied for common procedures such as image<br />
preprocessing (Jain, 1996), segmentation (Mancas-Thillou and Gosselin, 2007) and<br />
matching (Miyajima and Ralescu, 1993).<br />
Regardless of the application, the desired result of any clustering process is a maximally<br />
representative partition of the data set, which usually corresponds to clusters with high<br />
intra-cluster and low inter-cluster object similarities. In the quest for this goal, a myriad<br />
of clustering methods have been proposed. With no claim of being exhaustive, the next<br />
section presents an overview of some of the most relevant clustering methods, highlighting<br />
some important concepts in this context.<br />
1.2.1 Overview of clustering methods<br />
Several excellent and extensive surveys on clustering can be found in the literature (Jain,<br />
Murty, and Flynn, 1999; Berkhin, 2002; Kotsiantis and Pintelas, 2004; Xu and Wunsch II,<br />
2005). Providing a detailed description of the existing clustering methods lies beyond the<br />
scope of this work, so the reader interested in their ins and outs is referred to the previously<br />
cited surveys and references therein. However, due to the central role of clustering processes<br />
in this thesis, this section presents a brief description of several key issues in this context,<br />
such as:<br />
1. A categorization of clustering algorithms according to generic criteria.<br />
2. A brief discussion on similarity measures, one of the central notions in clustering.<br />
8
Chapter 1. Framework of the thesis<br />
3. An outline of the foundations of several representative clustering methods.<br />
For starters, let us introduce a few notational conventions which will be employed<br />
throughout this work:<br />
– in general terms, any object in a data set will be represented by means of a ddimensional<br />
column vector x =[x1 x2 ... xd] T (where T denotes transposition). As<br />
mentioned in section 1.1, in this work we consider only object representations based<br />
on numerical attributes, so that each feature xi is a real number and, hence, x ∈ R d .<br />
– so as to refer to a particular data item (e.g. the ith object in the data repository) we<br />
will use the notation xi =[xi1 xi2 ... xid ]T .<br />
– a data set is defined as a compilation of n objects, being mathematically representated<br />
by a d × n matrix X =[x1 x2 ... xn].<br />
– the number of clusters into which the objects will be partitioned is denoted as k. Thus,<br />
each cluster resulting from the clustering process will be assigned an integer-valued<br />
label in the integer range [1,...,k].<br />
Categorization of clustering methods<br />
Without taking into account their theoretical foundations, there exist many ways of classifying<br />
clustering algorithms (Jain, Murty, and Flynn, 1999). However, they can be split<br />
into two clearly separated groups according to two universal criteria: i) the structure of the<br />
final clustering solution, and ii) the overlap of the obtained clusters.<br />
On one hand, if clustering algorithms are analyzed in terms of the structure of the clustering<br />
solution, one can distinguish between partitional and hierarchical clustering methods.<br />
1. Partitional clustering algorithms: the clustering solution output by this type of algorithms<br />
corresponds to the most intuitive notion of clustering: a single partition of the<br />
objects in the data set into the desired number of clusters k.<br />
2. Hierarchical clustering algorithms: the structure of the clustering solution is a tree<br />
of clusters (Kotsiantis and Pintelas, 2004), which is built sequentially. Depending on<br />
whether this tree is constructed bottom-up or top-down (i.e. from n singleton clusters<br />
to a sole cluster or vice versa), hierarchical algorithms are called agglomerative or divisive.<br />
In either case, the algorithm decides, at each level of the tree, which two clusters<br />
should be merged (if agglomerative) or split (if divisive) depending on their degree of<br />
similarity, which is typically measured according to either the minimum, maximum<br />
or average of the distances between all pairs of objects drawn from both clusters,<br />
giving rise to the single-link, complete-link or average-link criteria (Jain, Murty, and<br />
Flynn, 1999). The clustering solution structure yielded by hierarchical algorithms is<br />
usually represented by means of a binary tree of clusters or dendrogram, anexampleof<br />
which is depicted in figure 1.4. Although hierarchical clustering algorithms typically<br />
compute the complete hierarchy of clusters 3 , a single partition can be obtained by<br />
3 As a consequence, when compared to their partitional counterparts, hierarchical clustering algorithms<br />
tend to be more computationally demanding (Jain, Murty, and Flynn, 1999).<br />
9
1.2. Clustering in knowledge discovery and data mining<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
2<br />
3<br />
−0.2<br />
−0.2 0 0.2 0.4 0.6<br />
(a) Scatterplot of the data of this<br />
toy two-dimensional data set.<br />
4<br />
7<br />
5<br />
9<br />
6<br />
8<br />
Euclidean distance<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
7 9 8 4 5 6 1 2 3<br />
object index<br />
(b) Dendrogram obtained by the<br />
single-link hierarchical algorithm.<br />
Figure 1.4: A hierarchical clustering toy example: (a) Scatterplot of an artificially generated<br />
two-dimensional data set containing n = 9 objects, each one of them is identified by a<br />
numerical label. (b) Dendrogram resulting of applying the single-link hierarchical agglomerative<br />
clustering algorithm on this data, using the Euclidean distance as the similarity<br />
measure. The dashed horizontal line performs a cut on the dendrogram, yielding a 4-way<br />
partition with an Euclidean distance between clusters ranging between 0.112 and 0.255.<br />
performing a cut through the dendrogram at the desired level of cluster similarity or<br />
by setting the desired number of clusters k, as shown by the dashed horizontal line in<br />
figure 1.4(b).<br />
On the other hand, the overlap of the clusters into which objects are grouped is an<br />
additional factor which allows splitting clustering algorithms into two large categories: hard<br />
and soft algorithms.<br />
1. Hard (aka crisp) clustering algorithms: this type of algorithms partition the data<br />
set into k disjoint clusters, i.e. each object is assigned to one and only one cluster.<br />
In mathematical terms, the result of a hard clustering process on the data set X<br />
is a n-dimensional integer-valued row label vector λ = [λ1 λ2 ... λn], where λi ∈<br />
{1, 2,...,k}, ∀i ∈ [1,n]. That is, the ith component of the label vector (or labeling<br />
for short) contains the label of the cluster the ith object in the data set is assigned<br />
to. For instance, the label vector obtained after applying a classic hard clustering<br />
algorithm such as k-means on the artificial toy data set depicted on figure 1.4(a),<br />
setting k =3,isλ =[222111333]. Notice the symbolic nature of the cluster labels,<br />
as the same clustering result would be represented by label permuted label vectors<br />
such as λ =[111222333] or λ =[333111222].<br />
2. Soft (aka fuzzy) clustering algorithms: they generate a set of k overlapping clusters,<br />
i.e. each object is associated to each of the k clusters to a certain degree. Hence, the<br />
result of conducting a soft clustering process on the data set X is a k × n real-valued<br />
clustering matrix Λ, whose(i,j)th entry indicates the degree of association between<br />
the ith cluster and the jth object. This degree of association is typically expressed<br />
in terms of the probability of membership of each object to each cluster, as done by<br />
10
Chapter 1. Framework of the thesis<br />
the well-known fuzzy c-means (FCM) soft clustering algorithm. Following with the<br />
toy two-dimensional data set presented in figure 1.4(a), the membership probability<br />
matrix yielded by the FCM clustering algorithm (with k = 3) is the following:<br />
⎛<br />
Λ = ⎝<br />
0.0544 0.0418 0.0572 0.0254 0.0192 0.0301 0.9764 0.9285 0.9723<br />
0.0252 0.0258 0.0375 0.9688 0.9758 0.9586 0.0144 0.0554 0.0173<br />
0.9205 0.9324 0.9053 0.0059 0.0050 0.0114 0.0092 0.0160 0.0104<br />
It is easy to see that permuting the rows of matrix Λ would not alter the clustering<br />
results, as cluster identifiers are symbolic. Moreover, notice that a clustering matrix<br />
Λ can always be transformed into a label vector λ by simply assigning each object<br />
to the cluster it is more strongly associated with (e.g. the cluster with the highest<br />
membership probability).<br />
Summarizing, the structure of the clustering solution and the overlap of the resulting<br />
clusters define a two-dimensional frame of reference which allows categorizing clustering<br />
algorithms in a broad sense, i.e. without resorting to their theoretical foundations.<br />
Distance and similarity measures<br />
According to the definition of clustering presented at the beginning of section 1.2, measuring<br />
the resemblance between the objects in the data set is central to clustering processes. For<br />
this reason, these next paragraphs will be devoted to a brief description of several ways<br />
of measuring similarity. More specifically, we focus solely on measures for computing the<br />
resemblance between objects under numeric feature representation, although there exist<br />
equivalent measures for comparing objects represented by means of ordinal or nominal<br />
attributes (Xu and Wunsch II, 2005; Jain, Murty, and Flynn, 1999).<br />
Hence, let us consider two objects in the data set represented as vectors in a R d space,<br />
namely xi =[xi1 xi2 ... xid ]T and xj =[xj1 xj2 ... xjd ]T .<br />
There exist two complementary ways of comparing xi and xj: i) measuring their degree<br />
of similarity, denoted as S(xi, xj), or ii) measuring the distance between them, i.e.<br />
D(xi, xj). Although when dealing with objects represented by numerical features it is more<br />
usual to measure the distance between them than their similarity (Jain, Murty, and Flynn,<br />
1999; Xu and Wunsch II, 2005), both types of measures will be described next—furthermore,<br />
there are multiple ways of transforming a similarity measure into a distance and viceversa<br />
(Fenty, 2004).<br />
1. Distance measures<br />
– Minkowski distance: properly speaking, it is a family of distances. In general<br />
terms, the Minkowski distance of order n is defined according to equation (1.1):<br />
D(xi, xj) =<br />
11<br />
d<br />
l=1<br />
|xil<br />
− xjl | 1<br />
n<br />
n<br />
⎞<br />
⎠<br />
(1.1)
1.2. Clustering in knowledge discovery and data mining<br />
– Euclidean distance: possibly the most widely used metric, it is obtained as the<br />
particularization of the Minkowski metric for n =2:<br />
D(xi, xj) =<br />
d<br />
l=1<br />
|xil<br />
− xjl | 1<br />
2<br />
2<br />
(1.2)<br />
This is the distance measure used in the most classic implementation of the kmeans<br />
clustering algorithm, and it tends to form hyperspherical clusters (Xu and<br />
Wunsch II, 2005).<br />
– Manhattan distance: also known as city block distance, it is defined as a particular<br />
case of the Minkowski metric for n = 1 (see equation (1.3)), and it tends to<br />
create hyperrectangular clusters (Xu and Wunsch II, 2005).<br />
<br />
d<br />
<br />
D(xi, xj) = |xil − xjl |<br />
l=1<br />
(1.3)<br />
– Mahalanobis distance: it can be regarded as a modified version of the Euclidean<br />
distance, that takes into account the covariance among the attributes. It is<br />
defined as follows:<br />
D(xi, xj) =(xi − xj) T S −1 (xi − xj) (1.4)<br />
where S is the sample covariance matrix computed over all the data set (Jain,<br />
Murty, and Flynn, 1999). Algorithms using this distance tend to create hyperellipsoidal<br />
clusters (Xu and Wunsch II, 2005).<br />
2. Similarity measures<br />
– Cosine similarity: it consists of measuring the angle comprised between the vectors<br />
representing the objects, and, as such, it does not depend on their length.<br />
It is defined as follows:<br />
S(xi, xj) = xT i xj<br />
||xi|| ||xj||<br />
(1.5)<br />
where ||·||denotes vector norm.<br />
– Pearson correlation coefficient: a classic concept in the probability theory and<br />
statistics fields, correlation measures the strength and direction of the linear<br />
relationship between vectors xi and xj. The most widely used correlation index<br />
is the Pearson correlation coefficient, which is defined in equation (1.6):<br />
d<br />
(xil − ¯xi)(xjl − ¯xj)<br />
l=1<br />
S(xi, xj) = <br />
d<br />
d 2 2<br />
(xil − ¯xi) (xjl − ¯xj)<br />
l=1<br />
l=1<br />
where xil is the lth component of vector xi, and¯xi denotes its sample mean.<br />
12<br />
(1.6)
Chapter 1. Framework of the thesis<br />
– Extended Jaccard coefficient: whereas the cosine and Pearson correlation coefficient<br />
similarity measures consider that vectors xi and xj are similar if they point<br />
in the same direction (i.e. they have roughly the same set of features and in the<br />
same proportion), the extended Jaccard coefficient –defined in equation (1.7)–<br />
accounts both for the angle and magnitude of the vectors.<br />
S(xi, xj) =<br />
x T i xj<br />
||xi|| + ||xj|| − x T i xj<br />
(1.7)<br />
For further insight on these and other distance and similarity measures, their properties<br />
and other characteristics, see (Duda, Hart, and Stork, 2001; Xu and Wunsch II, 2005).<br />
Approaches to clustering<br />
As aforementioned, multiple clustering algorithms with fairly different foundations can be<br />
found in the literature. Our interest here is to present a brief description of the most wellknown<br />
theoretical approaches clustering algorithms are based on, enumerating some specific<br />
implementations emanated from them.<br />
1. Square error based clustering: in this approach, the goal is minimizing the sum of<br />
squared distances between the objects and the centroid (i.e. the center of gravity)<br />
of the cluster they are assigned to. This minimization process is usually executed<br />
iteratively, as in the case of the k-means clustering algorithm, which is probably<br />
the most representative example of this type of algorithm (Forgy, 1965). A slight<br />
variant of the k-means algorithm is k-medioids, where clusters are represented by<br />
one of its member objects instead of their centroids, which makes it more robust to<br />
outliers (Kaufman and Rousseeuw, 1990). Moreover, techniques that allow splitting<br />
and merging the resulting clusters have also been developed. An example is the<br />
ISODATA algorithm (Ball and Hall, 1965), which divides high variance clusters and<br />
joins close clusters—quite obviously, setting the thresholds that govern the cluster<br />
merging and splitting decisions is a key issue in ISODATA.<br />
2. Mixture densities based clustering: this approach follows a probabilistic perspective,<br />
as it assumes that the objects in the data set have been generated according to several<br />
probability distributions—typically one per cluster. Finding the clusters boils down to<br />
making assumptions on these probability distributions (they are often assumed to be<br />
Gaussian) and estimating the parameters of the underlying models usually following<br />
a maximum likelihood approach (Jain, Murty, and Flynn, 1999). The iterative optimization<br />
of the maximum likelihood criterion using the expectation-maximization<br />
(EM) algorithm has given rise to the most popular clustering algorithm based on<br />
mixture densities, EM clustering (Mc<strong>La</strong>chlan and Krishnan, 1997). Other algorithms<br />
based on this kind of approach are AutoClass (which extends mixture densities to<br />
Poisson, Bernoulli and log-normal probability distributions) (Cheeseman and Stutz,<br />
1996), or the SNOB algorithm, which uses a mixture model in conjunction with the<br />
minimum message length principle (Wallace and Dowe, 1994).<br />
3. Graph based clustering: this type of algorithms applies the concepts and properties<br />
of graph theory to the clustering problem, mapping data objects onto the nodes of a<br />
13
1.2. Clustering in knowledge discovery and data mining<br />
weighted graph whose edges reflect the similarity between each pair of objects. This<br />
makes graph based clustering conceptually similar to hierarchical clustering (Jain and<br />
Dubes, 1988), a good example of which is the Chameleon hierarchical clustering algorithm,<br />
which is based on the k-nearest neighbour graph (Karypis, Han, and Kumar,<br />
1999). Other (non-hierarchical) graph based clustering algorithms are i) Zahn’s algorithm<br />
(Zahn, 1971), which is based on discarding the edges with the largest lengths<br />
on the minimal spanning tree so as to create clusters, ii) CLICK, based on computing<br />
the minimum weight cut to form clusters (Sharan and Shamir, 2000), and iii)<br />
MajorClust, which is based on the weighted partial connectivity of the graph, a measure<br />
whose maximization implicitly determines the optimum number of clusters to be<br />
found (Stein and Niggemann, 1999). <strong>La</strong>stly, the more recent family of spectral clustering<br />
algorithms can be included in the context of graph based clustering. This type<br />
of algorithms are often reported as outperforming traditional clustering techniques<br />
(von Luxburg, 2006). In addition, spectral clustering is simple to implement and can<br />
be solved efficiently by standard linear algebra methods, as, in short, it boils down to<br />
computing the eigenvectors of the <strong>La</strong>placian matrix of the graph. Several variants of<br />
spectral clustering algorithms have been proposed, differing in the way the similarity<br />
graph and the <strong>La</strong>placian matrix are computed (e.g. (Shi and Malik, 2000; Ng, Jordan,<br />
and Weiss, 2002)).<br />
4. Clustering based on combinatorial search: this approach is based on considering clustering<br />
as a combinatorial optimization problem that can be solved by applying search<br />
techniques for finding the global (or approximate global) optimum clustering solution.<br />
In this context, two paradigms have been followed in the design of clustering algorithms:<br />
stochastic optimization methods and deterministic search techniques. Among<br />
the former, some of the most popular approaches are based on evolutionary computation<br />
(e.g. hard or soft clustering based on genetic algorithms (Hall, Özyurt,<br />
and Bezdek, 1999; Tseng and Yang, 2001)), simulated annealing (e.g. (Selim and<br />
Al-Sultan, 1991)), Tabu search (Al-Sultan, 1995) and hybrid solutions (Chu and Roddick,<br />
2000; Scott, Clark, and Pham, 2001), whereas deterministic annealing is the most<br />
typical deterministic search technique applied to clustering (Hofmann and Buhmann,<br />
1997).<br />
5. Clustering based on neural networks: the well-known learning and modelling abilities<br />
of neural networks have been exploited to solve clustering problems. The two most<br />
successful neural networks paradigms applied to clustering are i) competitive learning,<br />
where the Self-Organizing Maps (Kohonen, 1990) and Generalized Learning Vector<br />
Quantization (Karayiannis et al., 1996) play a salient role, and ii) adaptive resonance<br />
theory (Carpenter and Grossberg, 1987), which encompasses a whole family of neural<br />
networks architectures that can be used for hierarchical (Wunsch et al., 1993) and<br />
soft (Carpenter, Grossberg, and Rosen, 1991) clustering.<br />
6. Kernel based clustering: the rationale of kernel-based learning algorithms is simplifying<br />
the task of separating the objects in the data set by nonlinearly transforming them<br />
into a higher-dimensional feature space. Through the design of an inner-product kernel,<br />
the time-consuming and sometimes even infeasible process of explicitly describing<br />
the nonlinear mapping and computing the corresponding data points in the transformed<br />
space can be avoided (Xu and Wunsch II, 2005). A recent example of this<br />
approach is Support Vector Clustering (SVC) (Ben-Hur et al., 2001), that employs<br />
14
Chapter 1. Framework of the thesis<br />
the radial basis function as its kernel, being capable of forming either agglomerative<br />
or divisive hierarchical clusters. Moreover, SVC can be further extended to allow for<br />
fuzzy membership (Chiang and Hao, 2003). Kernel-based clustering algorithms have<br />
many advantages, such as the ability to form arbitrary clustering shapes or to deal<br />
with noise and outliers (Xu and Wunsch II, 2005).<br />
7. Density based clustering: this type of clustering algorithms form clusters based on the<br />
density of objects in a region of the feature space, so that the neighborhood of each<br />
object in a cluster must contain a minimum number of objects. This principle allows<br />
the growth of clusters in any direction, thus being able to discover arbitrarily shaped<br />
clusters, besides providing a natural protection against outliers. There are two main<br />
approaches to density-based clustering, differing in the way density is computed: the<br />
first class of algorithms refers the computation of density to the objects in the data<br />
set (such as the DBSCAN algorithm (Ester et al., 1996)). In contrast, the second type<br />
of density-based clustering strategies create an analytical model of the density over<br />
all the feature space, using influence functions that describe the impact of each data<br />
object within its neighbourhood, thus identifying clusters by determining the maximum<br />
of the overall density function—as the DENCLUE algorithm does (Hinneburg<br />
and Keim, 1998). More recently, another density-based clustering algorithm called<br />
Shared Nearest Neighbors (Ertz, Steinbach, and Kumar, 2003) has been proposed: it<br />
measures object similarity upon the number of common nearest neighbours of each<br />
pair of objects, thus identifying core points, around which clusters are grown.<br />
8. Grid based clustering: in this approach, the clustering space is quantized into a finite<br />
number of hyperrectangular cells. Those cells containing a number of objects over<br />
a predetermined threshold are connected to form the clusters. Possibly, the three<br />
most well-known clustering algorithms based on this approach are STING (Wang,<br />
Yang, and Muntz, 1997), WaveCluster (Sheikholeslami, Chatterjee, and Zhang, 1998)<br />
and CLIQUE (Agrawal et al., 1998). The main differences between them is the cell<br />
generation procedure. In STING, the feature space is divided into several levels, thus<br />
forming a hierarchical cell structure. In contrast, CLIQUE follows a recursive process<br />
for generating (k+1)-dimensional dense cells by associating dense k-dimensional cells,<br />
starting with k = 1. In turn, WaveCluster follows a fairly different approach, as it<br />
applies the Wavelet transform on the original feature space into a frequency space<br />
where the natural clusters in the data become distinguishable (i.e. cells are somehow<br />
defined on the transformed space).<br />
1.2.2 Evaluation of clustering processes<br />
According to the flow diagram depicted in figure 1.2, the final stage of knowledge discovery<br />
processes involves the user’s evaluation of the mined patterns (Fayyad, Piatetsky-Shapiro,<br />
and Smyth, 1996). When clustering is the central data mining task of the KD process, this<br />
implies validating the obtained clusters.<br />
As far as this issue is concerned, three distinct approaches can be followed depending<br />
on the reference used for evaluating the clustering solution:<br />
– the data itself, by determining whether the evaluated cluster structure provides a<br />
proper description of the data, which is measured by means of internal cluster validity<br />
15
1.2. Clustering in knowledge discovery and data mining<br />
indices.<br />
– a predefined and allegedly correct cluster structure, measuring its degree of resemblance<br />
to the obtained clustering solution by means of external cluster validity indices.<br />
– a clustering solution resulting from another clustering process (e.g. a distinct execution<br />
of the same clustering algorithm but using different parameters), measuring<br />
their relative merit so as to decide which one may best reveal the characteristics of<br />
the objects using relative cluster validity indices.<br />
All these three types of evaluation criteria can be used for validating individual clusters,<br />
as well as the output of partitional and hierarchical clustering algorithms (Jain and Dubes,<br />
1988; Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and Vazirgiannis,<br />
2002b).<br />
As regards their applicability, only internal and relative evaluation criteria are applicable<br />
to evaluate clustering solutions in real-life scenarios. This is due to the unsupervised<br />
nature of the clustering problem, as the class membership of objects is unknown in practice.<br />
However, in a research context –where ‘correct’ category labels presumably assigned by an<br />
expert to the objects in the data set are usually known, but not available to the clustering<br />
algorithm–, it is sometimes more appropriate to use external validity indices, since clusters<br />
are ultimately evaluated externally by humans (Strehl, 2002). For this reason, in this work<br />
we will make use of external evaluation criteria solely. However, there exist some recent<br />
efforts that aim to find correlations between internal and external cluster validity indices,<br />
such as (Ingaramo et al., 2008). Nevertheless, for further insight on internal and relative<br />
validity indices, see (Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and<br />
Vazirgiannis, 2002b; Maulik and Bandyopadhyay, 2002).<br />
Therefore, evaluation will consist in testing whether the clustering solution reflects the<br />
true group structure of the data set, captured in a reference clustering or ground truth4 .<br />
A further advantage of this evaluation approach lies in the fact that external evaluation<br />
measures can be used to compare fairly the performance of clustering algorithms regardless<br />
of their foundations, as they make no assumption about the mechanisms used for finding<br />
the clusters (Strehl, 2002).<br />
There exist multiple ways of comparing a clustering solution to a ground truth. Quite<br />
obviously, different approaches must be followed depending on the nature of the clustering<br />
solution (i.e. whether it is hard or soft, hierarchical or partitional).<br />
As regards the evaluation of soft clustering solutions, the main difficulty lies not in the<br />
comparison process (see (Gopal and Woodcock, 1994; Jäger and Benz, 2000) for some classic<br />
approaches), but in the creation of a fuzzy ground truth, which may require applying an<br />
averaging scheme that accounts for systematic biases in the answers of the expert labelers<br />
(Jäger and Benz, 2000; Jomier, LeDigarcher, and Aylward, 2005).<br />
As far as the validation hierarchical clustering solutions is concerned, it is necessary to<br />
use a hierarchical taxonomy as the ground truth. However, this type of ground truth is<br />
4 The unsupervised nature of clustering makes that the performance of clustering algorithms cannot be<br />
judged with the same certitude as for supervised classifiers, as the external categorization (ground truth)<br />
might not be optimal. For instance, the way web pages are organized in the Yahoo! taxonomy is possibly not<br />
the best structure possible, but achieving a grouping similar to the Yahoo! taxonomy is certainly indicative<br />
of successful clustering (Strehl, 2002).<br />
16
Chapter 1. Framework of the thesis<br />
not always available, as some domains are more prone to be organized hierarchically than<br />
others. Possibly due to this fact, not much research has been done on external hierarchical<br />
clustering evaluation (Patrikainen and Meila, 2005). Some examples of the few existing<br />
hierarchical clustering comparison methods are simple layer-wise comparison (Fowlkes and<br />
Mallows, 1983) and cophenetic matrices (Theodoridis and Koutroumbas, 1999), although<br />
they also have their shortcomings (Patrikainen and Meila, 2005). For this reason, the most<br />
extended strategy is to compare the clusterings found at a certain level of the dendrogram<br />
with a partitional ground truth. Unfortunately, this approach does not take in account the<br />
cluster hierarchy in any way, which is clearly not the point if the hierarchical clustering<br />
solution is to be validated as a whole.<br />
Allowing for all these considerations, and provided that the outputs of soft and hierarchical<br />
clustering algorithms can always be converted to hard and partitional clustering<br />
solutions, respectively (see section 1.2.1), the most common cluster validation procedure<br />
consists in comparing hard partitional clustering solutions (i.e. label vectors) with the<br />
same type of ground truths (i.e. comparing cluster labels with class labels) (Strehl, 2002).<br />
The following paragraphs are devoted to a description of some relevant external validity<br />
indices for evaluating hard partitional clustering solutions.<br />
The multiple possible ways for comparing the class labels contained in the ground truth<br />
label vector γ and the cluster labels in a label vector λ can be categorized into two groups<br />
depending on whether they are based on i) object pairwise matching, or ii) cluster matching.<br />
Object pairwise matching cluster validity indices are based on counting how many object<br />
pairs (xi, xj) , ∀ i = j, are clustered together and separately in both γ and λ (the more coincidences,<br />
the higher the similarity between the clustering solution and the ground truth).<br />
Following this rationale, several validity indices have been proposed, such as the Rand index<br />
(Rand, 1971), the Adjusted Rand index (Hubert and Arabie, 1985), the Fowlkes-Mallows<br />
index (Fowlkes and Mallows, 1983) or the Jaccard index, among others—see (Halkidi, Batistakis,<br />
and Vazirgiannis, 2002a; Denoeud and Guénoche, 2006).<br />
Cluster matching cluster validity indices measure the degree of agreement between the<br />
assignment of objects to classes (according to γ) and clusters (as designated by λ). Typical<br />
examples of this kind of validity indices are the <strong>La</strong>rsen index (<strong>La</strong>rsen and Aone, 1999), the<br />
Van Dongen index (van Dongen, 2000), variation of information (Meila, 2003), entropy or<br />
mutual information (Cover and Thomas, 1991), to name a few.<br />
In all the experimental sections of this work, the cluster validity index used for evaluating<br />
clustering results is normalized mutual information, denoted as φ (NMI) . This choice is<br />
motivated by the fact that φ (NMI) is theoretically well-founded, unbiased, symmetric with<br />
respect to λ and γ and normalized in the [0, 1] interval —the higher the value of φ (NMI) ,the<br />
more similar λ and γ are (Strehl, 2002). Mathematically, normalized mutual information<br />
is defined as follows:<br />
φ (NMI) k k h=1 l=1<br />
(γ, λ) =<br />
nh,l log<br />
k h=1 n(γ)<br />
<br />
)<br />
n(γ k<br />
h<br />
h log n<br />
<br />
n·nh,l<br />
(γ )<br />
n h n(λ)<br />
l<br />
l=1 n(λ)<br />
l<br />
<br />
log n(λ)<br />
<br />
l<br />
n<br />
(1.8)<br />
where n (γ)<br />
h is the number of objects in cluster h according to γ, n (λ)<br />
l is the number of objects<br />
17
1.3. Multimodal clustering<br />
in cluster l according to λ, nh,l denotes the number of objects in cluster h according to γ as<br />
well as in group l according to λ, n is the number of objects contained in the data set, and<br />
k is the number of clusters into which objects are clustered according to λ and γ (Strehl,<br />
2002).<br />
Thus, the more similar the clustering solutions represented by the label vector λ and<br />
the ground truth γ, the closer to 1 φ (NMI) (γ, λ) will be. As the ground truth γ is assumed<br />
to represent the true partition of the data, high quality clusterings will attain φ (NMI) (γ, λ)<br />
values close to unity. As a consequence, given two label vectors λ1 and λ2, the former will<br />
be considered to be better than the latter if φ (NMI) (γ, λ1) >φ (NMI) (γ, λ2), and vice versa.<br />
1.3 Multimodal clustering<br />
The ubiquity of multimedia data has motivated an increasing interest in clustering techniques<br />
capable of dealing with multimodal data. In the following paragraphs we review<br />
some of the most relevant works on clustering multimodal data.<br />
Possibly one of the first works that mention the multimedia clustering problem was<br />
(Hinneburg and Keim, 1998). The authors place special emphasis in highlighting the two<br />
main challenges faced by clustering algorithms in this context: the high dimensionality<br />
of the feature vectors and the existence of noise. To tackle these problems, the authors<br />
proposed DENCLUE, a density-based clustering algorithm capable of dealing satisfactorily<br />
with both issues. However, in that work, multimodality seemed to be more of a pretext to<br />
justify the challenges of clustering high dimensional noisy data than an interest in itself.<br />
This was not the case of the browsing and retrieval system for collections of text annotated<br />
web images presented in (Chen et al., 1999), which was a multimodal extension<br />
of the Scatter-Gather document browser of (Cutting et al., 1992). In this case, multiple<br />
clusterings were created upon text and image features independently. In particular, clustering<br />
on image features was employed as part of a query refinement process. Therefore, this<br />
proposal is multimodal in the sense that the features of the distinct modalities are employed<br />
for clustering the image collection, but, still, they were not fully integrated in the clustering<br />
process.<br />
In contrast, the true multimodality of the clustering approach proposed in (Barnard<br />
and Forsyth, 2001) is guaranteed by modeling the probabilities of word and image feature<br />
occurrences and co-occurrences. It consisted of a statistical hierarchical generative model<br />
fitted with the EM algorithm, which organizes image databases using both image features<br />
and their associated text. In subsequent works, the learnt joint distribution of image regions<br />
and words was exploited in several applications, such as the prediction of words associated<br />
with whole images or with image regions (Barnard et al., 2003).<br />
Multimodal clustering has also been applied to the discovery of perceptual clusters for<br />
disambiguating the semantic meaning of text annotated images (Benitez and Chang, 2002).<br />
To do so, the images are clustered based on the visual or the text feature descriptors. Moreover,<br />
the system could also conduct multimodal clustering upon any combination of text<br />
and visual feature descriptors by conducting an early fusion of these. Principal Component<br />
Analysis was used to integrate and to reduce the dimensionality of feature descriptors<br />
before clustering. As regards the results obtained by multimodal clustering, the authors<br />
highlighted the uncorrelatedness of visual and text feature descriptors, which suggests that<br />
18
Chapter 1. Framework of the thesis<br />
they should be integrated in the knowledge extraction process.<br />
In the multimodal clustering context, the field that has motivated the largest amount<br />
of research efforts is the clustering of web image search results based not only on visual<br />
features, but also using the surrounding text and also link information—as organizing the<br />
results into different semantic clusters might facilitates users browsing (Cai et al., 2004).<br />
In that work, each image returned by the search engine is represented using three kinds<br />
of information: visual information, textual information and link information (text and link<br />
data are recovered from the surroundings of the image). The rationale of this approach is<br />
based on the fact that the textual and link based representations can reflect the semantic<br />
relationship of images better than visual features. The proposed system implements a<br />
two level clustering algorithm: in the first level, clustering is conducted using the textual<br />
and link representation of images (separately or jointly). In the second level, clustering<br />
is conducted on the images assigned to each cluster resulting from the previous stage. In<br />
this case, low level visual features are employed to re-organize the images in the first level<br />
clusters, so as to group visually similar images to facilitate users browsing.<br />
A second paper dealing with web image search results clustering was (Gao et al., 2005).<br />
In that work, a tripartite graph was used to model the relations among low-level features,<br />
images and their surrounding texts. Thus, the method was formulated as a constrained<br />
multiobjective optimization problem, which can be efficiently solved by semi-definite programming.<br />
In a similar context, clustering was applied for image sense discrimination for web images<br />
retrieved from ambiguous keywords (Loeff, Ovesdotter-Alm, and Forsyth, 2006). Its goal<br />
was presenting the image search results in semantically sensible clusters for improved image<br />
browsing. To do so, spectral clustering was applied on multimodal features: simple local and<br />
global image features, and a bag of words representation of the text in the embedding web<br />
page. Multimodal fusion was achieved by combining pairwise object similarities measured<br />
on both image and textual features in the graph affinity matrix of the spectral clustering<br />
algorithm.<br />
Finally, the notion that each of the multiple modalities in a multimedia collection contributes<br />
its own perspective to the collections organization was the driving force behind the<br />
proposal in (Bekkerman and Jeon, 2007). That work presents the Comraf* model, a lightweight<br />
version of combinatorial Markov random fields. In Comraf*, multimodal clustering<br />
is faced as the problem of simultaneously constructing a partition of each data modality.<br />
By clustering modalities simultaneously, the statistical sparseness of the data representation<br />
can be overcome, obtaining a dense and smooth joint distribution of the modalities.<br />
However, not every modality has to be clustered, as long as the so-called target modality<br />
is.<br />
The reader interested in multimedia indexing and retrieval is referred to the recent<br />
and complete survey of (Chen, 2006), although it is mainly focused on text plus image<br />
modalities.<br />
1.4 Clustering indeterminacies<br />
As mentioned at the end of section 1.1, the accomplishment of a knowledge discovery process<br />
requires making several critical decisions at each of its stages, which may have to be re-<br />
19
1.4. Clustering indeterminacies<br />
executed if the user is not satisfied with the evaluation of the mined patterns. In the case<br />
that clustering is the data mining task of the knowledge discovery process, these important<br />
decisions are often made blindly, due to the unsupervised nature of the clustering problem.<br />
Unfortunately, these decisions determine, to a large extent, the effectiveness of the clustering<br />
task (Jain, Murty, and Flynn, 1999), so they should not be made unconcernedly.<br />
Thus, the obtention of a good quality clustering solution relies heavily on making optimal<br />
(or quasi-optimal) decisions at every stage of the KD process. The doubts that seize<br />
clustering practitioners at the time of making such decisions are caused by what we call<br />
cluster indeterminacies, which mainly localize in the selection of i) the way objects are<br />
represented, and ii) the clustering algorithm to be applied.<br />
As regards the decision on data representation, ideal features should permit distinguishing<br />
objects belonging to different clusters, besides being robust to noise, easy to extract and<br />
interpret (Xu and Wunsch II, 2005). In a blind quest for finding such a data representation,<br />
the clustering practitioner is struck by the following questions:<br />
– how should the objects be represented? Should we stick to their original representation,<br />
select a subset of the original attributes (i.e. feature selection) or transform<br />
them into a new feature space (i.e. feature extraction)?<br />
– if the original data representation is subject to a dimensionality reduction process,<br />
which should be the dimensionality of the reduced feature space?<br />
– if the original data representation is subject to a feature selection process, which<br />
criterion should be followed?<br />
– if feature extraction is applied, which criterion should guide it?<br />
Regrettably, whereas these questions are easy to answer in a supervised classification<br />
scenario (e.g. the optimal feature subset can be chosen by maximizing some function of<br />
predictive classification performance (Kohavi and John, 1998) or by applying a feature<br />
transformation driven by class labels (Torkkola, 2003)), they have no clear nor universal<br />
answer in an unsupervised context. This is due to the fact that, in clustering, the lack of<br />
class labels makes feature selection a necessarily ad hoc and often trial-and-error process (Dy<br />
and Brodley, 2004). Moreover, studies comparing the influence of object representations<br />
based on feature extraction in clustering performance often come up with contradictory<br />
conclusions (Tang et al., 2004; Shafiei et al., 2006; Cobo et al., 2006; Sevillano, Alías, and<br />
Socoró, 2007b).<br />
To illustrate the effect and importance of the data representation clustering indeterminacy,<br />
in the following paragraphs we present experimental evidences that the selection of<br />
a specific object representation can condition the quality of a clustering process to a large<br />
extent. In particular, we have represented the objects contained in the Wine and the miniNG<br />
data collections using multiple data representations: the original attributes (referred<br />
to as baseline) and feature extraction-based representations —obtained by means of Principal<br />
Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative<br />
Matrix Factorization (NMF) and Random Projection (RP)— on a range of distinct dimensionalities.<br />
Upon each object representation, we have applied a refined repeated bisecting<br />
clustering algorithm based on the correlation similarity measure for obtaining a partition<br />
20
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
WINE rbr−corr−e1<br />
Baseline<br />
PCA<br />
ICA<br />
NMF<br />
RP<br />
0<br />
3 4 5 6 7 8 9 10 11 12 13<br />
dimensions<br />
(a) Clustering results on the Wine<br />
data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
Chapter 1. Framework of the thesis<br />
miniNG rbr−corr−e1<br />
Baseline<br />
PCA<br />
ICA<br />
NMF<br />
RP<br />
0<br />
50 100 150 200 250 300 350 390<br />
dimensions<br />
(b) Clustering results on the miniNG<br />
data set.<br />
Figure 1.5: Illustration of the data representation indeterminacy on the clustering results<br />
of the (a) Wine and (b) miniNG data sets clustered by the rbr-corr-e1 algorithm.<br />
of the data 5 . In all cases, the desired number of clusters k is set to the number of classes<br />
defined by the ground truth of each data set.<br />
The results of these clustering processes are presented in figure 1.5, which displays<br />
the normalized mutual information (φ (NMI) ) values between these clustering solutions and<br />
the ground truth of each data collection. It can be observed that, in the Wine data set<br />
(figure 1.5(a)), the clustering solution obtained operating on the original representation<br />
is worse (i.e. it attains a lower φ (NMI) score) than all but three of the feature-extraction<br />
based representations. In particular, the maximum value of φ (NMI) is attained using an<br />
8-dimensional PCA transformation of the original data. However, these results cannot be<br />
generalized. In fact, it is the baseline object representation that yields the best results<br />
when the same clustering algorithm is applied on the miniNG data set (see figure 1.5(b)).<br />
If we analyze the data representation that yields the best clustering results across the<br />
12 unimodal data sets described in appendix A.2, we observe a rather even distribution:<br />
baseline (23% of the times), LSA (31%), NMF (31%) and RP (15%), which somehow<br />
reinforces the notion that no intrinsically superior data representation exists—see appendix<br />
B.1 for more experimental results regarding data representation clustering indeterminacies.<br />
Moreover, notice the remarkable influence of the data representation dimensionality on<br />
the value of φ (NMI) , i.e. it is not only important to select the right type of representation,<br />
but its dimensionality is also a critical factor. Although there exist several approaches for<br />
determining the most natural dimensionality of the data (dnat) –such as spectrum of eigenvectors<br />
in PCA (Duda, Hart, and Stork, 2001) or reconstruction error in NMF (Sevillano<br />
et al., 2009)–, it is not trivial to ensure that any clustering algorithm will yield its best<br />
performance when operating on the dnat-dimensional representation of the data.<br />
To sum up, this modest example tries to demonstrate the relevance of the data representation<br />
indeterminacy, as an incorrect choice in data representation may ruin the results<br />
of a clustering process.<br />
5 For a detailed description of the clustering algorithms, the data sets and the data representations employed<br />
in the experimental sections of this thesis, see appendices A.1, A.2 and A.3, respectively.<br />
21
1.4. Clustering indeterminacies<br />
The other major source of indeterminacy is the selection of the particular clustering<br />
algorithm to apply. In this sense, there are several critical questions that must be answered:<br />
– what type of algorithm should we apply? Hierarchical or partitional? Hard or soft?<br />
– once the type of clustering algorithm is selected, which specific clustering algorithm<br />
should be applied?<br />
– how should the parameters of the clustering algorithm be tuned?<br />
– in how many clusters should the data objects be grouped?<br />
As far as the selection of the type of clustering algorithm is concerned, it depends on<br />
the desired shape of the clustering solution. In any case, it is worth noting that soft and<br />
hierarchical clustering algorithms can be regarded as a generalization of their hard and<br />
partitional counterparts, as the latter can always be obtained from the former.<br />
Moreover, it is a commonplace that no universally superior clustering algorithm exists,<br />
as most of the proposals found in the literature have been designed to solve particular<br />
problems in specific fields (Xu and Wunsch II, 2005), thus being able to outperform the<br />
other existing algorithms in a concrete context, but not in others (Jain, Murty, and Flynn,<br />
1999). This fact has been theoretically analyzed and demonstrated by the impossibility<br />
problem in (Kleinberg, 2002). Thus, unless some domain knowledge enables clustering<br />
practitioners to choose a specific algorithm, this selection is often made blindly to a large<br />
extent.<br />
Once the algorithm is chosen, its parameters must be set. Again, this is not a trivial<br />
choice, as they largely determine its behaviour. Several examples concerning the sensitivity<br />
of some popular clustering algorithms to their parameter tuning are easy to find in the<br />
literature: for instance, there is no universal method for identifying the initial centroids<br />
of k-means. The EM clustering algorithm is highly sensitive to the selection of its initial<br />
parameters and the effect of a singular covariance matrix, as it can converge to a local<br />
optimum (Xu and Wunsch II, 2005). In combinatorial search based clustering, there exist<br />
no theoretic guidelines to select the appropriate and effective parameters, while the selection<br />
of the graph <strong>La</strong>placians is a major issue that affects the performance of spectral clustering<br />
algorithms, just like it happens in kernel-based clustering algorithms as regards selecting<br />
the width of the Gaussian kernels (Xu and Wunsch II, 2005).<br />
And finally, one has to decide the number of clusters k into which the objects must be<br />
grouped, as many clustering algorithms (e.g. partitional) require this value to be passed as<br />
one of their parameters. Unfortunately, in most cases the number of classes in a data set<br />
is unknown, so it is one more parameter to guess about. Moreover, determining which is<br />
the ‘correct’ number of clusters in a data set is an open question: in some cases, equally<br />
satisfying (though substantially different) clustering solutions can be obtained with different<br />
values of k for the same data, proving that the right value of k often depends on the scale<br />
at which the data is inspected (Chakaravathy and Ghosh, 1996).<br />
Notwithstanding, there exist several practical procedures for determining the most suitable<br />
value of k. Possibly, the most intuitive approach consists in visualizing the data set on<br />
a two-dimensional space, although this strategy is of little use for complex data sets (Xu and<br />
Wunsch II, 2005). Additionally, relative cluster validity indices (such as the Davies-Bouldin<br />
22
φ (NMI) direct-cos-i2 graph-cos-i2<br />
BBC 0.807 (P100) 0.603 (P83)<br />
PenDigits 0.658 (P84) 0.829 (P99)<br />
Chapter 1. Framework of the thesis<br />
Table 1.1: Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits<br />
data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms.<br />
index (Davies and Bouldin, 1979), Dunn’s index (Dunn, 1973), the Calinski-Harabasz index<br />
(Calinski and Harabasz, 1974) or the I index (Maulik and Bandyopadhyay, 2002)) can be<br />
applied for determining the most appropriate number of clusters by comparing the relative<br />
merit of several clustering solutions obtained for distinct values of k (Halkidi, Batistakis,<br />
and Vazirgiannis, 2002b)—unfortunately, the performance of these indices is data dependent,<br />
which gives rise to a further indeterminacy (Xu and Wunsch II, 2005). And last, in<br />
the context of mixture densities based clustering, the number of clusters can be determined<br />
through the optimization of criterion functions such as the Akaike’s Information Criterion<br />
(Akaike, 1974) or the Bayesian Inference Criterion (Schwarz, 1978), among others.<br />
In this work, the number of clusters is assumed to be known from the start of the<br />
clustering process, and k is set to be equal to the number of classes defined by the ground<br />
truth of each data set. However, in a real-world scenario, this parameter should be tuned<br />
using some of the previously mentioned techniques.<br />
To illustrate the clustering algorithm selection indeterminacy, table 1.1 presents the<br />
values of φ (NMI) (and their percentiles in the global φ (NMI) distribution of each case) obtained<br />
from evaluating the clustering solutions yielded by the graph-cos-i2 and direct-cos-i2 (graphbased<br />
and direct clustering algorithms using the cosine distance, respectively) operating on<br />
the baseline data representation of the objects contained in the BBC and PenDigits data<br />
sets 6 . As just mentioned, the desired number of clusters k is set to the number of classes<br />
in each data set. It can be observed that these two algorithms, despite using the same<br />
object similarity measure and optimizing the same criterion function, offer almost opposite<br />
performances in these two specific data collections, so no absolute claims on the superiority<br />
of none of them can be made.—see appendix B.1 for more experimental results regarding<br />
the clustering algorithm selection indeterminacy.<br />
It is worth noticing that the decision problems caused by the previously described clustering<br />
indeterminacies are multiplied in the case of clustering multimodal data. In this<br />
context, besides the data representation and clustering algorithm selection indeterminacies,<br />
the clustering practitioner must face an additional set of questions with no clear answer,<br />
such as:<br />
– should one modality dominate the clustering process? If so, which one, and to which<br />
extent?<br />
– should the modalities be fused? If so, how should the fusion process be conducted?<br />
To illustrate the effect of these and the aforementioned indeterminacies in a multimodal<br />
clustering scenario, we have conducted several clustering experiments on the multimodal<br />
data sets presented in appendix A.2.2.<br />
6 Refer to appendices A.1, A.2 and A.3 for a detailed description of the clustering algorithms, the data<br />
sets and the data representations employed in the experimental sections of this thesis.<br />
23
1.4. Clustering indeterminacies<br />
Data set Best results Mode #1 Mode #2 Multimodal<br />
CAL500<br />
Corel<br />
InternetAds<br />
IsoLetters<br />
φ (NMI) 0.411 (P100) 0.249 (P74) 0.310 (P88)<br />
algorithm rbr-cos-i2 graph-cos-i2 graph-jacc-i2<br />
representation RP, r=120 PCA, r=40 ICA, r=100<br />
φ (NMI) 0.669 (P99) 0.270 (P40) 0.675 (P100)<br />
algorithm rbr-corr-i2 rb-cos-e1 rbr-corr-i2<br />
representation RP, r=300 NMF, r=400 NMF, r=550<br />
φ (NMI) 0.430 (P100) 0.258 (P98) 0.319 (P99)<br />
algorithm bagglo-cos-slink agglo-corr-clink graph-jacc-i2<br />
representation RP, r=70 Baseline NMF, r=150<br />
φ (NMI) 0.754 (P93) 0.537 (P67) 0.897 (P100)<br />
algorithm rbr-corr-i2 graph-jacc-i2 rbr-corr-i2<br />
representation Baseline ICA, r=12 PCA, r=100<br />
Table 1.2: Illustration of the clustering indeterminacies on the CAL500, Corel, InternetAds<br />
and IsoLetters multimoda data sets. Each column presents the top-performing clustering<br />
configuration for each separate modality and for the multimodal data representation.<br />
The twenty-eight clustering algorithms employed in this work (see table A.1 for a quick<br />
reference) are run on i) each modality of the data set, and ii) on the multimodal representation<br />
obtained by early feature fusion, as described in appendix A.3.2. In both cases,<br />
clustering is run on the baseline and feature extraction based data representations (see<br />
section A.3.1 of appendix A).<br />
As a summary of the obtained results and an illustration of the clustering indeterminacies,<br />
table 1.2 presents the highest quality clustering results obtained in each case (i.e.<br />
when clustering is conducted on either mode –mode #1 and mode #2 columns– and on<br />
the multimodal representation), indicating the corresponding value of φ (NMI) , its percentile<br />
in the global φ (NMI) distribution obtained, and the top-performing clustering configuration<br />
(i.e. algorithm plus data representation plus dimensionality of the reduced feature space r<br />
when needed).<br />
As expected, there is no predominant modality across all the data sets. For the CAL500<br />
collection, the clustering results obtained on mode #1 (text) are clearly superior to the rest.<br />
A similar behaviour is observed in the InternetAds data set, where it is also mode #1 (images<br />
size and aspect ratio, plus caption and alternate text in this case) the one that yields the<br />
best clustering results, although its predominance is not as clear as in CAL500. In contrast,<br />
the highest quality clustering solutions are obtained from multimodal representations in the<br />
Corel and IsoLetters data sets.<br />
<strong>La</strong>st but not least, notice the effect of the data representation and clustering algorithm<br />
indeterminacies on all the data sets. Again, there is no universally superior clustering<br />
algorithm nor data representation that guarantees the best clustering results. For a more<br />
detailed description of the experimental results regarding the clustering indeterminacies in<br />
multimodal data collections, see appendix B.2.<br />
24
1.5 Motivation and contributions of the thesis<br />
Chapter 1. Framework of the thesis<br />
The main motivation of this thesis is the construction of an efficient multimodal clustering<br />
system that performs as autonomously as possible, avoiding the re-execution of the different<br />
stages of the knowledge discovery process. As these feedback loops are caused by suboptimal<br />
decision-making, our idea is setting cluster practitioners free from the obligation of making<br />
such critical decisions in a blind way, obtaining, at the same time, clustering solutions which<br />
are robust to the clustering indeterminacies presented in the previous section.<br />
Instead of being forced to blindly select a single clustering configuration, the user is<br />
encouraged to use and combine all the data modalities, representations and clustering algorithms<br />
at hand, generating as many individual clustering solutions (compiled into a cluster<br />
ensemble) as possible. It will be the proposed system which, in a fully unsupervised mode,<br />
outputs a consensus clustering solution that will hopefully be comparable to (or even better<br />
than) the one achieved using the best clustering configuration among the available ones.<br />
As the informed reader may have guessed, the approach followed in the quest for this<br />
goal localizes in the consensus clustering framework, which is defined as “the problem of<br />
combining multiple partitionings of a set of objects into a single consolidated clustering<br />
without accessing the features or algorithms that determined these partitionings” (Strehl<br />
and Ghosh, 2002). That is, the data representations, modalities and clustering algorithms<br />
employed for generating the individual partitions are not of the system’s concern, as it will<br />
operate on the individual clustering solutions regardless of the way they were created.<br />
However, applying consensus clustering on cluster ensembles as a means for obtaining<br />
robust clustering solutions is not new—in fact it has been a central or collateral matter in<br />
several works (Strehl and Ghosh, 2002; Fred and Jain, 2003; Sevillano et al., 2006a; Fern<br />
and Lin, 2008). Anyway, this thesis deals with several crucial and, to our knowledge, little<br />
addressed issues in this context, such as:<br />
– the computational burden imposed by the use of large cluster ensembles generated by<br />
crossing multiple data modalities, representations and clustering algorithms.<br />
– the quality decrease of the consensus clustering solution caused by the wide diversity<br />
of the cluster ensemble.<br />
– the application of cluster ensembles on the multimodal clustering problem.<br />
– the definition of methods for building consensus clustering solutions (either crisp or<br />
fuzzy) from the outputs of soft clustering algorithms.<br />
As a systematic response to these challenges, this thesis puts forward the following<br />
proposals:<br />
– parallelizable hierarchical consensus architectures for creating consensus clustering<br />
solutions in a computationally efficient way (see chapter 3).<br />
– fully unsupervised consensus self-refining procedures, so as to drive the quality of<br />
the consensus clustering solution near or even above the best available individual<br />
clustering configuration —see chapter 4.<br />
25
1.5. Motivation and contributions of the thesis<br />
Obj Object<br />
representation<br />
Multimedia<br />
data set R df D df df ,<br />
X Clustering E<br />
df df A<br />
( hard / soft)<br />
Flat/<br />
hi hierarchical hi l<br />
(serial/parallel)<br />
consensus architecture<br />
( hard / soft)<br />
<br />
c<br />
or<br />
<br />
c<br />
Consensus<br />
final<br />
c<br />
self- or<br />
refining final<br />
c<br />
Figure 1.6: Block diagram of the robust multimodal clustering system based on self-refining<br />
hierarchical consensus architectures.<br />
– the construction of multimodal cluster ensembles and the application of self-refining<br />
hierarchical consensus architectures for robust multimodal clustering (see chapter 5).<br />
– consensus functions based on voting strategies for combining fuzzy partitions contained<br />
in soft cluster ensembles —see chapter 6.<br />
These contributions can be articulated in a unitary proposal for robust multimodal<br />
clustering based on cluster ensembles, a block diagram of which is shown in figure 1.6. The<br />
procedure for deriving the partition of a multimodal data collection X accordingtoour<br />
proposal goes as follows: firstly, multiple representations of the objects contained in X are<br />
created by the application of a set of representational and dimensional diversity factors<br />
provided by the user (denoted as dfR and dfD in figure 1.6). Next, a set of either hard or<br />
soft clustering algorithms (referred to as the algorithmic diversity factor dfA) are applied on<br />
the distinct object representations obtained from the previous step, giving rise to a set of<br />
clusterings compiled in the cluster ensemble E. Notice that, up to this point, the only choices<br />
made by the user refer to the object representation techniques and clustering algorithms<br />
employed for creating the ensemble. As mentioned earlier, the user is encouraged to employ<br />
the widest possible range of diversity factors, thus creating maximally diverse clusterings<br />
so as to break free from the indeterminacies inherent to clustering. The obviously high<br />
computational cost associated to this cluster ensemble generation strategy can somehow be<br />
mitigated considering it is a highly parallelizable process (Hore, Hall, and Goldgof, 2006).<br />
Subsequently, the process for deriving the partition of the data set X upon the cluster<br />
ensemble E starts by applying a consensus clustering procedure. This can either be<br />
conducted according to a flat or a hierarchical consensus architecture, a decision that is<br />
automatically made by the system based on the characteristics of the data set X, the cluster<br />
ensemble E and the consensus function F employed for combining the clusterings in E<br />
—which is selected by the user. In case a hierarchical consensus architecture is employed,<br />
an additional decision (also made with no user supervision) is the one related to its serial or<br />
parallel execution, which ultimately depends on the availability of computational resources.<br />
As a result, a consensus clustering solution is obtained, which can either be represented<br />
by a consensus label vector λc or a consensus clustering matrix Λc, depending on whether<br />
a crisp or a fuzzy clustering approach is taken. Subsequently, this consensus clustering is<br />
subjected to an almost fully autonomous self-refining procedure, which requires the user<br />
to specify a percentage threshold (denoted by symbol ‘%’ in figure 1.6). Finally, the final<br />
partition of the data set X is obtained, denoted as λ final<br />
c (or Λfinal c in the fuzzy case).<br />
Before proceeding with the description of our proposals, the next chapter presents an<br />
overview of related work in the area of cluster ensembles.<br />
26<br />
%
Chapter 2<br />
Cluster ensembles and consensus<br />
clustering<br />
In our quest for overcoming clustering indeterminacies in a multimodal context, the notions<br />
of cluster ensembles and consensus clustering play a central role. As mentioned at the<br />
end of chapter 1, our strategy for clustering multimodal data in a robust manner is based<br />
on the massive creation of multiple partitions of the target data set and the subsequent<br />
combination of these into a single consensus clustering solution. Therefore, an appropriate<br />
way to start this chapter is by formally defining the two closely related concepts of cluster<br />
ensembles and consensus clustering1 .<br />
For starters, a cluster ensemble E is defined as the compilation of the outcomes of l<br />
clustering processes. For simplicity, we assume in this work that the l clustering processes<br />
group the data into the same number of clusters, namely k, although this is not a strictly<br />
necessary constraint2 . Depending on whether the clustering processes are crisp or fuzzy, E<br />
will be a hard or a soft cluster ensemble.<br />
In the former case, E is mathematically defined as a l×n integer-valued matrix compiling<br />
l row label vectors λi (∀i ∈ [1,...,l]) resulting from the respective hard clustering processes<br />
(see equation (2.1)).<br />
⎛<br />
λ1<br />
λl<br />
⎞<br />
⎛<br />
⎜<br />
⎜λ2<br />
⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠ =<br />
⎜<br />
⎝ .<br />
λ11 λ12 ... λ1m<br />
λ21 λ22 ... λ2m<br />
. ..<br />
λl1 λl2 ... λlm<br />
⎞<br />
⎟<br />
⎠<br />
(2.1)<br />
where λij ∈{1,...,k} (∀i ∈ [1,...,l], and ∀j ∈ [1,...,n]), i.e. each component of each<br />
1 In some works, the term ‘cluster ensemble’ is used to designate the framework for combining multiple<br />
partitionings obtained from separate clustering runs into a final consensus clustering (Strehl and Ghosh,<br />
2002; Punera and Ghosh, 2007). In this work, however, we stick to the literal meaning of this expression,<br />
and use it to design the result of gathering several clustering solutions.<br />
2 Since our goal is to combine partitions differing only in the way data are represented and clustered, we<br />
set the number of clusters k to be equal across the l clustering processes. However, combining clustering<br />
solutions with a variable number of clusters is a common practice in the cluster ensembles literature. This<br />
can be useful for clustering complex data sets upon simple individual partitions (Fred and Jain, 2005), or<br />
for discovering the natural number of clusters in the data set (Strehl and Ghosh, 2002), although these<br />
potentialities are not exploited in this work.<br />
27
Chapter 2. Cluster ensembles and consensus clustering<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
2<br />
3<br />
−0.2<br />
−0.2 0 0.2 0.4 0.6<br />
Figure 2.1: Scatterplot of an artificially generated two-dimensional toy data set containing<br />
n = 9 objects grouped into k = 3 natural clusters. Each object is identified by a numerical<br />
label.<br />
labeling is an integer label identifying to which of the k clusters each of the n objects in<br />
the data set is assigned to.<br />
For illustration purposes, and resorting to the toy clustering example presented in section<br />
1.2.1, equation (2.2) presents a hard cluster ensemble created by compiling the outcomes of<br />
l = 3 independent runs of the k-means clustering algorithm on the two-dimensional data set<br />
presented in figure 2.1, which contains n = 9 objects, setting the desired number of clusters<br />
k equal to 3.<br />
4<br />
7<br />
5<br />
9<br />
6<br />
8<br />
⎛<br />
1 1 1 3 3 3 2 2<br />
⎞<br />
2<br />
E = ⎝2<br />
2 2 1 1 1 3 3 3⎠<br />
(2.2)<br />
2 2 2 3 3 3 1 1 1<br />
On its part, a soft cluster ensemble E is defined as the compilation of the outcomes of l<br />
fuzzy clustering processes, and as such, it is mathematically expressed as a kl×n matrix, as<br />
presented in equation (2.3) (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />
⎛ ⎞<br />
Λ1<br />
⎜<br />
⎜Λ2<br />
⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠<br />
Λl<br />
(2.3)<br />
where Λi is the k × n real-valued clustering matrix resulting from the ith soft clustering<br />
process (∀i ∈ [1,...,l]).<br />
Continuing with the same toy example, equation (2.4) presents a soft cluster ensemble<br />
created by collecting the outcomes of l = 3 independent executions of the fuzzy c-means<br />
clustering algorithm on the same data set as before. As k = 3, the first three rows of<br />
E correspond to the clustering probability membership matrix output by the first soft<br />
clustering process, the next three are the outcome of the second fuzzy clusterer, and so on.<br />
28
1 1 1 3 3 3 2 2 2<br />
1<br />
2<br />
3<br />
2 2 2 1<br />
2 2 2 3<br />
1 1 3 3<br />
3 3 1 1<br />
3<br />
1<br />
Chapter 2. Cluster ensembles and consensus clustering<br />
E<br />
Consensus<br />
function c<br />
1<br />
1 1 2 2<br />
2 3 3 3<br />
Figure 2.2: Schematic representation of the obtention of a consensus labeling λc by applying<br />
a consensus function F on a hard cluster ensemble E containing l = 3 individual label<br />
vectors, being n =9andk =3.<br />
⎛<br />
0.921 0.932 0.905 0.006 0.005 0.011 0.009 0.016<br />
⎞<br />
0.010<br />
⎜<br />
0.054<br />
⎜<br />
0.025<br />
⎜<br />
⎜0.025<br />
E = ⎜<br />
⎜0.920<br />
⎜<br />
⎜0.054<br />
⎜<br />
⎜0.054<br />
⎝0.920<br />
0.042<br />
0.026<br />
0.026<br />
0.932<br />
0.042<br />
0.042<br />
0.932<br />
0.057<br />
0.038<br />
0.038<br />
0.905<br />
0.057<br />
0.057<br />
0.905<br />
0.025<br />
0.969<br />
0.969<br />
0.006<br />
0.025<br />
0.025<br />
0.006<br />
0.019<br />
0.976<br />
0.976<br />
0.005<br />
0.019<br />
0.019<br />
0.005<br />
0.030<br />
0.959<br />
0.959<br />
0.011<br />
0.030<br />
0.030<br />
0.011<br />
0.976<br />
0.014<br />
0.014<br />
0.009<br />
0.976<br />
0.976<br />
0.009<br />
0.929<br />
0.055<br />
0.055<br />
0.016<br />
0.929<br />
0.929<br />
0.016<br />
0.972 ⎟<br />
0.017 ⎟<br />
0.017 ⎟<br />
0.010 ⎟<br />
0.972 ⎟<br />
0.972 ⎟<br />
0.010⎠<br />
0.025 0.026 0.038 0.969 0.976 0.959 0.014 0.055 0.017<br />
(2.4)<br />
Notice that, just like it was observed in section 1.2.1 regarding crisp and fuzzy clustering<br />
solutions, soft cluster ensembles can be converted to hard cluster ensembles by assigning<br />
each object to the cluster it is more strongly associated to. In fact, by doing so, the soft<br />
ensemble in equation (2.4) would be converted to the hard cluster ensemble in equation<br />
(2.2). Moreover, notice that the l = 3 components that make up both cluster ensembles are<br />
identical, given the symbolic nature of cluster labels.<br />
As for consensus clustering, it is defined as the process of obtaining a consolidated<br />
clustering solution through the application of a consensus function F on a cluster ensemble<br />
E (Strehl and Ghosh, 2002). In other words, consensus clustering can be regarded as the<br />
problem of combining several clustering solutions without accessing the features representing<br />
the clustered objects. Figure 2.2 depicts a schematic representation of a consensus clustering<br />
process conducted on the hard cluster ensemble resulting from our toy example. In this<br />
case, the result of the consensus clustering process is a consensus label vector λc which,<br />
quite obviously, represents the same partition as the individual label vectors that compose<br />
the cluster ensemble. However, in a real context, a higher degree of diversity among the<br />
clustering solutions embedded in the cluster ensemble can be expected (which, in fact, is<br />
desirable), a situation consensus clustering algorithms take advantage of for consolidating<br />
richer consensus clustering solutions (Pinto et al., 2007).<br />
Quite obviously, the design of the consensus function F is a central issue as regards<br />
consensus clustering. Most works in the consensus clustering literature focus on combining<br />
the outcomes of hard clustering processes (as in the example depicted in figure 2.2), although<br />
some consensus functions can be applied to either hard or soft cluster ensembles indistinctly,<br />
possibly after introducing some minor modifications (Strehl and Ghosh, 2002; Fern and<br />
Brodley, 2004; <strong>La</strong>nge and Buhmann, 2005). However, little effort has been conducted<br />
towards the design of specific consensus functions for soft cluster ensembles that generate<br />
29
fuzzy consensus clustering solutions (Dimitriadou, Weingessel, and Hornik, 2002; Punera<br />
and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />
Regardless of whether the cluster ensemble is hard or soft, combining the results of<br />
several clustering processes has multiple applications, a good description of which can be<br />
found in (Strehl and Ghosh, 2002). In a nutshell, consensus clustering is useful for:<br />
– knowledge reuse: in some scenarios, one may want to create a partition of a set of<br />
objects, but the access to the original data may be restricted due to copyright or privacy<br />
reasons (customer databases are the most prototypical examples of this type of<br />
situation). However, if a set of legacy partitions of the data exist (e.g. segmentations<br />
of a customer database based on distinct criteria –such as residence, purchasing patterns,<br />
age, etc.), consensus clustering provides a means for reconciling the knowledge<br />
contained in those legacy clusterings.<br />
– distributed clustering: due to security or operational reasons, there exist situations in<br />
which the data to be clustered is scattered across different locations. In this context,<br />
as an alternative to gathering and processing all the data at one site –which can be<br />
unfeasible, for instance, due to storage costs–, the data available at each location<br />
would be subject to a clustering process, and the label vectors obtained would be<br />
combined by means of consensus clustering, yielding a consolidated classification of<br />
the data.<br />
– robust clustering: in this case, the goal is to obtain a consensus clustering solution<br />
that improves the quality of the component clusterings, based on the fact that if the<br />
distinct clustering processes disagree, combining their outcomes may offer additional<br />
information and discriminatory power, thus obtaining a combined better clustering<br />
closer to a hypothetical true classification (Pinto et al., 2007).<br />
It is in this latter application that consensus clustering can be more clearly regarded as<br />
the unsupervised counterpart of classifier committees, as the objective of both strategies is to<br />
combine the outcomes of several classification processes aiming to improve the quality of the<br />
component classifiers (Dietterich, 2000). However, the purely symbolic nature of the labels<br />
returned by unsupervised classifiers makes consensus clustering a more challenging task.<br />
Possibly due to this fact, consensus clustering has historically been far less popular than<br />
classifier committees, and it has only began to draw considerable attention of researchers<br />
during the last decade.<br />
In the quest for obtaining good quality consensus clustering solutions, the design of<br />
both the cluster ensemble and the consensus function are of critical importance. Although<br />
having a cluster ensemble is always necessary in order to conduct consensus clustering, some<br />
works focus mainly on the design of the consensus function, relegating the construction of<br />
the cluster ensemble, and vice versa. Given the importance of both elements, we split the<br />
revision of the related work in this field into two separate parts, devoting section 2.1 to the<br />
previous work regarding the construction of cluster ensembles and section 2.2 to overview<br />
the existing approaches to the design of consensus functions.<br />
30
2.1 Related work on cluster ensembles<br />
Chapter 2. Cluster ensembles and consensus clustering<br />
Our aim in this section is to review the strategies applied in the literature as regards the<br />
construction of cluster ensembles, given its influence on the consensus clustering process results.<br />
Two alternative approaches have been traditionally followed in this context, differing<br />
in the number of distinct clustering algorithms used for generating the individual partitions<br />
in the ensemble.<br />
The first cluster ensemble creation strategy consists of compiling the outcomes of multiple<br />
runs of a single clustering algorithm, which gives rise to what is known as a homogeneous<br />
cluster ensemble (Hadjitodorov, Kuncheva, and Todorova, 2006). In this case, the diversity<br />
of the ensemble components can be induced by several means, often in a combined manner:<br />
– application of a stochastic clustering algorithm: this strategy relies on the fact that<br />
the outcome of a stochastic clustering algorithm depends on how its parameters are<br />
adjusted. For instance, diverse clustering solutions can be obtained by the random<br />
initialization of the starting centroids of k-means (Fred, 2001; Fred and Jain, 2002a;<br />
Fred and Jain, 2003; Dimitriadou, Weingessel, and Hornik, 2001; Greene et al., 2004;<br />
Long, Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Kuncheva, Hadjitodorov,<br />
and Todorova, 2006; Li, Ding, and Jordan, 2007; Nguyen and Caruana, 2007; Ayad<br />
and Kamel, 2008; Fern and Lin, 2008) or fuzzy c-means (Dimitriadou, Weingessel,<br />
and Hornik, 2002), or the initial settings of EM clustering (Punera and Ghosh, 2007;<br />
Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b).<br />
– random number of clusters: in this case, at each run of the clustering algorithm,<br />
the number of clusters to be found is set randomly (Fred and Jain, 2002b; Fred and<br />
Jain, 2005; Topchy, Jain, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova,<br />
2006; Hadjitodorov and Kuncheva, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and<br />
Turmo, 2008b; Ayad and Kamel, 2008). In general terms, this number of clusters<br />
is usually set to be much larger than the expected number of categories in the data<br />
set (Dimitriadou, Weingessel, and Hornik, 2001; Fred and Jain, 2002a), being often<br />
selected at random from a predefined interval (Long, Zhang, and Yu, 2005; Hore, Hall,<br />
and Goldgof, 2006).<br />
– distinct object representations: another source of diversity lies in the way objects<br />
are represented. Indeed, as we showed in section 1.4, running the same clustering<br />
algorithm on distinct representations of the same data set often leads to pretty diverse<br />
clustering solutions. Allowing for this fact, cluster ensembles have been created by<br />
running a single clustering algorithm on different data representations generated by<br />
random feature selection (Agogino and Tumer, 2006; Hadjitodorov and Kuncheva,<br />
2007; Fern and Lin, 2008), random feature extraction (Greene et al., 2004; Long,<br />
Zhang, and Yu, 2005; Hore, Hall, and Goldgof, 2006; Hadjitodorov and Kuncheva,<br />
2007; Fern and Lin, 2008) or deterministic feature extraction (Sevillano et al., 2006a;<br />
Sevillano et al., 2006b; Sevillano et al., 2007a; Sevillano, Alías, and Socoró, 2007b).<br />
– data subsampling: the creation of multiple clustering solutions upon distinct random<br />
subsamples of the data set has been applied as a means for generating diverse cluster<br />
ensembles (Fischer and Buhmann, 2003; Dudoit and Fridlyand, 2003; Minaei-Bidgoli,<br />
Topchy, and Punch, 2004; Kuncheva, Hadjitodorov, and Todorova, 2006; Punera and<br />
Ghosh, 2007).<br />
31
2.1. Related work on cluster ensembles<br />
– weak clustering: another approach to the generation of homogeneous cluster ensembles<br />
is the repeated application of computationally cheap and conceptually simple<br />
clustering procedures that, although yielding poor clustering solutions by themselves<br />
(this is why they are said to be weak),mayleadtobetterdataclusteringifcombined.<br />
This type of strategies are of special interest when clustering high dimensional and/or<br />
large data collections, as deriving multiple partitions by traditional means may become<br />
too costly (Fern and Brodley, 2003). Examples of this include using random<br />
hyperplanes for splitting the data (Topchy, Jain, and Punch, 2003) or prematurely<br />
halted executions of k-means (Hadjitodorov and Kuncheva, 2007).<br />
– noise injection: the random perturbation of the representation of the objects (Hadjitodorov,<br />
Kuncheva, and Todorova, 2006) or the labels contained in the individual<br />
clustering solutions (Hadjitodorov and Kuncheva, 2007) through noise injection has<br />
also been applied in a few research works, although these strategies constitute a far<br />
less natural way of creating diverse cluster ensembles if compared to the previous ones.<br />
The second approach for creating cluster ensembles consists of applying several distinct<br />
clustering algorithms for generating the individual components of the ensemble, which gives<br />
rise to what are known as heterogeneous cluster ensembles. If clustering algorithms with<br />
substantially different biases are employed, cluster ensembles with a high degree of diversity<br />
can be obtained. This strategy has been applied in several works, such as (Strehl and<br />
Ghosh, 2002; <strong>La</strong>nge and Buhmann, 2005; Gonzàlez and Turmo, 2006; Gionis, Mannila, and<br />
Tsaparas, 2007; Gonzàlez and Turmo, 2008a; Gonzàlez and Turmo, 2008b).<br />
Notice that the strategies used for creating homogeneous and heterogeneous cluster<br />
ensembles can be combined so as to create even more diverse ensembles, as in (Sevillano et<br />
al., 2007c), where a bunch of clustering algorithms are applied on different representations –<br />
obtained by means of multiple feature extraction techniques with distinct dimensionalities–<br />
of the objects in the data set. In this work, we will follow this approach as regards the<br />
generation of cluster ensembles, using the clustering algorithms and object representations<br />
described in appendices A.1 and A.3, as our goal is to overcome the indeterminacies resulting<br />
from the selection of a particular clustering configuration.<br />
There exist several recent works in the literature dealing with the design of the cluster<br />
ensemble. In general terms, they can be divided into two categories: i) those works focused<br />
on analyzing which strategies should be followed for creating cluster ensemble components<br />
that give rise to good quality consensus clustering solutions, and ii) those centered on<br />
obtaining a good quality consensus clustering given a particular cluster ensemble.<br />
Among the first group, we highlight the works by Kuncheva and Hadjitorov. In (Hadjitodorov,<br />
Kuncheva, and Todorova, 2006), the authors analyze the diversity of the individual<br />
partitions composing a hard cluster ensemble and its effect on the quality of the<br />
consensus clustering. To do so, several measures for evaluating the diversity of a cluster ensemble<br />
are proposed and evaluated. Moreover, such measures are employed in the derivation<br />
of a procedure for selecting the candidate with the median diversity among a population<br />
of cluster ensembles, a criterion that leads to the obtention of equal or better consensus<br />
clustering solutions than those obtained on arbitrarily chosen cluster ensembles.<br />
The notion that moderately diverse cluster ensembles lead to good quality consensus<br />
is reinforced by the experimental results presented in (Hadjitodorov and Kuncheva, 2007),<br />
where the authors apply a standard genetic algorithm for driving the selection of the cluster<br />
32
Chapter 2. Cluster ensembles and consensus clustering<br />
ensemble components, which are generated by random feature extraction and selection, weak<br />
clusterers, or random number of clusters, among others. Unfortunately, the fitness function<br />
driving the genetic algorithm used to select the best cluster ensemble is the classification<br />
accuracy of the respective ensemble with respect to the correct object labels (ground truth),<br />
which makes this strategy of limited use in practice.<br />
Another interesting work that incides in the heuristics of the cluster ensemble construction<br />
process is (Kuncheva, Hadjitodorov, and Todorova, 2006). In this sense, the authors<br />
recommend creating individual partitions with a variable number of clusters as a means for<br />
obtaining good quality consensus labelings.<br />
As mentioned earlier, an alternative way of obtaining a good quality consensus clustering<br />
solution is based not on designing the cluster ensemble components according to certain<br />
heuristics, but on refining its contents. The rationale of such strategies is based on the fact<br />
that the quality of the consensus clustering solution is penalized by the inclusion of poor<br />
individual clustering solutions in the ensemble. For this reason, it seems logical to develop<br />
techniques capable of discarding such cluster ensemble components prior to conducting the<br />
consensus clustering process.<br />
In this sense, the application of quality and/or diversity criteria for selecting a small<br />
subset of a large cluster ensemble was evaluated in (Fern and Lin, 2008) as a means for obtaining<br />
a consensus clustering solution that equals or betters the one that would be obtained<br />
using the whole ensemble. Pursuing the same goal, the authors of (Goder and Filkov, 2008)<br />
propose creating smaller subsets of the cluster ensemble that will yield better consensus.<br />
These mini-cluster ensembles are generated by clustering the individual partitions of the<br />
cluster ensemble using a hierarchical agglomerative average-link clustering algorithm.<br />
2.2 Related work on consensus functions<br />
The goal of this subsection is to review the state of the art in the area of consensus clustering.<br />
Although a considerable corpus of theoretical work on combining classifications was<br />
developed in the 80s and earlier (e.g. (Mirkin, 1975; Barthelemy, <strong>La</strong>clerc, and Monjardet,<br />
1986; Neumann and Norton, 1986)), it was not until the start of the present decade when<br />
this field experienced a significant flourish of activity.<br />
Despite this relatively recent awakening, multiple approaches to the combination of<br />
several clusterings can be found in the literature. In general terms, consensus clustering<br />
can be posed as an optimization problem the goal of which is to minimize a cost function<br />
measuring the dissimilarity between the consensus clustering solution and the partitions in<br />
the cluster ensemble –often, the cost function is expressed in terms of the number of pairwise<br />
co-clustering disagreements between the individual partitions in the cluster ensemble and<br />
the consensus clustering solution. Unfortunately, finding the partition that minimizes the<br />
proposed symmetric difference distance metric (i.e. the so-called median partition) isa<br />
NP-hard problem (Goder and Filkov, 2008), and this is the reason why it is necessary to<br />
resort to distinct heuristics so as to conduct clustering combination.<br />
Aiming to provide the reader with a global perspective on the distinct existing approaches<br />
in this field, table 2.1 presents a taxonomy of some of the most relevant consensus<br />
functions according to the theoretical foundations that guides the consensus process. Notice<br />
that some consensus functions appear under more than one theoretical approach, as some-<br />
33
2.2. Related work on consensus functions<br />
Theoretical approach Consensus functions<br />
VMA (Dimitriadou, Weingessel, and Hornik, 2002)<br />
BagClust1 (Dudoit and Fridlyand, 2003)<br />
Voting<br />
URCV, RCV, ACV (Ayad and Kamel, 2008)<br />
Also in (Fischer and Buhmann, 2003;<br />
Greene and Cunningham, 2006)<br />
CSPA, HGPA, MCLA (Strehl and Ghosh, 2002)<br />
Graph partitioning<br />
HBGF (Fern and Brodley, 2004)<br />
BALLS (Gionis, Mannila, and Tsaparas, 2007)<br />
EAC (Fred and Jain, 2005)<br />
CSPA (Strehl and Ghosh, 2002)<br />
BagClust2 (Dudoit and Fridlyand, 2003)<br />
Object co-association IPC (Nguyen and Caruana, 2007)<br />
BALLS (Gionis, Mannila, and Tsaparas, 2007)<br />
Majority Rule, CC Pivot (Goder and Filkov, 2008)<br />
Also in (Greene et al., 2004)<br />
QMI (Topchy, Jain, and Punch, 2003)<br />
Categorical clustering<br />
ITK (Punera and Ghosh, 2007)<br />
EM (Topchy, Jain, and Punch, 2004)<br />
PLA (<strong>La</strong>nge and Buhmann, 2005)<br />
Probabilistic<br />
Also in (Long, Zhang and Yu, 2005;<br />
Li, Ding and Jordan, 2007)<br />
Reinforcement learning (Agogino and Tumer, 2006)<br />
ALSAD, KMSAD, SLSAD<br />
Similarity as data<br />
(Kuncheva, Hadjitodorov, and Todorova, 2006)<br />
IVC, IPVC, IPC (Nguyen and Caruana, 2007)<br />
Centroid based<br />
Also in (Hore, Hall, and Goldgof, 2006)<br />
Agglomerative, Furthest, LocalSearch<br />
Correlation clustering<br />
(Gionis, Mannila, and Tsaparas, 2007)<br />
Search techniques SA, BOEM (Goder and Filkov, 2008)<br />
BestClustering (Gionis, Mannila, and Tsaparas, 2007)<br />
Cluster ensemble component selection BOK (Goder and Filkov, 2008)<br />
Also in (Fern and Lin, 2008)<br />
Table 2.1: Taxonomy of consensus functions according to their theoretical basis.<br />
times the limits between them are somewhat vague. Throughout the following paragraphs,<br />
the main features of these consensus functions are described.<br />
2.2.1 Consensus functions based on voting<br />
The main idea underlying consensus functions based on voting strategies is the notion that<br />
objects assigned to a particular cluster by many partitions in the ensemble should also be<br />
located in that cluster according to the consensus clustering solution. One obvious way to<br />
achieve this goal is to consider cluster labels as votes, thus consolidating different clusterings<br />
by means of voting procedures. However, due to the symbolic nature of clusters (caused<br />
by the unsupervised nature of the clustering problem), it is necessary to disambiguate the<br />
clusters across the l components of the cluster ensemble prior to voting.<br />
One of the pioneering works in voting-based consensus clustering was the voting-merging<br />
algorithm (VMA) of Dimitriadou, Weingessel, and Hornik (2001). In that work, cluster<br />
34
Chapter 2. Cluster ensembles and consensus clustering<br />
desambiguation is conducted by matching those clusters sharing the highest percentage<br />
of objects, iterating this cluster matching process across all the partitions in the cluster<br />
ensemble. As a result of the voting step, a fuzzy partition of the data set is obtained.<br />
Subsequently, a merging procedure is conducted on this soft partition, fusing those clusters<br />
which are closest to each other. By imposing a stopping criterion based on the sureness<br />
of the obtained clusters, this merging process is capable of finding the natural number<br />
of clusters in the data set. In (Dimitriadou, Weingessel, and Hornik, 2002), the authors<br />
define the consensus clustering solution as the one minimizing the average square distance<br />
with respect to all the partitions in the cluster ensemble. They demostrate that obtaining<br />
such consensus clustering boils down to finding the optimal re-labeling of the clusters of<br />
all the individual clusterings, which becomes an unfeasible problem if approached directly,<br />
since it requires an enumeration of all possible permutations. Therefore, they resort to<br />
the VMA consensus function of (Dimitriadou, Weingessel, and Hornik, 2001) for finding an<br />
approximate solution to the problem, extending its application to soft cluster ensembles.<br />
One of the two consensus functions proposed in (Dudoit and Fridlyand, 2003), called<br />
BagClust1, is based on applying plurality voting on the labelings in the cluster ensemble<br />
after a label disambiguation process based on measuring the overlap between clusters. The<br />
generation of the cluster ensemble components follows a resampling strategy similar to<br />
bagging, aiming to reduce the variability in the partitioning results via consensus clustering.<br />
A related proposal was the one presented in (Fischer and Buhmann, 2003). In that work,<br />
consensus clustering is viewed as a means for improving the quality and reliability of the<br />
results of path-based clustering, applying bagging for creating the hard cluster ensemble.<br />
The consensus clustering solution is obtained through a maximum likelihood mapping in<br />
which the label permutation problem is solved by means of the Hungarian method (Kuhn,<br />
1955), which somehow resembles the application of plurality voting on the disambiguated<br />
individual partitions in the cluster ensemble (Ayad and Kamel, 2008). Moreover, a related<br />
reliability measure chooses the number of clusters with the highest stability as the preferable<br />
consensus solution.<br />
In (Greene and Cunningham, 2006), a majoritary voting strategy was applied for generating<br />
the consensus clustering solution, after disambiguating the clusters using the Hungarian<br />
algorithm. An additional interest of that work is that it was one of the first research<br />
efforts that considered the problem of creating and combining a large number of clustering<br />
solutions in the context of high dimensional data sets (such as text document collections).<br />
Indeed, the authors point out that using large ensembles boosts computational cost, while<br />
small ensembles tend to produce unstable consensus clustering solutions. In this context,<br />
the authors propose basing the cluster ensemble construction and the consensus clustering<br />
tasks on a prototype reduction technique that allows representing the whole data set by<br />
means of a minimal set of objects, ensuring that the clustering results will approximate<br />
those that would be obtained on the original data set. By doing so, the final clustering<br />
solution can be extended to those objects that have been left out of the reduced data set<br />
representation while alleviating the overall computational cost of the whole process. In particular,<br />
the reduced version of the cluster ensemble is obtained by projecting the pairwise<br />
object similarity matrix by means of a kernel matrix.<br />
The recent work of (Ayad and Kamel, 2008) presented a set of consensus functions based<br />
on cumulative voting –named URCV, RCV and ACV– whose time complexity scales linearly<br />
with the size of the data set. Another interesting feature is their capability of combining<br />
35
2.2. Related work on consensus functions<br />
crisp partitions with different number of clusters, although the desired number of clusters<br />
k in the consensus clustering solution is a necessary parameter for the execution of their<br />
consensus functions. The proposals presented in this work are based on the computation of<br />
a probabilistic mapping for solving the cluster correspondence problem–instead of the oneto-one<br />
classic cluster mapping– which allows combining partitions with different number of<br />
clusters avoiding the addition of dummy clusters as in (Dimitriadou, Weingessel, and Hornik,<br />
2002). In particular, three ways for deriving such probabilistic mapping based on the idea<br />
of cumulative voting are presented. The construction of the consensus clustering solution is<br />
a two-stage procedure: firstly, based on the cumulative vote mapping, a tentative consensus<br />
is derived as a summary of the cluster ensemble maximizing the information content in<br />
terms of entropy. And secondly, the extraction of the final consensus clustering solution<br />
is obtained by applying an agglomerative clustering algorithm that minimizes the average<br />
generalized Jensen-Shannon divergence within each cluster.<br />
As already mentioned, solving the cluster correspondence problem paves the way for the<br />
application of voting strategies for combining the outcomes of multiple clustering processes.<br />
This issue is the central focus of (Boulis and Ostendorf, 2004), which presented several<br />
methods for finding the correspondence between the clusters of the individual partitions in<br />
the cluster ensemble. Two of these proposals are based on linear optimization techniques,<br />
which are applied on an objective function that measures the degree of agreement among<br />
clusters. In contrast, the third cluster correspondence method is based on Singular Value<br />
Decomposition, and it sets cluster correspondences based on cluster correlation. All these<br />
methods operate on a common space where the clusters of the distinct partitions (which<br />
can be either crisp or fuzzy) are represented by means of cluster co-association matrices.<br />
2.2.2 Consensus functions based on graph partitioning<br />
The work by Strehl and Ghosh on consensus clustering based on graph partitioning is<br />
probably one of the most classic references in the field of cluster ensembles (Strehl and<br />
Ghosh, 2002). To our knowledge, they were the first to formulate the consensus clustering<br />
problem in an information theoretic framework –i.e. the consensus clustering solution<br />
should be the one maximizing the mutual information with respect to all the individual<br />
partitions in the cluster ensemble–, a path followed by other authors in subsequent works<br />
(Fred and Jain, 2003). In view of its prohibitive cost when formulated as a combinatorial<br />
optimization problem in terms of shared mutual information, the authors propose<br />
three clustering combination heuristics based on deriving a hypergraph representation of<br />
the cluster ensemble —all of which require the desired number of clusters k as one of their<br />
parameters. The first consensus function (called Cluster-based Similarity Partitioning Algorithm<br />
or CSPA) induces a pairwise object similarity measure from the cluster ensemble<br />
(as in (Fred and Jain, 2002a)), obtaining the consensus partition by reclustering the objects<br />
with the METIS graph partitioning algorithm (Karypis and Kumar, 1998). For this<br />
reason, we have enclosed the CSPA consensus function in both the graph partitioning and<br />
object co-association categories in table 2.1. In the second clustering combiner proposed<br />
in (Strehl and Ghosh, 2002) –named HGPA for HyperGraph Partitioning Algorithm–, the<br />
cluster ensemble problem is posed as the partitioning of a hypergraph where hyperedges<br />
represent clusters into k unconnected components of approximately the same size, cutting<br />
a minimum number of hyperedges. And the third consensus function (Meta-CLustering<br />
Algorithm or MCLA) views the clustering integration process as a cluster correspondence<br />
36
Chapter 2. Cluster ensembles and consensus clustering<br />
problem that is solved by identifying and consolidating groups of clusters (meta-clusters),<br />
which is done by applying graph-based clustering to hyperedges in the hypergraph representation<br />
of the cluster ensemble. In (Strehl and Ghosh, 2002), the authors apply the proposed<br />
consensus functions on hard cluster ensembles, suggesting that they could be extended to a<br />
fuzzy clustering integration scenario. Such extensions (in particular, the soft versions of the<br />
CSPA and MCLA consensus functions, sCSPA and sMCLA, respectively) were introduced<br />
in (Punera and Ghosh, 2007; Sevillano, Alías, and Socoró, 2007b).<br />
The clustering combination problem was also formulated as a graph partitioning problem<br />
in (Fern and Brodley, 2004). In particular, a bipartite graph is built from a hard cluster<br />
ensemble, although the authors suggest that the same method can be applied for combining<br />
soft clustering solutions after introducing minor modifications on the proposed consensus<br />
function, which is called HBGF for Hybrid Bipartite Graph Formulation. As in previous<br />
graph partitioning approaches to clustering combination, the desired number of clusters<br />
must be set aprioriandpassed as a parameter of the consensus function. In contrast, HBGF<br />
considers object and cluster similarity simultaneously when creating the consensus clustering<br />
solution, an issue not considered by other graph partition based consensus functions as<br />
CSPA and MCLA (Strehl and Ghosh, 2002).<br />
More recently, the BALLS consensus function (Gionis, Mannila, and Tsaparas, 2007) operates<br />
on a graph representation of the pairwise object co-dissociation matrix, viewing its<br />
vertices as the objects in the data set, its edges being weighted by the pairwise object<br />
distances. The rationale of the consensus clustering creation process is the iterative construction<br />
of consensus clusters from compact and relatively isolated sets of close vertices,<br />
which are then removed from the graph.<br />
2.2.3 Consensus functions based on object co-association measures<br />
The approach to consensus clustering based on object co-association measures is based on<br />
the assumption that objects belonging to a natural cluster are likely to be co-located in<br />
the same cluster by different clusterings of the cluster ensemble. Therefore, pairwise object<br />
co-occurrences are deemed as votes which are aggregated in a n × n object co-association<br />
matrix (where n is the number of instances contained in the data collection). A great<br />
advantage of this type of methods is that it avoids cluster disambiguation processes, as the<br />
cluster ensemble is inspected object-wise, rather than cluster-wise. However, a downside of<br />
consensus functions based object co-association is that their time and space complexities<br />
scale quadratically with n, thus making their application on large data sets highly costly or<br />
even unfeasible.<br />
One of the pioneering works on the combination of hard clustering solutions based on<br />
object co-association metrics is the evidence accumulation (EAC) approach. In the original<br />
form of the evidence accumulation consensus function, the consensus clustering solution is<br />
obtained by applying a simple majority voting scheme on the co-association matrix (Fred,<br />
2001). In subsequent versions, consensus is derived by applying the single-link hierarchical<br />
clustering algorithm on the object co-association matrix, regarding it as a measure of the<br />
similarity between objects (Fred and Jain, 2002a)—a virtually identical proposal is found<br />
in a contemporary work (Zeng et al., 2002). In (Fred and Jain, 2003), the evidence accumulation<br />
approach is formulated in an information-theoretic framework, defining the optimal<br />
consensus clustering solution as the one maximizing the sum of normalized mutual infor-<br />
37
2.2. Related work on consensus functions<br />
mation (φ (NMI) ) with respect to all the partitions in the cluster ensemble. The authors<br />
prove that, by maximizing the number of shared objects in the consolidated clusters, the<br />
EAC consensus function maximizes the aforementioned information theoretical objective<br />
function, although reaching its global optimum is not ensured in all situations. Moreover,<br />
cutting the dendrograms resulting from the application of the single-link clustering on the<br />
co-association matrix at the highest lifetime level leads to a minimization of the variance<br />
of the average φ (NMI) , which guarantees the robustness of the clustering solution to small<br />
variations in the composition of the cluster ensemble —furthermore, this also avoids making<br />
assumptions on the number of clusters, a significant advantage with respect to other<br />
consensus functions. A compendium on the evidence accumulation consensus clustering<br />
approach is presented in (Fred and Jain, 2005), extending the previous consensus functions<br />
through the application of other hierarchical clustering algorithms on the pairwise object<br />
co-association matrix.<br />
The clustering of high dimensional data is the main motivation of the work presented<br />
in (Fern and Brodley, 2003). In this scenario, Random Projection (RP) is an efficient<br />
dimensionality reduction technique, although it often gives rise to highly unstable clustering<br />
results. In order to reduce this variability, the authors propose creating cluster ensembles<br />
by compiling partitions resulting from distinct RP runs, combining them using a consensus<br />
function very similar to EAC, as it applies an agglomerative clustering algorithm on an<br />
object similarity matrix.<br />
One of the two consensus functions presented by Dudoit and Fridlyand (2003), named<br />
BagClust2, resembles evidence accumulation, as it builds a pairwise object dissimilarity<br />
matrix which is subject to a partitioning process for obtaining the consensus clustering.<br />
However, BagClust2 and EAC differ in that the former requires that the desired number<br />
of clusters is passed as a parameter to the consensus function (the same happens with<br />
BagClust1).<br />
In (Greene et al., 2004), consensus clustering was conducted by means of variants of the<br />
EAC consensus functions using distinct hierarchical clustering algorithms (i.e. single-link,<br />
complete-link and average-link) for partitioning the pairwise object co-association matrix,<br />
as proposed in (Fred and Jain, 2005). However, the central matter of study in that work<br />
is the analysis of the diversity of the cluster ensemble as a factor determining the quality<br />
of the consensus clustering. In this sense, the authors focused on random techniques for<br />
introducing diversity in the cluster ensemble, such as random subspacing, random algorithm<br />
initialization, random number of clusters or random feature projection.<br />
A related work is the Majority Rule consensus function of (Goder and Filkov, 2008),<br />
which is also based on clustering the pairwise object co-dissociation matrix, which can be<br />
done by simply setting a dissimilarity threshold like in the the first version of EAC (Fred,<br />
2001), or by applying the average-link hierarchical clustering algorithm —like in the latest<br />
versions of EAC (Fred and Jain, 2005).<br />
Moreover, there exist several consensus functions that make indirect use of pairwise<br />
object co-association (or co-dissociation) matrices, despite the way the consensus clustering<br />
is obtained differs from that of EAC. Examples of this include some graph partition-based<br />
consensus functions, such as CSPA (Strehl and Ghosh, 2002) and BALLS (Gionis, Mannila,<br />
and Tsaparas, 2007), the Iterative Pairwise Consensus (IPC) (Nguyen and Caruana, 2007)<br />
(a consensus function based on cluster centroids in which objects are iteratively reassigned<br />
to the clusters of the consensus partition according to their similarity), or the CC Pivot<br />
38
Chapter 2. Cluster ensembles and consensus clustering<br />
consensus function (Goder and Filkov, 2008), which obtains the consensus partition by<br />
conducting an iterative pivoting on the object dissimilarity matrix.<br />
2.2.4 Consensus functions based on categorical clustering<br />
A different approach to consensus clustering is the one related to categorical clustering,<br />
which basically consists of transforming the contents of the cluster ensemble into quantitative<br />
features that represent the objects, for subsequently clustering them according to this<br />
novel representation —thus obtaining the consensus partition. The QMI (Quadratic Mutual<br />
Information) consensus function of (Topchy, Jain, and Punch, 2003) posed the problem of<br />
combining the partitions contained in a hard cluster ensemble in an information theoretic<br />
framework, and consists of applying the k-means clustering algorithms on this new feature<br />
space, which forces the user to set the desired number of clusters k in advance.<br />
In (Punera and Ghosh, 2007), a novel fuzzy consensus function based on the Information<br />
Theoretic K-means (ITK) algorithm was presented. Its rationale follows a similar approach<br />
to that of (Topchy, Jain, and Punch, 2003). In this case, though, consensus clustering<br />
is conducted on soft cluster ensembles (i.e. the compilation of the outcomes of multiple<br />
fuzzy clustering processes), so each object in the data set is represented by means of the<br />
concatenated posterior cluster membership probability distributions corresponding to each<br />
one of the l fuzzy partitions in the cluster ensemble. Thus, using the Kullback-Leibler<br />
divergence (KLD) between those probability distributions as a measure of the distance<br />
between objects, the k-means algorithm is applied so as to obtain the consensus clustering<br />
solution. Note that the ITK consensus function is capable of combining fuzzy partitions<br />
with variable number of clusters, while producing a crisp consensus clustering solution.<br />
Moreover, this consensus function allows assigning distinct weights to each clustering in<br />
the cluster ensemble, which can be useful for the user to express his/her confidence on the<br />
quality of some individual clusterings.<br />
2.2.5 Consensus functions based on probabilistic approaches<br />
Consensus clustering has also been approached from a probabilistic perspective. One of<br />
the pioneering works in this direction was the Expectation-Maximization (EM) consensus<br />
function proposed in (Topchy, Jain, and Punch, 2004), where a probabilistic model of the<br />
consensus clustering solution is defined in the space of the contributing clusters. Such<br />
model is based on a finite mixture of multinomial distributions, each component of which<br />
corresponds to a cluster of the combined clustering, which is obtained as the solution to<br />
the maximum likelihood problem solved by means of the EM algorithm. Contrasting with<br />
other consensus functions, the authors highlight the low computational complexity of the<br />
proposed method and its ability to combine partitions with different numbers of clusters.<br />
Another probabilistic approach to the consensus clustering problem was presented in<br />
(Long, Zhang, and Yu, 2005). The central matter in that work was finding a solution to<br />
the cluster correspondence problem (which, as mentioned earlier, is due to the symbolic<br />
identification of clusters caused by the unsupervised nature of the clustering problem). In<br />
particular, the goal was to derive a correspondence matrix that desambiguates the clusters<br />
of each individual clustering in the cluster ensemble (represented as a probabilistic or binary<br />
membership matrix depending on whether the cluster ensemble is soft or hard) with regard<br />
39
2.2. Related work on consensus functions<br />
to a hypothetical consensus clustering solution membership matrix. The goal is to find<br />
a correspondence matrix that yields the best projection of each individual clustering on<br />
the space defined by the consensus clustering solution. From a practical viewpoint, both<br />
the correspondence and consensus clustering matrices are derived simultaneously, using an<br />
EM-like approach.<br />
The beautiful proposal of (<strong>La</strong>nge and Buhmann, 2005) introduced a consensus function<br />
named Probabilistic <strong>La</strong>bel Aggregation (PLA), which operates on soft cluster ensembles<br />
(although it also works on crisp ones). Its rationale is as follows: given a single fuzzy<br />
partition, a pairwise object co-association matrix is created by simply multiplicating the<br />
membership probabilities matrix by its own transpose. Repeating this process on all the<br />
partitions in the soft cluster ensemble and aggregating (and subsequently normalizing) the<br />
resulting matrices gives rise to a joint probability matrix of finding two objects in the<br />
same cluster. Neatly, the authors propose subjecting this joint probability matrix to a<br />
non-negative matrix factorization process that yields estimates for class-likelihoods and<br />
class-posteriors, upon which the consensus clustering solution is based. This factorization<br />
process is posed as an optimization problem which is solved by applying the EM algorithm.<br />
Besides the elegance of the proposed solution, this work also stands out by the fact that<br />
it supports an out-of-sample extension that makes it possible to assign previously unseen<br />
objects to classes of the consensus clustering solution. Moreover, the proposed method also<br />
allows combining weighted partitions, i.e. it gives the user the chance to assign different<br />
degrees of relevance to the cluster ensemble partitions.<br />
A closely related proposal is the application of Non-Negative Matrix Factorization<br />
(NMF) for solving the consensus clustering problem presented (Li, Ding, and Jordan, 2007).<br />
In contrast to (<strong>La</strong>nge and Buhmann, 2005), the aim is to combine crisp partitions, which<br />
imposes a series of constraints on the optimization problem that is solved via symmetric<br />
NMF—which, from an algorithmic viewpoint, is implemented by means of multiplicative<br />
rules. Moreover, the same approach is employed for conducting semi-supervised clustering,<br />
a problem that lies beyond the scope of this work.<br />
2.2.6 Consensus functions based on reinforcement learning<br />
Reinforcement learning has also been applied for the construction of consensus clustering<br />
solutions (Agogino and Tumer, 2006). In that work, the average φ (NMI) of the consensus<br />
clustering solution with respect to the cluster ensemble is regarded as the reward that must<br />
be maximized by the actions of the agents. In this case, each agent casts a vote indicating<br />
which cluster each object should be assigned to (i.e. it operates on hard cluster ensembles).<br />
The application of a majority voting scheme on these votes yields the consensus clustering<br />
solution, which is iteratively refined as the agents learn how to vote so as to maximize the<br />
average φ (NMI) . The authors highlight the ease of their approach for combining clusterings<br />
in distributed scenarios, which makes it specially suitable in failure-prone domains.<br />
2.2.7 Consensus functions based on interpeting object similarity as data<br />
The work by Kuncheva, Hadjitodorov, and Todorova (2006) introduced three consensus<br />
functions based on interpreting object similarity as data. That is, each object is represented<br />
by n features, where the jth feature of the ith object corresponds to the co-association<br />
40
Chapter 2. Cluster ensembles and consensus clustering<br />
strength between the ith and the jth objects. The authors base these consensus functions<br />
on the proved suitability of using similarity measures as object features in classification<br />
problems (Kalska, 2005). Thus, the consensus clustering solution is obtained by applying<br />
standard clustering algorithms on the pairwise object co-association matrix. In particular,<br />
the hierarchical average-link, single-link and k-means clustering algorithms are applied,<br />
giving rise to the ALSAD, SLSAD and KMSAD consensus functions. Notice that these<br />
consensus functions, despite being based on partitioning the object co-association matrix<br />
(just like EAC, for instance), differ from these in that the contents are not interpreted as<br />
measures of similarity between objects, but rather as attributes in a new feature space.<br />
2.2.8 Consensus functions based on cluster centroids<br />
The time and memory scalability problems derived from combining clusterings of large data<br />
sets is the principal motivation of several works that tackle the consensus clustering problem<br />
following a centroid-based approach (Hore, Hall, and Goldgof, 2006; Nguyen and Caruana,<br />
2007). The underlying rationale consists of representing the cluster ensemble components<br />
in terms of the centroids of their clusters, instead of label vectors. By doing so, storage<br />
inconveniences are alleviated, as the number of clusters k is usually much lower than the<br />
number of objects n. In (Hore, Hall, and Goldgof, 2006), moreover, the authors highlight<br />
that the parallelization of the cluster ensemble construction process would dramatically<br />
decrease its time complexity. As regards the creation of the consensus clustering solution,<br />
it is based on computing the average centroid of each cluster, after disambiguating them<br />
by means of the Hungarian algorithm. Furthermore, that work introduced the possibility<br />
of discarding bad clusters from the cluster ensemble at consensus clustering creation time.<br />
In (Nguyen and Caruana, 2007), three iterative consensus functions were presented and<br />
empirically compared with other eleven clustering combiners in a complete experimental<br />
study. The proposal in that work derived the consensus clustering solution in terms of the<br />
centroids of its clusters. Although the proposed consensus functions are capable of combining<br />
clusterings with a variable number of clusters, all the individual partitions contained<br />
in the hard cluster ensembles used in the experiments have the same number of clusters<br />
for simplicity. The first consensus function proposed in this work, called Iterative Voting<br />
Consensus (IVC), is based on recursively computing the centroids of the consensus solution<br />
clusters, and assigning each object to the nearest cluster, which is determined in terms<br />
of the Hamming distance. This procedure is iterated until the centroids of the consensus<br />
clustering solution reach a stable state. The second proposal (named Iterative Probabilistic<br />
Voting Consensus or IPVC), is a variant of IVC in which objects are iteratively assigned<br />
to consensus clusters in terms of their distance with respect to the objects that have been<br />
previously assigned to them. And in the third proposed consensus function, Iterative Pairwise<br />
Consensus or IPC, objects are iteratively reassigned to consensus clusters according to<br />
their similarity as measured by the pairwise object co-association matrix.<br />
2.2.9 Consensus functions based on correlation clustering<br />
Recently, the connection between the late-emerging problem of correlation clustering and<br />
consensus clustering was exploited for deriving novel consensus functions capable of determining<br />
the most natural number of clusters (Gionis, Mannila, and Tsaparas, 2007). In<br />
that work, the cluster ensemble is modeled as a graph resembling a pairwise object co-<br />
41
2.2. Related work on consensus functions<br />
dissociation matrix, and the consensus clustering solution is defined as the one minimizing<br />
the disagreements with respect to the individual partitions contained in the cluster ensemble.<br />
In particular, three consensus functions based on correlation clustering were presented<br />
in this work, which are briefly described next. Firstly, the AGGLOMERATIVE consensus function<br />
results from applying a standard bottom-up procedure for correlation clustering on<br />
cluster ensembles. Resorting to the graph view of the pairwise object distance matrix, the<br />
AGGLOMERATIVE algorithm follows an iterative merging process that joins objects in clusters<br />
depending on whether their average distance is below a predefined threshold, stopping when<br />
further cluster merging does not reduce the number of disagreements of the consensus solution<br />
with respect to the cluster ensemble. Secondly, the FURTHEST consensus function can<br />
be regarded as the converse of AGGLOMERATIVE , as it consists of a top-down procedure that<br />
iteratively separates maximally distant graph vertices into consensus clusters, assigning<br />
the remaining objects to the cluster that minimizes the overall number of disagreements.<br />
This process is stopped when no disagreement reduction is achieved from additional cluster<br />
splitting. And fifthly, the LOCALSEARCH algorithm is derived from the application of a local<br />
search correlation clustering heuristic, which is based on a greedy procedure that, starting<br />
with a specific (possibly random) partition of the graph, tries to minimize the number of<br />
disagreements resulting from moving objects to different clusters or creating new singleton<br />
clusters, stopping when no move can decrease the disagreements rate. Interestingly, the authors<br />
point out that, despite its high computational cost, the LOCALSEARCH algorithm can be<br />
employed as a post-processing step for refining a previously obtained consensus clustering<br />
solution.<br />
2.2.10 Consensus functions based on search techniques<br />
In (Goder and Filkov, 2008), two consensus functions based on search techniques were<br />
introduced. Their rationale consists of building the consensus clustering solution by means<br />
of a greedy search process aiming to minimize the cost function —the authors implement<br />
such search processes either by means of Simulated Annealing (SA), as in (Filkov and Skiena,<br />
2004), and on successive single object movements that guarantee the largest decrease of the<br />
cost function (Best One Element Moves or BOEM).<br />
2.2.11 Consensus functions based on cluster ensemble component selection<br />
Recall that the aim of any consensus clustering process is to obtain a single partition<br />
from a collection of l clustering solutions. As an alternative means for achieving that<br />
goal, cluster ensemble component selection techniques are based on obtaining the consensus<br />
clustering solution by selection, not by combination. For instance, the BESTCLUSTERING<br />
algorithm (Gionis, Mannila, and Tsaparas, 2007) is not a consensus function proper, as it<br />
identifies as the consensus clustering the individual partition that minimizes the number of<br />
disagreements with respect to the remaining clusterings in the cluster ensemble.<br />
Following a very similar approach, the Best of K (BOK) consensus function is based on<br />
selecting that individual clustering from the cluster ensemble that minimizes the number<br />
of pairwise co-clustering disagreements between the individual partitions in the cluster<br />
ensemble (Goder and Filkov, 2008).<br />
42
Chapter 2. Cluster ensembles and consensus clustering<br />
2.2.12 Other interesting works on consensus clustering<br />
There exist several works in the literature devoted to the experimental comparison of the<br />
performance of consensus functions, the main interest of which lies in the evaluation of the<br />
quality of the consensus clusterings obtained.<br />
Examples of this include the work by Minaei-Bidgoli, Topchy, and Punch (2004), where<br />
a data resampling scheme was presented as a means for improving the robustness and stability<br />
of the consensus clustering solution. In that work, the EAC, BagClust2, QMI, CSPA,<br />
HGPA and MCLA consensus functions are experimentally compared when operating on<br />
hard cluster ensembles created from bootstrap partitions on several artificial and real data<br />
collections. The main conclusion drawn is that, as expected, there exists no universally superior<br />
consensus function, as each consensus function explores the data set in different ways,<br />
thus its efficiency greatly depends on the existing structure in the data set. Another extensive<br />
and interesting performance comparison between several consensus functions operating<br />
on small hard cluster ensembles is presented in (Kuncheva, Hadjitodorov, and Todorova,<br />
2006). Recently, the application of consensus clustering as a means for avoiding the obtention<br />
of suboptimal clustering solutions when applying non-parametric clustering algorithms<br />
on text document collections is tackled in both (Gonzàlez and Turmo, 2008a) and (Gonzàlez<br />
and Turmo, 2008b). These works compared i) the performance of several non-parametric<br />
clustering algorithms across six text corpora, and ii) the quality of the consensus clustering<br />
solution when it is built –using some of the consensus functions presented in (Gionis,<br />
Mannila, and Tsaparas, 2007)– upon homogeneous and heterogeneous cluster ensembles.<br />
In most of these inter-consensus functions comparisons, the evaluation of their computational<br />
complexity is often given marginal importance, although it becomes a critical aspect<br />
when it comes to their application in practice, especially when dealing with large data<br />
collections or cluster ensembles containing many partitions. This is the main motivation<br />
behind the data sampling strategy proposed in (Greene and Cunningham, 2006; Gionis,<br />
Mannila, and Tsaparas, 2007). The proposal of the latter work is the SAMPLING algorithm,<br />
which consists of performing a sufficient subsampling of the objects in the data set –thus<br />
constructing the consensus clustering solution on a reduced cluster ensemble–, and the subsequent<br />
extension of the combined clustering solution on those objects that have been left<br />
out of the subsampling process. The time complexity of these two processes is linear with<br />
the data set size, which can lead to relevant time savings when dealing with large data<br />
collections.<br />
Another variant of the consensus clustering problem is the weighted combination of<br />
clusterings, which constitutes the central point of (Gonzàlez and Turmo, 2006). The idea<br />
behind weighted consensus clustering is the possibility of giving more relevance to some<br />
components of the cluster ensemble, as they may better describe the structure of the data<br />
set. Thus, it makes sense to combine clusterings in a weighted manner, emphasizing the<br />
contribution of those components deemed as the best ones in the ensemble. Besides designing<br />
consensus functions capable of combining weighted partitions, it is necessary to devise<br />
strategies for setting the proper weight of each individual clustering, which is not trivial in<br />
an unsupervised scenario. In this work, hypergraph-based (Strehl and Ghosh, 2002) and<br />
probabilistic (Topchy, Jain, and Punch, 2004) consensus functions are modified so as to<br />
handle weighted partitions. Moreover, the best weighting scheme is determined by creating<br />
differently weighted cluster ensembles, and subsequently selecting the best option in an<br />
unsupervised manner through the maximization of a scoring function. Moreover, this con-<br />
43
2.2. Related work on consensus functions<br />
sensus function allows assigning distinct weights to each clustering in the cluster ensemble,<br />
which can be useful for the user to express his/her confidence on the quality of some individual<br />
clusterings. The ITK consensus function of (Punera and Ghosh, 2007) also allows to<br />
assign distinct weights to each clustering in the cluster ensemble, which can be useful for<br />
the user to express his/her confidence on the quality of some individual clusterings.<br />
44
Chapter 3<br />
Hierarchical consensus<br />
architectures<br />
As outlined in section 1.5, our proposal for building robust multimedia clustering systems<br />
lies on the creation of consensus clustering solutions upon cluster ensembles. These ensembles<br />
are made up of a large number of individual clusterings resulting from the execution of<br />
multiple clustering algorithms on several unimodal and multimodal representations of the<br />
objects contained in the data set.<br />
Indeed, the massive crossing between clustering algorithms, object representations and<br />
data modalities is a simple and parallelizable manner of generating highly diverse heterogeneous<br />
cluster ensembles, entrusting the obtention of a meaningful combined clustering<br />
solution to the consensus clustering task.<br />
Given the unsupervised nature of the clustering problem, we think this is a pretty<br />
sensible way of proceeding so as to obtain clustering solutions robust to the influence of the<br />
clustering indeterminacies, as sticking to the use of a handful of clustering algorithms or<br />
object representations can lead to an involuntary and undesirable limitation as regards the<br />
quality and diversity of the cluster ensemble components.<br />
However, at the same time that this strategy allows the creation of rich cluster ensembles,<br />
it also introduces several drawbacks that affect the consensus clustering task:<br />
– the large number of individual clustering solutions contained in the cluster ensemble,<br />
resulting from the aforementioned combination of clustering algorithms, object representations<br />
and data modalities, often leads to a notable increase in the computational<br />
cost of the execution of the consensus function, which can even become prohibitive.<br />
– this same fact incides in the diversity and quality of the cluster ensemble components,<br />
and, while moderate diversity has been found to be beneficial as far as consensus<br />
clustering is concerned (Hadjitodorov, Kuncheva, and Todorova, 2006; Fern and Lin,<br />
2008), the existence of poor quality clustering solutions in the cluster ensemble may<br />
cause a detrimental effect on the quality of the consensus clustering solution.<br />
Allowing for these considerations, in this thesis we introduce the concept of self-refining<br />
hierarchical consensus architectures (SHCA), defined as a generic means for fighting against:<br />
45
3.1. Motivation<br />
– the computational complexity of combining a large number of individual partitions, by<br />
means of hierarchical consensus architectures, which consist in the layered construction<br />
of the consensus clustering solution through a hierarchical structure of low complexity<br />
intermediate consensus processes.<br />
– the negative bias induced by poor quality clusterings in the consensus clustering solution,<br />
by means of a self-refining post-processing that, using the obtained consensus<br />
clustering solution as a reference, builds a select and reduced cluster ensemble (i.e. a<br />
subset of the original cluster ensemble), deriving a new and refined consensus clustering<br />
upon it in a fully unsupervised manner.<br />
Although both strategies are complementary (not in vain they can be naturally combined<br />
giving rise to SHCA), their description and study is decoupled in the present and the<br />
next chapters, respectively. Thus, in our description and analysis of hierarchical consensus<br />
architectures (chapter 3), we ultimately aim to design computationally optimal consensus<br />
architectures and, consequently, we will solely focus on aspects regarding their time complexity.<br />
Meanwhile, the study of consensus self-refining procedures, which are presented in<br />
chapter 4, is centered on improving the quality of the consensus solutions yielded by the<br />
most computationally efficient consensus architectures devised in the present chapter.<br />
The introduction, the discussion of their rationale and the theoretical description of<br />
hierarchical consensus architectures are complemented by the presentation of multiple experiments<br />
analyzing multiple aspects of their performance on several real data collections.<br />
<strong>La</strong>st but not least, it is to note that although all the proposals put forward in this chapter<br />
are focused on a hard cluster ensemble scenario, they are also applicable for fuzzy clusterings<br />
combination.<br />
3.1 Motivation<br />
The construction of consensus clustering solutions is usually tackled as a one-step process,<br />
in the sense that the whole cluster ensemble E is input to the consensus function F at once<br />
—see figure 3.1(a). This is what we call flat consensus clustering. However, as outlined in<br />
chapter 2, the time and space complexities of consensus functions typically scale linearly or<br />
quadratically with the size of the cluster ensemble l –i.e. O (l w ), where w ∈{1, 2}–, which<br />
may lead to a highly costly or even impossible execution of the consensus clustering task if<br />
it is to be conducted on a cluster ensemble containing a large number of partitions 1 .<br />
For this reason, a natural way for avoiding this limitation besides reducing the computational<br />
complexity of the consensus solution creation process consists in applying the<br />
classic divide-and-conquer strategy (Dasgupta, Papadimitriou, and Vazirani, 2006) which<br />
basically:<br />
– breaks the original problem into subproblems which are nothing but smaller instances<br />
of the same type of problem<br />
1 Moreover, the time complexity of consensus functions also depends –linearly or quadratically, see appendix<br />
A.5– on the number of objects in the data set n and the number of clusters k of the clusterings<br />
in the ensemble. However, as we assume that these two factors are constant for a given cluster ensemble<br />
corresponding to a specific data set, the only dependence of concern is that referring to the cluster ensemble<br />
size l.<br />
46
λ 1 λ 11 λ 12 λ 13 … λ 1m<br />
λ 2 λ 21 λ 22 λ 23 … λ 2m<br />
λ 3 λ 31 λ 32 λ 33 … λ 3m<br />
λ 4 λ 41 λ 42 λ 43 … λ 4m<br />
λ 5 λ 51 λ 52 λ 53 … λ 5m<br />
…<br />
λ l λ l1 λ l2 λ l3 … λ lm<br />
Chapter 3. Hierarchical consensus architectures<br />
Consensus<br />
function<br />
(a) Flat construction of a consensus clustering solution<br />
on a hard cluster ensemble<br />
1 11 12 13 … 1m Consensus 1<br />
<br />
function 1<br />
2 21 22 23 … 2m c 1<br />
<br />
3 31 32 33 … 3m<br />
4 41 42 43 … 4m<br />
5 51 52 53 … 5m <br />
l l1 l2 l3 … lm<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
<br />
<br />
1<br />
c 2<br />
1<br />
c K K1<br />
Consensus<br />
ffunction i<br />
Consensus<br />
function<br />
λ c<br />
1<br />
s<br />
c1 1<br />
<br />
1<br />
s<br />
cKs Consensus<br />
function<br />
(b) Hierarchical construction of a consensus clustering solution on a hard cluster ensemble<br />
Figure 3.1: Flat vs hierarchical construction of a consensus clustering solution on a hard<br />
cluster ensemble.<br />
– recursively solves these subproblems<br />
– appropriately combines their outcomes<br />
Transferring this strategy to the consensus clustering problem is equivalent to segmenting<br />
the cluster ensemble into subsets (referred to as mini-ensembles hereafter), building<br />
an intermediate consensus solution upon each mini-ensemble, and subsequently combining<br />
these halfway consensus clusterings into the final consensus clustering solution λc —see<br />
figure 3.1(b). Due to the fact that successive layers (or stages) of consensus solutions are<br />
created, we give this approach the name of hierarchical consensus architecture (HCA), as<br />
opposed to the traditional flat topology of consensus clustering processes.<br />
The rationale of hierarchical consensus architectures is pretty simple. By reducing the<br />
time and space complexities of each intermediate consensus clustering –which is achieved by<br />
creating it upon a smaller ensemble–, we aim to reduce the overall execution requirements<br />
(i.e. memory and specially, CPU time), although a larger number of low cost consensus<br />
clustering processes must be run. However, this strategy is capable of yielding computational<br />
gains, as for large enough values of l, the execution of the original problem becomes<br />
slower than the recursive execution of the subproblems into which it is divided (Dasgupta,<br />
Papadimitriou, and Vazirani, 2006).<br />
47<br />
c
3.1. Motivation<br />
An additional and very relevant point as regards the computational efficiency of hierarchical<br />
consensus architectures is that they naturally allow the parallel execution of the<br />
consensus clustering processes of every HCA stage —quite obviously, this will ultimately<br />
depend on the availability of computing resources. Thus, the degree of parallelism in executing<br />
the consensus of every HCA stage will set the lower and upper bounds of the time<br />
required for obtaining the final consensus clustering λc.<br />
In the best-case scenario, the HCA running time can be as low as the sum of the<br />
execution times of the longest-lasting consensus task of each stage of the architecture,<br />
provided that the available computational resources allow the parallel computation of all<br />
the intermediate consensus solutions of any given stage.<br />
On the contrary, if the execution of the halfway consensus is serialized, the time required<br />
for running the whole HCA amounts to the sum of the execution times of all the consensus<br />
processes of the stages of the hierarchical consensus architecture, which constitutes the<br />
upper bound of the running time of a hierarchical consensus architecture.<br />
Therefore, depending on the design of the HCA, the simultaneously available computing<br />
resources and the characteristics of the data set, structuring the consensus clustering task<br />
in a hierarchical manner may be more or less computationally beneficial (or not beneficial<br />
at all) as compared to its flat counterpart. From a practical viewpoint, our general idea is<br />
to provide the user with simple tools that, for a given consensus clustering problem, allow<br />
to decide a priori whether hierarchical consensus architectures are more computationally<br />
efficient than traditional flat consensus and, if so, implement the HCA variant of minimal<br />
complexity.<br />
Moreover, it is important to highlight the fact that, in cases where the flat execution<br />
of the consensus function F becomes impossible due to memory limitations caused by the<br />
large size of the cluster ensemble, a carefully designed HCA will allow obtaining a consensus<br />
clustering solution.<br />
Let us now elaborate briefly on several notational definitions regarding hierarchical<br />
consensus architectures that will be of help when describing our proposals in detail. We<br />
suggest the reader resort to the generic HCA topology depicted in figure 3.1(b) for a better<br />
understanding of the concepts we are about to expose.<br />
Firstly, a hierarchical consensus architecture is structured in s successive stages. The<br />
number of intermediate consensus solutions obtained at the output of the ith stage is denoted<br />
as Ki —notice that Ks = 1 (i.e. the last stage yields the single final consensus<br />
clustering solution λc). The notation used for designating the jth halfway consensus clustering<br />
created at the ith HCA stage is λ i cj ,wherei∈ [1,s− 1] and j ∈ [1,Ki].<br />
Another important factor in the definition of HCAs is the size of the mini-ensembles,<br />
which may vary from stage to stage or even within the same stage. For this reason, we<br />
denote as bij the size of the mini-ensemble upon which the jth consensus process of the ith<br />
HCA stage is conducted. Notice that bs1 = Ks−1 (i.e. the last consensus stage combines<br />
all the intermediate clusterings output by the previous stage into the single final consensus<br />
clustering solution λc), while, in the HCA presented in figure 3.1(b), b1j =2∀j ∈ [1,K1].<br />
Moreover, notice that hierarchical architectures naturally allow the use of distinct consensus<br />
functions across the distinct stages (or even within the same stage). However, in<br />
this work we assume that a single consensus function F is applied for conducting all the<br />
consensus processes involved.<br />
48
Chapter 3. Hierarchical consensus architectures<br />
In this chapter, we propose two strategies for constructing hierarchical consensus architectures,<br />
which differ in i) the way mini-ensembles are created, and ii) which HCA<br />
parameters are tuned by the user. As a result, two HCA implementation alternatives are<br />
put forward:<br />
– random hierarchical consensus architectures, whose tunable parameter is the size of<br />
the mini-ensembles –the components of which are selected at random–, which eventually<br />
determines the HCA topology (i.e. its number of stages).<br />
– deterministic hierarchical consensus architectures, where the construction of the miniensembles<br />
is driven by the cluster ensemble creation process —in particular, the diversity<br />
factors used in creating the ensemble determine the number of HCA stages<br />
and the mini-ensembles components.<br />
The following sections are devoted to describing the rationale and implementation details<br />
of both HCA variants, specifying how the number of stages, the number of consensus<br />
processes per stage and the size of the mini-ensembles are determined in each case. This<br />
description is completed by an analysis of their computational complexity.<br />
3.2 Random hierarchical consensus architectures<br />
In this section, we introduce random hierarchical consensus architectures (RHCA for short),<br />
we define their topology from a generic perspective, making a brief description of their<br />
foundations, followed by an analysis of their computational complexity.<br />
3.2.1 Rationale and definition<br />
The idea behind random hierarchical consensus architectures is to construct a regular pyramidal<br />
structure of intermediate consensus processes that delivers, at its top, the final consensus<br />
clustering solution λc. The term random refers not to the consensus architecture<br />
itself, but to the way mini-ensembles are created. In particular, the randomness of RHCA<br />
lies in the fact that the clusterings input to each stage of the hierarchical architecture are<br />
shuffled randomly.<br />
Besides this fact, the most distinctive feature of RHCA is that the user determines<br />
the size of the mini-ensembles, setting it to b, keeping it constant across the stages of<br />
the consensus architecture. Therefore, given a cluster ensemble containing l component<br />
clusterings and a mini-ensemble size set equal to b by the user, the number of stages s of<br />
the resulting RHCA is computed by equation (3.1).<br />
⎧<br />
⎪⎩<br />
⌊log b (l)⌉ if<br />
⎪⎨<br />
<br />
s = ⌊logb (l)⌉−1 if<br />
⌊log b (l)⌉ +1 if<br />
<br />
<br />
l<br />
b ⌊log b (l)⌉<br />
l<br />
b ⌊log b (l)⌉<br />
l<br />
b ⌊log b (l)⌉<br />
49<br />
<br />
≤ 1and<br />
<br />
≤ 1and<br />
<br />
> 1<br />
l<br />
b ⌊log b (l)⌉−1<br />
l<br />
b ⌊log b (l)⌉−1<br />
<br />
> 1<br />
<br />
=1<br />
(3.1)
3.2. Random hierarchical consensus architectures<br />
where ⌊x⌉ denotes the operation of rounding x to the nearest integer (Hastad et al., 1988).<br />
The second option in equation (3.1) reduces the number of stages by one in the case that<br />
the penultimate RHCA stage already yields one consensus clustering, whereas the third one<br />
adds a supplementary stage so as to ensure the obtention of a single consensus solution at<br />
the output of the RHCA.<br />
Furthermore, the number of consensus solutions computed at the ith stage of the RHCA<br />
(where i ∈ [1,s]) is determined by the expression in equation (3.2).<br />
<br />
l<br />
Ki =max<br />
bi <br />
, 1<br />
(3.2)<br />
where ⌊x⌋ stands for the greatest integer less than or equal to x (i.e. the result of applying<br />
the floor function on number x).<br />
However, it is important to notice that it will only be possible to keep the mini-ensembles<br />
size constant all along the hierarchy (i.e. bij = b, ∀i ∈ [1,s]and∀j∈ [1,Ki]) when l is<br />
an integer power of b. In the likely case that this condition is not met, in the current<br />
implementation of RHCA we choose, for simplicity, integrating the spare clusterings2 of the<br />
ith RHCA stage into its last (i.e. the Kith) mini-ensemble, thus introducing a bounded<br />
increase of its size, as b ≤ biKi < 2b. Moreover, the size of the mini-ensemble input to the<br />
sth stage is set to be equal to the number of halfway consensus output by the penultimate<br />
RHCA stage, as defined in equation (3.3).<br />
⎧<br />
⎪⎨<br />
b if i
Chapter 3. Hierarchical consensus architectures<br />
the mini-ensembles size, notice that the size of the third mini-ensemble of the first RHCA<br />
stage is increased (b13 =3) so that all the l = 7 components of the cluster ensemble are<br />
involved in one of the K1 = 3 consensus processes of the first RHCA stage. This also<br />
happens in the second stage, where b21 =3andK2 =1,which,asjustmentioned,yieldsa<br />
single consensus at its output.<br />
The interested reader will find a more detailed description of these and other RHCA<br />
configuration examples in appendix C.1.<br />
3.2.2 Computational complexity<br />
In the following paragraphs, we present a study of the asymptotic computational complexity<br />
of RHCA, considering both its fully serial and parallel implementations, which, as<br />
aforementioned, constitute the upper and lower bounds of the RHCA execution time.<br />
Serial RHCA<br />
For starters, the time complexity of the fully serialized implementation is considered. This<br />
means that the intermediate consensus tasks of each RHCA stage must be sequentially executed<br />
on a single computation unit. Recall that the time complexity of consensus functions<br />
typically grows linearly or quadratically with the cluster ensemble size, that is, it can be<br />
expressed as O (l w ), where w ∈{1, 2}. Therefore, the serial time complexity of a RHCA<br />
(STCRHCA) withs stages boils down to systematically adding the time complexities of all<br />
the consensus processes executed across the whole RHCA, as defined in equation (3.4).<br />
STCRHCA =<br />
s Ki <br />
O (bij w ) (3.4)<br />
i=1 j=1<br />
where Ki refers to the number of consensus processes executed in the ith RHCA stage, bij is<br />
the mini-ensemble size corresponding to the jth consensus process executed at the ith stage<br />
of the hierarchy —the exact value of these parameters is computed according to equations<br />
(3.2) and (3.3), respectively—, and O (bij w ) reflects the complexity of each intermediate<br />
consensus process.<br />
Equation (3.4) can be reformulated so as to obtain a compact expression of an upper<br />
bound of STCRHCA as a function of the user defined mini-ensembles size b. This requires<br />
recalling that, in the current RHCA implementation, the effective mini-ensembles size is<br />
bounded, that is, bij < 2b, ∀i ∈ [1,s]and∀j ∈ [1,Ki]. Therefore, we can write:<br />
STCRHCA <<br />
s Ki <br />
O ((2b) w ) (3.5)<br />
i=1 j=1<br />
Notice that, from an algorithmic viewpoint, equation (3.5) can be regarded as two nested<br />
loops where the number of iterations of the inner loop (Ki) depends on the value of the<br />
outer loop’s index (i). The number of iterations of the inner loop as a function of the outer<br />
loop’s index is presented in table 3.1.<br />
Thus, it can be observed that the total number of times the mini-ensemble consensus of<br />
51
3.2. Random hierarchical consensus architectures<br />
λ 1 λ 11 λ 12 λ 13 … λ 1m<br />
λ 2 λ 21 λ 22 λ 23 … λ 2m<br />
λ 3 λ 31 λ 32 λ 33 … λ 3m<br />
λ 4 λ 41 λ 42 λ 43 … λ 4m<br />
λ 5 λ 51 λ 52 λ 53 … λ 5m<br />
λ 6 λ 61 λ 62 λ 63 … λ 6m<br />
λ 7 λ 71 λ 72 λ 73 … λ 7m<br />
λ 8 λ 81 λ 82 λ 83 … λ 8m<br />
λ 1 λ 11 λ 12 λ 13 … λ 1m<br />
λ 2 λ 21 λ 22 λ 23 … λ 2m<br />
λ 3 λ 31 λ 32 λ 33 … λ 3m<br />
λ 4 λ 41 λ 42 λ 43 … λ 4m<br />
λ 5 λ 51 λ 52 λ 53 … λ 5m<br />
λ 6 λ 61 λ 62 λ 63 … λ 6m<br />
λ 7 λ 71 λ 72 λ 73 … λ 7m<br />
λ 8 λ 81 λ 82 λ 83 … λ 8m<br />
λ 9 λ 91 λ 92 λ 93 … λ 9m<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
λ<br />
λ<br />
λ<br />
λ<br />
1<br />
c1<br />
1<br />
c2<br />
1<br />
c3<br />
1<br />
c4<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
(a) Topology of a RHCA with l =8andb =2,withs =3<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
λ<br />
λ<br />
λ<br />
λ<br />
1<br />
c1<br />
1<br />
c2<br />
1<br />
c3<br />
1<br />
c4<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
(b) Topology of a RHCA with l =9andb =2,withs =3<br />
1 11 12 13 … 1m<br />
2 21 22 23 … 2m 3 31 32 33 … 3m 4 41 42 43 … 4m 5 51 52 53 … <br />
5m<br />
6 61 62 63 … 6m<br />
7 71 72 73 … 7m<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
<br />
1<br />
c1<br />
λ<br />
λ<br />
λ<br />
λ<br />
2<br />
c1<br />
2<br />
c2<br />
2<br />
c1<br />
2<br />
c2<br />
Consensus<br />
function<br />
(c) Topology of a RHCA with l =7andb =2,withs =2<br />
<br />
1<br />
c2<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Figure 3.2: Three examples of topologies of random hierarchical consensus architectures<br />
with distinct relationships between the cluster ensemble and mini-ensembles sizes, l and b,<br />
respectively.<br />
maximum complexity O ((2b) w ) are executed is s<br />
Ki. The value of this summation can be<br />
approximated by considering that the number of consensus per stage Ki is also bounded,<br />
as Ki ≤ l<br />
b i (see equation (3.2)), yielding:<br />
52<br />
i=1<br />
<br />
1<br />
c3<br />
c 2<br />
1<br />
<br />
c<br />
λc 3<br />
1<br />
λc 3<br />
1<br />
= λ<br />
= λ<br />
c<br />
c
Chapter 3. Hierarchical consensus architectures<br />
i # inner loop iterations<br />
1 K1<br />
2 K2<br />
... ...<br />
s 1<br />
Table 3.1: Number of inner loop iterations as a function of the outer’s loop index i.<br />
s<br />
Ki ≤<br />
i=1<br />
s<br />
i=1<br />
l<br />
= l ·<br />
bi s<br />
i=1<br />
<br />
1 1<br />
1 b − b = l ·<br />
bi s+1<br />
1 − 1<br />
<br />
bs − 1<br />
= l ·<br />
b<br />
b<br />
s <br />
(3.6)<br />
(b − 1)<br />
. Therefore, the<br />
which equals the partial sum of a geometric series whose common ratio is 1<br />
b<br />
upper bound of the time complexity STCRHCA can be rewritten as:<br />
<br />
bs − 1<br />
STCRHCA <br />
Ki, ∀i ∈ [2,s]). In this case of maximum parallelism, the parallel time complexity of a<br />
RHCA (PTCRHCA) ofs stages is formulated according to equation (3.9).<br />
PTCRHCA =<br />
s<br />
i=1<br />
max<br />
j∈[1,Ki] (O (bij w )) (3.9)<br />
That is, the parallel time complexity of a RHCA is equal to the sum of the time complexities<br />
of the most time-consuming consensus process of each RHCA stage. Notice that<br />
it is not difficult to find an upper bound to PTCRHCA, as finding the maximum of O (bij w )<br />
just requires taking into account that bij < 2b, ∀i, j. Thus:<br />
PTCRHCA <<br />
s<br />
i=1<br />
O ((2b) w )=s · O ((2b) w )=O (s · (2b) w ) (3.10)<br />
If the number of RHCA stages is approximated as s ≈ log b (l), and constants are<br />
dropped, equation (3.10) can be rewritten as a function of l and b:<br />
53
3.2. Random hierarchical consensus architectures<br />
s : number of stages<br />
10 1<br />
Number of RHCA stages as a function of b<br />
10 0<br />
10 0<br />
10 1<br />
10 2<br />
b<br />
10 3<br />
100<br />
200<br />
500<br />
1000<br />
2000<br />
5000<br />
10000<br />
10 4<br />
(a) Evolution of the number of<br />
RHCA stages s as a function<br />
of b<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 0<br />
10 1<br />
10 2<br />
b<br />
10 3<br />
100<br />
200<br />
500<br />
1000<br />
2000<br />
5000<br />
10000<br />
10 4<br />
sum(K i ) : total number of consensus Number of consensus in a RHCA as a function of b<br />
(b) Evolution of the total<br />
number of RHCA consensus<br />
s<br />
Ki as a function of b<br />
i=1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
b ij : mean mini−ensemble size Mini−ensembles size of a RHCA as a function of b<br />
10 0<br />
10 0<br />
10 1<br />
10 2<br />
b<br />
10 3<br />
100<br />
200<br />
500<br />
1000<br />
2000<br />
5000<br />
10000<br />
10 4<br />
(c) Evolution of the mean<br />
mini-ensembles size as a function<br />
of b<br />
Figure 3.3: Evolution of RHCA parameters as a function of the mini-ensembles size b for<br />
cluster ensembles sizes ranging from 100 to 10000.<br />
3.2.3 Running time minimization<br />
PTCRHCA O (log b (l)(2b) w )=O (b w log b (l)) (3.11)<br />
In light of the expressions of the upper bounds of the serial and parallel time complexities<br />
of RHCA, a naturally arising question is which particular RHCA configuration yields, for<br />
a given cluster ensemble, the minimal running time —notice that the user’s election of the<br />
mini-ensembles size b determines both the number of stages and of consensus computed per<br />
stage, see equations (3.1) and (3.2), which ultimately determines the running time of the<br />
RHCA.<br />
In fact, there exists a trade-off between the value of b and the execution time of the<br />
whole RHCA, as selecting a small value for b simultaneously reduces the time complexity of<br />
each consensus while increasing the total number of stages (s) and of consensus processes<br />
of the RHCA ( s<br />
Ki), and vice versa.<br />
i=1<br />
With the purpose of visualizing the dependence between b and these factors, figure 3.3<br />
depicts their value for different cluster ensembles sizes l ∈ [100, 10000] as a function of the<br />
mini-ensembles size b ∈ 2, ⌊ l<br />
2 ⌋ .<br />
Firstly, figure 3.3(a) shows the exponential decay of the number of RHCA stages s as a<br />
function of b, caused by the fact that s is computed as the b-base logarithm of the cluster<br />
ensemble size l. Secondly, the evolution of the total number of consensus processes follows<br />
a fast exponential decay (hard to appreciate in this doubly logarithmic chart), as depicted<br />
in figure 3.3(b). And finally, figure 3.3(c) presents the mean value of the effective miniensembles<br />
size bij across the whole RHCA (which obviously scales linearly with b)asarough<br />
indicator of the complexity of each halfway consensus process, which will approximately be<br />
(linearly or quadratically) proportional to this value.<br />
Allowing for the evident dependence between the user defined mini-ensembles size b and<br />
54
Chapter 3. Hierarchical consensus architectures<br />
the running of RHCA, it seems necessary to accurately choose the value of this parameter<br />
—regardless of whether the serial or parallel version of RHCA is implemented (in this latter<br />
case, notice that the RHCA running time still depends on b (via s) although it becomes<br />
s<br />
insensitive to the total number of consensus processes, Ki)—,asRHCAvariantswith<br />
different values of b may have dramatically different running times.<br />
As mentioned earlier, we aim to design an automatic mechanism that allows selecting<br />
thespecificvalueofb that gives rise to the RHCA variant of minimal running time, making<br />
it also possible to decide aprioriwhether it is more computationally efficient than flat<br />
consensus.<br />
To do so, we propose a simple yet effective methodology for comparing distinct RHCA<br />
variants (i.e. with different b values) on computational grounds, a detailed description of<br />
which is presented in table 3.2. In a nutshell, the proposed strategy is based on estimating<br />
the RHCA variants running time using equations (3.4) or (3.9) (depending on whether the<br />
serial or parallel version is to be implemented), replacing the theoretical time complexity of<br />
intermediate consensus processes (O (bij w )) by the average real execution time of c consensus<br />
processes using a specific consensus function F on a mini-ensemble of size bij —recall that<br />
the values of bij can be computed by means of equation (3.3).<br />
It is to note that, for a specific RHCA variant (that is, for a given mini-ensemble size<br />
b), the parameter bij may take a few different values —for instance, in the three toy RHCA<br />
variants depicted in figure 3.2, bij = {2, 3} although b = 2 in all of them. That is, the<br />
number of consensus processes to be executed for estimating the running time of the RHCA<br />
is usually small, which makes the proposed procedure little computationally demanding.<br />
Futhermore, notice that a more robust estimation can be obtained by averaging the running<br />
times of c>1 executions of the consensus function F on mini-ensembles of size bij, although<br />
this would increase the time required for completing the estimation procedure.<br />
By means of the proposed methodology, the user is provided with an estimation of<br />
which is the most computationally efficient RHCA configuration and which is its running<br />
time for the consensus clustering problem at hand. However, it is necessary to decide<br />
whether this allegedly optimal RHCA variant is faster than flat consensus. The running<br />
time of flat consensus can be estimated by extrapolating from the running times of the<br />
consensus processes executed upon mini-ensembles of size bij. A simpler although less<br />
efficient alternative consists of launching the execution of flat consensus, halting it as soon<br />
as its running time exceeds the estimated execution time of the allegedly optimal RHCA<br />
variant.<br />
The next section is devoted to illustrate the performance of the proposed running time<br />
estimation methodology by means of several experiments, as well as for evaluating the<br />
computational efficiency of RHCA in front of flat consensus.<br />
3.2.4 Experiments<br />
In the following paragraphs, we present a set of experiments oriented to i) evaluate the<br />
performance of the computationally optimal consensus architecture prediction methodology,<br />
and ii) analyze the computational efficiency of random hierarchical consensus architectures.<br />
To do so, we have designed several experiments which are outlined next.<br />
55<br />
i=1
3.2. Random hierarchical consensus architectures<br />
1. Given the cluster ensemble size l, create a set of mini-ensembles sizes b with values<br />
sweeping from 2 to ⌊ l<br />
2 ⌋.<br />
2. For each b value –which corresponds to a RHCA variant– compute the number of<br />
stages of the RHCA s according to equation (3.1). With these results in hand, limit<br />
the sweep of values of b according to two criteria:<br />
i) as there exist multiple values of b that yield RHCA variants with the same number<br />
of stages, consider only the largest and smallest values of b that yield the same<br />
number of RHCA stages s.<br />
ii) keep those values of b that uniquely give rise to RHCA variants with a specific<br />
number of stages.<br />
3. For the reduced set of b values, compute the total number of consensus processes<br />
s<br />
Ki, and the real mini-ensembles sizes bij of the corresponding RHCA variants,<br />
i=1<br />
according to equations (3.2) and (3.3), respectively.<br />
4. Measure the time required for executing the consensus function F on c randomly<br />
picked mini-cluster ensembles of the sizes bij corresponding to each value of b.<br />
5. Employ the computed parameters of each RHCA variant (i.e. number of stages s,<br />
s<br />
total number of consensus processes Ki and the running times of the consensus<br />
i=1<br />
function F) to estimate the running times of the whole hierarchical architecture,<br />
using equations (3.4) or (3.9) depending on whether its fully serial or parallel version<br />
is to be implemented in practice.<br />
Table 3.2: Methodology for estimating the running time of multiple RHCA variants.<br />
Experimental design<br />
– What do we want to measure?<br />
i) The time complexity of random hierarchical consensus architectures.<br />
ii) The ability of the proposed methodology for predicting the computationally optimal<br />
RHCA variant, in both the fully serial and parallel implementations.<br />
– How do we measure it?<br />
i) The time complexity of the implemented serial and parallel RHCA variants is<br />
measured in terms of the CPU time required for their execution —serial running<br />
time (SRTRHCA) and parallel running time (PRTRHCA).<br />
ii) The estimated running times of the same RHCA variants –serial estimated running<br />
time (SERTRHCA) and parallel estimated running time (PERTRHCA)– are<br />
computed by means of the proposed running time estimation methodology, which<br />
is based on the measured running time of c = 1 consensus clustering process. Predictions<br />
regarding the computationally optimal RHCA variant will be successful<br />
56
Chapter 3. Hierarchical consensus architectures<br />
in case that both the real and estimated running times are minimized by the<br />
same RHCA variant, and the percentage of experiments in which prediction is<br />
successful is given as a measure of its performance. In order to measure the<br />
impact of incorrect predictions, we also measure the execution time differences<br />
(in both absolute and relative terms) between the truly and the allegedly fastest<br />
RHCA variants in the case prediction fails. This evaluation process is replicated<br />
for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />
on the prediction accuracy of the proposed methodology.<br />
– How are the experiments designed? All the RHCA variants corresponding to<br />
the sweep of values of b resulting from the proposed running time estimation methodology<br />
have been implemented (see table 3.2). In order to test our proposals under a<br />
wide spectrum of experimental situations, consensus processes have been conducted<br />
using the seven consensus functions for hard cluster ensembles presented in appendix<br />
A.5 (i.e. CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD), employing<br />
cluster ensembles of the sizes corresponding to the four diversity scenarios described<br />
in appendix A.4 —which basically boils down to compiling the clusterings output by<br />
|dfA| = {1, 10, 19, 28} clustering algorithms. In all cases, the real running times correspond<br />
to an average of 10 independent runs of the whole RHCA, in order to obtain<br />
representative real running time values (recall that the mini-ensemble components<br />
change from run to run, as they are randomly selected). For a description of the<br />
computational resources employed in or experiments, see appendix A.6.<br />
– How are results presented? Both the real and estimated running times of the<br />
serial and parallel implementations of the RHCA variants are depicted by means of<br />
curves representing their average values.<br />
– Which data sets are employed? For brevity reasons, this section only describes<br />
the results of the experiments conducted on the Zoo data collection. The presentation<br />
of the results of these same experiments on the Iris, Wine, Glass, Ionosphere, WDBC,<br />
Balance and MFeat unimodal data collections is deferred to appendix C.2.<br />
One word before proceeding to present the results obtained. In practice, only serial<br />
RHCA have been implemented in our experiments. The real execution times of their parallel<br />
counterparts are, in fact, an estimation based on retrieving the execution time of the longestlasting<br />
consensus process of each stage of the serial RHCA and plugging them into equation<br />
(3.9).<br />
Diversity scenario |df A| =1<br />
Firstly, figure 3.4 presents the results corresponding to the lowest diversity scenario, i.e.<br />
the one resulting from using a single randomly chosen clustering algorithm for generating<br />
the cluster ensemble —that is, the cardinality of the algorithmic diversity factor is equal<br />
to one, i.e. |dfA| = 1, which, on this data set, gives rise to a cluster ensemble size l = 57.<br />
Following the methodology of table 3.2, the sweep of values of the mini-ensemble size is<br />
b = {2, 3, 4, 6, 7, 28, 57} —recall that each value of b gives rise to a distinct RHCA variant.<br />
Figure 3.4(a) presents the serial RHCA estimated running time (SERTRHCA), while figure<br />
3.4(b) depicts the real serial running time (or SRTRHCA) of the implemented RHCA<br />
57
3.2. Random hierarchical consensus architectures<br />
SERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 0<br />
10 −1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
2 3 4 6 7 28 57<br />
b : mini−ensemble size<br />
(a) Serial estimated running time<br />
5<br />
10<br />
4 3 3 2 2 1<br />
0<br />
s : number of stages<br />
10 −1<br />
2 3 4 6 7 28 57<br />
b : mini−ensemble size<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 0<br />
10 −1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
2 3 4 6 7 28 57<br />
b : mini−ensemble size<br />
(b) Serial real running time<br />
5<br />
10<br />
4 3 3 2 2 1<br />
0<br />
s : number of stages<br />
10 −1<br />
2 3 4 6 7 28 57<br />
b : mini−ensemble size<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.4: Estimated and real running times of the serial RHCA on the Zoo data collection<br />
in the diversity scenario corresponding to a cluster ensemble of size l = 57.<br />
variants. Figures 3.4(c) and 3.4(d) present their counterparts for the parallel RHCA implementation.<br />
The lower horizontal axis of each chart presents the mini-ensembles size b of<br />
each RHCA variant, and the superior horizontal axis indicates the corresponding number<br />
of stages s of the RHCA. Notice, for instance, that s =1forb = 57, which corresponds to<br />
flat consensus.<br />
If the estimated and real execution times of the serial implementation of the RHCA are<br />
analyzed separately (figures 3.4(a) and 3.4(b)), it can be observed that flat consensus is<br />
faster than any RHCA variant regardless of the consensus function employed. This is due<br />
to the small size of the cluster ensemble (l = 57) in this low diversity scenario, which makes<br />
any hierarchical consensus architecture slower than its one-step counterpart.<br />
Moreover, the visual comparison of figures 3.4(a) and 3.4(b) shows that SERTRHCA is<br />
a fairly accurate estimation of SRTRHCA. However, it is to notice that our goal is not<br />
to predict the exact value of SRTRHCA, but to use SERTRHCA to predict where the real<br />
running time will attain its minimum value —which is equivalent to determining the most<br />
computationally efficient RHCA variant, a goal that is perfectly accomplished in this case.<br />
Figures 3.4(c) and 3.4(d) depict the estimated and real running times of the fully<br />
parallel implementation of the same RHCA variants as before. The observation of these<br />
charts reveals that PERTRHCA succeeds notably in predicting the location of the minima<br />
58
SERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 285 570<br />
b : mini−ensemble size<br />
(a) Serial estimated running time<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 285 570<br />
b : mini−ensemble size<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Chapter 3. Hierarchical consensus architectures<br />
SRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 285 570<br />
b : mini−ensemble size<br />
(b) Serial real running time<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 285 570<br />
b : mini−ensemble size<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.5: Estimated and real running times of the serial RHCA on the Zoo data collection<br />
in the diversity scenario corresponding to a cluster ensemble of size l = 570.<br />
of PRTRHCA —in fact, the only prediction error occurs in the case the SLSAD consensus<br />
function is employed. In this case, according to PERTRHCA (figure 3.4(c)), the most efficient<br />
consensus architecture is the RHCA variant with s = 2 stages using mini-ensembles of<br />
size b = 28. However, the real execution times (figure 3.4(d)) reveal that flat consensus is<br />
the fastest option when this consensus function is employed for combining the clusterings.<br />
Nevertheless, we would like to highlight that the cost (measured in terms of running<br />
time) of selecting this computationally suboptimal RHCA variant based on the PERTRHCA<br />
prediction is almost negligible in absolute terms, as the difference between the running<br />
times of the truly and allegedly optimal parallel RHCA variants is smaller than a tenth of<br />
a second in this case.<br />
Diversity scenario |df A| =10<br />
Figure 3.5 presents the estimated and real execution times of several architectural variants of<br />
both the fully serial and parallel RHCA implementations in the second diversity scenario,<br />
the one resulting from employing |dfA| = 10 randomly chosen clustering algorithms for<br />
generating a cluster ensemble of size l = 570. In this case, the sweep of values of the<br />
mini-ensembles size is b = {2, 3, 4, 5, 7, 8, 19, 20, 285, 570}.<br />
If the estimated and real execution times of the serial implementation of the RHCA are<br />
59
3.2. Random hierarchical consensus architectures<br />
observed (figures 3.5(a) and 3.5(b)), it can be noticed that i) SERTRHCA is again a pretty<br />
accurate estimation of SRTRHCA, andii) for several consensus functions (in fact, all but<br />
EAC) there exists at least one RHCA variant that is more computationally efficient that<br />
flat consensus. In general terms, the difference between the running times of the fastest<br />
RHCA variant and flat consensus is small, although in the case of the MCLA consensus<br />
function, the execution of flat consensus (i.e. b = l = 570) is six times as costly as the<br />
fastest RHCA variant (the one with b = 20).<br />
Three main conclusions can be drawn at this point: firstly, increasing the size of the<br />
cluster ensemble makes hierarchical consensus architectures a computationally competitive<br />
alternative to flat consensus. Secondly, it is necessary to predict accurately which is the<br />
fastest RHCA variant (i.e. the specific value of the mini-ensembles size b) soastoobtain<br />
significant execution time savings. And thirdly, the computational optimality of a particular<br />
RHCA variant is local to the consensus function employed.<br />
As regards the estimated and real running times of the fully parallel RHCA implementation,<br />
depicted in figures 3.5(c) and 3.5(d), we can conclude that, again, PERTRHCA is a<br />
good indicator of the most computationally efficient RHCA variant. Furthermore, notice<br />
that the differences between the running times of flat consensus and the optimal RHCA are<br />
beyond one order of magnitude for most consensus functions, which highlights the interestingness<br />
of RHCA in computational terms, as well as the need for being able to predict<br />
which is the least time consuming consensus architecture.<br />
Diversity scenario |df A| =19<br />
The results corresponding to the third diversity scenario (i.e. cluster ensembles of size<br />
l = 1083 using |dfA| = 19 randomly chosen clustering algorithms) are presented in figure<br />
3.6. In this case, the mini-ensembles size sweep is b = {2, 3, 4, 5, 6, 8, 9, 26, 27, 541, 1083}.<br />
As regards the serial implementation of the RHCA –whose estimated and real running<br />
times are presented in figures 3.6(a) and 3.6(b), respectively–, a few observations must be<br />
made. Firstly, notice that the curves in figure 3.6(a) present a high degree of resemblance<br />
to the ones in figure 3.6(b), which indicates that SERTRHCA is a notably accurate predictor<br />
of SRTRHCA. Again, we would like to highlight the fact that our main interest is that<br />
the former is a good predictor of the location of the minima of the latter, a goal which is<br />
pretty successfully achieved in this case. Secondly, notice the influence of the consensus<br />
function employed for conducting the clustering combination on the running time of the<br />
RHCA. Whereas most of them yield a similar running time pattern (i.e. they have a<br />
more or less pronounced minimum around b =26orb = 27), two consensus functions<br />
stand out for their particular behaviour: i) when the EAC consensus function is employed,<br />
flat consensus is faster than any serial RHCA variant, and ii) when consensus is created<br />
by means of the MCLA consensus function, the space complexity requirements of MCLA<br />
make flat consensus not executable, as this is the only consensus function (among the ones<br />
employed in this work) whose complexity scales quadratically with the cluster ensemble size<br />
(see appendix A.5).<br />
If the estimated and real parallel RHCA implementation running times are evaluated<br />
(see figures 3.6(c) and 3.6(d)), it can be observed that, whatever the consensus function<br />
employed, there always exists at least one parallel RHCA variant which performs more<br />
efficiently than flat consensus. Moreover, notice that just like in the previous diversity<br />
60
SERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10<br />
10<br />
6 5 5 4 4 3 3 2 2 1<br />
2<br />
s : number of stages<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 −1<br />
2 3 4 5 6 8 9 26 27 541 1083<br />
b : mini−ensemble size<br />
(a) Serial estimated running time<br />
s : number of stages<br />
10 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 8 9 26 27 541 1083<br />
b : mini−ensemble size<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Chapter 3. Hierarchical consensus architectures<br />
SRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10<br />
10<br />
6 5 5 4 4 3 3 2 2 1<br />
2<br />
s : number of stages<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 −1<br />
2 3 4 5 6 8 9 26 27 541 1083<br />
b : mini−ensemble size<br />
(b) Serial real running time<br />
s : number of stages<br />
10 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 8 9 26 27 541 1083<br />
b : mini−ensemble size<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.6: Estimated and real running times of the serial RHCA on the Zoo data collection<br />
in the diversity scenario corresponding to a cluster ensemble of size l = 1083.<br />
scenario, there exists a notable difference between the running times of the most efficient<br />
parallel RHCA and flat consensus, which can be as high as two orders of magnitude.<br />
Diversity scenario |df A| =28<br />
Figure 3.7 depicts the estimated and real execution times corresponding to the highest<br />
diversity scenario —i.e. the one resulting from applying the |dfA| = 28 clustering algorithms<br />
from the CLUTO clustering package for generating cluster ensembles of size l = 1596. In<br />
this case, the mini-ensembles size sweep is b = {2, 3, 4, 5, 6, 10, 11, 32, 33, 798, 1596}.<br />
The results obtained are pretty similar to those obtained on the previous diversity<br />
scenario, although the following remarks must be made: firstly, notice that the large size of<br />
the cluster ensemble may not only impede flat consensus, but also the execution of those<br />
RHCA variants using larger mini-ensembles (see the curves corresponding to the MCLA<br />
consensus function). And secondly, it is noteworthy that the larger the cluster ensemble,<br />
the greater running time savings –which can be as high as two orders of magnitude– are<br />
derived from using the computationally optimal RHCA variant instead of flat consensus (in<br />
the case it is executable), regardless of the consensus function employed.<br />
61
3.2. Random hierarchical consensus architectures<br />
SERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 10 11 32 33 798 1596<br />
b : mini−ensemble size<br />
(a) Serial estimated running time<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 10 11 32 33 798 1596<br />
b : mini−ensemble size<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 10 11 32 33 798 1596<br />
b : mini−ensemble size<br />
(b) Serial real running time<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 10 11 32 33 798 1596<br />
b : mini−ensemble size<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.7: Estimated and real running times of the serial RHCA on the Zoo data collection<br />
in the diversity scenario corresponding to a cluster ensemble of size l = 1596.<br />
Conclusions regarding the computationally efficiency of RHCA<br />
The observation of the results obtained across the four diversity scenarios allow drawing<br />
several conclusions (together with the experiments presented in appendix C.2) as regards<br />
the computational efficiency of hierarchical and flat consensus architectures:<br />
– hierarchical consensus architectures can constitute a feasible way to obtain a consensus<br />
clustering solution in cases where one-step consensus is not affordable, which<br />
ultimately depends on the size of the cluster ensemble, the characteristics of the consensus<br />
function and the computational resources at hand.<br />
– as expected, parallel RHCA are highly efficient, being faster than flat consensus even<br />
in low diversity scenarios.<br />
– serial RHCA implementations become computationally competitive in medium to high<br />
diversity scenarios.<br />
– depending on the characteristics of the consensus function(s) employed for conducting<br />
clustering combination, large variations of the overall execution time of consensus<br />
architectures are observed.<br />
62
Chapter 3. Hierarchical consensus architectures<br />
Serial RHCA Parallel RHCA<br />
Dataset % correct ΔRT % correct ΔRT<br />
predictions (sec.) (%) predictions (sec.) (%)<br />
Zoo 72.2 1.11 109.7 54.9 0.10 67.2<br />
Iris 90.4 0.05 26.1 56.7 0.12 102.7<br />
Wine 77.1 0.60 37.8 46.8 0.21 139.5<br />
Glass 74.6 0.49 26.5 25.9 0.26 97.3<br />
Ionosphere 73.1 2.63 16.6 67.8 0.77 110.5<br />
WDBC 63.0 12.11 39.1 38.6 8.17 113.9<br />
Balance 92.4 0.31 29.5 73.2 3.09 87.3<br />
MFeat 83.4 7.02 27.7 76.3 14.41 50.3<br />
average 78.3 3.04 39.1 55.0 3.39 96.1<br />
Table 3.3: Evaluation of the minimum complexity RHCA variant estimation methodology<br />
in terms of the percentage of correct predictions and running time penalizations resulting<br />
from mistaken predictions.<br />
Conclusions regarding the optimal RHCA prediction methodology<br />
As far as the proposed running time estimation methodology is concerned, the following<br />
conclusions are drawn:<br />
– the computation of SERTRHCA and PERTRHCA constitutes a reasonable, simple and<br />
fast means for predicting whether flat or hierarchical consensus must be conducted.<br />
– the selection of computationally suboptimal consensus architectures caused by prediction<br />
errors of the proposed methodology entails a (usually assumable) execution<br />
time overhead.<br />
In order to provide the reader with a more quantitative analysis of the predictive power<br />
of the proposed running time estimation methodology, we have computed the percentage<br />
of experiments –considering the eight data sets over which they have been conducted– the<br />
minimum value of the estimated and real running times is obtained for the same consensus<br />
architecture. If, for a given experiment, both functions are simultaneously minimized, then<br />
our methodology suceeds in determining aprioriwhich is the fastest consensus architecture.<br />
If not, we compute the difference between the real running times of the truly (i.e. the<br />
one that minimizes SRTRHCA or PRTRHCA) and the allegedly (that is, the one minimizing<br />
SERTRHCA or PERTRHCA) computationally optimal consensus architectures, so as to<br />
provide a measure of the impact of choosing a suboptimal consensus configuration both in<br />
absolute and relative terms.<br />
Table 3.3 presents the percentage of experiments where the minima of SERTRHCA<br />
and PERTRHCA predict the most efficient consensus architecture correctly (expressed as<br />
‘% correct predictions’). In this case, SERTRHCA and PERTRHCA have been estimated<br />
upon a single execution (c = 1) of a consensus process on the mini-ensembles of size bij,<br />
i.e. no statistical running time averaging is conducted. Moreover, the difference between<br />
the real running times (ΔRT, measured both in seconds and in relative percentage) of the<br />
truly and the allegedly computationally optimal consensus architectures is also presented,<br />
63
3.2. Random hierarchical consensus architectures<br />
which corresponds to an average across 50 independent runs of the experiments conducted<br />
on each data collection.<br />
It is to note that, as already observed in the experiment described in this section and<br />
those presented in appendix C.2, SERTRHCA is a pretty accurate estimator of SRTRHCA<br />
(despite being based on the running time of a single consensus) and, as such, it suceeds<br />
notably in determining the most computationally efficient consensus architecture, achieving<br />
an average level of accuracy superior to 78% across the eight data sets employed in these<br />
experiments. In spite of this notably high accuracy percentage, notice that the resulting<br />
running time increase derived from inaccurate predictions is pretty high when measured in<br />
relative percentage —e.g. for the Zoo data set, the average running time of truly optimal<br />
consensus architectures is more than doubled when suboptimal ones are selected. However,<br />
if this average execution time increase is measured in absolute terms, we can conclude that<br />
it is perfectly assumable from a practical viewpoint in most cases —after all, the large<br />
relative deviation observed in the Zoo data collection results only in a one second running<br />
time rise.<br />
As regards the parallel implementation of the RHCA, there are a couple of issues worth<br />
noting: firstly, the proposed prediction methodology performs worse than in the serial case.<br />
This is due to the fact that, as observed across all the experiments conducted, PERTRHCA is<br />
a poorer estimator of PRTRHCA. However, despite ΔRT achieves very high values in relative<br />
terms, the absolute running time deviations between the truly and allegedly fastest consensus<br />
architectures are, again, not of paramount importance (i.e. the running time overhead<br />
due to a slightly erroneous estimation of the fastest RHCA variant clearly compensates the<br />
hypothetical execution of the least efficient consensus architecture).<br />
Recall that the proposed running time estimation methodology is based on capturing the<br />
execution times of several (namely c) runs of the consensus process on mini-ensembles of the<br />
sizes bij corresponding to each RHCA variant. As aforementioned, the results just reported<br />
have been obtained upon a single execution (c = 1). But expectably, the larger the value<br />
of c, the more accurate the estimation, but also the more costly in terms of computation<br />
time.<br />
So as to evaluate the influence of this factor, figure 3.8 depicts the evolution of the<br />
percentage of correct predictions (for both the serial and parallel implementations, referred<br />
to as %CPS and %CPP, respectively) and of the running time deviations (ΔRTS and ΔRTP<br />
in both absolute and relative terms) as a function of the parameter c, varying its value<br />
between 1 and 20, averaged across the eight data collections employed in this experiment.<br />
It can be observed that, despite the relatively wide sweep of values of c, thevariation<br />
in the percentage of correct predictions is below 6% for both the serial and parallel RHCA<br />
implementations —see figure 3.8(a). In terms of the difference between the running times<br />
of the truly and allegedly fastest consensus architectures, this results in a slight reduction<br />
of ΔRTS and ΔRTP –figure 3.8(b)–, which is, in any case, lower than 1.7 seconds —which,<br />
in relative percentage terms, amounts to a maximum reduction of 22% —see figure 3.8(c).<br />
Thus, we can conclude that, despite being a coarse approximation, using the running<br />
time of a single consensus process as the basis for estimating the execution time of the<br />
whole RHCA yields pretty accurate results as far as the prediction of the most efficient<br />
consensus architecture is concerned. Furthermore, when this prediction methodology fails,<br />
the execution time overhead is, in most cases, not dramatic from a practical standpoint.<br />
64
% correct predictions<br />
90<br />
80<br />
70<br />
60<br />
CP S<br />
CP P<br />
50<br />
1 5 10 15 20<br />
c : number of consensus processes<br />
(a) Percentage of correct optimal<br />
consensus architecture<br />
predictions<br />
RT (sec.)<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
Chapter 3. Hierarchical consensus architectures<br />
RT S<br />
RT P<br />
1.5<br />
1 5 10 15 20<br />
c : number of consensus processes<br />
(b) Absolute running time<br />
differences between the<br />
truly and allegedly optimal<br />
consensus architectures<br />
relative % RT<br />
100<br />
80<br />
60<br />
40<br />
20<br />
relative RT S<br />
relative RT P<br />
0<br />
1 5 10 15 20<br />
c : number of consensus processes<br />
(c) Relative running time<br />
differences between the truly<br />
and allegedly optimal consensus<br />
architectures<br />
Figure 3.8: Evolution of the accuracy of the serial and parallel RHCA running time estimation<br />
as a function of the number of consensus processes c used in the estimation, measured in<br />
terms of (a) the percentage of correct predictions, the (b) absolute and (c) relative running<br />
time deviations between the truly and allegedly optimal consensus architectures, averaged<br />
across the eight data sets employed .<br />
Summary of the most computationally efficient RHCA variants<br />
Following the proposed methodology, we have estimated which are the most computationally<br />
efficient consensus architectures for the twelve unimodal data collections described in<br />
appendix A.2.1. The results corresponding to the fully serial and parallel implementations<br />
are presented in tables 3.4 and 3.5, respectively.<br />
From a notational viewpoint, RHCA variants are expressed in terms of their number of<br />
stages s (in case there exist two variants with the same number of stages, we denote whether<br />
it corresponds to the implementation using the minimum or maximum mini-ensemble size<br />
by the symbols bm andbM, respectively). Moreover, successful predictions of the computationally<br />
optimal consensus architectures (i.e. the minima of SERTRHCA –or PERTRHCA–<br />
and SRTRHCA –or PRTRHCA– are yielded by the same consensus architecture) are denoted<br />
with the dagger symbol (†). Obviously, this applies for the eight first data collections<br />
(from Zoo to MFeat), where the true running times of all consensus architectures have<br />
been measured after their real execution. For the remaining four data sets (from miniNG<br />
to PenDigits), the allegedly optimal consensus architectures are presented —however, we<br />
think it is not an outlandish assumption to consider that a rate of computationally optimal<br />
consensus architecture correct predictions comparable to those presented in table 3.3 can<br />
be expected in these cases.<br />
As regards the consensus architecture serial implementation (table 3.4), a few observations<br />
can be made: firstly, the higher the degree of diversity (i.e. the larger the cluster<br />
ensembles), the more efficient RHCA variants become when compared to flat consensus.<br />
As observed earlier, the most notable exception to this rule of thumb occurs when clustering<br />
combination is conducted by means of the EAC consensus function, whereas it can<br />
be observed that the remaining ones show, in general terms, a pretty similar behaviour as<br />
regards the type of consensus architecture (flat or hierarchical) that minimizes the total<br />
65
3.2. Random hierarchical consensus architectures<br />
running time. Secondly, notice that flat consensus tends to be computationally optimal in<br />
those data sets having small cluster ensembles even in high diversity scenarios (e.g. Iris,<br />
Balance or MFeat). Thirdly, for data collections containing a large number of objects n<br />
(such as PenDigits), only the HGPA and MCLA consensus functions are executable in our<br />
experimental conditions (as they are the only whose complexity scales linearly with the<br />
data set size, see appendix A.5). And last, notice the predominance of RHCA variants<br />
with s =2ands = 3 stages among the fastest ones, which seems to indicate that, from a<br />
computational perspective, RHCA variants few stages are more efficient, despite consensus<br />
processes are conducted on large mini-cluster ensembles.<br />
Most of these observations can be extrapolated to the case of the fully parallel consensus<br />
implementation (table 3.5), where we can observe a pretty overwhelming prevalence of<br />
RHCA variants over flat consensus, a trend that was already reported earlier in this section<br />
and also in appendix C.2.<br />
In the remains of this work, experiments concerning random hierarchical consensus<br />
architectures have been limited to those RHCA variants of minimum estimated running<br />
time for the sake of brevity.<br />
66
Consensus Diversity Dataset<br />
function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
CSPA A| = 10 s=2,bM† flat† s=2,bM† s=2,bM† s=2,bM† s=2,bM† flat† flat† s=2,bM flat flat –<br />
|dfA| = 19 s=3,bM flat† s=2,bM† s=2,bM† s=2,bM s=2,bM† flat† flat† s=2,bm s=2,bM s=2,bM –<br />
|dfA| = 28 s=2,bm† flat† s=2,bM† s=2,bM s=2,bm s=2,bM flat† flat† s=3,bM s=2,bM s=2,bM –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
EAC A| = 10 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|dfA| = 19 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|dfA| = 28 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat flat† flat† s=3,bM s=2,bM flat s=2,bM<br />
|df<br />
HGPA A| = 10 s=2,bM† flat† s=2,bM† s=2,bM† s=2,bm† s=2,bm† flat† s=2,bM† s=4,bM s=2,bm s=2,bm s=2,bm<br />
|dfA| = 19 s=2,bm† flat† s=2,bM† s=2,bM s=2,bm† s=3,bM† s=2,bM† s=2,bm† s=2,bm s=3,bM s=2,bm s=3,bm<br />
|dfA| = 28 s=2,bm† s=2,bM† s=3,bM s=2,bm s=3,bM† s=3,bM s=2,bM† s=3,bM s=3,bM s=4,bM s=6 s=3,bm<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat flat† flat† s=3,bm s=2,bm s=3,bM s=2,bM<br />
|df<br />
MCLA A| = 10 s=3,bM flat† s=2,bm† s=2,bm† s=3,bM† s=2,bm† flat s=2,bm† s=4,bM s=4,bM s=3,bM s=3,bm<br />
|dfA| = 19 s=2,bm† s=2,bM† s=2,bm s=2,bm† s=2,bm s=3,bM s=2,bM† s=2,bm† s=2,bm s=3,bm s=4,bm s=3,bm<br />
|dfA| = 28 s=2,bm† s=2,bM† s=2,bm s=3,bM† s=2,bm† s=3,bM† s=2,bM s=3,bm s=3,bM s=4,bM s=3,bM s=4,bM<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
ALSAD A| = 10 s=2,bm† flat† s=2,bM s=2,bM s=2,bM† flat† flat† flat† s=2,bM flat flat –<br />
|dfA| = 19 s=2,bm† flat† s=2,bm† s=2,bm† s=2,bm† s=2,bM flat† flat† s=2,bm flat flat –<br />
|dfA| = 28 s=3,bM s=2,bM† s=2,bm† s=3,bM s=2,bm† s=2,bm flat† flat† s=3,bm flat flat –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
KMSAD A| = 10 s=2,bm† flat† s=2,bM† s=2,bM† s=3,bM† s=2,bM† flat† flat† flat flat flat –<br />
|dfA| = 19 s=2,bm† flat s=2,bm† s=2,bM† s=3,bM† s=2,bm† flat† flat† s=2,bM s=2,bM s=2,bM –<br />
|dfA| = 28 s=2,bm† s=2,bM s=3,bM s=2,bm† s=2,bm† s=2,bm† flat† flat† s=3,bM s=3,bM s=2,bM –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
SLSAD A| = 10 s=2,bm† flat† s=2,bM s=2,bm† s=2,bM s=2,bM† flat† flat† s=2,bM flat flat –<br />
|dfA| = 19 s=2,bm† flat† s=2,bm† s=3,bM s=3,bM s=2,bM flat† flat† s=2,bM flat flat –<br />
|dfA| = 28 s=2,bm† s=2,bM s=3,bM s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bm flat flat –<br />
Chapter 3. Hierarchical consensus architectures<br />
67<br />
Table 3.4: Computationally optimal consensus architectures (flat or RHCA) on the unimodal data collections assuming a fully serial<br />
implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.
3.2. Random hierarchical consensus architectures<br />
Consensus Diversity Dataset<br />
function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† s=2,bm s=2,bm flat –<br />
|df<br />
CSPA A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 19 s=3,bm flat s=3,bM s=3,bM s=2,bm† s=3,bM flat† s=3,bM s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 28 s=2,bm† s=3,bM s=3,bm s=3,bM s=2,bm† s=3,bM flat s=3,bm s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
EAC A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† flat s=2,bm s=3,bM –<br />
|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=3,bM s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 28 s=2,bm† s=3,bM s=2,bm† s=3,bM s=2,bm† s=3,bM s=2,bm flat† s=2,bm s=2,bm s=2,bm –<br />
|dfA| = 1 flat† flat† flat† flat† s=3,bM s=3,bM flat† flat† s=3,bm s=3,bm s=2,bm s=3,bM<br />
|df<br />
HGPA A| = 10 s=3,bM flat† s=2,bm s=3,bM s=4,bM s=3,bm s=2,bm† s=3,bM s=6 s=4,bm s=2,bm s=4,bM<br />
|dfA| = 19 s=3,bm† s=3,bM s=3,bM s=4,bM s=2,bm† s=4,bM s=2,bm† s=4,bM s=2,bM s=4,bM s=6 s=3,bm<br />
|dfA| = 28 s=3,bm s=3,bM s=3,bm s=3,bM s=3,bm s=3,bm† s=3,bm s=3,bm† s=3,bM s=5,bm s=6 s=4,bM<br />
|dfA| = 1 s=2,bm† flat† flat† flat† s=3,bM s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm s=3,bM<br />
|df<br />
MCLA A| = 10 s=3,bm† s=2,bm† s=2,bm s=3,bm s=4,bM s=2,bm† s=2,bm† s=3,bM s=5 s=4,bm s=4,bm s=4,bM<br />
|dfA| = 19 s=2,bm s=3,bM s=3,bM s=4,bM s=2,bm† s=4,bM s=2,bm† s=4,bM s=2,bM s=4,bM s=4,bm s=3,bm<br />
|dfA| = 28 s=3,bm† s=2,bm† s=3,bm s=3,bm† s=4,bM s=4,bm s=2,bm† s=3,bm† s=3,bM s=5,bm s=3,bM s=4,bM<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
ALSAD A| = 10 s=3,bM flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM flat s=3,bM –<br />
|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=2,bm† s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 28 s=4,bM s=2,bm† s=3,bM s=3,bm s=2,bm† s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
KMSAD A| = 10 s=2,bm† flat† s=2,bm s=3,bM s=4,bM s=2,bm† flat† flat† s=3,bM s=3,bm s=2,bM –<br />
|dfA| = 19 s=2,bm† s=3,bM s=3,bM s=3,bm s=3,bm s=3,bM s=2,bm flat† s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 28 s=2,bm† s=2,bm† s=4,bM s=3,bM s=2,bm† s=3,bM s=4,bM flat† s=3,bM s=2,bm s=3,bm –<br />
|dfA| = 1 s=2,bM flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
SLSAD A| = 10 s=3,bM flat† s=2,bm s=3,bM s=2,bm† s=2,bm† flat† flat† s=3,bM s=2,bm s=3,bM –<br />
|dfA| = 19 s=2,bm† s=2,bM s=3,bM s=4,bM s=2,bm† s=3,bM flat† flat† s=3,bM s=2,bm s=2,bm –<br />
|dfA| = 28 s=3,bm s=2,bm† s=3,bM s=3,bm s=2,bm† s=3,bM flat† flat† s=3,bm s=2,bm s=2,bm –<br />
68<br />
Table 3.5: Computationally optimal consensus architectures (flat or RHCA) on the unimodal data collections assuming a fully parallel<br />
implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.
Chapter 3. Hierarchical consensus architectures<br />
3.3 Deterministic hierarchical consensus architectures<br />
This section is devoted to the description of deterministic hierarchical consensus architectures<br />
(or DHCA). As in the previous section, we present a generic definition of this<br />
architectural variant along with a study of its computational complexity.<br />
3.3.1 Rationale and definition<br />
As opposed to random HCA, this proposal drives the creation of the mini-ensembles by<br />
a deterministic criterion. The main idea behind DHCA is to exploit the distinct ways of<br />
introducing diversity in the cluster ensemble as the guiding principle for creating the miniensembles<br />
upon which the intermediate consensus clustering solutions are built. That is,<br />
a key differential factor between DHCA and RHCA is that the former type of architecture<br />
is indirectly designed by the user when creating the cluster ensemble, whereas the latter<br />
requires the user to fix an architectural defining factor (i.e. assign a value to the size of the<br />
mini-ensembles b).<br />
Enlarging on the relationship between the creation of the cluster ensemble and the<br />
configuration of the DHCA, it is important to recall the strategies employed for introducing<br />
diversity in cluster ensembles (see section 2.1).<br />
For instance, heterogeneous cluster ensembles –whose components are generated by<br />
the execution of multiple clustering algorithms on the data set– have a single diversity<br />
factor, i.e. the set of distinct clustering algorithms employed. Meanwhile, when creating<br />
homogeneous cluster ensembles (those compiling the outcomes of multiple runs of a single<br />
clustering algorithm), a wider spectrum of diversity factors can be applied, such as the<br />
random starting configuration of a stochastic algorithm, or the use of distinct attributes for<br />
representing the objects in the data set, among others.<br />
As aforementioned, in this work we combine both the homogeneous and heterogeneous<br />
approaches for creating cluster ensembles, aiming not only to obtain highly diverse cluster<br />
ensembles, but also to design a strategy for fighting against clustering indeterminacies. This<br />
means that we employ several mutually crossed diversity factors (e.g. multiple clustering<br />
algorithms are run on several data representations with varying dimensionalities), and this<br />
will be the scenario where DHCA will be defined.<br />
In general terms, let us denote the number of diversity factors employed in the cluster<br />
ensemble creation process as f. Each diversity factor dfi ∀i ∈ [1,f] has a cardinality |dfi|<br />
—e.g. |dfi| denotes the number of clustering algorithms employed for creating the cluster<br />
ensemble in case that the ith diversity factor dfi represents the algorithmic diversity of the<br />
ensemble.<br />
Finally, notice that, if fully mutual crossing between all diversity factors is ensured (e.g.<br />
each cluster ensemble component is the result of running each clustering algorithm on each<br />
document representation of each distinct dimensionality), the cluster ensemble size l can be<br />
expressed as:<br />
f<br />
l = |dfk| (3.12)<br />
k=1<br />
Let us see how the design of the cluster ensemble determines the topology of a deterministic<br />
hierarchical consensus architecture. The guiding principle is that the consensus<br />
69
3.3. Deterministic hierarchical consensus architectures<br />
processes conducted at each stage of a DHCA combine those clusterings stemming from a<br />
single diversity factor (e.g. those ensemble components obtained by applying all the available<br />
algorithms on a particular object representation with a specific dimensionality). Then,<br />
quite obviously, the number of stages of a DHCA is equal to the number of diversity factors<br />
employed in creating the cluster ensemble, i.e. s = f.<br />
Besides selecting the diversity factors (and their cardinality) used in generating the cluster<br />
ensemble, the user must make an additional choice that affects the DHCA configuration:<br />
deciding which diversity factor is subject to consensus at each DHCA stage. This is specified<br />
by means of an ordered list of diversity factors, O = {df1,df2,...,dff }, sothatdfi will<br />
refer hereafter to the diversity factor which is subject to consensus at the ith stage of the<br />
DHCA.<br />
As regards the number of consensus executed on the ith DHCA stage (Ki), it is equal<br />
to the product of the cardinalities of the diversity factors that have not been subject to<br />
consensus neither in the present nor in the previous stages —except for the last stage,<br />
where a single consensus is conducted, as defined in equation (3.13).<br />
⎧<br />
⎨<br />
Ki =<br />
⎩<br />
f<br />
k=i+1<br />
|dfk| if 1 ≤ i
λ 1<br />
λ 2<br />
= λ<br />
= λ<br />
λ3 = λ<br />
λ 4<br />
= λ<br />
λ5 = λ<br />
λ 6<br />
λ 7<br />
= λ<br />
= λ<br />
λ8 = λ<br />
λ 9<br />
λ 10<br />
λ 11<br />
λ 12<br />
λ 13<br />
λ 14<br />
λ 15<br />
λ 16<br />
= λ<br />
= λ<br />
= λ<br />
= λ<br />
= λ<br />
= λ<br />
= λ<br />
= λ<br />
λ17 = λ<br />
λ 18<br />
= λ<br />
1,<br />
1,<br />
1<br />
2,<br />
1,<br />
1<br />
3,<br />
1,<br />
1<br />
1,<br />
1,<br />
2<br />
2,<br />
1,<br />
2<br />
3,<br />
1,<br />
2<br />
1,<br />
1,<br />
3<br />
2,<br />
1,<br />
3<br />
3,<br />
1,<br />
3<br />
1,<br />
2,<br />
1<br />
2,<br />
2,<br />
1<br />
3,<br />
2,<br />
1<br />
1,<br />
2,<br />
2<br />
2,<br />
2,<br />
2<br />
3,<br />
2,<br />
2<br />
1,<br />
2,<br />
3<br />
2,<br />
2,<br />
3<br />
3,<br />
2,<br />
3<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
λ<br />
λ<br />
λ<br />
λ<br />
λ<br />
λ<br />
A,<br />
1,<br />
1<br />
A,<br />
1,<br />
2<br />
A,<br />
1,<br />
3<br />
A,<br />
2,<br />
1<br />
A,<br />
2,<br />
2<br />
A,<br />
2,<br />
3<br />
Chapter 3. Hierarchical consensus architectures<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
λ<br />
λ<br />
A, 1 1,<br />
D<br />
A, 2 2,<br />
D<br />
Consensus<br />
function c λ<br />
Figure 3.9: An example of a deterministic hierarchical consensus architecture operating on a<br />
cluster ensemble created using three diversity factors: three clustering algorithms |dfA| =3,<br />
two object representations |dfR| = 2 of three dimensionalities each |dfD| = 3. The cluster<br />
ensemble component obtained by running the ith clustering algorithm on the jth object<br />
representation and the kth dimensionality is denoted as λi,j,k. Consensus are sequentially<br />
created across the algorithmic, dimensional and representational diversity factors (dfA, dfD<br />
and dfR, respectively).<br />
Therefore, a total of K2 = |dfR| = 2 consensus processes are run, each on a mini-ensemble<br />
of size b2j = |dfD| =3, ∀j ∈ [1, 2]. The halfway consensus clustering solutions obtained<br />
after this second stage are designated as λA,j,D.<br />
And finally, the last DHCA stage combines the clusterings output by the previous one,<br />
which only differ in its original object representation. Being the final stage of the hierarchy,<br />
a single consensus process is executed (K3 = 1), and the size of the mini-ensemble coincides<br />
with the cardinality of the representation diversity factor, i.e. b3j = |dfR| =2.<br />
3.3.2 Computational complexity<br />
The maximum and minimum time complexities of DHCA –corresponding to the serial and<br />
parallel execution of the consensus processes of each stage, respectively– are estimated in<br />
the following paragraphs. In this case, the goal is to express these complexities in terms of<br />
the cardinality and number of diversity factors employed in the cluster ensemble creation<br />
process. Recall that the time complexity of consensus functions typically grows linearly or<br />
quadratically with the cluster ensemble size, i.e. it is O (l w ), where w ∈{1, 2}.<br />
71
3.3. Deterministic hierarchical consensus architectures<br />
Serial DHCA<br />
Firstly, let us consider the fully serialized version of the DHCA. In this case, the time<br />
complexity amounts to the sum of all the consensus processes, as defined by equation (3.14).<br />
Notice that STCDHCA can be expressed ultimately in terms of the number and cardinalities<br />
of the diversity factors employed in the generation of the cluster ensemble.<br />
STCDHCA =<br />
s Ki <br />
i=1 j=1<br />
O (bij w )=<br />
s<br />
i=1<br />
Ki · O (bij w )=<br />
f<br />
<br />
f<br />
i=1<br />
k=i+1<br />
Keeping the higher order terms, serial DHCA time complexity is:<br />
Parallel DHCA<br />
<br />
f<br />
<br />
STCDHCA = O |dfk| |df1| w<br />
<br />
k=2<br />
<br />
|dfk| · O (|dfi| w ) (3.14)<br />
(3.15)<br />
And secondly, the time complexity of the parallelized execution of the DHCA is presented<br />
in equation (3.16). As all the consensus processes in a given DHCA stage are equally costly,<br />
the value of PTCDHCA amounts to the addition of the complexities of one of the consensus<br />
processes run on each of the s stages of the hierarchy.<br />
PTCDHCA =<br />
s<br />
i=1<br />
O (bij w )=<br />
f<br />
O (|dfi| w ) (3.16)<br />
Notice that the parallel execution of a DHCA can be regarded as a sequence of f<br />
instructions of complexity O (|dfi| w ) , ∀i ∈ [1,f]. Therefore, applying the sum rule of<br />
assymptotic notation, PTCDHCA can be rewritten as:<br />
<br />
PTCDHCA = O<br />
3.3.3 Running time minimization<br />
i=1<br />
max (|dfi|)<br />
i<br />
w<br />
<br />
(3.17)<br />
As in section 3.2, a naturally arising question regarding the practical implementation of<br />
deterministic hierarchical consensus architectures is the following: given a cluster ensemble<br />
of size l created upon a set of diversity factors dfi (for i = {1,...,f}), which is the least<br />
time consuming DHCA variant that can be built?<br />
Indeed, as the topology of a deterministic hierarchical consensus architecture is ultimately<br />
determined by an ordered list O of the f diversity factors indicating upon which<br />
diversity factor consensus is conducted at each DHCA stage, there exist f! distinctDHCA<br />
variants for a given consensus clustering problem —one for each of the possible ways of ordering<br />
the f diversity factors. Then, the question transforms into: how should the diversity<br />
factors be ordered so as to minimize the total running time of the DHCA?<br />
72
Chapter 3. Hierarchical consensus architectures<br />
Notice that, as opposed to what was observed in random HCA, the distinct DHCA<br />
variants do not differ in their number of stages (which is, in all cases, equal to the number<br />
of diversity factors, i.e. s = f), but in the time complexity of each stage of the architecture.<br />
Thus, in order to determine which is the computationally optimal DHCA variant, it is<br />
necessary to analyze the dependence between the ordering of the diversity factors and the<br />
total number of consensus processes executed and their complexity. In this section, we tackle<br />
this issue for both the fully serial and parallel implementation of deterministic hierarchical<br />
consensus architectures.<br />
Without loss of generality, let us assume that consensus clustering is to be conducted<br />
on a cluster ensemble of size l generated upon f = 3 mutually crossed diversity factors. By<br />
means of an ordered list O, these three factors are associated to one of the stages of the<br />
DHCA, i.e. O = {df1,df2,df3} —recall that, according to the definition of DHCA, i) the<br />
numerical subindex of each diversity factor identifies the stage it is associated to, and ii)<br />
Ki consensus processes of complexity O (|dfi| w )(wherew = {1, 2}) are conducted in the ith<br />
stage, with i = {1, 2, 3} in this case.<br />
As aforementioned, the total number of consensus processes depends on the cardinality<br />
of the diversity factors, which in this particular case amounts to the expression presented<br />
in equation (3.18):<br />
f<br />
Ki =<br />
i=1<br />
3<br />
Ki =<br />
i=1<br />
3<br />
|dfk| +<br />
k=2<br />
3<br />
|dfk| +1=|df2||df3| + |df3| + 1 (3.18)<br />
k=3<br />
where the number of consensus per stage Ki is computed according to equation (3.13).<br />
Firstly, let us analyze the running time of the fully parallel DHCA implementation. As<br />
in section 3.2, we assume that sufficient computing resources allow the concurrent execution<br />
of all the consensus processes of any of the DHCA stages —notice that this amounts to<br />
having as many as |df2||df3| parallel computation units capable of running simultaneously<br />
all the consensus processes of the first stage, which is the one with the largest number of<br />
consensus.<br />
If this condition is met, the running time of any parallel DHCA variant becomes independent<br />
of the ordering of the diversity factors. This is due to the fact that, assuming<br />
the fully simultaneous execution of all the consensus processes corresponding to the same<br />
DHCA stage, the running time of the whole consensus architecture will be proportional to<br />
O (|df1| w )+O (|df2| w )+O (|df3| w ) –as the running time of each DHCA stage is equivalent to<br />
the execution of a single consensus process–, which is independent of which diversity factor<br />
is assigned to each stage.<br />
However, despite the diversity factors ordering does not affect the running time of parallel<br />
DHCA variants, this factor has a significant impact on the dimensioning of the necessary<br />
resources for the entirely parallel execution of all the consensus processes involved, as it is<br />
directly related to the total number of consensus that must be executed in the DHCA.<br />
f<br />
From equation (3.18), it is straightforward to see that Ki is independent of the<br />
cardinality of the diversity factor associated to the first DHCA stage (df1). Thus, notice<br />
that the total number of consensus of a DHCA is minimized if the diversity factors are<br />
arranged in the ordered list O according to their cardinality and in decreasing order, i.e.<br />
|df1| ≥|df2| ≥|df3|. By doing so, the number of consensus processes conducted in the<br />
73<br />
i=1
3.3. Deterministic hierarchical consensus architectures<br />
1. Given the cluster ensemble size l generated upon a set of f mutually crossed diversity<br />
factors, create f! ordered lists corresponding to all the possible permutations of the<br />
diversity factors, each giving rise to a DHCA variant.<br />
2. For each one of the ordered lists, compute the total number of consensus processes<br />
per stage Ki, according to equation (3.13).<br />
4. Measure the time required for executing the consensus function F on c randomly<br />
picked mini-cluster ensembles of sizes |dfi| for i ∈{1,...,f}.<br />
5. Employ the computed parameters of each DHCA variant (i.e. number of stages s =<br />
f, total number of consensus processes s<br />
Ki and the running times of the consensus<br />
i=1<br />
function F) to estimate the running times of the whole hierarchical architecture,<br />
using equations (3.14) or (3.16) depending on whether its fully serial or parallel<br />
version is to be implemented in practice.<br />
Table 3.6: Methodology for estimating the running time of multiple DHCA variants.<br />
first stage of the DHCA is minimized, which is equivalent to minimizing the necessary<br />
computation units for the fully parallel implementation of the DHCA.<br />
If the running time of the serial implementation of the DHCA is now considered, the<br />
total number of consensus to be executed is not the only factor to take into account. In<br />
fact, arranging the diversity factors in decreasing order, while minimizing the total number<br />
of consensus to be executed across the DHCA, brings about a contradictory collateral effect<br />
if the complexity and number of the consensus processes run at each stage are considered.<br />
Indeed, notice that it is in the first DHCA stage where the largest number of consensus<br />
processes is executed (K1 = |df2||df3|), and they have the highest complexity (O (|df1| w ),<br />
where w = {1, 2}), as |df1| ≥|df2| ≥|df3|. Moreover, a single minimum complexity (i.e.<br />
O (|df3| w )) consensus process is conducted at the third stage, as K3 =1.<br />
Thus, as far as the running time of the serial implementation of DHCA is concerned,<br />
there exists an apparent trade-off involving the order of diversity factors, and the number<br />
and complexity of the associated consensus processes. It is important to note that the computationally<br />
optimal solution ultimately depends on the growth rates of the total number<br />
of consensus and of their time complexities with respect to the cardinality of the diversity<br />
factors (|dfi|, fori = {1,...,f}). However, while the latter grow according to a linear or<br />
quadratic law, the former follows a usually steeper multiplicative growth rate.<br />
Similarly to what has been discussed in section 3.2, table 3.6 presents a methodology<br />
for estimating the running times of the f! DHCA variants, which allows comparing them<br />
and, as a consequence, predicting which is the most computationally efficient.<br />
74
3.3.4 Experiments<br />
Chapter 3. Hierarchical consensus architectures<br />
This section presents the results of multiple experiments oriented to illustrate the computational<br />
efficiency of DHCA, as well as to evaluate the predictive power of the running time<br />
estimation methodology of table 3.6. Their design follows the scheme presented next.<br />
Experimental design<br />
– What do we want to measure?<br />
i) The time complexity of deterministic hierarchical consensus architectures.<br />
ii) The ability of the proposed methodology for predicting the computationally optimal<br />
DHCA variant, in both the fully serial and parallel implementations.<br />
iii) The predictive power of the proposed methodology based on running time estimation<br />
vs the computational optimality criterion based on designing the DHCA<br />
according to a decreasing diversity factor cardinality order, in both the fully<br />
serial and parallel implementations.<br />
– How do we measure it?<br />
i) The time complexity of the implemented serial and parallel DHCA variants is<br />
measured in terms of the CPU time required for their execution —serial running<br />
time (SRTDHCA) and parallel running time (PRTDHCA).<br />
ii) The estimated running times of the same DHCA variants –serial estimated running<br />
time (SERTDHCA) and parallel estimated running time (PERTDHCA)– are<br />
computed by means of the proposed running time estimation methodology, which<br />
is based on the measured running time of c = 1 consensus clustering process. Predictions<br />
regarding the computationally optimal DHCA variant will be successful<br />
in case that both the real and estimated running times are minimized by the<br />
same DHCA variant, and the percentage of experiments in which prediction is<br />
successful is given as a measure of its performance. In order to measure the<br />
impact of incorrect predictions, we also measure the execution time differences<br />
(in both absolute and relative terms) between the truly and the allegedly fastest<br />
DHCA variants in the case prediction fails. This evaluation process is replicated<br />
for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />
on the prediction accuracy of the proposed methodology.<br />
iii) Both computationally optimal DHCA variants prediction approaches are compared<br />
in terms of the percentage of experiments in which prediction is successful,<br />
and in terms of the execution time overheads (in both absolute and relative terms)<br />
between the truly and the allegedly fastest DHCA variants in the case prediction<br />
fails.<br />
– How are the experiments designed? The f! DHCA variants corresponding to<br />
all the possible permutations of the f diversity factors employed in the generation<br />
of the cluster ensemble have been implemented (see table 3.6). As described in appendix<br />
A.4, cluster ensembles have been created by the mutual crossing of f =3<br />
diversity factors: clustering algorithms (dfA), object representations (dfR) and data<br />
dimensionalities (dfD). Thus, in all our experiments, the number of DHCA variants is<br />
75
3.3. Deterministic hierarchical consensus architectures<br />
f! = 3! = 6, which are identified by an acronym describing the order in which diversity<br />
factors are assigned to stages —for instance, ADR describes the DHCA variant<br />
defined by the ordered list O = {df1 = dfA,df2 = dfD,df3 = dfR}. For a given data<br />
collection, the cardinalities of the representational and dimensional diversity factors<br />
(|dfR| and |dfD|, respectively) are constant, while the cardinality of the algorithmic<br />
diversity factor takes four distinct values |dfA| = {1, 10, 19, 28}, giving rise to the four<br />
diversity scenarios where our proposals are analyzed. Moreover, consensus clustering<br />
has been conducted by means of the seven consensus functions for hard cluster<br />
ensembles described in appendix A.5, which allows evaluating the behaviour of our<br />
proposals under distinct consensus paradigms. In all cases, the real running times<br />
correspond to an average of 10 independent runs of the whole RHCA, in order to<br />
obtain representative real running time values. As described in appendix A.6, all the<br />
experiments have been executed under Matlab 7.0.4 on Pentium 4 3GHz/1 GB RAM<br />
computers.<br />
– How are results presented? Both the real and estimated running times of the<br />
serial and parallel implementations of the DHCA variants are depicted by means of<br />
curves representing their average values.<br />
– Which data sets are employed? For brevity reasons, this section only describes<br />
the results of the experiments conducted on the Zoo data collection. On this data<br />
set, the cardinalities of the representational and dimensional diversity factors are<br />
|dfR| = 5 and |dfD| = 14, respectively. The presentation of the results of these<br />
same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat<br />
unimodal data collections is deferred to appendix C.3.<br />
Diversity scenario |df A| =1<br />
Figure 3.10 presents the estimated and real running times of the serial and parallel DHCA<br />
implementations in the lowest diversity scenario corresponding to the use of |dfA| = 1<br />
randomly chosen clustering algorithm for creating a cluster ensemble of size l = 57. The<br />
DHCA variants are identified in the horizontal axis of each chart. Meanwhile, the values<br />
of SERTDHCA and PERTDHCA correspond to an arbitrarily chosen estimation experiment<br />
based on a single consensus run (i.e. c =1).<br />
On one hand, figures 3.10(a) and 3.10(b) present the estimated and real running times<br />
of the serial DHCA implementation (SERTDHCA and SRTDHCA). Notice that SERTDHCA<br />
is a fairly good estimator of the real execution time of the DHCA variants. Moreover, it<br />
successfully predicts the fastest consensus architecture —which is what we are ultimately<br />
interested in. Notice that DRA is the DHCA variant minimizing SRTDHCA, which corresponds<br />
to the decreasing ordering of the diversity factors in terms of their cardinality.<br />
On the other hand, figures 3.10(c) and 3.10(d) depict the estimated and real running<br />
times corresponding to the fully parallel implementation of DHCA (PERTDHCA and<br />
PRTDHCA, respectively). There are two issues worth noting in this case: firstly, notice that<br />
the real execution time of the distinct DHCA variants shows a notably lower dispersion than<br />
in the serial case, which somehow corroborates our conjecture regarding the unimportance<br />
of the diversity factors ordering in parallel DHCA variants. And secondly, it is clear that<br />
PERTDHCA does not perform as accurately as regards the determination of the fastest con-<br />
76
SERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(a) Serial estimated running time<br />
10 −1<br />
|df A | = 1 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Chapter 3. Hierarchical consensus architectures<br />
SRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 −1<br />
(b) Serial real running time<br />
|df A | = 1 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.10: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />
data collection in the diversity scenario corresponding to a cluster ensemble of size l = 57.<br />
sensus architecture as in the serial case. Moreover, notice that in this low diversity scenario,<br />
hierarchical consensus architectures are, in general terms, slower than flat consensus.<br />
Diversity scenario |df A| =10<br />
Figure 3.11 presents the results obtained in the diversity scenario corresponding to the generation<br />
of the cluster ensemble by the application of |dfA| = 10 clustering algorithms chosen<br />
at random. In this case, there exists at least one serial DHCA variant that performs faster<br />
than flat consensus —the only exception occurs when consensus are created by the EAC<br />
consensus function (recall this also happened with RHCA, see section 3.2.4). As in the previous<br />
diversity scenario, all the parallel DHCA variants yield pretty similar running times,<br />
as opposed to what is observed in the serial case, where there exist significant execution<br />
time differences between variants. <strong>La</strong>st, as regards the prediction of the computationally<br />
optimal consensus architecture, notice that its performance is pretty accurate in both the<br />
serial and parallel cases.<br />
Diversity scenario |df A| =19<br />
As the diversity level of the cluster ensemble increases, a shift in the computationally optimal<br />
serial DHCA variant can be observed —see figure 3.12. Indeed, as figure 3.12(b) shows,<br />
the ADR variant of DHCA becomes the least computationally expensive serial hierarchical<br />
77
3.3. Deterministic hierarchical consensus architectures<br />
SERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 10 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(a) Serial estimated running time<br />
|df A | = 10 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 10 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(b) Serial real running time<br />
|df A | = 10 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.11: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />
data collection in the diversity scenario corresponding to a cluster ensemble of size l = 570.<br />
consensus architecture except when the EAC consensus function is employed —however,<br />
this behaviour is not always successfully predicted by SERTDHCA, asdepictedinfigure<br />
3.12(a). We believe this is due to the fact that our estimation is founded on the execution<br />
time of a single consensus process.<br />
As regards the parallel implementation of DHCA, the six DHCA variants yield very<br />
similar real execution times, as shown in figure 3.12(d), a trend that the running time<br />
estimation also captures —see figure 3.12(c). However, notice that this fact makes it difficult<br />
that the absolute minima of PERTDHCA and PRTDHCA coincide, which will probably harm<br />
the predictive accuracy of our proposal in the parallel implementation context. Moreover,<br />
notice that when the MCLA consensus function is employed, flat consensus is not executable<br />
(with the resources available in our experiments, see appendix A.6), while all the DHCA<br />
variants are.<br />
Diversity scenario |df A| =28<br />
Figure 3.13 presents the estimated and real running times of the serial and parallel DHCA<br />
implementations in the highest diversity scenario, i.e. the one corresponding to the creation<br />
of the cluster ensemble by means of the |dfA| = 28 clustering algorithms of the CLUTO<br />
clustering toolbox —which gives rise to a cluster ensemble containing l = 1596 components.<br />
In this context, arranging the diversity factors in decreasing cardinality order for defining<br />
their association to the DHCA stages again yields the most computationally efficient serial<br />
78
SERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 19 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(a) Serial estimated running time<br />
|df A | = 19 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Chapter 3. Hierarchical consensus architectures<br />
SRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 19 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(b) Serial real running time<br />
|df A | = 19 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.12: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />
data collection in the diversity scenario corresponding to a cluster ensemble of size l = 1083.<br />
DHCA variant (ADR in this case). This fact somehow reinforces the idea that, when<br />
compared to the typically linear or quadratic time complexity of consensus functions, the<br />
multiplicative growth rate of the total number of consensus imposes a stronger constraint<br />
as far as the running time of the DHCA is concerned. As observed in the previous diversity<br />
scenarios, the selection of a particular DHCA variant is a less critical matter when the fully<br />
parallel implementation of DHCA is considered, as all six variants yield pretty similar real<br />
execution times. In the serial case, in contrast, the accuracy of the running time estimation<br />
methodology is much more crucial, although SERTDHCA performs as a reasonably successful<br />
predictor.<br />
As mentioned earlier, these same experiments have been run on the Iris, Wine, Glass,<br />
Ionosphere, WDBC, Balance and MFeat unimodal data collections, and the corresponding<br />
results are presented in appendix C.3. Most of the conclusions drawn regarding the computational<br />
efficiency of hierarchical and flat consensus architectures in the analysis of random<br />
hierarchical consensus architectures are also applicable in the context of DHCA, such as<br />
the high computational efficiency of i) parallel DHCA even in low diversity scenarios, and<br />
ii) serial DHCA in medium and high diversity scenarios, or the dependence between the<br />
characteristics of the consensus function employed for conducting clustering combination<br />
and the execution time of consensus architectures.<br />
Moreover, two extra conclusions regarding the selection of the computationally optimal<br />
DHCA variant must be discussed. Firstly, defining the DHCA architecture by means of an<br />
ordered list of diversity factors arranged in decreasing cardinality order (i.e. associating the<br />
79
3.3. Deterministic hierarchical consensus architectures<br />
SERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 28 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(a) Serial estimated running time<br />
|df A | = 28 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(c) Parallel estimated running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 28 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(b) Serial real running time<br />
|df A | = 28 , |df D | = 14 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(d) Parallel real running time<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure 3.13: Estimated and real running times of the serial and parallel DHCA on the Zoo<br />
data collection in the diversity scenario corresponding to a cluster ensemble of size l = 1596.<br />
most numerous diversity factor to the first stage, the second most populated to the second<br />
DHCA stage, and so on) seems to give rise to the most computationally efficient serial<br />
DHCA variant. And secondly, the execution time of fully parallel DHCA appears to be<br />
pretty insensitive to the way diversity factors are associated to the stages of the hierarchical<br />
consensus architecture.<br />
In practice, these two latter facts may play down the accuracy of the computationally<br />
optimal consensus architecture prediction methodology presented in table 3.6, as it seems<br />
possible to make well-grounded aprioridecisions as regards the selection of the fastest<br />
deterministic hierarchical consensus architecture variant without need of any running time<br />
estimation. For this reason, the next section presents an exhaustive comparative evaluation<br />
of these two strategies for predicting the computationally optimal consensus architecture.<br />
Evaluation of the optimal DHCA prediction methodology based on running<br />
time estimation<br />
Following an analogous procedure as in section 3.2.4, we have computed the percentage of<br />
experiments in which the estimated and real running times are simultaneously minimized by<br />
the same consensus architecture. The impact of the failures of this prediction methodology is<br />
measured in terms of the absolute and relative differences between the real execution times<br />
–ΔRT– of the truly (i.e. the one minimizing SRTDHCA or PRTDHCA) and the allegedly<br />
(the one that minimizes SERTDHCA or PERTDHCA) computationally optimal consensus<br />
80
Chapter 3. Hierarchical consensus architectures<br />
Serial DHCA Parallel DHCA<br />
Dataset % correct ΔRT % correct ΔRT<br />
predictions (sec.) (%) predictions (sec.) (%)<br />
Zoo 55.9 1.32 227.3 40.0 0.03 34.0<br />
Iris 93.4 0.56 180.7 51.2 0.10 63.7<br />
Wine 76.1 1.19 48.2 36.8 0.14 48.1<br />
Glass 76.4 0.76 32.1 39.3 0.23 60.7<br />
Ionosphere 83.2 11.9 33.1 47.9 1.38 46.5<br />
WDBC 76.2 77.73 58.1 39.7 4.93 44.5<br />
Balance 90.6 0.52 26.5 71.1 0.93 51.7<br />
MFeat 88.6 5.40 27.4 68.4 5.83 28.9<br />
average 80.0 12.57 78.0 49.3 1.70 47.3<br />
Table 3.7: Evaluation of the minimum complexity DHCA variant estimation methodology<br />
in terms of the percentage of correct predictions and running time penalizations resulting<br />
from mistaken predictions.<br />
architectures. The results corresponding to an averaging across 20 independent running<br />
time estimation experiments are presented in table 3.7.<br />
It can be observed that, in the serial case, SERTDHCA is a pretty accurate predictor,<br />
achieving correct prediction rates of the computationally optimal consensus architecture<br />
superior to 75% in all but one of the data sets. The running time overheads associated<br />
to incorrect predictions are usually negligible in absolute terms (ΔRT (sec.)) —except for<br />
the WDBC data collection, where the large real execution times of any of the consensus<br />
architectures make any mistake costly.<br />
As regards the performance of the proposed methodology for predicting the most efficent<br />
parallel implementation of DHCA, its degree of accuracy is inferior than in the serial case –a<br />
circumstance already observed in the context of RHCA–, although the penalization caused<br />
by this lower level of precision ranges below one second of extra execution time, which<br />
constitutes an assumable cost from a practical viewpoint —the WDBC and MFeat data<br />
collections stand out as the exceptions to this rule, although the corresponding ΔRT (sec.)<br />
overheads (around five seconds) are again of little importance in practice.<br />
Finally, we have also evaluated the influence of employing the execution times of c>1<br />
consensus executions for estimating the running times of the DHCA variants. Expectably,<br />
the larger c, the more accurate the running time estimation and, consequently, smaller<br />
running time overheads will be derived from incorrect predictions. On the flip side, however,<br />
this will slow down the prediction process —recall that, in the experiments presented up to<br />
now, c =1.<br />
As in section 3.2.4, a sweep of values of c ∈ [1, 20] has been conducted, computing<br />
the percentage of fastest consensus architecture correct predictions and the absolute and<br />
relative running time deviations associated to prediction errors at each step of the sweep,<br />
averaging the results of twenty independent runs of this experiment on each one of the eight<br />
unimodal data collections —see figure 3.14.<br />
It can be observed that, despite the gradual increase of correct predictions (figure<br />
3.14(a)), the running time deviations suffer a steep decrease as soon as c = 4 consensus<br />
processes are employed for computing SERTDHCA. Moreover, notice that using larger val-<br />
81
3.3. Deterministic hierarchical consensus architectures<br />
% correct predictions<br />
100<br />
80<br />
60<br />
40<br />
1 5 10<br />
CP<br />
S<br />
15<br />
CP<br />
P<br />
20<br />
c : number of consensus processes<br />
(a) Percentage of correct optimal<br />
consensus architecture<br />
predictions<br />
RT (sec.)<br />
15<br />
10<br />
5<br />
RT S<br />
RT P<br />
0<br />
1 5 10 15 20<br />
c : number of consensus processes<br />
(b) Absolute running time<br />
differences between the truly<br />
and allegedly optimal consensus<br />
architectures<br />
relative % RT<br />
80<br />
60<br />
40<br />
20<br />
relative RT S<br />
relative RT P<br />
0<br />
1 5 10 15 20<br />
c : number of consensus processes<br />
(c) Relative running time differences<br />
between the truly and<br />
allegedly optimal consensus<br />
architectures<br />
Figure 3.14: Evolution of the accuracy of the serial and parallel DHCA running time estimation<br />
as a function of the number of consensus processes used in the estimation, measured in<br />
terms of (a) the percentage of correct predictions, the (b) absolute and (c) relative running<br />
time deviations between the truly and allegedly optimal consensus architectures.<br />
ues of c does not imply significant reductions in ΔRT, which shows a pretty stationary<br />
behaviour for c>5. <strong>La</strong>st, it is worth observing that, in the parallel case, the correct<br />
prediction percentage increase and relative ΔRT decrease obtained for c>1resultinalmost<br />
negligible absolute ΔRT reductions, which again reveals the lesser importance of the<br />
aprioriselection of a particular parallel DHCA variant.<br />
This adds to the fact that, as suggested earlier, it seems unnecessary to conduct any<br />
runnning time estimation process for determining the fastest hierarchical consensus architecture<br />
variant, as assigning the diversity factors to DHCA stages in decreasing cardinality<br />
order apparently gives rise to the serial DHCA variant of minimum time complexity. With<br />
the purpose of evaluating the validity of this latter hypothesis, we have conducted the<br />
experiments presented throughout the following paragraphs.<br />
Evaluation of the optimal DHCA prediction methodology based on decreasing<br />
cardinality diversity factor ordering<br />
As far as the serial DHCA implementation is concerned, we have computed the percentage<br />
of experiments where the minimum real running time is achieved by the variant corresponding<br />
to the decreasing cardinality ordering of the diversity factors. Moreover, in case this<br />
prediction fails, we have computed the running time overhead resulting from selecting a<br />
computationally suboptimal hierarchical consensus architecture as the fastest one. The average<br />
results obtained for each one of the eight unimodal data collections are presented in<br />
table 3.8.<br />
It can be observed, for instance, that the DHCA variants defined by the ordered list of<br />
diversity factors in decreasing cardinality order is always the fastest hierarchical consensus<br />
architecture in the Zoo data set —which is equivalent to a 100% correct prediction rate.<br />
Using this prediction method, the lowest accuracy is obtained in the MFeat data collection<br />
(71.4%) —in this case, the average running time deviation derived from the 28.6% of in-<br />
82
Chapter 3. Hierarchical consensus architectures<br />
Serial DHCA<br />
Dataset % correct ΔRT<br />
predictions (sec.) (%)<br />
Zoo 100 – –<br />
Iris 96.4 0.01 1.2<br />
Wine 89.3 0.69 16.2<br />
Glass 92.9 0.09 5.4<br />
Ionosphere 85.7 23.10 35.3<br />
WDBC 92.9 2.03 2.2<br />
Balance 75.0 1.12 17.1<br />
MFeat 71.4 233.6 18.5<br />
average 88.0 32.6 11.9<br />
Table 3.8: Evaluation of the minimum complexity serial DHCA variant prediction based on<br />
decreasing diversity factor ordering in terms of the percentage of correct predictions and<br />
running time penalizations resulting from mistaken predictions.<br />
correct predictions amounts to 233.6 seconds (a very high value due to the absolute real<br />
execution times of hierarchical consensus architectures on this data set), which is equivalent<br />
to an average deviation of 18.5% in relative percentage terms.<br />
An averaging across data sets yields a prediction accuracy of 88%, i.e. it performs better<br />
than the prediction methodology based on running time estimation, which made a 80% of<br />
correct predictions (see table 3.7). This result reinforces the notion that the decreasing<br />
cardinality diversity factor ordering approach to select the computationally optimal serial<br />
DHCA variant is an alternative worth considering, as it requires no previous consensus<br />
execution besides obtaining higher levels of prediction accuracy.<br />
Aiming to support the conjecture that there is no computationally superior DHCA variant<br />
when its fully parallel implementation is considered, we have conducted an experiment<br />
seeking to quantify the differences between the least and most time consuming DHCA variants.<br />
So as to provide a valid contrast to these results, the same computation has been<br />
conducted regarding the most and least computationally efficient serial DHCA variants,<br />
proving that making an accurate selection is much more important in the serial than in the<br />
parallel case. Table 3.9 presents the results obtained, averaged across all the experiments<br />
conducted on each of the eight unimodal data collections.<br />
It can be observed that the running time differences between the most and least computationally<br />
efficient DHCA variants are very notable in the serial case —in fact, it takes from<br />
5 to 18 times longer to run the slowest DHCA variant than the computationally optimal<br />
one. In contrast, these variations are much smaller when the fully parallel implementation<br />
of DHCA is considered. In this case, as expected, a greater running time uniformity is<br />
observed across DHCA variants, as the least computationally efficient variant is at most 2.5<br />
times slower than the fastest one.<br />
To sum up, the decreasing cardinality diversity factor ordering provides the user with a<br />
pretty accurate notion of which is the most computationally efficient DHCA configuration<br />
without need of executing a single consensus process. However, this strategy does not allow<br />
to decide whether the allegedly fastest DHCA variant is more efficient than flat consensus.<br />
To do so, we propose estimating the running time of the computationally optimal DHCA<br />
83
3.3. Deterministic hierarchical consensus architectures<br />
Serial DHCA Parallel DHCA<br />
Dataset max (SRTDHCA) − min (SRTDHCA) max (PRTDHCA) − min (PRTDHCA)<br />
(sec.) (%) (sec.) (%)<br />
Zoo 7.36 547.6 0.08 42.3<br />
Iris 5.34 707.4 0.12 53.2<br />
Wine 12.66 636.8 0.23 70.1<br />
Glass 8.78 387.1 0.34 90.8<br />
Ionosphere 487.73 1650.3 2.82 92.3<br />
WDBC 2095.36 1357.6 15.91 80.5<br />
Balance 187.77 736.5 8.57 96.2<br />
MFeat 16667.23 1562.4 1104.08 154.4<br />
Table 3.9: Running time differences between the most and least computationally efficient<br />
DHCA variants in both the serial and parallel implementations.<br />
variant (following the strategy presented in table 3.6), and then i) estimate the running time<br />
of flat consensus by extrapolating the execution times of the consensus processes conducted<br />
upon mini-ensembles of size |dfi| (for i = {1,...,f}), or ii) launch the execution of flat<br />
consensus, halting it as soon as its running time exceeds the estimated execution time of<br />
the allegedly optimal DHCA variant —which is a simpler but less efficient alternative.<br />
Summary of the most computationally efficient DHCA variants<br />
To end this section, we have estimated which are the most computationally efficient consensus<br />
architectures for the twelve unimodal data collections described in appendix A.2.1.<br />
The results corresponding to the fully serial and parallel implementations are presented in<br />
tables 3.10 and 3.11, respectively.<br />
As regards the serial consensus architecture implementation (table 3.10), a few notational<br />
observations must be made: successful predictions of the computationally optimal<br />
consensus architecture (i.e. the minima of SERTDHCA and SRTDHCA are yielded by the<br />
same consensus architecture) are denoted with a dagger (†). Moreover, we highlight the<br />
cases where the minimum time complexity consensus architecture is the DHCA variant defined<br />
by the ordered list of diversity factors arranged in decreasing cardinality order using<br />
the double dagger (‡) symbol. Quite obviously, this only applies to the first eight data<br />
collections (Zoo to MFeat), as in these cases both the estimated and real execution times<br />
are available. For the remaining data sets (miniNG to PenDigits), we have only estimated<br />
which are the computationally optimal consensus architectures.<br />
Firstly, notice the large number of † symbols in table 3.10, which indicates the reasonably<br />
high accuracy of the proposed optimal consensus architecture prediction methodology.<br />
Moreover, notice that the most times we predict correctly that the least time consuming<br />
consensus architecture is a DHCA variant, its architecture is created by arranging the<br />
diversity factors in decreasing order of cardinality (which is denoted by the ‡ symbol).<br />
Secondly, it is important to highlight that the higher the degree of diversity, the more<br />
efficient DHCA variants become when compared to flat consensus —as already observed<br />
throughout all the experiments reported, the EAC consensus function constitutes an exception<br />
to this rule. However, notice that flat consensus tends to be computationally optimal<br />
84
Chapter 3. Hierarchical consensus architectures<br />
in those data sets having small cluster ensembles even in high diversity scenarios (e.g. Iris,<br />
Balance or MFeat).<br />
Thirdly, as the number of objects n contained in the data set increases (such as in<br />
PenDigits collection), only the HGPA and MCLA consensus functions are executable (as<br />
they are the only whose time complexity scales linearly with the data set size, see appendix<br />
A.5), and hierarchical consensus architectures are the most computationally efficient ones.<br />
However, if the data set was even larger, nor flat neither DHCA would be affordable from a<br />
computational perspective —with the resources employed in our experiments, see appendix<br />
A.6.<br />
Most of these observations can be extrapolated to the case of the fully parallel consensus<br />
implementation (table 3.11), where a pretty overwhelming prevalence of DHCA variants over<br />
flat consensus can be observed, a trend that was already reported earlier in this section and<br />
can also be observed in the experiments described in appendix C.3.<br />
For brevity reasons, the experiments presented in the remains of this work concerning<br />
deterministic hierarchical consensus architectures will solely refer to those DHCA variants<br />
of minimum estimated running time —i.e. those presented in tables 3.10 and 3.11.<br />
85
3.3. Deterministic hierarchical consensus architectures<br />
Consensus Diversity Dataset<br />
function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
CSPA A| = 10 DAR‡ flat† flat† flat† DAR‡ flat† flat† flat† flat flat flat –<br />
|dfA| = 19 ADR‡ flat† flat† ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat –<br />
|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
EAC A| = 10 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|dfA| = 19 flat† flat† flat† flat† flat† flat† flat† flat† ADR flat flat –<br />
|dfA| = 28 flat† flat† flat† flat† flat† flat† flat† flat† ADR flat flat –<br />
|dfA| = 1 flat† flat† flat† flat† flat† DRA flat† RDA DRA DRA DRA DRA<br />
|df<br />
HGPA A| = 10 DAR‡ flat† flat† ADR‡ DAR‡ DRA flat† ARD‡ DAR DAR DAR DAR<br />
|dfA| = 19 ARD‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ ARD‡ ARD‡ ADR ADR ADR ADR<br />
|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR‡ ADR ARD† ARD‡ ADR ADR ADR ADR<br />
|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† DRA DRA DRA DRA<br />
|df<br />
MCLA A| = 10 DAR‡ flat† DAR‡ ADR‡ DAR‡ DAR‡ flat† ARD‡ DAR DAR DAR DAR<br />
|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ ARD‡ ARD‡ ADR ADR ADR ADR<br />
|dfA| = 28 ADR‡ ARD‡ ADR‡ ADR‡ DAR‡ ADR‡ ARD‡ ARD‡ ADR ADR ADR ADR<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
ALSAD A| = 10 DAR‡ flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR flat† flat† flat† ADR flat flat –<br />
|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ ADR‡ flat† flat† flat† ADR flat flat –<br />
|dfA| = 1 flat† flat† flat† flat† flat† DRA flat† RAD flat flat flat –<br />
|df<br />
KMSAD A| = 10 DAR‡ flat† DAR flat† DAR‡ DAR‡ flat† flat† flat flat flat –<br />
|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ DAR‡ flat† flat† ADR flat flat –<br />
|dfA| = 28 ADR‡ flat† ADR‡ ADR‡ DAR ADR‡ flat† flat† ADR flat flat –<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
SLSAD A| = 10 DAR‡ flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|dfA| = 19 ADR‡ flat† ADR‡ ADR‡ DAR‡ flat† flat† flat† ADR flat flat –<br />
|dfA| = 28 ADR‡ flat† ADR‡ ADR DAR flat† flat† flat† ADR flat flat –<br />
86<br />
Table 3.10: Computationally optimal consensus architectures (flat or DHCA) on the unimodal data collections assuming a fully serial<br />
implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions. The double dagger (‡) identifies DHCA<br />
variants defined by the ordered list of diversity factors in decreasing cardinality order.
Consensus Diversity Dataset<br />
function scenario Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG Segmentation BBC PenDigits<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† DRA flat flat –<br />
|df<br />
CSPA A| = 10 DAR† flat† DAR† ADR DAR† DAR flat† ARD DAR DAR DAR –<br />
|dfA| = 19 ADR† ARD ADR† ADR† DAR DAR† flat† ARD ADR ADR ADR –<br />
|dfA| = 28 ADR† ARD† ADR† ADR† DAR ADR† ARD† ARD ADR DAR ADR –<br />
|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
EAC A| = 10 DAR flat† DAR ADR DAR DAR flat† flat† flat flat flat –<br />
|dfA| = 19 ADR† ARD ADR† ADR† DAR DAR flat† flat† flat flat flat –<br />
|dfA| = 28 ADR† ARD ADR ADR DAR ADR flat† flat† flat flat flat –<br />
|dfA| = 1 DRA flat† flat† DRA DRA† DRA† flat† RAD DRA DRA DRA DRA<br />
|df<br />
HGPA A| = 10 DAR ARD DAR ADR† DAR† DAR ARD ARD† DAR DAR DAR DAR<br />
|dfA| = 19 ADR ARD† ADR ADR† DAR† DAR ARD ARD† ADR ADR ADR ADR<br />
|dfA| = 28 ADR† ARD† ADR† ADR† DAR† ADR ARD ARD ADR ADR ADR ADR<br />
|dfA| = 1 DRA† flat† flat† flat† DRA† DRA† flat† flat† DRA DRA DRA DRA<br />
|df<br />
MCLA A| = 10 DAR† ARD DAR ADR† DAR† DAR ARD ARD† DAR DAR DAR DAR<br />
|dfA| = 19 ADR† ARD† ADR† ADR DAR† DAR ARD† ARD† ADR ADR ADR ADR<br />
|dfA| = 28 ADR† ARD† ADR† ADR† DAR ADR† ARD ARD† ADR ADR ADR ADR<br />
|dfA| = 1 flat† flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
ALSAD A| = 10 DAR flat† DAR ADR† DAR DAR flat† flat† DAR flat flat –<br />
|dfA| = 19 ADR† flat† ADR ADR DAR DAR flat† flat† ADR flat flat –<br />
|dfA| = 28 ADR† ARD† ADR ADR DAR ADR† flat† flat† ADR flat flat –<br />
|dfA| = 1 flat† flat† DRA DRA flat† DRA† flat† RAD DRA flat flat –<br />
|df<br />
KMSAD A| = 10 DAR ARD DAR† ADR DAR DAR ARD ARD DAR DAR DAR –<br />
|dfA| = 19 ADR ARD† ADR† ADR DAR DAR ARD ARD ADR ADR ADR –<br />
|dfA| = 28 ADR† ARD ADR† ADR DAR ADR† ARD† ARD ADR ADR ADR –<br />
|dfA| = 1 DRA flat† flat† flat† flat† flat† flat† flat† flat flat flat –<br />
|df<br />
SLSAD A| = 10 DAR flat† DAR ADR DAR DAR flat† flat† DAR flat flat –<br />
|dfA| = 19 ADR ARD ADR ADR DAR DAR flat† flat† ADR flat flat –<br />
|dfA| = 28 ADR ARD ADR† ADR DAR ARD flat† flat† ADR ADR flat –<br />
Chapter 3. Hierarchical consensus architectures<br />
87<br />
Table 3.11: Computationally optimal consensus architectures (flat or DHCA) on the unimodal data collections assuming a fully parallel<br />
implementation. The dagger (†) symbolizes optimal consensus architecture correct predictions.
3.4. Flat vs. hierarchical consensus<br />
3.4 Flat vs. hierarchical consensus<br />
In sections 3.2 and 3.3, two specific implementations of hierarchical consensus architectures<br />
have been proposed, alongside a methodology for determining aprioriwhich is the fastest<br />
(random or deterministic) HCA variant, and deciding whether it is computationally advantageous<br />
with respect to classic flat consensus. In this section, we present a direct twofold<br />
comparison between flat consensus and those DHCA and RHCA variants deemed as the<br />
most computationally efficient by the proposed running time estimation methodologies.<br />
Firstly, we compare them in terms of computational complexity. In fact, such comparison<br />
could be made upon the results presented in sections 3.2 and 3.3, but we think that a<br />
comparison considering only the allegedly best performing variants may simplify the process<br />
of drawing meaningful conclusions. And secondly, these least time consuming hierarhical<br />
consensus architecture variants will be compared with flat consensus in terms of the quality<br />
of the consensus clustering solutions they yield. By doing so, we intend to present a comprehensive<br />
picture of our hierarchical consensus architecture proposals in terms of the two<br />
main factors that condition robust clustering by consensus: time complexity and quality.<br />
3.4.1 Running time comparison<br />
This section compares the real execution times of the allegedly fastest DHCA and RHCA<br />
variants and flat consensus. The experiments conducted follow the design outlined next.<br />
Experimental design<br />
– What do we want to measure? The time complexity of the allegedly fastest<br />
DHCA and RHCA variants and flat consensus.<br />
– How do we measure it? We measure the CPU time required for the execution of<br />
the aforementioned consensus architectures.<br />
– How are the experiments designed? Such comparison entails the running times<br />
of ten independent runs of each one of the compared consensus architectures. So<br />
as to evaluate their computational efficiency under distinct experimental conditions,<br />
the consensus processes involved have been conducted by means of the seven consensus<br />
functions for hard cluster ensembles employed in this work —see appendix A.5.<br />
Moreover, experiments have been replicated on the four diversity scenarios described<br />
in appendix A.4 —recall that they differ in the algorithmic diversity factor, as a set of<br />
|dfA| = {1, 10, 19, 28} randomly chosen clustering algorithms are employed for creating<br />
the cluster ensemble in each diversity scenario.<br />
– How are results presented? In formal terms, the measured execution times are<br />
presented by means of boxplot charts, so as to provide the reader with a notion<br />
of the degree of dispersion and asymmetry of the running times of each consensus<br />
architecture. When comparing boxplots, notice that non-overlapping boxes notches<br />
indicate that the medians of the compared running times differ at the 5% significance<br />
level, which allows a quick inference of the statistical significance of the results.<br />
– Which data sets are employed? A detailed description of the results of this<br />
comparison on the Zoo data collection is presented in the following paragraphs. Recall<br />
88
CPU time (sec.)<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0.14<br />
0.12<br />
0.1<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
Chapter 3. Hierarchical consensus architectures<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(a) Serial implementation running time<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.14<br />
0.12<br />
0.1<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
0<br />
flat<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(b) Parallel implementation running time<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.18<br />
0.16<br />
0.14<br />
0.12<br />
0.1<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
0.12<br />
0.1<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.15: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario corresponding to<br />
a cluster ensemble of size l = 57.<br />
that, the cardinalities of the dimensional and representational diversity factors of this<br />
fata set are |dfD| =14and|dfR| = 5, respectively. For brevity reasons, the results<br />
obtained on eleven more unimodal data sets are described in detail in appendix C.4.<br />
However, at the end of this section, the running times of the three compared consensus<br />
architectures measured across the experiments conducted on the twelve unimodal<br />
data collections employed in this work are compiled and compared. The goal of such<br />
comparison is to analyze whether any of the consensus architecture is inherently faster<br />
than the rest.<br />
Diversity scenario |df A| =1<br />
The running times of the estimated computationally optimal serial and parallel DHCA and<br />
RHCA implementations and flat consensus architectures in the lowest diversity scenario are<br />
presented in figure 3.15.<br />
As regards the fully serial implementation (figure 3.15(a)), it can be observed that flat<br />
consensus is 1.2 to 5 times faster than the fastest hierarchical consensus variant regardless<br />
89
3.4. Flat vs. hierarchical consensus<br />
of the consensus function employed, and that such differences are, in all cases, statistically<br />
significant. As far as the hierarchical consensus architectures are concerned, notice that the<br />
fastest DHCA variant (DRA) is more computationally efficient than its RHCA counterpart<br />
(which has s = 2 stages and mini-ensembles of size b = 28), except when consensus processes<br />
are conducted by means of the EAC and SLSAD consensus functions —in this case, statistically<br />
equivalent running times are attained by both HCA. If these results are contrasted<br />
with the predicted computationally optimal consensus architectures presented in tables 3.4<br />
and 3.10, a single prediction error is detected (flat consensus turns out to be faster than the<br />
DRA DHCA variant when consensus is conducted by MCLA), which reinforces the notion<br />
that the proposed computationally optimal consensus architecture prediction methodology<br />
performs pretty well.<br />
Figure 3.15(b) presents the running times of the fully parallel optimal hierarhical consensus<br />
architectures and flat consensus. As in the serial case, it can be noticed that flat<br />
consensus tends to be more efficient than RHCA and DHCA. The only exception occurs<br />
when consensus is conducted by means of the MCLA consensus function —which is due<br />
to the fact that it is the only combiner the computational complexity of which increases<br />
quadratically with the size of the cluster ensemble. <strong>La</strong>st but not least, it is to note that,<br />
as opposed to what was observed in the serial implementation, the fastest RHCA is less<br />
time consuming than the most efficient DHCA variant, and the running time differences<br />
between them are statistically significant. The reason for this lies in the fact that this<br />
specific RHCA variant has s = 2 stages and consensus is conducted on mini-ensembles of<br />
size b = 7, whereas the DHCA variant consists of three consensus stages, in one of which<br />
consensus are built on larger mini-ensembles of size |dfD| = 14, which is responsible for the<br />
higher computational cost of parallel DHCA in this case.<br />
Diversity scenario |df A| =10<br />
The results corresponding to the experiments conducted in the second diversity scenario<br />
(i.e. cluster ensembles generated by the compilation of the clusterings output by |dfA| =10<br />
randomly selected clustering algorithms, giving rise to cluster ensembles of size l = 570) are<br />
presented in figure 3.16. In particular, figure 3.16(a) depicts the execution time boxplots<br />
of the serial implementation of hierarchical consensus architectures. The first noticeable<br />
situation is that, in contrast to what was observed in the lowest diversity scenario, the<br />
computationally optimal RHCA variant (s =2andb =20orb = 285 depending on the<br />
consensus function employed) is faster than its DHCA counterpart (DAR). This is due to the<br />
fact that the rise of the algorithmic diversity factor (from |dfA| =1to|dfA| = 10) entails an<br />
increase of the computational cost of one of the DHCA stages that exceeds the increment<br />
of the complexity of the RHCA caused by the same factor. Meanwhile, regarding the<br />
computational efficiency of flat consensus, two opposed behaviours are observed depending<br />
on the consensus function employed: while being faster than any hierarchical architecture<br />
when consensus are built using the CSPA and EAC consensus functions, one-step consensus<br />
is slower when the remaining clustering combiners are employed. Moreover, the differences<br />
between the running times of these consensus architectures is statistically significant at the<br />
5% significance level in all cases.<br />
Figure 3.16(b) presents the results corresponding to the fully parallel implementation<br />
of consensus architectures. In this case, flat consensus is at least four times more computationally<br />
costly than the DHCA and RHCA variants. Moreover, the optimal RHCA<br />
90
CPU time (sec.)<br />
CPU time (sec.)<br />
3.8<br />
3.6<br />
3.4<br />
3.2<br />
3<br />
2.8<br />
2.6<br />
2.4<br />
2.2<br />
2<br />
1.5<br />
1<br />
0.5<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
3.4<br />
3.2<br />
3<br />
2.8<br />
2.6<br />
2.4<br />
2.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
Chapter 3. Hierarchical consensus architectures<br />
25<br />
20<br />
15<br />
10<br />
5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.16: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario corresponding to<br />
a cluster ensemble of size l = 570.<br />
is faster than its DHCA counterpart (except for the MCLA consensus function), although<br />
they both attain very similar execution times, their differences being statistically significant<br />
little below the 5% significance level.<br />
Diversity scenario |df A| =19<br />
The running times of the consensus architectures corresponding to the experiments conducted<br />
on the third diversity scenario (i.e. cluster ensembles of size l = 1083) are presented.<br />
Figure 3.17(a) depicts the execution time boxplots of the serially implemented consensus<br />
architectures. In most cases, hierarchical consensus architectures are faster than their flat<br />
counterpart —the only exception occurs when consensus are built using the EAC consensus<br />
function, a trend that was already observed in sections 3.2 and 3.3. Notice, moreover,<br />
that flat consensus is not executable when MCLA is the consensus function employed for<br />
creating the consensus clustering solutions.<br />
When the entirely parallel implementation of hierarchical consensus architectures is evaluated<br />
from a computational viewpoint, the optimal RHCA and DHCA variants (see tables<br />
91
3.4. Flat vs. hierarchical consensus<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7.5<br />
7<br />
6.5<br />
6<br />
5.5<br />
5<br />
4.5<br />
4<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time<br />
8<br />
6<br />
4<br />
2<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.17: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario corresponding to<br />
a cluster ensemble of size l = 1083.<br />
3.5 and 3.11) are at least eight times faster than flat consensus –see figure 3.17(b)– attaining<br />
very similar execution times (being statistically equivalent when the HGPA, ALSAD and<br />
SLSAD consensus functions are employed). This is quite logical provided that DHCA has 3<br />
stages, building consensus on mini-ensembles of sizes |dfA| = 19, |dfD| = 14, and |dfR| =5,<br />
while the fastest RHCA has two or three stages (depending on the consensus function employed)<br />
where consensus is built on mini-ensembles of size b =26orb = 27. At the end of<br />
the day, the number of stages and the mini-ensembles sizes of DHCA and RHCA are conterbalanced,<br />
yielding, as mentioned earlier, pretty similar running times. Notice, however,<br />
that when consensus are built using the MCLA consensus function, RHCA is penalized<br />
with respect to DHCA, given the larger size of its mini-ensembles and the aforementioned<br />
quadratic dependence of this consensus function running time with this factor.<br />
Diversity scenario |df A| =28<br />
A very similar behaviour to the one just reported is observed when the size of the cluster<br />
ensembles is increased. Indeed, in the highest diversity scenario, i.e. the one corresponding<br />
to the use of the |dfA| = 28 clustering algorithms of the CLUTO toolbox for creating cluster<br />
92
CPU time (sec.)<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.5<br />
1<br />
0.5<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
12<br />
11<br />
10<br />
Chapter 3. Hierarchical consensus architectures<br />
9<br />
8<br />
7<br />
6<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time<br />
15<br />
10<br />
5<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.18: Running times of the computationally optimal RHCA, DHCA and flat consensus<br />
architectures on the Zoo data collection for the diversity scenario corresponding to<br />
a cluster ensemble of size l = 1596.<br />
ensembles of size l = 1596, almost identical running time boxplot charts are obtained —see<br />
figure 3.18.<br />
As aforementioned, this same experiment has been conducted on the eleven remaining<br />
unimodal data collections, and the corresponding execution time boxplot charts are<br />
presented in appendix C.4. From their analysis, the following conclusions have been drawn:<br />
– the prediction regarding the computationally optimality of flat or hierarhical consensus<br />
architectures are made with a high degree of accuracy (in fact, average prediction<br />
accuracies of 93.45% and 91.07% are obtained across our experiments in the serial<br />
and parallel implementation cases, respectively).<br />
– flat consensus is the most efficient consensus architecture when the EAC consensus<br />
function is employed, regardless of the diversity scenario or whether the serial or<br />
parallel implementation of hierarchical architectures is employed.<br />
– in large datasets, only HGPA and MCLA are executable, as the time and space<br />
complexities of the remaining consensus functions scale quadratically with the number<br />
93
3.4. Flat vs. hierarchical consensus<br />
of objects in the collection, as they employ object co-association matrices as a basic<br />
element of their consensus processes.<br />
– regarding which type of hierarchical consensus architecture (RHCA or DHCA) is<br />
more computationally efficient, little can be said apriori, as it depends on the specific<br />
configurations of the hierarhical architectures, i.e. their number of stages and the<br />
sizes of the mini-ensembles.<br />
– using the MCLA consensus function penalizes those architectures with large miniensembles,<br />
as its time complexity depends quadratically on this factor.<br />
Comparison across diversity scenarios and data collections<br />
Aiming to reveal the existence of any global computational superiority pattern between<br />
the two hierarchical consensus architecture variants, besides confirming the hypotheses put<br />
forward earlier, we have compiled the real running times of the assumedly computationally<br />
optimal RHCA and DHCA variants and of flat consensus in each diversity scenario using<br />
each consensus function, across the twelve data collections employed in these experiments.<br />
This process has been replicated for both the fully serial and parallel implementations<br />
of hierarchical consensus architectures, and as a result, the running time boxplot charts<br />
presented in figures 3.19 and 3.20 have been obtained.<br />
For starters, figure 3.19 presents the running times corresponding to the entirely serial<br />
implementation of the computationally optimal RHCA and DHCA variants and flat consensus.<br />
Notice the notable height of the boxes in the boxplots, caused by the fact that they<br />
represent running times of the consensus architectures across data collections with fairly<br />
distinct characteristics (i.e. number of objects and clusters). Nevertheless, the focus of this<br />
analysis should be placed on detecting relative differences between the boxes corresponding<br />
to the three consensus architectures. In general terms, it can be observed how flat consensus<br />
becomes gradually slower than hierarchical consensus as the size of the cluster ensembles<br />
grows (i.e. as the cardinality of the algorithmic diversity factor |dfA| is increased). As reported<br />
earlier, consensus architectures based on the EAC consensus function constitute the<br />
only exception to this rule. As regards the comparison between the running times of RHCA<br />
and DHCA, the most significant differences are observed when the ALSAD and SLSAD<br />
consensus functions are employed —in these cases, the random HCA variants are faster<br />
than their deterministic counterparts, a trend that becomes more apparent as the cluster<br />
ensembles size grows. Finally, notice that, in absolute terms, consensus architectures based<br />
on the HGPA, MCLA and CSPA consensus functions are faster than those employing the<br />
EAC, ALSAD, KMSAD and SLSAD clustering combiners.<br />
And secondly, the execution time boxplots corresponding to flat consensus and the fully<br />
parallel implementation of consensus architectures are depicted in figure 3.20. As in the serial<br />
case, the use of the HGPA, MCLA and CSPA consensus functions gives rise, in general,<br />
to faster consensus architectures than when consensus clustering solutions are generated by<br />
means of the EAC, ALSAD, KMSAD and SLSAD clustering combiners. Moreover, the superiority<br />
of hierarchical architectures in front of flat consensus becomes manifest in diversity<br />
scenarios with |dfA| ≥10, depending on the consensus function employed. As regards the<br />
comparison between RHCA and DHCA, a wide spectrum of behaviours is observed. When<br />
consensus is built upon the CSPA, EAC, HGPA and KMSAD consensus functions, little significant<br />
differences between both hierarchical architectures are detected. Meanwhile, RHCA<br />
94
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
CSPA<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
CSPA<br />
RHCA DHCA flat<br />
10<br />
RHCA DHCA flat<br />
−1<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
CSPA<br />
CSPA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 4<br />
10 2<br />
10 0<br />
10 −2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
EAC<br />
RHCA DHCA flat<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
EAC<br />
10<br />
RHCA DHCA flat<br />
−1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
EAC<br />
EAC<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 1<br />
10 0<br />
10 −1<br />
HGPA<br />
10<br />
RHCA DHCA flat<br />
−2<br />
CPU time (sec.)<br />
Chapter 3. Hierarchical consensus architectures<br />
10 1<br />
10 0<br />
10 −1<br />
MCLA<br />
10<br />
RHCA DHCA flat<br />
−2<br />
CPU time (sec.)<br />
10 4<br />
10 2<br />
10 0<br />
10 −2<br />
ALSAD<br />
RHCA DHCA flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
HGPA<br />
RHCA DHCA flat<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
MCLA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
ALSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
HGPA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
MCLA<br />
RHCA DHCA flat<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
ALSAD<br />
RHCA DHCA flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
HGPA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
MCLA<br />
RHCA DHCA flat<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
ALSAD<br />
10<br />
RHCA DHCA flat<br />
−1<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
10 −2<br />
KMSAD<br />
10<br />
RHCA DHCA flat<br />
−3<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
KMSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
KMSAD<br />
RHCA DHCA flat<br />
KMSAD<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−1<br />
Figure 3.19: Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures across all data collections for the diversity scenarios corresponding<br />
to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms.<br />
95
3.4. Flat vs. hierarchical consensus<br />
tends to be a bit quicker to execute than DHCA in high diversity scenarios, although the<br />
fact that the mini-ensembles sizes employed in the fastest random architecture variants are<br />
usually larger than those employed in DHCA is penalized by consensus architectures based<br />
on the MCLA consensus function. And last, as already perceived in the serial case, RHCA<br />
tends to be slightly more efficient than DHCA when clustering combination is conducted by<br />
means of consensus functions based on treating hierarchical clustering similarity measures<br />
as data (i.e. ALSAD and SLSAD).<br />
To sum up, we can conclude that there exists a very important dependence between<br />
the computationally optimal type of consensus architecture, the size of the cluster ensemble<br />
upon which consensus is built and the consensus function employed. From a practical<br />
standpoint, in front of a specific consensus clustering problem (i.e. a cluster ensemble of a<br />
given size l and a particular computational resources configuration), the user should take<br />
into account how these factors interact at the time of deciding which type of consensus<br />
architecture is to be implemented. However, this decision should not only be made on<br />
computational efficiency grounds. In fact, it should also allow for the quality of the consensus<br />
clustering solution obtained, as the quick obtention of a poor consensus data grouping<br />
would be of little use in practice. For this reason, the next section evaluates the quality of<br />
the consensus label vectors output by the same consensus architectures that have just been<br />
analyzed in computational terms.<br />
3.4.2 Consensus quality comparison<br />
In this section, we evaluate the quality of the consensus clustering solutions yielded by the<br />
fastest DHCA and RHCA variants and flat consensus architectures, which constitutes an<br />
indicator of their suitability for conducting robust clustering. The experiments conducted<br />
to this end follow the design described next.<br />
Experimental design<br />
– What do we want to measure?<br />
i) The suitability of the allegedly fastest DHCA and RHCA variants and flat consensus<br />
for obtaining clustering results robust to the inherent indeterminacies of<br />
clustering.<br />
ii) A further goal of this section is to determine whether certain consensus architectures<br />
tend to outperform others as regards the quality of the consensus clusterings<br />
they obtain.<br />
– How do we measure it?<br />
i) We analyze the quality of the consensus clustering solutions obtained by these<br />
consensus architectures, comparing it with respect the individual clusterings contained<br />
in the cluster ensemble E upon which consensus is built. The more similar<br />
the qualities of the consensus clustering solution and the top quality cluster ensemble<br />
components, the higher robustness to the clustering indeterminacies is<br />
attained. As mentioned in section 1.2.2, in this work we evaluate clustering<br />
solutions by means of an external cluster validity index, i.e. we compare the consensus<br />
clustering solution embodied in the labeling vector λc with a predefined<br />
96
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
CSPA<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
CSPA<br />
RHCA DHCA flat<br />
10<br />
RHCA DHCA flat<br />
−1<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
CSPA<br />
CSPA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 4<br />
10 2<br />
10 0<br />
10 −2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
EAC<br />
RHCA DHCA flat<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
EAC<br />
10<br />
RHCA DHCA flat<br />
−1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
CPU time (sec.)<br />
10 1<br />
10 0<br />
10 −1<br />
HGPA<br />
10<br />
RHCA DHCA flat<br />
−2<br />
CPU time (sec.)<br />
Chapter 3. Hierarchical consensus architectures<br />
10 1<br />
10 0<br />
10 −1<br />
MCLA<br />
10<br />
RHCA DHCA flat<br />
−2<br />
CPU time (sec.)<br />
10 4<br />
10 2<br />
10 0<br />
10 −2<br />
ALSAD<br />
RHCA DHCA flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
HGPA<br />
RHCA DHCA flat<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
MCLA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
ALSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
(b) Parallel implementation running time, |dfA| =10<br />
EAC<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
HGPA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
MCLA<br />
RHCA DHCA flat<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
ALSAD<br />
RHCA DHCA flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
HGPA<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
MCLA<br />
RHCA DHCA flat<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
ALSAD<br />
10<br />
RHCA DHCA flat<br />
−1<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
10 −2<br />
KMSAD<br />
10<br />
RHCA DHCA flat<br />
−3<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
KMSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
KMSAD<br />
RHCA DHCA flat<br />
KMSAD<br />
10<br />
RHCA DHCA flat<br />
−1<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 −1<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−2<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
SLSAD<br />
10<br />
RHCA DHCA flat<br />
−1<br />
Figure 3.20: Running times of the computationally optimal parallel RHCA, DHCA and flat<br />
consensus architectures across all data collections for the diversity scenarios corresponding<br />
to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms.<br />
97
3.4. Flat vs. hierarchical consensus<br />
and allegedly correct cluster structure (or ground truth), measuring their degree<br />
of resemblance in terms of normalized mutual information, φ (NMI) —recall that<br />
φ (NMI) ∈ [0, 1] and the higher its value, the better the quality of the consensus<br />
clustering solution. We measure the percentage of experiments and the relative<br />
φ (NMI) differences between the consensus clusterings and the cluster ensemble<br />
components of maximum and median φ (NMI) score.<br />
ii) We compare the φ (NMI) scores of the consensus clusterings obtained by the three<br />
consensus architectures subject to evaluation.<br />
– How are the experiments designed? The experimental methodology followed is<br />
the same as when the computational efficiency of consensus architectures was analyzed<br />
in the previous sections. That is, the consensus quality comparison has been<br />
conducted on the four diversity scenarios described in appendix A.4. In each diversity<br />
scenario, ten independent experiments have been conducted using the seven consensus<br />
functions for hard cluster ensembles employed in this work (CSPA, EAC, HGPA,<br />
MCLA, ALSAD, KMSAD and SLSAD). From a formal viewpoint, the φ (NMI) values<br />
of the consensus clustering solutions obtained in the 10 experiments corresponding to<br />
each consensus function and diversity scenario are presented.<br />
– How are results presented? In formal terms, the measured φ (NMI) values are<br />
presented by means of boxplot charts. By doing so, we can see the quality scatter<br />
of each consensus function and architecture. Again, non-overlapping boxes notches<br />
indicate that the medians of the compared φ (NMI) differ at the 5% significance level.<br />
– Which data sets are employed? In this section, we present in detail the results<br />
obtained on the Zoo data collection —for the sake of brevity, the results obtained in<br />
the remaining eleven unimodal data sets are deferred to appendix C.4. However, at<br />
the end of this section, the φ (NMI) scores of the three compared consensus architectures<br />
measured across the experiments conducted on the twelve unimodal data collections<br />
employed in this work are compiled and compared. The goal of such comparison is<br />
to analyze whether any of the consensus architecture tends to yield better consensus<br />
clustering solutions than the rest.<br />
One last before proceeding: notice that the only differences between serial and parallel<br />
hierarchical consensus architectures refer to their time complexity, not the quality of the<br />
consensus clustering solutions they yield. For this reason, the distinction between serial and<br />
parallel architectures is not found in this section.<br />
Diversity scenario |df A| =1<br />
Firstly, the φ (NMI) values of the estimated optimal serial and parallel DHCA and RHCA<br />
implementations and flat consensus architectures in the lowest diversity scenario are presented<br />
in figure 3.21. Each chart presents four boxes that, from left to right, represent the<br />
φ (NMI) values of the components of the cluster ensemble E, and of the consensus clustering<br />
solutions output by the RHCA, DHCA and flat consensus architectures, respectively. It<br />
can observed that the three consensus architectures, yield, in general, pretty similar quality<br />
consensus solutions (in fact, the differences between them are statistically non significant<br />
at the 5% level for the CSPA, MCLA, ALSAD and KMSAD consensus functions). The<br />
98
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
Chapter 3. Hierarchical consensus architectures<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.21: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />
scenario corresponding to a cluster ensemble of size l = 57.<br />
largest inter-consensus architecture deviations are found when consensus clustering is based<br />
on the HGPA consensus function, as the notches of the corresponding φ (NMI) boxes are far<br />
from overlapping. If the consensus functions are compared in terms of the quality of the<br />
clustering solutions they yield, it can be observed that the best results are obtained by the<br />
EAC, ALSAD, KMSAD and SLSAD consensus functions. In these cases, the medians of<br />
the consensus solutions output by the consensus architectures are better than the 75% of<br />
the components of the cluster ensemble E, which denotes a notable level of robustness to<br />
the clustering indeterminacies.<br />
Diversity scenario |df A| =10<br />
Secondly, the quality of the consensus clustering solutions output by the consensus architectures<br />
corresponding to the experiments conducted on the diversity scenario corresponding<br />
to cluster ensembles of size l = 570 are presented in figure 3.22. The trends detected in the<br />
previous diversity scenario are somehow confirmed in this case. That is, those consensus<br />
functions based on evidence accumulation (EAC) and object similarity as data (ALSAD,<br />
KMSAD and SLSAD) yield the best quality consensus clustering solutions, and they show<br />
a high degree of independence with respect to the topology of the consensus architecture.<br />
In fact, EAC and SLSAD based consensus architectures give rise to consensus clusterings<br />
which are better than the 80% of the cluster ensemble components, which again reveals the<br />
ability of these consensus functions to attain clustering solutions robust to the uncertainties<br />
inherent to clustering. In contrast, consensus labelings obtained by hypergraph-based<br />
consensus functions (CSPA, HGPA and MCLA) attain lower φ (NMI) values, while showing<br />
a larger quality variabilty (this only applies to the HGPA and MCLA consensus functions).<br />
Diversity scenario |df A| =19<br />
The results corresponding to the experiments conducted in the third diversity scenario (i.e.<br />
cluster ensembles generated by the compilation of the clusterings output by |dfA| =19<br />
randomly selected clustering algorithms, giving rise to cluster ensembles of size l = 1083)<br />
are presented in figure 3.23. The behaviour detected in the previous diversity scenarios<br />
is also found in this case. Again, the largest inter-consensus architecture variations are<br />
99
3.4. Flat vs. hierarchical consensus<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.22: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />
scenario corresponding to a cluster ensemble of size l = 570.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.23: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />
scenario corresponding to a cluster ensemble of size l = 1083.<br />
observed when consensus are built using the HGPA consensus function. In the remaining<br />
cases, much smaller deviations are found (maybe with the exception of ALSAD, but the<br />
observed dispersions are smaller than those of HGPA) —in fact, statistically non significant<br />
differences between the three consensus architectures are observed in the EAC, MCLA,<br />
KMSAD and SLSAD based architectures.<br />
Diversity scenario |df A| =28<br />
And last, the φ (NMI) values of the consensus clustering solutions yielded by the hierarchical<br />
and flat consensus architectures corresponding to the experiments conducted on the<br />
highest diversity scenario (i.e. cluster ensembles of size l = 1596) are presented in figure<br />
3.24. This scenario is ideal for analyzing the variability of the quality of the consensus<br />
clustering solutions output by the distinct consensus functions, as exactly the same cluster<br />
ensemble has been employed in the ten experiments analyzed —in contrast, in the previous<br />
diversity scenarios, the cluster ensemble employed in each one of the ten experiments was<br />
created by compiling the clustering components generated by |dfA| = {1, 10, 19} randomly<br />
picked clustering algorithms (that is, two superimposed randomness factors underlie the<br />
boxplots presented in figures 3.21 to 3.23). In this sense, it can be observed that, for any<br />
100
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
Chapter 3. Hierarchical consensus architectures<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure 3.24: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Zoo data collection for the diversity<br />
scenario corresponding to a cluster ensemble of size l = 1596.<br />
given consensus architecture, the HGPA, MCLA and KMSAD consensus functions have the<br />
highest quality variability, due to the existence of some random underlying process in their<br />
consensus generation procedures (e.g. the random initialization of k-means in the KM-<br />
SAD consensus function). In contrast, the qualities of the consensus clusterings output by<br />
those consensus architectures based on CSPA, EAC, ALSAD and SLSAD show very small<br />
(or even null) variations. As regards the inter-consensus architecture quality divergences,<br />
those based on the HGPA and ALSAD consensus functions show the most disparate results,<br />
whereas in other cases (e.g. CSPA, EAC, MCLA or KMSAD) statistically equivalent qualities<br />
are yielded by the three consensus architectures. And last, as far as the robustness of<br />
the consensus clustering solutions is concerned, notice that the EAC, ALSAD, KMSAD and<br />
SLSAD based consensus architectures yield the highest quality clustering results, getting<br />
pretty close to the top-quality component of the cluster ensemble E, being,inmostcases,<br />
better than the 75% of the clusterings contained in it.<br />
Comparison across diversity scenarios and data collections<br />
So as to provide the reader with a global comparative view of the consensus architectures in<br />
terms of the quality of the consensus clustering solutions they yield, we have compiled the<br />
φ (NMI) values obtained across all the experiments conducted on the twelve unimodal data<br />
collections in each diversity scenario, representing them in the boxplots depicted in figure<br />
3.25. Recall that, when comparing boxplots, non-overlapping boxes notches indicate that<br />
the medians of the compared magnitudes differ at the 5% significance level.<br />
A twofold qualitative analysis can be made in view of these results. The first aspect of<br />
study is an intra-consensus function comparison among consensus architectures. A quick<br />
inspection of any of the rows of figure 3.25 reveals that the optimality of consensus architectures<br />
is a property that is local to the consensus function applied. When the clustering<br />
combination process is based on the CSPA consensus function, the three consensus architectures<br />
yield pretty similar quality consensus solutions (as the boxes have a notable overlap),<br />
although DHCA tends to attain slightly higher φ (NMI) values —a similar pattern is observed<br />
in the boxplots presented in the column corresponding to the EAC consensus function. In<br />
contrast, flat consensus architectures yield higher quality consensus than their hierarchical<br />
counterparts when they are based on the HGPA clustering combiner. The analysis of the<br />
101
3.4. Flat vs. hierarchical consensus<br />
results obtained when the MCLA consensus function is employed as the basis of consensus<br />
architectures must be made with care, as flat consensus is not executable when it must<br />
be conducted on large cluster ensembles. For this reason, the boxplots presented in the<br />
MCLA column only reflect the φ (NMI) values corresponding to those experiments where the<br />
three consensus architectures are executable. In these cases, the best consensus clustering<br />
solutions are obtained by the flat and DHCA consensus architectures. A greater degree of<br />
evenness between consensus architectures is observed in the consensus functions that treat<br />
object similarity as object features, i.e. ALSAD, KMSAD and SLSAD —notice the large<br />
overlap between boxes. However, whereas DHCA yields slightly lower quality consensus<br />
clustering solutions than the RHCA and flat consensus architectures when the ALSAD and<br />
KMSAD consensus functions are employed, it is the flat consensus approach that attains<br />
the lowest φ (NMI) values among the SLSAD based consensus architectures.<br />
And secondly, if an inter-consensus functions comparison is conducted, we can conclude<br />
that the excellent performance of the EAC consensus function on the Zoo data collection<br />
apparently constitutes an exception to the rule, as –together with HGPA and SLSAD–<br />
it yields the lowest φ (NMI) values (i.e. the poorest consensus clustering solutions) when<br />
the results obtained across all the data sets and diversity scenarios are compiled. In contrast,<br />
CSPA, MCLA, ALSAD and KMSAD tend to yield comparatively better consensus<br />
clustering solutions in a global perspective.<br />
Following a more quantitative perspective, we have compared the quality of the consensus<br />
clustering solutions yielded by the three consensus architectures with the components<br />
of the cluster ensemble consensus is conducted upon. In particular, this comparison has<br />
taken into account the cluster ensemble components of median and maximum φ (NMI) with<br />
respect to the ground truth (referred to as the median ensemble component, orMEC,and<br />
best ensemble component, or BEC, respectively). This comparison makes sense inasmuch we<br />
focus the application of consensus clustering as a means for becoming robust to the inherent<br />
indeterminacies that affect the clustering problem. More specifically, the higher the φ (NMI)<br />
of the consensus clustering solution with respect to that of the cluster ensemble components,<br />
the higher robustness is achieved. The median and maximum φ (NMI) components are used<br />
as a summarized reference of the quality of the cluster ensemble contents.<br />
For this reason, we have evaluated i) the percentage of experiments in which the consensus<br />
clustering solution attains a higher φ (NMI) than the MEC and the BEC, and ii)<br />
the relative percentage φ (NMI) variation between the median and the best cluster ensemble<br />
components and the consensus clustering solution.<br />
As regards the first issue, table 3.12 presents the percentage of experiments (considering<br />
all data sets in the highest diversity scenario) where the consensus clustering solution is better<br />
than the median normalized mutual information cluster ensemble component (MEC). It<br />
can be observed that the average percentage of experiments (considering all data sets in the<br />
highest diversity scenario) where the consensus clustering solution is better than the MEC<br />
is 53.1%, which indicates than in more than the half of the experiments, consensus yields a<br />
clustering solution better than the one located halfway of the cluster ensemble components.<br />
When the relative percentage φ (NMI) gains between consensus clustering solutions and the<br />
MEC are computed, a reasonable average of 59% gain is obtained —see table 3.13 for a<br />
detailed presentation of the results per consensus function and consensus architecture.<br />
If such comparison is referred to the cluster ensemble component that best describes the<br />
group structure of the data in terms of normalized mutual information with respect to the<br />
102
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
CSPA<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
CSPA<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
CSPA<br />
CSPA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
EAC<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
EAC<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
EAC<br />
EAC<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
Chapter 3. Hierarchical consensus architectures<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
ALSAD<br />
0<br />
RHCA DHCA flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
ALSAD<br />
0<br />
RHCA DHCA flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
ALSAD<br />
0<br />
RHCA DHCA flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
ALSAD<br />
0<br />
RHCA DHCA flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
KMSAD<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
KMSAD<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
KMSAD<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
KMSAD<br />
0<br />
RHCA DHCA flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
SLSAD<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
SLSAD<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
SLSAD<br />
0<br />
RHCA DHCA flat<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
SLSAD<br />
0<br />
RHCA DHCA flat<br />
Figure 3.25: φ (NMI) of the consensus solutions obtained by the computationally optimal<br />
parallel RHCA, DHCA and flat consensus architectures across all data collections for the<br />
diversity scenarios corresponding to the use of |dfA| = {1, 10, 19 and 28} clustering algorithms.<br />
103
3.5. Discussion<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 58.3 25 42.3 33.2 66.7 69.8 41.7<br />
RHCA 69.8 24.7 15.9 74.2 79.1 79.1 50.4<br />
DHCA 83.3 8.3 3.9 77.1 75 77.9 58.3<br />
Table 3.12: Percentage of experiments in which the consensus clustering solution is better<br />
than the median cluster ensemble component.<br />
given ground truth (i.e. the BEC), we observe that the consensus clustering solution attains<br />
higher φ (NMI) values in only a 0.1% of the experiments —see table 3.14. If the degree of<br />
improvement of those consensus clustering solutions that attain a higher φ (NMI) than the<br />
BEC is measured in terms of relative percentage φ (NMI) increase, a modest 0.6% φ (NMI)<br />
gain is obtained in average —see table 3.15 for a detailed view across consensus functions<br />
and architectures.<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 90 12.9 80.2 50.1 107.1 87.7 24.6<br />
RHCA 78.8 16.3 9.2 96.4 94.7 90.6 33.3<br />
DHCA 73.7 11.6 5.6 53.1 83.7 72.2 67.6<br />
Table 3.13: Relative percentage φ (NMI) gain between the consensus clustering solution and<br />
the median cluster ensemble component.<br />
As a conclusion, we can see that the application of consensus clustering processes on<br />
a collection of partitions of a given data collection provides a means for obtaining a summarized<br />
clustering that, although rarely better than the best component available in the<br />
cluster ensemble, is reasonably often quite better than the median data partition. However,<br />
despite these fairly good results, we aim to obtain clustering solutions more robust to the<br />
inherent indeterminacies of clustering (i.e. closer or even better than the maximum quality<br />
cluster ensemble component). For this reason, chapter 4 introduces what we call consensus<br />
self-refining procedures that aim to improve the quality of the consensus clustering solutions<br />
obtained from either hierarchical or flat consensus architectures.<br />
3.5 Discussion<br />
Our proposal for building clustering systems that behave robustly in front of the indeterminacies<br />
inherent to unsupervised classification problems relies on the application of consensus<br />
clustering processes on large cluster ensembles created by the application of multiple mutually<br />
crossed diversity factors.<br />
However, in the consensus clustering literature, relatively few works face the problematics<br />
of combining large amounts of clusterings, as most authors tend to employ rather small<br />
cluster ensembles for evaluating their proposals. However, the application of certain consensus<br />
clustering approaches in computationally demanding scenarios can be difficult. Typical<br />
examples of this include consensus functions based on object co-association measures that<br />
become inapplicable on large data collections, or clustering combiners not executable on<br />
104
Chapter 3. Hierarchical consensus architectures<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 0 0 0 0 0 0.2 0<br />
RHCA 0.4 0 0 0 1.2 0.1 0<br />
DHCA 0 0 0 0 0 0.3 0<br />
Table 3.14: Percentage of experiments in which the consensus clustering solution is better<br />
than the best cluster ensemble component.<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat – – – – – 0.07 –<br />
RHCA 0.85 – – – 0.8 0.07 –<br />
DHCA – – – – – 1.1 –<br />
Table 3.15: Relative percentage φ (NMI) gain between the consensus clustering solution and<br />
the best cluster ensemble component.<br />
large cluster ensembles if their complexity increases quadratically with the number of components<br />
in the ensemble.<br />
To our knowledge, most previous proposals oriented towards this aim deal with subsampling<br />
strategies as a means for reducing the computational complexity of consensus<br />
processes. That is, if the clustering combination task becomes more costly as the number of<br />
the objects in the data set and/or the number of the cluster ensemble components grow, a<br />
natural solution consists in applying the consensus clustering process on a reduced version<br />
of the data collection (<strong>La</strong>nge and Buhmann, 2005; Greene and Cunningham, 2006; Gionis,<br />
Mannila, and Tsaparas, 2007) and/or the cluster ensemble (Greene and Cunningham, 2006),<br />
which is created by means of some sufficient subsampling procedure. Once the consensus<br />
process is completed on the reduced data set (or cluster ensemble), it is extended to those<br />
entities (objects or cluster ensemble components) that have been left out of the mentioned<br />
subset. Whereas reducing the size of the data collection and/or the cluster ensemble subject<br />
to the consensus clustering process entails an automatic saving of time complexity, one<br />
should take into account the cost associated to the subsampling and extension processes,<br />
which is often linear with the size of the data collection (<strong>La</strong>nge and Buhmann, 2005; Greene<br />
and Cunningham, 2006; Gionis, Mannila, and Tsaparas, 2007).<br />
In contrast, our hierarchical consensus architecture proposals are based on reducing the<br />
time complexity of consensus processes without discarding any of the objects in the data<br />
set nor any of the cluster ensemble components. By means of a divide and conquer type<br />
of approach (Dasgupta, Papadimitriou, and Vazirani, 2006), we break a single consensus<br />
clustering problem into multiple smaller problems, which gives rise to hierarchical consensus<br />
architectures that allow to achieve important computational time savings, specially in high<br />
diversity scenarios —i.e. the ones we might find ourselves in if the strategy of using multiple<br />
mutually crossed diversity factors for creating large cluster ensembles is followed.<br />
As far as we know, the use of divide and conquer approaches to the consensus clustering<br />
problem has only been reported in (He et al., 2005) as a means for clustering data sets<br />
that contain both numeric and categorical attributes. This proposal consists of dividing<br />
the original data collection into two purely numeric and categorical subsets, conducting<br />
105
3.5. Discussion<br />
separate clustering processes on each type of features (employing well-established clustering<br />
algorithms designed to that end), and subsequently combining the resulting clustering<br />
solutions by means of consensus functions. Thus, this divide and conquer consensus clustering<br />
proposal is aimed to deal with objects composed of multi-type features, rather than<br />
a means for reducing the overall time complexity of consensus processes.<br />
In this chapter, two versions of hierarchical consensus architectures have been proposed.<br />
In each one of them, one of the two factors that define the topology of the architecture are<br />
prefixed, i.e. the number of stages (in deterministic HCA) or the size of the mini-ensembles<br />
(in random hierarchical consensus architectures). Structuring the whole consensus clustering<br />
task as a set of partial consensus processes that take place in successive stages gives<br />
the user the chance to apply different consensus functions across the hierarchy —a possibility<br />
that, to our knowledge, remains unexplored. Moreover, the decomposition of a classic<br />
one-step problem into a set of smaller instances of the same problem naturally allows its parallelization<br />
—provided that sufficient computational resources are available. At this point,<br />
we would like to highlight the fact that, though posed in the context of the robust clustering<br />
problem, hierarchical consensus architectures are applicable to any consensus clustering<br />
task involving large cluster ensembles.<br />
From a practical perspective, we have presented a simple running time estimation<br />
methodology that, for a given consensus clustering problem, allows a fast and pretty accurate<br />
prediction of which is the computationally optimal consensus architecture. However,<br />
the reasonably good performance of the proposed methodology could be further improved<br />
by means of a more complex (probably statistical) modeling of the consensus running times<br />
which constitute the basis of the estimation.<br />
Based on these predictions, we have presented an experimental study in which the flat<br />
and the fastest hierarchical consensus architectures are, firstly, compared in terms of their<br />
execution time. Such comparison has taken into account the most and least computationally<br />
costly HCA implementations (i.e. fully serial and parallel), so as to provide a notion of the<br />
upper and lower bounds of the time complexity of hierarchical consensus architectures.<br />
One of the most expectable conclusions drawn from the conducted experiments is that the<br />
computational optimality of a given consensus architecture is local to the consensus function<br />
F employed for combining the clusterings. In particular, as far as the execution time of<br />
hierarchical consensus architectures is concerned, the main issue to take into account is<br />
the dependence between the time complexity of F and the size of the mini-ensembles upon<br />
which consensus is conducted. For instance, the use of consensus functions the complexity<br />
of which scales quadratically with the number of clusterings consensus is created upon (e.g.<br />
MCLA) clearly favours hierarchical consensus architectures. In contrast, flat consensus<br />
is more efficient than the fastest serial hierarchical consensus architectures even in high<br />
diversity scenarios when consensus functions such as EAC are employed.<br />
Besides analyzing their computational aspects, we have also compared hierarchical and<br />
flat consensus architectures in terms of the quality of the consensus clustering solutions<br />
they yield. In this sense, inter-consensus architecture variability is highly dependent on the<br />
characteristics of the cluster ensemble and the consensus function employed. For instance,<br />
hierarchical and flat consensus architectures based on the CSPA, EAC, ALSAD and SLSAD<br />
consensus functions yield pretty similar quality consensus clusterings, whereas greater variances<br />
are observed when the remaining consensus functions are used. Moreover, in general<br />
terms, we have observed that consensus architectures based on EAC and HGPA typically<br />
106
Chapter 3. Hierarchical consensus architectures<br />
yield lower quality consensus clustering solutions when compared to the other consensus<br />
functions.<br />
Thus, in some sense, we face a further indeterminacy, this one referred to the consensus<br />
function to apply. However, this indeterminacy can be overcome by taking advantage of the<br />
capability of creating several consensus clustering solutions by means of multiple consensus<br />
functions in computationally optimal time, and subsequently, apply a supraconsensus<br />
function that allows selecting the highest quality consensus clustering solution in a fully<br />
unsupervised manner, as proposed in (Strehl and Ghosh, 2002).<br />
Besides this use, supraconsensus strategies constitute a basic ingredient of the consensus<br />
self-refining procedure presented in the next chapter, which is oriented to better the quality<br />
of consensus clustering solutions as a means for creating robust clustering systems upon<br />
consensus clustering processes.<br />
3.6 Related publications<br />
Our first approach to hierarchical consensus architectures dealt with deterministic HCA<br />
(Sevillano et al., 2007a), although it was solely focused on the analysis of the quality of<br />
the consensus clusterings obtained, not on its computational aspect. The details of this<br />
publication, presented as a poster at the ECIR 2007 conference held at Rome, are described<br />
next.<br />
Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />
Title: A Hierarchical Consensus Architecture for Robust Document Clustering<br />
In: Proceedings of 29th European Conference on Information Retrieval (ECIR 2007)<br />
Publisher: Springer<br />
Series: Lecture Notes in Computer Science<br />
Volume: 4425<br />
Editors: Giambattista Amati, Claudio Carpineto and Giovanni Romano<br />
Pages: 741-744<br />
Year: 2007<br />
Abstract: A major problem encountered by text clustering practitioners is the difficulty<br />
of determining aprioriwhich is the optimal text representation and clustering<br />
technique for a given clustering problem. As a step towards building robust document<br />
partitioning systems, we present a strategy based on a hierarchical consensus clustering<br />
architecture that operates on a wide diversity of document representations and<br />
partitions. The conducted experiments show that the proposed method is capable of<br />
yielding a consensus clustering that is comparable to the best individual clustering<br />
available even in the presence of a large number of poor individual labelings, outperforming<br />
classic non-hierarchical consensus approaches in terms of performance and<br />
computational cost.<br />
107
Chapter 4<br />
Self-refining consensus<br />
architectures<br />
As described in chapter 3, our proposal for building clustering systems robust to the inherent<br />
indeterminacies that affect the clustering problem consists of i) creating a cluster ensemble<br />
E composed of a large number of individual partitions generated by the use of as many<br />
diversity factors (e.g. clustering algorithms, object representations, etc.) as possible, ii)<br />
deriving a unique clustering solution λc upon that cluster ensemble through the application<br />
of a consensus clustering process.<br />
As mentioned earlier, the use of such a large cluster ensemble entails two negative consequences.<br />
The first one refers to the fact that the construction of the consensus clustering<br />
solution can become costly or even unfeasible, as the space and time complexity of consensus<br />
functions scales up linearly or even quadratically with the size of the cluster ensemble.<br />
In order to overcome such difficulty, in chapter 3 we put forward the concept of hierarchical<br />
consensus architectures, which are based on applying a divide-and-conquer approach<br />
to consensus clustering. Moreover, by means of a simple running time estimation methodology,<br />
the user is capable of deciding apriori, with a notable degree of accuracy, which<br />
is the most computationally efficient consensus architecture for solving a given consensus<br />
clustering problem.<br />
The other main downside to the use of large cluster ensembles is the negative bias induced<br />
on the quality of the consensus clustering solution λc by the expectable presence<br />
of poor1 individual clusterings in E, caused by the somewhat indiscriminate generation of<br />
cluster ensemble components that our proposal indirectly encourages. In order to overcome<br />
this inconvenience, we propose a simple consensus self-refining process that, in a fully unsupervised<br />
manner, allows to improve the quality of the derived consensus clustering solution<br />
λc. Moreover, an additional benefit derived from this automatic consensus refining procedure<br />
is the uniformization of the quality of the consensus clustering solutions yielded by<br />
distinct consensus architectures, which allows selecting the most appropriate one based on<br />
1 By good quality clustering solutions we refer to those partitions that reflect the true group structure<br />
of the data. Provided that we evaluate our clustering results by means of an external cluster validity index<br />
–normalized mutual information (φ (NMI) ) with respect to the ground truth, i.e. an allegedly correct group<br />
structure of the data–, the highest quality clustering results will be those attaining a φ (NMI) close to 1,<br />
whereas the φ (NMI) values associated to poor quality partitions will tend to zero, as φ (NMI) ∈ [0, 1] by<br />
definition (Strehl and Ghosh, 2002).<br />
109
4.1. Description of the consensus self-refining procedure<br />
computational efficiency criteria solely. While put forward in a hard clustering scenario, this<br />
proposal could be exported to a fuzzy context by introducing several minor modifications.<br />
This chapter is organized as follows: section 4.1 describes the proposed self-refining consensus<br />
procedure. Next, several experiments regarding the application of the self-refining<br />
process on the consensus clustering solutions output by the three types of consensus architectures<br />
described in the previous chapter are presented in section 4.2. An alternative<br />
procedure based on cluster ensemble component selection for obtaining refined consensus<br />
clustering solutions upon a given cluster ensemble is described in section 4.3, and finally,<br />
the discussion and conclusions presented in section 4.4 put the end to this chapter.<br />
4.1 Description of the consensus self-refining procedure<br />
The proposed approach for refining the quality of the consensus clustering solution λc is<br />
pretty straightforward, and it is based on the notion of average normalized mutual information<br />
φ (ANMI) (Strehl and Ghosh, 2002) between a cluster ensemble E and a consensus<br />
clustering solution λc built upon it, as defined by equation (4.1).<br />
φ (ANMI) (E, λc) = 1<br />
l<br />
l<br />
φ (NMI) (λi, λc) (4.1)<br />
where l represents the number of partitions (or components) contained in the cluster ensemble<br />
E and λi is the ith of these components.<br />
The higher φ (ANMI) (E, λc), the more information the consensus clustering solution λc<br />
shares with all the clusterings in E, thus it can be considered to capture the information<br />
contained in the ensemble to a larger extent. In fact, the computation of the φ (ANMI)<br />
between a given cluster ensemble E and a set of consensus clustering solutions obtained<br />
by means of different consensus functions is proposed in (Strehl and Ghosh, 2002) as a<br />
means for choosing among them in a unsupervised fashion, giving rise to what is called a<br />
supraconsensus function.<br />
It is important to notice that each term of the summation in equation (4.1) measures<br />
the resemblance between each cluster ensemble component and λc. As a consequence, those<br />
cluster ensemble components more similar to the consensus clustering solution contribute<br />
in a greater proportion to the sum in equation (4.1).<br />
Assuming that the consensus function F applied for obtaining the consensus clustering<br />
solution λc delivers a moderately good performance –in the sense that the quality of λc will<br />
be reasonably higher than the one of the poorest components of the cluster ensemble E–,<br />
then the normalized mutual information (φ (NMI) ) between λc and each cluster ensemble<br />
component λi (∀i ∈ [1,...,l]), gives an approximate measure of the quality of the latter<br />
(Fern and Lin, 2008).<br />
Allowing for this fact, the proposed consensus self-refining procedure is based on ranking<br />
the l components of the cluster ensemble E accordingtotheirφ (NMI) with respect the<br />
consensus clustering solution λc. The results of this sorting process is represented by means<br />
of an ordered list Oφ (NMI) = {λ<br />
φ (NMI) 1, λ φ (NMI)2,...λφ (NMI)l}, the subindices of which refer<br />
to the aforementioned φ (NMI) based ranking, i.e. λ<br />
φ (NMI) 1 denotes the cluster ensemble<br />
component that attains the highest normalized mutual information with respect to λc,<br />
110<br />
i=1
Chapter 4. Self-refining consensus architectures<br />
λ φ (NMI) 2 is the component with the second highest φ(NMI) respect the consensus clustering<br />
solution, and so on.<br />
Subsequently, a percentage p of the highest ranked l individual partitions is selected so<br />
as to form a select cluster ensemble Ep –see equation (4.2)–, upon which a refined consensus<br />
clustering solution λcp will be derived through the application of the consensus function F.<br />
Notice that, the larger the percentage p, the more components are included in the select<br />
cluster ensemble Ep —ultimately, Ep = E if p = 100.<br />
⎛<br />
⎜<br />
Ep = ⎜<br />
⎝<br />
λ<br />
φ (NMI) 1<br />
λ<br />
φ (NMI) 2<br />
.<br />
λ p<br />
φ (NMI) ⌊<br />
100 l⌉<br />
⎞<br />
⎟<br />
⎠<br />
(4.2)<br />
Following the rationale of the proposed self-refining procedure, it can be assumed that,<br />
with a high probability, the worst components of the initial cluster ensemble E will have been<br />
excluded from Ep. Therefore, the self-refined consensus clustering solution λcp obtained<br />
through the application of the consensus function F on Ep will probably improve the initial<br />
consensus labeling λc, as we will experimentally demonstrate in later sections.<br />
Finally, three additional remarks so as to conclude this description: firstly, notice that<br />
the consensus process run on the select cluster ensemble Ep can be conducted following<br />
either a flat or a hierarchical approach, depending on the consensus function applied, the<br />
characteristics of the data set and the value of p, which, as aforementioned, determines<br />
thesizeofEp. As reported in the previous chapter, the proposed running time estimation<br />
methodologies constitute an efficient means for deciding whether the self-refined consensus<br />
solution λcp should be derived following either a flat or a hierarhical consensus architecture.<br />
Secondly, notice that the proposed consensus self-refining process is entirely automatic<br />
and unsupervised (hence its name), as it is solely based on the cluster ensemble E, the<br />
consensus clustering solution λc and a similarity measure –φ (NMI) – that requires no external<br />
knowledge for its computation. The only user-driven decision refers to the selection of the<br />
value of the percentage p used for creating the select cluster ensemble Ep.<br />
And the third remark deals just with this latter issue. Quite obviously, the selection<br />
of the percentage p is made blindly. So as to avoid the negative consequences of choosing<br />
a suboptimal value of p at random, our consensus self-refining proposal is completed by<br />
the (possibly parallelized) creation of multiple refined consensus clustering solutions using<br />
P distinct percentage values p = {p1,p2,...,pP }, i.e. λc p i ,fori = {1, 2,...,P}, selecting<br />
as the final refined consensus clustering solution λ final<br />
c<br />
the one maximizing φ (ANMI) with<br />
respect to the cluster ensemble E, as defined by equation (4.3) —in fact, this unsupervised a<br />
posteriori clustering selection process is equivalent to the supraconsensus function proposed<br />
in (Strehl and Ghosh, 2002).<br />
λ final<br />
c<br />
<br />
<br />
= λ max φ (ANMI) <br />
(E, λ) , λ ∈{λc, λcp1 ,...,λcpP } (4.3)<br />
For summarization purposes, table 4.1 describes the steps that constitute the proposed<br />
consensus self-refining procedure.<br />
111
4.2. Flat vs. hierarchical self-refining<br />
1. Given a cluster ensemble E containing l<br />
⎛<br />
components:<br />
⎞<br />
λ1<br />
⎜<br />
⎜λ2<br />
⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠<br />
and a pre-computed consensus clustering solution λc, compute the φ (NMI) between<br />
λc and each of the components of the cluster ensemble, that is:<br />
λl<br />
φ (NMI) (λc, λk) , ∀k =1,...,l<br />
2. Generate an ordered list Oφ (NMI) = {λ<br />
φ (NMI) 1, λ φ (NMI)2,...λφ (NMI)l} of cluster ensemble<br />
components ranked in decreasing order according to their φ (NMI) with respect<br />
λc.<br />
3. Create a set of P select cluster ensembles Epi<br />
nents of Oφ (NMI):<br />
Epi =<br />
⎛<br />
λ<br />
φ (NMI)<br />
⎜<br />
⎝<br />
1<br />
λ<br />
φ (NMI) 2<br />
⎞<br />
⎟<br />
.<br />
⎟<br />
⎠<br />
where pi ∈ (0, 100) , ∀i =1,...,P.<br />
λ φ (NMI) ⌊ p i<br />
100 l⌉<br />
pi<br />
by compiling the first ⌊ 100l⌉ compo-<br />
4. Run a (flat or hierarchical) consensus architecture based on a consensus function F<br />
on Epi , obtaining a self-refined consensus clustering solution λc p i .<br />
5. Apply the supraconsensus function on the non-refined consensus clustering solution<br />
λc and the set of self-refined consensus clustering solutions λc p i , i.e. select as the<br />
final consensus solution the one maximizing its φ (ANMI) with respect to the cluster<br />
ensemble E.<br />
Table 4.1: Methodology of the consensus self-refining procedure.<br />
4.2 Flat vs. hierarchical self-refining<br />
In this section, we present several experiments regarding the application of the consensus<br />
self-refining procedure described in section 4.1 on the consensus clustering solutions output<br />
by the three consensus architectures described in chapter 3. In all cases, our main interest<br />
is focused on analyzing the qualities of the clusterings obtained by the proposed self-refining<br />
procedure, not on evaluating the computational aspects of the self-refining process, as the<br />
decision regarding whether it is implemented according to a hierarchical or a flat consensus<br />
architecture can be efficiently made using the running time estimation methodologies<br />
proposed in chapter 3.<br />
The experiments conducted follow the design described next.<br />
112
– What do we want to measure?<br />
Chapter 4. Self-refining consensus architectures<br />
i) The quality of the self-refined consensus clusterings obtained by the proposed<br />
methodology when applied on the consensus clusterings output by the flat and<br />
allegedly fastest RHCA and DHCA consensus architectures.<br />
ii) The ability of the proposed self-refining procedure to obtain a consensus clustering<br />
of higher quality than that of a) its non-refined counterpart, and b) the<br />
highest and median quality cluster ensemble components.<br />
iii) The quality of the self-refined consensus clustering of maximum quality compared<br />
to its non-refined counterpart.<br />
iv) The ability of the supraconsensus function to select, in a fully unsupervised<br />
manner, the highest quality self-refined consensus clustering among the set of<br />
self-refined clusterings generated.<br />
v) We analyze whether self-refining constitutes a means for uniformizing the quality<br />
of the consensus clustering solutions yielded by the flat and allegedly fastest<br />
RHCA and DHCA consensus architectures, thus making it possible to decide, on<br />
computational grounds only, which is the most suitable consensus architecture<br />
for a given clustering problem.<br />
– How do we measure it?<br />
i) The quality of the self-refined consensus clusterings is measured in terms of the<br />
φ (NMI) with respect to the ground truth of each data collection.<br />
ii) The percentage of experiments in which the proposed self-refining procedure gives<br />
rise to at least one self-refined consensus clustering of higher quality than that<br />
of a) its non-refined counterpart, and b) the highest and median quality cluster<br />
ensemble components.<br />
iii) We measure the relative φ (NMI) percentage difference between the self-refined<br />
consensus clustering of maximum quality and its non-refined counterpart.<br />
iv) The precision of the supraconsensus function is measured in terms of the percentage<br />
of experiments in which it manages to select the highest quality self-refined<br />
consensus clustering.<br />
v) We compare the average variance between the φ (NMI) scores of the consensus<br />
clusterings λc yielded by the three evaluated consensus architectures (i.e. prior to<br />
self-refining) with the variance between the consensus clustering selected by the<br />
) after the self-refining procedure is conducted.<br />
supraconsensus function (λ final<br />
c<br />
– How are the experiments designed? we only analyze the results of the consensus<br />
self-refining process executed on the highest diversity scenario (i.e. the one where<br />
cluster ensembles are created by applying the |dfA| = 28 clustering algorithms from<br />
the CLUTO clustering package). The reason for this is twofold: besides brevity, this<br />
limitation on our analysis avoids that the results of the self-refining process are masked<br />
by the consensus quality variability observed in lower diversity scenarios —recall that,<br />
in those cases, the quality of the consensus clustering solutions shows larger variances,<br />
as the cluster ensemble changes from experiment to experiment due to the random<br />
selection of |dfA| = {1, 10, 19} clustering algorithms, whereas exactly the same cluster<br />
ensemble is employed across the ten experiments in the highest diversity scenario. As<br />
113
4.2. Flat vs. hierarchical self-refining<br />
in all the experimental sections of this thesis, consensus processes have been replicated<br />
using the set of seven consensus functions described in appendix A.5, namely: CSPA,<br />
EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD. Results are averaged across ten<br />
independent experiments consisting of ten consensus function runs each. With the<br />
objective of analyzing the expectable dependence between the degree of refinement of<br />
the consensus clustering solution and the percentage p of cluster ensemble components<br />
included in the select cluster ensemble Ep, the experiments have been replicated for a<br />
set of percentage values in the range p ∈ [2, 90]. Subsequently, the final consensus label<br />
vector λ final<br />
c is selected among all the available (i.e. non-refined and refined) consensus<br />
clustering solutions through the application of the supraconsensus function presented<br />
in equation (4.3). <strong>La</strong>st, it is important to state that, although it is possible (and, in<br />
fact, recommendable) to apply either flat or hierarchical consensus on the select cluster<br />
ensemble depending on which is the most computationally efficient option, all selfrefining<br />
consensus processes in our experiments have been conducted, for simplicity,<br />
using a flat consensus architecture.<br />
– How are results presented? Results are presented by means of boxplot charts of<br />
the φ (NMI) values corresponding to the consensus self-refining process. In particular,<br />
each subfigure depicts –from left to right– the φ (NMI) values of: i) the components of<br />
the cluster ensemble E, ii) the non-refined consensus clustering solution (i.e. the one<br />
resulting from the application of either a hierarchical or a flat consensus architecture,<br />
denoted as λc), and iii) the self-refined consensus labelings λc p i obtained upon select<br />
cluster ensembles created using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}.<br />
Moreover, the consensus clustering solution deemed as the optimal one (across a<br />
majority of experiment runs) by the supraconsensus function is identified by means<br />
of a vertical green dashed line. Moreover, the quality comparisons between the selfrefined<br />
consensus clusterings, the non-refined consensus clusterings and the cluster<br />
ensemble components are presented by means of tables showing the average values of<br />
the measured magnitudes.<br />
– Which data sets are employed? These experiments span the twelve unimodal data<br />
collections employed in this work. For brevity reasons, and following the presentation<br />
scheme of the previous chapter, this section only describes in detail the results of<br />
the self-refining procedure obtained on the Zoo data set, deferring the portrayal of<br />
the results obtained on the remaining data collections to appendix D.1. However,<br />
the global evaluation of the self-refining and the supraconsensus processes entails the<br />
results obtained on the twelve unimodal collections employed in this work.<br />
Figure 4.1 presents the boxplot charts of the φ (NMI) values corresponding to the consensus<br />
self-refining process applied on the Zoo data set. Notice that figure 4.1 is organized into<br />
three columns of subfigures, each one of which corresponds to one of the three consensus<br />
architectures, i.e. flat, RHCA and DHCA.<br />
Pretty varied results can be observed in figure 4.1, as regards both the performance of<br />
the self-refining process in itself and of the supraconsensus selection function. For instance,<br />
when the consensus clustering solution output by the flat consensus architecture using the<br />
CSPA consensus function is subject to self-refining (see the leftmost boxplot on the top<br />
row of figure 4.1), we can observe that two of the refined solutions yield clearly higher<br />
φ (NMI) values than their non-refined counterpart —in particular, the ones obtained using<br />
114
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
Chapter 4. Self-refining consensus architectures<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure 4.1: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Zoo data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by the<br />
supraconsensus function in each experiment.<br />
115
4.2. Flat vs. hierarchical self-refining<br />
the 30% and 60% of the whole cluster ensemble, i.e. λc30 and λc60. Moreover, notice that<br />
supraconsensus selects λc30 as the final consensus clustering solution (that is, it performs<br />
correctly).<br />
In other cases, supraconsensus fails to select the highest quality consensus clustering<br />
solution. See, for instance, that supraconsensus selects the non-refined consensus clustering<br />
solution λc yielded by the flat consensus architecture based on the EAC consensus function<br />
as the optimal one, whereas the refined clusterings λc40, λc50 and λc60 attain higher φ (NMI)<br />
values (leftmost boxplot chart on the second row of figure 4.1). Furthermore, notice that<br />
in some minority cases, the self-refining procedure introduces no or little improvement,<br />
as when it is applied on the consensus solution output by the RHCA using the ALSAD<br />
consensus function —central column boxplot on the fifth row of figure 4.1.<br />
<strong>La</strong>st, notice that the boxplot corresponding to the refining of the consensus clustering<br />
solution output by the flat consensus architecture using MCLA –leftmost boxplot on the<br />
fourth row– only presents the φ (NMI) values corresponding to the cluster ensemble E. This<br />
is due to the fact that, for this particular consensus function and diversity scenario, flat<br />
consensus is not executable with our computational resources —see appendix A.6. Moreover,<br />
as all self-refining consensus processes in our experiments have been conducted using a<br />
flat consensus architecture, the self-refining of the consensus clustering solutions output by<br />
RHCA and DHCA are not computed from λc40 forth due to memory limitations when using<br />
the MCLA consensus function. However, recall from chapter 3 that hierarchical consensus<br />
architectures would allow the computation of consensus clustering solutions in situations<br />
where flat consensus is not executable.<br />
A deeper and more quantitative evaluation of the proposed consensus self-refining procedure<br />
requires analyzing two of its facets. Firstly, it is necessary to evaluate the self-refining<br />
process in itself, answering questions such as: i) how often does the self-refining process yield<br />
a higher quality consensus clustering solution than the non-refined one? ii) to which extent<br />
are the top quality self-refined consensus clustering solutions better than their non-refined<br />
counterpart? iii) how do the best self-refined consensus clustering solutions compare to<br />
the cluster ensemble components? or iv) does the self-refining procedure reduce the differences<br />
between the quality of the consensus clustering solutions output by distinct consensus<br />
architectures? The answers to these questions are presented in section 4.2.1.<br />
And secondly, given a set of self-refined consensus clustering solutions, a supraconsensus<br />
function capable of blindly selecting the highest quality self-refined solution is required. Its<br />
performance can be evaluated in terms of i) the percentage of occasions the supraconsensus<br />
function selects the highest quality consensus solution, and ii) the quality loss degree due<br />
to the supraconsensus selection of suboptimal consensus clustering solutions. These aspects<br />
are evaluated in section 4.2.2.<br />
4.2.1 Evaluation of the consensus-based self-refining process<br />
As regards the evaluation of the self-refining process, we have firstly analyzed the percentage<br />
of self-refining experiments in which at least one of the self-refined consensus clustering<br />
solutions attains a φ (NMI) with respect to the ground truth that is higher than the one<br />
achieved by the consensus clustering solution available prior to self-refinement. The results<br />
presented in table 4.2, which correspond to an average across all the data sets for each<br />
consensus architecture and consensus function, reveal that the proposed self-refining proce-<br />
116
Chapter 4. Self-refining consensus architectures<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 90.4 63.6 70 77.25 80 76.9 90<br />
RHCA 85.7 83.2 97 94.2 73.4 76.5 82.4<br />
DHCA 91.1 78.5 98 88.5 89.8 88.1 67.8<br />
Table 4.2: Percentages of self-refining experiments in which one of the self-refined consensus<br />
clustering solutions is better than its non-refined counterpart, averaged across the twelve<br />
data collections.<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 16.5 273.3 53.3 14.9 9.8 16.1 433.4<br />
RHCA 10.9 157.1 294779.8 200.8 14.1 15.1 205.6<br />
DHCA 24.5 66.4 152450.9 79.9 38.4 30.9 224.9<br />
Table 4.3: Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />
clustering solutions with respect to its non-refined counterpart, averaged across the twelve<br />
data collections.<br />
dure performs pretty successfully, giving rise to at least one self-refined consensus clustering<br />
solution that improves the consensus clustering available prior to refining in an average 83%<br />
of the experiments conducted.<br />
Moreover, we have also computed the relative φ (NMI) percentage gain between the nonrefined<br />
and the top quality self-refined consensus clustering solution —considering only<br />
those experiments where self-refining yields a better clustering solution, i.e. the 83% of<br />
the total. The results presented in table 4.3, which again correspond to an average across<br />
all the data sets for each consensus architecture and consensus function, reveal that the<br />
proposed self-refining procedure performs in an overwhelmingly successful manner, giving<br />
rise to an average relative percentage φ (NMI) gain of 21386% across all the experiments<br />
conducted. This exceptionally large figure is due to the fact that, although seldom, extremely<br />
poor quality consensus clustering solutions are available prior to self-refining in some cases.<br />
In particular, this situation is found when the HGPA consensus function is employed for<br />
refining the consensus clustering solutions yielded by hierarchical consensus architectures<br />
on the WDBC and BBC data collections (see, for instance, figure D.10 in appendix D).<br />
Despite its exceptionality, this fact introduces a large bias on the averaged values of φ (NMI)<br />
gains. However, if this kind of artifact is ignored, relative φ (NMI) gains between 10% and<br />
430% are consistently obtained in all cases, which gives an idea of the suitability of the<br />
proposed self-refining procedure for bettering consensus clustering solutions.<br />
Besides comparing the top quality self-refined consensus clustering solution with its<br />
non-refined counterpart, we have also contrasted its quality with respect to the highest and<br />
median φ (NMI) components of the cluster ensemble E, referred to as BEC (best ensemble<br />
component) and MEC (median ensemble component), respectively. Using the quality of<br />
these two components as a reference, we have evaluated i) the percentage of experiments<br />
where the maximum φ (NMI) (either refined or non-refined) consensus clustering solution<br />
attains a higher quality to that of the BEC and MEC, and ii) the relative percentage<br />
φ (NMI) variation between them and the top quality consensus clustering solution. Again,<br />
117
4.2. Flat vs. hierarchical self-refining<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 8.3 0 0 0 25 23.9 8.3<br />
RHCA 8.3 0 0 16.7 28.3 23.8 4<br />
DHCA 16.7 0 0.1 16.6 16.7 18.2 8.3<br />
Table 4.4: Percentages of experiments in which the best (non-refined or self-refined) consensus<br />
clustering solution is better than the best cluster ensemble component, averaged across<br />
the twelve data collections.<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 2.7 – – – 3.5 1.1 0.1<br />
RHCA 2.5 – – 1.7 4.2 1 0.1<br />
DHCA 3.3 – 2.2 1.4 1.4 1.2 0.8<br />
Table 4.5: Relative percentage φ (NMI) gains between the best (non-refined or self-refined)<br />
consensus clustering solution and the best cluster ensemble component, averaged across the<br />
twelve data collections.<br />
all the results presented correspond to an average across all the experiments conducted on<br />
the twelve unimodal data collections.<br />
As regards the first issue, table 4.4 presents the percentage of experiments where the<br />
highest quality consensus clustering solution (either refined or non-refined) is better than the<br />
BEC (i.e. it attains a φ (NMI) that is higher than that of the cluster ensemble component that<br />
best describes the group structure of the data in terms of normalized mutual information<br />
with respect to the given ground truth). In average, this happens in a 10.6% of the conducted<br />
experiments, which is a frequency of occurence 100 times higher than what was obtained<br />
when non-refined clustering solutions were considered (see table 3.14 in chapter 3). Again,<br />
this result reveals the notable consensus improvement introduced by the proposed selfrefining<br />
procedure. Moreover, notice the poor results obtained with the EAC and the<br />
HGPA consensus functions, which were already reported to be the worst performing ones<br />
in chapter 3.<br />
Moreover, the relative percentage φ (NMI) gains between the top quality consensus clustering<br />
solution and the BEC are presented in table 4.5, attaining a modest average increase<br />
of 1.8%. However, recall that this figure was as low as 0.6% when the non-refined consensus<br />
clustering solution was considered (see table 3.15 in section 3.4), which indicates that the<br />
consensus self-refining procedure introduces again notable quality improvements.<br />
If this comparison is now referred to the median ensemble component, it can be observed<br />
that, in average, the best (non-refined or self-refined) consensus clustering solution attains<br />
a φ (NMI) that is higher than that of the cluster ensemble component that has the median<br />
normalized mutual information with respect to the given ground truth in a 67.7% of the<br />
experiments conducted (see table 4.6). Recall that this percentage was 53.1% when the<br />
consensus clustering solution prior to self-refining was compared to the MEC —see table<br />
3.12 in section 3.4.<br />
If the degree of improvement between the best (non-refined or self-refined) consensus<br />
clustering solutions that attain a higher φ (NMI) than the MEC is measured in terms of<br />
118
Chapter 4. Self-refining consensus architectures<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 67.7 41.7 62.1 25 75 75 66.7<br />
RHCA 72.3 46.4 69.8 81.9 82 83.3 66.6<br />
DHCA 83.3 33.3 69.2 88.2 83.3 83.1 66.7<br />
Table 4.6: Percentage of experiments in which the best (non-refined or self-refined) consensus<br />
clustering solution is better than the median cluster ensemble component, averaged<br />
across the twelve data collections.<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 107.4 33.1 82.4 91.1 113.8 109.1 73.3<br />
RHCA 96.6 24 98.7 109.5 118.4 114.3 70.4<br />
DHCA 113.3 25.6 100.2 108.8 118.9 114.7 87.8<br />
Table 4.7: Relative percentage φ (NMI) gain between the best (non-refined or self-refined)<br />
consensus clustering solution and the median cluster ensemble component, averaged across<br />
the twelve data collections.<br />
relative percentage φ (NMI) increase, a notable 91% relative φ (NMI) gain is obtained in average<br />
—see table 4.7 for a detailed view across consensus functions and architectures. Again,<br />
the beneficial effect of self-refining becomes evident if this result is compared to the one<br />
obtained from the analysis of the non-refined consensus clustering solution as, in that case,<br />
the observed relative φ (NMI) gain was 59% (see table 3.13).<br />
Furthermore, we have also measured the ability of the self-refining procedure for uniformizing<br />
the quality of the consensus clustering solutions output by the distinct consensus<br />
architectures. So as to evaluate this issue, we have computed, for each individual experiment,<br />
the average variance of the φ (NMI) values of the non-refined consensus solutions<br />
yielded by the RHCA, DHCA and flat consensus architectures —the smaller the variance,<br />
the more similar φ (NMI) values. This procedure has been repeated for the top quality (either<br />
refined or non-refined) consensus clustering solutions obtained at each experiment. The results<br />
of this analysis are presented in table 4.8. Except for the EAC consensus function,<br />
we can observe a notable reduction in the variance between the φ (NMI) of the consensus<br />
solutions output by the three considered consensus architectures, keeping it below the hundredth<br />
threshold in most cases. In average global terms, variance is dramatically reduced<br />
by an approximate factor of 20, from 0.105 to 0.0056. For this reason, it can be conjectured<br />
that, besides bettering the quality of consensus clustering solutions as already reported, the<br />
proposed self-refining procedure also helps making the quality of the self-refined consensus<br />
clustering solution more independent from the consensus architecture employed —so that<br />
it can be selected following computational criteria solely.<br />
As a conclusion, it can be asserted that the proposed consensus self-refining procedure is<br />
reasonably successful, as, in general terms, it introduces a quality increase that makes selfrefined<br />
consensus clustering solutions closer to the best individual components available<br />
on the cluster ensemble, which would ultimately constitute the goal of robust clustering<br />
systems based on consensus clustering.<br />
It is of paramount importance to notice that, in the analysis of all the previous results,<br />
119
4.2. Flat vs. hierarchical self-refining<br />
Consensus Consensus function<br />
solution CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
non-refined 0.011 0.011 0.026 0.638 0.009 0.011 0.029<br />
best non/self-refined 0.004 0.019 0.005 0.002 0.002 0.001 0.006<br />
Table 4.8: φ (NMI) variance of the non-refined and the best non/self-refined consensus clustering<br />
solutions across the flat, RHCA and DHCA consensus architectures, averaged across<br />
the twelve data collections.<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 30.4 50 53.1 11 25 23.8 37.5<br />
RHCA 25 35.9 38.4 3.9 24.4 24.1 36.6<br />
DHCA 12.5 42 40.1 17 0 29.5 37.5<br />
Table 4.9: Percentage of experiments in which the supraconsensus function selects the top<br />
quality consensus clustering solution, averaged across the twelve data collections.<br />
we have assumed the use of the top quality self-refined consensus clustering solution. Quite<br />
obviously, achieving the encouraging results reported would require using a supraconsensus<br />
function that, in an automatic manner, would detect that best self-refined consensus clustering<br />
in any given situation. The next section is devoted to the performance analysis of<br />
such supraconsensus function.<br />
4.2.2 Evaluation of the supraconsensus process<br />
As regards the performance of the supraconsensus function proposed by (Strehl and Ghosh,<br />
2002), we have firstly evaluated the percentage of experiments in which the supraconsensus<br />
function selects the highest quality consensus clustering solution. Table 4.9 presents the<br />
results averaged across all the data collections, for each consensus function and architecture.<br />
The average accuracy with which the supraconsensus function selects the top quality selfrefined<br />
consensus selection is 29%, i.e. it manages to select the best solution in less than a<br />
third of the experiments conducted.<br />
This somehow contradicts the beautiful conclusions of (Strehl and Ghosh, 2002), where<br />
φ (ANMI) (E, λc) is presented as a suitable surrogate of φ (NMI) (γ, λc) for selecting the best<br />
consensus clustering solutions in real scenarios, where a ground truth γ is not available.<br />
Such conclusion was supported by the fact that both φ (ANMI) (E, λc) andφ (NMI) (γ, λc)<br />
follow very similar patterns as regards their growth (i.e. the higher φ (NMI) (γ, λc), the<br />
higher φ (ANMI) (E, λc)). However, such claims were sustained on experiments using synthetic<br />
clustering results. In several of our experiments, in contrast, we have witnessed that<br />
this behaviour is not always obeyed in a strict fashion.<br />
Just for illustration purposes, we have conducted a toy experiment, in which a set of<br />
randomly picked 300 cluster ensemble components corresponding to the Zoo data collection<br />
have been evaluated in terms of i) their φ (NMI) with respect to the ground truth, and ii)<br />
their φ (ANMI) with respect to the 299 remaining clusterings selected. Figure 4.2 depicts<br />
both magnitudes, where the horizontal axis of each figure corresponds to an index of the<br />
clusterings in the ensemble arranged in decreasing order of φ (NMI) with respect to the<br />
120
φ (NMI)<br />
0.8<br />
0.78<br />
0.76<br />
0.74<br />
0.72<br />
0 100 200 300<br />
clustering index<br />
(a) Decreasingly ordered<br />
φ (NMI) (wrt ground truth)<br />
Chapter 4. Self-refining consensus architectures<br />
φ (ANMI)<br />
0.8<br />
0.75<br />
0.7<br />
0.65<br />
0 100 200 300<br />
clustering index<br />
(b) φ (ANMI) values wrt the<br />
toy cluster ensemble<br />
Figure 4.2: Decreasingly ordered φ (NMI) (wrt ground truth) values of the 300 clusterings<br />
included in the toy cluster ensemble (left), and their corresponding φ (ANMI) values (wrt the<br />
toy cluster ensemble) (right).<br />
Consensus Consensus function<br />
architecture CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
flat 8.8 14.5 17.8 12 6.7 8.9 12.1<br />
RHCA 6.3 26.6 24.7 27.3 8.6 8.4 16.9<br />
DHCA 15.9 27 22.9 19.6 9.4 11 8.7<br />
Table 4.10: Relative percentage φ (NMI) losses due to suboptimal self-refined consensus clustering<br />
solution selection by supraconsensus, averaged across the twelve data collections.<br />
ground truth. Notice how the monotonic decreasing behaviour of φ (NMI) –figure 4.2(a)– is<br />
not strictly observed in φ (ANMI) (see figure 4.2(b), where a fifth order fitting red dashed<br />
curve is overlayed for comparison). In fact, the clustering attaining the maximum φ (ANMI)<br />
is the one with the fiftieth largest φ (NMI) . Thus, in practice, φ (ANMI) seems to constitute<br />
a means for identifying good clustering solutions, but not the best one. For this reason,<br />
it seems that requiring φ (ANMI) (E, λc) to select the one self-refined consensus clustering<br />
solution of highest quality is a far too restrictive constraint, which leads to the slightly<br />
disappointing results presented in table 4.9.<br />
In order to evaluate the influence of the apparent lack of precision of the supraconsensus<br />
function, we have measured the relative percentage φ (NMI) loss derived from a suboptimal<br />
consensus solution selection, using the top quality consensus clustering solution as a reference<br />
(i.e. the one that should be selected by an ideal supraconsensus function). The results,<br />
which are presented in table 4.10, show that the impact of the modest selection accuracy<br />
of the supraconsensus function leads to an average relative φ (NMI) loss of 14.9%.<br />
To conclude, it can be asserted that, while the proposed consensus self-refining procedure<br />
introduces notable gains as regards the quality of consensus clustering solutions, there<br />
still exists room for taking full advantage of its performance, as the entirely unsupervised<br />
selection of the highest quality consensus solution is not a fully solved problem yet.<br />
121
4.3. Selection-based self-refining<br />
4.3 Selection-based self-refining<br />
The consensus self-refining procedure proposed in section 4.1 is based on using a consensus<br />
clustering solution as a reference for computing the φ (NMI) of the cluster ensemble components,<br />
which, at the same time, constitutes the guiding principle of the creation of the select<br />
cluster ensemble Ep, upon which the self-refining process is conducted.<br />
In this section, we propose an alternative procedure for obtaining a self-refined consensus<br />
clustering solution. The only difference between this proposal and the one put forward in<br />
section 4.1 lies in the fact that the computation of the φ (NMI) of the cluster ensemble<br />
components –a step prior to the creation of the select cluster ensemble Ep– is not referred<br />
to a previously obtained consensus labeling λc, but to one of the components of the cluster<br />
ensemble E.<br />
By doing so, we aim to devise an alternative means for obtaining a high quality clustering<br />
from a large cluster ensemble that does not require the execution of any consensus process<br />
for obtaining the reference clustering with respect to which the cluster ensemble components<br />
are compared in terms of normalized mutual information, with the obvious computational<br />
savings it conveys.<br />
Due to the fact that this proposed method is based on selecting one of the cluster<br />
ensemble components for initiating the consensus self-refining process, we have called it<br />
selection-based self-refining, and its constituting steps are presented in table 4.11.<br />
In the next paragraphs, we will analyze the performance of this second self-refining<br />
proposal, following the same experimental scheme employed in section 4.2. That is, we firstly<br />
review the results obtained on the Zoo data collection at a qualitative level (the analysis<br />
corresponding to the remaining data collections is presented in appendix D.2), followed by<br />
a quantitative study of the quality of the self-refined consensus clustering solutions across<br />
all the experiments conducted.<br />
With the objective of making the results of selection-based consensus self-refining comparable<br />
to those presented in the previous section, we have followed the same experimental<br />
procedure, that is: i) the experiments have been replicated for a set of self-refining percentage<br />
values in the range p ∈ [2, 90], ii) the experiments have been executed on the cluster<br />
ensembles corresponding to the highest diversity scenario.<br />
For starters, figure 4.3 depicts the boxplot charts of the φ (NMI) values corresponding to<br />
the selection-based consensus self-refining process. Each chart depicts –from left to right–<br />
the φ (NMI) values of: i) the components of the cluster ensemble E, ii) the cluster ensemble<br />
component with maximum φ (ANMI) with respect to the whole ensemble, i.e. λref, andiii)<br />
the self-refined consensus clusterings λcpi obtained upon select cluster ensembles created<br />
using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}.<br />
Firstly, we can notice the high quality of the selected cluster ensemble component λref,<br />
the φ (NMI) of which is pretty close to the highest quality component of the cluster ensemble<br />
E. Thus, it seems that the proposed selection procedure constitutes, by itself, a fairly good<br />
approach for obtaining clustering solutions that are robust to the inherent indeterminacies<br />
of the clustering problem. When the self-refining procedure is applied on the select cluster<br />
ensemble created upon λref, distinct performances are observed. Whereas in some cases<br />
none of the self-refined clustering solutions λcpi attains higher φ (NMI) values than λref (see,<br />
for instance, figure 4.3(a)), the opposite is observed when self-refining based on the EAC and<br />
122
1. Given a cluster ensemble E containing l components:<br />
⎛ ⎞<br />
λ1<br />
⎜λ2⎟<br />
⎜ ⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠<br />
Chapter 4. Self-refining consensus architectures<br />
compute the φ (ANMI) between each one of them and the cluster ensemble, that is:<br />
l<br />
φ (NMI) (λi, λk) , ∀k =1,...,l<br />
φ (ANMI) (E, λk) = 1<br />
l<br />
i=1<br />
2. Select the cluster ensemble component that maximizes its φ (ANMI) with respect to the whole<br />
ensemble as the reference for the self-refining<br />
<br />
process:<br />
(ANMI)<br />
λref =maxφ<br />
(E, λk)<br />
λk<br />
<br />
3. Compute the φ (NMI) between λref and each of the components of the cluster ensemble, that<br />
is:<br />
φ (NMI) (λref, λk) , ∀k =1,...,l<br />
4. Generate an ordered list Oφ (NMI) = {λφ (NMI) 1, λφ (NMI)2,...λφ (NMI)l} of cluster ensemble components<br />
ranked in decreasing order according to their φ (NMI) with respect λref.<br />
5. Create a set of P select cluster ensembles Epi by compiling the first ⌊ pi<br />
100l⌉ components of<br />
Oφ (NMI):<br />
⎛<br />
λφ (NMI)<br />
⎜<br />
Epi = ⎜<br />
⎝<br />
1<br />
λφ (NMI) 2<br />
⎞<br />
⎟<br />
.<br />
⎟<br />
⎠<br />
where pi ∈ (0, 100) , ∀i =1,...,P.<br />
λl<br />
λ φ (NMI) ⌊ p i<br />
100 l⌉<br />
6. Run a (flat or hierarchical) consensus architecture based on a consensus function F on Epi,<br />
obtaining a self-refined consensus clustering solution λc p i .<br />
7. Apply the supraconsensus function on the selected cluster ensemble component λref and<br />
the set of self-refined consensus clustering solutions λc p i , i.e. select as the final consensus<br />
solution the one maximizing its φ (ANMI) with respect to the cluster ensemble E, i.e.:<br />
λ final<br />
c<br />
= λ max φ (ANMI) (E, λ) , λ ∈{λref, λc p 1 ,...,λc p P }<br />
Table 4.11: Methodology of the cluster ensemble component selection-based consensus selfrefining<br />
procedure.<br />
SLSAD consensus functions is applied on λref —see figures 4.3(b) and 4.3(g), respectively.<br />
As in section 4.2, the consensus clustering solution deemed as the optimal one (across<br />
a majority of experiment runs) by the supraconsensus function is identified by means of a<br />
vertical green dashed line. As regards its performance, we can observe that it manages to<br />
select the highest quality clustering solution in all cases except when self-refining is based<br />
on the ALSAD and KMSAD consensus functions.<br />
So as to provide the reader with a more comprehensive and quantitative analysis of the<br />
performance of the proposed selection-based consensus self-refining procedure, the follow-<br />
123
4.3. Selection-based self-refining<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure 4.3: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Zoo data collection<br />
across all the consensus functions employed. The green dashed vertical line identifies the<br />
clustering solution selected by the supraconsensus function in each experiment.<br />
ing sections present a separate study of the results yielded by the self-refining procedure<br />
itself and the supraconsensus function that, a posteriori, must select the best self-refining<br />
consensus clustering solution.<br />
4.3.1 Evaluation of the selection-based self-refining process<br />
As far as the evaluation of the selection-based consensus self-refining procedure is concerned,<br />
four analysis have been conducted. For starters, we have measured the percentage<br />
of experiments in which the procedure of self-refining yields a better quality clustering<br />
than the cluster ensemble component selected as a reference (i.e. the one maximizing its<br />
φ (ANMI) with respect to the cluster ensemble E, referred to as λref). The results, averaged<br />
across all the unimodal data collections employed in this work, are presented in table 4.12<br />
as a function of the consensus function employed. In average, the self-refining procedure,<br />
when conducted on select cluster ensembles created upon the selection of λref, yields better<br />
clustering solutions in a 56% of the conducted experiments.<br />
This figure is notably lower than what was obtained when the select cluster ensemble<br />
124
Chapter 4. Self-refining consensus architectures<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
54.5 28.2 10.3 69.6 81.8 82 65.4<br />
Table 4.12: Percentage of self-refining experiments in which one of the self-refined consensus<br />
clustering solutions is better than the selected cluster ensemble component reference λref,<br />
averaged across the twelve data collections.<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
26.9 9.1 20.8 15.2 15.5 11.4 7.8<br />
Table 4.13: Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />
clustering solutions with respect to the maximum φ (ANMI) cluster ensemble component,<br />
averaged across the twelve data collections.<br />
creation uses a previously derived consensus clustering solution (it was 83%). This is due to<br />
the fact that the cluster ensemble component selection usually results in using a reference<br />
clustering λref of higher quality than the consensus clustering solution λc.<br />
Secondly, in those experiments where self-refined consensus solutions are better than<br />
λref, we have measured the relative degree of improvement achieved (quantified in terms<br />
of relative percentage φ (NMI) increase). The results, presented in table 4.13, show notable<br />
quality improvements, averaging a 15.2% relative φ (NMI) gain across all data sets and consensus<br />
functions. These quality gains obtained are much smaller than those obtained on the<br />
self-refining experiments based on a previously derived consensus clustering solution (see<br />
section 4.2), again due to the superior quality of the reference clustering the self-refining<br />
procedure is based upon.<br />
Next, the maximum and median φ (NMI) components of the cluster ensemble E –referred<br />
to as BEC (best ensemble component) and MEC (median ensemble component), respectively–<br />
are compared to either the top quality self-refined consensus clustering solution or λref,<br />
(depending on which has the largest φ (NMI) with respect to the ground truth). As in the<br />
previous section, we have evaluated i) the percentage of experiments where the maximum<br />
φ (NMI) consensus clustering solution attains a higher quality to that of the BEC and MEC,<br />
and ii) the relative percentage φ (NMI) variation between them and the top quality consensus<br />
clustering solution. Once more, all the results presented correspond to an average across<br />
all the experiments conducted on the twelve unimodal data collections.<br />
On one hand, table 4.14 presents the aforementioned magnitudes referred to the best<br />
cluster ensemble component. In average, the highest quality clustering (either λref or one<br />
of the self-refined consensus solutions) is better that the BEC in a 14.1% of the conducted<br />
experiments, achieving an average relative percentage φ (NMI) gain of 1.6%. It is important<br />
to notice that these results are pretty similar to those obtained when self-refining is based<br />
on a previously derived consensus clustering solution (see section 4.2), as these percentages<br />
were equal to 10.6% and 1.8%, respectively.<br />
On the other hand, table 4.15 presents the results of the same experiment, but referred<br />
to the median ensemble component (or MEC). In this case, the selection and self-refining<br />
procedure yields clusterings better than the MEC in 98% of the occasions, attaining an aver-<br />
125
4.3. Selection-based self-refining<br />
%of<br />
experiments<br />
relative %<br />
φ (NMI) gain<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
9.1 9.1 0 16.7 36.4 27.6 0<br />
2.1 0.2 – 2.6 1.8 1.3 –<br />
Table 4.14: Percentage of experiments where either the top quality self-refined consensus<br />
clustering solution or λref better the best cluster ensemble component, and relative φ (NMI)<br />
gain percentage with respect to it, averaged across the twelve data collections.<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
%of<br />
experiments<br />
100 95.4 95 100 100 100 95.4<br />
relative %<br />
φ<br />
118.7 100.7 118.3 114.9 116.1 112.5 107.4<br />
(NMI) gain<br />
Table 4.15: Percentage of experiments where either the top quality self-refined consensus<br />
clustering solution or λref better the median cluster ensemble component, and relative<br />
φ (NMI) gain percentage with respect to it, averaged across the twelve data collections.<br />
age relative φ (NMI) gain of 112.7%. These figures indicate that the selection-based consensus<br />
self-refining yields better results –when compared to the MEC– than its consensus-based<br />
counterpart, where the two aforementioned percentages reduced to 67.7% and 91%, respectively.<br />
As a summarization of this analysis of the selection-based consensus self-refining proposal,<br />
we can conclude that, firstly, it constitutes a fairly good approach as far as the<br />
obtention of a high quality partition of the data is concerned. When compared to the<br />
consensus-based self-refining procedure put forward in section 4.1, it can be observed that,<br />
while the relative quality gains introduced by the self-refining process itself are smaller in<br />
selection-based consensus self-refining, the top quality clustering results obtained are superior<br />
to those yielded by consensus-based self-refining. We believe that these phenomena<br />
are both due to the differences in the quality of the clustering solution that constitutes<br />
the starting point of the self-refining process —in the case of consensus-based self-refining,<br />
this reference is a previously derived consensus clustering λc, which typically is a poorer<br />
data partition than the maximum φ (ANMI) cluster ensemble component λref (see figures in<br />
appendices D.1 and D.2 for a quick visual comparison). This fact makes selection-based<br />
self-refining even a more attractive alternative, all the more since no previous consensus<br />
process execution is required —with the obvious computational savings this implies.<br />
4.3.2 Evaluation of the supraconsensus process<br />
As regards the performance of the supraconsensus function proposed by (Strehl and Ghosh,<br />
2002), we have conducted a twofold evaluation. On one hand, we have analyzed the percentage<br />
of experiments in which the highest quality and the clustering solution selected<br />
via supraconsensus coincide. On the other hand, we have measured the relative percentage<br />
φ (NMI) loss derived from a suboptimal consensus solution selection, using the top quality<br />
clustering solution as a reference (i.e. the one that should be selected by an ideal supracon-<br />
126
%of<br />
experiments<br />
relative %<br />
φ (NMI) loss<br />
Chapter 4. Self-refining consensus architectures<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
37.5 61.2 84.6 30.6 12.5 23 60<br />
24.6 10.4 16.8 12.2 10.9 12.7 11.9<br />
Table 4.16: Percentage of experiments in which the supraconsensus function selects the top<br />
quality clustering solution, and relative percentage φ (NMI) losses between the top quality<br />
clustering solution and the one selected by supraconsensus, averaged across the twelve data<br />
collections.<br />
sensus function). The results of these two experiments are presented in table 4.16, averaged<br />
across all the data collections and for each consensus function.<br />
The average accuracy with which the supraconsensus function selects the top quality<br />
self-refined consensus selection is 44.2%, i.e. it manages to select the best solution in less<br />
than a half of the experiments conducted. Moreover, this apparent lack of precision entails<br />
an average relative φ (NMI) reduction of 14.2%.<br />
These results reinforce the idea that the supraconsensus function proposed in (Strehl<br />
and Ghosh, 2002) is still far from constituting the most appropriate means for selecting, in<br />
a completely unsupervised manner, the best consensus clustering solution among a bunch<br />
of them, specially if they have pretty similar qualities. This is the reason why the average<br />
level of selection accuracy attained by the supraconsensus function in the selection-based<br />
self-refining scenario is higher than in the consensus-based context (44.2% vs. 29%), as<br />
the φ (NMI) differences between the top quality clustering solution and the remaining ones<br />
is notably higher in the former case than in the latter —in selection-based self-refining,<br />
the selected cluster ensemble component λref is often of notably higher quality than the<br />
self-refined consensus clustering solutions λcpi , see appendix D.2.<br />
In contrast, similar results are obtained in both the selection-based and consensusbased<br />
self-refining scenarios when the efficiency of the supraconsensus function is measured<br />
in terms of the φ (NMI) loss caused from erroneous selections (i.e. when a clustering solution<br />
other than the highest quality one is selected by the supraconsensus function). In<br />
selection-based self-refining, this relative percentage φ (NMI) loss is 14.2%, being 14.9% in<br />
the consensus-based self-refining context.<br />
4.4 Discussion<br />
In this chapter, we have put forward a couple of proposals oriented to obtain a high quality<br />
clustering solution given a cluster ensemble and a similarity measure between partitions,<br />
using consensus clustering and following a fully unsupervised procedure. Together with the<br />
computationally efficient consensus architectures presented in chapter 3, these proposals<br />
constitute the basis for constructing robust consensus clustering systems.<br />
Our proposals are based on applying consensus clustering on a set of clusterings –<br />
compiled in a select cluster ensemble– which are chosen from the cluster ensemble according<br />
to their similarity with respect to an initially available clustering solution. By doing so, we<br />
127
4.4. Discussion<br />
have experimentally proved that it is very likely to obtain a refined consensus clustering<br />
solution of higher quality than the original one.<br />
The main difference between our two proposals lies in the origin of the clustering employed<br />
as a reference for creating the select cluster ensemble. In the first proposal, referred<br />
to as consensus-based self-refining, this initial clustering is the consensus clustering solution<br />
λc resulting from a previous consensus process run on the whole cluster ensemble. In our<br />
second proposition, the starting point of the refining process is one of the components of<br />
the cluster ensemble, which is selected using an average normalized mutual information<br />
criterion —giving rise to what we call selection-based self-refining.<br />
Unfortunately, the optimal configuration of this self-refining procedure –e.g. the size<br />
of the select cluster ensemble, or the consensus function employed for creating the refined<br />
clustering solutions– is local to each particular experiment. This inconvenience, which<br />
is by no means new in the consensus clustering literature, can be tackled by means of<br />
a supraconsensus function that, in a blind manner, selects the highest quality clustering<br />
solution among a bunch of them, created using distinct self-refining configurations. However,<br />
the application of one of the most extended supraconsensus function (the one proposed in<br />
(Strehl and Ghosh, 2002)) in our experiments has yielded little disappointing results, as it<br />
is capable of selecting the highest quality clustering solution in a relatively low percentage<br />
of the experiments conducted. Moreover, alternative supraconsensus functions based on<br />
average normalized mutual information gave rise to even poorer selection accuracies (not<br />
reported here due to the limited interest of the results obtained), which suggests that it<br />
is necessary to conduct further research in order to devise novel supraconsensus functions<br />
capable of satisfying such a restrictive constraint as the one imposed here —i.e. selecting,<br />
among a set of clustering solutions, the top quality one in a fully unsupervised fashion.<br />
As aforementioned, the concepts of consensus self-refining and supraconsensus functions<br />
are closely related. In fact, supraconsensus is originally presented in (Strehl and Ghosh,<br />
2002) as a means for selecting the best consensus clustering solution among a bunch of<br />
them, created using different consensus functions. Therefore, it seems logical to consider<br />
the application of supraconsensus not on a set of previously derived consensus clustering<br />
solutions, but on the cluster ensemble components themselves, so as to select the highest<br />
quality ones.<br />
Some very recent works have dealt with this issue, such as (Gionis, Mannila, and<br />
Tsaparas, 2007), where the BESTCLUSTERING algorithm is defined as a means for identifying<br />
the individual partition that minimizes the number of disagreements with respect to<br />
the remaining components of the cluster ensemble. Nevertheless, no posterior consensus<br />
clustering based refinement process is applied on this presumably high quality cluster ensemble<br />
component, which, as we have experimentally proved, may bring about important<br />
quality gains.<br />
More recently, the use of clustering solution refinement procedures based on consensus<br />
clustering has been studied in (Fern and Lin, 2008) contemporarily to the completion of this<br />
thesis. That work and ours have multiple points in common, such as i) the primary purpose<br />
of avoiding the negative influence of poor clusterings contained in large cluster ensembles<br />
on the quality of the consensus clustering solutions built upon them, ii) the use of one of<br />
the components of the cluster ensemble as the reference partition for creating the reduced<br />
select cluster ensemble, as we propose in selection-based self-refining, iii) the analysis of<br />
the quality of several refined consensus clustering solutions generated upon multiple select<br />
128
Chapter 4. Self-refining consensus architectures<br />
cluster ensembles of distinct sizes, and iv) the use of normalized mutual information as the<br />
guiding principle for comparing clusterings.<br />
However, there also exist several differences between both works, as in (Fern and Lin,<br />
2008) i) refining is not presented as a means for bettering the quality of a previously<br />
derived consensus clustering solution, but as a means for obtaining a good quality one<br />
upon a select cluster ensemble resulting from discarding those poor components that may<br />
induce a quality loss in it, ii) the criteria employed for choosing the components included<br />
in the select cluster ensemble consider both clustering quality and diversity, iii) clustering<br />
refinement results obtained by a single consensus function (CSPA) are reported, and iv) no<br />
supraconsensus methodology for selecting the best refined consensus clustering is studied.<br />
To conclude, we would like to highlight again the significant quality improvements that<br />
can be obtained by means of self-refining consensus procedures. However, it is also important<br />
to be aware that we cannot take full advantage of these gains if a good performing<br />
supraconsensus methodology allows to select the top quality self-refined clustering solution<br />
with a high level of confidence. For this reason, in our opinion, devising such a technique is<br />
a matter of paramount importance as regards the further research to be conducted in this<br />
particular field.<br />
In the future, it would be interesting to analyze how consensus self-refining procedures<br />
perform if the cluster ensemble selection process was based on clustering similarity measures<br />
other than normalized mutual information. Furthermore, we also intend to study the<br />
possibility of creating the select cluster ensemble by including in it all those clusterings<br />
exceeding a certain φ (ANMI) threshold with respect to the reference clustering, instead of<br />
selecting a percentage p of the cluster ensemble components.<br />
4.5 Related publications<br />
As mentioned earlier, the aim of the proposed self-refining consensus clustering approach<br />
is to obtain partitions which are robust to the indeterminacies inherent to the clustering<br />
problem. This has been the main propeller of our research since the early days, which<br />
has been reflected in several publications at several international conferences and national<br />
journals. The application focus of these works was document clustering, so they were<br />
mainly published at Information Retrieval and Natural <strong>La</strong>nguage Processing forums. In<br />
none of these works, however, self-refining procedures are included as a means for obtaining<br />
improved quality clustering results, so our proposals in this specific area remain, by the<br />
moment, unpublished.<br />
The first publication regarding robust document clustering based on cluster ensembles<br />
was presented as a poster at the SIGIR 2006 conference held at Seattle. The details of this<br />
publication follow.<br />
Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />
Title: Feature Diversity in Cluster Ensembles for Robust Document Clustering<br />
In: Proceedings of the 29th ACM SIGIR Conference<br />
Pages: 697-698<br />
129
4.5. Related publications<br />
Year: 2006<br />
Abstract: The performance of document clustering systems depends on employing<br />
optimal text representations, which are not only difficult to determine beforehand,<br />
but also may vary from one clustering problem to another. As a first step towards<br />
building robust document clusterers, a strategy based on feature diversity and cluster<br />
ensembles is presented in this work. Experiments conducted on a binary clustering<br />
problem show that our method is robust to near-optimal model order selection and<br />
able to detect constructive interactions between different document representations in<br />
the test bed.<br />
A subsequent extension of this work was published at the Journal of the Spanish Society<br />
for Natural <strong>La</strong>nguage Processing (Procesamiento del Lenguaje Natural) after its presentation<br />
at the SEPLN 2006 conference held at Zaragoza.<br />
Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />
Title: Robust Document Clustering by Exploiting Feature Diversity in Cluster Ensembles<br />
In: Journal of the Spanish Society for Natural <strong>La</strong>nguage Processing (Procesamiento<br />
del Lenguaje Natural)<br />
Volume: 37<br />
Pages: 169176<br />
Year: 2006<br />
Abstract: The performance of document clustering systems is conditioned by the use<br />
of optimal text representations, which are not only difficult to determine beforehand,<br />
but also may vary from one clustering problem to another. This work presents an<br />
approach based on feature diversity and cluster ensembles as a first step towards<br />
building document clustering systems that behave robustly across different clustering<br />
problems. Experiments conducted on three binary clustering problems of increasing<br />
difficulty show that the proposed method is i) robust to near-optimal model order<br />
selection, and ii) able to detect constructive interactions between different document<br />
representations, thus being capable of yielding consensus clusterings superior to any<br />
of the individual clusterings available.<br />
<strong>La</strong>st, a global analysis regarding clustering indeterminacies and how they can be overcome<br />
via cluster ensembles was presented at the ICA 2007 conference as an oral presentation.<br />
Authors: Xavier Sevillano, Germán Cobo, Francesc Alías and Joan Claudi Socoró<br />
Title: Text Clustering on <strong>La</strong>tent Thematic Spaces: Variants, Strenghts and Weaknesses<br />
In: Proceedings of 7th International Conference on Independent Component Analysis<br />
and Signal Separation<br />
130
Publisher: Springer<br />
Series: Lecture Notes in Computer Science<br />
Volume: 4666<br />
Chapter 4. Self-refining consensus architectures<br />
Editors: Mike E. Davies, Christopher J. James, Samer A. Abdallah and Mark D.<br />
Plumbley<br />
Pages: 794-801<br />
Year: 2007<br />
Abstract: Deriving a thematically meaningful partition of an unlabeled document<br />
corpus is a challenging task. In this context, the use of document representations<br />
based on latent thematic generative models can lead to improved clustering. However,<br />
determining apriorithe optimal document indexing technique is not straighforward,<br />
as it depends on the clustering problem faced and the partitioning strategy adopted.<br />
So as to overcome this indeterminacy, we propose deriving a single consensus labeling<br />
upon the results of clustering processes executed on several document representations.<br />
Experiments conducted on subsets of two standard text corpora evaluate distinct<br />
clustering strategies based on latent thematic spaces and highlight the usefulness<br />
of consensus clustering to overcome the indeterminacy regarding optimal document<br />
indexing.<br />
131
Chapter 5<br />
Multimedia clustering based on<br />
cluster ensembles<br />
As already outlined in section 1.3, multimodality is an increasingly noticeable trend as<br />
far as the nature of data is concerned. Given the growing ubiquity of multimedia data,<br />
it seems logical to consider that the derivation of robust clustering strategies as a means<br />
for organizing the increasingly larger multimodal repositories available is already a field of<br />
interest in itself.<br />
However, it is important to take into account that the direct application of classic<br />
clustering algorithms for partitioning multimedia data collections may turn out to be suboptimal.<br />
The reason for this is twofold: firstly, the usual indeterminacies that condition<br />
the performance of clustering algorithms are multiplied due to the existence of several data<br />
modalities. This means that the user must not only make a decision regarding which is<br />
the object representation or the clustering algorithm that are supposed to yield the best<br />
partition of the data —i.e. the ones best describing the natural group structure of the<br />
data. Furthermore, it is also necessary to make a decision regarding on which of the m<br />
data modalities clustering is to be conducted, as classic clustering algorithms are designed<br />
to operate on unimodal data.<br />
And secondly, notice that clustering multimedia data using a single modality entails<br />
ignoring the presumable positive synergies that may exist between the different modalities,<br />
which could be of interest for deriving a better partition of the data. The only way a<br />
classic clustering algorithm can take advantage of the possible benefits of multimodality<br />
consists in creating multimodal representations of the objects, conducting an early fusion<br />
of the features corresponding to distinct modalities prior to clustering. That is, clustering<br />
is conducted on a single, artificially generated multimodal representation of the objects<br />
created by the combination of the m original modalities. However, feature fusion may be<br />
benefitial or not as regards the quality of the clustering results, as reported in appendix<br />
B.2, which turns the application of this strategy into a further indeterminacy to deal with.<br />
For these reasons, in this chapter we propose applying consensus clustering as a means<br />
for clustering multimedia data robustly, as it provides a natural way for combining i) the<br />
results of clustering processes run on each one of the m distinct modalities —thus conducting<br />
a late fusion of modalities, and ii) the partitions obtained upon the multimodal data<br />
representation derived by the early fusion of the features of the m modalities. By doing so,<br />
133
5.1. Generation of multimodal cluster ensembles<br />
X 1<br />
Object<br />
representation p<br />
+ clustering<br />
Object<br />
X2 representation<br />
X<br />
+ clustering<br />
Multimedia<br />
data set<br />
Xm Object<br />
representation<br />
+ clustering<br />
E<br />
Consensus<br />
architecture<br />
(fl (flat t or<br />
hierarchical)<br />
Feature<br />
fusion<br />
Object<br />
representation<br />
+ clustering<br />
Multimodal cluster ensemble generation<br />
<br />
c<br />
Consensus<br />
self<br />
refining<br />
Figure 5.1: Block diagram of the proposed multimodal consensus clustering system.<br />
we can take advantage of both modality fusion approaches, which can be of help to reveal<br />
the group structure of the data.<br />
The proposed multimodal consensus clustering approach follows the schematic block<br />
diagram of figure 5.1, and consists of the following steps: the generation of the multimodal<br />
cluster ensemble E, plus the application of a computationally efficient consensus architecture<br />
that, followed by a consensus-based self-refining procedure, yields the final partition of the<br />
multimodal data collection subject to clustering, λfinal c .<br />
In this chapter, the phases that constitute the multimodal consensus clustering process<br />
are described and contextualized in the framework of the experiments conducted in this<br />
work. For starters, section 5.1 presents the strategies followed in the creation of the multimodal<br />
cluster ensemble. Next, section 5.2 describes the particularities of the consensus<br />
architecture and the self-refining procedure that give rise to the multimodal data partition.<br />
<strong>La</strong>st, the results of the multimedia consensus clustering experiments run in this thesis<br />
are presented in section 5.3, and with the conclusions discussed in section 5.4 the present<br />
chapter comes to an end.<br />
5.1 Generation of multimodal cluster ensembles<br />
The key point for conducting a multimodal consensus clustering process lies in the creation<br />
of a cluster ensemble that contains both clusterings derived on each data modality separately<br />
and on fused modalities. In this section, we describe a general procedure for creating a<br />
multimodal cluster ensemble E upon a multimedia data collection.<br />
Without loss of generality, let us assume that the multimodal data collection subject to<br />
clustering contains n objects represented by numeric attributes. Thus, the whole data set<br />
can be formally represented by means of a d × n real-valued matrix X, where each object<br />
is represented by means of a d-dimensional column vector xi, ∀i ∈ [1,n].<br />
134<br />
<br />
final<br />
c
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Moreover, suppose that our multimedia data collection is composed of m modalities.<br />
That is, each of the n objects is simultaneously represented by m disjoint sets of real-valued<br />
attributes –each one of which corresponds to one of the m modalities– of sizes d1, d2, ...,<br />
dm, sothatd1 + d2 + ... + dm = d. Thus, the multimodal data set matrix X can be<br />
decomposed in m submatrices X1, X2, ..., Xm. Each one of these matrices Xi (of size<br />
di × n, ∀i ∈ [1,m]) represents all the objects in the data set according to each one of the<br />
m modalities it contains —see figure 5.1.<br />
Given this scenario, a first subset of the clusterings that will constitute the multimodal<br />
cluster ensemble E are generated through the application, upon each 1 submatrix<br />
Xi, (∀i ∈ [1,n]), of f mutually crossed diversity factors dfj, ∀j ∈ [1,f]. If the same set of<br />
diversity factors is applied on the m modalities, the number of clusterings generated in this<br />
first subset is equal to:<br />
l1 mod = m|df1||df2| ...|dff| (5.1)<br />
where the |·|operator denotes the cardinality of a set.<br />
Secondly, another subset of clusterings is created by the application of a set of diversity<br />
factors (not necessarily equal to the previous one) upon a fused multimodal representation<br />
of the data set. This representation can be generated by means of any early feature fusion<br />
process, such as the application of a projection-based object representation technique on<br />
the d-dimensional vectors resulting from the concatenation of the features corresponding<br />
to the m modalities (<strong>La</strong> Cascia, Sethi, and Sclaroff, 1998; Zhao and Grosky, 2002; Benitez<br />
and Chang, 2002; Snoek, Worring, and Smeulders, 2005; Gunes and Piccardi, 2005). This<br />
second subset of clusterings will be referred to using the symbol m mod, astheyareobtained<br />
upon an object representation that combines the m modalities into a single one.<br />
Assuming, for simplicity, that the same set of diversity factors are employed for creating<br />
the subsets of unimodal and multimodal clusterings, the number of multimodal partitions<br />
created becomes:<br />
lm mod = |df1||df2| ...|dff | (5.2)<br />
Finally, the mere compilation of the unimodal and multimodal partitions constitute the<br />
multimedia cluster ensemble E, the size of which is equal to:<br />
l = l1 mod + lm mod =(m +1)|df1||df2| ...|dff | (5.3)<br />
As regards the creation of the multimodal cluster ensembles E on the four multimedia<br />
data collections employed in this work (see appendix A.2.2 for a description), three diversity<br />
factors have been applied: clustering algorithms (dfA), object representations (dfR)<br />
and object representation dimensionalities (dfD). In the following paragraphs, a detailed<br />
description of these diversity factors and their role in the cluster ensemble creation process<br />
are presented.<br />
Starting with the original object features, which constitute the baseline representation,<br />
additional object representations are derived by means of feature extraction based on<br />
1 As these clusterings are created by running multiple clustering processes separately on each modality,<br />
we refer to them using the symbol 1 mod, which stands for “one modality”.<br />
135
5.2. Self-refining multimodal consensus architecture<br />
Data set Modality df D range |df D|<br />
audio [30,120] 10<br />
CAL500<br />
text [30,70] 5<br />
audio + text [30,200] 18<br />
speech [100,600] 11<br />
IsoLetters image [3,16] 14<br />
speech + image [100,600] 11<br />
object [60,120] 7<br />
InternetAds collateral [100,1000] 19<br />
object + collateral [100,1000] 19<br />
image [50,350] 7<br />
Corel<br />
text [100,450] 8<br />
image + text [100,800] 15<br />
Table 5.1: Range and cardinality of the dimensional diversity factor dfD per modality for<br />
each one of the four multimedia data sets.<br />
Principal Component Analysis, Independent Component Analysis, Random Projection and<br />
Non-negative Matrix Factorization (this last representation can only be applied when the<br />
original object features are non-negative). Thus, the total number of object representationsiseither|dfR|<br />
= 4 (for the CAL500 and IsoLetters collections) or |dfR| = 5 (for the<br />
InternetAds and Corel data sets). It is important to notice that these feature extraction<br />
based object representations are derived for each one of the m = 2 modalities these data<br />
sets contain, plus for the multimodal baseline representation created by concatenating the<br />
features of both modes.<br />
For each feature extraction based object representation and modality, a set of distinct<br />
representations are created by conducting a sweep of dimensionalities, which constitutes<br />
the second diversity factor, dfD. Quite obviously, its cardinality depends on the data set<br />
and modality. The range and cardinality of dfD per modality corresponding to each one<br />
of the four multimedia data sets employed in the experimental section of this chapter are<br />
presented in table 5.1.<br />
And finally, the clusterings that make up the multimodal cluster ensemble E are created<br />
by running |dfA| = 28 clustering algorithms from the CLUTO clustering package<br />
(see appendix A.1) on each distinct object representation. As a result, a total of l =<br />
2856, 3108, 5124 and 3444 partitions are obtained for the CAL500, IsoLetters, InternetAds<br />
and Corel multimodal data collections, respectively. Notice that, in our case, diversity<br />
factors are not mutually crossed, as the baseline object representations lack dimensional<br />
diversity. Therefore, the generic expressions of equations (5.1) to (5.3) do not apply in our<br />
case.<br />
5.2 Self-refining multimodal consensus architecture<br />
Once the multimodal cluster ensemble E is built, the next step consists in deriving a consensus<br />
clustering solution λc upon it. Recall that, according to the conclusions drawn in<br />
chapter 3, it may be more computationally efficient to tackle this task by means of a flat or<br />
a hierarchical consensus architecture depending on the size of the cluster ensemble.<br />
136
Chapter 5. Multimedia clustering based on cluster ensembles<br />
As far as this latter issue is concerned, it is important to notice that, if the cluster<br />
ensemble generation process proposed in section 5.1 is followed, multimodality induces an<br />
important increase in the cluster ensemble size. Indeed, if a set of f mutually crossed<br />
diversity factors of cardinalities |df1|, |df2|, ..., |dff | are applied on a particular unimodal<br />
data collection, we obtain a cluster ensemble of size:<br />
lunimodal = |df1||df2| ...|dff| (5.4)<br />
However, if the same data collection was multimodal and contained m modalities, the<br />
application of the previously presented multimedia cluster ensemble creation procedure –<br />
using exactly the same diversity factors applied on the unimodal version of the data set–<br />
would yield an ensemble of size:<br />
lmultimodal =(m +1)|df1||df2| ...|dff| (5.5)<br />
That is, multimodality increases the size of cluster ensembles by a factor of (m +1)<br />
—i.e. in the minimally multimodal case, m = 2, the cluster ensembles obtained are three<br />
times larger than those that would be created in a unimodal scenario. For this reason, and<br />
allowing for the conclusions of chapter 3, it seems that hierarchical consensus architectures<br />
are likely to be the most computationally efficient implementation alternative for deriving<br />
consensus clustering solutions upon multimodal cluster ensembles —however, the running<br />
time estimation process proposed in chapter 3 constitutes a valid tool for selecting apriori,<br />
and with a high degree of precision, which is the computationally optimal consensus<br />
architecture for solving a specific consensus clustering problem.<br />
Regardless of the consensus architecture employed, it will output a consensus clustering<br />
solution λc. The next and final step consists in applying the consensus self-refining procedure<br />
proposed in section 4.1, so as to obtain a presumably higher quality refined consensus<br />
clustering solution λ final<br />
c . Quite obviously, in a multimedia clustering scenario, the selfrefining<br />
process will be based on selecting a subset of the components of the multimodal<br />
cluster ensemble E for creating a select cluster ensemble. Otherwise, it follows exactly the<br />
steps presented in table 4.1.<br />
In this work, a three-stage deterministic hierarchical consensus architecture (or DHCA<br />
for short) has been applied for deriving the consensus clustering solution λc upon our<br />
multimodal cluster ensembles. This is due to the fact that we have deemed multimodality<br />
as a diversity factor (denoted as dfM) in itself. Moreover, in contrast to the procedure<br />
followed in the previous chapters, the multimodal cluster ensemble E considered in each<br />
individual experiment conducted in this chapter only contains clusterings created by a<br />
single clustering algorithm, despite, as mentioned earlier, |dfA| = 28 of them have been<br />
employed. In other words, for each data collection, experiments on |dfA| = 28 different<br />
cluster ensembles have been conducted. The components of each one of these ensembles<br />
only differ in the representational (dfR), dimensional (dfD) and multimodal (dfM) diversity<br />
factors, while having been created by means of a single clustering algorithm.<br />
According to the conclusions drawn in section 3.3, the DHCA variant that minimizes the<br />
number of executed consensus processes and the running time of its serial implementation<br />
is the one in which consensus processes are sequentially run on the distinct diversity factors<br />
that make up the cluster ensemble E arranged in decreasing cardinality order.<br />
137
5.3. Multimodal consensus clustering results<br />
Therefore, it is necessary to determine the cardinality of the three diversity factors so<br />
as to devise the computationally optimal DHCA topology. As described in section 5.1, the<br />
cardinality of the representational diversity factor is either |dfR| =4or|dfR| = 5, depending<br />
on the data set. The inspection of table 5.1 reveals that the dimensional diversity factor<br />
adopts a wide range of cardinalities, but all of them fall in the [5, 19] interval. <strong>La</strong>st, the newly<br />
introduced modality diversity factor entails not only the m = 2 original data modalities,<br />
but also the multimodal one resulting from the feature-level fusion of the former —thus, its<br />
cardinality is equal to |dfM| = 3. For this reason, the specific DHCA variant implemented is<br />
referred to as DRM (as |dfD| > |dfR| > |dfM|), and consensus will be sequentially conducted<br />
across dimensionalities, representations and modalities at each of its three stages.<br />
For illustration purposes, figure 5.2 depicts a toy DHCA DRM variant applied on a 27dimensional<br />
multidimensional cluster ensemble created on dimension, representation and<br />
modality diversity factors all of cardinality equal to 3. In its first stage, consensus are conducted<br />
across dimensionalities, thus yielding a first set of intermediate consensus clusterings<br />
denoted as λD,j,k, wherej and k index object representations and modalities, respectively.<br />
Subsequently, the second consensus stage executes consensus processes across the distinct<br />
object representations, giving rise to a second set of partial consensus clustering solutions,<br />
denoted as λD,R,k, wherek designates modalities. Assuming that two of these modalities<br />
are truly original modes, and that the third one is a created by feature-level fusion of the<br />
mod 1<br />
former, the clusterings output by the second stage of the DHCA are also denoted as λc ,<br />
mod 2 mod 1+mod 2<br />
λc and λc , respectively. Finally, the execution of a consensus process on<br />
these three intermediate clusterings yields the final consensus clustering solution λc, which<br />
is referred to as intermodal hereafter.<br />
Notice that conducting consensus by means of this DHCA variant instead of a flat<br />
or a random hierarchical consensus architecture is specially interesting from an analytic<br />
viewpoint, as it makes it possible to compare the effect of the consensus process on each<br />
of the three modalities, by simply evaluating the three intermediate consensus clusterings<br />
mod 1 mod 2 mod 1+mod 2<br />
input to the last consensus stage –i.e. λ , λ and λ in figure 5.2.<br />
5.3 Multimodal consensus clustering results<br />
c<br />
In this section, the results of the proposed multimodal consensus clustering experiments<br />
conducted in this work are described. The design of these experiments has followed the<br />
rationale described next.<br />
– What do we want to measure?<br />
c<br />
i) In section 5.3.1, we evaluate the quality of the partial consensus clusterings obtained<br />
on each separate modality and on the one resulting from multimodal<br />
mod 1 mod 2 mod 1+mod 2<br />
feature-level fusion (i.e. λc , λc and λc , respectively) plus the<br />
intermodal clustering λc resulting from applying the consensus process for combining<br />
the three aforementioned consensus clustering solutions. Moreover, we<br />
also analyze how do the unimodal, multimodal and intermodal consensus clusterings<br />
compare to each other plus the components of the cluster ensembles they<br />
are created upon.<br />
138<br />
c
1,<br />
1,<br />
1<br />
2,<br />
1,<br />
1<br />
<br />
<br />
3 3,<br />
1 1,<br />
1<br />
1,<br />
1,<br />
2<br />
2,<br />
1,<br />
2<br />
<br />
<br />
3,<br />
1,<br />
2<br />
1<br />
, 1 , 3<br />
2,<br />
1,<br />
3<br />
<br />
3,<br />
1,<br />
3<br />
1 , 2 , 1<br />
2,<br />
2,<br />
1<br />
<br />
3,<br />
2,<br />
1<br />
1<br />
, 2 , 2<br />
2,<br />
2,<br />
2<br />
<br />
3,<br />
2,<br />
2<br />
1 1,<br />
2 2,<br />
3<br />
2,<br />
2,<br />
3<br />
<br />
3,<br />
2,<br />
3<br />
<br />
1 1,<br />
3 3,<br />
1<br />
2,<br />
3,<br />
1<br />
3,<br />
2,<br />
1<br />
<br />
1 1,<br />
3 3,<br />
2<br />
2,<br />
3,<br />
2<br />
3,<br />
3,<br />
2<br />
<br />
<br />
1 1,<br />
3 3,<br />
3<br />
2,<br />
3,<br />
3<br />
3,<br />
3,<br />
3<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
Chapter 5. Multimedia clustering based on cluster ensembles<br />
D,<br />
1,<br />
1<br />
D,<br />
2,<br />
1<br />
D,<br />
3,<br />
1<br />
D,<br />
1,<br />
2<br />
D,<br />
2,<br />
2<br />
D,<br />
3,<br />
2<br />
D,<br />
1,<br />
3<br />
D,<br />
2,<br />
3<br />
D,<br />
3,<br />
3<br />
Consensus<br />
mod_1<br />
function D D,<br />
R,<br />
1 c<br />
Consensus<br />
function<br />
Consensus<br />
function<br />
mod_ 2<br />
D D,<br />
R,<br />
2 c<br />
mod_1mod_<br />
2<br />
D D,<br />
R,<br />
3 c<br />
Consensus<br />
function c <br />
Figure 5.2: Deterministic hierarchical consensus architecture DRM variant operating on<br />
a cluster ensemble created using three diversity factors: three dimensionality |dfD| =3of<br />
three object representations |dfR| = 3 and three modalities |dfM| = 3. The cluster ensemble<br />
component obtained upon the jth object representation with the ith dimensionality on the<br />
kth modality is denoted as λi,j,k. Consensus are sequentially created across the dimension,<br />
representation and modality diversity factors (dfD, dfR and dfM, respectively).<br />
ii) In section 5.3.2, we analyze the quality of the self-refined intermodal consensus<br />
clusterings λc p i obtained upon select cluster ensembles containing a percentage<br />
pi of the partitions of the original multimodal cluster ensemble E. Moreover,<br />
we also evaluate the performance of the supraconsensus function as a means<br />
for selecting, in a fully unsupervised manner, the top quality (either refined or<br />
non-refined) consensus clustering.<br />
– How do we measure it?<br />
i) The quality of the unimodal, multimodal and intermodal consensus clusterings<br />
obtained is evaluated in terms of their φ (NMI) with respect to the ground truth of<br />
the data set. Inter-consensus clusterings comparisons are conducted in terms of<br />
their relative percentage φ (NMI) differences, and the percentage of experiments in<br />
which one of them attains higher φ (NMI) scores than the rest. Moreover, comparisons<br />
between the consensus clusterings and their associated cluster ensembles<br />
are made in terms of relative percentage φ (NMI) differences, the percentage of<br />
139
5.3. Multimodal consensus clustering results<br />
cluster ensemble components that attain φ (NMI) scores higher than that of the<br />
evaluated consensus clusterings, and the percentage of experiments and relative<br />
percentage φ (NMI) differences between them and the cluster ensemble components<br />
of maximum and median quality.<br />
ii) The quality of the self-refined consensus clusterings is measured in terms of their<br />
φ (NMI) with respect to the ground truth of the data set. We also compute the percentage<br />
of experiments in which the top quality self-refined consensus clustering<br />
attains a higher φ (NMI) score than its non-refined counterpart, besides the relative<br />
percentage φ (NMI) differences between them and the maximum and median<br />
cluster ensemble components. On the other hand, the ability of the supraconsensus<br />
function is evaluated by the computation of the percentage of experiments in<br />
which it succeeds in selecting the highest quality consensus clustering available<br />
as the final partition of the data, besides the relative percentage φ (NMI) losses<br />
suffered when it does not.<br />
– How are the experiments designed? Just like in all the experimental sections<br />
of this thesis, consensus clusterings have been derived by means of the seven consensus<br />
functions described in appendix A.5, namely CSPA, EAC, HGPA, MCLA,<br />
ALSAD, KMSAD and SLSAD. By doing so, it is possible to compare their performances<br />
across all the consensus clustering problems conducted. It is important to<br />
note that, in this chapter, we are solely interested in analyzing the quality of the consensus<br />
clustering solutions obtained, as the main purpose of the proposed multimodal<br />
consensus clustering approach is achieving robustness to clustering indeterminacies,<br />
which, as aforementioned, are increased due to multimodality. For brevity reasons,<br />
only the results corresponding to cluster ensembles based on four distinct clustering<br />
algorithms are graphically displayed. These four clustering algorithms –namely<br />
agglo-cos-upgma, direct-cos-i2, graph-cos-i2 and rb-cos-i2 – cover all the clustering<br />
approaches encompassed in the CLUTO clustering package (see appendix A.1 for a<br />
description). However, when global analyses are presented, the results obtained on<br />
the |dfA| = 28 multimodal cluster ensembles are considered.<br />
– How are results presented?<br />
i) The quality of the unimodal, multimodal and intermodal consensus clusterings<br />
obtained is presented by means of φ (NMI) score boxplots. Recall that nonoverlapping<br />
boxes notches indicate that the medians of the compared running<br />
times differ at the 5% significance level, which allows a quick inference of the statistical<br />
significance of the results. Quantitative performance evaluation measures<br />
are presented in the shape of numeric tables showing the average values of the<br />
magnitudes analyzed (mainly, percentage of experiments and relative percentage<br />
φ (NMI) differences).<br />
ii) Results are presented by means of boxplot charts of the φ (NMI) values corresponding<br />
to the consensus self-refining process. In particular, each subfigure depicts<br />
–fromlefttoright–theφ (NMI) values of: i) the components of the cluster ensemble<br />
E, ii) the non-refined consensus clustering solution (i.e. the one resulting from<br />
the application of either a hierarchical or a flat consensus architecture, denoted<br />
as λc), and iii) the self-refined consensus labelings λcpi obtained upon select<br />
cluster ensembles created using percentages pi = {2, 5, 10, 15, 20, 30, 40, 50, 75}.<br />
140
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Moreover, the consensus clustering solution deemed as the optimal one (across<br />
a majority of experiment runs) by the supraconsensus function is identified by<br />
means of a vertical green dashed line. Moreover, the quality comparisons between<br />
the self-refined consensus clusterings, the non-refined consensus clusterings and<br />
the cluster ensemble components are presented by means of tables showing the<br />
average values of the measured magnitudes on each one of the four multimodal<br />
data collections employed in this work.<br />
– Which data sets are employed? Only the multimodal consensus clustering results<br />
obtained on the IsoLetters data collection are described in detail in this section —a<br />
thorough portrayal of the experiments corresponding to the three other multimedia<br />
data collections employed in this work (CAL500, InternetAds and Corel) can be found<br />
in appendix E. However, the global evaluation of our proposals encompasses the<br />
results obtained on the four multimodal data collections, presenting the average values<br />
as a result.<br />
5.3.1 Consensus clustering per modality and across modalities<br />
This section is devoted to the evaluation of the intermediate (i.e. unimodal and multimodal)<br />
and final (that is, intermodal) consensus clustering solutions yielded by the proposed multimodal<br />
deterministic hierarchical consensus architecture applied on the IsoLetters data set.<br />
Recall that the objects contained in this data collection are instances of the letters of the<br />
English alphabet expressed in two original modalities, speech and image.<br />
We start with a visual evaluation of the quality of the aforementioned consensus clusterings,<br />
measured in terms of their normalized mutual information φ (NMI) with respect to<br />
the ground truth of the data set (the closer its value to unity the higher the quality of the<br />
corresponding clustering). Moreover, we also constrast them with the components of the<br />
multimodal cluster ensemble they are created upon.<br />
For starters, figure 5.3 depicts four boxplot charts corresponding to the φ (NMI) scores of<br />
the cluster ensemble components and the unimodal, multimodal and intermodal consensus<br />
clusterings, in the case the cluster ensemble compiles partitions output by the agglo-cosupgma<br />
clustering algorithm. In each one of the boxplots, the φ (NMI) values of the components<br />
of the cluster ensemble E and of the consensus clusterings yielded by each one of the<br />
seven consensus functions employed in this work across ten experiment runs are shown. In<br />
particular, figures from 5.3(a) to 5.3(c) depict the quality of the intermediate unimodal and<br />
multimodal consensus clustering solutions λimage c , λspeech c ,andλimage+speech c , respectively.<br />
<strong>La</strong>st, figure 5.3(d) shows the boxplots corresponding to the intermodal consensus clustering<br />
λc, resulting from the combination of the previous three.<br />
There are several observations worth making in view of these results. Firstly, it is to note<br />
that pretty diverse quality consensus clusterings are obtained depending on the consensus<br />
function employed. Clearly, EAC and HGPA yield the worst results, whereas the five other<br />
consensus functions tend to yield better consensus partitions, being often able to compete<br />
with the highest quality cluster ensemble components.<br />
Secondly, focusing on the unimodal consensus clustering processes solely (figures 5.3(a)<br />
and 5.3(b)), notice the substantial differences between the φ (NMI) values of the cluster ensemble<br />
components corresponding to the image and speech modalities. Undoubtedly, this<br />
141
5.3. Multimodal consensus clustering results<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ agglo−cos−upgma<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
speech<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
image+speech<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c agglo−cos−upgma<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure 5.3: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the IsoLetters data set.<br />
is a determining factor in that the qualities of the consensus clustering solutions λ image<br />
c<br />
are notably higher than those of λ speech<br />
c<br />
, as average relative percentage φ (NMI) differences<br />
between both modalities of 57.4% are obtained. That is, despite consensus clustering manages,<br />
in both modalities, to yield reasonable quality results, selecting a single modality for<br />
clustering a multimedia data collection can be a highly suboptimal option —as it can totally<br />
limit the quality of the obtained partition.<br />
Thirdly, when consensus clustering is conducted on the multimodal modality resulting<br />
from feature-level fusion (figure 5.3(c)), even better consensus clustering results are obtained<br />
–in average relative φ (NMI) terms, a 13.6% better than those obtained on the image<br />
modality–, which indicates the existence of positive synergies between both modalities on<br />
this data collection.<br />
And finally, if the λ image<br />
c<br />
, λ speech<br />
c<br />
and λ image+speech<br />
c<br />
consensus clustering solutions are<br />
combined –figure 5.3(d)– pretty good clusterings are obtained, specially when the CSPA,<br />
ALSAD and KMSAD consensus functions are employed (in these cases, relative φ (NMI) differences<br />
with respect the image and image+speech modalities below 5% are obtained). For<br />
the remaining consensus functions, the intermodal consensus clustering λc attains lower<br />
φ (NMI) scores, thus constituting a trade-off between the consensus clusterings of the unimodal<br />
and fused modalities.<br />
Figure 5.4 presents the results of the same process, but executed on the multimodal<br />
cluster ensemble E created by means of the direct-cos-i2 clustering algorithm. In this case,<br />
we can observe a very similar behaviour to the one just reported. In this case, however,<br />
the intermodal consensus clustering solution λc (see figure 5.4(d)) is, in some cases (e.g.<br />
when consensus is based on CSPA), better (from a 3.1% to a 13.9% in relative percentage<br />
φ (NMI) differences) than any of its unimodal and multimodal counterparts —figures 5.4(a)<br />
to 5.4(c).<br />
The quality of the unimodal, multimodal and intermodal consensus clustering solutions<br />
obtained by the application of the multimodal DHCA on the cluster ensemble generated<br />
upon the graph-cos-i2 clustering algorithm of the CLUTO toolbox are presented in figure<br />
5.5. In this case, a larger performance uniformity among consensus functions is observed,<br />
at least as far as the image and image+speech modalities are concerned (figures 5.5(a) and<br />
5.5(c)). Otherwise, the consensus clusterings obtained upon the multimodal representation<br />
142
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Chapter 5. Multimedia clustering based on cluster ensembles<br />
speech<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
image+speech<br />
λ direct−cos−i2<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c direct−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure 5.4: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the IsoLetters data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
speech<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
image+speech<br />
λ graph−cos−i2<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c graph−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure 5.5: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the IsoLetters data set.<br />
of the data (λ image+speech<br />
c ) attain a higher quality than any of their unimodal counterparts<br />
(a 16.1% better in relative terms). Moreover, intermodal consensus –figure 5.5(d)– gives<br />
rise to clusterings that, at best, are comparable to the partitions obtained on the multimodal<br />
modality (e.g. ALSAD and KMSAD, where average relative φ (NMI) losses of 3% are<br />
observed) and, in the worst cases, constitute a trade-off between the combined modalities.<br />
<strong>La</strong>st, notice that pretty similar results are obtained when this consensus process is applied<br />
on the cluster ensemble created by the compilation of the partitions output by the<br />
rb-cos-i2 CLUTO clustering algorithm (see figure 5.6). Once more, the comparison of the<br />
consensus clusterings obtained on the unimodal and multimodal modalities (figures 5.6(a)<br />
to 5.6(c)) reveals the superiority of the latter on this data collection. When these three<br />
modalities are fused in the final consensus stage of the DHCA, the obtained intermodal<br />
consensus clustering solution λc yielded by the CSPA, ALSAD, KMSAD and SLSAD consensus<br />
functions attain φ (NMI) values only 0.9% to a 7.1% worse than those attained on the<br />
multimodal data representation, as depicted in figure 5.6(d).<br />
While the boxplots depicted in figures 5.3 to 5.6 provide the reader with a qualitative<br />
though partial vision of the results of unimodal, multimodal and intermodal consensus<br />
143
5.3. Multimodal consensus clustering results<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
speech<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image+speech<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c rb−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure 5.6: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the IsoLetters data set.<br />
clustering, it is necessary to conduct a more quantitative and generic analysis across the<br />
experiments conducted on the |dfA| = 28 cluster ensembles created using all the clustering<br />
algorithms from the CLUTO toolbox upon the four multimodal data collections employed<br />
in this work.<br />
Unimodal and multimodal consensus clustering vs their cluster ensembles<br />
mod 1<br />
Firstly, we have evaluated the quality of the two unimodal (λc mod 1+mod 2<br />
multimodal (λc have been created upon.<br />
mod 2<br />
and λc )andthe<br />
) consensus clusterings with respect to the cluster ensembles they<br />
In order to evaluate how the consensus clusterings compare to their associated cluster<br />
ensembles, we have computed the percentage of cluster ensemble components that attain<br />
ahigherφ (NMI) than the evaluated consensus clustering. Quite obviously, the smaller this<br />
percentage, the higher robustness to clustering indeterminacies is achieved. The results of<br />
this analysis are presented in table 5.2.<br />
Care must be taken in analyzing the presented percentages, due to the fact that the two<br />
unimodal and the multimodal consensus clusterings have been created upon different cluster<br />
ensembles, which makes comparisons across columns (i.e. across consensus functions) fair,<br />
but the same does not hold for comparisons across rows (i.e. across consensus clusterings).<br />
If the performance of the seven consensus functions is contrasted, it can be observed that<br />
EAC and HGPA yield, in most cases, the worst results, as the consensus clusterings they<br />
yield (either unimodal or multimodal) are, in average, worse than the 76.9% and 78.8% of<br />
the components of the cluster ensemble they are created upon —a percentage that goes<br />
below 30% in the case of the best performing consensus functions (CSPA, ALSAD and<br />
KMSAD). These results confirm that there may exist great differences between consensus<br />
functions as far as the quality of the consensus clusterings is concerned, so care must be<br />
taken when choosing which ones are applied.<br />
If averages are taken for summarization purposes, unimodal consensus clusterings are<br />
better than the 49.2% of their corresponding cluster ensemble components, while this percentage<br />
rises to 56.5% when multimodal consensus clusterings are considered.<br />
144
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Data Consensus Consensus function<br />
set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
λ<br />
IsoLetters<br />
image<br />
c 21.1 60.1 63.4 40.9 20.4 26.6 40.2<br />
λ speech<br />
c 43.9 91.5 79.4 47.4 18.3 34.6 77.4<br />
λ image+speech<br />
c 17.1 51.8 59.9 34.1 16.7 25.0 39.1<br />
λ<br />
CAL500<br />
audio<br />
c 16.5 86.3 67.7 29.9 9.9 12.0 70.3<br />
λ text<br />
c 29.7 83.5 52.6 59.3 32.6 30.5 75.7<br />
λ audio+text<br />
c 21.9 94.9 54.8 48.0 16.9 20.3 61.9<br />
λ<br />
InternetAds<br />
object<br />
c 47.8 62.2 99.1 44.7 39.8 40.5 61.6<br />
λ collateral<br />
c 55.3 52.4 99.4 40.9 45.1 37.6 59.6<br />
λ object+collateral<br />
c 41.6 64.5 99.7 29.9 29.2 34.6 37.3<br />
λ<br />
Corel<br />
image<br />
c 12.7 93.8 96.2 50.0 19.1 26.7 79.1<br />
λ text<br />
c 10.4 89.2 88.1 43.7 28.8 23.7 77.8<br />
λ image+text<br />
c 7.5 92.2 85.1 40.2 10.9 20.2 64.1<br />
145<br />
Table 5.2: Percentage of cluster ensemble components that attain a higher φ (NMI) than the unimodal and multimodal consensus clusterings,<br />
across the four multimedia data collections and the seven consensus functions.
5.3. Multimodal consensus clustering results<br />
The second analysis consists of comparing the unimodal and multimodal consensus<br />
clusterings with the cluster ensemble components of maximum and median φ (NMI) (which<br />
we call best and median ensemble components, or BEC and MEC for short). Taking these<br />
two cluster ensemble components as a reference, we have computed i) the percentage of<br />
experiments in which the evaluated consensus clustering attains a higher φ (NMI) ,andii) the<br />
relative percentage φ (NMI) differences between them and the evaluated consensus clustering.<br />
The results of this analysis are presented, per data collection and consensus function, in<br />
tables 5.3 and 5.4, where evaluation is referred to BEC and the MEC, respectively.<br />
It can be observed that, as already noticed in previous experiments, the EAC and<br />
HGPA consensus functions perform notably worse than the remaining ones. In average,<br />
unimodal consensus clusterings are better than their corresponding BEC in a 6.5% of the<br />
experiments, whereas the multimodal consensus clustering solutions attain a higher φ (NMI)<br />
than the BEC in a 8.7% of the occasions (see table 5.3). If this comparison is made in<br />
terms of the relative percentage φ (NMI) differences, we see that, in average, the unimodal<br />
consensus clusterings are a 33.5% worse than the BEC, while this percentage reduces to<br />
28.2% when the multimodal consensus clusterings are considered.<br />
If the median ensemble component is taken as a reference –see table 5.4–, we observe<br />
that unimodal consensus clusterings are better than the MEC in a 54% of the experiments<br />
conducted. In contrast, when the multimodal consensus clustering solution is considered,<br />
superiority with respect to the MEC is obtained in a 62.3% of the cases. If the MEC and<br />
the consensus clusterings are compared in terms of relative percentage φ (NMI) differences,<br />
we see that the unimodal consensus yields clusterings which are a 21.1% better than the<br />
MEC, while this percentage rises to 39.6% in the case of multimodal consensus.<br />
Thus, a conclusion we can draw at this point is that, in view of the results just reported,<br />
the execution of consensus processes on multimodal cluster ensembles yields better quality<br />
consensus clusterings than those obtained upon cluster ensembles based on a single modality,<br />
which somehow constitutes a claim in favour of early fusion techniques. However, this<br />
statement must be made with caution, as it is not supported by evidence in all the data<br />
collections (for instance, the CAL500 collection constitutes an exception to this rule).<br />
Multimodal vs unimodal consensus clustering<br />
Secondly, we have compared the quality of the multimodal consensus clusterings (that is,<br />
mod 1+mod 2<br />
mod 1 mod 2<br />
λ ) with the quality of their unimodal counterparts (λ and λ ). Again,<br />
c<br />
c<br />
c<br />
this comparison has been made in terms of the percentage of experiments in which the<br />
former attains a higher φ (NMI) than the latter, and the relative percentage φ (NMI) differences<br />
between them, taking the unimodal consensus clusterings as a reference. The results are<br />
presented in table 5.5.<br />
mod 1+mod 2<br />
In average terms, the multimodal consensus clustering λc is better than its<br />
two unimodal counterparts in a 53.3% of the experiments conducted, and worse than both<br />
of them in only a 13% of the occasions. If the φ (NMI) differences between these consensus<br />
clusterings are measured, we observe that multimodal consensus yields partitions that, in<br />
average relative percentage φ (NMI) terms, are a 4802.4% better. Although this large figure is<br />
mainly caused by two outliers (on the InternetAds data set using the ALSAD and KMSAD<br />
consensus functions), the results presented in table 5.5 show an overwhelming majority of<br />
positive Δφ (NMI) values, which reinforces the notion that multimodal consensus processes,<br />
146
Chapter 5. Multimedia clustering based on cluster ensembles<br />
when compared to their unimodal counterparts, constitute, in most cases, a better option.<br />
Intermodal vs unimodal and multimodal consensus clustering<br />
Furthermore, we have also investigated whether the combination of unimodal and multimodal<br />
consensus clusterings (i.e the execution of intermodal consensus processes) can lead<br />
to the obtention of better partitions.<br />
For this reason, table 5.6 presents the detailed results corresponding to the comparison of<br />
the intermodal consensus clustering solutions with respect to their unimodal and multimodal<br />
counterparts, across all the data sets and consensus functions. Once again, such comparison<br />
is twofold, as it takes into account the percentage of experiments in which intermodal<br />
consensus is better than unimodal and multimodal, and the relative percentage φ (NMI)<br />
differences between them (taking the unimodal and multimodal consensus clusterings as a<br />
reference).<br />
If averages across data collections and consensus functions are taken, the following results<br />
are obtained: when compared to the unimodal consensus clusterings, intermodal consensus<br />
are better than them in a 59.5% of the experiments conducted, attaining an average relative<br />
φ (NMI) gain of 2821.7% with respect to them. That is, intermodal consensus clusterings are,<br />
in general terms, superior to their unimodal counterparts. Possibly the clearest exceptions<br />
to this rule are found in the audio modality of the CAL500 data set and the image mode<br />
of the Corel collection.<br />
However, intermodal consensus clusterings are superior to their multimodal counterparts<br />
in just a 34.7% of the occasions, reaching a quality that, measured in average relative φ (NMI)<br />
percentage terms, is a 65.5% better. Thus, as already suggested by the boxplots charts<br />
presented in figures 5.3 to 5.6, intermodal consensus clusterings tend to become a trade-off<br />
between the multimodal and unimodal consensus clustering solutions it is based on.<br />
Furthermore, if the quality of the intermodal consensus clustering is contrasted to that of<br />
the cluster ensemble it is created upon (that is, the one compiling both unimodal and multimodal<br />
clusterings), we obtain that it is better than the 52.9% of its components —recall that<br />
this percentage was 49.2% and 56.5% when referred to the unimodal and multimodal consensus<br />
clusterings, which reinforces the notion that, in general terms, intermodal consensus<br />
is a trade-off between its unimodal and multimodal counterparts.<br />
Notice that pretty different situations are found among the data sets used in this experiment.<br />
For instance, the intermodal consensus clustering is clearly inferior to its multimodal<br />
conterpart on the IsoLetters data set, whereas quite the opposite is observed on the InternetAds<br />
collection. Therefore, we consider that creating an intermodal consensus clustering<br />
is a pretty generic way of proceeding, as sometimes it can be advantageous to combine unimodal<br />
and multimodal consensus clusterings. Its eventual inferior quality (when compared<br />
to either its unimodal and multimodal counterparts) can be compensated by the consensus<br />
self-refining procedure presented in section 5.2. The results of applying it on the intermodal<br />
consensus clustering λc are described in the following section.<br />
147
5.3. Multimodal consensus clustering results<br />
Data Consensus Consensus function<br />
set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
λ<br />
IsoLetters<br />
image %exp 3.6 0 0 0 17.9 5 3.6<br />
c<br />
∆φ (NMI) -7.3 -42.9 -39.2 -18.8 -10.1 -9.1 -22<br />
λ speech %exp 0 0 0 0 10.7 7.1 3.6<br />
c ∆φ (NMI) -17.3 -59.8 -37.5 -20.3 -4.8 -10.2 -42.1<br />
λ image+speech %exp 10.7 0 0 0 7.1 8.6 0<br />
c ∆φ (NMI) -5.8 -33.2 -38.2 -16.1 -8.6 -8.5 -16.2<br />
λ<br />
CAL500<br />
audio %exp 3.5 0 0 1.4 25 7.8 0<br />
c ∆φ (NMI) -9.4 -50.5 -37.5 -17 -5.8 -7 -37.3<br />
λ text %exp 28.6 0 13.5 1.4 21.4 26.4 7.1<br />
c<br />
∆φ (NMI) -4.1 -24.2 -11.7 -14.9 -5.5 -4.6 -23.1<br />
λ audio+text %exp 21.4 0 5.7 0.7 28.6 21.4 3.6<br />
c ∆φ (NMI) -4.3 -44.8 -18.1 -19.6 -4.2 -6.1 -23.3<br />
λ<br />
InternetAds<br />
object %exp 0 3.6 0 0 0 0 0<br />
c ∆φ (NMI) -45.1 -72.8 -99.9 -45.4 -43.3 -41.7 -61.9<br />
λ collateral %exp 0 0 0 0 3.6 4.3 0<br />
c ∆φ (NMI) -81.7 -80.2 -99.9 -70.5 -70.3 -64.8 -87.7<br />
λ object+collateral %exp 0 0 0 25 3.5 3.6 3.6<br />
c<br />
∆φ (NMI) -41 -82.5 -99.9 -38.6 -34.7 -42.3 -42.8<br />
λ<br />
Corel<br />
image %exp 25 0 0 3.6 39.3 9.3 3.5<br />
c<br />
∆φ (NMI) -2 -56.7 -43.5 -28.9 -4.8 -7.1 -28.8<br />
λ text %exp 35.7 0 5.7 1.4 14.3 15 10.7<br />
c<br />
∆φ (NMI) 21.3 -56.3 -34.3 -24.6 -9.5 -8.4 -34.2<br />
λ image+text %exp 39.3 0 0 11.4 32.1 10 7.1<br />
c<br />
∆φ (NMI) 6.6 -60.6 -40.7 -24.4 -4.9 -8.2 -29.4<br />
148<br />
Table 5.3: Evaluation of the unimodal and multimodal consensus clusterings with respect to the best cluster ensemble component (or<br />
BEC), across the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of<br />
experiments in which the evaluated consensus clustering is superior to the BEC, and ∆φ (NMI) stands for the relative percentage φ (NMI)<br />
difference of the former wrt the latter.
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Data Consensus Consensus function<br />
set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
λ<br />
IsoLetters<br />
image %exp 92.3 21.4 22.9 85 100 90.7 67.9<br />
c<br />
Δφ (NMI) 80 -10.7 5.2 45.9 56.3 69.3 35.1<br />
λ speech %exp 89.3 3.5 0 75.7 100 98.6 7.2<br />
c Δφ (NMI) 5.6 -47.4 -20.1 0.7 23.6 17 -23.4<br />
λ image+speech %exp 100 50 42.1 91.4 100 92.9 85.7<br />
c<br />
Δφ (NMI) 96.3 11.2 13.8 64 70.4 83.9 63<br />
λ<br />
CAL500<br />
audio %exp 100 3.6 12.9 86.4 100 99.3 21.4<br />
c Δφ (NMI) 24.4 -31.9 -15.4 12 28.8 27.8 -14.1<br />
λ text %exp 78.6 14.3 45 39.3 75 73.6 28.6<br />
c<br />
Δφ (NMI) 9.9 -13.3 1.1 -2.4 8.4 9.5 -11.8<br />
λ audio+text %exp 96.4 3.5 41.4 55.7 85.7 92.9 35.7<br />
c<br />
Δφ (NMI) 16.3 -33.2 -0.6 -2.9 16.4 14 -6.8<br />
λ<br />
InternetAds<br />
object %exp 71.4 32.1 0 77.1 67.9 68.6 46.2<br />
c Δφ (NMI) 64.5 110.2 -99.9 30.4 69 115.4 93.6<br />
λ collateral %exp 50 50 0 61.4 57.1 67.9 21.4<br />
c<br />
Δφ (NMI) 111.7 173.5 -99.7 109.2 143.8 184 -33.5<br />
λ object+collateral %exp 71.4 25 0 78.6 82.1 66.4 67.9<br />
c<br />
Δφ (NMI) 236.9 -13 -99.4 137.5 186.1 202.9 14.6<br />
λ<br />
Corel<br />
image %exp 100 0 0 52.1 96.4 85 14.3<br />
c<br />
Δφ (NMI) 23.4 -46.7 -33.1 -17.3 18.7 16.5 -9.4<br />
λ text %exp 96.4 10.7 8.6 62.9 85.7 87.9 21.4<br />
c<br />
Δφ (NMI) 48.8 -45.7 -19.7 -8.2 13.4 14.9 -16.9<br />
λ image+text %exp 100 3.6 0 63.6 100 95.7 17.9<br />
c<br />
Δφ (NMI) 51 -45.7 -21.5 -3.3 30.7 27.4 -0.2<br />
149<br />
Table 5.4: Evaluation of the unimodal and multimodal consensus clusterings with respect to the median cluster ensemble component<br />
(or MEC), across the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of<br />
experiments in which the evaluated consensus clustering is superior to the MEC, and Δφ (NMI) stands for the relative percentage φ (NMI)<br />
difference of the former wrt the latter.
5.3. Multimodal consensus clustering results<br />
Data Consensus Consensus function<br />
set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
λ<br />
IsoLetters<br />
image %exp 92.9 92.9 93.6 92.9 92.9 92.9 85.7<br />
c<br />
∆φ (NMI) 10.7 37.4 11.2 14.3 10.7 9.7 20.1<br />
λ speech %exp 96.4 92.9 95 95.7 85.7 92.9 92.9<br />
c ∆φ (NMI) 81.1 177.7 56.5 70.7 54.3 65.9 155.4<br />
λ<br />
CAL500<br />
audio %exp 7.1 10.7 30.7 2.1 3.6 5 25<br />
c ∆φ (NMI) -26 -21.7 -7.3 -31.4 -28.7 -29.4 -11.2<br />
λ text %exp 100 17.9 85 79.3 96.4 93.6 82.1<br />
c<br />
∆φ (NMI) 19.9 -12.3 11.8 13.8 22.2 18.7 23.4<br />
λ<br />
InternetAds<br />
object %exp 64.3 35.7 93.6 62.1 64.3 59.3 71.4<br />
c ∆φ (NMI) 80 234 1616 348 53833 1454 1270<br />
λ collateral %exp 67.9 50 78.6 73.6 82.1 64.3 78.6<br />
c ∆φ (NMI) 1690 160 -10 910 3750 200660 1030<br />
λ<br />
Corel<br />
image %exp 85.7 28.6 42.9 65.7 60.7 61.4 53.6<br />
c<br />
∆φ (NMI) -2.5 -0.8 96.3 12.9 -5.9 -5.7 -6.7<br />
λ text %exp 89.3 96.4 88.6 91.4 96.4 96.4 96.5<br />
c<br />
∆φ (NMI) 142.9 149.9 129.9 165.8 169.6 151.7 192<br />
150<br />
Table 5.5: Evaluation of the multimodal consensus clusterings with respect to their unimodal counterparts, across the four multimedia<br />
data collections and the seven consensus functions. The symbol %exp denotes the percentage of experiments in which the multimodal<br />
consensus clustering is superior to the unimodal consensus clusterings, and ∆φ (NMI) stands for the relative percentage φ (NMI) difference<br />
of the former wrt the latter.
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Data Consensus Consensus function<br />
set type CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
λ<br />
IsoLetters<br />
image %exp 92.9 42.9 1.4 8.6 89.3 86.4 46.4<br />
c<br />
Δφ (NMI) 9 -5.2 -24.3 -12.1 8.9 10.1 3<br />
λ speech %exp 96.4 89.3 52.9 87.1 82.1 92.9 92.9<br />
c Δφ (NMI) 78.2 90.2 6.6 30.5 49.6 59.5 114.2<br />
λ image+speech %exp 17.9 3.6 0 0.7 17.9 12.1 7.1<br />
c<br />
Δφ (NMI) -1.4 -28.6 -31.8 -22.5 -0.3 1.8 -12.7<br />
λ<br />
CAL500<br />
audio %exp 7.1 17.9 8.6 1.4 7.1 7.2 28.6<br />
c Δφ (NMI) -23.8 -10.1 -19.7 -31.9 -23.7 -25.8 -8.9<br />
λ text %exp 96.4 53.6 32.9 80.7 100 99.3 89.3<br />
c<br />
Δφ (NMI) 25 0.5 -3.4 13.8 30.7 24.2 28<br />
λ audio+text %exp 75 78.6 11.4 40.7 75 72.1 57.1<br />
c<br />
Δφ (NMI) 4.2 17.3 -13.1 3 7.3 5 6.8<br />
λ<br />
InternetAds<br />
object %exp 57.1 57.2 90 45.7 57.1 53.6 42.9<br />
c Δφ (NMI) 2 3 -10 104 44152 465 -38<br />
λ collateral %exp 85.7 82.1 70.7 78.6 92.9 71.4 64.3<br />
c<br />
Δφ (NMI) 1250 60 -30 940 4880 105030 60<br />
λ object+collateral %exp 42.9 67.9 75 30 71.4 55.7 32.1<br />
c<br />
Δφ (NMI) 479.1 41.7 -19.2 59.4 41.1 1322.6 22.2<br />
λ<br />
Corel<br />
image %exp 75 10.7 3.6 7.1 17.9 6.4 3.6<br />
c<br />
Δφ (NMI) -0.7 -17.1 -24.4 -22.4 -10.8 -11.6 -18<br />
λtext %exp 100 92.9 91.4 84.3 100 100 100<br />
c<br />
Δφ (NMI) 145.9 136.8 30.1 83.5 153.8 143.4 157.9<br />
λ image+text %exp 14.3 46.4 14.4 5.7 14.3 14.2 17.9<br />
c<br />
Δφ (NMI) 6.7 1.1 -38.2 -25.2 3.2 7.5 -1.6<br />
151<br />
Table 5.6: Evaluation of the intermodal consensus clustering with respect to the unimodal and multimodal consensus clusterings, across<br />
the four multimedia data collections and the seven consensus functions. The symbol %exp denotes the percentage of experiments in<br />
which the intermodal consensus clustering is superior to the unimodal and multimodal consensus clusterings, and Δφ (NMI) stands for the<br />
relative percentage φ (NMI) difference of the former wrt the latter.
5.3. Multimodal consensus clustering results<br />
5.3.2 Self-refined consensus clustering across modalities<br />
In this section, we analyze the results of subjecting the intermodal consensus clustering<br />
solution λc to a self-refining procedure based on a cluster ensemble E, the components of<br />
which correspond to both unimodal and multimodal partitions.<br />
As described in chapter 4, the self-refining process is based on the creation of a select<br />
cluster ensemble Epi –containing a percentatge pi of the clusterings in E– followed by<br />
the application of a (either flat or hierarchical) consensus process on it, which yields a<br />
self-refined consensus clustering λ pi<br />
c . In this work, the refining consensus processes are<br />
conducted by means of a flat consensus architecture, and the set of percentages pi employed<br />
is pi = {2, 5, 10, 15, 20, 30, 40, 50, 75}.<br />
For this reason, the obtention of the final consensus clustering solution λ final<br />
c<br />
relies on the<br />
application of a supraconsensus selection function onto the set of self-refined clusterings λ pi<br />
c<br />
—see section 4.1 for a description of the φ (ANMI) -based supraconsensus function employed<br />
in this work. In the following paragraphs, the performances of these two processes (i.e. the<br />
self-refining procedure and the supraconsensus function) are evaluated separately.<br />
As in the previous section, the results of the execution of these processes on the IsoLetters<br />
data collection are described in detail next. Again, for brevity reasons, the presentation of<br />
the results corresponding to the CAL500, InternetAds and Corel data sets is deferred to<br />
appendix E.<br />
Evaluation of the multimodal self-refining process<br />
For starters, figure 5.7 depicts the results of the self-refining process using the multimodal<br />
cluster ensemble resulting from gathering the partitions output by the agglo-cos-upgma<br />
clustering algorithm of the CLUTO package. Each one of the seven boxplots presented (one<br />
per consensus function) shows, from left to right, the φ (NMI) values of the multimodal cluster<br />
ensemble E components, of the intermodal consensus clustering λc, and of the self-refined<br />
consensus clustering solutions λ pi<br />
c . The consensus clustering selected by the supraconsensus<br />
, is highlighted by a vertical green dashed line.<br />
Notice that, regardless of the consensus function employed, there always exists at least<br />
one self-refined consensus clustering solution that attains a φ (NMI) value that is statistically<br />
significantly higher than the one achieved by the non-refined version λc. In some cases, as<br />
when consensus is based on CSPA (figure 5.7(a)), relatively small φ (NMI) gains are obtained,<br />
specially if they are compared with the dramatic φ (NMI) increases brought about by selfrefining<br />
when, for instance, the EAC, HGPA or SLSAD consensus functions are employed<br />
(see figures 5.7(b), 5.7(c) and 5.7(g)).<br />
As regards the ability of the supraconsensus function to select the highest quality (either<br />
refined or non-refined) consensus clustering as the final partition of the data, it is to<br />
notice that it apparently performs reasonably well, as it mostly succeeds in picking up the<br />
clustering solution of maximum φ (NMI) as λfinal c . A deeper analysis of the supraconsensus<br />
function performance will be presented in the next section.<br />
Notice that the proposed intermodal consensus self-refining procedure shows a very similar<br />
behaviour in the experiments conducted on the multimodal cluster ensembles created<br />
upon the partitions output by the direct-cos-i2, thegraph-cos-i2 and the rb-cos-i2 clustering<br />
algorithms from the CLUTO toolbox —see figures 5.8, 5.9 and 5.10, respectively.<br />
function, λ final<br />
c<br />
152
φ (NMI)<br />
CSPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Chapter 5. Multimedia clustering based on cluster ensembles<br />
EAC agglo−cos−upgma<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
ALSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
HGPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
KMSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
MCLA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
SLSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure 5.7: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the IsoLetters data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
SLSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
MCLA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure 5.8: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the IsoLetters data set.<br />
A comprehensive evaluation of the results of the intermodal consensus self-refining procedure<br />
is presented throughout the following paragraphs. This analysis considers the ex-<br />
153<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
5.3. Multimodal consensus clustering results<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
SLSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 40<br />
c<br />
MCLA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure 5.9: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the IsoLetters data set.<br />
Data set<br />
CSPA EAC<br />
Consensus function<br />
HGPA MCLA ALSAD KMSAD SLSAD<br />
IsoLetters 96.4 96.5 100 98.6 100 100 100<br />
CAL500 89.3 92.9 99.3 97.8 67.9 83.6 82.1<br />
InternetAds 75 57.1 46.4 98.5 78.6 90 85.7<br />
Corel 100 89.2 100 100 100 100 100<br />
Table 5.7: Percentage of multimodal self-refining experiments in which one of the selfrefined<br />
consensus clustering solutions is better than its non-refined counterpart, across the<br />
four multimedia data collections and the seven consensus functions.<br />
periments conducted upon the cluster ensembles created using the |dfA| = 28 clustering<br />
algorithms across the four multimedia data collections employed in this work.<br />
For starters, in order to evaluate the ability of the self-refining process to create high<br />
quality partitions, we have measured the percentage of experiments in which there exists<br />
at least one self-refined consensus clustering λ pi<br />
c that attains a higher φ (NMI) than its<br />
non-refined counterpart λc. The results per data set and consensus function, which are<br />
presented in table 5.7, reveal that self-refining is capable of yielding a beneficial effect in a<br />
large majority (an average 90.2%) of the experiments conducted. This figure is of the same<br />
order of magnitude of the one obtained in the consensus-based unimodal self-refining experiments<br />
presented in section 4.2.1, which indicates that multimodality does not constitute<br />
an obstacle as far as the performance of the proposed self-refining procedure is concerned.<br />
Moreover, so as to evaluate the quality improvement that the proposed self-refining<br />
procedure is able to introduce, we have computed the relative φ (NMI) gain between the<br />
154<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
Chapter 5. Multimedia clustering based on cluster ensembles<br />
λ 40<br />
c<br />
EAC rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 40<br />
c<br />
MCLA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure 5.10: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the IsoLetters data set.<br />
top quality self-refined consensus clustering and its non-refined counterpart, measured in<br />
those experiments where there exists a self-refined consensus clustering superior to the nonrefined<br />
version (i.e. a 90.2% of the total). Table 5.8 presents the results corresponding to<br />
each data collection and consensus function. It can be observed that φ (NMI) gains over 10%<br />
are consistently obtained in most cases —again, these results are of comparable magnitude<br />
to those obtained in the unimodal scenario (see section 4.2.1). Notice that the results<br />
obtained on the InternetAds data set stand out among the rest, as relative percentage<br />
φ (NMI) gains of the order of 103 to 105 are observed. These extremely large figures are<br />
due to the extremely low quality consensus clusterings available prior to refining on this<br />
collection —thus transforming any φ (NMI) increase caused by self-refining into a huge figure<br />
when expressed in relative percentage difference terms referred to the non-refined clustering.<br />
In the 9.8% of the experiments in which none of the self-refined consensus clusterings (λ pi<br />
c )<br />
attains a higher quality than the non-refined one (λc), the difference between the top quality<br />
λpi c and λc is an averaged relative percentage φ (NMI) loss of 19%.<br />
Besides comparing the top quality self-refined consensus clustering solution with its nonrefined<br />
counterpart, we have also contrasted its quality with respect to the clusterings that<br />
make up the cluster ensemble.<br />
Firstly, we have computed the percentage of cluster ensemble components that attain<br />
ahigherφ (NMI) score (with respect to the ground truth) than that of the top quality selfrefined<br />
consensus clustering, as this figure constitutes a pretty clear indicator of how does<br />
it compare to the cluster ensemble it is created upon (see table 5.9). In average terms, the<br />
top quality self-refined consensus clustering is better than the 78.3% of the cluster ensemble<br />
components, which is a sign of notable robustness to the indeterminacies of multimodal<br />
clustering. Moreover, recall that this percentage was 52.9% prior to self-refining, which<br />
155<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
5.3. Multimodal consensus clustering results<br />
Data set<br />
CSPA EAC<br />
Consensus function<br />
HGPA MCLA ALSAD KMSAD SLSAD<br />
IsoLetters 8.7 93 121.1 31.3 23.5 16 35.4<br />
CAL500 14.1 31.1 57.4 22.3 13.8 18.6 19.4<br />
InternetAds 26284 12207 467710 1991.5 1686.8 1788.5 1742.3<br />
Corel 11.1 64.2 177.1 42.2 32 33.6 44.7<br />
Table 5.8: Relative φ (NMI) gain percentage between the top quality self-refined consensus<br />
clustering solutions with respect to its non-refined counterpart, across the four multimedia<br />
data collections and the seven consensus functions.<br />
Data set<br />
CSPA EAC<br />
Consensus function<br />
HGPA MCLA ALSAD KMSAD SLSAD<br />
IsoLetters 2.2 19.9 16.8 5.4 0.5 1.7 8.7<br />
CAL500 14.1 61.6 17.6 17.1 15.4 14.4 41.2<br />
InternetAds 16.9 35.9 94.1 8.3 18.1 17.6 38.5<br />
Corel 0.9 63.4 29.5 4.9 0.1 1.2 40.9<br />
Table 5.9: Percentage of the cluster ensemble components that attain a higher φ (NMI) score<br />
than the top quality self-refined consensus clustering solution, across the four multimedia<br />
data collections and the seven consensus functions.<br />
again provides an evidence of the benefits of the proposed consensus self-refining procedure.<br />
Furthermore, we have compared the top quality self-refined consensus clustering with<br />
the the highest and median φ (NMI) components of the multimodal cluster ensemble E,<br />
referred to as BEC (best ensemble component) and MEC (median ensemble component),<br />
respectively. Using the quality of these two components as a reference, we have evaluated<br />
i) the percentage of experiments where the maximum φ (NMI) consensus clustering solution<br />
attains a higher quality to that of the BEC and MEC, and ii) the relative percentage φ (NMI)<br />
variation between them and the top quality consensus clustering solution. Again, all the<br />
results presented correspond to an average across all the experiments conducted upon the<br />
cluster ensembles generated using the twenty-eight clustering algorithms from the CLUTO<br />
toolbox employed in this work.<br />
Table 5.10 presents, for each data set and consensus function, the percentage of experiments<br />
in which the top quality consensus clustering (either non-refined or refined) attains<br />
a higher quality (i.e. higher φ (NMI) ) than the cluster ensemble component of maximum<br />
quality (or BEC). In each box of the table, the percentage of experiments in which the<br />
non-refined consensus clustering reaches a higher φ (NMI) score than the BEC is shown in<br />
brackets. By doing so, the effect of the self-refining process can be evaluated at a glance.<br />
Notice, then, that the equality of the bracketed and the unbracketed figures shown in any<br />
table box indicates that none of the refined consensus clusterings attains a higher quality<br />
than its non-refined counterpart.<br />
In average, a self-refined consensus clustering better than the BEC is obtained in a<br />
19.1% of the experiments conducted, whereas this figure was as low as 1.6% prior to selfrefining.<br />
Notice that, depending on the data collection, pretty diverse results are obtained<br />
(i.e. consensus self-refining seems to achieve a greater level of success when applied on the<br />
IsoLetters and Corel collections, than on CAL500 and InternetAds data sets) —furthermore,<br />
156
Data set<br />
IsoLetters<br />
CAL500<br />
InternetAds<br />
Corel<br />
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
46.4 3.6 0 26.4 64.3 51.4 25<br />
(3.5) (0) (0) (0) (7.1) (5) (0)<br />
7.1 0 0 0 3.6 2.9 3.5<br />
(0) (0) (0) (0) (3.6) (2.9) (3.5)<br />
3.6 0 0 0 0 1.4 0<br />
(0) (0) (0) (0) (0) (0) (0)<br />
75 0 0 45 92.9 72.9 10.7<br />
(14.3) (0) (0) (0) (7.1) (0) (0)<br />
Table 5.10: Percentage of experiments in which the best (either non-refined or self-refined)<br />
consensus clustering solution is better than the best cluster ensemble component (or BEC),<br />
across the four multimedia data collections and the seven consensus functions. The percentages<br />
prior to self-refining are shown in brackets.<br />
notice the aforementioned differences between the results offered by the distinct consensus<br />
functions.<br />
Data set<br />
IsoLetters<br />
CAL500<br />
InternetAds<br />
Corel<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
4.2 11.2 – 2.4 8.2 6.9 8.3<br />
(2.1) (–) (–) (–) (5.7) (3.2) (–)<br />
5.6 – – – 10.4 3.2 3.3<br />
(–) (–) (–) (–) (10.4) (3.2) (3.3)<br />
6.2 – – – – 2.9 –<br />
(–) (–) (–) (–) (–) (–) (–)<br />
6.9 – – 2.1 6.1 5.4 13.7<br />
(4.2) (–) (–) (–) (1.1) (–) (–)<br />
Table 5.11: Relative φ (NMI) percentage difference between the top quality (either non-refined<br />
or self-refined) consensus clustering solution with respect to the best ensemble component<br />
(or BEC), across the four multimedia data collections and the seven consensus functions.<br />
The relative φ (NMI) percentage differences prior to self-refining are shown in brackets.<br />
Moreover, we have computed the relative percentage φ (NMI) gains between the top quality<br />
(non-refined or refined) consensus clustering solution and the BEC –limited to those<br />
experiments in which the former is superior to the latter–, obtaining the results presented<br />
in table 5.11. If the φ (NMI) gains are averaged across all the experiments conducted, a 6.3%<br />
relative percentage φ (NMI) increase is obtained. Again, the percentages corresponding to<br />
the same magnitude measured prior to refining is presented in brackets in each box of the<br />
table, attaining an average φ (NMI) gain of 4.1%. That is, self-refining does not only give rise<br />
to a larger number of clusterings better than the BEC, but it also increases the φ (NMI) gains<br />
with respect to it. However, in those experiments in which the top quality (non-refined or<br />
refined) consensus clustering attains a φ (NMI) score which is lower than that of the BEC<br />
(i.e. in a 80.9% of the total), its quality is a 28.2% lower, measured in averaged relative<br />
percentage φ (NMI) terms.<br />
As regards the comparison with the median quality cluster ensemble component (or<br />
157
5.3. Multimodal consensus clustering results<br />
Data set<br />
IsoLetters<br />
CAL500<br />
InternetAds<br />
Corel<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
100 92.9 100 99.3 100 100 100<br />
(96.4) (21.4) (2.9) (82.9) (100) (100) (96.4)<br />
100 21.4 97.1 99.3 100 100 71.4<br />
(100) (7.1) (15) (40) (100) (93.6) (32.1)<br />
96.4 60.7 2.9 100 85.7 92.1 60.7<br />
(75) (32.1) (0) (72.9) (75) (66.4) (17.9)<br />
100 14.3 91.4 100 100 100 60.8<br />
(100) (14.3) (10) (15.7) (89.3) (85) (17.9)<br />
Table 5.12: Percentage of experiments in which the top quality (either non-refined or selfrefined)<br />
consensus clustering solution is better than the median cluster ensemble component<br />
(or MEC), across the four multimedia data collections and the seven consensus functions.<br />
The percentages prior to self-refining are shown in brackets.<br />
MEC), its quality is surpassed by the top quality consensus clustering in an average 83.8%<br />
of the experiments conducted —see table 5.12 for a detailed vision of these results across<br />
data sets and consensus functions. It can be observed that in a vast majority of cases, selfrefining<br />
guarantees the obtention of partitions that are better than the median clustering<br />
available in the cluster ensemble. Again, the percentage of experiments in which the nonrefined<br />
intermodal consensus clustering attains a higher φ (NMI) than the MEC is presented<br />
in brackets in each box of the table, yielding an average of 55.7% —that is, self-refining<br />
increases the chances of obtaining a partition better than the median one in almost a 30%.<br />
But better to what extent? So as to answer this question, table 5.13 presents the<br />
relative φ (NMI) percentage differences between the top quality consensus clustering and<br />
the MEC, considering only those experiments in which the former attains a higher φ (NMI)<br />
value than the latter. In average, a relative percentage gain of 142.7% is achieved, which<br />
again reinforces the notion that the proposed multimodal consensus self-refining process is,<br />
by itself, capable of yielding good quality partitions upon a previously derived consensus<br />
clustering solution. Moreover, each box in table 5.13 shows, in brackets, the relative φ (NMI)<br />
percentage differences between the non-refined intermodal consensus clustering λc and the<br />
MEC. Notice how the self-refining procedure, besides yielding consensus clusterings superior<br />
to the MEC in a larger number of experiments, also increases the difference with respect<br />
to it, rising it from an average 103.7% to the aforementioned 142.7%. In the experiments<br />
in which the top quality fails to attain a higher φ (NMI) value than that of the MEC (i.e. a<br />
16.2% of the total), their quality is a 32.1% lower, measured in average relative percentage<br />
φ (NMI) terms.<br />
However, it is to notice that, in tables 5.7 to 5.13, the performance of the self-refining<br />
procedure has been evaluated taking the highest φ (NMI) self-refined consensus clustering<br />
solution as a reference. But, as aforementioned, the self-refining process generates multiple<br />
self-refined clusterings λ pi<br />
c using distinct percentages pi of the original cluster ensemble<br />
E. Therefore, the subsequent application of the average normalized mutual information<br />
(φ (ANMI) ) based supraconsensus function is required so as to obtain the final partition of<br />
the multimodal data set, λ final<br />
c . As already described in chapter 4, the ability of the<br />
supraconsensus function to select the top quality consensus clustering solution is a crucial<br />
issue as regards the overall performance of the multimodal self-refining consensus clustering<br />
158
Data set<br />
IsoLetters<br />
CAL500<br />
InternetAds<br />
Corel<br />
Chapter 5. Multimedia clustering based on cluster ensembles<br />
Consensus function<br />
CSPA EAC HGPA MCLA ALSAD KMSAD SLSAD<br />
86.6 55.6 63.9 81.1 100.3 95.2 78.8<br />
(76.8) (15.4) (2.4) (28.1) (65.3) (67.8) (34)<br />
33.4 29.3 31.6 28.1 28.9 32.6 16<br />
(18.9) (4.4) (7.6) (-) (-) (-) (-)<br />
382.9 191.2 196.7 412 408.5 446.3 406.9<br />
(314.2) (174.1) (–) (236.4) (222.7) (291.5) (356.6)<br />
113.5 258 40.8 35.5 106.1 106 130.9<br />
(83.8) (39.7) (2.6) (4.6) (58) (54.8) (225.5)<br />
Table 5.13: Relative φ (NMI) percentage difference between the top quality (either non-refined<br />
or self-refined) consensus clustering solution with respect to the median ensemble component<br />
(or MEC), across the four multimedia data collections and the seven consensus functions.<br />
The relative φ (NMI) percentage differences prior to self-refining are shown in brackets.<br />
Data set<br />
CSPA EAC<br />
Consensus function<br />
HGPA MCLA ALSAD KMSAD SLSAD<br />
IsoLetters 53.6 67.9 70 50.7 39.3 42.9 39.3<br />
CAL500 64.3 35.7 17.9 10.7 21.4 46.4 21.5<br />
InternetAds 7.1 53.6 81.4 10 21.4 12.1 35.7<br />
Corel 57.1 57.2 70 57.1 39.2 45 50<br />
Table 5.14: Percentage of experiments in which the supraconsensus function selects the top<br />
quality consensus clustering solution, across the four multimedia data collections and the<br />
seven consensus functions.<br />
system, and for this reason, the following paragraphs are devoted to its evaluation.<br />
Evaluation of the supraconsensus process<br />
Firstly, we have evaluated the supraconsensus function in terms of the percentage of experiments<br />
in which it suceeds, i.e. it selects the highest quality consensus clustering solution<br />
as the final partition λ final<br />
c . The results obtained for each data set and consensus function<br />
are presented in table 5.14, performing correctly in an average 42.1% of the experiments<br />
conducted —that is, it is able to select the best clustering available in less than the half of<br />
the occasions, which reveals (as outlined in chapter 4) that there is still room for improving<br />
the performance of supraconsensus functions.<br />
And secondly, we have analyzed how the consensus clustering selected by supraconsen-<br />
, compares to the components of the cluster ensemble it is created upon —taking<br />
sus, λ final<br />
c<br />
again the cluster ensemble components of maximum and median φ (NMI) (respectively referred<br />
to as best and median ensemble component, or BEC and MEC for short) as a reference.<br />
Hence, we have measured the relative percentage φ (NMI) differences between the<br />
consensus clustering selected by supraconsensus and these cluster ensemble components, so<br />
as to provide the reader with a notion of the impact of the apparent lack of accuracy of the<br />
φ (ANMI) -based supraconsensus function.<br />
The results, averaged across all the consensus functions, are presented in table 5.15.<br />
159
5.4. Discussion<br />
Data set<br />
Relative φ (NMI) difference<br />
with respect to<br />
BEC MEC<br />
IsoLetters -17.5 53.2<br />
CAL500 -34.7 13.2<br />
InternetAds -65.2 138.7<br />
Corel -28.9 169.1<br />
Table 5.15: Relative φ (NMI) percentage differences between the best and median components<br />
of the cluster ensemble and the consensus clustering λ final<br />
c selected by supraconsensus,<br />
for the four multimedia data collections, averaged across the seven consensus functions<br />
employed.<br />
In average, λ final<br />
c<br />
is, in relative percentage terms, a 36.6% worse than the BEC and a<br />
93.5% better than the MEC. As expected, the price to pay for the lack of precision of the<br />
supraconsensus function is a reduction of the quality of the final clustering solution.<br />
5.4 Discussion<br />
In this chapter, we have proposed and experimentally explored the use of consensus clustering<br />
strategies for partitioning multimedia data collections robustly. From our viewpoint,<br />
this application constitutes a natural extension of the computationally efficient consensus<br />
architectures presented in chapter 3 and the self-refining procedures proposed in chapter<br />
4. As mentioned earlier, the growing ubiquity of multimedia data makes the proposals put<br />
forward in the present chapter even more appealing.<br />
Across the experiments presented in this chapter and in appendix B.2, we have observed<br />
that partitioning multimodal data sets in a robust manner is more tricky than doing so in a<br />
unimodal scenario, as the existence of multiple modalities in the data increases the already<br />
numerous indeterminacies inherent to the clustering problem.<br />
As a means for fighting against this fact, modality fusion has become a recurrent issue in<br />
the multimedia data analysis literature. Indeed, assuming that combining the distinct data<br />
modalities can be of interest is pretty logical, as it is an obvious way to take advantage of<br />
the expectably existing constructive dependences between them. In this sense, and focusing<br />
on the clustering problem, there exist two main approaches to modality fusion: early (aka<br />
feature level) fusion and late (classification decision) fusion.<br />
However, our experiments have revealed that none of these fusion strategies is, by itself,<br />
capable of ensuring robust clustering results: in some cases, feature level fusion gives rise to<br />
the best clustering results, whereas, in other cases, it simply constitutes a trade-off between<br />
modalities. For this reason, our multimodal self-refining consensus clustering architectures<br />
constitute a generic approach for partitioning multimedia data collections with a reasonable<br />
degree of robustness, with the advantage of encompassing, simultaneously, early and late<br />
fusion techniques.<br />
To our knowledge, most works dealing with multimodal clustering focus on feature level<br />
fusion, deriving novel early fusion approaches for combining modalities (see section 1.4<br />
for a brief review). However, they often disregard the fact that, in some data sets, early<br />
160
Chapter 5. Multimedia clustering based on cluster ensembles<br />
fusion may not be advantageous (that is, there may exist modalities that do not contribute<br />
positively to the obtention of a good partition of the data, see appendix B.2). In our opinion,<br />
this constitutes one of the main strengths of our proposal, as it allows the user to employ<br />
any modality created by feature-level fusion besides the original modalities of the data for<br />
obtaining the final partition of the data.<br />
This aprioripositive openhandedness entails two negative consequences: firstly, it increases<br />
the computational complexity of the consensus process, although such inconvenience<br />
can be sorted out by the application of the efficient consensus hierchical consensus architectures<br />
proposed in chapter 3. The second drawback is the inclusion of the poorest modality<br />
clustering results in the consensus clustering process, but this can be alleviated by the use<br />
of the consensus-based self-refining procedures described in chapter 4, achieving notable<br />
results when applied on the intermodal consensus clustering solutions.<br />
In future works, the implementation of selection-based self-refining processes (see section<br />
4.3) on multimodal cluster ensembles will be investigated, as we expect that it may yield<br />
higher quality multimodal partitions than the ones presented in this chapter. Furthermore,<br />
as already stated in chapter 4, it will be necessary to devise novel, more precise supraconsensus<br />
functions, capable of selecting with a higher degree of accuracy the top quality<br />
consensus clustering solution in an unsupervised manner.<br />
5.5 Related publications<br />
None of the work regarding multimedia clustering presented in this chapter has been published<br />
yet. Nevertheless, we would like to hihglight the following paper, focused on applying<br />
early fusion of modalities for conducting jointly multimodal data analysis and synthesis of<br />
facial video sequences (Sevillano et al., 2009). The details of this work, published as a book<br />
chapter, are presented next.<br />
Authors: Xavier Sevillano, Javier Melenchón, Germán Cobo, Joan Claudi Socoró<br />
and Francesc Alías<br />
Title: Audiovisual Analysis and Synthesis for Multimodal Human-Computer Interfaces<br />
In: Engineering the User Interface: From Research to Practice<br />
Publisher: Springer<br />
Editors: Miguel Redondo, Crescencio Bravo and Manuel Ortega<br />
Pages: 179–194<br />
Year: 2009<br />
ISBN: 978-1-84800-135-0<br />
Abstract: Multimodal signal processing techniques are called to play a salient role<br />
in the implementation of natural computer-human interfaces. In particular, the development<br />
of efficient interface front ends that emulate interpersonal communication<br />
would benefit from the use of techniques capable of processing the visual and auditory<br />
161
5.5. Related publications<br />
modes jointly. This work introduces the application of audiovisual analysis and synthesis<br />
techniques based on Principal Component Analysis and Non-negative Matrix<br />
Factorization on facial audiovisual sequences. Furthermore, the applicability of the<br />
extracted audiovisual bases is analyzed throughout several experiments that evaluate<br />
the quality of audiovisual resynthesis using both objective and subjective criteria.<br />
162
Chapter 6<br />
Voting based consensus functions<br />
for soft cluster ensembles<br />
As outlined in section 1.2.1, clustering algorithms can be bisected into two large categories,<br />
depending on the number of clusters every object is assigned to. On one hand, hard (or<br />
crisp) clustering algorithms assign each object to a single cluster. For this reason, the<br />
result of applying a hard clustering process on a data set containing n objects is usually<br />
represented as a n-dimensional integer-valued row vector of labels (or labeling) λ, each<br />
component of which identifies to which of the the k clusters each object is assigned to, that<br />
is:<br />
λ =[λ1 λ2 ... λn] (6.1)<br />
where λi ∈ [1,k], ∀i ∈ [1,n].<br />
On the other hand, soft (or fuzzy) clustering algorithms allow the objects to belong to<br />
all clusters to a certain extent. Thus, the results of their application for partitioning a data<br />
set containing n objects into k clusters is usually represented by means of a k×n real-valued<br />
clustering matrix Λ –see equation (6.2)–, the (i,j)th entry of which indicates the degree of<br />
association between the jth object and the ith cluster.<br />
⎛<br />
⎜<br />
Λ = ⎜<br />
⎝<br />
λ11 λ12 ... λ1n<br />
λ21 λ22 ... λ2n<br />
.<br />
. ..<br />
λk1 λk2 ... λkn<br />
.<br />
⎞<br />
⎟<br />
⎠<br />
(6.2)<br />
where λij ∈ R, ∀i ∈ [1,k]and∀j ∈ [1,n].<br />
For illustration purposes, we resort to the toy clustering example presented in chapter<br />
2, in which clustering is conducted on the two-dimensional artificial data set presented in<br />
figure 6.1. This toy data collection contains n = 9 objects, and the desired number of<br />
clusters k is set to 3.<br />
If the classic k-means hard clustering algorithm is applied on this data set, the label<br />
vector presented in equation (6.3) is obtained. Recall that the labels λi contained in λ<br />
are purely symbolic (i.e. the labelings λ =[111222333]orλ =[333222111]<br />
represent exactly the same partition of the data).<br />
163
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
3<br />
2<br />
1<br />
−0.2<br />
−0.2 0 0.2 0.4 0.6<br />
Figure 6.1: Scatterplot of an artificially generated two-dimensional data set containing n =9<br />
objects, which are represented by coloured symbols and identified by a number. The black<br />
star symbols represent the position of the cluster centroids found by the k-means algorithm.<br />
6<br />
9<br />
4<br />
8<br />
7<br />
5<br />
λ =[222111333] (6.3)<br />
As regards the results of applying a fuzzy clustering algorithm on this data collection,<br />
these clearly differ depending on the way the degree of association between objects and<br />
clusters is codified. Usually, the scalar values λij contained in the clustering matrix Λ<br />
represent cluster membership probabilities (i.e. the higher the value of λij, the more strongly<br />
the jth object is associated to the ith cluster). For instance, this is the way the well-known<br />
fuzzy c-means (FCM) clustering algorithm codifies its clustering results (Höppner, Klawonn,<br />
and Kruse, 1999). In fact, if this algorithm is applied on the previously described artificial<br />
data set, the clustering matrix presented in equation (6.4) is obtained.<br />
⎛<br />
0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016<br />
⎞<br />
0.010<br />
Λ = ⎝0.921<br />
0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠<br />
(6.4)<br />
0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />
It can be observed that any row permutation in Λ would yield an equivalent fuzzy partition.<br />
Moreover, notice that Λ can be transformed into a hard clustering solution by simply<br />
assigning each object to the cluster with maximum membership probability.<br />
However, the degree of association between objects and clusters can be described in<br />
terms of other parameters, such as the distance of each object to the cluster centroids (such<br />
as k-means, that despite being a hard clustering algorithm, can output this information). In<br />
fact, the object to centroid1 distance matrix obtained after applying k-means on the same<br />
toy data set as before is presented in equation (6.5).<br />
⎛<br />
0.362 0.325 0.672 0.002 0.002 0.005 0.160 0.092<br />
⎞<br />
0.125<br />
Λ = ⎝0.010<br />
0.009 0.027 0.397 0.490 0.436 0.251 0.320 0.209⎠<br />
(6.5)<br />
0.170 0.202 0.445 0.090 0.125 0.162 0.002 0.005 0.002<br />
In this case, the conversion of Λ into a crisp partition requires assigning every object<br />
to the closest (i.e. minimum distance) cluster. Thus, depending on the nature of the soft<br />
1 The cluster centroids are represented by means of a black star symbol in figure 6.1<br />
164
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
clustering process, the interpretation of the scalar values λij contained in Λ differs (i.e. if λij<br />
represent membership probabilities, the larger their value the stronger the object-cluster<br />
association, whereas the opposite interpretation should be made in case they represent<br />
object to centroid distances).<br />
Either way, fuzzy clustering can be regarded as a version of hard clustering with relaxed<br />
object membership restrictions. Such relaxation is particularly useful when the clusters are<br />
not well separated, or when their boundaries are ambiguous. Moreover, object to cluster<br />
association information may be of help in discovering more complex relations between a<br />
given object and the clusters (Xu and Wunsch II, 2005). Furthermore, notice that soft<br />
clustering can also be regarded as a generalization of its hard counterpart, as a crisp partition<br />
can always be derived from a fuzzy one, whereas the opposite cannot be held.<br />
However, these apparent strengths of soft clustering algorithms have barely been reflected<br />
in the development of consensus functions especially devised for combining the outcomes<br />
of multiple fuzzy clustering processes, as most proposals in this area are oriented<br />
towards their application on hard clustering scenarios. Nevertheless, as described in section<br />
2.2, there exist a few works in the consensus clustering literature devoted to the derivation<br />
of consensus functions for soft cluster ensembles, such as the VMA consensus function of<br />
(Dimitriadou, Weingessel, and Hornik, 2002) or the ITK consensus function of (Punera and<br />
Ghosh, 2007). Moreover, several other consensus functions can be indistinctly applied on<br />
both hard and soft cluster ensembles, such as PLA (<strong>La</strong>nge and Buhmann, 2005) or HGBF<br />
(Fern and Brodley, 2004), while others, originally devised for hard cluster ensembles, can be<br />
adapted for their use as soft partition combiners with relative ease (e.g. (Strehl and Ghosh,<br />
2002)).<br />
In this chapter, we make several proposals regarding the application of consensus processes<br />
on soft cluster ensembles. For starters, the notion of soft cluster ensembles is reviewed<br />
in section 6.1. Next, in section 6.2, we describe a procedure for adapting the hard consensus<br />
functions employed in this work to soft cluster ensembles. In section 6.3, a family of<br />
novel soft consensus functions based on the application of cluster disambiguation and voting<br />
strategies is proposed. Finally, the results of several experiments regarding the performance<br />
of the proposed soft consensus functions are presented in section 6.4, and the discussion of<br />
section 6.5 puts an end to this chapter.<br />
6.1 Soft cluster ensembles<br />
As described in chapter 2, cluster ensembles are nothing but the compilation of the outputs<br />
of multiple (namely l) clustering processes. Focusing on a fuzzy clustering scenario, and<br />
making the simplyfing assumption that the l clustering processes partition the data into k<br />
clusters, a soft cluster ensemble E is represented by means of a kl × n real-valued matrix:<br />
⎛ ⎞<br />
Λ1<br />
⎜<br />
⎜Λ2<br />
⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠<br />
Λl<br />
(6.6)<br />
where Λi is the k × n real-valued clustering matrix resulting from the ith soft clustering<br />
process (∀i ∈ [1,l]).<br />
165
6.2. Adapting consensus functions to soft cluster ensembles<br />
Recall that the contents of each clustering matrix Λi enclosed in the soft cluster ensemble<br />
E results from the execution of a fuzzy clustering process, and that, depending on its nature,<br />
the interpretation of the scalar values that ultimately make up E may differ largely. Thus,<br />
for conducting a consensus process on the soft cluster ensemble E it is necessary that such<br />
values hold the same type of proportionality with repect to the degree of association between<br />
objects and clusters (i.e. they all are either directly or inversely proportional).<br />
This prerequisite becomes more evident if an analogy between soft clustering and voting<br />
procedures is established. Such analogy is inspired by the parallelism between supervised<br />
classification processes and voting drawn in (van Erp, Vuurpijl, and Schomaker, 2002).<br />
According to this analogy, the process of fuzzily clustering an object can be regarded as an<br />
election, in which the clusterer (regarded as a voter) casts its preference for each one of the<br />
clusters (or candidates). Put that way, it becomes quite obvious that, when the results of<br />
several fuzzy clustering processes are gathered into a soft cluster ensemble with the purpose<br />
of building a consolidated clustering solution upon it, they should be straightly comparable<br />
—possibly, after applying some scale normalization.<br />
Regardless of the characteristics and nature of soft cluster ensembles, it is interesting<br />
to evaluate how classic consensus functions (i.e. those originally designed to combine crisp<br />
partitions) can be applied on the fuzzy consensus clustering problem. The next section<br />
deals with this very issue.<br />
6.2 Adapting consensus functions to soft cluster ensembles<br />
The consensus functions employed so far in the experimental sections of this work (i.e.<br />
CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD) are originally designed to<br />
operate on hard cluster ensembles (see appendix A.5 for a description). Nevertheless, they<br />
can be easily adapted for combining fuzzy partitions. The key point is that all these<br />
consensus functions base their clustering combination processes on object co-association<br />
matrices (i.e. matrices the contents of which estimate the degree of similarity between<br />
objects upon the partitions contained in the cluster ensemble). Fortunately, this type of<br />
co-association matrices can be derived not only from hard cluster ensembles, but also from<br />
their soft counterparts, which makes these consensus functions easily applicable on the<br />
fuzzy consensus clustering problem. In the following paragraphs, we elaborate on this issue<br />
resorting again to the previous toy example, continuously drawing parallelisms between the<br />
hard and soft clustering scenarios —for brevity, such comparison will only be referred to<br />
fuzzy clustering processes that codify object to cluster associations in terms of membership<br />
probabilities, although an equivalent study could also be formulated in the case these were<br />
expressed by means of magnitudes inversely proportional to the strength of object to cluster<br />
associations, such as object to cluster centroids distances.<br />
Consider the hard clustering solution of equation (6.3) corresponding to our clustering<br />
toy example, that is:<br />
λ =[222111333]<br />
Notice that an equivalent representation of this partition can be also given by a k × n<br />
incidence matrix Iλ (called binary membership indicator matrix in (Strehl and Ghosh,<br />
2002)), the (i,j)th entry of which is equal to 1 in the case the jth object is assigned to<br />
166
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
the ith cluster according to λ, and 0 otherwise —see equation (6.7), which presents the<br />
incidence matrix corresponding to the label vector λ of equation (6.2).<br />
⎛<br />
0 0 0 1 1 1 0 0<br />
⎞<br />
0<br />
Iλ = ⎝1<br />
1 1 0 0 0 0 0 0⎠<br />
(6.7)<br />
0 0 0 0 0 0 1 1 1<br />
Notice that the information contained in Iλ is somehow comparable to the contents of<br />
a soft clustering matrix Λ, in the sense that they both express the degree of association<br />
between objects and clusters. For illustration purposes, equation (6.8) presents the fuzzy<br />
clustering matrix Λ output by the FCM clustering algorithm on the artificial data set of<br />
figure 6.1. In fact, rounding each element of this clustering matrix Λ to the nearest integer<br />
would indeed yield the incidence matrix Iλ of equation (6.7).<br />
⎛<br />
0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016<br />
⎞<br />
0.010<br />
Λ = ⎝0.921<br />
0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠<br />
(6.8)<br />
0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />
The construction of object co-association matrices given the incidence matrix Iλ built<br />
upon a hard clustering solution λ is pretty straightforward, and it only requires computing<br />
some matrix products.<br />
In particular, the object co-association matrix Oλ is computed as the product between<br />
the transpose of Iλ and Iλ. The object co-association matrix corresponding to the hard<br />
clustering solution obtained on our toy clustering example is presented in equation (6.9).<br />
In fact, Oλ is a n × n adjacency matrix, the (i,j)th entry of which equals 1 if the ith<br />
and the jth objects are placed in the same cluster, or 0 otherwise (Kuncheva, Hadjitodorov,<br />
and Todorova, 2006). The name object co-association matrix stems from the fact that the<br />
contents of Oλ indicate, from a clustering viewpoint, the degree of similarity between the<br />
n objects in the data set.<br />
167
6.2. Adapting consensus functions to soft cluster ensembles<br />
Oλ = Iλ T Iλ =<br />
⎛<br />
0 1<br />
⎞<br />
0<br />
⎜<br />
0<br />
⎜<br />
0<br />
⎜<br />
⎜1<br />
⎜<br />
⎜1<br />
⎜<br />
⎜1<br />
⎜<br />
⎜0<br />
⎝0<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0 ⎟<br />
0 ⎟ ⎛<br />
0 ⎟ 0<br />
0⎟⎝<br />
⎟ 1<br />
0 ⎟ 0<br />
1 ⎟<br />
1⎠<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
⎞<br />
0<br />
0⎠<br />
1<br />
0<br />
⎛<br />
1<br />
0<br />
1<br />
1<br />
1 0 0 0 0 0<br />
⎞<br />
0<br />
⎜<br />
1<br />
⎜<br />
1<br />
⎜<br />
⎜0<br />
= ⎜<br />
⎜0<br />
⎜<br />
⎜0<br />
⎜<br />
⎜0<br />
⎝0<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
1<br />
0 ⎟<br />
0 ⎟<br />
0 ⎟<br />
0 ⎟<br />
0 ⎟<br />
1 ⎟<br />
1⎠<br />
0 0 0 0 0 0 1 1 1<br />
(6.9)<br />
In a fuzzy clustering scenario, the object co-association matrix can easily be derived by<br />
simply multiplying the transpose clustering matrix Λ by itself. Resorting again to our toy<br />
clustering example, the resulting object co-association matrix (denoted as OΛ) ispresented<br />
in equation (6.10).<br />
It is easy to see that OΛ is indeed a fuzzy adjacency matrix, as the statistical independence<br />
of the probabilities of assigning objects i and j to the same cluster allows to interpret<br />
its (i,j)th entry as the joint probability that objects i and j are placed in the same cluster<br />
by the clusterer. However, statistical independence does not hold for the elements of the<br />
diagonal of OΛ, as each object is always “co-clustered” with itself, which would require<br />
making the elements on the diagonal of OΛ equal to 1.<br />
168
⎛<br />
OΛ = ΛT ⎜<br />
Λ = ⎜<br />
⎝<br />
⎛<br />
⎜<br />
= ⎜<br />
⎝<br />
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
0.054 0.921 0.025<br />
0.026 0.932 0.042<br />
0.057 0.905 0.038<br />
0.969 0.025 0.006<br />
0.976 0.019 0.005<br />
0.959 0.030 0.011<br />
0.009 0.014 0.976<br />
0.016 0.055 0.929<br />
0.010 0.017 0.972<br />
⎞<br />
⎟ ⎛<br />
⎟ ⎝<br />
⎟<br />
⎠<br />
0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 0.010<br />
0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017<br />
0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />
0.852 0.861 0.838 0.076 0.070 0.080 0.038 0.075 0.040<br />
0.861 0.871 0.847 0.049 0.043 0.053 0.054 0.091 0.057<br />
0.838 0.847 0.824 0.078 0.073 0.082 0.050 0.086 0.053<br />
0.076 0.049 0.078 0.940 0.946 0.930 0.015 0.022 0.016<br />
0.070 0.043 0.073 0.946 0.953 0.937 0.014 0.021 0.015<br />
0.080 0.053 0.082 0.930 0.937 0.921 0.020 0.027 0.021<br />
0.038 0.054 0.050 0.015 0.014 0.020 0.953 0.908 0.949<br />
0.075 0.091 0.086 0.022 0.021 0.027 0.908 0.866 0.904<br />
0.040 0.057 0.053 0.016 0.015 0.021 0.949 0.904 0.945<br />
⎞<br />
⎟<br />
⎠<br />
⎞<br />
⎠<br />
(6.10)<br />
Quite obviously, consensus functions based on object co-association matrices do not<br />
operate on matrices Oλ or OΛ, as they are derived upon a single clustering solution. However,<br />
the computation of object co-association matrices can easily be extended to a set of<br />
clustering solutions compiled in either a hard or a soft cluster ensemble. We start with the<br />
derivation of the object co-association matrix upon a hard cluster ensemble E containing l<br />
clusterings, represented by means of a l × n integer valued matrix:<br />
⎛<br />
λ1<br />
⎜<br />
⎜λ2<br />
⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠<br />
λl<br />
⎞<br />
(6.11)<br />
In this case, the incidence matrix of the hard cluster ensemble, denoted as IEλ ,isa<br />
kl × n matrix resulting from the concatenation of the incidence matrices of the l clusterings<br />
that make up the ensemble (see equation (6.12)).<br />
IEλ =<br />
⎛ ⎞<br />
Iλ1<br />
⎜Iλ2<br />
⎟<br />
⎜ ⎟<br />
⎜ ⎟<br />
⎝ . ⎠<br />
Iλl<br />
(6.12)<br />
As in the case of a single clustering, the object co-association matrix of the hard cluster<br />
ensemble, denoted as OEλ , can be derived upon IEλ by simple matrix multiplication, as<br />
presented in equation (6.13).<br />
OEλ<br />
T<br />
= IEλ<br />
IEλ<br />
169<br />
(6.13)
6.2. Adapting consensus functions to soft cluster ensembles<br />
Drawing a parallelism with respect to the interpretation of the object co-association<br />
matrix Oλ built upon a single clustering, it can be stated that the (i,j)th entry of OEλ<br />
indicates the proportion of clusterers that put the ith and jth objects in the same cluster.<br />
Porting this same idea to the soft clustering scenario, the soft version of the object coassociation<br />
matrix OEλ (that we denote as OEΛ ) is computed following an analog procedure<br />
to the one just reported, summarized in equation (6.14).<br />
OEΛ = ET E = ΛT 1 ΛT2 ...ΛT ⎛ ⎞<br />
Λ1<br />
⎜<br />
⎜Λ2<br />
⎟<br />
l ⎜ ⎟<br />
⎝ . ⎠<br />
Λl<br />
(6.14)<br />
where E is the kl × n matrix representing the soft cluster ensemble made up by the compilation<br />
of the l fuzzy clustering matrices Λi, ∀i ∈ [1,l]. Just like in the case of the fuzzy<br />
adjacency matrix OΛ derived upon a single clustering (see equation (6.10)), it is necessary<br />
to set the elements of the diagonal of OEΛ to unity, as each object is always “co-clustered”<br />
with itself.<br />
As aforementioned, all the hard consensus functions used so far in this work employ<br />
object co-association matrices, and most of them explicitly base their consensus processes<br />
on them. This is the case of the CSPA, EAC, ALSAD, KMSAD and SLSAD consensus<br />
functions, which differ among them in the way the object co-association matrix OEλ is<br />
interpreted. On one hand, some of them construe OEλ<br />
as an object similarity matrix,<br />
obtaining the consensus partition by applying some similarity-based clustering algorithm,<br />
such as graph partitioning (in CSPA (Strehl and Ghosh, 2002)) or hierarchical clustering<br />
(EAC (Fred and Jain, 2005)).<br />
On the other hand, the so-called similarity-as-data consensus functions interpret the ith<br />
row of the object co-association matrix OEλ<br />
as a set of new n features representing the<br />
ith object, thus applying some standard clustering algorithm on it for obtaining the consensus<br />
clustering solution. The application of the hierarchical single-link and average-link<br />
hierarchical clustering algorithms give rise to the SLSAD and ALSAD consensus functions,<br />
whereas the KMSAD consensus function consists in conducting this partitioning by means<br />
of k-means (Kuncheva, Hadjitodorov, and Todorova, 2006).<br />
On its part, the HGPA consensus function considers the cluster ensemble incidence<br />
matrix IEλ to be the adjacency matrix of a hypergraph with n vertices and kl hyperedges.<br />
The consensus clustering process is regarded as the partitioning of such hypergraph by<br />
cutting the minimum number of hyperedges, a procedure that takes the object co-association<br />
matrix OEλ as its input (Strehl and Ghosh, 2002).<br />
And last, the MCLA consensus function tackles the consensus clustering problem as a<br />
process of clustering clusters, which also implies interpreting the cluster ensemble incidence<br />
matrix IEλ as a hypergraph adjacency matrix. Again, the algorithmic analysis2 of this<br />
consensus function reveals that such procedure requires the object co-association matrix<br />
OEλ as one of its parameters (Strehl and Ghosh, 2002).<br />
Allowing for the fact that all seven consensus functions employ the object co-association<br />
matrix OEλ as its main input parameter when operating on hard cluster ensembles, devising<br />
2<br />
The Matlab source code of the CSPA, HGPA and MCLA consensus functions is available for download<br />
at http://www.strehl.com.<br />
170
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
their version for soft cluster ensembles becomes pretty straightforward, as it simply requires<br />
substituting the object co-association matrix derived from a hard cluster ensemble OEλ by<br />
its fuzzy counterpart, OEΛ . Notice, that, despite taking object co-association matrices<br />
derived upon a soft cluster ensemble as their input, all these consensus functions output a<br />
hard consensus clustering solution λc.<br />
As mentioned at the introduction of this section, we have established an analogy between<br />
hard and soft consensus functions based on the similarities between object co-association<br />
matrices, assuming that the results of fuzzy clustering are expressed in terms of membership<br />
probabilities. However, we consider that the present study is quite generic, as it could be<br />
extended to the case fuzzy clustering results were expressed in any other form by transforming<br />
them into membership probabilities prior to computing the corresponding object<br />
co-association matrices.<br />
6.3 Voting based consensus functions<br />
In this section, we put forward a set of proposals in the shape of a family of novel consensus<br />
functions especially devised for their application on soft cluster ensembles. These consensus<br />
functions are inspired in voting strategies, which have also been a source of inspiration for<br />
the development of systems for combining supervised classifiers (van Erp, Vuurpijl, and<br />
Schomaker, 2002), search engines (Aslam and Montague, 2001), or word sense disambiguation<br />
systems (Buscaldi and Rosso, 2007). A distinguishing factor of the consensus functions<br />
we propose in this section is that they yield fuzzy consensus clustering solutions, whereas<br />
other soft consensus functions found in the literature ouput crisp consensus clusterings,<br />
despite they are applied on soft cluster ensembles (Punera and Ghosh, 2007).<br />
In fact, voting is a formal way of combining the opinions of several voters into a single<br />
decision (e.g. the election of a president). Therefore, it seems quite logical that voting<br />
strategies can be readily applied for combining the outcomes of multiple decision systems,<br />
using the voting strategy that best fits the way these decisions are expressed.<br />
Given that a clusterer is an unsupervised classifier, the most natural parallelism we can<br />
establish is related to voting based supervised classifier combination (aka classifier ensembles).<br />
In this case, each classifier is a voter, the possible categories objects can be filed under<br />
are the candidates, and an election is the classification of an object (van Erp, Vuurpijl, and<br />
Schomaker, 2002). Quite obviously, the voting strategy employed for combining the votes<br />
–and thus obtain the winner of the election (i.e. the resulting classification of the object by<br />
the ensemble of classifiers)– depends on how votes are expressed, be it an assignment to a<br />
single class (i.e. single-label classification (Sebastiani, 2002)), or an either ranked or scored<br />
list of classes. The former case calls for the application of unweighed voting methods such<br />
as plurality or majority voting, whereas the latter make it possible to apply more sophisticated<br />
voting strategies such as confidence and ranking voting methods (van Erp, Vuurpijl,<br />
and Schomaker, 2002).<br />
Nevertheless, prior to the application of any voting strategy on soft cluster ensembles,<br />
there is a crucial problem to be solved. This has to do with the inherent unsupervised<br />
nature of clustering processes, which makes clusters be ambiguously identified. Therefore,<br />
it is necessary to perform a cluster alignment (or disambiguation) process before voting.<br />
Notice that this is not an issue of concern when applying voting strategies on supervised<br />
171
6.3. Voting based consensus functions<br />
classifier ensembles, as categories (i.e. candidates) are univocally defined in that case. We<br />
elaborate on the cluster disambiguation technique employed in this work in section 6.3.1. It<br />
is important to highlight that consensus functions based on object co-association matrices<br />
circumvent this inconvenience (<strong>La</strong>nge and Buhmann, 2005), although their main drawback<br />
is that the complexity of the object co-association matrix computation is quadratic with<br />
the number of objects in the data set (Long, Zhang, and Yu, 2005).<br />
The problem of combining the outcomes of multiple soft clustering processes by means<br />
of voting strategies implies interpreting the contents of the soft cluster ensemble as the<br />
preference of each voter (clusterer) for each candidate (cluster), as soft clustering algorithms<br />
output the degree of association of each object to all the clusters. For this reason, voting<br />
methods capable of dealing with voters’ preferences (in particular, confidence and ranking<br />
voting strategies) are the basis of our consensus functions, as they lend themselves naturally<br />
to be applied in this context. However, care must be taken as regards how these preferences<br />
are expressed, that is, if they are directly or inversely proportional to the strength of the<br />
association between objects and clusters (e.g. membership probabilities or distances to<br />
centroids, respectively). In section 6.3.2, we describe four voting strategies that give rise to<br />
the proposed consensus functions.<br />
6.3.1 Cluster disambiguation<br />
In this section, we elaborate on the problem of cluster disambiguation, also known as the<br />
cluster correspondence problem.<br />
As pointed out earlier, a single hard clustering solution can be expressed by multiple<br />
equivalent labeling vectors λ, due to the symbolic nature of the labels clusters are identified<br />
with. This also occurs in soft clustering, as the permutation of the rows of a clustering<br />
matrix Λ also gives rise to equivalent fuzzy partitions. Quite obviously, this cluster identification<br />
ambiguity also rises between the multiple clustering solutions compiled in a cluster<br />
ensemble E, and thus, it becomes an issue of concern when it comes to applying voting<br />
strategies for conducting consensus clustering, given the equivalence between clusters and<br />
candidates defined by the previously described analogy with voting procedures. For this<br />
reason, our voting-based consensus functions for soft cluster ensembles make use of a cluster<br />
disambiguation technique prior to proper voting.<br />
In particular, we require from such method the ability to solve the cluster re-labeling<br />
problem —an instance of the cluster correspondence problem in which a one to one correspondence<br />
between clusters is considered (recall that, in this work, all the clusterings in the<br />
ensemble and the consensus clustering are assumed to have the same number of clusters,<br />
namely k).<br />
To solve the cluster re-labeling problem we make use of the Hungarian method (also<br />
known as Kuhn-Munkres algorithm or Munkres assignment algorithm) (Kuhn, 1955), a<br />
technique that allows to obtain the most consistent alignment among the different clusterings<br />
(Ayad and Kamel, 2008).<br />
Given a pair of clustering solutions with k clusters each, the Hungarian method is capable<br />
of finding, among the k! possible cluster permutations, the one that maximizes the overlap<br />
between them. In particular, such cluster permutation is applied on one of the two clustering<br />
solutions, while the other is taken as a reference. Put in probabilistic terms, the Hungarian<br />
algorithm selects the cluster permutation that best fits the empirical cluster assignment<br />
172
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
probabilities estimated from the reference clustering solution (i.e. it finds the optimal<br />
cluster permutation that yields the largest probability mass over all cluster assignment<br />
probabilities (Fischer and Buhmann, 2003)). Depending on whether the aforementioned<br />
clustering solutions correspond to hard or fuzzy partitions, cluster permutations amount to<br />
label reassignments or to row order rearrangements, respectively.<br />
The Hungarian algorithm poses the cluster correspondence problem as a weighted bipartite<br />
matching problem, solving it in O(k3 ) time. A beautiful analysis of its error probability<br />
can be found in (Topchy et al., 2004). In this work, we have employed the implementation<br />
of (Buehren, 2008), which bases the clusters disambiguation process upon a measure of<br />
the dissimilarity between the clusters of the two clustering solutions under consideration.<br />
Cluster dissimilarity is usually embodied in a k × k matrix, the (i,j)th entry of which is<br />
proportional to the degree of dissimilarity between the ith cluster of one of the clustering<br />
solutions and the jth cluster of the other one.<br />
Cluster dissimilarity can easily be derived upon the considered pair of clustering solutions,<br />
regardless of whether they are hard or fuzzy partitions, as we show next. In the crisp<br />
case, a cluster similarity matrix S λ1 ,λ 2 can be obtained by simple matrix products between<br />
the incidence matrices of both clusterings, denoted as λ1 and λ2 —see equation (6.15).<br />
Sλ1 ,λ = Iλ1 2 Iλ2<br />
T<br />
(6.15)<br />
For illustration purposes, consider the two crisp clustering solutions of equation (6.16):<br />
λ1 =[222111333]<br />
λ2 =[113333322] (6.16)<br />
The incidence matrices corresponding to λ1 and λ2 are presented in equation (6.17).<br />
Iλ1 =<br />
⎛<br />
0 0 0 1 1 1 0 0<br />
⎞<br />
0<br />
⎝1<br />
1 1 0 0 0 0 0 0⎠<br />
0 0 0 0 0 0 1 1 1<br />
Iλ2 =<br />
⎛<br />
1 1 0 0 0 0 0 0<br />
⎞<br />
0<br />
⎝0<br />
0 0 0 0 0 0 1 1⎠<br />
(6.17)<br />
0 0 1 1 1 1 1 0 0<br />
The cluster similarity matrix derived upon these two clustering solutions is the one<br />
presented in equation (6.18).<br />
173
6.3. Voting based consensus functions<br />
Sλ1 ,λ = Iλ1 2 Iλ2<br />
T<br />
=<br />
⎛<br />
1 0<br />
⎞<br />
0<br />
⎛<br />
0<br />
⎝1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
⎜<br />
1<br />
⎜<br />
⎞ ⎜<br />
0<br />
0 ⎜<br />
⎜0<br />
0⎠<br />
⎜<br />
⎜0<br />
1 ⎜<br />
⎜0<br />
⎜<br />
⎜0<br />
⎝0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1<br />
0 ⎟<br />
1 ⎟ ⎛<br />
1 ⎟ 0<br />
1 ⎟ = ⎝2<br />
1 ⎟ 0<br />
1 ⎟<br />
0⎠<br />
0<br />
0<br />
2<br />
⎞<br />
3<br />
1⎠<br />
1<br />
(6.18)<br />
0 1 0<br />
Firstly, notice that S λ1 ,λ 2 is not a symmetric matrix (as object co-association matrices<br />
are), due to the fact that its rows and columns correspond to different entities. In fact,<br />
the (i,j)th element of S λ1 ,λ 2 is equal to the number of objects that are assigned to the ith<br />
cluster of λ1 and to the jth cluster of λ2, thus clearly indicating the degree of resemblance of<br />
these clusters. Deriving a cluster dissimilarity matrix from S λ1 ,λ 2 is pretty straightforward,<br />
provided that the implementation of the Hungarian method employed in this work does not<br />
require that the cluster dissimilarity measures verify any special property as regards their<br />
scale. The result of the cluster disambiguation method applied on this toy example is the<br />
cluster correspondence vector π λ1 ,λ 2 presented in equation (6.19).<br />
π λ1 ,λ 2 = [3 1 2] (6.19)<br />
The interpretation of the cluster correspondence vector πλ1 ,λ is that the cluster iden-<br />
2<br />
tified with the ‘1’ label in λ1 corresponds to the cluster with label ‘3’ of λ2, the cluster<br />
labeled as ‘2’ in λ1 must be aligned with the cluster with label ‘1’ of λ2, and the cluster ‘3’<br />
of λ1 is equivalent to cluster with label ‘2’ of λ2.<br />
The most usual way of representing the information contained in the cluster correspondence<br />
vector πλ1 ,λ is by means of a cluster permutation matrix P 2 λ1 ,λ . In general, P 2 λ1 ,λ is 2<br />
a k × k matrix whose entries are all zero except that the πλ1 ,λ (i)-th entry of the ith row is<br />
2<br />
equal to 1. The cluster permutation matrix corresponding to our toy example is presented<br />
in equation (6.20). Notice how all of its entries are zero except for the third entry of the<br />
first row (as πλ1 ,λ (1) = 3), the first entry of the second row (as π 2 λ1 ,λ (2) = 1) and the<br />
2<br />
second entry of the third row (as πλ1 ,λ (3) = 2).<br />
2<br />
P λ1 ,λ 2 =<br />
⎛<br />
0 0<br />
⎞<br />
1<br />
⎝1<br />
0 0⎠<br />
(6.20)<br />
0 1 0<br />
In order to obtain the cluster permuted version of the clustering λ1, it is necessary to<br />
multiply the transpose of the cluster permutation matrix Pλ1 ,λ by the incidence matrix<br />
2<br />
associated to this clustering, Iλ1 , which yields the cluster permuted incidence matrix I πλ1 ,λ2 λ , 1<br />
as indicated in equation (6.21).<br />
I πλ 1 ,λ 2<br />
λ 1<br />
T<br />
= Pλ1 ,λ Iλ1<br />
2 ,λ2 In our example, the cluster permuted incidence matrix is:<br />
174<br />
(6.21)
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
I πλ<br />
⎛<br />
0<br />
1 ,λ2 λ = ⎝0<br />
1<br />
1<br />
1<br />
0<br />
0<br />
⎞ ⎛<br />
0 0<br />
1⎠⎝1<br />
0 0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
⎞ ⎛<br />
0 1<br />
0⎠<br />
= ⎝0<br />
1 0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
1<br />
0<br />
0<br />
1<br />
0<br />
⎞<br />
0<br />
1⎠<br />
0<br />
(6.22)<br />
Therefore, assigning each object to the cluster it is most strongly associated to transforms<br />
the cluster permuted incidence matrix I πλ1 ,λ2 λ into the disambiguated crisp clustering<br />
1<br />
λ πλ1 ,λ2 1 . In equation (6.23), this clustering is presented alongside λ2 —the clustering that<br />
has been taken as the reference of the cluster disambiguation process.<br />
λ πλ 1 ,λ 2<br />
1 =[111333222] (6.23)<br />
λ2 =[113333322] (6.24)<br />
In the context of our voting-based soft consensus functions, though, cluster disambiguation<br />
is conducted on pairs of soft clustering solutions. In order to illustrate how to proceed<br />
in this case, we use a toy example that is the fuzzy version of the one just reported. For<br />
brevity, we will only consider the case in which object-to-cluster associations are expressed<br />
in terms of membership probabilities, although an analog procedure could be devised in<br />
the case these were expressed by means of other metrics. Therefore, given the two fuzzy<br />
partitions Λ1 and Λ2 of equation (6.25):<br />
⎛<br />
0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016<br />
⎞<br />
0.010<br />
Λ1 = ⎝0.921<br />
0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017⎠<br />
0.025<br />
⎛<br />
0.932<br />
0.042<br />
0.921<br />
0.038<br />
0.019<br />
0.006<br />
0.030<br />
0.005<br />
0.014<br />
0.011<br />
0.025<br />
0.976<br />
0.057<br />
0.929<br />
0.017<br />
0.972<br />
⎞<br />
0.055<br />
Λ2 = ⎝0.042<br />
0.025 0.005 0.011 0.009 0.006 0.038 0.972 0.929⎠<br />
(6.25)<br />
0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016<br />
The cluster similarity matrix S Λ1 ,Λ 2 is computed upon the soft clustering matrices themselves,<br />
as described by equation (6.26).<br />
SΛ 1 ,Λ 2 = Λ1Λ T 2 =<br />
⎛<br />
= ⎝<br />
⎛<br />
= ⎝<br />
0.054 0.026 0.057 0.969 0.976 0.959 0.009 0.016 0.010<br />
0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017<br />
0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972<br />
0.143 0.054 2.878<br />
1.738 0.136 1.042<br />
0.188 1.845 0.969<br />
⎞<br />
⎛<br />
⎜<br />
⎞ ⎜<br />
⎠ ⎜<br />
⎝<br />
0.932 0.042 0.026<br />
0.921 0.025 0.054<br />
0.019 0.005 0.976<br />
0.030 0.011 0.959<br />
0.014 0.009 0.976<br />
0.025 0.006 0.969<br />
0.057 0.038 0.905<br />
0.017 0.972 0.010<br />
0.055 0.929 0.016<br />
⎠ (6.26)<br />
175<br />
⎞<br />
⎟<br />
⎠
6.3. Voting based consensus functions<br />
The interpretation of the contents of S Λ1 ,Λ 2 is the same as in the crisp scenario, i.e. its<br />
(i,j)th element is proportional to the similarity between the ith cluster of Λ1 and the jth<br />
cluster of Λ2. Again, transforming S Λ1 ,Λ 2 into a cluster dissimilarity matrix is the final<br />
step before solving the weight bipartite matching problem using the Hungarian method<br />
implementation of (Buehren, 2008), thus obtaining the cluster correspondence vector π Λ1 ,Λ 2<br />
of equation (6.27) (notice that this is exactly the same permutation vector of equation (6.19),<br />
as the present toy example is the fuzzy version of the former).<br />
π Λ1 ,Λ 2 = [3 1 2] (6.27)<br />
Although the interpretation of the cluster correspondence vector is equivalent in both<br />
the hard and the soft clustering scenarios (i.e. the cluster that is given the number ‘1’<br />
identifier in Λ1 corresponds to the cluster number ‘3’ of Λ2, and so on), recall that, in the<br />
fuzzy case, cluster permutations are equivalent to row order rearrangements.<br />
Consequently, in order to obtain the cluster permuted version of the fuzzy partition<br />
Λ1, it is necessary to multiply the transpose of the cluster permutuation matrix PΛ1 ,Λ2 associated to the cluster correspondence vector πΛ1 ,Λ by the fuzzy partition Λ1 itself. As<br />
2<br />
a result, the cluster permuted soft clustering Λ πΛ1 ,Λ2 1 is obtained —see equation (6.28) for<br />
the pair of cluster aligned fuzzy clustering solutions of our toy example.<br />
Λ πΛ 1 ,Λ 2<br />
1<br />
⎛<br />
0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055<br />
⎞<br />
0.017<br />
= ⎝0.025<br />
0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972⎠<br />
0.054<br />
⎛<br />
0.932<br />
0.026<br />
0.921<br />
0.057<br />
0.019<br />
0.969<br />
0.030<br />
0.976<br />
0.014<br />
0.959<br />
0.025<br />
0.009<br />
0.057<br />
0.016<br />
0.017<br />
0.010<br />
⎞<br />
0.055<br />
Λ2 = ⎝0.042<br />
0.025 0.005 0.011 0.009 0.006 0.038 0.972 0.929⎠<br />
(6.28)<br />
0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016<br />
Given a cluster ensemble E containing a set of l soft clustering solutions, the cluster<br />
disambiguation process consists in, taking one of them as a reference, apply the Hungarian<br />
method sequentially on the remaining l − 1 clustering solutions (Topchy et al., 2004). As a<br />
result, a cluster aligned version of the cluster ensemble is obtained, and voting procedures<br />
can be readily applied on it.<br />
6.3.2 Voting strategies<br />
Once the correspondence between the k clusters of each one of the l soft clustering solutions<br />
compiled in the cluster ensemble E has been resolved and the corresponding cluster<br />
permutations have been applied, it is time to derive the consensus clustering solution upon<br />
E, a task we tackle by means of voting procedures. In this section, we describe four voting<br />
methods, which give rise to as many consensus functions.<br />
Before proceeding to their description, recall that the scalar elements of a soft cluster<br />
ensemble E are considered, from a voting standpoint, as the expression of the degree of<br />
preference of each voter (i.e. clusterer) for each candidate (cluster) in the present election<br />
(clusterization of an object). The result of the election (i.e. the consolidated clusterization<br />
of the object under consideration based upon the decisions of the l clusterers comprised<br />
176
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
in the ensemble) will depend on the voting procedure applied, which, at the same time, is<br />
conditioned by the way the voters’ preferences are expressed.<br />
Given that in fuzzy cluster ensembles voters express their preference for each and every<br />
one of the k candidates, the soft consensus functions proposed in this chapter make use of<br />
confidence and positional voting methods (van Erp, Vuurpijl, and Schomaker, 2002), which<br />
are applicable in voting scenarios in which voters grade candidates according to their degree<br />
of confidence. The former makes direct use of the specific values of the preference scores<br />
the voters emit –thus, they are sensitive to their scaling–, whereas the latter are based on<br />
ranking the candidates according to the degree of confidence expressed by the voters.<br />
As mentioned earlier, the way fuzzy clusterers express their preference for the clusters<br />
can be either directly or inversely proportional to the strength of association between objects<br />
and clusters (e.g. membership probabilities or distances to centroids, respectively). In fact,<br />
it is possible that both types of clusterings are intermingled in E, and, for this reason, the<br />
voting method must somehow be informed of this fact, so that appropriate scale or ranking<br />
transformations are applied —depending on whether a confidence or a positional voting<br />
strategy is employed.<br />
In the following sections, we present four consensus functions for soft cluster ensembles,<br />
each of which is based on a specific voting mechanism. Besides their generic description,<br />
we illustrate them by means of a toy example using the soft cluster ensemble E containing<br />
the l = 2 cluster aligned fuzzy clustering solutions presented in equation (6.28).<br />
E =<br />
Λ1<br />
Λ2<br />
⎛<br />
0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055<br />
⎞<br />
0.017<br />
⎜<br />
⎜<br />
0.025<br />
⎜<br />
= ⎜0.054<br />
⎜<br />
⎜0.932<br />
⎝0.042<br />
0.042<br />
0.026<br />
0.921<br />
0.025<br />
0.038<br />
0.057<br />
0.019<br />
0.005<br />
0.006<br />
0.969<br />
0.030<br />
0.011<br />
0.005<br />
0.976<br />
0.014<br />
0.009<br />
0.011<br />
0.959<br />
0.025<br />
0.006<br />
0.976<br />
0.009<br />
0.057<br />
0.038<br />
0.929<br />
0.016<br />
0.017<br />
0.972<br />
0.972 ⎟<br />
0.010 ⎟<br />
0.055 ⎟<br />
0.929⎠<br />
0.026 0.054 0.976 0.959 0.976 0.969 0.905 0.010 0.016<br />
(6.29)<br />
Notice that, in this toy example, the l = 2 clustering solutions (voters) compiled in<br />
the ensemble (Λ1 and Λ2) express object to cluster associations (their preferences for candidates)<br />
by means of membership probabilities, which makes the scalar elements of both<br />
clusterings directly comparable, thus avoiding the need for applying any scale transformations.<br />
However, this need not be the general case, and we will address how to deal with<br />
cluster ensembles containing unequal clusterings as the proposed consensus functions are<br />
presented throughout the following paragraphs.<br />
Confidence voting<br />
Consensus functions based on confidence voting methods derive the consolidated clustering<br />
solution upon the values of the confidence scores each clusterer assigns to each cluster. For<br />
this reason, a prerequisite for using these voting methods is that these confidence values are<br />
comparable in magnitude (van Erp, Vuurpijl, and Schomaker, 2002). Assuming this is true,<br />
we propose the application of the sum and product confidence voting rules, which gives rise<br />
to the following two consensus functions:<br />
– SumConsensus (SC): this consensus function is based on the application of the<br />
confidence voting sum rule, which simply consists of adding the confidence values<br />
177
6.3. Voting based consensus functions<br />
Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />
Output: Sum voting matrix ΣE<br />
Data: k clusters, n objects<br />
Hungarian (E)<br />
ΣE = 0k×n for i =1...l do<br />
if Λi not membership probabilities then<br />
Probabilize (Λi)<br />
end<br />
ΣE = ΣE + Λi<br />
end<br />
Algorithm 6.1: Symbolic description of the soft consensus function SumConsensus.<br />
Probabilize and Hungarian are symbolic representations of the conversion of fuzzy clusterings<br />
to membership probability matrices and the cluster disambiguation procedures,<br />
respectively, while 0 k×n represents a k × n zero matrix.<br />
that all the voters cast for each candidate. As a result, a k × n sum matrix ΣE is<br />
obtained, the (i,j)th entry of which equals the sum of the preference scores of assigning<br />
the jth object to the ith cluster across the l cluster ensemble components, as presented<br />
in equation (6.30).<br />
ΣE =<br />
l<br />
i=1<br />
Λi<br />
(6.30)<br />
where Λi refers to the ith clustering contained in the soft cluster ensemble E, once<br />
the cluster disambiguation process has been conducted.<br />
A schematic and generic description of the SumConsensus consensus function is presented<br />
in algorithm 6.1. As it can be observed, we propose transforming all the fuzzy<br />
clusterings compiled in the cluster ensemble E into membership probability matrices<br />
–which is symbolically represented by the procedure called Probabilize–, thus making<br />
the sum voting method directly applicable on them once the cluster alignment<br />
problem is solved by means of the Hungarian procedure. According to algorithm<br />
6.1, SumConsensus outputs the sum voting matrix ΣE, which can be interpreted<br />
readily as a fuzzy consensus clustering. However, it can easily be converted into a<br />
classic membership probability based fuzzy consensus clustering Λc, oracrispconsensus<br />
clustering λc, as we show in the following paragraphs —as it will be seen, such<br />
postprocessing could also be included as the final step of SumConsensus.<br />
The application of SumConsensus on the toy cluster ensemble of equation (6.29) gives<br />
rise to the sum matrix ΣE presented in equation (6.31). Notice that the execution of<br />
the Probabilize and the Hungarian procedures is not required in this case, as the<br />
l = 2 fuzzy clusterings considered express object to cluster associations by means of<br />
membership probabilities and their clusters are aligned.<br />
178
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
⎛<br />
1.853 1.853 0.924 0.055 0.033 0.055 0.071 0.072<br />
⎞<br />
0.072<br />
ΣE = ⎝0.067<br />
0.067 0.043 0.017 0.014 0.017 1.014 1.901 1.901⎠<br />
(6.31)<br />
0.08 0.08 1.033 1.928 1.952 1.928 0.914 0.026 0.026<br />
Notice that the higher the value of the (i,j)th entry of ΣE, the more likely is that the<br />
jth object belongs to the ith cluster. Of course, this is due to the fact that the l =2<br />
fuzzy clusterings contained in the soft cluster ensemble of our toy example express<br />
object to cluster associations by means of membership probabilities, which are directly<br />
proportional to the strength of association between objects and clusters.<br />
Moreover, notice that if each column of ΣE is divided by its L1-norm (i.e. the sum<br />
of its elements), each column entries become cluster membership probabilities, and,<br />
therefore, a classic fuzzy consensus clustering solution Λc can be obtained (see equation<br />
(6.32)).<br />
⎛<br />
0.926 0.926 0.462 0.027 0.016 0.028 0.035 0.036<br />
⎞<br />
0.036<br />
Λc = ⎝0.033<br />
0.033 0.021 0.008 0.007 0.008 0.507 0.951 0.951⎠<br />
(6.32)<br />
0.041 0.041 0.517 0.965 0.977 0.964 0.458 0.013 0.013<br />
Furthermore, notice that Λc can be transformed into a crisp consensus clustering<br />
λc by simply assigning each object to the cluster it is most strongly associated to,<br />
breaking hypothetical ties at random. Referring once more to our toy example, the<br />
crisp consensus clustering obtained by hardening Λc is the one presented in equation<br />
(6.33).<br />
λc = 1 1 3 3 3 3 2 2 2 <br />
(6.33)<br />
– ProductConsensus (PC): the only difference between this consensus function and<br />
SumConsensus is that the preference values per candidate are multiplied instead of<br />
added. Quite obviously, the product rule is highly sensitive to low values, which could<br />
ruin the chances of a candidate on winning the election, no matter what its other confidence<br />
values are (van Erp, Vuurpijl, and Schomaker, 2002). Equation (6.34) presents<br />
the voting process that constitutes the core of the ProductConsensus consensus function.<br />
It is important to notice that Λi correspond to the cluster ensemble components<br />
once cluster alignment has been conducted, and matrix products are computed entrywise.<br />
As a result, the k × n product matrix ΠE is obtained.<br />
ΠE =<br />
l<br />
i=1<br />
Λi<br />
(6.34)<br />
Algorithm 6.2 presents the schematic description of the ProductConsensus consensus<br />
function. As in the previous consensus function, we propose converting the fuzzy<br />
clusterings Λi into membership probability matrices, which allows to apply the product<br />
voting rule on them once the cluster correspondence problem has been solved by<br />
means of the Hungarian algorithm. It can be observed that ProductConsensus yields<br />
179
6.3. Voting based consensus functions<br />
Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />
Output: Product voting matrix ΠE<br />
Data: k clusters, n objects<br />
Hungarian (E)<br />
ΠE = 1k×n for i =1...l do<br />
if Λi not membership probabilities then<br />
Probabilize (Λi)<br />
end<br />
ΠE = ΠE ◦ Λi<br />
end<br />
Algorithm 6.2: Symbolic description of the soft consensus function ProductConsensus.<br />
Probabilize and Hungarian are symbolic representations of the conversion of fuzzy clusterings<br />
to membership probability matrices and the cluster disambiguation procedures,<br />
respectively, while 1 k×n represents a k × n unit matrix, and ◦ represents the Hadamard<br />
(or entrywise) matrix product.<br />
the product voting matrix ΠE as its output. However, as in the case of SumConsensus,<br />
it can be transformed into a fuzzy or a crisp consensus clustering, a process that<br />
can be included as the final step of ProductConsensus.<br />
The results of applying the product rule on the toy cluster ensemble of equation (6.29)<br />
yields the product matrix ΠE presented in equation (6.35).<br />
⎛<br />
ΠE = ⎝<br />
0.858 0.858 0.017 7.5·10 −4 2.7·10 −4 7.5·10 −4 7.9·10 −4 9.3·10 −4 9.3·10 −4<br />
0.001 0.001 1.9·10−4 6.6·10−5 4.5·10−5 6.6·10−5 0.037 0.903 0.903<br />
0.001 0.001 0.056 0.929 0.953 0.929 0.008 1.6·10−4 1.6·10−4 ⎞<br />
⎠<br />
(6.35)<br />
Dividing each column of ΠE by its L1-norm gives rise to the fuzzy consensus clustering<br />
solution Λc based on membership probabilities of equation (6.36), and assigning each<br />
object to the cluster it is most strongly associated to (breaking ties randomly) yields<br />
the crisp consensus clustering λc of equation (6.37).<br />
⎛<br />
Λc = ⎝<br />
0.997 0.997 0.235 8.1·10−4 2.8·10−4 8.1·10−4 0.017 0.001 0.001<br />
0.001 0.001 0.003 7.1·10−5 4.7·10−5 7.1·10−5 0.806 0.998 0.998<br />
0.002 0.002 0.762 0.999 0.999 0.999 0.177 1.8·10−4 1.8·10−4 λc = 1 1 3 3 3 3 2 2 2 <br />
⎞<br />
⎠ (6.36)<br />
(6.37)<br />
Notice that the differences between the fuzzy consensus clusterings Λc obtained by<br />
SumConsensus and ProductConsensus (equations (6.32) and (6.36) ) –due to the<br />
different way voters’ preferences are combined– are lost when they are transformed<br />
into crisp consensus clusterings.<br />
180
Positional voting<br />
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
Positional (aka ranking) voting methods rank the candidates according to the confidence<br />
scores emitted by the voters. Thus, fine-grain information regarding preference differences<br />
between candidates is ignored, although problems in scaling the voters confidence scores<br />
are avoided —that is, positional voting is useful in situations where confidence values are<br />
hard to scale correctly (van Erp, Vuurpijl, and Schomaker, 2002).<br />
As an aid for describing the positional voting methods that constitute the core of our<br />
consensus functions, equation (6.38) defines Λi (the ith component of the soft cluster ensemble<br />
E) in terms of its columns, represented by vectors λij (∀j =1,...,n).<br />
⎛ ⎞<br />
Λ1<br />
⎜<br />
⎜Λ2<br />
⎟<br />
E = ⎜ ⎟<br />
⎝ . ⎠ where Λi = <br />
λi1 λi2 ... λin<br />
Λl<br />
(6.38)<br />
In this work, we propose employing two positional voting strategies (namely the Borda<br />
and the Condorcet voting methods) for deriving the consensus clustering solution, which<br />
gives rise to the following consensus functions:<br />
– BordaConsensus (BC): the Borda voting method (Borda, 1781) computes the mean<br />
rank of each candidate over all voters, reranking them according to their mean rank.<br />
This process results in a grading of all the n objects with respect to each of the k<br />
clusters, which is embodied in a k × n Borda voting matrix BE. Such grading process<br />
is conducted as follows: firstly, for each object (election), clusters (candidates) are<br />
ranked according to their degree of association with respect to it (from the most to<br />
the least strongly associated). Then, the top ranked candidate receives k points, the<br />
second ranked cluster receives k − 1 points, and so on. After iterating this procedure<br />
across the l cluster ensemble components, the grading matrix BE is obtained. The<br />
whole process is described in algorithm 6.3. Notice that the Rank procedure orders<br />
the clusters from the most to the least strongly associated to each object, yielding a<br />
ranking vector r which is a list of the k clusters ordered according to their degree of<br />
association with respect to the object under consideration (i.e. its first component<br />
(r(1)) identifies the most strongly associated cluster, and so on. Thus, the Rank<br />
procedure must take into account whether the scalar values contained in λab are<br />
directly or inversely proportional to the strength of association between objects and<br />
clusters.<br />
When applied on our toy example, the resulting Borda grading matrix is the one<br />
presented in equation (6.39).<br />
⎛<br />
6 6 5 4 4 4 4 4<br />
⎞<br />
4<br />
BE = ⎝3<br />
3 2 2 2 2 4 6 6⎠<br />
(6.39)<br />
3 3 5 6 6 6 4 2 2<br />
According to Borda voting, the higher the value of the (i,j)th entry of BE, themore<br />
likely the jth object belongs to the ith cluster. Thus, again, dividing each column of<br />
matrix BE by its L1-norm transforms it into a cluster membership probability matrix,<br />
181
6.3. Voting based consensus functions<br />
Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />
Output: Borda voting matrix BE<br />
Data: k clusters, n objects<br />
Hungarian (E)<br />
BE = 0 k×n<br />
for a =1...l do<br />
for b =1...n do<br />
r = Rank (λab);<br />
for c =1...k do<br />
BE (r (c) ,b)=BE (r (c) ,b)+(k − c +1);<br />
end<br />
end<br />
end<br />
Algorithm 6.3: Symbolic description of the soft consensus function BordaConsensus.<br />
Hungarian and Rank are symbolic representations of the cluster disambiguation and cluster<br />
ordering procedures, respectively, while the vector λab represents the bth column of<br />
the ath cluster ensemble component Λa, r is a clusters ranking vector and 0 k×n represents<br />
a k × n zero matrix.<br />
i.e. a soft consensus clustering solution Λc (see equation (6.40)), and assigning each<br />
object to the cluster it is most strongly associated to –breaking ties randomly– yields<br />
the crisp consensus clustering of equation (6.41).<br />
⎛<br />
0.5 0.5 0.417 0.333 0.333 0.333 0.333 0.333<br />
⎞<br />
0.333<br />
Λc = ⎝0.25<br />
0.25 0.166 0.167 0.167 0.167 0.333 0.5 0.5⎠<br />
(6.40)<br />
0.25 0.25 0.417 0.5 0.5 0.5 0.333 0.167 0.167<br />
λc = 1 1 3 3 3 3 3 2 2 <br />
(6.41)<br />
– CondorcetConsensus (CC): just like Borda voting, the Condorcet voting method’s<br />
origin dates from the French revolution period, as an effort to address the shortcomings<br />
of simple majority voting when there are more than two candidates (Condorcet,<br />
1785). Although often considered to be a multi-step unweighed voting algorithm,<br />
the Condorcet voting method can also be regarded as a positional voting strategy, as<br />
it employs the voters’ preference choices between any given pair of candidates (van<br />
Erp, Vuurpijl, and Schomaker, 2002). In particular, this voting method performs an<br />
exhaustive pairwise candidate ranking comparison across voters, and the winner of<br />
each one of these one-to-one confrontations scores a point. The result of this process<br />
is the Condorcet score matrix CE, the(i,j)th element of which indicates how many<br />
candidates does the ith candidate beat in one-to-one comparisons in the jth election<br />
(where candidates are clusters and an election corresponds to the clusterization of an<br />
object).<br />
Algorithm 6.4 presents a description of the CondorcetConsensus consensus function.<br />
As in BordaConsensus, the Rank procedure must take into account whether the scalar<br />
182
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
Input: Soft cluster ensemble E containing l fuzzy clusterings Λi (∀i =1...l)<br />
Output: Condorcet voting matrix CE<br />
Data: k clusters, n objects<br />
Hungarian (E)<br />
for b =1...n do<br />
M = 0k×k for a =1...l do<br />
r = Rank (λab);<br />
for c =1...k do<br />
M (r(c), r(c +1÷ k)) = M (r(c), r(c +1÷ k)) + 1<br />
end<br />
end<br />
for c =1...k do<br />
CE (c, b) =Count M(c, 1 ÷ k) ≥ l<br />
<br />
2<br />
end<br />
end<br />
Algorithm 6.4: Symbolic description of the soft consensus function CondorcetConsensus.<br />
Hungarian and Rank are symbolic representations of the cluster disambiguation and cluster<br />
ordering procedures, respectively, while the vector λab represents the bth column of<br />
the ath cluster ensemble component Λa, r is a clusters ranking vector and 0 k×k represents<br />
a k × k zero matrix.<br />
values contained in λab are directly or inversely proportional to the strength of association<br />
between objects and clusters. In each election, the (i,j)th entry of the square<br />
matrix M (usually referred to as the Condorcet sum matrix) counts the number of<br />
times the ith cluster is preferred over the jth one. The Count procedure is used for<br />
counting the number of elements of the cth row of matrix M that are greater or equal<br />
than l<br />
2 , which means that at least half of the voters preferred one cluster over another.<br />
Resorting again to our toy example, equation (6.42) presents the Condorcet score<br />
matrix resulting from applying CondorcetConsensus on it.<br />
⎛<br />
2 2 2 1 1 1 2 1<br />
⎞<br />
1<br />
CE = ⎝1<br />
1 0 0 0 0 2 2 2⎠<br />
(6.42)<br />
1 1 2 2 2 2 2 0 0<br />
Again, dividing each column of matrix CE by its L1-norm transforms the Condorcet<br />
score matrix into a soft consensus clustering solution Λc, whose(i,j)th entry represents<br />
the probability that the jth object belongs to the ith cluster (see equation (6.43) for<br />
the fuzzy consensus clustering solution obtained by CondorcetConsensus on our toy<br />
example). And assigning each object to the cluster it is most strongly associated to<br />
–breaking ties randomly– yields the crisp consensus clustering of equation (6.44).<br />
⎛<br />
0.500 0.500 0.500 0.333 0.333 0.333 0.333 0.333<br />
⎞<br />
0.333<br />
Λc = ⎝0.250<br />
0.250 0 0 0 0 0.333 0.667 0.667⎠<br />
(6.43)<br />
0.250 0.250 0.500 0.667 0.667 0.667 0.333 0 0<br />
183
6.4. Experiments<br />
λc = 1 1 1 3 3 3 1 2 2 <br />
(6.44)<br />
Notice that the fuzzy consensus clusterings Λc output by BordaConsensus and CondorcetConsensus<br />
(equations (6.40) and (6.43)) differ notably from those obtained by<br />
SumConsensus and ProductConsensus (equations (6.32) and (6.36)) —see the double<br />
and triple ties obtained at the clusterization of the third and seventh objects, which<br />
are due to the intrinsic differences between the distinct voting strategies applied.<br />
Moreover, notice that the two positional voting based consensus functions (BC and<br />
CC) yield structuraly similar fuzzy consensus clusterings Λc, although their contents<br />
differ slightly. However, their hardened versions λc (equations (6.41) and (6.44)) differ<br />
in a larger extent, due to the random way ties are broken.<br />
6.4 Experiments<br />
This section presents several consensus clustering experiments evaluating the consensus<br />
functions for soft cluster ensembles proposed in the previous section. These experiments<br />
are conducted according to the following design.<br />
– What do we want to measure? We are interested in comparing both in the quality<br />
of the consensus clustering solutions obtained and the time complexity of the proposed<br />
consensus functions.<br />
– How do we measure it? As regards the time complexity aspect, all consensus<br />
processes follow a flat architecture (i.e. one step consensus), and we measure the<br />
CPU time required for their execution, using the computational resources described<br />
in appendix A.6. As far as the evaluation of the quality of the consensus clustering<br />
results is concerned, despite the proposed consensus functions output fuzzy consensus<br />
clustering solutions, we have compared their hardened version with respect to the<br />
ground truth of each data set in terms of normalized mutual information φ (NMI) .The<br />
reason for this is twofold: firstly, a soft ground truth is not available for these data sets,<br />
so fuzzy consensus clusterings cannot be directly evaluated. And secondly, provided<br />
that the CSPA, HGPA, MCLA and EAC consensus functions output hard consensus<br />
clustering solutions, fair inter-consensus functions comparison requires converting the<br />
soft consensus clustering matrices Λc output by VMA, BC, CC, PC and SC to crisp<br />
consensus labelings λc —recall that this simply boils down to assigning each object<br />
to the cluster it is more strongly associated to.<br />
– How are the experiments designed? In each consensus clustering experiment we<br />
have applied our four voting-based consensus functions –SumConsensus (SC), ProductConsensus<br />
(PC), BordaConsensus (BC) and CondorcetConsensus (CC)–, besides<br />
the fuzzy versions of CSPA, EAC, HGPA and MCLA (see section 6.2) plus one of<br />
the pioneering soft consensus functions, namely VMA (Voting Merging Algorithm)<br />
(Dimitriadou, Weingessel, and Hornik, 2002) —see appendix A.5 for a description.<br />
Experiments have been conducted on the twelve unimodal data collections employed<br />
in this work, which are described in appendix A.2.1. As regards the creation of the<br />
soft cluster ensemble components, we have employed the fuzzy c-means and the kmeans<br />
clustering algorithms. Whereas the former is fuzzy by nature, the latter is not.<br />
184
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
However, if object to cluster centroid distances are inverted and normalized using a<br />
softmax normalization, so they can be interpreted as membership probabilities are<br />
obtained (i.e. the k-means clustering solutions are fuzzified). For the sake of greater<br />
algorithmic diversity, variants of k-means using the Euclidean, city block, cosine and<br />
correlation distance measures have been employed. Thus, the cardinality of the algorithmic<br />
diversity factor is |dfA| = 5. Applying all these clustering algorithms on each<br />
and every one of the distinct object representations created by the mutually crossed<br />
application of the representational and dimensional diversity factors of each data set,<br />
gives rise to soft cluster ensembles of the sizes l presented in table 6.1. In order to obtain<br />
a representative analysis of the aforementioned consensus functions performance,<br />
we have conducted multiple experiments on distinct diversity scenarios. To do so,<br />
besides using the cluster ensemble of size l, we have also generated cluster ensembles<br />
of sizes ⌊ l l l<br />
l<br />
20⌋, ⌊ 10⌋, ⌊ 5⌋ and ⌊ 2⌋, which are created by randomly picking a subset<br />
of the original cluster ensemble components. For each distinct cluster ensemble, ten<br />
independent runs of each consensus function are executed.<br />
– How are results presented? The performances of the nine soft consensus functions<br />
are summarized by means of a quality (φ (NMI) with respect to the ground truth) versus<br />
time complexity (CPU time measured in seconds) diagram that describes, in a pretty<br />
summarized manner, the qualities of the consensus functions compared. For each<br />
consensus function, the depicted scatterplot corresponds to the region limited by the<br />
mean ± 2-standard deviation curves corresponding to the two associated magnitudes<br />
(i.e. φ (NMI) and CPU time) computed throughout all the experiments conducted<br />
on each data collection —ten independent experiment runs on each one of the five<br />
cluster ensemble sizes. In order to determine whether the differences between the<br />
compared consensus functions are significant or not, we have conducted a pairwise<br />
comparison (both in CPU time and φ (NMI) terms) among them applying a t-paired<br />
test, measuring the significance level p at which the null hypothesis (equal means with<br />
possibly unequal variances) is rejected. If the typical 95% confidence interval for true<br />
difference in means is taken as a reference, significance level values of p
6.4. Experiments<br />
Data set Soft cluster ensemble size l<br />
Zoo 285<br />
Iris 45<br />
Wine 225<br />
Glass 145<br />
Ionosphere 485<br />
WDBC 565<br />
Balance 35<br />
Mfeat 30<br />
miniNG 365<br />
Segmentation 260<br />
BBC 285<br />
PenDigits 285<br />
Table 6.1: Soft cluster ensemble sizes l corresponding to the unimodal data sets.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ZOO<br />
0<br />
0 1 2 3 4<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure 6.2: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Zoo data collection.<br />
efficiency of VMA is quite expectable, due to the fact that it simultaneously solves the<br />
cluster correspondence problem and voting following an iterative procedure (Dimitriadou,<br />
Weingessel, and Hornik, 2002), whereas in SC, PC, BC and CC, these two processes are<br />
sequentially conducted.<br />
As regards the quality of the consensus clustering solutions, notice that the four consensus<br />
functions proposed achieve almost identical φ (NMI) scores than the best performing<br />
state-of-the-art alternative, VMA.<br />
Table 6.2 presents the significance level values obtained from all the t-paired tests conducted<br />
on the Zoo data set. The upper and lower triangular sections of the table correspond<br />
to the comparison between consensus functions in terms of CPU time and φ (NMI) , respectively.<br />
When pairwise comparisons between the ith and the jth consensus functions result<br />
in statistically significant differences, the corresponding significance level value p is presented<br />
in the (i,j)th entry of the table (or in the (j,i)th entry, depending on whether it is a<br />
comparison in terms of CPU time or φ (NMI) ). Otherwise, the lack of statistically significant<br />
186
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— × × 0.0001 0.0001 0.0001 0.0001 0.0013 0.0012<br />
EAC 0.0001 ——— × 0.0001 0.0002 0.0001 0.0001 0.002 0.0019<br />
HGPA 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0017 0.0016<br />
MCLA × 0.0001 0.0001 ——— 0.0001 0.0009 × 0.0003 0.0003<br />
VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0001 0.0001 0.0001 0.0001 × ——— 0.0001 × ×<br />
CC 0.0001 0.0001 0.0001 0.0001 × × ——— 0.0001 0.0001<br />
PC 0.0001 0.0001 0.0001 0.0001 × × × ——— ×<br />
SC 0.0001 0.0001 0.0001 0.0001 × 0.0337 0.0419 × ———<br />
Table 6.2: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Zoo data set. The upper and lower triangular sections<br />
of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />
Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />
differences is denoted by means of the × symbol.<br />
For instance, let us see how does BordaConsensus (BC) compare to the eight remaining<br />
consensus function in terms of execution CPU time —for an easier identification, the<br />
contents of the corresponding boxes of table 6.2 are italicized. In fact, they tell us that the<br />
differences observed in figure 6.2 (according to which BC is apparently faster than MCLA<br />
and CC, and slower than CSPA, EAC, HGPA, VMA, PC and SC) are statistically significant<br />
with respect to all but the PC and SC consensus functions.<br />
If this comparison is based on the φ (NMI) of the consensus clustering solutions, figure 6.2<br />
suggests that BC performs better than CSPA, EAC, HGPA and MCLA, which is true from a<br />
statistical significance standpoint, as the corresponding entries of table 6.2 (which are typed<br />
in boldface for ease of identification) confirm. In contrast, the small differences between<br />
the φ (NMI) values of BC, VMA, CC and PC appreciated in figure 6.2 are statistically non<br />
significant, whereas it is with respect to SC despite its apparent closeness.<br />
In order to provide the reader with a global perspective that illustrates the performance<br />
of the proposed consensus functions compared to their state-of-the-art counterparts across<br />
the twelve unimodal collections employed in this work, we have computed the total percentage<br />
of experiments in which the latter yield better, equivalent or worse results than the<br />
voting-based consensus functions —considering the statistical significance of the differences<br />
between the compared magnitudes (CPU time and φ (NMI) ).<br />
Firstly, table 6.3 presents the results of such comparative analysis when it is referred<br />
to the quality of the consensus clusterings output by the consensus functions in all the<br />
experiments conducted. It can be observed that the four proposed consensus functions<br />
outperform EAC, HGPA and MCLA in a pretty overwhelming percentage of the experiments<br />
(an average 94.4% of the total). When compared to CSPA and VMA, we can appreciate<br />
certain differences between the performance of the consensus functions based on confidence<br />
voting (PC and SC) and the ones based on positional voting (BC and CC). In general terms,<br />
SC and PC perform slightly better than BC and CC. Moreover, notice that BordaConsensus<br />
and CondorcetConsensus attain exactly the same results, whereas the similarity between<br />
the results of SC and PC is also very noticeable. We conjecture that these high degrees<br />
of resemblance is due to the fact that evaluation is conducted upon a hardened version<br />
of the soft consensus clustering output by these consensus functions. Thus, the intrinsic<br />
187
6.4. Experiments<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
φ (NMI) BC CC PC SC<br />
better than ... 27.3% 27.3% 9.1% 9.1%<br />
equivalent to ... 9.1% 9.1% 45.4% 45.4%<br />
worse than ... 63.6% 63.6% 45.4% 45.4%<br />
better than ... 0% 0% 0% 0%<br />
equivalent to ... 0% 0% 0% 0%<br />
worse than ... 100% 100% 100% 100%<br />
better than ... 0% 0% 0% 0%<br />
equivalent to ... 0% 0% 0% 0%<br />
worse than ... 100% 100% 100% 100%<br />
better than ... 0% 0% 0% 0%<br />
equivalent to ... 16.7% 16.7% 16.7% 16.7%<br />
worse than ... 83.3% 83.3% 83.3% 83.3%<br />
better than ... 25% 25% 0% 0%<br />
equivalent to ... 33.3% 33.3% 100% 91.7%<br />
worse than ... 41.7% 41.7% 0% 8.3%<br />
Table 6.3: Percentage of experiments in which the state-of-the-art consensus functions<br />
(CSPA, EAC, HGPA, MCLA and VMA) yield (statistically significant) better/equivalent/worse<br />
consensus clustering solutions than the four proposed consensus functions<br />
(BC, CC, PC and SC).<br />
differences between the distinct voting strategies is somehow lost.<br />
In either case, the φ (NMI) scores obtained by the four proposed voting based consensus<br />
functions is statistically significantly lower than that of CSPA and VMA in only a 15.3% of<br />
the experiments conducted, which clearly indicates that, from a consensus quality perspective,<br />
our proposals constitute an attractive alternative for conducting consensus clustering<br />
on soft cluster ensembles.<br />
And secondly, the results of the previously described comparison, but referred to execution<br />
CPU time, are presented in table 6.4. In general terms, the state-of-the-art consensus<br />
functions (except for EAC) are faster than the proposed consensus functions based on positional<br />
voting methods (BC and CC). This is due to the candidates ranking step that<br />
precedes the voting process itself (see algorithms 6.3 and 6.4). Moreover, the execution of<br />
CC takes longer than BC, due to the exhaustive pairwise candidate comparison involved in<br />
the Condorcet voting method. In contrast, the confidence voting based consensus functions<br />
(PC and SC) are more computationally efficient, being as fast or faster than CSPA, EAC,<br />
HGPA and MCLA in an average 80.7% of the experiments. However, they are unable to<br />
match the computational efficiency of VMA, which, as mentioned earlier, is caused by its<br />
simultaneous and iterative cluster disambiguation and voting procedure.<br />
As a conclusion, it can be stated that the four voting based consensus functions proposed<br />
in this chapter are indeed worthy of being considered as an alternative when it comes to<br />
creating consensus clustering solutions on soft cluster ensembles, as they are capable of<br />
delivering high quality consensus clustering solutions at an acceptable computational cost<br />
—this is specially true for those consensus functions based on confidence voting methods<br />
(i.e. PC and SC). The higher computational complexity of positional voting based consensus<br />
functions (BC and CC) suggests limiting their application to those cases in which the<br />
188
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
CPU time BC CC PC SC<br />
faster than ... 45.4% 72.7% 18.2% 18.2%<br />
CSPA equivalent to ... 18.2% 0% 18.2% 18.2%<br />
slower than ... 36.4% 27.3% 63.6% 63.6%<br />
faster than ... 9.1% 27.3% 9.1% 9.1%<br />
EAC equivalent to ... 27.3% 9.1% 0% 0%<br />
slower than ... 63.6% 63.6% 90.9% 90.9%<br />
faster than ... 91.7% 91.7% 33.3% 33.3%<br />
HGPA equivalent to ... 8.3% 8.3% 41.7% 41.7%<br />
slower than ... 0% 0% 25% 25%<br />
faster than ... 66.7% 83.3% 16.7% 16.7%<br />
MCLA equivalent to ... 25% 16.7% 41.7% 41.7%<br />
slower than ... 8.3% 0% 41.7% 41.7%<br />
faster than ... 100% 100% 91.7% 91.7%<br />
VMA equivalent to ... 0% 0% 8.3% 8.3%<br />
slower than ... 0% 0% 0% 0%<br />
Table 6.4: Percentage of experiments in which the state-of-the-art consensus functions<br />
(CSPA, EAC, HGPA, MCLA and VMA) are executed (statistically significantly)<br />
faster/equivalent/slower than the four proposed consensus functions (BC, CC, PC and SC).<br />
confidence values contained in the soft cluster ensemble are difficult to scale correctly (van<br />
Erp, Vuurpijl, and Schomaker, 2002).<br />
6.5 Discussion<br />
The main motivation of the proposals put forward in this chapter is the fact that most of the<br />
literature on cluster ensembles is mainly focused on the application of consensus clustering<br />
processes on hard cluster ensembles. In our opinion, however, soft consensus clustering is<br />
an alternative worth considering, inasmuch as crisp clustering is in fact a simplification of<br />
fuzzy clustering —a simplification that may give rise to the loss of valuable information.<br />
The initial source of inspiration for the soft consensus functions just presented was<br />
metasearch (aka information fusion) systems, the main purpose of which is to obtain improved<br />
search results by combining the ranked lists of documents returned by multiple<br />
search engines in response to a given query. Although the resemblance between metasearch<br />
and consensus clustering was already reported in (Gionis, Mannila, and Tsaparas, 2007),<br />
direct inspiration came from the works of Aslam and Montague (Aslam and Montague,<br />
2001; Montague and Aslam, 2002), where metasearch algorithms based on positional voting<br />
were devised —notice that this type of voting techniques lend themselves to be applied in<br />
this context, as search engines return lists of ranked documents. From that point on, the<br />
analogy between object-to-cluster association scores in a soft cluster ensemble and voters’<br />
preferences for candidates became the key issue for deriving consensus functions based on<br />
positional and confidence voting methods.<br />
Nevertheless, the application of voting methods for combining clustering solutions is not<br />
new. For instance, unweighed voting strategies (van Erp, Vuurpijl, and Schomaker, 2002)<br />
such as plurality and majority voting have been applied for deriving consensus clustering<br />
189
6.5. Discussion<br />
solutions on hard cluster ensembles (Dudoit and Fridlyand, 2003; Fischer and Buhmann,<br />
2003; Greene and Cunningham, 2006). To our knowledge, the only voting-based consensus<br />
function for soft cluster ensembles is the Voting-Merging Algorithm (VMA) of (Dimitriadou,<br />
Weingessel, and Hornik, 2002), which employs a weighted version of the sum rule for<br />
confidence voting. Moreover, all these works share a common point in that they use the<br />
Hungarian algorithm for solving the cluster correspondence problem.<br />
Additional techniques for cluster disambiguation developed in the consensus clustering<br />
literature include the correspondence estimation based on common space cluster representation<br />
by clusters clustering or Singular Value Decomposition (Boulis and Ostendorf, 2004),<br />
the Soft Correspondence Ensemble Clustering algorithm, which is based on establishing a<br />
soft correspondence between clusters (in the sense that a cluster of a given clustering corresponds<br />
to every cluster in another clustering with different weights) (Long, Zhang, and<br />
Yu, 2005), the cumulative voting approach, that, unlike common one-to-one voting schemes<br />
(e.g. Hungarian), computes a probabilistic mapping between clusters (Ayad and Kamel,<br />
2008), or the FullSearch, Greedy and <strong>La</strong>rgeKGreedy cluster alignment algorithms (Jakobsson<br />
and Rosenberg, 2007). The first two approaches coincide in that they can be indistinctly<br />
applied for aligning the clusters of crisp and fuzzy partitions. Given the key importance of<br />
the cluster disambiguation process as a prior step to voting, we plan to evaluate these alternatives<br />
to the Hungarian method, so as to investigate their impact on both the quality of<br />
the consensus clusterings obtained and the time complexity of the whole consensus process.<br />
The comparative performance analysis of the four proposed consensus functions has<br />
revealed that they constitute a feasible alternative for conducting consensus clustering processes<br />
on soft cluster ensembles, as they are capable of yielding consensus clustering solutions<br />
of comparable or superior quality to those obtained by state-of-the-art clustering combiners<br />
at a reasonable computational cost. An additional appealing feature of our proposals is<br />
that they naturally deliver fuzzy consensus clustering solutions, which makes all sense in a<br />
soft clustering scenario —a fact other recent consensus functions for soft cluster ensembles<br />
–as the one presented in (Punera and Ghosh, 2007)– do not consider. However, the lack<br />
of a fuzzy ground truth has not allowed evaluating the soft consensus clusterings obtained,<br />
which constitutes one of the future directions of research of the work conducted in this<br />
chapter. As mentioned earlier, this would probably make the differences between the proposed<br />
consensus functions more evident, as it would highlight the differences between the<br />
distinct voting methods employed.<br />
As reported earlier, the sequential application of the cluster disambiguation and the<br />
voting processes penalizes the time complexity of our proposals, specially when they are<br />
compared to VMA. Thus, in the future, we plan to adopt the iterative cluster alignment<br />
plus voting strategy employed by this consensus function, which, in our opinion, will surely<br />
reduce the execution time of the proposed voting-based consensus functions without significantly<br />
reducing the quality of the consensus functions obtained.<br />
Another significant conclusion is that the EAC and HGPA consensus functions yield<br />
the lowest quality consensus clusterings, as already noticed in the vast majority of the<br />
experiments conducted in the hard clustering scenario.<br />
190
Chapter 6. Voting based consensus functions for soft cluster ensembles<br />
6.6 Related publications<br />
Our first approach to voting based soft consensus functions was the derivation of Borda-<br />
Consensus (Sevillano, Alías, and Socoró, 2007b). The details of this publication, presented<br />
as a poster at the SIGIR 2007 conference held at Amsterdam, are described next.<br />
Authors: Xavier Sevillano, Francesc Alías and Joan Claudi Socoró<br />
Title: BordaConsensus: a New Consensus Function for Soft Cluster Ensembles<br />
In: Proceedings of the 30th ACM SIGIR Conference<br />
Pages: 743-744<br />
Year: 2007<br />
Abstract: Consensus clustering is the task of deriving a single labeling by applying<br />
a consensus function on a cluster ensemble. This work introduces BordaConsensus, a<br />
new consensus function for soft cluster ensembles based on the Borda voting scheme.<br />
In contrast to classic, hard consensus functions that operate on labelings, our proposal<br />
considers cluster membership information, thus being able to tackle multiclass<br />
clustering problems. Initial small scale experiments reveal that, compared to stateof-the-art<br />
consensus functions, BordaConsensus constitutes a good performance vs.<br />
complexity trade-off.<br />
191
Chapter 7<br />
Conclusions<br />
The contributions put forward in this thesis constitute a unitary proposal for robust clustering<br />
based on cluster ensembles, with a specific focus on the increasingly interesting application<br />
of multimedia data clustering and a view on its generalization in fuzzy clustering<br />
scenarios. In this chapter, we summarize the main features of our proposals, highlighting<br />
their strengths and weaknesses, and outlining some interesting directions for future research.<br />
As for the robustness of clustering, recall that the unsupervised nature of this problem<br />
makes it difficult (if not impossible) to select apriorithe clustering system configuration<br />
that gives rise to the best 1 data partition. Furthermore, given the myriad of options –e.g.<br />
clustering algorithms, data representations, etc.– available to the clustering practitioner,<br />
such important decision making is marked by a high degree of uncertainty. As suboptimal<br />
configuration decisions may give rise to little meaningful partitions of the data, it turns out<br />
that these clustering indeterminacies end up being very relevant in practice, which, in our<br />
opinion, justifies research efforts oriented to overcome them (such as the present one). This<br />
was the main motivation of our first approaches to robust clustering via cluster ensembles<br />
(Sevillano et al., 2006a; Sevillano et al., 2006b; Sevillano et al., 2007c), which have attracted<br />
the attention of several researchers (Tjhi and Chen, 2007; Pinto, 2008; Gonzàlez and Turmo,<br />
2008b; Tjhi and Chen, 2009).<br />
For these reasons, our approach to robust clustering intentionally reduces user decision<br />
making as much as possible, thus following an approach that is nearly opposite to the<br />
procedure usually employed in clustering: instead of using a specific clustering configuration<br />
(which is often selected blindly unless domain knowledge is available), the clustering<br />
practitioner is encouraged to use and combine all the clustering configurations at hand,<br />
compiling the resulting clusterings into a cluster ensemble, upon which a consensus clustering<br />
is derived. The more similar this consensus clustering is to the highest quality clustering<br />
contained in the cluster ensemble, the greater the robustness to clustering indeterminacies.<br />
In this context, it must be noted that our particular approach to robust clustering foments<br />
the creation of large cluster ensembles. This motivates that one of our main issues<br />
of concern is the computationally efficient derivation of a high quality consolidated cluster-<br />
1 The best data partition is an elusive concept in itself, as it basically depends on how the clustered<br />
data is interpreted. However, for any given interpretation criterion, some clustering algorithms may obtain<br />
better clusters than others (Jain, Murty, and Flynn, 1999). In this work, the quality of clusterings has been<br />
evaluated by comparison with a allegedly correct cluster structure of the data, referred to as ground truth,<br />
measuring their degree of resemblance by means of normalized mutual information, or φ (NMI) .<br />
193
Chapter 7. Conclusions<br />
ing upon the aforementioned cluster ensemble, which gives rise to the first two proposals<br />
put forward in this thesis: hierarchical consensus architectures and consensus self-refining<br />
procedures, which are reviewed in sections 7.1 and 7.2, respectively.<br />
Our proposals for robust clustering based on cluster ensembles find a natural field of<br />
application in multimedia data clustering, as the existence of multiple data modalities<br />
poses additional indeterminacies that challenge the obtention of robust clustering results.<br />
Moreover, our strategy naturally allows the simultaneous use of early and late multimodal<br />
fusion techniques, which constitutes a highly generic approach to the problem of multimedia<br />
clustering. In section 7.3, our proposals in this area are reviewed.<br />
The last proposal of this thesis can be regarded as a first step for generalizing our cluster<br />
ensembles based robust clustering approach, as it consists of several voting based consensus<br />
functions for soft cluster ensembles —recall that crisp clustering is in fact a simplification of<br />
its fuzzy counterpart. These consensus functions, which are reviewed in section 7.4, can also<br />
be considered a response to the relatively few efforts devoted to the derivation of consensus<br />
clustering strategies in the context of fuzzy clustering.<br />
We have given great importance to the experimental evaluation of all our proposals. To<br />
that effect, we have employed several state-of-the-art consensus functions for hard cluster<br />
ensembles –hypergraph based (CSPA, HGPA, MCLA) (Strehl and Ghosh, 2002), evidence<br />
accumulation (EAC) (Fred and Jain, 2005) and similarity-as-data based (ALSAD, KMSAD,<br />
SLSAD) (Kuncheva, Hadjitodorov, and Todorova, 2006)– to implement our self-refining<br />
hierarchical consensus architectures. Moreover, the fuzzy versions of CSPA, EAC, HGPA<br />
and MCLA, plus the VMA soft consensus function (Dimitriadou, Weingessel, and Hornik,<br />
2002) have been used as an evaluation benchmark for our voting based consensus functions<br />
for soft cluster ensembles. Our proposals have been tested over a total of sixteen unimodal<br />
and multimodal data collections, which contain a number of objects ranging from hundreds<br />
to several thousands. In particular, the performance of self-refining hierarchical consensus<br />
architectures has been evaluated on both unimodal (chapters 3 and 4) and multimodal data<br />
collections (chapter 5), whereas the experiments concerning soft consensus functions have<br />
been conducted on the 12 unimodal collections —see chapter 6. However, in the near future,<br />
we plan extending these latter experiments towards multimodal data sets. We expect such<br />
extension to be little costly, since any consensus function can easily accommodate our early<br />
plus late fusion multimedia clustering proposal. In this sense, we also intend to apply<br />
our multimedia clustering system on well-known multimodal data sets such as VideoClef<br />
(composed of video data along with speech recognition transcripts, metadata and shot-level<br />
keyframes) (VideoCLEF, accessed on May 2009) and ImageClef (still images annotated with<br />
text) (ImageCLEF, accessed on May 2009).<br />
Furthermore, we have conducted our experiments on cluster ensembles of very different<br />
sizes (from 6 to 5124 clusterings), in order to evaluate the influence of this factor on the<br />
computational facet of our proposals. In all the experiments, the statistical significance<br />
of the results at the 5% significance level has been evaluated, either explicitly or by their<br />
presentation by means of boxplot charts.<br />
As mentioned in appendix A.6, the experiments conducted in this thesis have been<br />
run under Matlab 7.0.4 on Dual Pentium 4 3GHz/1 GB RAM computers. A total of<br />
20 computers were employed during approximately 9 months at an almost constant pace,<br />
combining for an estimated total running time of more than 10 years!<br />
These experimental variablilty has provided us with a comparative view of the state-<br />
194
Chapter 7. Conclusions<br />
of-the-art consensus funtions employed, both in terms of the quality of the consensus clusterings<br />
they yield and their computational complexity, and the most relevant conclusions<br />
drawn are enumerated next. As regards the consensus functions for hard cluster ensembles,<br />
we have observed that, in general terms, the EAC and HGPA consensus functions deliver,<br />
by far, the poorest quality consensus clusterings. We believe that the low performance of<br />
EAC is due to the fact that it was originally devised for consolidating clusterings with a very<br />
high number of clusters into a consensus partition with a smaller number of clusters (Fred<br />
and Jain, 2005). However, in our experiments, both the cluster ensemble components and<br />
the consensus clustering have the same number of clusters, which probably has a detrimental<br />
effect on the quality of the consensus partitions obtained by the evidence accumulation<br />
approach.<br />
From a computational standpoint, the HGPA and MCLA consensus functions are applicable<br />
on larger data collections than the rest, as their complexity is linear with the number<br />
of objects n. However, the execution time of MCLA is penalized when it is executed on<br />
large cluster ensembles, as its time complexity is quadratic with l (the number of cluster<br />
ensemble components). As regards the soft consensus functions, VMA constitutes the most<br />
attractive alternative, as it yields pretty high quality consensus clusterings while being fast<br />
to execute, thanks to the simultaneous execution of the cluster disambiguation and voting<br />
procedures. A rather opposite behaviour is shown by the soft versions of the EAC and<br />
HGPA consensus functions: the former is notably time consuming, while the latter outputs<br />
really poor quality consensus clusterings.<br />
Let us get critical for a while: possibly one of the major sources of criticism for this<br />
work refers to the rather unrealistic assumption (though not uncommon in the literature)<br />
that the number of clusters the objects must be grouped into (referred to as k) isaknown<br />
parameter. In practice, however, the user seldom knows how many clusters should be found,<br />
so it becomes a further indeterminacy to deal with.<br />
In this work, all the clusterings involved in any process (i.e. the cluster ensemble components<br />
and the consensus clusterings) have the same number of clusters, which coincides<br />
with the number of true groups in the data, defined by the ground truth that constitutes<br />
the gold standard for ultimately evaluating the quality our results. By doing so, we have<br />
also disregarded a common diversity factor employed in the creation of cluster ensembles,<br />
which often contain clusterings with different numbers of clusters (often chosen randomly).<br />
However, we would like to highlight at this point that not many of the consensus functions<br />
found in the literature are capable of estimating the correct number of clusters in<br />
the data set, thus making it necessary to specify the desired value of k as one of their parameters.<br />
Quite obviously, two of the clearest future research directions of this work are i)<br />
estimating the number of clusters of the consensus clustering solution, and ii) adapting the<br />
proposed consensus functions for dealing with cluster ensemble components with distinct<br />
numbers of clusters. The achievement of these goals would constitute the ultimate step<br />
towards a fully generic approach to robust clustering based on cluster ensembles.<br />
7.1 Hierarchical consensus architectures<br />
As regards the computational efficiency of consensus processes, the fact that their space and<br />
time complexities usually scale linearly or quadratically with the cluster ensemble size can<br />
195
7.1. Hierarchical consensus architectures<br />
make the execution of traditional one-step (aka flat) consensus processes (in which the whole<br />
cluster ensemble is input to the consensus function at once) very costly or even unfeasible<br />
when conducted on highly populated cluster ensembles. For this reason, the application of<br />
a divide-and-conquer strategy on the cluster ensemble –which gives rise to the hierarchical<br />
consensus architectures (HCA) proposed in chapter 3– constitutes an alternative to classic<br />
flat consensus that, besides leaving out none of the l cluster ensemble components, is also<br />
naturally parallelizable, making it even more computationally appealing.<br />
In particular, two types of hierarchical consensus architectures have been proposed:<br />
random and deterministic HCA. Both architectures differ in the way the user intervenes in<br />
their design. In random HCA, the user selects the size (b) of the mini-ensembles intermediate<br />
consensus processes are conducted, which, together with the cluster ensemble size l<br />
determines the number of stages of the consensus architecture. Compared to them, deterministic<br />
HCA provide a more modular approach to consensus clustering, as clusterings of<br />
the same nature are combined at each stage of the hierarchy. In fact, our first approach to<br />
hierarchical consensus architectures dealt with deterministic HCA (Sevillano et al., 2007a),<br />
although it was solely focused on the analysis of the quality of the consensus clusterings<br />
obtained, not on its computational aspect.<br />
Extensive experiments have proven that their computational efficiency is highly dependent<br />
on the characteristics of the consensus function employed for combining the clusterings<br />
(in particular, it depends on how its time complexity scales with the number of clusterings<br />
combined). For instance, flat consensus based on the EAC consensus function is more efficient<br />
than any hierarchical architecture, whereas a rather opposite behaviour is observed<br />
when MCLA is used.<br />
Moreover, we have observed that HCAs become faster than flat consensus when operating<br />
on highly populated cluster ensembles, regardless of whether their fully serial or parallel<br />
implementation is considered (except when the EAC consensus function is employed). Expectably,<br />
the fully parallel version of HCAs outperforms flat consensus (often by a large<br />
margin), even when small cluster ensembles are employed. An additional interesting feature<br />
of hierarchical consensus architectures is that they provide a means for obtaining a<br />
consensus clustering solution in scenarios where the complexity of flat consensus makes its<br />
execution impossible (for a given set of computational resources).<br />
Given the fact that multiple specific implementation variants of a HCA exist, and that<br />
their time complexities can differ largely, it seems necessary to provide the user with tools<br />
that allow to predict, for a given consensus clustering problem, which is the most computationally<br />
efficient one. In this sense, a simple methodology for estimating their running time<br />
–and thus, selecting which is the least time consuming– has also been proposed. Despite<br />
its simplicity, the proposed methodology is capable of achieving an accuracy close to 80%<br />
when predicting the fastest serially implementated HCA variant, while this percentage goes<br />
down to nearly 50% in the parallel implementation case. This difference is caused by the<br />
fact that, in the parallel case, the running time estimation is more sensitive to random<br />
deviations of the measured running times the estimation is based upon, as it often ends up<br />
depending on a single execution time sample. However, the impact of incorrect predictions,<br />
when measured in running time overheads with respect to the truly fastest HCA variant,<br />
is well below 10 seconds in a vast majority of the experiments conducted —of course, the<br />
relative importance of such deviations will ultimately depend on the time requirements of<br />
the specific application the HCA is embedded in.<br />
196
Chapter 7. Conclusions<br />
Though put forward in a robust clustering via cluster ensembles framework, hierarchical<br />
consensus architectures can be of interest in any consensus clustering related problem where<br />
cluster ensembles containing a large number of components are involved. Furthermore,<br />
HCAs are directly portable to a fuzzy clustering scenario with no modifications.<br />
In our opinion, the main weakness of this proposal lies in the rather simplistic approach<br />
taken in the running time estimation methodology, which employs the execution time of a<br />
single consensus process run for estimating the time complexity of the whole HCA. Despite<br />
experiments have demonstrated that its performance is pretty good, we conjecture that<br />
a possible means for improving it –especially in the parallel case, where lower prediction<br />
accuracies are obtained– would consist in modelling statistically the running times of the<br />
consensus processes the estimation is based on.<br />
7.2 Consensus self-refining procedures<br />
Besides the computational difficulties that have motivated the development of hierarchical<br />
consensus architectures, the use of large cluster ensembles also poses a challenge as far as the<br />
quality of the obtained consensus clustering is concerned. Indeed, the somewhat indiscriminate<br />
generation of clusterings encouraged by our robust clustering via cluster ensembles<br />
proposal may presumably lead to the creation of low quality cluster ensemble components,<br />
which affects the quality of consensus clustering negatively. In order to mitigate the undesired<br />
influence of these components, we have devised an unsupervised strategy for excluding<br />
them from the consensus process.<br />
The rationale of such strategy is the following: starting with a reference clustering,<br />
we measure its similarity (in terms of average normalized mutual information, or φ (ANMI) )<br />
with respect to the l cluster ensemble components. Subsequently, a percentage p of these<br />
components is selected, after ranking them according to their similarity with respect to the<br />
aforementioned reference clustering. <strong>La</strong>st, the self-refined consensus clustering is obtained<br />
by combining the clusterings included in such reduced cluster ensemble, according to either<br />
a flat or a hierarchical architecture —a decision that can be reliably made using the running<br />
time estimation methodology mentioned earlier.<br />
Following this generic approach, two self-refining strategies have been proposed. They<br />
solely differ in the origin of the clustering used as the reference of the self-refining procedure.<br />
In the first version (denominated consensus based self-refining), the reference clustering is<br />
the result of a previous consensus process conducted upon the cluster ensemble at hand.<br />
In contrast, the second self-refining procedure (referred to as selection based self-refining)<br />
employs one of the cluster ensemble components as the reference clustering, which is selected<br />
by means of an φ (ANMI) maximization criterion.<br />
We would like to highlight the fact that the self-refining procedure is almost fully unsupervised.<br />
The only user intervention is the selection of the percentage p of the l cluster<br />
ensemble components included in the select cluster ensemble the self-refined consensus clustering<br />
is derived upon. In order to minimize the risks of negatively biasing the self-refining<br />
of procedure results by a suboptimal selection of p, the user is prompted to select not a<br />
single, but a set of values of p. The self-refining procedure will produce a self-refined consensus<br />
clustering for each distinct value of p, selecting a posteriori, in a fully unsupervised<br />
manner, the one with maximum average normalized mutual information with respect to the<br />
197
7.3. Multimedia clustering based on cluster ensembles<br />
cluster ensemble (a selection process that is given the name of supraconsensus (Strehl and<br />
Ghosh, 2002)).<br />
The analysis of the quality (measured in terms of normalized mutual information with<br />
respect to the ground truth) of the set of self-refined consensus clusterings obtained at each<br />
experiment reveals that the proposed self-refining procedure is notably successful, as it is<br />
higher than that of the reference clustering in a 83% (for consensus based self-refining) or a<br />
56% (for selection based self-refining) of the experiments conducted. Furthermore, we have<br />
also observed that producing multiple self-refined consensus clusterings is a highly beneficial<br />
approach, as the highest quality self-refined clustering is obtained for very disparate values<br />
of p depending on the experiment —from p=2% to p=90%, thus it would be pretty easy to<br />
select a suboptimal value of p if a single one was chosen. As far as the quality gains induced<br />
by the self-refining procedure are concerned, relative percentage φ (NMI) increases (referred<br />
to the non-refined consensus clustering) higher –and quite often much higher– than 10%<br />
are obtained in a vast majority of the experiments conducted.<br />
A further advantage of the self-refining procedure is its ability to uniformize the quality of<br />
the consensus clustering solutions created by distinct consensus architectures –reducing the<br />
variances between their φ (NMI) scores by a factor of 20–, thus making it easier to decide which<br />
is the most appropriate consensus architecture for a given consensus clustering problem on<br />
computational grounds solely.<br />
However, the good performance of the proposed self-refining procedure is somewhat<br />
tarnished by the limited accuracy of the supraconsensus selection process, which manages<br />
to select the highest quality self-refined consensus clustering in less than the half of the<br />
experiments conducted, which causes an average 14% relative φ (NMI) reduction between the<br />
consensus clustering selected by supraconsensus and the top quality one.<br />
For this reason, the main research activities in this area should be directed, in our<br />
opinion, towards the derivation of accurate supraconsensus selection techniques capable of<br />
choosing, in a fully blind manner and as precisely as possible, the highest quality consensus<br />
clustering among a given bunch of them.<br />
<strong>La</strong>st, we have pleasingly noticed that fighting the expectable quality decrease suffered<br />
by consensus clusterings created upon large cluster ensembles has also drawn the interest<br />
of other authors. Curiously enough, this issue has also been tackled in (Fern and Lin,<br />
2008) in a very similar fashion to our selection based self-refining procedure, which can be<br />
interpreted as a sign of the good sense of our proposals.<br />
7.3 Multimedia clustering based on cluster ensembles<br />
Undoubtedly, ‘going multimedia’ is a beneficial trend, as it provides a richer vision of<br />
information. However, it poses a challenge when multimodal data is to be processed by<br />
means of unsupervised learning techniques (e.g. clustering), as the existence of multiple<br />
modalities increases the uncertainties about what is the best way to represent, classify or<br />
describe the data. In this sense, intuition tends to suggest that constructive interactions<br />
between the distinct modalities exist, which should lead to a better explanation of the<br />
data. However, it is not clear how this modality fusion should be conducted, either at a<br />
feature level (early fusion) or at a decision level (late fusion). Indeed, our experiments have<br />
demonstrated that early fusion is not always advantageous as regards the quality of the<br />
198
Chapter 7. Conclusions<br />
clustering results —although in other contexts, such as jointly multimodal data analysis<br />
and synthesis, it becomes a crucial process (Sevillano et al., 2009).<br />
For this reason, the key point of our approach to robust multimedia clustering consists<br />
in not prioritizing nor discarding any of the modalities. Rather the contrary, the user is<br />
encouraged to create clusterings upon each separate modality and on feature level fused<br />
modalities, compiling them all into a multimodal cluster ensemble, upon which a consensus<br />
clustering is created.<br />
Interestingly enough, the application of this strategy –which is nothing but a generalized<br />
version of our approach to robust clustering– naturally calls for the use of hierarchical<br />
consensus architectures, as the existence of multiple (say m) modalities increases cluster<br />
ensemble sizes by a minimum factor of m + 1 (as we consider the m original object representations<br />
plus the one created by their feature level fusion), which poses a computational<br />
challenge to the execution of flat consensus clustering. Furthermore, the hypothetical inclusion<br />
of low quality components in such a large cluster ensemble makes the application of<br />
self-refining procedures an attractive alternative for obtaining good consensus clusterings<br />
upon the aforementioned multimodal cluster ensemble.<br />
In order to evaluate the effect of multimodality in a modular manner, separate consensus<br />
processes have been conducted for each original data modality and for the modality<br />
derived from the early fusion of these. To that effect, a deterministic hierarchical consensus<br />
architecture has been employed in our multimodal consensus clustering experiments, as it<br />
allows a structured construction of consensus clusterings both within and across modalities.<br />
As regards within modality consensus, the results obtained reveal that consensus clusterings<br />
obtained on the multimodal modality (i.e. the one resulting from the early fusion<br />
of the original modalities of the data) attain higher φ (NMI) scores than their unimodal<br />
counterparts in an average 56% of the experiments conducted.<br />
When the unimodal and multimodal consensus clusterings are combined –thus giving<br />
rise to intermodal consensus clusterings– we observe that, in terms of φ (NMI) with respect<br />
to the ground truth, these are better than the unimodal ones in a 59.5% of the experiments,<br />
while this percentage is 34.7% when compared to the multimodal consensus clustering.<br />
However, the fairly distinct results obtained depending on the data set and consensus<br />
function employed suggest that creating an intermodal consensus clustering is a pretty<br />
generic way of proceeding, as its eventual inferior quality can be compensated by means of<br />
a subsequent self-refining procedure followed by an unsupervised supraconsensus selection<br />
of the final consensus clustering.<br />
If the maximum and median quality components of the multimodal cluster ensemble<br />
are taken as reference thresholds for evaluating the robustness of the self-refined consensus<br />
clustering selected by supraconsensus, we observe that it is a 36.6% worse than the former<br />
and a 93.5% better than the latter (measured in relative percentage φ (NMI) variations). As in<br />
the unimodal case, this performance would be improved if a better supraconsensus selection<br />
process was devised —which, as aforementioned, is one of the future work priorities.<br />
As regards the future research lines in the multimodal clustering area, we plan to investigate<br />
early multimodal fusion techniques capable of unveiling constructive interactions<br />
between modalities, besides applying selection based consensus self-refining on the multimodal<br />
cluster ensemble, as we conjecture that will probably yield higher quality clusterings<br />
than those obtained by consensus based self-refining.<br />
199
7.4. Voting based soft consensus functions<br />
7.4 Voting based soft consensus functions<br />
The outcome of a fuzzy clustering process is much more informative than its crisp counterpart,<br />
as it indicates the strength of association of each object to each cluster. Despite<br />
this fact, soft clustering combination strategies are a minority in the consensus clustering<br />
literature. Allowing for this, we have made several proposals in this area, aiming to extend<br />
all our previous proposals to the more generic framework of soft clustering.<br />
There exists a pretty evident parallelism between the strength of association of each<br />
object to each cluster in a fuzzy clustering solution and the degree of preference of a voter<br />
for a candidate in an election. This fact directly allows the application of certain voting<br />
methods for consolidating soft clusterings, considering the clusters as the candidates, the<br />
cluster ensemble components as voters, and the clusterization of each object as an election.<br />
However, given the ambiguous identification of clusters inherent to clustering, a cluster<br />
alignment between the cluster ensemble components is required prior to voting.<br />
In this work, we have proposed four consensus functions for soft cluster ensembles, which<br />
are the result of applying as many voting strategies for combining the clusterings in the<br />
ensemble. In particular, we have employed two confidence voting methods –the sum and<br />
product rules, which give rise to the SumConsensus (SC) and ProductConsensus (PC) consensus<br />
functions–, and two positional voting techniques —the Borda and Condorcet voting<br />
strategies that constitute the basis of the BordaConsensus (BC) and CondorcetConsensus<br />
(CC) clustering combiners. The main difference between these two families of voting methods<br />
lies in the fact that the former operate directly on the object-to-cluster association<br />
values that make up the cluster ensemble components, whereas the latter operate on the<br />
candidates ranking according to the voters’ preferences. For disambiguating the clusters,<br />
we have employed the classic Hungarian algorithm (Kuhn, 1955).<br />
The experiments conducted have evaluated our four consensus functions (SC, PC, BC<br />
and CC), comparing them with several state-of-the-art soft consensus functions in terms<br />
of their computational complexity and the quality of the consensus clusterings they yield.<br />
In terms of execution time, confidence voting consensus functions are faster than their<br />
positional voting counterparts, as the candidate ranking process penalizes the latter from a<br />
computational standpoint. In this sense, CC is the slowest proposal, due to the exhaustive<br />
pairwise candidate confrontation implicit in the Condorcet voting method. Contrarily, the<br />
more computationally efficient PC and SC consensus functions are as fast or faster than<br />
CSPA, EAC, HGPA and MCLA in a 81% of the experiments conducted —however, they<br />
are slower than VMA in a 92% of the cases.<br />
If the quality of the hardened version of the fuzzy consensus clusterings (measured<br />
in terms of φ (NMI) with respect to the ground truth) is used as the comparison factor,<br />
we observe that the four proposed consensus functions yield (statistically significantly)<br />
better results than any of the state-of-the-art consensus functions in an average 72% of the<br />
experiments conducted, which is a clear indicator of the goodness of our proposals. It is<br />
important to highlight that it has been impossible to evaluate directly the fuzzy consensus<br />
clusterings output by the four proposed consensus functions, due to the unavailability of<br />
soft labels in the data sets employed. As a future direction of research, we plan conducting<br />
this fuzzy evaluation, and we conjecture that greater differences between SC, PC, BC and<br />
CC will be observed, as the differences between the results of the voting strategies they<br />
are based upon are somewhat masked when the fuzzy consensus clusterings they yield are<br />
200
Chapter 7. Conclusions<br />
hardened.<br />
Ours is not the first approach to soft consensus clustering based on voting. In fact,<br />
the VMA consensus function employs a weighted version of the sum voting rule (Dimitriadou,<br />
Weingessel, and Hornik, 2002). However, to our knowledge, BordaConsensus (firstly<br />
introduced in (Sevillano, Alías, and Socoró, 2007b)) and CondorcetConsensus are pioneer<br />
positional voting based consensus functions.<br />
As aforementioned, our proposals deal with clusterings with a constant number of clusters<br />
k, and it would be of paramount interest to adapt them to combine clusterings with<br />
different number of clusters. A possible way to do so would consist in completing those<br />
clusterings with fewer clusters with dummy clusters (Ayad and Kamel, 2008), as suggested<br />
in the VMA consensus function (Dimitriadou, Weingessel, and Hornik, 2002).<br />
Besides this, possibly the clearest direction for future research in this area consists of<br />
adapting the simultaneous cluster disambiguation and voting mechanism of VMA, which<br />
would probably i) reduce the time complexity of the proposed consensus functions, and ii)<br />
require introducing some adjustments to the voting methods employed. Moreover, we are<br />
also interested in exploring other existing techniques for solving the cluster disambiguation<br />
problem, analyzing their impact both in terms of the quality of the consensus clustering<br />
solutions obtained and the overall computational complexity of the consensus function.<br />
201
References<br />
References<br />
Agogino, A. and K. Tumer. 2006. Efficient agent-based cluster ensembles. In Proceedings of<br />
the 5th International Joint Conference on Autonomous Agents and Multi-Agent Systems.<br />
Agrawal, R., J. Gehrke, D. Gunopulos, and P. Raghavan. 1998. Automatic subspace<br />
clustering of high dimensional data for data mining applications. In Proceedings of the<br />
ACM-SIGMOD Conference on the Management of Data, pages 94–105, Seattle, WA,<br />
USA.<br />
Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on<br />
Automatic Control, 19(6):716–722.<br />
Al-Sultan, K. 1995. A Tabu search approach to the clustering problem. Pattern Recognition,<br />
28(9):1443–1451.<br />
Anderberg, M.R. 1973. Cluster Analysis for Applications. Monographs and Textbooks on<br />
Probability and Mathematical Statistics. New York: Academic Press, Inc.<br />
Anderson, L.W., D.R. Krathwohl, P.W. Airasian, K.A. Cruikshank, R.E. Mayer, P.R. Pintrich,<br />
J. Raths, and M.C. Wittrock. 2001. A Taxonomy for Learning, Teaching, and<br />
Assessing – A Revision of Bloom’s Taxonomy of Educational Objectives. Addison Wesley<br />
Longman, Inc.<br />
Aslam, J.-A. and M. Montague. 2001. Models for metasearch. In Proceedings of the 24th<br />
ACM SIGIR Conference, pages 276–284, New Orleans, LA, USA.<br />
Asuncion, A. and D.J. Newman. 1999. UCI Machine Learning Repository.<br />
http://www.ics.uci.edu/∼mlearn/MLRepository.html. University of California, Irvine,<br />
School of Information and Computer Sciences.<br />
Ayad, H.G. and M.S. Kamel. 2008. Cumulative Voting Consensus Method for Partitions<br />
with Variable Number of Clusters. IEEE Transactions on Pattern Analysis and Machine<br />
Intelligence, 30(1):160–173.<br />
Ball, G.H. and D.J. Hall. 1965. ISODATA, a novel method of data analysis and classification.<br />
Technical Report, Stanford University, Stanford, CA, USA.<br />
Barnard, K., P. Duygulu, D.A. Forsyth, N. De Freitas, D.M. Blei, and M.I. Jordan. 2003.<br />
Matching Words and Pictures. Journal on Machine Learning Research, 3:1107–1135.<br />
Barnard, K. and D.A. Forsyth. 2001. Learning the semantics of words and pictures. In<br />
Proceedings of the IEEE International Conference on Computer Vision, volume II, pages<br />
408–415.<br />
Barthelemy, J.P., B. <strong>La</strong>clerc, and B. Monjardet. 1986. On the use of ordered sets in<br />
problems of comparison and consensus of classifications. Journal of Classification, 3:225–<br />
256.<br />
Bekkerman, R. and J. Jeon. 2007. Multi-modal clustering for multimedia collections. In<br />
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern<br />
Recognition, pages 1–8.<br />
202
References<br />
Ben-Hur, A., D. Horn, H. Siegelmann, and V. Vapnik. 2001. Support vector clustering.<br />
Journal on Machine Learning Research, 2:125–137.<br />
Benitez, A.B and S.F. Chang. 2002. Perceptual knowledge construction from annotated<br />
image collections. In Columbia University ADVENT, pages 26–29.<br />
Berkhin, P. 2002. Survey of clustering data mining techniques. Available<br />
online at http://www.accrue.com/products/rp cluster review.pdf or<br />
http://citeseer.nj.nec.com/berkhin02survey.html.<br />
Biggs, J.B. and K. Collis. 1982. Evaluating the Quality of Learning: the SOLO taxonomy.<br />
New York: Academic Press.<br />
Bingham, E. and H. Mannila. 2001. Random projection in dimensionality reduction: applications<br />
to image and text data. In Proceedings of the 7th ACM SIGKDD International<br />
Conference on Knowledge Discovery and Data Mining, pages 245–250, San Francisco,<br />
CA, USA.<br />
Bloom, B.S. 1956. Taxonomy of Educational Objectives: The Classification of Educational<br />
Goals. Susan Fauer Company, Inc.<br />
Borda, J.C. de. 1781. Memoire sur les Elections au Scrutin. Histoire de lAcademie Royale<br />
des Sciences, Paris.<br />
Boulis, C. and M. Ostendorf. 2004. Combining multiple clustering systems. In J.F. Boulicaut,<br />
F. Esposito, F. Giannotti, and D. Pedreschi, editors, Proceedings of the 8th European<br />
Conference on Principles and Practice of Knowledge Discovery in Databases,<br />
LNCS vol. 3202, pages 63–74. Springer.<br />
Brachman, R. and T. Anand. 1996. The process of knowledge discovery in databases: A<br />
human-centered approach. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,<br />
editors, Advances in Knowledge Discovery and Data Mining, pages 37–58, Menlo<br />
Park, CA, USA. AAAI Press.<br />
Buehren, M. 2008. Functions for the rectangular assignment problem.<br />
http://www.mathworks.com/matlabcentral/fileexchange/6543.<br />
Buscaldi, D. and P. Rosso. 2007. Upv-wsd : Combining different wsd methods by means of<br />
fuzzy borda voting. In Proceedings of the International SemEval Workshop, ACL 2007,<br />
pages 434–437, Prague, Czech Republic.<br />
Cai, D., X. He, Z. Li, W.Y. Ma, and J.R. Wen. 2004. Hierarchical clustering of www<br />
image search results using visual, textual and link information. In Proceedings of the<br />
12th Annual ACM International Conference on Multimedia, pages 952–959.<br />
Calinski, R.B. and J. Harabasz. 1974. A Dendrite Method for Cluster Analysis. Communications<br />
in Statistics, 3:1–27.<br />
Carpenter, G. and S. Grossberg. 1987. A massively parallel architecture for a self-organizing<br />
neural pattern recognition machine. Computer Vision, Graphics and Image Processing,<br />
37:54–115.<br />
203
References<br />
Carpenter, G., S. Grossberg, and D. Rosen. 1991. Fuzzy ART: Fast stable learning and<br />
categorization of analog patterns by an adaptive resonance system. Neural Networks,<br />
4:759–771.<br />
Carreira-Perpiñán, M.A. 1997. A review of dimension reduction techniques. Technical<br />
Report CS-96-09, Department of Computer Science, University of Sheffield, Sheffield,<br />
UK.<br />
Chakaravathy, S.V. and J. Ghosh. 1996. Scale based clustering using a radial basis function<br />
network. IEEE Transactions on Neural Networks, 2(5):1250–1261.<br />
Cheeseman, P. and J. Stutz. 1996. Bayesian classification (Autoclass): theory and results.<br />
In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in<br />
Knowledge Discovery and Data Mining, pages 153–180, Menlo Park, CA. AAAI Press.<br />
Chen, F., U. Gargi, L. Niles, and H. Schütze. 1999. Multi-modal browsing of images in<br />
web documents. In D. Doermann, editor, Proceedings of the 1999 Symposium on Image<br />
Understanding Technology, pages 265–276, Annapolis, MD, USA. UMD.<br />
Chen, N. 2006. A Survey of Indexing and Retrieval of Multimodal Documents: Text and<br />
Images. Technical Report 2006-505, School of Computing, Queens University, Kingston,<br />
Ontario, Canada.<br />
Chiang, J. and P. Hao. 2003. A new kernel-based fuzzy clustering approach: support vector<br />
clustering with cell growing. IEEE Transactions on Fuzzy Systems, 11(4):518–527.<br />
Chu, S. and J. Roddick. 2000. A clustering algorithm using the Tabu search approach with<br />
simulated annealing. In N. Ebecken and C. Brebbia, editors, Data Mining II–Proceedings<br />
of the 2nd International Conference on Data Mining Methods and Databases, pages 515–<br />
523.<br />
Cobo, G., X. Sevillano, F. Alías, and J.C. Socoró. 2006. Técnicas de representación de<br />
textos para clasificación no supervisada de documentos. Journal of the Spanish Society<br />
for Natural <strong>La</strong>nguage Processing (Procesamiento del Lenguaje Natural)–in Spanish,<br />
37:329–336.<br />
Condorcet, M. de. 1785. Essai sur l’application de l’analyse àlaprobabilité des decisions<br />
rendues àlapluralité des voix.<br />
Cover, T.M. and J.A. Thomas. 1991. Elements of information theory. John Wiley and<br />
Sons.<br />
Cutting, D.R., D.R. Karger, J.O. Pedersen, and J.W. Tukey. 1992. Scatter/Gather: a<br />
cluster-based approach to browsing large document collections. In Proceedings of the<br />
15th annual international ACM SIGIR conference on Research and Development in<br />
Information Retrieval, pages 318–329, Copenhagen, Denmark, June.<br />
Dasgupta, S., C. Papadimitriou, and U. Vazirani. 2006. Algorithms. McGraw-Hill.<br />
Davies, D.L and D.W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions<br />
on Pattern Analysis and Machine Intelligence, 1:224–227.<br />
204
References<br />
Deerwester, S., S.-T. Dumais, G.-W. Furnas, T.-K. <strong>La</strong>ndauer, and R. Harshman. 1990.<br />
Indexing by <strong>La</strong>tent Semantic Analysis. Journal American Society Information Science,<br />
6(41):391–407.<br />
Denoeud, L. and A. Guénoche. 2006. Comparison of distance indices between partitions.<br />
In V. Batagelj, H.H. Bock, A. Ferligoj, and A. ˘ Ziberna, editors, Data Science and<br />
Classification, pages 21–28. Springer.<br />
Dhillon, I., J. Fan, and Y. Guan. 2001. Efficient clustering of very large document collections.<br />
In R.L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R.R. Namburu,<br />
editors, Data Mining for Scientific and Engineering Applications. Kluwer Academic<br />
Publishers.<br />
Dietterich, T.G. 2000. Ensemble methods in machine learning. In J. Kittler and F. Roli,<br />
editors, Multiple Classifier Systems, LNCS vol. 1857, pages 1–15. Springer.<br />
Dimitriadou, E., A. Weingessel, and K. Hornik. 2001. Voting-merging: An ensemble<br />
method for clustering. In G. Dorffner, H. Bischof, and K. Hornik, editors, Artificial<br />
Neural Networks-ICANN 2001, LNCS vol. 2130, pages 217–224. Springer.<br />
Dimitriadou, E., A. Weingessel, and K. Hornik. 2002. A combination scheme for fuzzy<br />
clustering. International Journal of Pattern Recognition and Artificial Intelligence,<br />
16(7):901–912.<br />
Ding, C., X. He, and H.D. Simon. 2005. On the equivalence of nonnegative matrix factorization<br />
and spectral clustering. In Proceedings of the 2005 SIAM Conference on Data<br />
Mining, pages 606–610.<br />
Duda, R.O., P.E. Hart, and D.G. Stork. 2001. Pattern Classification. Wiley Interscience.<br />
Dudoit, S. and J. Fridlyand. 2003. Bagging to Improve the Accuracy of a Clustering<br />
Procedure. Bioinformatics, 19(9):1090–1099.<br />
Dunn, J.C. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact<br />
well-separated clusters. Journal on Cybernetics, 3:32–57.<br />
Duygulu, P., K. Barnard, N. de Freitas, and D. Forsyth. 2002. Object recognition as<br />
machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of<br />
the Seventh European Conference on Computer Vision, volume 4, pages 97–112. Springer<br />
Verlag.<br />
Dy, J.G. and C.E. Brodley. 2004. Feature Selection for Unsupervised Learning. Journal of<br />
Machine Learning Research, 5:845–889.<br />
Ertz, L., M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and<br />
densities in noisy, high dimensional data. In Proceedings of the 2nd SIAM International<br />
Conference on Data Mining, pages 47–58, San Francisco, CA, USA.<br />
Ester, M., H.P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering<br />
clusters in large spatial data sets with noise. In Proceedings of the 2nd International<br />
Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, OR,<br />
USA.<br />
205
References<br />
Fayyad, U. 1996. Data mining and knowledge discovery: making sense out of data. IEEE<br />
Expert: Intelligent Systems and Their Applications, 11(5):20–25.<br />
Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. 1996. From data mining to knowledge<br />
discovery: an overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,<br />
editors, Advances in Knowledge Discovery and Data Mining, pages 1–30, Menlo<br />
Park, CA, USA. AAAI Press.<br />
Fenty, J. 2004. Analyzing distances. The Stata Journal, 4(1):1–26.<br />
Fern, X.Z. and C.E. Brodley. 2003. Random Projection for High Dimensional Data Clustering:<br />
A Cluster Ensemble Approach. In Proceedings of 20th International Conference<br />
on Machine Learning, Washington DC, VA, USA.<br />
Fern, X.Z. and C.E. Brodley. 2004. Solving cluster ensemble problems by bipartite graph<br />
partitioning. In Proceedings of the 21st International Conference on Machine Learning,<br />
pages 281–288.<br />
Fern, X.Z. and W. Lin. 2008. Cluster ensemble selection. In Proceedings of the 2008 SIAM<br />
International Conference on Data Mining.<br />
Filkov, V. and S. Skiena. 2004. Integrating microarray data by consensus clustering.<br />
International Journal of Artificial Intelligence Tools, 13(4):863–880.<br />
Fischer, B. and J.M. Buhmann. 2003. Bagging for path-based clustering. IEEE Transactions<br />
on Pattern Analysis and Machine Intelligence, 25(11):1411–1415.<br />
Focardi, S.M. 2001. Clustering economic and financial time series: exploring the existence<br />
of stable correlation conditions. Technical Report 2001-04, The Intertek Group, Paris,<br />
France.<br />
Fodor, I.K. 2002. A survey of dimension reduction techniques. Technical Report UCRL-<br />
ID-148494, Center for Applied Scientific Computing, <strong>La</strong>wrence Livermore National <strong>La</strong>boratory,<br />
Livermore, CA.<br />
Forgy, E. 1965. Cluster analysis of multivariate data: efficiency vs. interpretability of<br />
classifications. Biometrics, 21:768–780.<br />
Fowlkes, E.B. and C.L. Mallows. 1983. A method for comparing two hierarchical clusterings.<br />
Journal of the American Statistical Association, 78(383):553–569.<br />
Fred, A. 2001. Finding consistent clusters in data partitions. In J. Kittler and F. Roli,<br />
editors, Multiple Classifier Systems, LNCS vol. 2096, pages 309–318. Springer.<br />
Fred, A. and A.K. Jain. 2002a. Data clustering using evidence accumulation. In Proceedings<br />
of the 16th International Conference on Pattern Recognition, pages 276–280.<br />
Fred, A. and A.K. Jain. 2002b. Evidence accumulation clustering based on the k-means<br />
algorithm. In T. Caelli, A. Amin, R.P.W. Duin, M. Kamel, and D. de Ridder, editors,<br />
Structural, Syntactic, and Statistical Pattern Recognition, LNCS vol. 2396, pages 442–<br />
451. Springer.<br />
206
References<br />
Fred, A. and A.K. Jain. 2003. Robust data clustering. In Proceedings of the 2003 IEEE<br />
Computer Society Conference on Computer Vision and Pattern Recognition, volume2,<br />
pages 128–133.<br />
Fred, A. and A.K. Jain. 2005. Combining Multiple Clusterings Using Evidence Accumulation.<br />
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835–850.<br />
Gao, B., T.Y. Liu, T. Qin, X. Zheng, Q.S. Cheng, and W.Y. Ma. 2005. Web image clustering<br />
by consistent utilization of visual features and surrounding texts. In Proceedings of the<br />
13th Annual ACM International Conference on Multimedia, pages 112–121.<br />
Gionis, A., H. Mannila, and P. Tsaparas. 2007. Clustering Aggregation. ACM Transactions<br />
on Knowledge Discovery from Data, 1(1):1–30.<br />
Goder, A. and V. Filkov. 2008. Consensus clustering algorithms: Comparison and refinement.<br />
In Proceedings of the 2008 SIAM Workshop on Algorithm Engineering and<br />
Experiments (ALENEX), pages 109–117.<br />
Gonzàlez, E. and J. Turmo. 2006. Unsupervised Document Clustering by Weighted Combination.<br />
LSI Research Report LSI-06-17-R, Departament de Llenguatges i Sistemes<br />
Informátics, Barcelona.<br />
Gonzàlez, E. and J. Turmo. 2008a. Comparing Non-Parametric Ensemble Methods. In<br />
E. Kapetanios, V. Sugumaran, and M. Spiliopoulou, editors, Proceedings of the 13th<br />
International Conference on Applications of Natural <strong>La</strong>nguage to Information Systems,<br />
LNCS vol. 5039, pages 245–256. Springer.<br />
Gonzàlez, E. and J. Turmo. 2008b. Non-Parametric Document Clustering by Ensemble<br />
Methods. Procesamiento del Lenguaje Natural, 40:91–98.<br />
Gopal, S. and C. Woodcock. 1994. Theory and methods for accuracy assessment of thematic<br />
maps using fuzzy sets. Photogrammetric Engineering and Remote Sensing, 60(2):181–<br />
188.<br />
Greene, D. and P. Cunningham. 2006. Efficient ensemble methods for document clustering.<br />
Technical Report CS-2006-48, Trinity College Dublin.<br />
Greene, D., A. Tsymbal, N. Bolshakova, and P. Cunningham. 2004. Ensemble clustering in<br />
medical diagnostics. In Proceedings of the 17th IEEE Symposium on Computer-Based<br />
Medical Systems, pages 576–581. IEEE Computer Society.<br />
Gunes, H. and M. Piccardi. 2005. Affect recognition from face and body: early fusion vs.<br />
late fusion. In Proceedings of 2005 IEEE International Conference on Systems, Man<br />
and Cybernetics, vol. 4, pages 3437–3443.<br />
Hadjitodorov, S.T. and L.I. Kuncheva. 2007. Selecting diversifying heuristics for cluster<br />
ensembles. In M. Haindl, J. Kittler, and F. Roli, editors, Proceedings of the 7th International<br />
Workshop on Multiple Classifier Systems, LNCS vol. 4472, pages 200–209.<br />
Springer.<br />
Hadjitodorov, S.T., L.I. Kuncheva, and L.P. Todorova. 2006. Moderate diversity for better<br />
cluster ensembles. Information Fusion, 7(3):264–275.<br />
207
References<br />
Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2002a. Cluster Validity Methods : Part I.<br />
ACM SIGMOD Record, 31(2):40–45.<br />
Halkidi, M., Y. Batistakis, and M. Vazirgiannis. 2002b. Cluster Validity Methods : Part<br />
II. ACM SIGMOD Record, 31(3):19–27.<br />
Hall, L., I. Özyurt, and J. Bezdek. 1999. Clustering with a genetically optimized approach.<br />
IEEE Transactions on Evolutionary Computation, 3(2):103–112.<br />
Hastad, J., B. Just, J.C. <strong>La</strong>garias, and C.P. Schnorr. 1988. Polynomial Time Algorithms<br />
for Finding Integer Relations among Real Numbers. SIAM Journal of Computing,<br />
18(5):859–881.<br />
He, Z., X. Xu, , and S. Deng. 2005. A cluster ensemble method for clustering categorical<br />
data. Information Fusion, 6(2):143–151.<br />
Hearst, M.A. 2006. Clustering versus faceted categories for information exploration. Communications<br />
of the ACM, 49(4):59–61.<br />
Hettich, S. and S.D. Bay. 1999. The UCI KDD Archive. http://kdd.ics.uci.edu. University<br />
of California at Irvine, Dept. of Information and Computer Science.<br />
Hinneburg, A. and D. Keim. 1998. An efficient approach to clustering in large multimedia<br />
data sets with noise. In Proceedings of the 4th International Conference on Knowledge<br />
Discovery and Data Mining, pages 58–65.<br />
Hofmann, T. and J. Buhmann. 1997. Pairwise data clustering by deterministic annealing.<br />
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1–14.<br />
Höppner, F., F. Klawonn, and R. Kruse. 1999. Fuzzy Cluster Analysis: Methods for<br />
Classification, Data Analysis, and Image Recognition. Wiley.<br />
Hore, P., L. Hall, and D. Goldgof. 2006. A Cluster Ensemble Framework for <strong>La</strong>rge Data<br />
sets. In Proceedings of the 2006 IEEE International Conference on Systems, Man and<br />
Cybernetics, volume 4, pages 3342–3347.<br />
Hoyer, P.O. 2004. Non-Negative Matrix Factorization with Sparseness Constraints. Journal<br />
on Machine Learning Research, 5:1457–1469.<br />
Hubert, L. and P. Arabie. 1985. Comparing Partitions. Journal of Classification, 2:193–<br />
218.<br />
Hyvärinen, A. 1999. Fast and Robust Fixed-Point Algorithms for Independent Component<br />
Analysis. IEEE Trans. on Neural Networks, 10(3):626–634.<br />
Hyvärinen, A., J. Karhunen, and E. Oja. 2001. Independent Component Analysis. John<br />
Wiley and Sons.<br />
ImageCLEF. accessed on May 2009. The CLEF cross language image retrieval track.<br />
http://www.imageclef.org.<br />
208
References<br />
Ingaramo, D., D. Pinto, P. Rosso, and M. Errecalde. 2008. Evaluation of internal validity<br />
measures in short-text corpora. In A. Gelbukh, editor, Proceedings of the 9th<br />
International Conference on Intelligent Text Processing and Computational Linguistics,<br />
volume 4919 of Lecture Notes in Computer Science, pages 555–567. Springer Verlag,<br />
Berlin, Heidelberg, New York.<br />
InternetWorldStats.com. accessed on February 2009. Internet usage statistics february<br />
2009. http://www.internetworldstats.com/stats.htm.<br />
Jäger, G. and U. Benz. 2000. Measures of Classification Accuracy Based on Fuzzy Similarity.<br />
IEEE Transactions on Geoscience and Remote Sensing, 38(3):1462–1467.<br />
Jain, A.K. 1996. Image segmentation using clustering. In K. Bowyer and N. Ahuja, editors,<br />
Advances in Image Understanding. IEEE Press.<br />
Jain, A.K. and R.C. Dubes. 1988. Algorithms for clustering data. Prentice Hall.<br />
Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data Clustering: a Survey. ACM Computing<br />
Surveys, 31(3):264–323.<br />
Jakobsson, M. and N.A. Rosenberg. 2007. CLUMPP: a cluster matching and permutation<br />
program for dealing with label switching and multimodality in analysis of population<br />
structure. Bioinformatics, 23:1801–1806.<br />
Jiang, D., C. Tang, and A. Zhang. 2004. Cluster analysis for gene expression data: a<br />
survey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370–1386.<br />
Jolliffe, I.T. 1986. Principal Component Analysis. Springer.<br />
Jomier, J., V. LeDigarcher, and S.R. Aylward. 2005. Comparison of vessel segmentations<br />
using STAPLE. In J. Duncan, editor, Proceedings of the 8th International Conference on<br />
Medical Image Computing and Computer-Assisted Intervention, pages 523–530, LNCS<br />
3749. Springer.<br />
Kaban, A. and M. Girolami. 2000. Unsupervised Topic Separation and Keyword Identification<br />
in Document Collections: A Projection Approach. Technical Report No. 10, Dept.<br />
of Computing and Information Systems, University of Paisley.<br />
Käki, M. 2005. Findex: search result categories help users when document ranking fails. In<br />
Proc. ACM SIGCHI Int’l Conference on Human Factors in Computing Systems, pages<br />
131–140. ACM Press.<br />
Kalska, E.P. 2005. Dissimilarity Representations in Pattern Recognition. Ph.D. thesis,<br />
Delft University of Technology, The Netherlands.<br />
Karayiannis, N., J. Bezdek, N. Pal, R. Hathaway, and P. Pai. 1996. Repairs to GLVQ: A<br />
new family of competitive learning schemes. IEEE Transactions on Neural Networks,<br />
7(5):1062–1071.<br />
Karypis, G., R. Aggarwal, V. Kumar, and S. Shekhar. 1997. Multilevel hypergraph partitioning:<br />
applications in VLSI domain. In Proceedings of the 34th Design and Automation<br />
Conference, pages 526–529.<br />
209
References<br />
Karypis, G., E. Han, and V. Kumar. 1999. Chameleon: Hierarchical clustering using<br />
dynamic modeling. IEEE Computer, 32(8):68–75.<br />
Karypis, G. and V. Kumar. 1998. A fast and high quality multilevel scheme for partitioning<br />
irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392.<br />
Kaski, S. 1998. Dimensionality Reduction by Random Mapping: Fast Similarity Computation<br />
for Clustering. In Proceedings of the International Joint Conference on Neural<br />
Networks, pages 413–418, Anchorage, AK, USA.<br />
Kaufman, L. and P. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster<br />
Analysis. New York, NY: John Wiley and Sons.<br />
Kleinberg, J. 2002. An impossibility theorem for clustering. Proceedings of the 2002<br />
Conference on Advances in Neural Information Processing Systems, 15:463–470.<br />
Klosgen, W., J.M. Zytkow, and J. Zyt. 2002. Handbook of Data Mining and Knowledge<br />
Discovery. USA: Oxford University Press.<br />
Kohavi, R. and G.H. John. 1998. The wrapper approach. In H. Liu and H. Motoda,<br />
editors, Feature Extraction, Construction and Selection: A Data Mining Perspective,<br />
pages 33–50. Springer-Verlag.<br />
Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480.<br />
Kotsiantis, S. and P. Pintelas. 2004. Recent advances in clustering: a brief survey. WSEAS<br />
Transactions on Information Science and Applications, 1(1):73–81.<br />
Kuhn, H. 1955. The Hungarian Method for the Assignment Problem. Naval Research<br />
Logistic Quarterly, 2:83–97.<br />
Kuncheva, L.I., S.T. Hadjitodorov, and L.P. Todorova. 2006. Experimental comparison<br />
of cluster ensemble methods. In Proceedings of the 9th International Conference on<br />
Information Fusion, pages 24–28.<br />
<strong>La</strong> Cascia, M., S. Sethi, and S. Sclaroff. 1998. Combining textual and visual cues for<br />
content-based image retrieval on the World Wide Web. In Proceedings of the IEEE<br />
Workshop on Content-Based Access of Image and Video Libraries, pages 24–28.<br />
<strong>La</strong>nge, T. and J.M. Buhmann. 2005. Combining partitions by probabilistic label aggregation.<br />
In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pages 147–156. ACM Press.<br />
<strong>La</strong>rsen, B. and C. Aone. 1999. Fast and effective text mining using linear time document<br />
clustering. In Proceedings of the 5th International Conference on Knowledge Discovery<br />
and Data Mining, pages 16–22.<br />
Lee, D.D. and H.S. Seung. 1999. Learning the Parts of Objects by Non-Negative Matrix<br />
Factorization. Nature, 401:788–791.<br />
Lee, D.D. and H.S. Seung. 2001. Algorithms for Non-Negative Matrix Factorization. Advances<br />
in Neural Information Processing Systems, 13.<br />
210
References<br />
Li, S.Z. and G. GuoDong. 2000. Content-based Audio Classification and Retrieval using<br />
SVM Learning. In Proceedings of the 1st IEEE Pacific-Rim Conference on Multimedia<br />
(Invited talk).<br />
Li, T., C. Ding, and M.I. Jordan. 2007. Solving Consensus and Semi-supervised Clustering<br />
Problems Using Nonnegative Matrix Factorization. In Proceedings of the 7th IEEE<br />
International Conference on Data Mining, pages 577–582.<br />
Lin, J. and D. Gunopulos. 2003. Dimensionality Reduction by Random Projection and<br />
<strong>La</strong>tent Semantic Indexing. In Proceedings of the 2003 SIAM Conference on Data Mining.<br />
Linnaeus, C. 1758. Systema Naturae per regna tria naturae, secundum classes, ordines,<br />
genera, species, cum characteribus, differentiis, synonymis, locis. Editio decima, reformata.<br />
Holmiae: <strong>La</strong>urentius Salvius.<br />
Liu, W. and Y. Luo. 2005. Applications of clustering data mining in customer analysis<br />
in department store. In Proceedings of the 2005 IEEE International Conference on<br />
Services Systems and Services Management, volume 2, pages 1042–1046.<br />
Loeff, N., C. Ovesdotter-Alm, and D.A. Forsyth. 2006. Discriminating image senses by clustering<br />
with multimodal features. In Proceedings of the COLING/ACL 2006 Conference,<br />
pages 547–554.<br />
Long, B., Z.M. Zhang, and P.S. Yu. 2005. Combining Multiple Clusterings by Soft Correspondence.<br />
In Proceedings of the 5th IEEE International Conference on Data Mining,<br />
pages 282–289.<br />
Maimon, O. and L. Rokach. 2005. Data Mining and Knowledge Discovery Handbook. New<br />
York: Springer.<br />
Mancas-Thillou, C. and B. Gosselin. 2007. Natural scene text understanding. In G. Obinata<br />
and A. Dutta, editors, Vision Systems, Segmentation and Pattern Recognition, pages<br />
307–333, Vienna, Austria. I-Tech Education and Publishing.<br />
Maulik, U. and S. Bandyopadhyay. 2002. Performance Evaluation of Some Clustering<br />
Algorithms and Validity Indices. IEEE Transactions on Pattern Analysis and Machine<br />
Intelligence, 24(12):1650–1654.<br />
Mc<strong>La</strong>chlan, G. and T. Krishnan. 1997. The EM Algorithm and Extensions. New York:<br />
Wiley.<br />
Meila, M. 2003. Comparing clusterings by the variation of information. In B. Scholkopf and<br />
M.K. Warmuth, editors, Proceedings of the 16th Annual Conference on Computational<br />
Learning Theory, pages 173–187, LNAI 2777. Springer.<br />
Minaei-Bidgoli, B., A. Topchy, and W.F. Punch. 2004. Ensembles of partitions via data<br />
resampling. In Proceedings of the 2004 International Conference on Information Technology,<br />
volume 2, pages 188–192.<br />
Mirkin, B.G. 1975. On the problem of reconciling partitions. In H.M. Blalock, A. Aganbegian,<br />
F.M. Borodkin, R. Boudon, and V. Capecchi, editors, Quantitative Sociology: International<br />
Perspectives on Mathematical and Statistical Modelling—Quantitative Studies<br />
in Social Relations, pages 441–449, New York. Academic Press.<br />
211
References<br />
Miyajima, K. and A. Ralescu. 1993. Modeling of Natural Objects Including Fuzziness and<br />
Application to Image Understanding. In Proceedings of the 2nd IEEE International<br />
Conference on Fuzzy Systems, pages 1049–1054.<br />
Molina, L.C., L. Belanche, and A. Nebot. 2002. Feature selection algorithms: a survey and<br />
experimental evaluation. In Proceedings of the 2002 IEEE International Conference on<br />
Data Mining, pages 306–313.<br />
Montague, M. and J.A. Aslam. 2002. Condorcet Fusion for Improved Retrieval. In Proceedings<br />
of the 2002 ACM Conference on Information and Knowledge Management, pages<br />
538–548.<br />
NetCraft.com. accessed on February 2009. February 2009 web server survey.<br />
http://news.netcraft.com/archives/2009/02/index.html.<br />
Neumann, D.A. and V.T. Norton. 1986. Clustering and isolation in the consensus problem<br />
for partitions. Journal of Classification, 3:281–298.<br />
Ng, A., M.I. Jordan, and Y. Weiss. 2002. On spectral clustering: analysis and an algorithm.<br />
In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information<br />
Processing Systems, volume 14. MIT Press.<br />
Nguyen, N. and R. Caruana. 2007. Consensus Clusterings. In Proceedings of the 7th IEEE<br />
International Conference on Data Mining, pages 607–612.<br />
Oja, E. 1992. Principal components minor components, and linear neural networks. Neural<br />
Networks, 5:927–935.<br />
Patrikainen, A. and M. Meila. 2005. Spectral Clustering for Microsoft Netscan Data. Technical<br />
report, UW-CSE-2005-06-05, Department of Computer Science and Engineering,<br />
University of Washington, Seattle, July.<br />
Piatetsky-Shapiro, G. 1991. Knowledge Discovery in Real Databases: A Report on the<br />
IJCAI-89 Workshop. AI Magazine, 11(5):68–70.<br />
Pinto, D.E. 2008. On Clustering and Evaluation of Narrow Domain Short-Text Corpora.<br />
Ph.D. thesis, Universidad Politécnica de Valencia, July.<br />
Pinto, F.R., J.A. Carriço, M. Ramirez, and J.S. Almeida. 2007. Ranked Adjusted Rand:<br />
integrating distance and partition information in a measure of clustering agreement.<br />
BMC Bioinformatics, 8(44):1–13.<br />
Punera, K. and J. Ghosh. 2007. Soft Consensus Clustering. In J. Oliveira and W. Pedrycz,<br />
editors, Advances in Fuzzy Clustering and its Applications, pages 69–92. Wiley.<br />
Rand, W.M. 1971. Objective criteria for the evaluation of clustering methods. Journal of<br />
the American Statistics Association, 66:846–850.<br />
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464.<br />
Scott, G., D. Clark, and T. Pham. 2001. A genetic clustering algorithm guided by a descent<br />
algorithm. In Proceedings of the Congress on Evolutionary Computation, volume 2,<br />
pages 734–740, Piscataway, NJ, USA.<br />
212
References<br />
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing<br />
Surveys, 34(1):1–47.<br />
Selim, S. and K. Al-Sultan. 1991. A simulated annealing algorithm for the clustering<br />
problems. Pattern Recognition, 24(10):1003–1008.<br />
Sevillano, X., F. Alías, and J.C. Socoró. 2007b. BordaConsensus: a New Consensus Function<br />
for Soft Cluster Ensembles. In Proceedings of the 30th ACM SIGIR Conference,<br />
pages 743–744, Amsterdam, The Netherlands, July.<br />
Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2006a. Feature Diversity in Cluster<br />
Ensembles for Robust Document Clustering. In Proceedings of the 29th ACM SIGIR<br />
Conference, pages 697–698, Seattle, WA, USA, August.<br />
Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2007a. A hierarchical consensus architecture<br />
for robust document clustering. In G. Amati, C. Carpineto, and G. Romano,<br />
editors, Proceedings of 29th European Conference on Information Retrieval, volume 4425<br />
of Lecture Notes in Computer Science, pages 741–744, Rome, Italy. Springer Verlag.<br />
Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2007c. Text clustering on latent thematic<br />
spaces: Variants, strenghts and weaknesses. In M.E. Davies, C.C. James, S.A. Abdallah,<br />
and M.D. Plumbley, editors, Proceedings of 7th International Conference on Independent<br />
Component Analysis and Signal Separation, volume 4666 of Lecture Notes in Computer<br />
Science, pages 794–801, London, UK. Springer Verlag.<br />
Sevillano, X., G. Cobo, F. Alías, and J.C. Socoró. 2006b. Robust Document Clustering by<br />
Exploiting Feature Diversity in Cluster Ensembles. Journal of the Spanish Society for<br />
Natural <strong>La</strong>nguage Processing (Procesamiento del Lenguaje Natural), 37:169–176.<br />
Sevillano, X., J. Melenchón, G. Cobo, J.C. Socoró, and F. Alías. 2009. Audiovisual analysis<br />
and synthesis for multimodal human-computer interfaces. In M. Redondo, C. Bravo,<br />
and M. Ortega, editors, Engineering the User Interface: From Research to Practice,<br />
pages 179–194, London. Springer Verlag.<br />
Shafiei, M., S. Wang, R. Zhang, E. Milios, B. Tang, J. Tougas, and R. Spiteri. 2006.<br />
A Systematic Study of Document Representation and Dimension Reduction for Text<br />
Clustering. Technical report, CS-2006-05, Dalhousie University.<br />
Shahnaz, F., M.W. Berry, V.P. Pauca, and R.J. Plemmons. 2004. Document clustering<br />
using nonnegative matrix factorization. Information Processing and Management,<br />
42:373–386.<br />
Sharan, R. and R. Shamir. 2000. CLICK: A clustering algorithm with applications to gene<br />
expression analysis. In Proceedings of the 8th International Conference on Intelligent<br />
Systems for Molecular Biology, pages 307–316.<br />
Sheikholeslami, C., S. Chatterjee, and A. Zhang. 1998. WaveCluster: A-MultiResolution<br />
Clustering Approach for Very <strong>La</strong>rge Spatial Data set. In A. Gupta, O. Shmueli, and<br />
J. Widom, editors, Proceedings of the 24th International Conference on Very <strong>La</strong>rge Data<br />
Bases, pages 428–439, New York, NY, USA. Morgan Kaufmann.<br />
Shi, J. and J. Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions<br />
on Pattern Analysis and Machine Intelligence, 22(8):888–905.<br />
213
References<br />
Snoek, C.G.M., M. Worring, and A.W.M. Smeulders. 2005. Early versus <strong>La</strong>te Fusion in<br />
Semantic Video Analysis. In Proceedings of the 13th ACM International Conference on<br />
Multimedia, pages 399–402.<br />
Srinivasan, S.H. 2002. Features for Unsupervised Document Classification. In Proceedings<br />
of the 6th Conference on Computational Natural <strong>La</strong>nguage Learning, pages 36–42,<br />
Taipei, Taiwan.<br />
Stein, B. and O. Niggemann. 1999. On the nature of structure and its identification. In<br />
P. Widmayer, G. Neyer, and S. Eidenbenz, editors, Proceedings of the 25th International<br />
Workshop on Graph-Theoretic Concepts in Computer Science, volume 1665 of Lecture<br />
Notes in Computer Science, pages 122–134. Springer Verlag, Berlin, Heidelberg, New<br />
York.<br />
Steinbach, M., G. Karypis, and V. Kumar. 2004. A comparison of common document<br />
clustering techniques. In Proceedings of the KDD Workshop on Text Mining, pages<br />
17–26, Boston, MA, USA.<br />
Strehl, A. 2002. Relationship-based Clustering and Cluster Ensembles for High-dimensional<br />
Data Mining. Ph.D. thesis, Faculty of the Graduate School of The University of Texas<br />
at Austin, May.<br />
Strehl, A. and J. Ghosh. 2002. Cluster Ensembles – A Knowledge Reuse Framework for<br />
Combining Multiple Partitions. Journal on Machine Learning Research, 3:583–617.<br />
Tang, B., X. Luo, M.I. Heywood, and M. Shepherd. 2004. A Comparative Study of Dimension<br />
Reduction Techniques for Document Clustering. Technical Report CS-2004-14,<br />
Faculty of Computer Science, Dalhousie University, Halifax, Canada.<br />
Tang, B., M. Shepherd, E. Milios, and M.I. Heywood. 2005. Comparing and Combining<br />
Dimension Reduction Techniques for Efficient Text Clustering. In Proceedings of the<br />
International Workshop on Feature Selection for Data Mining, pages 17–26, Newport<br />
Beach, CA, USA.<br />
Theodoridis, S. and K. Koutroumbas. 1999. Pattern Recognition. Academic Press.<br />
Tjhi, W.C. and L. Chen. 2007. Dual Fuzzy-possibilistic Co-clustering for Document Categorization<br />
and Summarization. In Optimization-based Data Mining Techniques with<br />
Applications Workshop of the IEEE International Conference on Data Mining, pages<br />
259–264.<br />
Tjhi, W.C. and L. Chen. 2009. Dual Fuzzy-possibilistic Co-clustering for Categorization of<br />
Documents. IEEE Transactions on Fuzzy Systems. Accepted for future publication (as<br />
of May 2009).<br />
Tombros, A., R. Villa, and C.J. van Rijsbergen. 2002. The effectiveness of query-specific<br />
hierarchic clustering in information retrieval. International Journal on Information<br />
Processing and Management, 38(4):559–582.<br />
Topchy, A., A.K. Jain, and W. Punch. 2003. Combining Multiple Weak Clusterings. In<br />
Proceedings of the 3rd IEEE International Conference on Data Mining, pages 331–338,<br />
Melbourne, FLA, USA.<br />
214
References<br />
Topchy, A., A.K. Jain, and W. Punch. 2004. A Mixture Model for Clustering Ensembles.<br />
In Proceedings of the 2004 SIAM Conference on Data Mining, pages 379–390.<br />
Topchy, A., M. <strong>La</strong>w, A.K. Jain, and A. Fred. 2004. Analysis of consensus partition in<br />
clustering ensemble. In Proceedings of the 4th International Conference on Data Mining,<br />
pages 225–232, Brighton, UK.<br />
Torkkola, K. 2003. Discriminative features for text document classification. Pattern Analysis<br />
and Applications, 6(4):301–308.<br />
Tseng, L. and S. Yang. 2001. A genetic approach to the automatic clustering problem.<br />
Pattern Recognition, 34:415–424.<br />
Turnbull, D., L. Barrington, D. Torres, and G. <strong>La</strong>nckriet. 2007. Towards Musical Queryby-Semantic-Description<br />
using the CAL500 Dataset. In Proceedings of the 30th ACM<br />
SIGIR Conference, pages 439–446, Amsterdam, The Netherlands, July.<br />
Valencia, A. 2002. Search and retrieve. EMBO Reports, 3(5):396–400.<br />
van Dongen, S. 2000. Performance criteria for graph clustering and Markov cluster experiments.<br />
Technical Report INS-R0012, Centrum voor Wiskunde en Informatica.<br />
van Erp, M., L. Vuurpijl, and L. Schomaker. 2002. An overview and comparison of voting<br />
methods for pattern recognition. In Proceedings of the Eighth International Workshop<br />
on Frontiers in Handwriting Recognition, pages 195–200, Ontario, Canada, August.<br />
van Rijsbergen, C.J. 1979. Information Retrieval. Buttersworth-Heinemann.<br />
VideoCLEF. accessed on May 2009. The CLEF cross language video retrieval track.<br />
http://www.cdvp.dcu.ie/VideoCLEF.<br />
von Luxburg, U. 2006. A tutorial on spectral clustering. Technical Report TR-149, Department<br />
for Empirical Inference, Max Planck Institute for Biological Cybernetics.<br />
Wallace, C. and D. Dowe. 1994. Intrinsic classification by MML – the Snob program.<br />
In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, pages<br />
37–44, Armidale, Australia.<br />
Wang, W., J. Yang, and R. Muntz. 1997. Sting: A statistical information grid approach<br />
to spatial data mining. In M. Jarke, M.J. Carey, K.R. Dittrich, F.H. Lochovsky,<br />
P. Loucopoulos, and M.A. Jeusfeld, editors, Proceedings of the 23rd International Conference<br />
on Very <strong>La</strong>rge Data Bases, pages 186–195, Athens, Greece. Morgan Kaufmann.<br />
Witten, I.H. and E. Frank. 2005. Data mining: practical machine learning tools and<br />
techniques. Morgan Kauffman Publishers.<br />
Wunsch, D., T. Caudell, C. Capps, R. Marks, and R. Falk. 1993. An optoelectronic<br />
implementation of the adaptive resonance neural network. IEEE Transactions on Neural<br />
Networks, 4(4):673–684.<br />
www.who.int. accessed on February 2009. World Health Organization International Classification<br />
of Diseases (ICD). http://www.who.int/classifications/icd/en/.<br />
215
References<br />
Xu, R. and D. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Transactions on<br />
Neural Networks, 16(2):645–678.<br />
Xu, W., X. Liu, and Y. Gong. 2003. Document Clustering Based on Non-Negative Matrix<br />
Factorization. In Proceedings of the 26th ACM SIGIR Conference, volume 2, pages<br />
267–273, Toronto, Canada.<br />
Yang, J. and S. Olafsson. 2005. Near-optimal feature selection. In Proceedings of the<br />
International Workshop on Feature Selection for Data Mining, pages 27–34, Newport<br />
Beach, CA.<br />
Zahn, C.T. 1971. Graph-theoretical methods for detecting and describing Gestalt clusters.<br />
IEEE Transactions on Computers, 20(1):68–86.<br />
Zeng, Y., J. Tang, J. Garcia-Frias, and G.R. Gao. 2002. An adaptive meta-clustering<br />
approach: combining the information from different clustering results. In Proceedings<br />
of the IEEE Computer Society Conference on Bioinformatics, pages 276–287.<br />
Zhao, R. and W.I. Grosky. 2002. Negotiating the semantic gap: from feature maps to<br />
semantic landscapes. Pattern Recognition, 35:593–600.<br />
Zhao, Y. and G. Karypis. 2001. Criterion functions for document clustering: Experiments<br />
and analysis. Technical Report TR #0140, Department of Computer Science, University<br />
of Minnesota, Minneapolis.<br />
Zhao, Y. and G. Karypis. 2003a. Clustering in life sciences. In A. Khodursky and Brownstein<br />
M., editors, Functional Genomics: Methods and Protocols, pages 183–218. Humana<br />
Press.<br />
Zhao, Y. and G. Karypis. 2003b. Hierarchical Clustering Algorithms for Document<br />
Datasets. Technical Report UMN CS #03-027, Department of Computer Science, University<br />
of Minnesota, Minneapolis.<br />
216
Appendix A<br />
Experimental setup<br />
A.1 The CLUTO clustering package<br />
All the clustering algorithms employed in the experimental sections of this work have been<br />
extracted from the CLUTO clustering toolkit. In its authors’ words, “CLUTO is a software<br />
package for clustering low- and high-dimensional data sets and for analyzing the characteristics<br />
of the obtained clusters. CLUTO is well-suited for clustering data sets arising in many<br />
diverse application areas including information retrieval, customer purchasing transactions,<br />
web, geographic information systems, science, and biology”. It is available online for download<br />
at http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download. We chose CLUTO as<br />
our clustering algorithm provider due to its ease of use, robustness, speed and scalability,<br />
as CLUTO’s algorithms have been optimized for operating on very large data sets both<br />
in terms of the number of objects (up to ∼ 10 5 ) as well as the number of features (up to<br />
∼ 10 4 ).<br />
CLUTO provides clustering algorithms based on the partitional, agglomerative, and<br />
graph-partitioning paradigms. Most of the algorithms implemented in CLUTO treat clustering<br />
as an optimization problem, thus seeking to maximize or minimize a particular clustering<br />
criterion function, which can be defined either globally or locally over the entire<br />
clustering solution space. As in any clustering process, computing the value of these criterion<br />
functions requires measuring the similarity between the objects in the data set. This<br />
means that, for applying a specific CLUTO clustering algorithm, it is necessary to select<br />
the desired:<br />
– clustering strategy (clustering paradigm and specific implementation): CLUTO includes<br />
six implementations of partitional, hierarchical agglomerative and graph-based<br />
clustering strategies.<br />
– criterion function: CLUTO provides a total of eleven criterion functions for driving<br />
its clustering algorithms.<br />
– similarity measure: CLUTO allows to measure the similarity between objects using<br />
four distinct alternatives.<br />
The six implementations of partitional, hierarchical agglomerative and graph-based clustering<br />
strategies available in the CLUTO clustering toolkit are briefly described in the<br />
217
A.1. The CLUTO clustering package<br />
following paragraphs:<br />
1. direct: this method computes the desired k-way clustering solution by simultaneously<br />
finding all k clusters.<br />
2. rb: repeated-bisecting clustering process, in which the desired k-way clustering solution<br />
is computed by performing a sequence of k −1 repeated bisections of the data set.<br />
At each bisecting step, one of the obtained clusters is selected and bisected further, so<br />
that each partial 2-way clustering solution optimizes the selected clustering criterion<br />
function locally.<br />
3. rbr: a refined repeated-bisecting method that performs a global optimization of the<br />
clustering solution obtained by the rb algorithm.<br />
4. agglo: agglomerative hierarchical clustering that locally optimizes the selected criterion<br />
function, stopping the agglomeration process when k clusters are obtained.<br />
5. bagglo: biased agglomerative clustering, which applies the agglo clustering method<br />
on an augmented representation of the objects created by concatenating the d original<br />
attributes of each object and √ n new features which are proportional to the<br />
similarity between that object and its cluster centroid according to a √ n-way partitional<br />
clustering solution that is initially computed on the data set by means of the<br />
rb algorithm.<br />
6. graph: graph-based clustering, in which the data set is modelled as a nearest-neighbor<br />
graph (each object is a vertex connected to the vertices representing its most similar<br />
objects) that is partitioned into k clusters according to one of the graph criterion<br />
functions.<br />
An enumeration of the eleven criterion functions implemented in the CLUTO software<br />
package follows (Zhao and Karypis, 2001):<br />
a. i1 : internal criterion function that maximizes the sum of the average pairwise similarities<br />
between the objects assigned to each cluster, weighted according to its size. Its<br />
maximization is equivalent to minimize sum-of-squared-distances between the objects<br />
in the same cluster, as in traditional k-means (Zhao and Karypis, 2001).<br />
b. i2 : internal criterion function that maximizes the similarity between each object and<br />
the centroid of the cluster it is assigned to.<br />
c. e1 : external criterion function that minimizes the proximity between each cluster’s<br />
centroid and the common centroid of the rest of the data set.<br />
d. h1 : hybrid criterion function that simultaneosly maximizes i1 and minimizes e1.<br />
e. g1 : MinMaxCut criterion function applied on the graph obtained by computing pairwise<br />
object similarities, partitioning the objects into groups by minimizing the edgecut<br />
of each partition (for graph-based clustering only).<br />
f. g1p: normalized Cut criterion function applied on the graph obtained by viewing the<br />
objects and their features as a bipartite graph, simultaneously partitioning the objects<br />
and their features (for graph-based clustering only).<br />
218
Appendix A. Experimental setup<br />
g. slink: traditional single-link criterion function (for agglomerative hierarchical clustering<br />
only).<br />
h. wslink: cluster-weighted single-link criterion function (for agglomerative hierarchical<br />
clustering only).<br />
i. clink: traditional complete-link criterion function (for agglomerative hierarchical clustering<br />
only).<br />
j. wclink: cluster-weighted complete-link criterion function (for agglomerative hierarchical<br />
clustering only).<br />
k. upgma: traditional unweighted pair-group method with arithmetic means criterion<br />
function (for agglomerative hierarchical clustering only).<br />
Finally, as regards the similarity measures that can be employed by the clustering algorithms<br />
implemented in CLUTO, they are described next (Zhao and Karypis, 2001):<br />
i. cos: the similarity between objects is computed using the cosine function.<br />
ii. corr : the similarity between objects is computed using the correlation coefficient.<br />
iii. dist: the similarity between objects is computed to be inversely proportional to the<br />
Euclidean distance (for graph-based clustering only).<br />
iv. jacc: the similarity between objects is computed using the extended Jaccard coefficient<br />
(for graph-based clustering only).<br />
For further insight on the distinct implementations of the clustering strategies, formal<br />
definitions of the criterion functions and similarity measures, or the criterion functions<br />
optimization procedure, the interested reader is referred to (Zhao and Karypis, 2001; Zhao<br />
and Karypis, 2003b).<br />
As the reader may infer from the previous enumerations, not all the clustering strategycriterion<br />
function-similarity measure combinations are possible in CLUTO. Table A.1 presents<br />
which triplets are allowed (denoted by the symbol), which are not allowed (denoted<br />
by ×), and which have been employed in our experiments (denoted by •)—28 out of the 68<br />
combinations allowed by CLUTO.<br />
In the experiments, each specific algorithm is identified by the clustering strategysimilarity<br />
measure-criterion function triplet employed, e.g. agglo-cos-slink (agglomerative<br />
clustering using the single link criterion and measuring object proximity with the cosine<br />
similarity), graph-jacc-i2 (graph-based clustering using the internal criterion function #2<br />
and the Jaccard distance), etc.<br />
A.2 Data sets<br />
In this work we have applied clustering processes on a total of sixteen data sets, 12 unimodal<br />
and four multimodal. In this section, we present their main features, such as their origin,<br />
the number of objects they contain (denoted throughout this work by n), the number (d)<br />
and meaning of their attributes, and the expected number of categories (k).<br />
219
A.2. Data sets<br />
Strategy Similarity<br />
cos<br />
i1<br />
<br />
i2<br />
•<br />
e1<br />
•<br />
h1 g1<br />
Criterion function<br />
g1p slink wslink clink wclink upgma<br />
rb<br />
√ × × × × × × ×<br />
corr<br />
dist<br />
<br />
×<br />
•<br />
×<br />
•<br />
×<br />
<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
jacc × × × × × × × × × × ×<br />
cos • • × × × × × × ×<br />
rbr<br />
corr<br />
dist<br />
<br />
×<br />
•<br />
×<br />
•<br />
×<br />
<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
jacc × × × × × × × × × × ×<br />
cos • • × × × × × × ×<br />
direct<br />
corr<br />
dist<br />
•<br />
×<br />
<br />
×<br />
•<br />
×<br />
<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
jacc × × × × × × × × × × ×<br />
cos × × × × × × • • •<br />
agglo<br />
corr<br />
dist<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
•<br />
×<br />
<br />
×<br />
•<br />
×<br />
<br />
×<br />
•<br />
×<br />
jacc × × × × × × × × × × ×<br />
cos × × × × × × • • •<br />
bagglo<br />
corr<br />
dist<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
•<br />
×<br />
<br />
×<br />
•<br />
×<br />
<br />
×<br />
•<br />
×<br />
jacc × × × × × × × × × × ×<br />
cos • • × × × × ×<br />
graph<br />
corr<br />
dist<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
×<br />
jacc • • × × × × ×<br />
Table A.1: Cross-option table indicating which clustering strategy-criterion functionsimilarity<br />
measure combinations are allowed (), not allowed (×) and employed in our<br />
experiments (•).<br />
220
A.2.1 Unimodal data sets<br />
Appendix A. Experimental setup<br />
A total of twelve unimodal data sets have been used in the experimental sections of this<br />
thesis. Unless noted otherwise, these data sets have been obtained from two classic public<br />
data repositories for the data mining and machine learning research communities such as the<br />
UCI Knowledge Discovery in Databases Archive (Hettich and Bay, 1999) and the UCI Machine<br />
Learning Repository (Asuncion and Newman, 1999). In the following paragraphs, we<br />
present a brief description of each data set, summarizing their most relevant characteristics<br />
in table A.2 as a quick reference source.<br />
1. Zoo: the goal of this data set is to learn to classify animals into seven classes given 17<br />
binary attributes representing features such as the presence of hair, feathers, backbone<br />
or teeth, or whether it is an aquatic or airborne animal, among others. The number<br />
of objects (i.e. animals) in the data set is 101.<br />
2. Iris: a classic data set in machine learning and pattern recognition. It contains 150<br />
objects (instances of Iris plants) represented by four real-valued features measuring<br />
the width and length of its petals and sepals. The goal is to classify the objects into<br />
one of the three classes of Iris plants, one of which is linearly separable from the<br />
others, while the latter two are not linearly separable from each other.<br />
3. Wine: this data set’s goal is to determine the origin of wines by means of chemical<br />
analysis. It contains 178 samples of wine which must be categorized into three wine<br />
classes based on their contents of 13 constituents such as alcohol, malic acid, or<br />
magnesium, represented as real-valued features.<br />
4. Glass: in this data set, 214 instances of glass are represented by 10 real-valued attributes<br />
corresponding to their contents in chemical elements such as aluminium,<br />
sodium, calcium, etc. The goal is to classify the objects into one of the predefined six<br />
categories (types of glass).<br />
5. Ionosphere: the contents of this data set are 351 radar returns from the ionosphere,<br />
classified either as good or bad depending on whether they show evidence of some<br />
type of structure in the ionosphere or not. Each radar return is described by 34<br />
autocorrelation-based real-valued features.<br />
6. WDBC : its complete name is Wisconsin Diagnostic Breast Cancer data set. It contains<br />
569 objects (breast mass images) represented by 32 real-valued features describing<br />
characteristics of the cell nuclei present in the image (radius, texture, perimenter,<br />
etc.). The goal is to classify these objects into one of the possible cancer diagnostics<br />
(malignant or benign).<br />
7. Balance: this data set was generated to model psychological experimental results.<br />
Each of the 625 objects is classified into three classes (as having the balance scale tip<br />
to the right, tip to the left, or balanced). The integer-valued attributes are the left<br />
weight, the left distance, the right weight, and the right distance.<br />
8. Mfeat: its original name is Multiple Features data set, as it represents the objects<br />
it contains (handwritten numerals from 0 to 9) using different real-valued features<br />
such as Fourier coefficients, profile correlations, Karhunen-Loève coefficients, pixel<br />
averages, Zernike moments and morphological attributes.<br />
221
A.2. Data sets<br />
Data set Number of Number of Number of Class<br />
name objects (n) attributes (d) classes (k) imbalance<br />
Zoo 101 17 7 40.6%–3.9%<br />
Iris 150 4 3 33.3%–33.3%<br />
Wine 178 13 3 39.9%–26.9%<br />
Glass 214 10 6 35.5%–4.2%<br />
Ionosphere 351 34 2 64.1%–35.9%<br />
WDBC 569 32 2 62.7%–37.3%<br />
Balance 625 4 3 46.1%–7.8%<br />
Mfeat 2000 649 10 10%–10%<br />
miniNG 2000 6679 20 5%–5%<br />
Segmentation 2100 19 7 14.3%–14.3%<br />
BBC 2225 6767 5 22.9%–17.3%<br />
PenDigits 7494 16 10 10.4%–9.6%<br />
Table A.2: Summary of the unimodal data sets employed in the experimental sections of<br />
this thesis. The “Class imbalance” column presents the percentage of objects in the data<br />
set belonging to the most and least populated categories, respectively.<br />
9. miniNG: this is a reduced version of the 20 Newsgroups text data set, as it contains<br />
only 2000 objects (text articles posted in Usenet) belonging to one of the 20 predefined<br />
thematic classes (e.g. sci.electronics, rec.sport.baseball or talk.politics.mideast).<br />
Typical text preprocessing tasks such as the removal of stop words and of terms appearing<br />
in less than 4 documents (document frequency thresholding) gives rise to a<br />
bag-of-words representation of each article on a 6679-dimensional tfidf -weighted (i.e.<br />
real-valued) term space (Sebastiani, 2002).<br />
10. Segmentation: known as the Image Segmentation data set, it contains 2100 outdoor<br />
images regions represented by 19-dimensional real-valued feature vectors that should<br />
be classified into one of seven texture classes: brickface, sky, foliage, cement, window,<br />
path and grass. We have employed the test subset of the Segmentation collection.<br />
11. BBC : this data set has been obtained from the online repository of the Machine Learning<br />
Group of the University College Dublin (http://mlg.ucd.ie/content/view/21/). It<br />
consists of 2225 documents from the BBC news website corresponding to stories in<br />
five topical areas (business, entertainment, politics, sport, tech). The original documents’<br />
representation used a 9636-dimensional term space which was reduced to 6767<br />
real-valued attributes after removing those terms with a document frequency smaller<br />
or equal to 4 (Sebastiani, 2002).<br />
12. PenDigits: its original name is Pen-Based Recognition of Handwritten Digits data<br />
set, whose training subset contains 7494 digitized handwritten digits (from 0 to 9)<br />
captured using a pressure sensitive tablet. Each object is represented by 16 integer<br />
attributes corresponding to the (x, y) coordinates of the electronic pen on the tablet<br />
sampled every 100 miliseconds.<br />
222
A.2.2 Multimodal data sets<br />
Appendix A. Experimental setup<br />
A total of four multimodal data sets have been used in the experimental sections of this<br />
thesis. Three of these data sets are multimodal in nature, whereas the remaining one has<br />
been generated artificially by the combination of two unimodal data collections. In the<br />
following paragraphs, we present a brief description of each data set, summarizing their<br />
most relevant features in table A.3 as a quick reference source.<br />
1. CAL500 : the Computer Audition <strong>La</strong>b 500-song data set is a collection of five hundred<br />
Western popular songs represented by means of two modalities: acoustic features and<br />
textual annotations (Turnbull et al., 2007). As regards the acoustic modality, we have<br />
employed the mean and standard deviations of the original real-valued delta MFCC<br />
(Mel-Frequency Cepstrum Coefficients) features as the acoustic attributes of each song<br />
(Li and GuoDong, 2000). As described in (Turnbull et al., 2007), the text modality<br />
was generated by means of an auditory experience survey, where 55 listeners were<br />
asked to annotate each song with several terms extracted from a musically relevant<br />
174-word vocabulary. As a result, each song was annotated with those terms assigned<br />
by at least three listeners. These annotating terms describe song-related semantic<br />
concepts as instruments (e.g. bass, piano, trumpet), vocals (e.g. falsetto, breathy,<br />
agressive), usage (e.g. at a party, waking up, driving) or emotion (e.g. cheerful,<br />
calming, tender). In order to evaluate the clustering processes conducted on this<br />
data set, we have used the annotating term “Best Genre” as the class of each object<br />
(it reflects the musical genre which best fits each song 1 ). There exist sixteen “Best<br />
Genre” labels, such as Alternative, Classic Rock, Punk, Country, and so on. Finally,<br />
we selected the subset of 297 songs which are assigned a single “Best Genre” label.<br />
2. IsoLetters: this multimodal data collection is the result of the artificial fusion of two<br />
unimodal data sets of the UCI Machine Learning Repository (Asuncion and Newman,<br />
1999): Isolet and LetterRecognition. Both data sets contain the same type of object:<br />
the 26 letters of the English alphabet. Whereas the Isolet collection is oriented to<br />
the spoken letter recognition problem (each object is the name of a letter uttered<br />
by a speaker, and represented by a total of 617 real-valued acoustic attributes such<br />
as spectral coefficients, contour features, or sonorant features), the LetterRecognition<br />
data set contains 16 visual features (statistical moments and edge counts) extracted<br />
from black and white images of the twenty-six capital letters of the English alphabet.<br />
Thus, IsoLetters, the multimodal data collection we have created as a combination<br />
of both unimodal data sets, pursues the goal of recognizing letters upon acoustic and<br />
visual features. The total number of objects in the IsoLetters data set (1559) is fixed<br />
by the size of the test subset of the Isolet collection.<br />
3. InternetAds: this data set (which is also found in the UCI Machine Learning Repository)<br />
represents a set of images which are possible advertisements on Internet pages.<br />
The goal is to classify them as an advertisement or not. Each object is described by<br />
means of 1558 real-valued attributes, including image geometry features, its textual<br />
caption, the alt text, phrases occuring in the URL, the image’s URL, the anchor text,<br />
and words occuring near the anchor text. We consider this data set as multimodal<br />
1<br />
So as to avoid biasing the results, all the genre-related annotating terms were eliminated, reducing the<br />
size of the vocabulary to 127 terms.<br />
223
A.3. Data representations<br />
Data set name<br />
CAL500<br />
(audio/text)<br />
IsoLetters<br />
(speech/image)<br />
InternetAds<br />
(object/collateral)<br />
Corel<br />
(image/text)<br />
Number of Number of Number of<br />
objects attributes per classes<br />
(n) mode (d 1/d 2) (k)<br />
Class imbalance<br />
297 78/127 16 14.5%–1.3%<br />
1559 617/16 26 3.8%–3.8%<br />
2359 133/1425 2 83.8%–16.2%<br />
3960 500/371 44 2.2%–2.3%<br />
Table A.3: Summary of the multimodal data sets employed in the experimental sections of<br />
this thesis. The “Class imbalance” column presents the percentage of objects in the data<br />
set belonging to the most and least populated categories, respectively.<br />
in the sense that some attributes are directly related to the object (as the image geometry<br />
features, the caption and the alt text, which totalize 133 features) while the<br />
remaining 1425 features are referred to collateral elements such as the anchor text or<br />
the URL. We have removed those objects in the data set with missing features (28%<br />
of the total), obtaining a reduced version of the InternetAds data set containing 2359<br />
objects.<br />
4. Corel: this is a rather classic multimodal data set (Duygulu et al., 2002), consisting of<br />
5000 text-annotated images from 50 Corel Stock Photo CDs. Each CD contains 100<br />
images on the same topic, such as “Sunrises and Sunsets”, “Mountains of America”<br />
or “Wild Animals” (Bekkerman and Jeon, 2007). Experiments have been conducted<br />
on the subset of 3960 images of the training subset of the Corel collection which are<br />
assigned at least one topic. The visual modality is codified as follows (Duygulu et<br />
al., 2002): images, represented using 33 color features, are segmented into regions,<br />
and these regions are clustered into 500 smaller connected regions (aka blobs), which<br />
are deemed visual terms, so that each image can be expressed in terms of these. As<br />
regards the textual modality, every image has a caption (i.e. a brief description of the<br />
scene) and an annotation (a list of objects appearing in the image). The vocabulary<br />
contains 371 words, and the term vectors are parameterized using the tfidf weighting<br />
scheme (van Rijsbergen, 1979).<br />
A.3 Data representations<br />
A.3.1 Unimodal data representations<br />
As mentioned in section 1.1, in this work objects are expressed in terms of d numerical<br />
attributes, so each object is represented as a column vector x ∈ Rd . Therefore, a whole<br />
data set containing n objects is mathematically expressed by means of a d × n matrix X.<br />
This original object representation is referred to as baseline throughout the thesis.<br />
Starting from the baseline representation, four other object representations have been<br />
224
Appendix A. Experimental setup<br />
generated by applying the following well-known feature extraction techniques 2 : Principal<br />
Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative Matrix<br />
Factorization (NMF) and Random Projection (RP). Besides providing diversity as<br />
regards data representation, these techniques are also employed with dimensionality reduction<br />
purposes. In this work, we refer to the reduced dimensionality of the resulting feature<br />
space by r, which takes a whole range of values in the interval [3,d]. As a result of each<br />
feature extraction procedure, a batch of r × n matrices (X r PCA , Xr ICA , Xr NMF or Xr RP )are<br />
obtained.<br />
The following paragraphs are devoted to a brief description of the main concepts regarding<br />
the aforementioned feature extraction techniques.<br />
– Principal Component Analysis, which is one of the most typical feature extraction<br />
techniques, is based on projecting the data onto a dimensionally reduced feature space<br />
such that i) the newly obtained features are decorrelated, and ii) thevarianceofthe<br />
original data is maximally retained. For these reasons, PCA is said to be capable of<br />
removing data redundancies while keeping the most relevant information contained<br />
in the data. There exist several ways for conducting PCA, from the eigenanalysis of<br />
the covariance matrix of X (Jolliffe, 1986) to neural network approaches (Oja, 1992).<br />
In this work, PCA is implemented by means of Singular Value Decomposition (SVD),<br />
following a similar approach to that of <strong>La</strong>tent Semantic Analysis (Deerwester et al.,<br />
1990). More specifically, given the d×n data matrix X, its SVD is expressed according<br />
to equation (A.1).<br />
X = U · Σ · V T<br />
(A.1)<br />
Matrix Σ contains the singular values of X ordered in decreasing order and the<br />
columns of matrices U and V are the left and right singular vectors of X, respectively.<br />
Dimensionality reduction is achieved by retaining the r largest singular values in Σ<br />
and the corresponding columns of matrix V, so that the r ×n matrix Xr T<br />
PCA = ΣrVr<br />
–where Σr and Vr are the reduced version of the singular values and right singular<br />
vectors matrices, respectively– will contain the location of the n objects in the<br />
r-dimensional PCA space, where clustering is conducted.<br />
– Independent Component Analysis (ICA) can be regarded as an extension of PCA<br />
for non-Gaussian data (Xu and Wunsch II, 2005), in which the projected data components<br />
are forced to be statistically independent—a stronger condition than PCA’s<br />
decorrelation. Being tightly bound to the blind source separation problem (Hyvärinen,<br />
Karhunen, and Oja, 2001), the application of ICA for feature extraction usually assumes<br />
the existence of a generative model that, in its simplest version, defines the<br />
observed data as the result of an unknown linear noiseless combination of r statistically<br />
independent latent variables (the so-called independent components). The<br />
goal of ICA algorithms is to recover the independent components making no further<br />
2 We choose applying feature extraction over feature selection given its greater ease of application (Jain,<br />
Murty, and Flynn, 1999), as our main goal is creating representational diversity, not elaborating on object<br />
representations. In informal experiments not reported here, other object representations based on feature<br />
selection plus change of basis (Srinivasan, 2002) were also tested but finally discarded as, in general terms,<br />
gave rise to lower quality clustering results.<br />
225
A.3. Data representations<br />
assumption than their statistical independence. The application of ICA in feature extraction<br />
is usually preceded by PCA with dimensionality reduction, as this procedure<br />
is equivalent to the usual whitening step that simplifies ICA algorithms (Hyvärinen,<br />
Karhunen, and Oja, 2001). Applying ICA on the PCA data yields an estimation of<br />
the r independent latent variables which generated the observed data:<br />
X r ICA = WX r PCA<br />
(A.2)<br />
where matrix W is known as the separating matrix. Equation (A.2) can be interpreted<br />
as a linear transformation of the data through its projection on the basis vectors<br />
contained in the rows of the separating matrix W. In this work, a version of the<br />
FastICA algorithm (Hyvärinen, 1999) that maximizes the skewness of the data is<br />
employed for obtaining the ICA representation of the data (Kaban and Girolami,<br />
2000).<br />
– Non-Negative Matrix Factorization (NMF) (Lee and Seung, 1999) is a feature extraction<br />
technique based on linear representations of non-negative data—i.e. NMF can<br />
only be applied when the original representation of the data is non-negative. Intuitively,<br />
NMF can be interpreted as a linear generative model somewhat similar to that<br />
of ICA but subject to non-negativity constraints, as it factorizes the non-negative data<br />
matrix X into the approximate product of two non-negative matrices W and H, as<br />
defined in equation (A.3). Thus, it can be argued that the data set is generated by<br />
the sum of a set of the latent non-negative variables contained in matrix H, while the<br />
elements of W are the weights of their linear combination.<br />
X ≈ WH ,whereX r NMF<br />
= H (A.3)<br />
From a practical viewpoint, i) NMF is usually implemented by means of iterative<br />
algorithms which try to minimize a cost function proportional to the reconstruction<br />
error ||X−W·H|| (Lee and Seung, 2001), and ii) dimensionality reduction is achieved<br />
by setting the respective sizes of the factorization matrices W and H to d × r and<br />
r × n at the time of computing the approximate factorization of equation (A.3).<br />
In this work, the NMF-based object representation XNMF is obtained by applying a<br />
mean square reconstruction error minimization algorithm from NMFPACK, a software<br />
package for NMF in Matlab (Hoyer, 2004). Besides its use as a feature extraction<br />
technique, the vision of NMF as a means for obtaining a parts-based description of<br />
the data has motivated alternative NMF-based clustering strategies (Xu, Liu, and<br />
Gong, 2003; Shahnaz et al., 2004), alongside studies on the theoretical connections<br />
between NMF and classic clustering approaches (Ding, He, and Simon, 2005).<br />
Compared to PCA and ICA, NMF is advantageous as the non-negativity of its basis<br />
vectors favours their interpretability. On the flip side, the derivation of the NMF<br />
representation is usually more computationally demanding.<br />
– Random Projection (RP) is a computationally efficient dimensionality reduction technique,<br />
proposed as an alternative to those feature extraction techniques that become<br />
too costly when the dimensionality of the original representation (d) isveryhigh<br />
(Kaski, 1998). The rationale behind RP is pretty straightforward: a dimensionality<br />
reduction method is effective as long as the distance between the objects in the<br />
226
Appendix A. Experimental setup<br />
original feature space is approximately preserved in the reduced r-dimensional space.<br />
Allowing for this fact, Kaski proved that this could be achieved using a random linear<br />
mapping embodied in a r × d random projection matrix R, where the columns of R<br />
are realizations of independent and identically distributed (i.i.d.) zero-mean normal<br />
variables, scaled to have unit length (Fodor, 2002).<br />
X r RP<br />
= RX (A.4)<br />
Several experimental studies bear witness of the fact that i) RP takes a fraction of<br />
the time required for executing other feature extraction techniques as PCA or ICA,<br />
among others, and ii) clustering results on RP data representations are sometimes<br />
comparable or even better than those obtained using, for instance, PCA —which<br />
somehow reinforces the notion of the data representation indeterminacy outlined in<br />
section 1.4 (Kaski, 1998; Bingham and Mannila, 2001; Lin and Gunopulos, 2003; Tang<br />
et al., 2005).<br />
A.3.2 Multimodal data representations<br />
As regards the representation of the objects of the multimodal data sets described in section<br />
A.2.2, two distinct approaches have been followed. Firstly, unimodal representations have<br />
been created for each mode separately, applying the same strategies as in the unimodal<br />
data sets—thus, we will not expand on this point. And secondly, we have generated truly<br />
multimodal data representations by combining both modalities. We elaborate on this latter<br />
issue in the following paragraphs.<br />
The simple concatenation of the baseline feature vectors of both modalities (previously<br />
normalized to unit length3 ) gives rise to the multimodal baseline representation, represented<br />
on a (d1 + d2)-dimensional attribute space —where d1 and d2 are the dimensionalities of<br />
the baseline representation of each modality.<br />
Subsequently, the feature extraction techniques described in section A.3.1 (i.e. PCA,<br />
ICA, RP and NMF—this latter only when data is non-negative) are applied on the multimodal<br />
baseline representation, yielding representations of dimensionalities r ∈ [3,d1 + d2].<br />
This procedure, known as early fusion or feature-level fusion in the literature, is a common<br />
strategy for creating representations of multimodal data from unimodal representations,<br />
and it has been applied in content-based image retrieval (<strong>La</strong> Cascia, Sethi, and Sclaroff,<br />
1998; Zhao and Grosky, 2002), semantic video analysis (Snoek, Worring, and Smeulders,<br />
2005), human affect recognition (Gunes and Piccardi, 2005), audiovisual video sequence<br />
analysis (Sevillano et al., 2009), besides multimodal clustering (Benitez and Chang, 2002).<br />
A.4 Cluster ensembles<br />
In this section, we briefly describe the cluster ensembles employed in the experimental<br />
sections of this thesis. As described in section 2.1, in this work we combine both the homogeneous<br />
and heterogeneous approaches for creating cluster ensembles. This means that we<br />
3 An uneven weighting of the concatenated vectors would give more importance to one of the modalities.<br />
As it is not clear how to appropriately weight each modality a priori, we forced that each subvector had<br />
unitary norm so as to avoid any bias.<br />
227
A.4. Cluster ensembles<br />
Data set name |dfA| =1 |dfA| =10 |dfA| =19 |dfA| =28<br />
Zoo 57 570 1083 1596<br />
Iris 9 90 171 252<br />
Wine 45 450 855 1260<br />
Glass 29 290 551 812<br />
Ionosphere 97 970 1843 2716<br />
WDBC 113 1130 2147 3164<br />
Balance 7 70 133 196<br />
Mfeat 6 60 114 168<br />
miniNG 73 730 1387 2044<br />
Segmentation 52 520 988 1456<br />
BBC 57 570 1083 1596<br />
PenDigits 57 570 1083 1596<br />
Table A.4: Cluster ensemble sizes l corresponding to distinct algorithmic diversity configurations<br />
for the unimodal data sets.<br />
employ several mutually crossed diversity factors (the twenty-eight clustering algorithms of<br />
the CLUTO clustering package presented in section A.1 are run on the data representations<br />
with varying dimensionalities described in section A.3) so as to generate the individual<br />
components of our cluster ensembles.<br />
However, several cluster ensemble instances have been generated by limiting the cardinality<br />
of the algorithmic diversity factor |dfA| (i.e. the number of clustering algorithms<br />
considered in creating the cluster ensemble components) to a discrete set of values: |dfA| =<br />
{1, 10, 19, 28}. This strategy is adopted with the objective of experimentally evaluating our<br />
proposals both in terms of i) their sensitivity to the cluster ensemble diversity (as the larger<br />
|dfA|, the more diverse the cluster ensemble), and ii) their computational scalability as regards<br />
the cluster ensemble size l (since this factor is proportional to |dfA|). Notice that the<br />
cluster ensembles with |dfA| = {1, 10, 19} are randomly sampled subsets of the maximally<br />
diverse cluster ensemble (the one corresponding to |dfA| =28).<br />
Tables A.4 and A.5 present the sizes of the cluster ensembles corresponding to the<br />
distinct diversity scenarios (i.e. cardinalities of the algorithmic diversity factor dfA) onthe<br />
unimodal and multimodal data collections employed in this work.<br />
Firstly, table A.4 presents the cluster ensemble sizes corresponding to the unimodal<br />
data sets. As expected, the cluster ensemble size l grows linearly with the value of |dfA|.<br />
Depending on the cardinalities of the representational and dimensional diversity factors of<br />
each data collection, fairly distinct cluster ensembles sizes are obtained (from the modest<br />
values of the Iris data set to the highly populated cluster ensembles of the WDBC collection).<br />
<strong>La</strong>st, table A.5 presents the cluster ensembles corresponding to the four multimodal data<br />
collections employed in this work for each diversity scenario. It is important to highlight<br />
the fact that the values of l presented in this table encompass the two unimodal and the<br />
multimodal data representations of the objects contained in these data sets.<br />
The reader is referred to appendix B for an analysis of the quality and diversity of the<br />
components of these cluster ensembles.<br />
228
Appendix A. Experimental setup<br />
Data set name |dfA| =1 |dfA| =10 |dfA| =19 |dfA| =28<br />
CAL500 102 1020 1938 2856<br />
IsoLetters 111 1110 2109 3108<br />
InternetAds 183 1830 3477 5124<br />
Corel 123 1230 2337 3444<br />
Table A.5: Cluster ensemble sizes l corresponding to distinct algorithmic diversity configurations<br />
for the multimodal data sets.<br />
A.5 Consensus functions<br />
In this section, we briefly describe the consensus functions employed in the experimental<br />
section of this thesis, placing special emphasis on specific implementation details when<br />
necessary. Moreover, we present the time complexity of each consensus function for a given<br />
cluster ensemble size (l), number of objects in the data set (n) and clusters (k).<br />
The first seven consensus functions are employed on experiments considering both hard<br />
and soft cluster ensembles (i.e. chapters from 3 to 6). On its part, the last one (VMA) is<br />
only applied on soft cluster ensembles, that is, in chapter 6.<br />
The Matlab source code of the first three consensus functions is available for download<br />
at http://www.strehl.com, whereas the remaining ones were implemented ad hoc for this<br />
work. For a more theoretical description of these consensus functions, see section 2.2.<br />
– CSPA (Cluster-based Similarity Partitioning Algorithm): this consensus function<br />
shares a lot of the rationale of the Evidence Accumulation consensus function (see<br />
below), as it is based on deriving a pairwise object similarity measure from the cluster<br />
ensemble and applying a similarity-based clustering algorithm on it—the METIS<br />
graph partitioning algorithm (Karypis and Kumar, 1998) in this case. Its computational<br />
complexity is O n 2 kl (Strehl and Ghosh, 2002).<br />
– HGPA (HyperGraph Partitioning Algorithm): this clustering combiner exploits a hypergraph<br />
representation of the cluster ensemble, re-partitioning the data by finding<br />
a hyperedge separator that cuts a minimal number of hyperedges, yielding k unconnected<br />
components of approximately the same size—which makes HGPA not an<br />
appropriate consensus function when clusters are highly imbalanced. The hypergraph<br />
partition is conducted by means of the HMETIS package (Karypis et al., 1997). Its<br />
time complexity is O (mkl) (Strehl and Ghosh, 2002).<br />
– MCLA (Meta-CLustering Algorithm): as in HGPA, each cluster corresponds to a hyperedge<br />
of the hypergraph representing the cluster ensemble. Subsequently, related<br />
hyperedges are detected by grouping them using the METIS graph-based clustering algorithm<br />
(Karypis and Kumar, 1998). Next, related hyperedges are collapsed and each<br />
object is assigned to the collapsed hyperedge in which it participates most strongly.<br />
Its computational complexity is O mk 2 l 2 (Strehl and Ghosh, 2002).<br />
– EAC (Evidence Accumulation): this is a pretty direct implementation of the consensus<br />
function presented in (Fred and Jain, 2002a). It consists in the computation of the<br />
pairwise object co-association matrix and the subsequent application of the single-link<br />
229
A.6. Computational resources<br />
hierarchical clustering algorithm on it. The main difference between our implementation<br />
and the original one lies in the fact that we cut the resulting dendrogram at the<br />
desired number of clusters k, whereas Fred proposes performing the cut at the highest<br />
lifetime level, so that the very consensus function finds the natural number of clusters<br />
in the data set. Its computational complexity is O n 2 l (Fred and Jain, 2005).<br />
– ALSAD (Average-Link on Similarity As Data): this is one of the three consensus<br />
functions presented in (Kuncheva, Hadjitodorov, and Todorova, 2006) based on considering<br />
object similarity measures as object features. Despite the authors do not give<br />
a specific name to this family of consensus functions, we have named them xxSAD<br />
so as to indicate that similarities are deemed as data, replacing xx by the acronym<br />
of the particular clustering algorithm used for obtaining the consensus clustering solution.<br />
In this case, the pairwise object co-association matrix is partitioned using the<br />
average-link (AL) hierarchical clustering algorithm, cutting the resulting dendrogram<br />
at the desired number of clusters. Its computational complexity is O n 2 l for creating<br />
the object similarity matrix plus O n 2 for partitioning it with the hierarchical AL<br />
clustering algorithm (Xu and Wunsch II, 2005).<br />
– KMSAD (K-Means on Similarity As Data): this consensus function belongs to the<br />
same family as the previous one. This time, the object co-association matrix is clustered<br />
using the classic k-means (KM) partitional algorithm. Its computational complexity<br />
is O n 2 l for creating the object similarity matrix plus O (tkm) for partitioning<br />
it with the k-means clustering algorithm (Xu and Wunsch II, 2005) —where t is the<br />
number of iterations of k-means.<br />
– SLSAD (Single-Link on Similarity As Data): following the same approach as the AL-<br />
SAD and KMSAD consensus functions, the pairwise object co-association matrix is<br />
partitioned by means of the single-link (SL) hierarchical clustering algorithm in this<br />
case. As in the ALSAD consensus function, the consensus clustering solution is obtained<br />
by cutting the dendrogram at the desired number of clusters. Its computational<br />
costisthesameasALSAD.<br />
– VMA (Voting Merging Algorithm): this consensus function is based on sequentially<br />
solving the cluster correspondence problem on pairs of cluster ensemble components,<br />
and, at each iteration, applying a weighted version of the sum rule confidence voting<br />
method. This algorithm scales linearly in the number of objects in the data set and<br />
the number of cluster ensemble components, i.e. its complexity is O (nl) (Dimitriadou,<br />
Weingessel, and Hornik, 2002).<br />
A.6 Computational resources<br />
All the experiments conducted in this thesis have been executed under Matlab 7.0.4 on<br />
Dual Pentium 4 3GHz/1 GB RAM computers. The reason for choosing Matlab as the<br />
programming language for codifying our proposals is threefold: besides the fact we are<br />
familiar with it, the existence of multiple built-in functions simplifies the implementation of<br />
many of the processes involved in our proposals (Principal Component Analysis and Random<br />
Projection feature extraction, for instance). Moreover, the availability of the full Matlab<br />
source code of several components of our proposals (e.g. hypergraph consensus functions<br />
230
Appendix A. Experimental setup<br />
or Non-negative Matrix Factorization feature extraction) has been a further incentive for<br />
this decision. However, the downside of this choice is the relatively slow execution of<br />
our proposals implementation, due to the fact that Matlab is an interpreted programming<br />
language.<br />
231
Appendix B<br />
Experiments on clustering<br />
indeterminacies<br />
The goal of this appendix is to present experimental evidences of the indeterminacies affecting<br />
the practical selection of a specific clustering configuration introduced in chapter 1. In<br />
particular, we focus on the indeterminacies regarding the selection of the data representation<br />
and clustering algorithm that yields the best clustering results for both unimodal and<br />
multimodal data collections.<br />
As already noted in chapter 1, the evaluation of the clustering results is based on computing<br />
the normalized mutual information φ (NMI) between a given label vector and the<br />
ground truth that is not available to the clustering process, being only used with evaluation<br />
purposes. Recall that φ (NMI) ranges from 0 to 1, the higher its value the more similar the<br />
clustering result to the ground truth.<br />
B.1 Clustering indeterminacies in unimodal data sets<br />
In this section, we analyze which clustering configurations (data representation plus clustering<br />
algorithm) give rise to the best partitioning of the unimodal data sets described in<br />
section A.2.1. We aim to demonstrate the dependence between the quality of the clustering<br />
results and the selection of the way objects are represented and clustered.<br />
As described in section A.3.1, starting with the original data representation (denoted<br />
as baseline), four additional representations have been created by applying several feature<br />
extraction techniques with multiple dimensionalities, namely Principal Component Analysis<br />
(PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization<br />
(NMF) and Random Projection (RP) 1 .<br />
On each distinct object representation, the 28 clustering algorithms from the CLUTO<br />
toolbox presented in section A.1 have been applied, which gives rise to the number of partitions<br />
per data representation presented in table B.1. Notice that, in those data sets not<br />
satisfying non-negativity constraints, the NMF representation was not derived. Moreover,<br />
1 The only exception to this rule is the MFeat data set, where no attribute transformation was applied,<br />
as its original form already presents data representation diversity through the use of several features, see<br />
section A.2.1.<br />
233
B.1. Clustering indeterminacies in unimodal data sets<br />
Data set Data representation<br />
name Baseline PCA ICA NMF RP<br />
Zoo 28 392 392 392 392<br />
Iris 28 56 56 56 56<br />
Wine 28 308 308 308 308<br />
Glass 28 196 196 196 196<br />
Ionosphere 28 896 896 - 896<br />
WDBC 28 784 784 784 784<br />
Balance 28 56 - 56 56<br />
Mfeat<br />
28 on each of its 6 representations<br />
(FAC, FOU, KAR, MOR, PIX and ZER)<br />
miniNG 28 504 504 504 504<br />
Segmentation 28 476 476 - 476<br />
BBC 28 392 392 392 392<br />
PenDigits 28 392 392 392 392<br />
Table B.1: Number of individual clusterings per data representation on each unimodal data<br />
set.<br />
the ICA algorithm employed for deriving the homonymous object representation presented<br />
convergence problems when executed on the Balance data collection, so no ICA representation<br />
was created on this data set.<br />
In the next paragraphs, we describe the clustering results obtained on each data set,<br />
emphasizing which clustering configurations lead to the best clustering results in each case.<br />
B.1.1 Zoo data set<br />
Figure B.1 presents the histograms of the φ (NMI) values (ranging in the [0,1] interval) obtained<br />
by all the clustering algorithms on each data representation for the Zoo data collection.<br />
Recall that φ (NMI) = 1 corresponds to a perfect match between the ground truth and<br />
a clustering solution. The analysis of these histograms help us to interpret the influence of<br />
the clustering indeterminacies on the quality of the clustering results.<br />
clustering count<br />
30<br />
20<br />
10<br />
Zoo Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
30<br />
20<br />
10<br />
Zoo PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
30<br />
20<br />
10<br />
Zoo ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
30<br />
20<br />
10<br />
Zoo NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
30<br />
20<br />
10<br />
Zoo RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.1: Histograms of the φ (NMI) values obtained on each data representation in the<br />
Zoodataset.<br />
Firstly, by inspecting the histogram corresponding to the clustering results obtained by<br />
applying the 28 algorithms on the baseline object representation (figure B.1(a)), we can<br />
234
Appendix B. Experiments on clustering indeterminacies<br />
see that φ (NMI) values scattered in a range extending approximately from φ (NMI) =0.45 to<br />
φ (NMI) =0.85 are obtained. It is important to notice that such diverse results are solely due<br />
to the clustering algorithm selection indeterminacy, as this histogram presents the results<br />
of running multiple distinct clustering algorithms on a single data representation.<br />
If this analysis is extended to the remaining histograms (figures B.1(b) to B.1(e)), it can<br />
be observed that the φ (NMI) scatter extends across an even wider range for each distinct type<br />
of representation. This somehow gives an idea of the dependence between the quality of the<br />
clustering results and the selection of the clustering algorithm. However, this conclusion<br />
cannot be drawn as directly as in the baseline representation, given that histograms B.1(b)<br />
to B.1(e) present the results of running the 28 algorithms on multiple representations with<br />
distinct dimensionalities derived by each feature extraction technique. In other words,<br />
the diversity observed in these histograms is produced by the joint effect of the clustering<br />
algorithm and dimensionality reduction data representation selection indeterminacies.<br />
However, if figures B.1(a) to B.1(e) are compared among themselves, the different histogram<br />
distributions reveal the effect of the clustering indeterminacy regarding the type of<br />
data representation. For example, clustering results on the NMF representations of this<br />
data set span across a comparatively narrower and higher range of φ (NMI) values than their<br />
PCA, ICA and RP counterparts, indicating that it is more probable to obtain better results<br />
if clustering is run on NMF representations than on the remaining ones.<br />
B.1.2 Iris data set<br />
Compared to other data sets, a pretty small number of clustering solutions have been generated<br />
on the Iris collection. Regardless of this fact, the effect of the clustering indeterminacies<br />
can also be observed in figure B.2.<br />
clustering count<br />
10<br />
5<br />
Iris Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
10<br />
5<br />
Iris PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
10<br />
5<br />
Iris ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
10<br />
5<br />
Iris NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
10<br />
5<br />
Iris RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.2: Histograms of the φ (NMI) values obtained on each data representations in the<br />
Iris data set.<br />
In this case, the wide span of the φ (NMI) histograms of the PCA and ICA representations<br />
(figures B.2(b) and B.2(c)) is the clearest indicator of the representation dimensionality and<br />
algorithm selection indeterminacies.<br />
If the qualities of the clustering solutions obtained for the distinct types of object representation<br />
are compared, we can observe that the highest φ (NMI) values are obtained using<br />
the RP and the baseline representations.<br />
235
B.1. Clustering indeterminacies in unimodal data sets<br />
B.1.3 Wine data set<br />
The histograms of the φ (NMI) values obtained by each clustering algorithm across all the<br />
data representations employed in the Wine data set are presented in figure B.3.<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Wine Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Wine PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Wine ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Wine NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Wine RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.3: Histograms of the φ (NMI) values obtained on each data representation in the<br />
Wine data set.<br />
The clustering indeterminacy regarding the selection of both the clustering algorithm<br />
and the dimensionality of the data representation is clearly observed in figures B.3(b) and<br />
B.3(c). For both the PCA and ICA data representations, a rather even histogram is obtained,<br />
spanning from φ (NMI) =0.04 to φ (NMI) =0.84.<br />
Moreover, notice that it is only with these data representations (PCA and ICA) that<br />
φ (NMI) values above 0.5 are obtained on this data set, which reinforces the importance of<br />
using the optimal type of features for the obtention of good clustering results.<br />
B.1.4 Glass data set<br />
The φ (NMI) histograms corresponding to the Glass data set are presented in figure B.4.<br />
clustering count<br />
25<br />
20<br />
15<br />
10<br />
5<br />
Glass Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
25<br />
20<br />
15<br />
10<br />
5<br />
Glass PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
25<br />
20<br />
15<br />
10<br />
5<br />
Glass ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
25<br />
20<br />
15<br />
10<br />
5<br />
Glass NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
25<br />
20<br />
15<br />
10<br />
5<br />
Glass RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.4: Histograms of the φ (NMI) values obtained on each data representation in the<br />
Glass data set.<br />
Notice the distinct histogram distributions obtained for each data representation, which<br />
gives an idea of how the selection of a particular data representation influences the quality of<br />
the clustering results. Additionally, a pretty wide range of values of φ (NMI) are observed in<br />
the histograms corresponding to the feature extraction based data representations (figures<br />
B.4(b) to B.4(e)), thus evidencing the effect of the dimensionality reduction and clustering<br />
algorithm selection indeterminacy.<br />
236
B.1.5 Ionosphere data set<br />
Appendix B. Experiments on clustering indeterminacies<br />
As regards the Ionosphere data collection, pretty similar φ (NMI) distributions are obtained<br />
for the PCA, ICA and RP representations (see figures B.5(b) to B.5(d)). Thus, in this<br />
case, there apparently exists a lower dependence between the quality of clustering and the<br />
feature extraction technique used for representing the objects. Nevertheless, despite the<br />
notable concentration of clustering results on the leftmost part of the histograms (i.e. poor<br />
clusterings with low values of φ (NMI) ), there exist some clustering solutions reaching φ (NMI)<br />
values above 0.5 using PCA and ICA feature extraction (see figures B.5(b) and B.5(c)).<br />
Moreover, notice that pretty poor quality clusterings are obtained when operating on the<br />
baseline object representation (figure B.5(a)).<br />
clustering count<br />
150<br />
100<br />
50<br />
Ionosphere Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
150<br />
100<br />
50<br />
Ionosphere PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
150<br />
100<br />
50<br />
Ionosphere ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
150<br />
100<br />
50<br />
Ionosphere RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) RP<br />
Figure B.5: Histograms of the φ (NMI) values obtained on each data representation in the<br />
Ionosphere data set.<br />
B.1.6 WDBC data set<br />
As regards the WDBC data collection, there exists a notable difference between the profiles<br />
of the histograms of the PCA, ICA and NMF representations when compared to the<br />
baseline and RP histograms. Indeed, the former present a sharp peak located in the lowest<br />
region of the φ (NMI) range, whereas the latter do not—which reflects the data representation<br />
clustering indeterminacy. The notably large differences between the highest and lowest<br />
φ (NMI) values of all the histograms reveal the influence of the clustering algorithm and data<br />
dimensionality selection on the quality of the partition results.<br />
clustering count<br />
100<br />
50<br />
WDBC Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
100<br />
50<br />
WDBC PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
100<br />
50<br />
WDBC ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
100<br />
50<br />
WDBC NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
100<br />
50<br />
WDBC RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.6: Histograms of the φ (NMI) values obtained on each data representation in the<br />
WDBC data set.<br />
237
B.1. Clustering indeterminacies in unimodal data sets<br />
B.1.7 Balance data set<br />
The approximately even distributions of the φ (NMI) histograms corresponding to the four<br />
object representations employed in the Balance data set (with the exception of the peak<br />
around φ (NMI) =0.04 in figure B.7(c)) transmit the idea that the chance of randomly selecting<br />
a good or a bad clustering configuration is rather equiprobable in this data collection.<br />
clustering count<br />
10<br />
5<br />
Balance Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
10<br />
5<br />
Balance PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
10<br />
5<br />
Balance NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) NMF<br />
clustering count<br />
10<br />
5<br />
Balance RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) RP<br />
Figure B.7: Histograms of the φ (NMI) values obtained on the each data representation in<br />
the Balance data set.<br />
B.1.8 MFeat data set<br />
In this data set, six distinct feature types were employed for representing the objects, each<br />
with a single dimensionality. Therefore, the φ (NMI) scatter observed in each of the figures<br />
from B.8(a) to B.8(f) is solely due to the algorithm selection indeterminacy.<br />
Notice that, in all these histograms, a pretty high density of clustering solutions around<br />
φ (NMI) =0.5 can be observed. Nevertheless, notably better clustering results (φ (NMI) ≈ 0.8)<br />
can be obtained using the KAR and PIX object representations (see figures B.8(c) and<br />
B.8(e)), which reveals the data representation indeterminacy effect.<br />
B.1.9 miniNG data set<br />
The wide spread of the φ (NMI) values observed in figure B.9(a) –from φ (NMI) =0.06 to<br />
φ (NMI) =0.64– is a clear evidence of how the selection of a particular clustering algorithm<br />
affects the quality of the clustering results.<br />
Moreover, notice that the clustering solutions obtained on the RP representation yield<br />
φ (NMI) values below 0.3, whereas the best results obtained on the remaining representations<br />
reach and even surpass φ (NMI) =0.5 —i.e. distinct object representations can significantly<br />
alter the results of a clustering process.<br />
B.1.10 Segmentation data set<br />
As regards the effect of applying distinct clustering algorithms on the same object representation,<br />
figure B.10(a) shows how, despite the accumulation of clustering solutions around<br />
φ (NMI) =0.35, a maximum quality of φ (NMI) =0.65 can be obtained on the baseline represenmtation<br />
of the objects.<br />
238
clustering count<br />
clustering count<br />
10<br />
5<br />
Mfeat FAC<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
10<br />
5<br />
(a) FAC<br />
Mfeat MOR<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) MOR<br />
Appendix B. Experiments on clustering indeterminacies<br />
clustering count<br />
clustering count<br />
10<br />
5<br />
Mfeat FOU<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
10<br />
5<br />
(b) FOU<br />
Mfeat PIX<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) PIX<br />
clustering count<br />
clustering count<br />
10<br />
5<br />
Mfeat KAR<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
10<br />
5<br />
(c) KAR<br />
Mfeat ZER<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(f) ZER<br />
Figure B.8: Histograms of the φ (NMI) values obtained on each data representation in the<br />
MFeat data set.<br />
clustering count<br />
60<br />
40<br />
20<br />
miniNG Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
60<br />
40<br />
20<br />
miniNG PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
60<br />
40<br />
20<br />
miniNG ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
60<br />
40<br />
20<br />
miniNG NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
60<br />
40<br />
20<br />
miniNG RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.9: Histograms of the φ (NMI) values obtained on each data representation in the<br />
miniNG data set.<br />
Furthermore, if figures B.10(b) and B.10(c) are compared to figure B.10(d), it is easy to<br />
see that whereas the two former present a wide and sharp peak centered around φ (NMI) =0.7<br />
(thus indicating that clustering solutions this good are likely to be obtained using the PCA<br />
and ICA representations of the objects), the latter has its acme around φ (NMI) =0.35—i.e.<br />
the quality of the RP based clustering solutions tends to be lower in this data set.<br />
B.1.11 BBC data set<br />
The BBC data collection constitutes another example where very diverse clustering solutions<br />
–with qualities ranging from φ (NMI) =0.01 to φ (NMI) =0.81– are obtained when clustering<br />
is conducted on the original representation of the objects (see figure B.11(a)).<br />
As far as the remaining data representations are concerned, the best results seem to<br />
be obtained using the NMF feature extraction technique, as its corresponding histogram is<br />
more scarcely and densely populated at the low and high ranges of φ (NMI) , respectively.<br />
239
B.1. Clustering indeterminacies in unimodal data sets<br />
clustering count<br />
Segmentation Baseline<br />
40<br />
30<br />
20<br />
10<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Segmentation PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Segmentation ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
40<br />
30<br />
20<br />
10<br />
Segmentation RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) RP<br />
Figure B.10: Histograms of the φ (NMI) values obtained on each data representation in the<br />
Segmentation data set.<br />
clustering count<br />
60<br />
40<br />
20<br />
BBC Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
60<br />
40<br />
20<br />
BBC PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
60<br />
40<br />
20<br />
BBC ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
60<br />
40<br />
20<br />
BBC NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
60<br />
40<br />
20<br />
BBC RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.11: Histograms of the φ (NMI) values obtained on each data representation in the<br />
BBC data set.<br />
B.1.12 PenDigits data set<br />
In this case, the distinct object representations present a reasonably similar behaviour<br />
according to the histograms depicted in figure B.12. Assuming a simplifying viewpoint,<br />
these can be decomposed into a negatively skewed peak with its acme around φ (NMI) =0.6,<br />
and two other narrow peaks, one located near φ (NMI) =0.8 and the other on the low range<br />
of the histogram. Thus, as opposed to what has been observed in other data collections, the<br />
application of the twenty-eight clustering algorithms on the distinct object representations<br />
yield comparable quality results in this data set.<br />
clustering count<br />
30<br />
20<br />
10<br />
PenDigits Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
clustering count<br />
30<br />
20<br />
10<br />
PenDigits PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA<br />
clustering count<br />
30<br />
20<br />
10<br />
PenDigits ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA<br />
clustering count<br />
30<br />
20<br />
10<br />
PenDigits NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF<br />
clustering count<br />
30<br />
20<br />
10<br />
PenDigits RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP<br />
Figure B.12: Histograms of the φ (NMI) values obtained on each data representation in the<br />
PenDigits data set.<br />
240
B.1.13 Summary<br />
Appendix B. Experiments on clustering indeterminacies<br />
So as to provide a summarized vision of the data representation and the clustering algorithm<br />
selection indeterminacies across all the analyzed data sets, table B.2 presents the φ (NMI)<br />
corresponding to the best clustering solution achieved by each one of the five families of<br />
clustering algorithms employed in this work, namely agglomerative (agglo), biased agglomerative<br />
(bagglo), direct, graph, repeated-bisecting (rb) and refined repeated-bisecting (rbr),<br />
indicating the type of object representation employed in each case (either baseline, PCA,<br />
ICA, NMF or RP).<br />
There are several worth observing facts as regards the data representation indeterminacy.<br />
Notice that, in some data sets (e.g. Zoo or miniNG), there exists a notable diversity as<br />
regards the type of representation that yields the top clustering result for each family of<br />
clustering algorithms. In contrast, in other data collections, there seems to exist a particular<br />
object representation that apparently reveals the data set structure regardless of the type of<br />
clustering algorithm applied. This behaviour is observed in the Iris and Balance collections,<br />
and also, to a lesser extent, in the WDBC and Segmentation data sets. Moreover, notice<br />
the variablility of these optimal object representations across the analyzed data sets, which<br />
is a clear indicator of the clustering indeterminacy regarding data representations.<br />
As far as the selection of the optimal clustering algorithm is concerned, it is important<br />
to note that at least one representative of the five families of clustering algorithms employed<br />
in this work reach the best absolute performance in at least one of the analyzed data sets,<br />
which gives an idea of the algorithm selection indeterminacy. Moreover, notice that choosing<br />
the wrong type of clustering algorithm may affect the quality of the clustering solution<br />
dramatically (see the Ionosphere and Balance collections) or not (as in the Segmentation<br />
data set).<br />
B.2 Clustering indeterminacies in multimodal data sets<br />
The goal of this section is to evaluate the effect of clustering indeterminacies in the context<br />
of multimodal data collections. Along with the data representation and clustering algorithm<br />
selection indeterminacies, multimodality introduces a further focus of uncertainty, as it is<br />
not evident to decide whether the combination of the m modalities will benefit the quality<br />
of the obtained clustering solution or not. And again, to make things worse, it is important<br />
to recall that all these indeterminacies are local to each data collection, so, in general, it is<br />
not possible to drawn universally valid conclusions.<br />
As done in appendix B.1, we start by presenting the total number of individual clustering<br />
solutions obtained by applying the 28 clustering algorithms extracted from the CLUTO<br />
toolbox on all the data representations of the objects contained in the employed multimodal<br />
data sets2 —see table B.3. Notice that the CAL500 and InternetAds collections lack the<br />
NMF representations as they do not satisfy the necessary non-negativity constraints.<br />
In the next paragraphs, we describe the clustering results obtained on the four multimodal<br />
data sets, placing special emphasis on which data representations and modalities<br />
lead to the best clustering results in each case.<br />
2 See appendices A.1, A.2.2, and A.3.2 for a description of the clustering algorithms, the multimodal<br />
collections and the multimodal objects representations employed in this work.<br />
241
B.2. Clustering indeterminacies in multimodal data sets<br />
Data set Highest maximum φ (NMI) −→ Lowest maximum φ (NMI)<br />
agglo bagglo direct rb rbr graph<br />
Zoo NMF-13d PCA-13d baseline RP-12d RP-12d PCA-6d<br />
(0.865) (0.858) (0.853) (0.848) (0.848) (0.730)<br />
bagglo direct rb rbr agglo graph<br />
Iris baseline baseline baseline baseline baseline baseline<br />
(0.899) (0.899) (0.899) (0.899) (0.837) (0.821)<br />
direct rbr bagglo rb graph agglo<br />
Wine PCA-10d PCA-10d ICA-12d PCA-10d PCA-8d ICA-5d<br />
(0.836) (0.836) (0.802) (0.795) (0.755) (0.720)<br />
agglo direct rb rbr bagglo graph<br />
Glass RP-8d RP-5d PCA-3d PCA-6d PCA-4d PCA-3d<br />
(0.487) (0.442) (0.423) (0.418) (0.417) (0.392)<br />
graph agglo bagglo direct rb rbr<br />
Ionosphere PCA-8d PCA-31d RP-14d RP-9d RP-9d RP-9d<br />
(0.656) (0.314) (0.309) (0.234) (0.234) (0.234)<br />
graph bagglo direct rb rbr agglo<br />
WDBC NMF-3d NMF-3d NMF-3d NMF-3d NMF-3d RP-15d<br />
(0.637) (0.603) (0.563) (0.563) (0.563) (0.522)<br />
agglo bagglo rbr rb direct graph<br />
Balance PCA-3d PCA-3d PCA-3d PCA-3d PCA-3d PCA-3d<br />
(0.703) (0.411) (0.394) (0.388) (0.370) (0.324)<br />
graph rbr direct agglo bagglo rb<br />
MFeat PIX PIX PIX KAR PIX KAR<br />
(0.811) (0.676) (0.669) (0.664) (0.606) (0.585)<br />
rb rbr bagglo direct graph agglo<br />
miniNG baseline PCA-70d baseline ICA-50d PCA-50d NMF-50d<br />
(0.638) (0.597) (0.559) (0.558) (0.412) (0.377)<br />
graph rbr agglo bagglo rb direct<br />
Segmentation PCA-11d PCA-13d ICA-13d PCA-13d PCA-14d ICA-17d<br />
(0.786) (0.741) (0.733) (0.731) (0.728) (0.720)<br />
rbr direct graph agglo bagglo rb<br />
BBC baseline baseline ICA-5d ICA-5d ICA-5d baseline<br />
(0.808) (0.808) (0.777) (0.750) (0.744) (0.726)<br />
graph agglo rbr direct bagglo rb<br />
PenDigits NMF-13d ICA-9d NMF-9d NMF-10d NMF-11d NMF-7d<br />
(0.839) (0.724) (0.682) (0.681) (0.665) (0.648)<br />
Table B.2: Top clustering results obtained by each clustering algorithm family across all<br />
the unimodal data sets, sorted from highest to lowest φ (NMI) .<br />
242
Appendix B. Experiments on clustering indeterminacies<br />
Data representation<br />
CAL500<br />
Data set<br />
Corel InternetAds IsoLetters<br />
MM 28 28 28 28<br />
Baseline M1 28 28 28 28<br />
M2 28 28 28 28<br />
MM 504 420 308 532<br />
PCA M1 280 196 308 196<br />
M2 140 224 392 532<br />
MM 504 420 308 532<br />
ICA M1 280 196 308 196<br />
M2 140 224 392 532<br />
MM – 420 – 532<br />
NMF M1 – 196 – 196<br />
M2 – 224 – 532<br />
MM 504 420 308 532<br />
RP M1 280 196 308 196<br />
M2 140 224 392 532<br />
Table B.3: Number of individual clusterings per data representation on each multimodal<br />
data set, where MM, M1 and M2 stand for multimodal, mode #1 and mode #2, respectively.<br />
B.2.1 CAL500 data set<br />
The φ (NMI) histograms presented in figure B.13 summarize the clustering results obtained by<br />
running the aforementioned twenty-eight algorithms on each type of object representation<br />
for each one of the two modalities, and for the multimodal representations as well.<br />
If the histograms are compared representationwise, we observe that all representations<br />
yield clustering solutions whose quality spans over similarly wide ranges below φ (NMI) =0.5.<br />
For a given modality, there exists no clearly superior object representation.<br />
However, if the histograms are compared across the modalities, it can be observed that<br />
better results are obtained when clustering is conducted on the audio modality of this data<br />
set, regardless of the type of representation employed. Moreover, the multimodal data<br />
representation seems to yield intermediate quality clustering results (i.e. slightly better<br />
than clustering on text only, but worse than clustering solely on audio), which reveals that<br />
the early fusion of acoustic and textual features is not beneficial in this case.<br />
B.2.2 Corel data set<br />
Figure B.14 presents the φ (NMI) histograms corresponding to the multimodal and unimodal<br />
clustering of the captioned images of the Corel data set.<br />
As regards the comparison across object representations, it can be observed that, specially<br />
for the image and multimodal modalities, the RP representation offers a large amount<br />
of good clustering solutions, whereas the quality of the clusterings obtained on the remaining<br />
representations is scattered over a wide range of φ (NMI) values.<br />
If the clustering results obtained on the two modalities are compared, we can see that<br />
the image modality is the one yielding the best clustering results (up to φ (NMI) =0.68),<br />
243
B.2. Clustering indeterminacies in multimodal data sets<br />
clustering count<br />
clustering count<br />
clustering count<br />
80<br />
60<br />
40<br />
20<br />
CAL500 Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
(multimodal)<br />
CAL500 Baseline M1<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(e) Baseline<br />
(audio)<br />
CAL500 Baseline M2<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(i) Baseline<br />
(text)<br />
clustering count<br />
clustering count<br />
clustering count<br />
CAL500 PCA<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(b) PCA (multimodal)<br />
CAL500 PCA M1<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(f) PCA (audio)<br />
CAL500 PCA M2<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(j) PCA (text)<br />
clustering count<br />
clustering count<br />
clustering count<br />
CAL500 ICA<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(c) ICA (multimodal)<br />
CAL500 ICA M1<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(g) ICA (audio)<br />
CAL500 ICA M2<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(k) ICA (text)<br />
clustering count<br />
clustering count<br />
clustering count<br />
CAL500 RP<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(d) RP (multimodal)<br />
CAL500 RP M1<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(h) RP (audio)<br />
CAL500 RP M2<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 0.5<br />
φ<br />
1<br />
(NMI)<br />
(l) RP (text)<br />
Figure B.13: Histograms of the φ (NMI) values on the CAL500 data set obtained on the<br />
following data representations.<br />
which are far better than those obtained on the text modality (always below φ (NMI) =0.3).<br />
<strong>La</strong>st but not least, the multimodal object representation seems to benefit slightly from the<br />
early fusion of the visual and textual features of both modalities, as better clustering results<br />
are obtained in this case, although by a very small margin.<br />
B.2.3 InternetAds data set<br />
The clustering results corresponding to the InternetAds data collection are summarized in<br />
figure B.15. Many poor clustering results are obtained on this data set, as the high peaks<br />
located on the leftmost regions of the histograms reveal. The distinct data representations<br />
and modalities present a pretty erratic behaviour, as discussed next.<br />
If the two modalities are compared, the best clustering results are obtained, in general<br />
terms, using the collateral information of the Internet advertisements (which are the objects<br />
in this data set). However, the multimodal composition of the objects tends to yield superior<br />
–although still poor– quality clusterings, except for the PCA representation.<br />
244
clustering count<br />
clustering count<br />
clustering count<br />
60<br />
40<br />
20<br />
Corel Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
(multimodal)<br />
60<br />
40<br />
20<br />
Corel Baseline M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(f) Baseline<br />
(image)<br />
60<br />
40<br />
20<br />
Corel Baseline M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(k) Baseline<br />
(text)<br />
clustering count<br />
clustering count<br />
clustering count<br />
60<br />
40<br />
20<br />
Corel PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA (multimodal)<br />
60<br />
40<br />
20<br />
Corel PCA M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(g) PCA (image)<br />
60<br />
40<br />
20<br />
Corel PCA M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(l) PCA (text)<br />
Appendix B. Experiments on clustering indeterminacies<br />
clustering count<br />
clustering count<br />
clustering count<br />
60<br />
40<br />
20<br />
Corel ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA (multimodal)<br />
60<br />
40<br />
20<br />
Corel ICA M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(h) ICA (image)<br />
60<br />
40<br />
20<br />
Corel ICA M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(m) ICA (text)<br />
clustering count<br />
clustering count<br />
clustering count<br />
60<br />
40<br />
20<br />
Corel NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF (multimodal)<br />
60<br />
40<br />
20<br />
Corel NMF M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(i) NMF (image)<br />
60<br />
40<br />
20<br />
Corel NMF M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(n) NMF (text)<br />
clustering count<br />
clustering count<br />
clustering count<br />
60<br />
40<br />
20<br />
Corel RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP (multimodal)<br />
60<br />
40<br />
20<br />
Corel RP M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
60<br />
40<br />
20<br />
(j) RP (image)<br />
Corel RP M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(o) RP (text)<br />
Figure B.14: Histograms of the φ (NMI) values on the Corel data set obtained on the following<br />
data representations.<br />
B.2.4 IsoLetters data set<br />
The collection of clustering solutions obtained on the IsoLetters artificial multimodal data<br />
collection are presented representation and modality-wise in figure B.16.<br />
In this case, the quality of the clusterings created on the distinct object representations<br />
present two clearly different histogram patterns depending on the modality. For instance,<br />
in the speech modality (figures B.16(e) to B.16(h)), the baseline and RP histograms present<br />
a main and a secondary minor peaks, whereas the PCA and ICA representations yield a<br />
pretty uniform distribution of clusterings. In contrast, a totally different distribution is<br />
found when clustering is run on the visual mode, where a single negatively skewed bell<br />
shape is observed (see figures B.16(i) to B.16(l)).<br />
Finally, it is to note that, regardless of the object representation employed, the early<br />
fusion of the speech and visual features of this data set gives rise to a notable increase in<br />
the quality of the clustering results (a 16.2% in averaged relative terms as regards the top<br />
quality individual clustering solution).<br />
245
B.2. Clustering indeterminacies in multimodal data sets<br />
clustering count<br />
clustering count<br />
clustering count<br />
200<br />
100<br />
InternetAds Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
(multimodal)<br />
200<br />
100<br />
InternetAds Baseline M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(f) Baseline<br />
(object)<br />
200<br />
100<br />
InternetAds Baseline M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(k) Baseline<br />
(collateral)<br />
clustering count<br />
clustering count<br />
clustering count<br />
200<br />
100<br />
InternetAds PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA (multimodal)<br />
200<br />
100<br />
InternetAds PCA M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(g) PCA (object)<br />
200<br />
100<br />
InternetAds PCA M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(l) PCA (collateral)<br />
clustering count<br />
clustering count<br />
clustering count<br />
200<br />
100<br />
InternetAds ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA (multimodal)<br />
200<br />
100<br />
InternetAds ICA M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(h) ICA (object)<br />
200<br />
100<br />
InternetAds ICA M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(m) ICA (collateral)<br />
clustering count<br />
clustering count<br />
clustering count<br />
200<br />
100<br />
InternetAds NMF<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) NMF (multimodal)<br />
200<br />
100<br />
InternetAds NMF M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(i) NMF (object)<br />
200<br />
100<br />
InternetAds NMF M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(n) NMF (collateral)<br />
clustering count<br />
clustering count<br />
clustering count<br />
200<br />
100<br />
InternetAds RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) RP (multimodal)<br />
200<br />
100<br />
InternetAds RP M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(j) RP (object)<br />
200<br />
100<br />
InternetAds RP M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(o) RP (collateral)<br />
Figure B.15: Histograms of the φ (NMI) values on the InternetAds data set obtained on the<br />
following data representations.<br />
B.2.5 Summary<br />
Repeating the formula employed in section B.1, table B.4 presents the φ (NMI) values attained<br />
by the top clustering solution achieved by the best representative of each one of the five<br />
families of clustering algorithms employed in this work (i.e. agglo, bagglo, direct, graph,<br />
rb and rbr), referring the employed type of representation (baseline, PCA, ICA, NMF or<br />
RP) and modality (multimodal –MM–, mode #1 –M1– or mode #2 –M2–). The idea<br />
is to present a condensed view of the influence of the data representation and clustering<br />
algorithm selection indeterminacies.<br />
Notice the distinct ordering of the five families of clustering algorithms in every data<br />
set. A clear indicator of the algorithm selection indeterminacy, is the fact that the rbr type<br />
of algorithms yield the top clustering solution in three of the four data sets, while offering<br />
the poorest performance in the InternetAds collection.<br />
The indeterminacy regarding the use of multimodal or unimodal data representations<br />
also becomes evident, as in two of the data sets (Corel and IsoLetters) the multimodal<br />
246
clustering count<br />
clustering count<br />
clustering count<br />
30<br />
20<br />
10<br />
IsoLetters Baseline<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(a) Baseline<br />
(multimodal)<br />
30<br />
20<br />
10<br />
IsoLetters Baseline M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(e) Baseline<br />
(speech)<br />
30<br />
20<br />
10<br />
IsoLetters Baseline M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(i) Baseline<br />
(image)<br />
clustering count<br />
clustering count<br />
clustering count<br />
30<br />
20<br />
10<br />
Appendix B. Experiments on clustering indeterminacies<br />
IsoLetters PCA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(b) PCA (multimodal)<br />
30<br />
20<br />
10<br />
IsoLetters PCA M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(f) PCA<br />
(speech)<br />
30<br />
20<br />
10<br />
IsoLetters PCA M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(j) PCA (image)<br />
clustering count<br />
clustering count<br />
clustering count<br />
30<br />
20<br />
10<br />
IsoLetters ICA<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(c) ICA (multimodal)<br />
30<br />
20<br />
10<br />
IsoLetters ICA M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(g) ICA<br />
(speech)<br />
30<br />
20<br />
10<br />
IsoLetters ICA M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(k) ICA (image)<br />
clustering count<br />
clustering count<br />
clustering count<br />
30<br />
20<br />
10<br />
IsoLetters RP<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(d) RP (multimodal)<br />
30<br />
20<br />
10<br />
IsoLetters RP M1<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(h) RP (speech)<br />
30<br />
20<br />
10<br />
IsoLetters RP M2<br />
0<br />
0 0.5 1<br />
φ (NMI)<br />
(l) RP (image)<br />
Figure B.16: Histograms of the φ (NMI) values on the IsoLetters data set obtained on the<br />
following data representations.<br />
representations dominate the best clustering results across all the families of algorithms,<br />
whereas it is one of the unimodal representations the ones to do so in the CAL500 and InternetAds<br />
collections. And finally, notice the diversity of types of representations appearing<br />
in table B.4, which suggests that, for a given data set, it is very difficult to select the data<br />
representation and clustering strategy that yield the best clustering results.<br />
247
B.2. Clustering indeterminacies in multimodal data sets<br />
Data set Highest φ (NMI) −→ Lowest φ (NMI)<br />
CAL500<br />
Corel<br />
InternetAds<br />
IsoLetters<br />
rbr direct agglo rb bagglo graph<br />
RP-M1 RP-M1 RP-M1 RP-M1 RP-M1 baseline-M1<br />
120d 100d 100d 120d 120d –<br />
(0.411) (0.404) (0.401) (0.384) (0.381) (0.364)<br />
rbr graph direct rb bagglo agglo<br />
NMF-MM RP-MM NMF-MM NMF-MM baseline-M1 baseline-MM<br />
550d 400d 450d 300d – –<br />
(0.675) (0.672) (0.671) (0.641) (0.624) (0.622)<br />
bagglo graph agglo direct rb rbr<br />
RP-M1 NMF-MM baseline-M2 NMF-M2 NMF-M2 NMF-M2<br />
70d 150d – 550d 550d 550d<br />
(0.430) (0.319) (0.258) (0.087) (0.087) (0.087)<br />
rbr direct graph agglo rb bagglo<br />
PCA-MM PCA-MM PCA-MM RP-MM baseline-MM baseline-MM<br />
100d 100d 100d 600d – –<br />
(0.897) (0.875) (0.846) (0.790) (0.751) (0.728)<br />
Table B.4: Top clustering results obtained by each clustering algorithm family across all<br />
the multimodal data sets, sorted from highest to lowest φ (NMI) .<br />
248
Appendix C<br />
Experiments on hierarchical<br />
consensus architectures<br />
This appendix presents several experiments regarding self-refining hierarchical consensus<br />
architectures.<br />
C.1 Configuration of a random hierarchical consensus architecture<br />
In this section, we present some examples that describe, in detail, the configuration process<br />
of random hierarchical consensus architectures (RHCA). The aim is to demostrate how,<br />
given a cluster ensemble size l and a mini-ensemble size b, equations (C.1), (C.2) and (C.3)<br />
allow determining the number of stages s, the number of consensus per stage Ki and the<br />
effective size of each mini-ensemble bij of the corresponding RHCA.<br />
For starters, let us evaluate carefully the three RHCA examples presented in section 3.2.<br />
In these toy examples, the mini-ensemble size is set to b = 2, while the respective cluster<br />
ensembles have l =7, 8 and 9 components. The first step of the RHCA design process<br />
consists of determining the number of stages of the hierarchy, s, accordingtoequation<br />
(C.1).<br />
⎧<br />
⎪⎩<br />
⌊log b (l)⌉ if<br />
⎪⎨<br />
<br />
s = ⌊logb (l)⌉−1 if<br />
⌊log b (l)⌉ +1 if<br />
<br />
<br />
l<br />
b ⌊log b (l)⌉<br />
l<br />
b ⌊log b (l)⌉<br />
l<br />
b ⌊log b (l)⌉<br />
<br />
≤ 1and<br />
<br />
≤ 1and<br />
<br />
> 1<br />
l<br />
b ⌊log b (l)⌉−1<br />
l<br />
b ⌊log b (l)⌉−1<br />
<br />
> 1<br />
<br />
=1<br />
(C.1)<br />
Table C.1 presents the results of this computation for the three aforementioned examples<br />
(one row per example), specifying the values of the decision factors used for selecting one<br />
of the three options presented in equation (C.1).<br />
Once the number of stages of the RHCA is computed, the next step consists of determining<br />
how many consensus processes are to be executed at each RHCA stage. This factor,<br />
249
C.1. Configuration of a random hierarchical consensus architecture<br />
l<br />
b⌊logb (l)⌉<br />
l<br />
b⌊logb (l)⌉−1<br />
l =7 0.875 1.75<br />
s<br />
⌊logb (l)⌉−1=2<br />
l =8 1 2 ⌊logb (l)⌉ =3<br />
l =9 1.125 2.25 ⌊logb (l)⌉ =3<br />
Table C.1: Examples of computation of the number of stages s of a RHCA on cluster<br />
ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2.<br />
which is designated by Ki (where the subindex i denotes the stage number), is computed<br />
according to equation (C.2).<br />
<br />
l<br />
Ki =max<br />
bi <br />
, 1<br />
(C.2)<br />
The number of consensus processes per stage of the three RHCA examples discussed<br />
are presented in table C.2.<br />
1<br />
Stage number<br />
2 3<br />
l =7 K1 =max(⌊3.5⌋ , 1) = 3<br />
as<br />
K2 =max(⌊1.75⌋ , 1) = 1<br />
–<br />
l<br />
b1 =3.5 as l<br />
b2 l =8<br />
=1.75<br />
K1 =max(⌊4⌋ , 1) = 4<br />
as<br />
K2 =max(⌊2⌋ , 1) = 2 K3 =max(⌊1⌋ , 1) = 1<br />
l<br />
b1 =4 as l<br />
b2 =2 as l<br />
b3 l =9<br />
=1<br />
K1 =max(⌊4.5⌋ , 1) = 4<br />
as<br />
K2 =max(⌊2.25⌋ , 1) = 2 K3 =max(⌊1.125⌋ , 1) = 1<br />
l<br />
b1 =4.5 as l<br />
b2 =2.25 as l<br />
b3 =1.125<br />
Table C.2: Examples of computation of the number of consensus per stage (Ki) ofaRHCA<br />
on cluster ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2.<br />
And finally, the real mini-ensembles size bij, ∀i ∈ [1,s]and∀j ∈ [1,Ki] mustbecomputed.<br />
Recall that the effective size of all the mini-ensembles of the RHCA is equal to the<br />
user-defined mini-ensemble size (i.e bij = b ∀i, j) if and only if l is an integer power of b. In<br />
practice, the effective mini-ensembles size is computed according to equation (C.3), which<br />
adjusts this factor so that all the original and intermediate clusterings are subject to a consensus<br />
process. The bij values corresponding to the three RHCA examples are presented<br />
in table C.3 along with the corresponding number of consensus Ki at each RHCA stage in<br />
brackets.<br />
⎧<br />
⎪⎨<br />
b if i
l =7<br />
l =8<br />
l =9<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
1<br />
Stage number<br />
2 3<br />
[K1 =3] [K2 =1]<br />
b11 = b =2<br />
b12 = b =2<br />
b13 = b + l mod b =3<br />
b21 = K1 =3<br />
–<br />
[K1 =4] [K2 =2] [K3 =1]<br />
b11 = b =2<br />
b12 = b =2<br />
b21 = b =2<br />
b13 = b =2<br />
b14 = b + l mod b =2<br />
b22 = b + K1 mod b =2<br />
[K1 =4] [K2 =2] [K3 =1]<br />
b11 = b =2<br />
b12 = b =2<br />
b21 = b =2<br />
b13 = b =2<br />
b14 = b + l mod b =3<br />
b22 = b + K1 mod b =2<br />
b31 = K2 =2<br />
b31 = K2 =2<br />
Table C.3: Examples of computation of the mini-ensembles size of a RHCA on cluster<br />
ensembles of size l =7, 8 and 9, being the mini-ensembles size b =2.<br />
cluster ensemble of an arbitrarily chosen size of l = 30 using the following predefined miniensembles<br />
sizes: b = {2, 3, 5, 15}.<br />
It can be observed that the larger the mini-ensemble size b, the smaller the number of<br />
stages s. Moreover, notice that, for any given RHCA, the number of consensus per stage Ki<br />
progressively converges to unity, i.e. Ks = 1, giving rise to a regular pyramidal hierarchy of<br />
consensus processes. <strong>La</strong>st, it is worth observing how the effective size of the mini-ensembles<br />
bij is determined. Notice bij is regularly set to b maybe except for the last (i.e. the Kith)<br />
consensus of each stage and/or the only consensus of the final stage, which may vary their<br />
size so as to accomodate the necessary number of clusterings into the associated consensus<br />
process.<br />
l =30<br />
2<br />
mini-ensemble size b<br />
3 5 15<br />
#ofstages s =4 s =3 s =2 s =2<br />
#of<br />
consensus<br />
per stage<br />
Ki = {15, 7, 3, 1} Ki = {10, 3, 1} Ki = {6, 1} Ki = {2, 1}<br />
miniensembles<br />
size bij<br />
b1j =2∀j<br />
<br />
∈ [1, 15]<br />
2 ∀j ∈ [1, 6]<br />
b2j =<br />
3 j =7<br />
<br />
2 ∀j ∈ [1, 2]<br />
b3j =<br />
3 j =3<br />
b41 =3<br />
b1j =3∀j ∈ [1, 10]<br />
<br />
3 ∀j ∈ [1, 2]<br />
b2j =<br />
4 j =3<br />
b31 =3<br />
b1j =5∀j ∈ [1, 6] b1j =15∀j ∈ [1, 2]<br />
b21 =6 b21 =2<br />
Table C.4: Configuration of RHCA topologies on a cluster ensemble of size l =30with<br />
varying mini-ensembles sizes b = {2, 3, 5, 15}.<br />
251
C.2. Estimation of the computationally optimal RHCA<br />
C.2 Estimation of the computationally optimal RHCA<br />
Section 3.2 presents a methodology for selecting the most computationally efficient implementation<br />
variant of random hierarchical consensus architectures. In short, such methodology<br />
consists of estimating the running time of several RHCA variants differing in the<br />
mini-ensembles size b, selecting the one that yields the minimum running time, which is the<br />
one to be truly executed.<br />
So as to validate this procedure, in this section we present the estimated and real running<br />
times of several variants of the fully serial and parallel implementations of RHCA on the<br />
Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat unimodal data sets (see appendix<br />
A.2.1 for a description of these collections) across the four experimental diversity scenarios<br />
employed in this work —see appendix A.4. The objective of this experiment is twofold:<br />
firstly, we seek to verify whether the proposed strategy succeeds in predicting the most<br />
computationally efficient RHCA variant. And secondly, we intend to analyze the conditions<br />
under which random hierarchical consensus architectures are computationally advantageous<br />
compared to flat consensus clustering. The experimental design that has been followed is<br />
outlined next.<br />
– What do we want to measure?<br />
i) The time complexity of random hierarchical consensus architectures.<br />
ii) The ability of the proposed methodology for predicting the computationally optimal<br />
RHCA variant, in both the fully serial and parallel implementations.<br />
– How do we measure it?<br />
i) The time complexity of the implemented serial and parallel RHCA variants is<br />
measured in terms of the CPU time required for their execution —serial running<br />
time (SRTRHCA) and parallel running time (PRTRHCA).<br />
ii) The estimated running times of the same RHCA variants –serial estimated running<br />
time (SERTRHCA) and parallel estimated running time (PERTRHCA)– are<br />
computed by means of the proposed running time estimation methodology, which<br />
is based on the measured running time of c = 1 consensus clustering process. Predictions<br />
regarding the computationally optimal RHCA variant will be successful<br />
in case that both the real and estimated running times are minimized by the<br />
same RHCA variant, and the percentage of experiments in which prediction is<br />
successful is given as a measure of its performance. In order to measure the<br />
impact of incorrect predictions, we also measure the execution time differences<br />
(in both absolute and relative terms) between the truly and the allegedly fastest<br />
RHCA variants in the case prediction fails. This evaluation process is replicated<br />
for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />
on the prediction accuracy of the proposed methodology.<br />
– How are the experiments designed? All the RHCA variants corresponding to<br />
the sweep of values of b resulting from the proposed running time estimation methodology<br />
have been implemented (see table 3.2). In order to test our proposals under a<br />
wide spectrum of experimental situations, consensus processes have been conducted<br />
using the seven consensus functions for hard cluster ensembles presented in appendix<br />
252
Appendix C. Experiments on hierarchical consensus architectures<br />
A.5 (i.e. CSPA, EAC, HGPA, MCLA, ALSAD, KMSAD and SLSAD), employing<br />
cluster ensembles of the sizes corresponding to the four diversity scenarios described<br />
in appendix A.4 —which basically boils down to compiling the clusterings output by<br />
|dfA| = {1, 10, 19, 28} clustering algorithms. In all cases, the real running times correspond<br />
to an average of 10 independent runs of the whole RHCA, in order to obtain<br />
representative real running time values (recall that the mini-ensemble components<br />
change from run to run, as they are randomly selected). For a description of the<br />
computational resources employed in or experiments, see appendix A.6.<br />
– How are results presented? Both the real and estimated running times of the<br />
serial and parallel implementations of the RHCA variants are depicted by means of<br />
curves representing their average values.<br />
C.2.1 Iris data set<br />
For starters, let us analyze the results corresponding to the Iris data collection. In this<br />
case, each diversity scenario corresponds to a cluster ensemble of size l =9, 90, 171 and 252,<br />
respectively. The left and right columns of figure C.1 present the estimated and real running<br />
times of several variants of the serial implementation of the RHCA on this data set across<br />
the four diversity scenarios. It can be observed that, as the size of the cluster ensemble<br />
grows, there appear RHCA variants more computationally efficient than flat consensus<br />
(especially when the MCLA and KMSAD consensus functions are employed). However,<br />
there are no significant differences between the running times of the fastest RHCA variant<br />
and flat consensus, probably due to the small size of this data set and of the associated<br />
cluster ensembles. For this reason, the inaccuracies of the running time prediction based<br />
on SERTRHCA are of little importance in practice.<br />
Figure C.2 presents the estimated and real running times of the parallel implementation<br />
of the RHCA in the four diversity scenarios analyzed. According to PERTRHCA, the<br />
parallel RHCA variants with the s = 2/lowest b and s = 3/highest b configurations yield<br />
the maximum computational efficiency —except for the lowest diversity scenario, where flat<br />
consensus is correctly designated to be the fastest option. If these predictions are compared<br />
to the real running times presented on the right column of figure C.2, it can be observed<br />
that, as the diversity level grows, they maintain their accuracy as regards the identification<br />
of the fastest consensus architecture for most consensus functions. However, in the case the<br />
prediction strategy fails to identify the fastest RHCA variant, we make, in terms of absolute<br />
running time penalization, a perfectly assumable error, as the real running times of parallel<br />
RHCA are below one second for the particular case of this data set.<br />
C.2.2 Wine data set<br />
In this section, we present the estimated and real running times of the serial and parallel<br />
implementations of RHCA on the Wine data collection. As aforementioned, this experiment<br />
has been replicated across four diversity scenarios that, in the case of this data set, correspond<br />
to cluster ensembles of size l =45, 450, 855 and 1260. Thus, notice that considerably<br />
large cluster ensembles are obtained in this case, especially if compared to those of the Iris<br />
data collection. This is due to the fact that the Wine data set has a much richer dimensional<br />
diversity as regards the distinct object representations generated (approximately five times<br />
253
C.2. Estimation of the computationally optimal RHCA<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
3<br />
10<br />
2 2 1<br />
0<br />
s : number of stages<br />
10 −1<br />
2 3 4 9<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
6 4 3 3 2 2 1<br />
10 1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
2 3 4 7 8 45 90<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 85 171<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 12 13 126 252<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
3<br />
10<br />
2 2 1<br />
0<br />
s : number of stages<br />
10 −1<br />
2 3 4 9<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
6 4 3 3 2 2 1<br />
10 1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
2 3 4 7 8 45 90<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 85 171<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 12 13 126 252<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.1: Estimated and real running times (RT) of the serial RHCA implementation on<br />
the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
254<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 −1<br />
10 0<br />
10 −1<br />
10 0<br />
10 −1<br />
10 0<br />
10 −1<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
s : number of stages<br />
3 2 2 1<br />
2 3 4 9<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
6 4 3 3 2 2 1<br />
2 3 4 7 8 45 90<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 85 171<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 12 13 126 252<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 −1<br />
10 0<br />
10 −1<br />
10 0<br />
10 −1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
3 2 2 1<br />
2 3 4 9<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
6 4 3 3 2 2 1<br />
2 3 4 7 8 45 90<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 85 171<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 12 13 126 252<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.2: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
255<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
richer), which boosts the size of the cluster ensemble.<br />
Firstly, figure C.3 depicts the results corresponding to the fully serial RHCA implementation<br />
across the four diversity scenarios (estimated running times on the left column, real<br />
running times on the right). The first remarkable fact is that SERTRHCA is a pretty accurate<br />
predictor of SRTRHCA, which can be easily verified if the pair of subfigures presented on<br />
each row of figure C.3 are compared. Again, RHCA becomes more computationally attractive<br />
as the size of the cluster ensemble increases (except for the EAC consensus function).<br />
Moreover, among the distinct RHCA variants executed in each experiment, the greatest<br />
efficiency is achieved by the ones with 2 or 3 stages.<br />
The estimated and real execution times of the parallel implementation of RHCA are<br />
depicted in figure C.4. As already observed in the previous data sets, PERTRHCA is a<br />
modestly accurate estimator of PRTRHCA, although it is a fairly good predictor of the most<br />
computationally efficient consensus architecture. Notice that, in the most diverse scenario<br />
(|dfA| = 28), the least time consuming RHCA variant is nearly two orders of magnitude<br />
faster than flat consensus —thus, it can be argued that being able to predict which RHCA<br />
configuration requires the least computation time constitutes a significantly advantageous<br />
strategy compared to the traditional one-step approach to consensus clustering.<br />
C.2.3 Glass data set<br />
This section presents the results of estimating the execution times of the fully serial and<br />
parallel implementations of RHCA in the four diversity scenarios for the Glass data set,<br />
which give rise to cluster ensembles of sizes l =29, 290, 551 and 812 each.<br />
Firstly, figure C.5 depicts both the estimated and real running times of several serial<br />
RHCA variants. These results are quite comparable to those obtained in the previous<br />
data collections. That is, except for the EAC consensus function, the RHCA variants with<br />
s =2ands = 3 stages become the most computationally efficient as the size of the cluster<br />
ensemble increases. Moreover, in the most diverse scenario (|dfA| = 28) flat consensus is not<br />
executable if the MCLA consensus function is employed as the clustering combiner, whereas<br />
hierarchical consensus does provide a means for obtaining a consolidated clustering solution<br />
upon the same cluster ensemble using this consensus function. Furthermore, notice that the<br />
proposed methodology for estimating the running time of serial RHCA yields fairly reliable<br />
predictions of their real execution time.<br />
And secondly, the results corresponding to the parallel implementation of RHCA are<br />
presented in figure C.6. Again, it can be observed that the estimated running time of the<br />
parallel RHCA is an arguably accurate approximation of the real execution time. However,<br />
notice that this lack of accuracy is tolerable inasmuch as i) the location of the minima of<br />
PERTRHCA mostly coincides with the minima of PRTRHCA —which means that the fastest<br />
consensus architecture is successfully predicted, and ii) the selection of a computationally<br />
suboptimal RHCA variant involves a light penalization in terms of real execution time.<br />
C.2.4 Ionosphere data set<br />
This section describes the results of the minimum complexity RHCA variant selection based<br />
on running time estimation. In the case of the Ionosphere data collection, cluster ensembles<br />
of sizes l =97, 970, 1843 and 2716 correspond to the four diversity scenarios where this<br />
256
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
5<br />
10<br />
4 3 3 2 2 1<br />
1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
2 3 4 5 6 22 45<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
8 5 5 4 4 3 3 2 2 1<br />
10 2<br />
s : number of stages<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
2 3 4 5 6 7 17 18 225 450<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 23 24 427 855<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 9 10 28 29 630 1260<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
5<br />
10<br />
4 3 3 2 2 1<br />
1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
2 3 4 5 6 22 45<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
8 5 5 4 4 3 3 2 2 1<br />
10 2<br />
s : number of stages<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
2 3 4 5 6 7 17 18 225 450<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 23 24 427 855<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 9 10 28 29 630 1260<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.3: Estimated and real running times (RT) of the serial RHCA implementation on<br />
the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
257<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 0<br />
10 −1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
2 3 4 5 6 22 45<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
8 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 17 18 225 450<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 23 24 427 855<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 9 10 28 29 630 1260<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 0<br />
10 −1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
2 3 4 5 6 22 45<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
8 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 17 18 225 450<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 23 24 427 855<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 9 10 28 29 630 1260<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.4: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
258<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
s : number of stages<br />
4 3 3 2 2 1<br />
2 3 4 5 14 29<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
8 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 13 14 145 290<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 275 551<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
10<br />
2 3 4 5 8 9 23 24 406 812<br />
0<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
s : number of stages<br />
4 3 3 2 2 1<br />
2 3 4 5 14 29<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
8 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 13 14 145 290<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 275 551<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
10<br />
2 3 4 5 8 9 23 24 406 812<br />
0<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.5: Estimated and real running times (RT) of the serial RHCA implementation on<br />
the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
259<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 0<br />
10 −1<br />
s : number of stages<br />
4 3 3 2 2 1<br />
2 3 4 5 14 29<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
8<br />
10<br />
5 4 4 3 3 2 2 1<br />
1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
2 3 4 5 6 13 14 145 290<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 275 551<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
9 6 5 4 4 3 3 2 2 1<br />
10 1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
2 3 4 5 8 9 23 24 406 812<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 0<br />
10 −1<br />
s : number of stages<br />
4 3 3 2 2 1<br />
2 3 4 5 14 29<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
8<br />
10<br />
5 4 4 3 3 2 2 1<br />
1<br />
s : number of stages<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
10 1<br />
10 0<br />
10 −1<br />
2 3 4 5 6 13 14 145 290<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 7 8 19 20 275 551<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
9 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 23 24 406 812<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.6: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
260<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
Appendix C. Experiments on hierarchical consensus architectures<br />
experiment is conducted.<br />
For starters, figure C.7 presents the estimated and real exectution times of several variants<br />
of the fully serial implementation of RHCA. If the estimated and real running times are<br />
compared, it can be observed that it is possible to accurately predict the real execution time<br />
of serial RHCA variants which, at the same time, allows the precise prediction of the most<br />
computationally efficient RHCA variant —the ultimate goal of the proposed methodology.<br />
Figure C.8 depicts the estimated and real running times of the parallel RHCA implementation<br />
across the sweep of values of b for all four diversity scenarios. In comparison to<br />
what is observed in other data sets, PERTRHCA is a better estimator of PRTRHCA in this<br />
case. Moreover, as the diversity of the cluster ensemble grows, the computational savings<br />
derived from employing the fastest RHCA instead of flat consensus are noteworthy (especially<br />
for the HGPA, CSPA, ALSAD and SLSAD consensus functions). <strong>La</strong>st, note that<br />
flat consensus is not executable in the three of the four diversity scenarios if consensus is<br />
obtained by means of MCLA, due to the large size of the mini-ensembles b.<br />
C.2.5 WDBC data set<br />
In this section, we present the results of estimating the execution times of the serial and<br />
parallel RHCA implementations on the WDBC data collection. According to the four<br />
diversity scenarios generated by employing |dfA| =1, 10, 19 and 28 clustering algorithms<br />
for generating the cluster ensembles, these contain l = 113, 1130, 2147 and 3164 individual<br />
partitions.<br />
The estimated and real running times corresponding to the serial implementation of<br />
RHCA are depicted in figure C.9. As observed in the remaining data collections, SERTRHCA<br />
is a fairly accurate estimator of SRTRHCA, which allows predicting the fastest consensus<br />
architecture with a high precision. Notice that, as already noted in other collections, RHCA<br />
becomes a competitive option as the size of the cluster ensemble grows, except when the EAC<br />
consensus functions is employed. Notice that, for the most diverse scenarios, all consensus<br />
architectures are highly costly (in terms of execution time), so being able to predict which<br />
is the fastest can lead to important computation savings.<br />
As regards the fully parallel implementation of RHCA, the estimated and real running<br />
times corresponding to the four aforementioned diversity scenarios are presented in figure<br />
C.10. Despite the estimation of the real execution time is not as accurate as in the serial<br />
case, PERTRHCA is a reasonable predictor of the fastest consensus architecture in most<br />
cases.<br />
C.2.6 Balance data set<br />
This section presents the estimated and real execution times of multiple variants of random<br />
hierarchical consensus architectures on the Balance data collection, both in its serial and<br />
parallel versions. The low cardinality of the dimensional diversity factor of this data set<br />
gives rise to relatively small cluster ensembles in the four diversity scenarios, which are<br />
equal to l =7, 70, 133 and 196 in this case.<br />
Firstly, figure C.11 depicts the estimated and real running times of the serial RHCA<br />
implementation in the four diversity scenarios. As already observed in the previous data<br />
261
C.2. Estimation of the computationally optimal RHCA<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 48 97<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
9 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 8 9 25 26 485 970<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 10 11 35 36 921 1843<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 7 12 13 42 43 1358 2716<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 48 97<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
9 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 8 9 25 26 485 970<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 10 11 35 36 921 1843<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 7 12 13 42 43 1358 2716<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.7: Estimated and real running times (RT) of the serial RHCA implementation on<br />
the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
262<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 48 97<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
9 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 8 9 25 26 485 970<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 10 11 35 36 921 1843<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 12 13 42 43 1358 2716<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
10 1<br />
10 0<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 48 97<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
9 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 8 9 25 26 485 970<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
10 7 6 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 10 11 35 36 921 1843<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
11<br />
10<br />
7 6 5 5 4 4 3 3 2 2 1<br />
2<br />
s : number of stages<br />
10 1<br />
10 0<br />
2 3 4 5 6 7 12 13 42 43 1358 2716<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.8: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
263<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 56 113<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 9 10 27 28 565 1130<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 7 11 12 37 38 1073 2147<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 7 12 13 45 46 1582 3164<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 56 113<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 9 10 27 28 565 1130<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 7 11 12 37 38 1073 2147<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
2 3 4 5 6 7 12 13 45 46 1582 3164<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.9: Estimated and real running times (RT) of the serial RHCA implementation on<br />
the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
264<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 56 113<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 9 10 27 28 565 1130<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 11 12 37 38 1073 2147<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 12 13 45 46 1582 3164<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 56 113<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
10 7 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 9 10 27 28 565 1130<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 11 12 37 38 1073 2147<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
11 7 6 5 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 7 12 13 45 46 1582 3164<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.10: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
265<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
sets, our proposed method allows obtaining a pretty accurate estimation of the execution<br />
time of any serial RHCA variant which, at the same time, allows the user to make a reliable<br />
decision regarding the most computationally efficient consensus architecture whatever the<br />
diversity scenario and consensus function is employed.<br />
Secondly, figure C.12 presents the corresponding magnitudes in the case that the fully<br />
parallel implementations of RHCA are employed. In this situation, the estimation of the real<br />
execution time is not as accurate as in the serial case, although the running time deviation<br />
suffered when a suboptimal RHCA architecture is selected is, from a practical viewpoint,<br />
perfectly assumable — a fact that has already been reported in the previous data sets.<br />
C.2.7 MFeat data set<br />
In this section, the results of estimating the execution times of RHCA are compared to their<br />
real counterparts across four diversity scenarios in the context of the MFeat data collection.<br />
The cluster ensemble sizes corresponding to these diversity scenarios are l =6, 60, 114 and 168,<br />
respectively.<br />
Figure C.13 presents the estimated and real running times of multiple variants of the<br />
serial implementation of RHCA on this data set. Besides the notably high accuracy of<br />
the estimation, we would like to highlight that flat consensus turns out to be the most<br />
efficient consensus architecture in the four diversity scenarios for all but two of the consensus<br />
functions employed (MCLA and HGPA), a behaviour that has already been observed in<br />
other data collections with small cluster ensembles (e.g. the Iris data set).<br />
The results corresponding to the parallel implementation of RHCA are depicted in<br />
figure C.14. In this case, the use of the HGPA and MCLA consensus functions as clustering<br />
combiners also make the RHCA variants with s =2ands = 3 stages computationally<br />
optimal. However, for the remaining consensus functions, flat consensus mostly prevails as<br />
the most efficient consensus architecture in most diversity scenarios.<br />
266
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
2<br />
10<br />
2 1<br />
2<br />
s : number of stages<br />
10 1<br />
10 0<br />
10 −1<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
2 3 7<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
6 4 3 3 2 2 1<br />
2 3 4 6 7 35 70<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 9 10 66 133<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
10<br />
2 3 4 5 6 11 12 98 196<br />
0<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
2<br />
10<br />
2 1<br />
2<br />
s : number of stages<br />
10 1<br />
10 0<br />
10 −1<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
2 3 7<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
6 4 3 3 2 2 1<br />
2 3 4 6 7 35 70<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 9 10 66 133<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
10<br />
2 3 4 5 6 11 12 98 196<br />
0<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.11: Estimated and real running times (RT) of the serial RHCA implementation<br />
on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
267<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 1<br />
10 0<br />
10 −1<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
2 2 1<br />
2 3 7<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
6 4 3 3 2 2 1<br />
2 3 4 6 7 35 70<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 9 10 66 133<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 11 12 98 196<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 1<br />
10 0<br />
10 −1<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
2 2 1<br />
2 3 7<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
6 4 3 3 2 2 1<br />
2 3 4 6 7 35 70<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 9 10 66 133<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 4 3 3 2 2 1<br />
2 3 4 5 6 11 12 98 196<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.12: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
268<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
SERT RHCA (sec.)<br />
10 2<br />
10 0<br />
10 4<br />
10 2<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
s : number of stages<br />
2 2 1<br />
2 3 6<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
10<br />
2 3 4 6 7 30 60<br />
0<br />
b : mini−ensemble size<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 4<br />
10 2<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 57 114<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 84 168<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
SRT RHCA (sec.)<br />
10 2<br />
10 0<br />
10 4<br />
10 2<br />
s : number of stages<br />
2 2 1<br />
2 3 6<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
10<br />
2 3 4 6 7 30 60<br />
0<br />
b : mini−ensemble size<br />
10 4<br />
10 3<br />
10 2<br />
10 1<br />
10 4<br />
10 2<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 57 114<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 84 168<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.13: Estimated and real running times (RT) of the serial RHCA implementation<br />
on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
269<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
C.2. Estimation of the computationally optimal RHCA<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
PERT RHCA (sec.)<br />
10 2<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
2 2 1<br />
2 3 6<br />
b : mini−ensemble size<br />
(a) Estimated RT, |dfA| =1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
2 3 4 6 7 30 60<br />
b : mini−ensemble size<br />
(c) Estimated RT, |dfA| =10<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 57 114<br />
b : mini−ensemble size<br />
(e) Estimated RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 84 168<br />
b : mini−ensemble size<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
PRT RHCA (sec.)<br />
10 2<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
10 3<br />
10 2<br />
10 1<br />
10 0<br />
s : number of stages<br />
2 2 1<br />
2 3 6<br />
b : mini−ensemble size<br />
(b) Real RT, |dfA| =1<br />
s : number of stages<br />
5 4 3 3 2 2 1<br />
2 3 4 6 7 30 60<br />
b : mini−ensemble size<br />
(d) Real RT, |dfA| =10<br />
s : number of stages<br />
6 4 4 3 3 2 2 1<br />
2 3 4 5 8 9 57 114<br />
b : mini−ensemble size<br />
(f) Real RT, |dfA| =19<br />
s : number of stages<br />
7 5 4 3 3 2 2 1<br />
2 3 4 5 10 11 84 168<br />
b : mini−ensemble size<br />
(h) Real RT, |dfA| =28<br />
Figure C.14: Estimated and real running times (RT) of the parallel RHCA implementation<br />
on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
270<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD
Appendix C. Experiments on hierarchical consensus architectures<br />
C.3 Estimation of the computationally optimal DHCA<br />
The methodology for selecting the most computationally efficient implementation variant<br />
of deterministic hierarchical consensus architectures presented in section 3.3 –consisting of<br />
estimating the running time of several DHCA variants differing in the order diversity factors<br />
are associated to the stages of the hierarchical consensus architecture, selecting the one that<br />
yields the minimum running time, which is the one to be truly executed– has been applied<br />
on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat unimodal data sets (see<br />
appendix A.2.1 for a description of these collections).<br />
In these experiments, the fully serial and parallel implementations of DHCA variants<br />
have been considered across the four experimental diversity scenarios employed in this work<br />
—see appendix A.4. The objective of this experiment is twofold: firstly, we seek to verify<br />
whether the proposed strategy succeeds in predicting the most computationally efficient<br />
DHCA variant. And secondly, we intend to analyze the conditions under which deterministic<br />
hierarchical consensus architectures are computationally advantageous compared to flat<br />
consensus clustering. We have followed the experimental design outlined next.<br />
– What do we want to measure?<br />
i) The time complexity of deterministic hierarchical consensus architectures.<br />
ii) The ability of the proposed methodology for predicting the computationally optimal<br />
DHCA variant, in both the fully serial and parallel implementations.<br />
iii) The predictive power of the proposed methodology based on running time estimation<br />
vs the computational optimality criterion based on designing the DHCA<br />
according to a decreasing diversity factor cardinality order, in both the fully<br />
serial and parallel implementations.<br />
– How do we measure it?<br />
i) The time complexity of the implemented serial and parallel DHCA variants is<br />
measured in terms of the CPU time required for their execution —serial running<br />
time (SRTDHCA) and parallel running time (PRTDHCA).<br />
ii) The estimated running times of the same DHCA variants –serial estimated running<br />
time (SERTDHCA) and parallel estimated running time (PERTDHCA)– are<br />
computed by means of the proposed running time estimation methodology, which<br />
is based on the measured running time of c = 1 consensus clustering process. Predictions<br />
regarding the computationally optimal DHCA variant will be successful<br />
in case that both the real and estimated running times are minimized by the<br />
same DHCA variant, and the percentage of experiments in which prediction is<br />
successful is given as a measure of its performance. In order to measure the<br />
impact of incorrect predictions, we also measure the execution time differences<br />
(in both absolute and relative terms) between the truly and the allegedly fastest<br />
DHCA variants in the case prediction fails. This evaluation process is replicated<br />
for a range of values of c ∈ [1, 20], so as to measure the influence of this factor<br />
on the prediction accuracy of the proposed methodology.<br />
iii) Both computationally optimal DHCA variants prediction approaches are compared<br />
in terms of the percentage of experiments in which prediction is successful,<br />
271
C.3. Estimation of the computationally optimal DHCA<br />
and in terms of the execution time overheads (in both absolute and relative terms)<br />
between the truly and the allegedly fastest DHCA variants in the case prediction<br />
fails.<br />
– How are the experiments designed? The f! DHCA variants corresponding to<br />
all the possible permutations of the f diversity factors employed in the generation<br />
of the cluster ensemble have been implemented (see table 3.6). As described in appendix<br />
A.4, cluster ensembles have been created by the mutual crossing of f =3<br />
diversity factors: clustering algorithms (dfA), object representations (dfR) and data<br />
dimensionalities (dfD). Thus, in all our experiments, the number of DHCA variants is<br />
f! = 3! = 6, which are identified by an acronym describing the order in which diversity<br />
factors are assigned to stages —for instance, ADR describes the DHCA variant<br />
defined by the ordered list O = {df1 = dfA,df2 = dfD,df3 = dfR}. For a given data<br />
collection, the cardinalities of the representational and dimensional diversity factors<br />
(|dfR| and |dfD|, respectively) are constant, while the cardinality of the algorithmic<br />
diversity factor takes four distinct values |dfA| = {1, 10, 19, 28}, giving rise to the four<br />
diversity scenarios where our proposals are analyzed. Moreover, consensus clustering<br />
has been conducted by means of the seven consensus functions for hard cluster<br />
ensembles described in appendix A.5, which allows evaluating the behaviour of our<br />
proposals under distinct consensus paradigms. In all cases, the real running times<br />
correspond to an average of 10 independent runs of the whole RHCA, in order to<br />
obtain representative real running time values. As described in appendix A.6, all the<br />
experiments have been executed under Matlab 7.0.4 on Pentium 4 3GHz/1 GB RAM<br />
computers.<br />
– How are results presented? Both the real and estimated running times of the<br />
serial and parallel implementations of the DHCA variants are depicted by means of<br />
curves representing their average values.<br />
– Which data sets are employed? For brevity reasons, this section only describes<br />
the results of the experiments conducted on the Zoo data collection. On this data<br />
set, the cardinalities of the representational and dimensional diversity factors are<br />
|dfR| = 5 and |dfD| = 14, respectively. The presentation of the results of these<br />
same experiments on the Iris, Wine, Glass, Ionosphere, WDBC, Balance and MFeat<br />
unimodal data collections is deferred to appendix C.3.<br />
C.3.1 Iris data set<br />
In this section, we present the estimated and real running times of the serial and parallel implementations<br />
of DHCA on the Iris data collection. As aforementioned, this experiment has<br />
been replicated across four diversity scenarios that, in the case of this data set, correspond<br />
to cluster ensembles of size l =9, 90, 171 and 252.<br />
The left and right columns of figure C.15 present the estimated and real running times<br />
of several variants of the serial implementation of the DHCA and flat consensus on this data<br />
set across the four diversity scenarios. There are a couple of issues worth noting: firstly,<br />
SERTDHCA is a pretty accurate estimator of the real execution time of the serial DHCA<br />
implementation, SRTDHCA. Secondly, notice that flat consensus is faster than the most<br />
efficient DHCA variants regardless of the consensus function and the diversity scenario.<br />
272
Appendix C. Experiments on hierarchical consensus architectures<br />
Furthermore, we would like to highlight that the computationally optimal DHCA variants<br />
are those defined by an ordered list of diversity factors in decreasing cardinality order, a<br />
trend that is notably well captured by SERTDHCA.<br />
Figure C.16 presents the estimated and real execution times of the fully parallel implementations<br />
of the DHCA variants. If compared to the serial case, the running time<br />
estimation PERTDHCA is not as accurate. Moreover, notice that, as already outlined in<br />
section 3.3, the execution times of the distinct DHCA variants tend to be quite similar.<br />
<strong>La</strong>st, DHCA become faster than flat consensus in the highest diversity scenario.<br />
C.3.2 Wine data set<br />
This section presents the estimated and real execution times of multiple variants of deterministic<br />
hierarchical consensus architectures on the Wine data collection, both in its serial<br />
and parallel versions. The high cardinality of the dimensional diversity factor of this data<br />
set gives rise to relatively large cluster ensembles in the four diversity scenarios, the sizes<br />
of which are equal to l =45, 450, 855 and 1260 in this case.<br />
Figure C.17 depicts the results of this experiment when the fully serial implementation<br />
of DHCA is considered. As already observed in section C.3.1, SERTDHCA is a pretty<br />
good estimator of SRTDHCA. Moreover, as regards the computational efficiency of DHCA<br />
variants, notice that i) those defined by the decreasing cardinality ordered list of diversity<br />
factors are the fastest, and ii) they become faster than flat consensus as the size of the<br />
cluster ensemble is increased.<br />
Meanwhile, figure C.18 presents the results corrresponding to the parallel DHCA implementation.<br />
Again, the time complexities of DHCA variants tend to converge, which<br />
reinforces our hypothesis regarding the irrelevance of the way diversity factors are associated<br />
to stages in parallel scenarios. Moreover, notice that DHCA variants are faster than<br />
flat consensus in all but one of the diversity scenarios —except when the EAC consensus<br />
function is employed.<br />
C.3.3 Glass data set<br />
In this section, we present the results of estimating the execution times of the serial and parallel<br />
DHCA implementations on the Glass data collection. According to the four diversity<br />
scenarios generated by employing |dfA| =1, 10, 19 and 28 clustering algorithms for generating<br />
the cluster ensembles, these contain l =29, 290, 551 and 812 individual partitions.<br />
For starters, the results corresponding to the serial implementation of deterministic<br />
hierarchical consensus architectures are presented in figure C.19. It can be observed that<br />
the estimation of the real execution time is pretty accurate, both in absolute terms (i.e.<br />
SERTDHCA is a good approximation of SRTDHCA) and as regards the determination of the<br />
computationally optimal consensus architecture. Furthermore, notice how the definition of<br />
DHCA variants by diversity factors arranged in decreasing cardinality order gives rise to<br />
the least time consuming configurations, which are even faster than flat consensus as the<br />
cluster ensemble size increases —again, consensus architectures based on the EAC consensus<br />
function constitute the only exception to this rule. <strong>La</strong>st, notice that when consensus are<br />
built using the MCLA consensus function, flat consensus is not executable in the highest<br />
diversity scenario.<br />
273
C.3. Estimation of the computationally optimal DHCA<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
|df A | = 10 , |df D | = 2 , |df R | = 5<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
|df A | = 19 , |df D | = 2 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.15: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
274
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 0<br />
10 −1<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 −1<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
|df A | = 10 , |df D | = 2 , |df R | = 5<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
|df A | = 19 , |df D | = 2 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 −1<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.16: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the Iris data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
275
C.3. Estimation of the computationally optimal DHCA<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 0<br />
|df A | = 1 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
|df A | = 10 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
(c) Estimated RT, |dfA| =10<br />
10<br />
ADR ARD DAR DRA RAD RDA flat<br />
0<br />
DHCA variant<br />
10 1<br />
|df A | = 19 , |df D | = 11 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 0<br />
|df A | = 1 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 11 , |df R | = 5<br />
10<br />
ADR ARD DAR DRA RAD RDA flat<br />
0<br />
DHCA variant<br />
10 1<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.17: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
276
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 0<br />
10 −1<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
|df A | = 10 , |df D | = 11 , |df R | = 5<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
|df A | = 19 , |df D | = 11 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
10 −1<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 11 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.18: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the Wine data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
277
C.3. Estimation of the computationally optimal DHCA<br />
Figure C.20 depicts the estimated and real execution times of fully parallel DHCA variants<br />
and flat consensus. The patterns presented in both columns of this figure (estimated<br />
running times on the left column, real execution times on the right) reveal the same behaviour<br />
observed on the previous data sets. That is, all DHCA variants have comparable<br />
running times, which would make running time estimations unnecessary as far as the election<br />
of the fastest DHCA variant is concerned. However, this estimation is necessary to<br />
decide whether hierarchical consensus is faster that its flat alternative, which occurs in all<br />
the diversity scenarios but in the lowest diversity one.<br />
C.3.4 Ionosphere data set<br />
In this section, the results of estimating the execution times of DHCA are compared to their<br />
real counterparts across four diversity scenarios on the Ionosphere data collection. The cluster<br />
ensemble sizes corresponding to these diversity scenarios are l =97, 970, 1843 and 2716,<br />
respectively.<br />
Firstly, figure C.21 depicts the estimated and real execution times considering the fully<br />
serial implementation of deterministic hierarchical consensus architectures. In this case,<br />
SERTDHCA is a fairly good estimator of SRTDHCA, and it constitutes a good base for<br />
predicting the least time consuming consensus architecture. Notice that, when consensus<br />
clusterings are built by means of the MCLA consensus function, flat consensus execution<br />
becomes impossible (given the computational resources employed in our experiments, see<br />
appendix A.6), so its hierarchical counterpart becomes a feasible alternative. Moreover, if<br />
hierarchical consensus is implemented by means of the DHCA variant defined by an ordered<br />
list of diversity factors arranged in decreasing cardinality order, notable computation time<br />
savings can be obtained.<br />
And secondly, as far as the fully parallel DHCA implementation is concerned (see figure<br />
C.22), the following observations can be made: i) PERTDHCA is a pretty accurate estimator<br />
of PRTDHCA, ii) there are no significant differences between the running times of the<br />
distinct variants of DHCA, and iii) flat consensus is more computationally costly than its<br />
hierarchical counterpart in all but one of the diversity scenarios considered.<br />
C.3.5 WDBC data set<br />
In this section, let us analyze the results corresponding to the WDBC data collection. In<br />
this case, each diversity scenario corresponds to a cluster ensemble of size l = 113, 1130, 2147<br />
and 3164, respectively. In first place, the left and right columns of figure C.23 present the<br />
estimated and real running times of the variants of the serial implementation of the DHCA<br />
on this data set across the four diversity scenarios.<br />
It can be observed that the proposed methodology yields a pretty good estimation<br />
of the real running time of DHCA variants. This allows the user to make well-grounded<br />
decisions regarding the most efficient hierarchical consensus architectures. For this data set,<br />
flat consensus is the computationally optimal architecture except in the highest diversity<br />
scenario —except when the EAC consensus function is employed.<br />
In second place, figure C.24 depicts the results corresponding to the parallel implementation<br />
of DHCA. The same conclusions drawn for the previous data collections are also<br />
applicable in the WDBC data set. That is, running times are almost independent of the<br />
278
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
|df A | = 10 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
|df A | = 19 , |df D | = 7 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 0<br />
|df A | = 1 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.19: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
279
C.3. Estimation of the computationally optimal DHCA<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
|df A | = 10 , |df D | = 7 , |df R | = 5<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
|df A | = 19 , |df D | = 7 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 0<br />
10 −1<br />
|df A | = 1 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 7 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.20: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the Glass data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
280
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 2<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
|df A | = 10 , |df D | = 32 , |df R | = 4<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
|df A | = 19 , |df D | = 32 , |df R | = 4<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 2<br />
10 0<br />
|df A | = 1 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.21: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
281
C.3. Estimation of the computationally optimal DHCA<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 1 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
|df A | = 10 , |df D | = 32 , |df R | = 4<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 19 , |df D | = 32 , |df R | = 4<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 1 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 32 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.22: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the Ionosphere data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
282
Appendix C. Experiments on hierarchical consensus architectures<br />
DHCA variant implemented, at least, no significant differences are observed among them –<br />
as opposed to the serial implementation case–, while hierarchical consensus is more efficient<br />
than flat consensus as soon as the cluster ensemble size grows.<br />
C.3.6 Balance data set<br />
This section presents the results of estimating the execution times of the fully serial and<br />
parallel implementations of DHCA in the four diversity scenarios for the Balance data set,<br />
which give rise to cluster ensembles of sizes l =7, 70, 133 and 196 each.<br />
Firstly, figure C.25 presents the estimated and real execution times corresponding to<br />
the serial implementation context. It is quite apparent that SERTDHCA provides the user<br />
with a good estimation of the real running time of consensus architectures (SRTDHCA) and,<br />
as such, it allows determining which is the computationally optimal consensus architecture<br />
with a high degree of accuracy. In this case, given the small size of the cluster ensemble<br />
in either of the four diversity scenarios, flat consensus is faster than most serial DHCA<br />
variants.<br />
And secondly, the results corresponding to the parallel implementation of DHCA are<br />
depicted in figure C.26. Once more, all DHCA variants have very similar running times.<br />
As in the serial case, flat consensus is faster than any of its hierarhical counterparts, except<br />
when the HGPA and MCLA consensus functions are employed as clustering combiners.<br />
C.3.7 MFeat data set<br />
This section describes the results of the minimum complexity DHCA variant selection based<br />
on running time estimation. In the case of the MFeat data collection, cluster ensembles of<br />
sizes l =6, 60, 114 and 168 correspond to the four diversity scenarios where this experiment<br />
is conducted.<br />
Figure C.27 depicts the estimated and real execution times of the fully serial implementation<br />
of DHCA. In this case, SERTDHCA is a pretty accurate estimator of SRTDHCA, and,<br />
as such, it is a good predictor of the most computationally efficient consensus architecture.<br />
In most cases, however, due to the relatively small sizes of the cluster ensembles in this data<br />
set, flat consensus is faster than any of the DHCA variants —except when the HGPA and<br />
MCLA consensus functions are employed in high diversity scenarios.<br />
<strong>La</strong>st, figure C.28 presents the results corrresponding to the parallel DHCA implementation.<br />
Again, the time complexities of DHCA variants reach pretty similar values. However,<br />
notice that DHCA variants are slower than flat consensus in most of the diversity scenarios<br />
—except when the HGPA and MCLA consensus functions are used for combining the<br />
clusterings.<br />
283
C.3. Estimation of the computationally optimal DHCA<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 2<br />
10 0<br />
|df A | = 1 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
(a) Estimated RT, |dfA| =1<br />
|df A | = 10 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(c) Estimated RT, |dfA| =10<br />
|df A | = 19 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 2<br />
10 0<br />
|df A | = 1 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.23: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
284
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 1<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 10 , |df D | = 28 , |df R | = 5<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 19 , |df D | = 28 , |df R | = 5<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 1<br />
10 0<br />
|df A | = 1 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 28 , |df R | = 5<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.24: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the WDBC data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
285
C.3. Estimation of the computationally optimal DHCA<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 2<br />
10 0<br />
|df A | = 1 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 10 , |df D | = 2 , |df R | = 4<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 19 , |df D | = 2 , |df R | = 4<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 2<br />
10 0<br />
|df A | = 1 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.25: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
286
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
|df A | = 10 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
|df A | = 19 , |df D | = 2 , |df R | = 4<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 0<br />
|df A | = 1 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 1<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 2 , |df R | = 4<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.26: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the Balance data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
287
C.3. Estimation of the computationally optimal DHCA<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
SERT DHCA (sec.)<br />
10 0<br />
|df A | = 1 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(a) Estimated RT, |dfA| =1<br />
|df A | = 10 , |df D | = 1 , |df R | = 6<br />
10<br />
ADR ARD DAR DRA RAD RDA flat<br />
0<br />
DHCA variant<br />
10 4<br />
10 2<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 4<br />
10 2<br />
|df A | = 19 , |df D | = 1 , |df R | = 6<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
SRT DHCA (sec.)<br />
10 0<br />
|df A | = 1 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 1 , |df R | = 6<br />
10<br />
ADR ARD DAR DRA RAD RDA flat<br />
0<br />
DHCA variant<br />
10 4<br />
10 2<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 4<br />
10 2<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.27: Estimated and real running times (RT) of the serial DHCA implementation<br />
on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
288
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
PERT DHCA (sec.)<br />
10 2<br />
10 0<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
|df A | = 1 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(a) Estimated RT, |dfA| =1<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 10 , |df D | = 1 , |df R | = 6<br />
(c) Estimated RT, |dfA| =10<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
|df A | = 19 , |df D | = 1 , |df R | = 6<br />
(e) Estimated RT, |dfA| =19<br />
|df A | = 28 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(g) Estimated RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
PRT DHCA (sec.)<br />
10 2<br />
10 0<br />
|df A | = 1 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(b) Real RT, |dfA| =1<br />
|df A | = 10 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(d) Real RT, |dfA| =10<br />
|df A | = 19 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
10 2<br />
10 0<br />
(f) Real RT, |dfA| =19<br />
|df A | = 28 , |df D | = 1 , |df R | = 6<br />
ADR ARD DAR DRA RAD RDA flat<br />
DHCA variant<br />
(h) Real RT, |dfA| =28<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
Figure C.28: Estimated and real running times (RT) of the parallel DHCA implementation<br />
on the Mfeat data collection in the four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
289
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
C.4 Computationally optimal RHCA, DHCA and flat consensus<br />
comparison<br />
In this section, we compare those random and deterministic hierarchical consensus architectures<br />
deemed to be computationally optimal against classic flat consensus in terms of two<br />
factors: i) their execution times, and ii) the quality of the consensus clustering solutions<br />
they yield. This twofold comparison is intended to determine under which conditions any<br />
the aforementioned consensus architectures outperforms the others, not only in terms of<br />
their computational efficiency, but also as far as their perfomance for the construction of<br />
robust clustering systems is concerned.<br />
This comparison has been conducted across the following eleven unimodal data collection:<br />
Iris, Wine, Glass, Ionosphere, WDBC, Balance, MFeat, miniNG, Segmentation, BBC<br />
and PenDigits. For each data set, ten independent experiments have been conducted on four<br />
diversity scenarios. Each diversity scenario is characterized by the use of cluster ensembles<br />
generated by applying a certain number of clustering algorithms |dfA| = {1, 10, 19, or 28}.<br />
In each experiment, the CPU time required for executing either the whole hierarchical consensus<br />
architecture or flat consensus is measured, and the quality of the consensus clustering<br />
solution is evaluated in terms of its normalized mutual information φ (NMI) with respect the<br />
ground truth.<br />
From a visualization perspective, both the execution times and the φ (NMI) values are<br />
presented by means of their respective boxplots —each of which comprises the ten independent<br />
experiments conducted on each diversity scenario for each data collection. When<br />
comparing boxplots, notice that non-overlapping boxes notches indicate that the medians<br />
of the compared running times differ at the 5% significance level, which allows a quick<br />
inference of the statistical significance of the results.<br />
C.4.1 Iris data set<br />
In this section, the running time and consensus quality comparison experiments are conducted<br />
on the Iris data collection. The four diversity scenarios correspond to cluster ensembles<br />
of sizes l =9, 90, 171 and 252.<br />
Running time comparison<br />
Figure C.29 presents the running times of the allegedly computationally optimal RHCA<br />
and DHCA variants (considering their serial implementation) and flat consensus. Due to<br />
the relatively small cluster ensembles on this data set, it can be observed that flat consensus<br />
is the fastest option in most cases, regardless of the diversity scenario and the consensus<br />
function employed. The only exceptions occur when consensus are built using the MCLA<br />
consensus function in the two highest diversity scenarios –in these cases, DHCA turns out to<br />
be the most efficient consensus architecture–, as this is the only consensus function the time<br />
complexity of which grows quadratically with the size of the cluster ensemble l. Amongthe<br />
hierarchical consensus architectures, DHCA tends to outperform RHCA in computational<br />
terms, except when the ALSAD consensus function is employed (although the differences<br />
between RHCA and DHCA are in general minor).<br />
The execution times corresponding to flat consensus and the entirely parallel implemen-<br />
290
Appendix C. Experiments on hierarchical consensus architectures<br />
tation of hierarchical architectures are presented in figure C.30. It can be observed that<br />
flat consensus gradually moves from being the optimal consensus architecture in the lowest<br />
diversity scenario to being the slowest in the highest diversity one. Compared to the serial<br />
implementation, the running time differences between RHCA and DHCA are less significant<br />
in this case —except when the ALSAD consensus function is the base of the consensus<br />
architecture.<br />
Consensus quality comparison<br />
As regards the quality of the consensus clustering solutions yielded by each consensus architecture<br />
as a function of the consensus function employed and the diversity scenario, the<br />
results obtained are presented in figure C.31. A few observations can be made: firstly, if<br />
the results obtained by the seven consensus functions are compared, it is to notice that<br />
fairly different performances are obtained: for instance, HGPA gives rise to pretty poorer<br />
quality consensus than the remaining consensus functions, as none of the boxes exceeds the<br />
φ (NMI) =0.6 level. Moreover, these relative performances are maintained across the different<br />
diversity scenarios. Secondly, the variability of the quality of the consensus clustering<br />
solutions can be evaluated by observing figure C.31(d), as the depicted boxplots corresponds<br />
to ten independent runs of the consensus clustering processes on the same cluster ensemble.<br />
Notice that the major differences are observed in the HGPA, MCLA and KMSAD consensus<br />
functions, which, as aforementioned, contain some random parameters that makes their<br />
performance vary (largely, as in HGPA or slightly, as in MCLA) from run to run. Thirdly,<br />
the relative comparison of the quality of the consensus solutions yielded by the the two HCA<br />
and flat consensus is local to the consensus function employed. Whereas DHCA seems to<br />
give rise to better consensus clustering solutions when the CSPA, ALSAD and KMSAD<br />
consensus functions are employed, it tends to be outperformed by RHCA flat consensus<br />
when clusterings are combined by EAC or HGPA. <strong>La</strong>st, notice that the highest level of<br />
similarity between the top-quality cluster ensemble components and consensus clustering<br />
solutions correspond to DHCA based on the CSPA consensus function.<br />
C.4.2 Wine data set<br />
This section presents the comparison between flat consensus and the computationally optimal<br />
consensus architectures in terms of CPU execution time and normalized mutual information<br />
between the ground truth and the consensus clustering solution yielded by each<br />
one of them. On this data collection, the cluster ensemble sizes corresponding to the four<br />
diversity scenarios are l =45, 450, 855 and 1260.<br />
Running time comparison<br />
As regards the execution times of the fully serial implementations of the estimated optimal<br />
RHCA and DHCA variants and flat consensus, a double evolutive behaviour can be<br />
observed —see figure C.32. Firstly, those consensus architectures using the CSPA, HGPA,<br />
MCLA, ALSAD, KMSAD and SLSAD consensus functions follow the same evolution pattern<br />
observed, for instance, in the Iris data collection (i.e. the larger the cluster ensemble,<br />
the more efficient hierarhical architectures become compared to flat consensus). In contrast,<br />
the consensus architectures based on the EAC consensus function present a fairly different<br />
291
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.29: Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Iris data collection in the four diversity scenarios |dfA| =<br />
{1, 10, 19, 28}.<br />
292
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.5<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.30: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Iris data collection in the four diversity scenarios |dfA| =<br />
{1, 10, 19, 28}.<br />
293
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.31: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Iris data collection in the four diversity<br />
scenarios |dfA| = {1, 10, 19, 28}.<br />
294
Appendix C. Experiments on hierarchical consensus architectures<br />
behaviour, as flat consensus is the fastest option regardless of the diversity scenario. That is,<br />
as already observed in sections C.2.5 and C.3.5, the time complexity behaviour of consensus<br />
architectures is local to the consensus function employed for combining the clusterings.<br />
As regards the computational complexity of the parallel implementation of HCA (see<br />
figure C.33), it can be observed that these become much faster than flat consensus as soon<br />
as the size of the cluster ensemble is increased. As in the previous data collection, the<br />
running times of parallel RHCA and DHCA are pretty similar.<br />
Consensus quality comparison<br />
As far as the quality of the consensus clustering solutions obtained by the distinct consensus<br />
architectures, figure C.34 depicts the corresponding φ (NMI) boxplots. Again, performances<br />
are highly local to the consensus function employed: in this case, those consensus architectures<br />
based on the EAC, HGPA and SLSAD consensus functions give rise to the lowest<br />
quality consensus clusterings. If the three consensus architectures are compared, it can<br />
be observed that RHCA and flat consensus tend to perform quite similarly, while worse<br />
clustering solutions are generally obtained from DHCA. Notice that the highest robustness<br />
to clustering indeterminacies (i.e. consensus clustering solutions of comparable quality to<br />
the cluster ensemble components of highest φ (NMI) ) are obtained from the RHCA and flat<br />
consensus architectures based on MCLA, ALSAD and KMSAD.<br />
C.4.3 Glass data set<br />
In this section, we present the running times and quality evaluation (by means of φ (NMI)<br />
values) of the consensus clustering processes implemented by means of the serial and parallel<br />
RHCA DHCA implementations and flat consensus on the Glass data collection. The cluster<br />
ensembles sizes corresponding to the four diversity scenarios in which our experiments are<br />
conducted are l =29, 290, 551 and 812.<br />
Running time comparison<br />
Figure C.35 presents the boxplot charts that represent the running times of the three implemented<br />
consensus architectures, considering the entirely serial implementation of the<br />
hierarhical ones. As in the previous data collections, flat consensus is the fastest option<br />
in the lowest diversity scenario, whereas hierarchical consensus architectures become more<br />
computationally efficient as soon as the size of the cluster ensemble increases —for all but<br />
the EAC consensus function. which again highlights the interest of structuring consensus<br />
processes in a hierarchical manner as a means for i) reducing their time complexity when<br />
they are to be conducted on large cluster ensembles, and ii) obtaining a consensus clustering<br />
solution when the execution of flat consensus becomes unfeasible (e.g. when the MCLA<br />
consensus function is employed in the highest diversity scenario).<br />
The computational complexity of the consensus architectures presents a very similar<br />
behaviour when the parallel implementation of hierarchical versions is studied (see figure<br />
C.36). In this case, though, the differences between the running times of flat and hierarchical<br />
consensus architectures is even larger.<br />
295
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
5<br />
4<br />
3<br />
2<br />
1<br />
8<br />
7<br />
6<br />
5<br />
4<br />
8<br />
7.5<br />
7<br />
6.5<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
2.6<br />
2.4<br />
2.2<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
CPU time (sec.)<br />
0<br />
3<br />
2.5<br />
2<br />
1.5<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
2.4<br />
2.2<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
6<br />
5.5<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
5<br />
4<br />
3<br />
2<br />
1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
20<br />
15<br />
10<br />
5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
45<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(d) Serial implementation running time, |dfA| =28<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.32: Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the Wine data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
296<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
3.5<br />
2.5<br />
1.5<br />
0.5<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
4<br />
3<br />
2<br />
1<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
1.2<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
1<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.24<br />
0.22<br />
0.2<br />
0.18<br />
0.16<br />
0.14<br />
0.12<br />
0.1<br />
0.08<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
4<br />
3<br />
2<br />
1<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
20<br />
15<br />
10<br />
5<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
(c) Parallel implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
6<br />
4<br />
2<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
40<br />
30<br />
20<br />
10<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.33: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Wine data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
297
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.34: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Wine data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
298
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
12<br />
11<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
25<br />
20<br />
15<br />
10<br />
5<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
5.5<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
EAC<br />
CPU time (sec.)<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
flat<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
6<br />
5.5<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.5<br />
1<br />
0.5<br />
0<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
12<br />
10<br />
8<br />
6<br />
4<br />
25<br />
20<br />
15<br />
10<br />
5<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
25<br />
20<br />
15<br />
10<br />
5<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.35: Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the Glass data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
299
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.5<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.5<br />
1<br />
0.5<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
(b) Parallel implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.36: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Glass data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
300
Consensus quality comparison<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
As regards the quality of the consensus clustering process, figure C.37 presents the boxplots<br />
depicting the φ (NMI) values of the components of the cluster ensemble E and of the consensus<br />
clustering solutions output by the RHCA, DHCA and flat consensus architectures. On<br />
this data collection, the φ (NMI) differences between the clustering solutions output by the<br />
three distinct consensus architectures are, in general terms, small —except when the EAC<br />
consensus function is employed, as flat consensus is clearly superior in this case. Moreover,<br />
we can observe that ALSAD and SLSAD outstand among the remaining consenus functions<br />
as the top performers.<br />
C.4.4 Ionosphere data set<br />
This section presents the execution times of the computationally optimal RHCA, DHCA<br />
and flat consensus architecture and the φ (NMI) values of the consensus clustering solutions<br />
yielded by them on the Ionosphere data collection. The presented results consider the<br />
experiments conducted across four diversity scenarios, and the cluster ensemble sizes corresponding<br />
to them are l =97, 970, 1843 and 2716, respectively.<br />
Running time comparison<br />
The execution times of flat consensus and the serially implemented RHCA and DHCA are<br />
depicted in the boxplots charts presented in figure C.38. The relative behaviour of the<br />
three consensus architectures is pretty similar to the one observed on the previous data<br />
collections. That is, i) flat consensus becomes slower than its hierarchical counterparts as<br />
the size of the cluster ensemble becomes larger, except when the EAC consensus function is<br />
employed, and ii) RHCA tends to be faster than DHCA when the EAC, ALSAD, SLSAD,<br />
whereas the opposite behaviour is observed in the hypergraph based consensus functions.<br />
The running times obtained in the case the hierarchical consensus architectures are<br />
implemented in an entirely parallel manner are presented in figure C.39. As expected,<br />
RHCA and DHCA become sensibly more efficient than flat consensus. Notice, however,<br />
that notable differences between the running times of the two hierarchical architectures can<br />
be found under certain consensus functions, such as EAC, ALSAD, or SLSAD.<br />
Consensus quality comparison<br />
As far as the quality of the consensus clustering process is concerned, the φ (NMI) boxplots<br />
corresponding to the consensus clustering solutions obtained by the RHCA, DHCA and flat<br />
consensus architectures across the four diversity scenarios on the Ionosphere data collection<br />
are presented in figure C.40. A notably high variability as regards the optimality of the<br />
different consensus architectures is found depending on the consensus function employed.<br />
For instance, DHCA tends to yield the highest quality results when consensus is conducted<br />
by means of SLSAD, flat consensus gives the best consensus clustering solutions derived by<br />
HGPA, and when MCLA is chosen as the clustering combiner, RHCA attains the higher<br />
φ (NMI) values than the remaining consensus architectures. In contrast, marginal quality<br />
differences are observed between the qualities of the consensus clustering solutions derived<br />
by the three consensus architectures when the remaining consensus functions are employed.<br />
301
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.37: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Glass data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
302
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
21<br />
20<br />
19<br />
18<br />
17<br />
16<br />
15<br />
14<br />
13<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
20<br />
15<br />
10<br />
5<br />
0<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
100<br />
80<br />
60<br />
40<br />
20<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
CPU time (sec.)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
20<br />
15<br />
10<br />
5<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
6<br />
5.5<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7.5<br />
7<br />
6.5<br />
6<br />
5.5<br />
5<br />
4.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
100<br />
80<br />
60<br />
40<br />
20<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(c) Serial implementation running time, |dfA| =19<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
10<br />
9.5<br />
9<br />
8.5<br />
8<br />
7.5<br />
7<br />
6.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
flat<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
20<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
50<br />
45<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.38: Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Ionosphere data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
303<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
SLSAD<br />
RHCA<br />
DHCA<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
flat
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
20<br />
15<br />
10<br />
5<br />
0<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
20<br />
15<br />
10<br />
5<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
20<br />
15<br />
10<br />
5<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.5<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
20<br />
15<br />
10<br />
5<br />
0<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
20<br />
15<br />
10<br />
5<br />
0<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.39: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Ionosphere data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
304
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.40: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Ionosphere data collection in the<br />
four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
305
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
C.4.5 WDBC data set<br />
In this section, we present the running times and consensus clustering solution qualities<br />
of the hierarchical and flat consensus architectures corresponding to the WDBC data collection.<br />
In this case, each diversity scenario corresponds to a cluster ensemble of size<br />
l = 113, 1130, 2147 and 3164, respectively.<br />
Running time comparison<br />
Figure C.41 presents the running time of flat consensus and of the serial implementation<br />
of the fastest random and hierarchical consensus architectures across the four diversity<br />
scenarios. In this case, the relationship between the execution times of these consensus architectures<br />
is a little different from what has been observed in the previous data collections.<br />
In particular, flat consensus is more a competitive alternative, being faster or almost as fast<br />
as RHCA in all diversity scenarios when consensus is based on the CSPA, EAC, ALSAD<br />
and SLSAD clustering combiners. In contrast, DHCA is notably slower than RHCA in most<br />
cases. This is due to the large cardinality of the dimensional diversity factor (|dfD|=28 on<br />
this data set) that makes the DHCA stage where consensus is conducted on this diversity<br />
factor much more computationally costly compared to the intermediate consensus processes<br />
of RHCA.<br />
In figure C.42, the execution times of the computationally optimal parallel RHCA and<br />
DHCA variants and flat consensus are presented. Two trends are observed in these boxplots:<br />
firstly, hierarchical architectures are faster than flat consensus, especially in diversity<br />
scenarios where large cluster ensembles are employed. And secondly, parallel DHCA are in<br />
general slower than their RHCA counterparts, for the same reason stated before.<br />
Consensus quality comparison<br />
Figure C.43 presents the φ (NMI) of the consensus clustering solutions yielded by the RHCA,<br />
DHCA and flat consensus architectures across the four diversity scenarios on the WDBC<br />
data collection. Firstly, notice that the EAC and SLSAD consensus functions give rise to<br />
very low quality consensus clusterings regardless of the consensus architecture employed.<br />
In contrast, flat consensus yields reasonably good consensus clusterings when it is derived<br />
by means of HGPA, although hierarchical consensus architectures based on this consensus<br />
function output poor consensus clustering solutions. Meanwhile, the remaining clustering<br />
combiners yield pretty good consensus —notice that slightly better results are obtained<br />
when it is derived by means of RHCA and flat consensus architectures.<br />
C.4.6 Balance data set<br />
This section presents the execution times of the estimated computationally optimal serial<br />
and parallel implementations of RHCA and DHCA and flat consensus in the four<br />
diversity scenarios for the Balance data set, which give rise to cluster ensembles of sizes<br />
l =7, 70, 133 and 196 each. Moreover, the quality of the consensus clustering solutions<br />
output by each consensus architecture are evaluated in terms of the normalized mutual<br />
information (φ (NMI) ) with respect to the ground truth.<br />
306
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
140<br />
120<br />
100<br />
80<br />
60<br />
200<br />
180<br />
160<br />
140<br />
120<br />
100<br />
80<br />
CSPA<br />
RHCA<br />
DHCA<br />
CSPA<br />
flat<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
200<br />
150<br />
100<br />
50<br />
0<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
CPU time (sec.)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
50<br />
40<br />
30<br />
20<br />
10<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
30<br />
25<br />
20<br />
15<br />
10<br />
50<br />
45<br />
40<br />
35<br />
30<br />
25<br />
20<br />
100<br />
80<br />
60<br />
40<br />
20<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
150<br />
100<br />
50<br />
0<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
500<br />
400<br />
300<br />
200<br />
100<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.41: Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the WDBC data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
307
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
60<br />
50<br />
40<br />
30<br />
20<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.45<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
200<br />
150<br />
100<br />
50<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
200<br />
150<br />
100<br />
50<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
80<br />
60<br />
40<br />
20<br />
0<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.42: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the WDBC data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
308
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.43: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the WDBC data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
309
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
Running time comparison<br />
The characteristics of this data set –in particular, the low cardinalities of the diversity factors<br />
associated to it–, make flat consensus the fastest consensus architecture when compared to<br />
the serial implementations of RHCA and DHCA regardless of the diversity scenario —see<br />
figure C.44. The only exception is the MCLA consensus function, whose time complexity<br />
scales quadratically with the size of the ensemble, which penalizes the execution of one-step<br />
consensus processes in front of their hierarchical counterparts.<br />
A somewhat lighter version of this same behaviour is observed in the running time<br />
analysis of the parallel implementations of the HCA, which is presented in figure C.45. In<br />
this case, though, RHCA and DHCA are as fast or faster than flat consensus when the<br />
HGPA, MCLA and KMSAD consensus functions are employed.<br />
Consensus quality comparison<br />
As regards the quality of the consensus clustering solutions output by the three consensus<br />
architectures, figure C.46 shows the results obtained on the Balance data collection. It is<br />
to notice that the EAC and HGPA consensus functions yield, in general, the lowest quality<br />
results. For the remaining consensus functions, pretty similar quality consensus solutions are<br />
obtained by means of the three architectures, except for the ALSAD and SLSAD consensus<br />
function, where notable differences are observed between HCA and flat consensus.<br />
310
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
20<br />
15<br />
10<br />
5<br />
20<br />
15<br />
10<br />
5<br />
25<br />
20<br />
15<br />
10<br />
5<br />
1<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
15<br />
10<br />
5<br />
0<br />
20<br />
15<br />
10<br />
5<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
40<br />
30<br />
20<br />
10<br />
0<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.44: Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the Balance data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
311
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
2.8<br />
2.6<br />
2.4<br />
2.2<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
3.5<br />
3<br />
2.5<br />
2<br />
3.5<br />
3<br />
2.5<br />
2<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
CPU time (sec.)<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
flat<br />
CPU time (sec.)<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
10<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
10<br />
8<br />
6<br />
4<br />
2<br />
8<br />
6<br />
4<br />
2<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
40<br />
30<br />
20<br />
10<br />
0<br />
25<br />
20<br />
15<br />
10<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
20<br />
15<br />
10<br />
5<br />
5<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.45: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the Balance data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
312
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.46: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Balance data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
313
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
C.4.7 MFeat data set<br />
This section describes the performance of the minimum complexity RHCA and DHCA<br />
serial and parallel variants in the case of the MFeat data collection, which are compared to<br />
classic flat consensus in terms of the time required for their execution and the quality of the<br />
consensus clustering solutions they yield. Cluster ensembles of sizes l =6, 60, 114 and 168<br />
correspond to the four diversity scenarios where these experiments are conducted.<br />
Running time comparison<br />
Figure C.47 presents the running times of flat consensus and of the serial implementations<br />
of RHCA and DHCA. Notice that, except when the HGPA and MCLA consensus functions<br />
are employed, flat consensus is faster than any of its hierarchical counterparts regardless of<br />
the size of the cluster ensemble (i.e. it is faster in all the diversity scenarios).<br />
When the parallel implementation of the HCA is considered (see figure C.48), the observed<br />
behaviour is very similar to the one that has been just reported. That is, flat consensus<br />
is the most computationally efficient consensus architecture, except when consensus<br />
functions based on hypergraph partition are employed. This is due to the fact that, on the<br />
MFeat data collection, the low cardinality of the diversity factors gives rise to relatively<br />
small cluster ensembles, which makes flat consensus a competitive alternative to hierarhical<br />
consensus architectures.<br />
Consensus quality comparison<br />
Figure C.49 presents the quality of the consensus clustering solutions yielded by the flat<br />
and hierarhical consensus architectures, under the shape of φ (NMI) boxplot diagrams. An<br />
inter-consensus function analysis reveals that EAC, HGPA and SLSAD yield, in general<br />
terms, the lowest quality results, while CSPA, ALSAD and KMSAD stand out as the best<br />
performing consensus functions, as they yield consensus clustering solutions the quality of<br />
which is comparable to that of the cluster ensemble components that best reveal the true<br />
cluster structure of the data set (i.e. those attaining the highest φ (NMI) values). Meanwhile,<br />
if an intra-consensus function study is conducted, we can conclude that whereas the three<br />
consensus architectures yield pretty similar quality consensus solutions when based on CSPA<br />
and ALSAD, larger differences between RHCA, DHCA and flat consensus are observed in<br />
other cases, as when consensus clustering is conducted by means of the EAC, HGPA, MCLA<br />
or SLSAD consensus functions.<br />
C.4.8 miniNG data set<br />
In this section, we present the running times and quality evaluation (by means of φ (NMI)<br />
values) of the consensus clustering processes implemented by means of the serial and parallel<br />
RHCA and DHCA implementations and flat consensus on the miniNG data collection.<br />
The cluster ensembles sizes corresponding to the four diversity scenarios in which our experiments<br />
are conducted are l =73, 730, 1387 and 2044.<br />
314
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
20<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
180<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
1.6<br />
1.5<br />
1.4<br />
1.3<br />
1.2<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
250<br />
200<br />
150<br />
100<br />
50<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.47: Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the MFeat data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
315
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
20<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
25<br />
20<br />
15<br />
10<br />
20<br />
19<br />
18<br />
17<br />
16<br />
15<br />
14<br />
13<br />
12<br />
23<br />
22<br />
21<br />
20<br />
19<br />
18<br />
17<br />
16<br />
15<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
15<br />
10<br />
5<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
250<br />
200<br />
150<br />
100<br />
50<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.48: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the MFeat data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
316
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.49: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the MFeat data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
317
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
Running time comparison<br />
The miniNG data collection is one of those cases where the cardinality of the diversity factors<br />
employed for generating the cluster ensemble, besides the number of objects it contains,<br />
makes flat consensus non-executable (for all but the EAC consensus function) in those<br />
scenarios where the cluster ensemble size is relatively large. In this situation, hierarchical<br />
consensus architectures become a means for making consensus clustering feasible.<br />
As regards the serial implementation of RHCA and DHCA –figure C.50–, the former<br />
tends to be faster than the latter, except when the HGPA and MCLA consensus functions<br />
are employed. This inter-consensus architecture performance is also observed in the parallel<br />
implementation case, presented in figure C.51.<br />
Consensus quality comparison<br />
The analysis of the quality of the consensus clustering solutions output by the flat and<br />
hierarchical consensus architectures can be made based on the φ (NMI) boxplot charts depicted<br />
in figure C.52. A single remark as regards the perforance of the distinct consensus<br />
functions: notice that CSPA, ALSAD and KMSAD based consensus solutions are the best<br />
ones in quality terms. And last, the φ (NMI) values of the consensus clusterings output by<br />
the two hierarchical consensus architectures –the only ones able to operate across all the<br />
diversity scenarios– are pretty similar in most cases.<br />
C.4.9 Segmentation data set<br />
This section presents the comparison between flat consensus and the computationally optimal<br />
consensus architectures in terms of CPU execution time and normalized mutual information<br />
between the ground truth and the consensus clustering solution yielded by each one<br />
of them. On the Segmentation data collection, the cluster ensemble sizes corresponding to<br />
the four diversity scenarios are l =52, 520, 988 and 1456.<br />
Running time comparison<br />
Figure C.53 presents the execution times of the flat consensus architecture and the estimated<br />
computationally optimal serial random and deterministic hierarchical consensus<br />
architectures. In this case, flat consensus is faster than RHCA and DHCA regardless of<br />
the cluster ensemble size (in our range of observation), except when the HGPA and MCLA<br />
consensus functions are employed —in fact, MCLA-based flat consensus is unfeasible in the<br />
two largest diversity scenarios. Moreover, the relative speed comparison between RHCA<br />
and DHCA yields different results depending on the consensus function employed: RHCA<br />
is faster than DHCA if consensus is based on CSPA, EAC, ALSAD or SLSAD, while the<br />
opposite behaviour is observed when the HGPA, MCLA and KMSAD consensus functions<br />
are used.<br />
Pretty similar results are obtained when the running times of the fully parallel implementation<br />
of RHCA and DHCA are analyzed, as figure C.53 reveals. The main difference<br />
with respect to what has been just reported is the logical speed up of HCA, which makes<br />
them be faster than flat consensus in the highest diversity scenario.<br />
318
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
650<br />
600<br />
550<br />
500<br />
450<br />
400<br />
350<br />
300<br />
1100<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
1200<br />
1100<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
15000<br />
10000<br />
5000<br />
EAC<br />
RHCA<br />
DHCA<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
flat<br />
x 10 4EAC<br />
0<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
x 10 4 ALSAD<br />
RHCA<br />
DHCA<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
75<br />
70<br />
65<br />
60<br />
55<br />
50<br />
45<br />
40<br />
35<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
flat<br />
x 10 4 ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
15000<br />
10000<br />
(b) Serial implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
140<br />
130<br />
120<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
250<br />
200<br />
150<br />
100<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
5000<br />
0<br />
1.5<br />
1<br />
0.5<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
x 10<br />
2<br />
4 KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
x KMSAD 104<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.50: Running times of the computationally optimal serial RHCA, DHCA and<br />
flat consensus architectures on the miniNG data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
319
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
30<br />
28<br />
26<br />
24<br />
22<br />
20<br />
30<br />
28<br />
26<br />
24<br />
22<br />
20<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
4000<br />
3000<br />
2000<br />
1000<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1200<br />
1000<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
2000<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
(b) Parallel implementation running time, |dfA| =10<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
CPU time (sec.)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
200<br />
150<br />
100<br />
50<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
2500<br />
2000<br />
1500<br />
1000<br />
(c) Parallel implementation running time, |dfA| =19<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
200<br />
150<br />
100<br />
50<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
500<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
2000<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
2000<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.51: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the miniNG data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
320
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.52: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the miniNG data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
321
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
100<br />
80<br />
60<br />
40<br />
20<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
900<br />
800<br />
700<br />
600<br />
500<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.2<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
x ALSAD 104<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
RHCA<br />
DHCA<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
flat<br />
x ALSAD 104<br />
3<br />
RHCA<br />
DHCA<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
400<br />
300<br />
200<br />
100<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
30<br />
25<br />
20<br />
15<br />
10<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
flat<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
50<br />
45<br />
40<br />
35<br />
30<br />
25<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
SLSAD<br />
RHCA<br />
DHCA<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.53: Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the Segmentation data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
322
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
30<br />
25<br />
20<br />
15<br />
10<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
400<br />
300<br />
200<br />
100<br />
0<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
CSPA<br />
RHCA<br />
DHCA<br />
0<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
CPU time (sec.)<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4000<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
CPU time (sec.)<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
2000<br />
1500<br />
1000<br />
500<br />
2000<br />
1500<br />
1000<br />
500<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
CPU time (sec.)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
flat<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
(b) Parallel implementation running time, |dfA| =10<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
400<br />
300<br />
200<br />
100<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1200<br />
1000<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
800<br />
600<br />
400<br />
200<br />
1200<br />
1000<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
1200<br />
1100<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.54: Running times of the computationally optimal parallel RHCA, DHCA and flat<br />
consensus architectures on the Segmentation data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
323
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
Consensus quality comparison<br />
The φ (NMI) values of the consensus clustering solutions yielded by flat, random hierarchical<br />
and deterministic hierarhical consensus architectures follows a pattern that is quite similar<br />
to what has been observed in the previous data collections, at least as far as the performance<br />
of the distinct consensus functions is concerned. That is, the lowest quality consensus solutions<br />
are obtained by means of the EAC and HGPA consensus functions, whereas ALSAD<br />
tends to yield the best results.<br />
C.4.10 BBC data set<br />
In this section, the running time and consensus quality comparison experiments are conducted<br />
on the BBC data collection. The four diversity scenarios correspond to cluster<br />
ensembles of sizes l =57, 570, 1083 and 1596.<br />
Running time comparison<br />
As far as the running times of the entirely serial implementation of RHCA and DHCA<br />
and of flat consensus are concerned, the boxplots depicted in figure C.56 show that flat<br />
consensus constitutes the most computationally competitive consensus architecture in most<br />
cases —in fact, the only exceptions occur when the HGPA and MCLA consensus functions<br />
are employed.<br />
When the parallel implementation of hierarchical consensus architecture is considered,<br />
they become more competitive (in computational terms), reverting the situation observed<br />
in the serial case for the CSPA and KMSAD consensus functions —see figure C.57.<br />
Consensus quality comparison<br />
As regards the quality of the consensus clustering solutions yielded by the three consensus<br />
architectures (measured as the φ (NMI) with respect to the ground truth that defines the<br />
true group structure of the BBC data collection), we can observe great differences between<br />
the performance of the distinct consensus functions –see figure C.58–: while the MCLA,<br />
ALSAD and KMSAD clustering combiners tend to yield consensus clusterings of quality<br />
comparable to the best components of the cluster ensemble, the clustering solutions output<br />
by consensus architectures based on EAC, HGPA and SLSAD are notably poorer.<br />
C.4.11 PenDigits data set<br />
This section presents the execution times of the computationally optimal RHCA, DHCA<br />
and flat consensus architecture and the φ (NMI) values of the consensus clustering solutions<br />
yielded by them on the PenDigits data collection. The presented results consider the<br />
experiments conducted across four diversity scenarios, and the cluster ensemble sizes corresponding<br />
to them are l =57, 570, 1083 and 1596, respectively. Due to the number of objects<br />
contained in this data set, only the HGPA and MCLA consensus functions are executable<br />
on it, as they are the only ones the space complexity of which scales linearly with this<br />
324
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.55: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the Segmentation data collection in the<br />
four diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
325
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
x 10<br />
2<br />
4EAC<br />
1.5<br />
1<br />
0.5<br />
0<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
RHCA<br />
DHCA<br />
flat<br />
x 10<br />
2.5<br />
4EAC<br />
1.5<br />
1<br />
0.5<br />
0<br />
20000<br />
15000<br />
10000<br />
5000<br />
RHCA<br />
DHCA<br />
flat<br />
x 10<br />
2<br />
4EAC<br />
0<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
4<br />
3<br />
2<br />
1<br />
0<br />
x 10 4 ALSAD<br />
RHCA<br />
DHCA<br />
(a) Serial implementation running time, |dfA| =1<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
150<br />
100<br />
50<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
4<br />
3<br />
2<br />
1<br />
0<br />
flat<br />
x 10 4 ALSAD<br />
RHCA<br />
DHCA<br />
(b) Serial implementation running time, |dfA| =10<br />
CPU time (sec.)<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
30<br />
25<br />
20<br />
15<br />
10<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
flat<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
15000<br />
10000<br />
5000<br />
0<br />
15000<br />
10000<br />
5000<br />
0<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
SLSAD<br />
RHCA<br />
DHCA<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.56: Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the BBC data collection in the four diversity scenarios |dfA| =<br />
{1, 10, 19, 28}.<br />
326
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
35<br />
30<br />
25<br />
20<br />
15<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
800<br />
600<br />
400<br />
200<br />
0<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
CPU time (sec.)<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
ALSAD<br />
RHCA<br />
DHCA<br />
(a) Parallel implementation running time, |dfA| =1<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
150<br />
100<br />
50<br />
0<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
flat<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
1200<br />
1000<br />
(b) Parallel implementation running time, |dfA| =10<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
2000<br />
1500<br />
1000<br />
500<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
800<br />
600<br />
400<br />
200<br />
0<br />
1200<br />
1000<br />
(c) Parallel implementation running time, |dfA| =19<br />
EAC<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
HGPA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
MCLA<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
ALSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
1000<br />
(d) Parallel implementation running time, |dfA| =28<br />
CPU time (sec.)<br />
800<br />
600<br />
400<br />
200<br />
0<br />
0<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
RHCA<br />
DHCA<br />
flat<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4000<br />
3000<br />
2000<br />
1000<br />
4000<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
1800<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
0<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.57: Running times of the computationally optimal parallel RHCA, DHCA and flat<br />
consensus architectures on the BBC data collection in the four diversity scenarios |dfA| =<br />
{1, 10, 19, 28}.<br />
327
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
CSPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
EAC<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
HGPA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
MCLA<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
KMSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
SLSAD<br />
E<br />
RHCA<br />
DHCA<br />
flat<br />
Figure C.58: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the BBC data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
328
Appendix C. Experiments on hierarchical consensus architectures<br />
attribute. Moreover, flat consensus is unfeasible even when based on the aforementioned<br />
consensus functions.<br />
Running time comparison<br />
Figure C.59 shows the running times corresponding to the serial implementation of RHCA<br />
and DHCA. It can be observed that, as the cluster ensemble size increases, RHCA becomes<br />
faster than DHCA. This is due to the fact that this growth is induced by an augmentation of<br />
the cardinality of the algorithmic diversity factor |dfA|, which directly produces an increase<br />
of the time complexity of one of the DHCA stages, while the impact of this growth is<br />
somewhat scattered across the distinct stages of a RHCA.<br />
Approximately the same behaviour is observed when the parallel implementation of both<br />
hierarchical consensus architectures is analyzed —see figure C.60.<br />
Consensus quality comparison<br />
Figure C.61 presents the φ (NMI) corresponding to the two aforementioned consensus architectures.<br />
There are a couple of issues worth noting in this case. Firstly, notice that HGPA<br />
yields very poor consensus clustering solutions on this data collection (recall that this fact<br />
has also been reported in several of the previous data sets). And secondly, it is important to<br />
highlight the notable differences between the quality of the consensus clusterings output by<br />
RHCA and DHCA, as the latter consensus architecture tends to yield far better consensus<br />
clustering solutions than the former —a trend that becomes more evident in high diversity<br />
scenarios.<br />
329
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
4<br />
3.5<br />
3<br />
2.5<br />
30<br />
28<br />
26<br />
24<br />
22<br />
20<br />
55<br />
50<br />
45<br />
40<br />
85<br />
80<br />
75<br />
70<br />
65<br />
2<br />
RHCA<br />
RHCA<br />
RHCA<br />
RHCA<br />
HGPA<br />
DHCA<br />
HGPA<br />
DHCA<br />
flat<br />
(a) Serial implementation running time, |dfA| =1<br />
flat<br />
(b) Serial implementation running time, |dfA| =10<br />
HGPA<br />
DHCA<br />
flat<br />
(c) Serial implementation running time, |dfA| =19<br />
HGPA<br />
DHCA<br />
flat<br />
(d) Serial implementation running time, |dfA| =28<br />
Figure C.59: Running times of the computationally optimal serial RHCA, DHCA and flat<br />
consensus architectures on the PenDigits data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
330<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
5<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
30<br />
28<br />
26<br />
24<br />
22<br />
20<br />
65<br />
60<br />
55<br />
50<br />
45<br />
40<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
RHCA<br />
RHCA<br />
RHCA<br />
RHCA<br />
MCLA<br />
DHCA<br />
MCLA<br />
DHCA<br />
MCLA<br />
DHCA<br />
MCLA<br />
DHCA<br />
flat<br />
flat<br />
flat<br />
flat
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
1.8<br />
1.7<br />
1.6<br />
1.5<br />
1.4<br />
1.3<br />
1.2<br />
1.1<br />
1<br />
2<br />
1.9<br />
1.8<br />
1.7<br />
1.6<br />
1.5<br />
1.4<br />
2.5<br />
2<br />
1.5<br />
1<br />
RHCA<br />
RHCA<br />
RHCA<br />
RHCA<br />
DHCA<br />
Appendix C. Experiments on hierarchical consensus architectures<br />
HGPA<br />
flat<br />
(a) Parallel implementation running time, |dfA| =1<br />
HGPA<br />
DHCA<br />
flat<br />
(b) Parallel implementation running time, |dfA| =10<br />
HGPA<br />
DHCA<br />
flat<br />
(c) Parallel implementation running time, |dfA| =19<br />
HGPA<br />
DHCA<br />
flat<br />
(d) Parallel implementation running time, |dfA| =28<br />
Figure C.60: Running times of the computationally optimal parallel RHCA, DHCA and<br />
flat consensus architectures on the PenDigits data collection in the four diversity scenarios<br />
|dfA| = {1, 10, 19, 28}.<br />
331<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
CPU time (sec.)<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
1.8<br />
1.7<br />
1.6<br />
1.5<br />
1.4<br />
1.3<br />
1.2<br />
1.1<br />
1<br />
2.5<br />
2<br />
1.5<br />
1<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
RHCA<br />
RHCA<br />
RHCA<br />
RHCA<br />
MCLA<br />
DHCA<br />
MCLA<br />
DHCA<br />
MCLA<br />
DHCA<br />
MCLA<br />
DHCA<br />
flat<br />
flat<br />
flat<br />
flat
C.4. Computationally optimal RHCA, DHCA and flat consensus comparison<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
E<br />
E<br />
E<br />
RHCA<br />
RHCA<br />
RHCA<br />
RHCA<br />
HGPA<br />
HGPA<br />
HGPA<br />
HGPA<br />
DHCA<br />
flat<br />
(a) φ (NMI) of the consensus solutions, |dfA| =1<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
E<br />
RHCA<br />
RHCA<br />
(b) φ (NMI) of the consensus solutions, |dfA| =10<br />
DHCA<br />
flat<br />
(c) φ (NMI) of the consensus solutions, |dfA| =19<br />
DHCA<br />
flat<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
E<br />
RHCA<br />
RHCA<br />
(d) φ (NMI) of the consensus solutions, |dfA| =28<br />
Figure C.61: φ (NMI) of the consensus solutions yielded by the computationally optimal<br />
RHCA, DHCA and flat consensus architectures on the PenDigits data collection in the four<br />
diversity scenarios |dfA| = {1, 10, 19, 28}.<br />
332<br />
MCLA<br />
MCLA<br />
MCLA<br />
MCLA<br />
DHCA<br />
DHCA<br />
DHCA<br />
DHCA<br />
flat<br />
flat<br />
flat<br />
flat
Appendix D<br />
Experiments on self-refining<br />
consensus architectures<br />
This appendix presents several experiments regarding self-refining flat and hierarchical consensus<br />
architectures described in chapter 4. In appendix D.1, the proposal for automatically<br />
refining a previously derived consensus clustering solution –what is called consensus based<br />
self-refining– is experimentally evaluated, whereas appendix D.2 presents the experiments<br />
regarding the creation of a refined consensus clustering solution upon the selection of a high<br />
quality cluster ensemble component, i.e. selection-based self-refining.<br />
In both cases, the experiments are conducted on eleven unimodal data collections,<br />
namely: Iris, Wine, Glass, Ionosphere, WDBC, Balance, MFeat, miniNG, Segmentation,<br />
BBC and PenDigits. The results of the self-refining experiments are displayed by means<br />
of boxplot charts showing the normalized mutual information (φ (NMI) ) with respect to the<br />
ground truth of each data set compiled across 100 independent experiment runs of i) the<br />
cluster ensemble E each experiment is conducted upon, ii) the clustering solution employed<br />
as the reference of the self-refining procedure, and iii) the self-refined consensus cluster-<br />
ing solutions λc p i obtained upon select cluster ensembles Epi<br />
created by the selection of<br />
a percentage pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90} of the whole ensemble E. Asinall<br />
the experimental sections of this thesis, consensus processes have been replicated using the<br />
set of seven consensus functions described in appendix A.5, namely: CSPA, EAC, HGPA,<br />
MCLA, ALSAD, KMSAD and SLSAD.<br />
D.1 Experiments on consensus-based self-refining<br />
In this section, the results of applying the consensus-based self-refining procedure described<br />
in section 4.1 on the aforementioned eleven data sets are presented. The self-refining process<br />
is intended to better the quality of a consensus clustering solution λc output by a flat, a<br />
random (RHCA) and a deterministic hierarchical consensus architecture (DHCA).<br />
For this reason, the results are displayed as a matrix of boxplot charts with three columns<br />
(the leftmost one for flat consensus, the central one presenting the results of RHCA, and<br />
DHCA on the right), and as many rows as consensus functions are employed on each<br />
particular data collection (seven in all cases except for the PenDigits data set, where only<br />
two consensus functions are applicable due to memory limitations given our computational<br />
333
D.1. Experiments on consensus-based self-refining<br />
resources —see appendix A.6).<br />
Moreover, the clustering solution deemed as the optimal by the supraconsensus function<br />
described in section 4.1 in a majority of experiment runs is highlighted by means of a vertical<br />
green dashed line, so that its performance can be qualitatively evaluated at a glance.<br />
D.1.1 Iris data set<br />
Figure D.1 presents the results of the self-refining consensus procedure applied on the Iris<br />
data set. As regards the results obtained using the CSPA consensus function, we can see<br />
that self-refining introduces no variations with respect to the quality of the non-refined<br />
consensus clustering solution λc in the case of the flat and RHCA consensus architectures.<br />
In contrast, slight but important φ (NMI) gains are obtained in the case of DHCA with the<br />
refined clustering solutions λ c 20 and λ c 40. Unfortunately, the supraconsensus function fails<br />
to select in this case one of the highest quality clustering solutions. A very similar situation is<br />
observed on the self-refining experiments based on the EAC, ALSAD and SLSAD consensus<br />
functions.<br />
Examples in which self-refining and supraconsensus perform successfully are the ones<br />
regarding both hierarchical consensus architectures using the HGPA consensus function. In<br />
this cases, self-refined consensus clustering solutions of higher quality than the one of λc<br />
are obtained and selected by the supraconsensus function. In contrast, in the experiments<br />
basedonMCLA,little(ifany)φ (NMI) gains are obtained via refining, and supraconsensus<br />
tends to select a clustering solution of slightly lower quality than λc.<br />
D.1.2 Wine data set<br />
The results corresponding to the application of the consensus-based self-refining procedure<br />
on the Wine data set are depicted in figure D.2. As far as the refining of the consensus<br />
solution output by the flat consensus architecture (leftmost column of figure D.2), we can<br />
see that quality improvements with respect to the initial consensus clustering λc are obtained<br />
in all cases, except when the HGPA consensus function is employed. In some cases,<br />
supraconsensus manages to select the highest quality clustering, such as when consensus<br />
is based on MCLA and SLSAD, whereas suboptimal solutions are selected in other cases<br />
—see for instance the CSPA, EAC, HGPA and ALSAD boxplots.<br />
We would like to hihglight the specially good results obtained on the self-refining of<br />
the consensus output by the DHCA architecture, see the rightmost column of figure D.2.<br />
Regardless of the consensus function employed, the self-refining procedure gives rise to<br />
higher quality clustering solutions, and the supraconsensus function selects the top quality<br />
one in most cases.<br />
D.1.3 Glass data set<br />
Figure D.3 presents the results of the consensus self-refining process when applied on the<br />
Glass data collection. In this case, little φ (NMI) gains are obtained by self-refining for most<br />
consensus functions. The clearest exception is EAC, where notable quality increases are<br />
observed, specially when self-refining is applied on the consensus solutions output by the<br />
hierarchical consensus architectures (RHCA and DHCA).<br />
334
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.1: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Iris data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by the<br />
supraconsensus function in each experiment.<br />
335
D.1. Experiments on consensus-based self-refining<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.2: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Wine data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by the<br />
supraconsensus function in each experiment.<br />
336
Appendix D. Experiments on self-refining consensus architectures<br />
As regards the performance of the supraconsensus function, the generally small quality<br />
variations among the non-refined and self-refined consensus clustering solutions gives<br />
relative importance to the lack of precision of supraconsensus in most cases. Again, the<br />
only exceptions to this behaviour occur in the refining of the consensus clustering solutions<br />
output by RHCA and DHCA when EAC is employed. In these cases, the supraconsensus<br />
function erroneously selects the non-refined consensus clustering solution λc as the final<br />
clustering solution, although higher quality self-refined partitions are available.<br />
D.1.4 Ionosphere data set<br />
The application of the consensus-based self-refining procedure on the Ionosphere data collection<br />
yields the φ (NMI) boxplots presented in figure D.4. On this collection, self-refining<br />
introduces quality gains in a few cases, such as the refining of the consensus clustering<br />
output by i) RHCA and DHCA using HGPA, or ii) flat consensus architecture and RHCA<br />
based on the SLSAD consensus function. In the remaining cases, the self-refining procedure<br />
brings about little (if any) quality gains.<br />
As regards the selection accuracy of the supraconsensus function, it consistently selects<br />
a good quality clustering solution, if not the highest quality one.<br />
D.1.5 WDBC data set<br />
Figure D.5 presents the φ (NMI) boxplots corresponding to the application of the consensusbased<br />
self-refining procedure on the WDBC data set.<br />
Fairly distinct results are obtained depending on the consensus function employed. For<br />
instance, when consensus is based on EAC and SLSAD, self-refining brings about nothing.<br />
In contrast, spectacular quality gains are obtained on the hierarchically derived consensus<br />
clusterings that employ HGPA. In the remaining cases, more modest φ (NMI) increases are<br />
observed.<br />
<strong>La</strong>st, notice that the supraconsensus function performs pretty accurately, as it selects<br />
good quality clustering solutions in most cases, although it rarely chooses the top quality<br />
one.<br />
D.1.6 Balance data set<br />
The application of the self-refining consensus procedure on the Balance data set yields the<br />
results summarized by the boxplots presented in figure D.6. It can be observed that, for<br />
most consensus functions and consensus architectures, the self-refined consensus clusterings<br />
show higher φ (NMI) values than those of their non-refined counterpart, λc. In some cases,<br />
these quality gains are notable, as, for instance, when consensus self-refining is based on<br />
the SLSAD consensus function on the flat consensus architecture —bottom row of figure<br />
D.5. In other cases, as in MCLA-based self-refining, the achieved φ (NMI) increases are more<br />
modest.<br />
As regards the ability of the supraconsensus function to select the top quality (nonrefined<br />
or refined) clustering solution, it can be observed that it is a hardly occuring event,<br />
which motivates the low percentage selection accuracy reported in section 4.2.2.<br />
337
D.1. Experiments on consensus-based self-refining<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.3: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Glass data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by<br />
the supraconsensus function in each experiment.<br />
338
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.4: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Ionosphere data collection across all the consensus<br />
functions employed. The green dashed vertical line identifies the clustering solution selected<br />
by the supraconsensus function in each experiment.<br />
339
D.1. Experiments on consensus-based self-refining<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.5: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the WDBC data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by<br />
the supraconsensus function in each experiment.<br />
340
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.6: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Balance data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by<br />
the supraconsensus function in each experiment.<br />
341
D.1. Experiments on consensus-based self-refining<br />
D.1.7 MFeat data set<br />
Figure D.7 presents the boxplots of the clusterings resulting from running the consensus<br />
self-refining procedure on the MFeat data collection. In this case, pretty varied behaviours<br />
are observed. For instance, when a high quality consensus clustering solution λc is available<br />
prior to self-refining, none of the refined consensus clusterings achieves a higher φ (NMI) —<br />
see, for instance, the boxplots corresponding to the CSPA, ALSAD and KMSAD consensus<br />
functions. In contrast, in cases in which λc has a low φ (NMI) , self-refining brings about<br />
sometimes notable quality gains, such as the ones observed in the EAC or SLSAD based<br />
flat and RHCA consensus architectures. However, the supraconsensus function tends to<br />
select the non-refined clustering solution as the final partition of the process in a majority<br />
of cases.<br />
D.1.8 miniNG data set<br />
The boxplot charts depicted in figure D.8 summarize the performance of the consensusbased<br />
self-refining procedure when applied on the miniNG data set. It is interesting to<br />
note that, except when self-refining is based on the EAC consensus function, important<br />
quality gains are obtained —in most cases, there exists at least one self-refined consensus<br />
clustering with higher φ (NMI) than the non-refined clustering λc. Notice the large quality<br />
gains obtained when self-refining is based on MCLA, as we move from a very low φ (NMI)<br />
non-refined consensus clustering solution λc to self-refined clusterings that are comparable<br />
to the highest quality components in the cluster ensemble E. However, when self-refining is<br />
basedonEAC,little(ifany)φ (NMI) increases are introduced by the self-refining procedure.<br />
<strong>La</strong>st, notice how, in most cases, the supraconsensus function selects high quality clustering<br />
solutions as the final partition.<br />
D.1.9 Segmentation data set<br />
Figure D.9 presents the boxplots of the non-refined and self-refined consensus clustering<br />
solutions obtained by the consensus-based self-refining procedure applied on the Segmentation<br />
data collection. Notice that, thanks to the proposed refining process, at least one<br />
self-refined clustering solution of higher quality than that of the non-refined consensus clustering<br />
λc is obtained in most cases. In fact, the only exceptions occur in the refinement of<br />
the λc output by the flat and DHCA consensus architecture based on the EAC consensus<br />
function.<br />
As regards the performance of the supraconsensus function, we can see that it casts a<br />
shadow over the good results of the self-refining process just reported, as it rarely picks up<br />
the highest quality consensus clustering solution —although it usually selects one of the<br />
higher quality ones.<br />
D.1.10 BBC data set<br />
The qualities of the clusterings resulting from applying the consensus-based self-refining<br />
procedure on the BBC data set are presented in the boxplots of figure D.10. Notice that,<br />
although the quality of the non-refined consensus clustering λc is highly dependent on<br />
the consensus function employed (from the high φ (NMI) values in CSPA, MCLA, ALSAD<br />
342
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.7: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the MFeat data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by<br />
the supraconsensus function in each experiment.<br />
343
D.1. Experiments on consensus-based self-refining<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.8: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the miniNG data collection across all the consensus<br />
functions employed. The green dashed vertical line identifies the clustering solution selected<br />
by the supraconsensus function in each experiment.<br />
344
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.9: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the Segmentation data collection across all the consensus<br />
functions employed. The green dashed vertical line identifies the clustering solution selected<br />
by the supraconsensus function in each experiment.<br />
345
D.1. Experiments on consensus-based self-refining<br />
or KMSAD to the poorer qualities for EAC, HGPA or SLSAD), the self-refining process<br />
manages to yield better clusterings in most cases, although the observed φ (NMI) increases<br />
are, in general, modest.<br />
Notice that, unfortunately, the supraconsensus function is reasonably successful in selecting<br />
the top quality consensus clustering solution blindly.<br />
D.1.11 PenDigits data set<br />
Figure D.11 depicts the φ (NMI) values of the non-refined and self-refined consensus clusterings<br />
resulting from the application of the consensus-based self-refining procedure on the<br />
PenDigits data set. Remember that, on this collection, only the HGPA and MCLA consensus<br />
functions are applicable using the hierarchical consensus architectures. Whereas the<br />
quality of the clusterings obtained using HGPA is dramatically bad, the results obtained<br />
with MCLA are pretty encouraging. The large φ (NMI) gain observed when refining the consensus<br />
clustering λc output by RHCA is noteworthy. Moreover, notice that supraconsensus<br />
selects correctly the highest quality clustering solution in this case.<br />
346
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
KMSAD<br />
SLSAD<br />
(a) flat<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E p=0p=2p=5p=10p=15p=20p=30p=40p=50p=60p=75p=90<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
KMSAD<br />
SLSAD<br />
(b) RHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
CSPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
EAC<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
MCLA<br />
E p=0p=2p=5p=10p=15p=20p=30p=40p=50p=60p=75p=90<br />
ALSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
KMSAD<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
SLSAD<br />
(c) DHCA<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.10: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the BBC data collection across all the consensus functions<br />
employed. The green dashed vertical line identifies the clustering solution selected by the<br />
supraconsensus function in each experiment.<br />
347
D.2. Experiments on selection-based self-refining<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
MCLA<br />
(a) RHCA<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
1<br />
0.5<br />
0<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
HGPA<br />
E λc<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
MCLA<br />
(b) DHCA<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.11: φ (NMI) boxplots of the cluster ensemble E, the original consensus clustering<br />
λc and the self-refined consensus clustering solutions λc p i output by the flat, RHCA, and<br />
DHCA consensus architectures on the PenDigits data collection across all the consensus<br />
functions employed. The green dashed vertical line identifies the clustering solution selected<br />
by the supraconsensus function in each experiment.<br />
D.2 Experiments on selection-based self-refining<br />
This section presents the results of the clustering self-refining procedure based on the selection<br />
of a cluster ensemble component λref by means of an average normalized mutual<br />
information (φ (ANMI) ) criterion —see section 4.3.<br />
The results are presented in a very similar fashion to that of the previous section,<br />
that is, by means of boxplot charts displaying the φ (NMI) of the cluster ensemble E, the<br />
selected cluster ensemble component λref and the self-refined consensus solutions λcpi ,<br />
with pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90}. Moreover, the clustering solution designated<br />
to be the optimal according to the supraconsensus function is highlighted by a vertical<br />
green dashed line, which provides a simple and fast means for evaluating its performance<br />
qualitatively.<br />
D.2.1 Iris data set<br />
Figure D.12 presents the results of the selection-based self-refining procedure applied on<br />
the Iris data set. It can be observed that the selected cluster ensemble component λref is<br />
of notable quality —i.e. well above the median partition in the cluster ensemble E. Notice<br />
how the self-refining process brings about relevant φ (NMI) gains depending on the consensus<br />
function employed. This is the case of the CSPA, MCLA, ALSAD, KMSAD and SLSAD<br />
consensus functions. However, the supraconsensus function selects λref as the optimal<br />
partition, thus ignoring the improvements introduced by the self-refining process on the<br />
aforementioned cases. This again highlights the need for good performing supraconsensus<br />
functions.<br />
348
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.12: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Iris data collection<br />
across all the consensus functions employed. The green dashed vertical line identifies the<br />
clustering solution selected by the supraconsensus function in each experiment.<br />
D.2.2 Wine data set<br />
The results obtained by the application of the selection-based self-refining procedure on<br />
the Wine data collection are presented in figure D.13. Notice that the cluster ensemble<br />
component selected by means of average normalized mutual information criteria, λref, is<br />
nearly the top quality partition contained in the cluster ensemble E. In this case, the creation<br />
of select cluster ensembles brings about no quality gains, regardless of the consensus function<br />
employed.<br />
As regards the performance of the supraconsensus function, it selects λref as the final<br />
clustering solution in a majority of cases, choosing a noticeably worse partition in those<br />
experiments based on the KMSAD and CSPA consensus functions.<br />
D.2.3 Glass data set<br />
Figure D.14 presents the φ (NMI) boxplots corresponding to the selection-based self-refining<br />
procedure applied on the Glass data set. Notice how the notably high quality clustering<br />
349
D.2. Experiments on selection-based self-refining<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.13: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Wine data collection<br />
across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
solution λref is hardly surpassed by any of the refined consensus clustering solutions —<br />
in fact, this only happens when self-refining is based on the EAC and SLSAD consensus<br />
functions. In most cases, supraconsensus selects the selected cluster ensemble component<br />
λref as the final clustering solution.<br />
D.2.4 Ionosphere data set<br />
The application of the selection-based self-refining process on the Ionosphere data collection<br />
gives rise to modest quality increases, as depicted in figure D.15. Notice how only when<br />
the self-refining procedure is based on the CSPA and HGPA consensus functions, clustering<br />
solutions of higher φ (NMI) than λref are obtained.<br />
Furthermore, notice that the supraconsensus function selects, in most cases, the highest<br />
quality clustering solutions —unfortunately, the only exceptions occur when self-refining is<br />
based on CSPA and HGPA, i.e. the cases when self-refining introduces some significant<br />
quality gains.<br />
350
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.14: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Glass data collection<br />
across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
D.2.5 WDBC data set<br />
Figure D.16 presents the φ (NMI) boxplots corresponding to the selection-based self-refining<br />
process applied on the WDBC data set. Firstly, notice that the cluster ensemble component<br />
selected by means of the φ (ANMI) criterion –λref– is pretty close to the highest quality<br />
partition contained in the cluster ensemble. Secondly, the effect of self-refining is highly<br />
dependent on the consensus function employed. For instance, no quality gains are achieved<br />
when CSPA, EAC or HGPA are used. In contrast, φ (NMI) gains (although modest) are<br />
obtained when self-refining is conducted using the MCLA, ALSAD, KMSAD and SLSAD<br />
consensus functions. <strong>La</strong>st, notice that the supraconsensus function selects very high quality<br />
clusterings as the final partition.<br />
D.2.6 Balance data set<br />
As far as the performance of the selection-based self-refining process when applied on the<br />
Balance data collection, figure D.17 shows that self-refined clustering solutions of higher<br />
351
D.2. Experiments on selection-based self-refining<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.15: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Ionosphere data<br />
collection across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
quality of that of the selected cluster ensemble component λref are obtained for most consensus<br />
functions —in fact, the clearest exceptions to this behaviour are EAC and HGPA.<br />
However, the supraconsensus function is not capable of selecting those better quality clusterings<br />
as the final partition in most cases, which again gives an idea of its limited performance.<br />
D.2.7 MFeat data set<br />
Figure D.18 presents the φ (NMI) boxplots of the clusterings obtained after applying the<br />
selection-based self-refining process on the MFeat data set. Notice how, for four out of<br />
the seven consensus functions (CSPA, MCLA, ALSAD and KMSAD), notable quality gains<br />
are obtained (i.e. at least one of the refined clusterings attains a higher φ (NMI) than the<br />
selected cluster ensemble component λref). Unfortunately, the supraconsensus function fails<br />
to select these high quality partitions as the final clustering solution λ c final, as it constantly<br />
selects λref as the optimal one, which is the correct option when self-refining is based on<br />
the EAC, HGPA and SLSAD consensus functions.<br />
352
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.16: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the WDBC data<br />
collection across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
D.2.8 miniNG data set<br />
The application of the selection-based self-refining procedure on the miniNG data collection<br />
yields the boxplots presented in figure D.19. Notice that the selected cluster ensemble<br />
component λref is comparable to the best individual partitions contained in the ensemble<br />
E. Despite of this, the self-refining process manages to obtain even higher quality consensus<br />
clusterings when based on the CSPA, MCLA, ALSAD and KMSAD consensus functions.<br />
Unfortunately, the supraconsensus functions fails in most occasions in selecting the maximum<br />
φ (NMI) clustering —in fact, it conducts the correct election when the EAC, HGPA<br />
and SLSAD consensus functions are employed.<br />
D.2.9 Segmentation data set<br />
Figure D.20 presents the φ (NMI) boxplots of the selection-based self-refined clustering solutions<br />
obtained on the Segmentation data set. Despite the notable quality of the selected<br />
cluster ensemble component λref, notice that important quality gains are obtained when<br />
353
D.2. Experiments on selection-based self-refining<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.17: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Balance data<br />
collection across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
self-refining is applied, especially when the CSPA, ALSAD and KMSAD consensus functions<br />
are employed —more modest improvements are obtained when using MCLA or SLSAD,<br />
whereas none is attained when self-refining is based on EAC and HGPA.<br />
As regards the ability of the supraconsensus function to select the top quality clustering<br />
solution as the final one, it only suceeds when consensus is based on EAC, HGPA and<br />
MCLA. However, the φ (NMI) losses caused by suboptimal supraconsensus selection are, in<br />
general, moderate.<br />
D.2.10 BBC data set<br />
The application of the selection-based self-refining procedure on the BBC data set yields<br />
the φ (NMI) boxplots depicted in figure D.21. Notice that, in this very data collection, the<br />
cluster ensemble component λref selected via average φ (NMI) is very close (if not equal)<br />
to the maximum quality individual partition contained in the cluster ensemble E. Starting<br />
from this high quality reference point, the self-refining procedure manages to yield slightly<br />
better clusterings when it is based on the MCLA, ALSAD and KMSAD consensus functions.<br />
354
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.18: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the MFeat data<br />
collection across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
Moreover, notice that, regardless of the consensus function employed, the supraconsensus<br />
function tends to select pretty high quality clustering solutions as the final ones.<br />
D.2.11 PenDigits data set<br />
Figure D.22 presents the φ (NMI) boxplots corresponding to the application of the selectionbased<br />
clustering self-refining procedure on the PenDigits data set. Recall that, due to its<br />
size, only the HGPA and MCLA consensus functions are executable on this data collection.<br />
As regards the results obtained, notice that the selected cluster ensemble component λref<br />
has a notably high quality. However, the results obtained when it is self-refined differ<br />
dramatically depending on the consensus function applied. In the case of HGPA, selfrefining<br />
brings about no quality gains, and the supraconsensus function correctly selects<br />
λref as the final clustering solution. In contrast, refined clusterings yielded by MCLA<br />
are capable of achieving slightly higher φ (NMI) values than the selected cluster ensemble<br />
component λref. However, supraconsensus conducts a suboptimal selection, as it does not<br />
choose the maximum quality refined clustering as the final partition.<br />
355
D.2. Experiments on selection-based self-refining<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.19: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the miniNG data<br />
collection across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
356
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
Appendix D. Experiments on self-refining consensus architectures<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.20: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the Segmentation<br />
data collection across all the consensus functions employed. The green dashed vertical line<br />
identifies the clustering solution selected by the supraconsensus function in each experiment.<br />
357
D.2. Experiments on selection-based self-refining<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
HGPA<br />
1<br />
0.5<br />
0<br />
(c) HGPA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
CSPA<br />
(a) CSPA<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
φ (NMI)<br />
KMSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(f) KMSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
MCLA<br />
1<br />
0.5<br />
0<br />
(d) MCLA<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
EAC<br />
(b) EAC<br />
φ (NMI)<br />
SLSAD<br />
1<br />
0.5<br />
0<br />
E<br />
(g) SLSAD<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
ALSAD<br />
(e) ALSAD<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.21: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the BBC data collection<br />
across all the consensus functions employed. The green dashed vertical line identifies the<br />
clustering solution selected by the supraconsensus function in each experiment.<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
HGPA<br />
(a) HGPA<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
φ (NMI)<br />
1<br />
0.5<br />
0<br />
E<br />
λref<br />
λ c 2<br />
λ c 5<br />
λ c 10<br />
MCLA<br />
(b) MCLA<br />
λ c 15<br />
λ c 20<br />
λ c 30<br />
λ c 40<br />
λ c 50<br />
λ c 60<br />
λ c 75<br />
λ c 90<br />
Figure D.22: φ (NMI) boxplots of the cluster ensemble E, the selected cluster ensemble component<br />
λref and the self-refined consensus clustering solutions λc p i on the PenDigits data<br />
collection across all the consensus functions employed. The green dashed vertical line identifies<br />
the clustering solution selected by the supraconsensus function in each experiment.<br />
358
Appendix E<br />
Experiments on multimodal<br />
consensus clustering<br />
This appendix presents several experiments regarding multimodal self-refining consensus<br />
architectures described in chapter 5, applied to the CAL500, InternetAds and Corel data<br />
collections. Due to space limitations, the experiments described correspond to the application<br />
of the proposed methodology on cluster ensembles resulting from the application of four<br />
of the twenty-eight clustering algorithms employed in this thesis, namely agglo-cos-upgma,<br />
direct-cos-i2, graph-cos-i2 and rb-cos-i2.<br />
For each one of the data sets, two facets of the experiments are presented separately.<br />
Firstly, the consensus clusterings obtained on each modality and across modalities is qualitatively<br />
evaluated. To do so, a set of boxplot charts displaying the φ (NMI) values of the<br />
components of the corresponding cluster ensemble E, and of the unimodal, multimodal and<br />
intermodal consensus clusterings obtained by the seven consensus functions employed in<br />
this work across 10 independent runs.<br />
And secondly, the quality of the self-refined consensus clustering solutions output by<br />
the proposed consensus self-refining procedure is also evaluated with the help of boxplot<br />
diagrams displaying the φ (NMI) values of the corresponding cluster ensembles, of the nonrefined<br />
consensus clustering λc and of the self-refined consensus clustering solutions λc p i .<br />
As regards the latter, a set of refined clusterings are obtained using a range of percentages<br />
pi = {2, 5, 10, 15, 20, 30, 40, 50, 60, 75} of the whole ensemble E. The performance of the<br />
φ (ANMI) -based supraconsensus function for picking up one of the λc p i is also qualitatively<br />
evaluated.<br />
E.1 CAL500 data set<br />
In this section, the results of the multimodal consensus clustering experiments conducted<br />
on the CAL500 data collection are described. The modalities contained in this data set are<br />
audio and text —see appendix A.2.2 for a description.<br />
359
E.1. CAL500 data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio<br />
λ agglo−cos−upgma<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ agglo−cos−upgma<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
audio+text<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c agglo−cos−upgma<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.1: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the CAL500 data set.<br />
E.1.1 Consensus clustering per modality and across modalities<br />
For starters, the quality of the consensus clustering solutions obtained on i) the two original<br />
modalities, ii) the fused audio+text multimodal modality, and iii) across the previous<br />
three modalities are evaluated. In figure E.1, the results obtained after the application<br />
of the proposed multimodal consensus architecture on the cluster ensemble resulting from<br />
the compilation of the partitions output by the agglo-cos-upgma clustering algorithm are<br />
presented. It can be observed that the quality of the clusterings corresponding to the<br />
audio modality are notably better than those obtained on the text mode (except when<br />
the EAC consensus function is employed). The early fusion of the auditory and textual<br />
features does not introduce any beneficial effect, rather the contrary. The quality of the<br />
intermodal consensus clusterings λc corresponding to the combination of three modalities<br />
are approximately a trade-off between them.<br />
Figures E.2, E.3 and E.4 depict, respectively, the results obtained on the cluster ensembles<br />
created upon the direct-cos-i2, graph-cos-i2 and rb-cos-i2 CLUTO clustering algorithms.<br />
It can be observed that pretty similar results to the ones just reported are obtained<br />
in all cases: that is, the consensus clusterings based on the audio mode attain higher qualities<br />
than on the remaining modalities, while multimodal and intermodal consensus clustering<br />
solutions are a kind of trade-off between modalities.<br />
360
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Appendix E. Experiments on multimodal consensus clustering<br />
text<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio+text<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c direct−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.2: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the CAL500 data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio+text<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c graph−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.3: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the CAL500 data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
audio+text<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c rb−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.4: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the CAL500 data set.<br />
361
E.1. CAL500 data set<br />
φ (NMI)<br />
CSPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC agglo−cos−upgma<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
ALSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
HGPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
KMSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
λ 30<br />
c<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
MCLA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
SLSAD agglo−cos−upgma<br />
1<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.5: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the CAL500 data set.<br />
E.1.2 Self-refined consensus clustering across modalities<br />
In this section, the results of running the self-refining procedure on the intermodal consensus<br />
clustering solution λc are evaluated. Firstly, the results of the process applied on<br />
the cluster ensemble created by the compilation of the clusterings output by the agglo-cosupgma<br />
clustering algorithm are presented in figure E.5. On each one of the seven boxplot<br />
charts displayed (one per consensus function), the clustering solution selected by the supraconsensus<br />
function, λfinal c , is highlighted by a green dashed vertical line. Notice that in<br />
all cases there exists at least one self-refined consensus clustering λpi c that attains a higher<br />
φ (NMI) than the non-refined consensus clustering solution, λc. However, the supraconsensus<br />
function mostly fails to select the top quality clustering as the final partition of the data<br />
—in fact, it only does so in the experiments based on the CSPA, ALSAD and KMSAD<br />
consensus functions. This situation is a clear illustrative example of the advantages of the<br />
proposed self-refining procedures and the shortcomings of the φ (NMI) based supraconsensus<br />
function.<br />
Figures E.6 to E.8 present the self-refining results obtained on the cluster ensembles<br />
constructed upon the clusterings generated by the direct-cos-i2, graph-cos-i2 and rb-cosi2<br />
clustering algorithms. Notice that, in most cases, the self-refining procedure yields at<br />
least one clustering of superior quality than the non-refined consensus clustering solution.<br />
Exceptions to this behaviour occur, for instance, when the KMSAD and EAC consensus<br />
functions are employed for clustering combination on the direct-cos-i2 and graph-cos-i2<br />
cluster ensembles (see figures E.6(f) and E.7(b), respectively). Again, the supraconsensus<br />
function performs with modest accuracy, managing to select the top quality clustering<br />
solution in some cases (see figure E.8(b)) and failing clamorously in others (as in figure<br />
362<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
Appendix E. Experiments on multimodal consensus clustering<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
SLSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(g) SLSAD<br />
MCLA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.6: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the CAL500 data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
SLSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
MCLA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.7: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the CAL500 data set.<br />
E.7(g)).<br />
363<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
E.2. InternetAds data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 40<br />
c<br />
MCLA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.8: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the CAL500 data set.<br />
E.2 InternetAds data set<br />
In the following paragraphs, the results corresponding to the execution of self-refining multimodal<br />
consensus clustering on the InternetAds collection are presented. In this case, the<br />
modalities are object (i.e image) attributes and collateral image attributes (see appendix<br />
A.2.2 for a description).<br />
E.2.1 Consensus clustering per modality and across modalities<br />
In this section, the quality of the unimodal, multimodal and intermodal consensus clusterings<br />
obtained on the cluster ensembles generated upon the agglo-cos-upgma, direct-cos-i2,<br />
graph-cos-i2 and rb-cos-i2 clustering algorithms is evaluated.<br />
Firstly, the results corresponding to the agglo-cos-upgma cluster ensemble are depicted<br />
in figure E.9. The first thing to notice is the extremely low quality of the cluster ensemble<br />
components, regardless of the modality. This fact conditions the consensus clustering results<br />
obtained, which are also of very low quality. Moreover, in contrast to what has been observed<br />
in the rest of experiments, there exist very little differences among the performance of the<br />
distinct consensus functions employed.<br />
A very similar behaviour is observed in the experiments conducted on the direct-cos-i2<br />
and rb-cos-i2 cluster ensembles (figures E.10 and E.12). However, pretty different results are<br />
obtained on the graph-cos-i2 cluster ensemble (see figure E.11). In this case, the execution<br />
of consensus clustering on the collateral and the multimodal modalities yields better results.<br />
364<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
object<br />
λ agglo−cos−upgma<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Appendix E. Experiments on multimodal consensus clustering<br />
collateral<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
object+collateral<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c agglo−cos−upgma<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.9: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the InternetAds data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
object<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
collateral<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
object+collateral<br />
λ direct−cos−i2<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c direct−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.10: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the InternetAds data set.<br />
E.2.2 Self-refined consensus clustering across modalities<br />
The results of the application of the self-refining procedure on the intermodal consensus<br />
clustering λc are presented next. Again, the consensus clustering selected by the supraconsensus<br />
function, λ final<br />
c , is highlighted by a green vertical dashed line.<br />
Figures E.13, E.14 and E.16, which show the results corresponding to the agglo-cosupgma,<br />
direct-cos-i2 and rb-cos-i2 cluster ensembles, reveal that little is achieved by selfrefining<br />
in these cases. In contrast, the usual growing φ (NMI) patterns induced by selfrefining<br />
are observed in figure E.15, especially when the MCLA, ALSAD and KMSAD<br />
consensus functions are employed (see figures E.15(d), E.15(e) and E.15(f)). Unfortunately,<br />
the supraconsensus function mostly fails in selecting the top quality clustering in these<br />
cases.<br />
365
E.3. Corel data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
object<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
collateral<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
object+collateral<br />
λ graph−cos−i2<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c graph−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.11: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the InternetAds data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
object<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
collateral<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
object+collateral<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c rb−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.12: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the InternetAds data set.<br />
E.3 Corel data set<br />
This section is devoted to the presentation of the results of the multimodal consensus<br />
clustering experiments executed on the Corel data collection. On this data set, modalities<br />
are image and text features.<br />
E.3.1 Consensus clustering per modality and across modalities<br />
For starters, figure E.17 depicts the unimodal, multimodal and intermodal consensus clusterings<br />
obtained on the agglo-cos-upgma cluster ensemble. Notice the notable differences<br />
between both modalities, as clustering this collection using the textual features of the objects<br />
leads to the obtention of better partitions than those obtained on the image modality.<br />
Apparently, the multimodal modality resulting from the early fusion of textual and visual<br />
features, yields clusterings the quality of which is equal or slightly lower than the textual<br />
ones. Thus, in this case, multimodality brings about no gains as regards the obtention of<br />
higher quality partitions. <strong>La</strong>st, the intermodal consensus clustering λc attains φ (NMI) values<br />
366
φ (NMI)<br />
CSPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Appendix E. Experiments on multimodal consensus clustering<br />
EAC agglo−cos−upgma<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
ALSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
HGPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
KMSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
MCLA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
SLSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.13: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the InternetAds data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
(e) ALSAD<br />
EAC direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(b) EAC<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
(f) KMSAD<br />
HGPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
SLSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
MCLA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(d) MCLA<br />
Figure E.14: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the InternetAds data set.<br />
comparable to the best text modality based partitions when the CSPA, ALSAD and KM-<br />
SAD consensus functions are employed, while it constitutes a trade-off between modalities<br />
when the remaining clustering combiners are used.<br />
A pretty similar performance is observed when the consensus process is applied on the<br />
367<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
E.3. Corel data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
SLSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(g) SLSAD<br />
MCLA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.15: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the InternetAds data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
SLSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
MCLA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.16: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the InternetAds data set.<br />
direct-cos-i2, graph-cos-i2 and rb-cos-i2 cluster ensembles, as figures E.18, E.19 and E.20<br />
reveal.<br />
368<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ agglo−cos−upgma<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Appendix E. Experiments on multimodal consensus clustering<br />
image<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
text+image<br />
λ agglo−cos−upgma<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c agglo−cos−upgma<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.17: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the agglo-cos-upgma algorithm on the Corel data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text+image<br />
λ direct−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c direct−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.18: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the direct-cos-i2 algorithm on the Corel data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text+image<br />
λ graph−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c graph−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.19: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the graph-cos-i2 algorithm on the Corel data set.<br />
E.3.2 Self-refined consensus clustering across modalities<br />
In the following paragraphs, the results of applying the consensus self-refining procedure on<br />
the intermodal consensus clustering λc are qualitatively described.<br />
For starters, the φ (NMI) values of the non-refined and self-refined consensus clusterings<br />
obtained on the agglo-cos-upgma are presented in figure E.21. We can observe that, in<br />
all cases, there exists at least one refined consensus clustering that attains a higher φ (NMI)<br />
value than the non-refined clustering λc. Moreover, in this case, the supraconsensus function<br />
is pretty successful in selecting the top quality consensus clustering as the final partition<br />
369
E.3. Corel data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(a) Modality 1<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
image<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(b) Modality 2<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
text+image<br />
λ rb−cos−i2<br />
c<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(c) Multimodal<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ c rb−cos−i2<br />
E<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
ALSAD<br />
KMSAD<br />
SLSAD<br />
(d) Intermodal<br />
Figure E.20: φ (NMI) boxplots of the unimodal, multimodal and intermodal consensus clustering<br />
solutions using the rb-cos-i2 algorithm on the Corel data set.<br />
λ final<br />
c , which is again highlighted by means of a vertical green dashed line.<br />
The performance of the self-refining procedure is equally satisfying when conducted on<br />
the cluster ensembles created by means of the three remaining clustering algorithms, as<br />
figures E.22, E.23 and E.24 depict. However, the selection accuracy of the supraconsensus<br />
function is somewhat inconsistent, as already observed on the previous data sets.<br />
φ (NMI)<br />
CSPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
EAC agglo−cos−upgma<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
ALSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
HGPA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
KMSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
MCLA agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
SLSAD agglo−cos−upgma<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.21: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the agglo-cos-upgma algorithm on the Corel data set.<br />
370<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
Appendix E. Experiments on multimodal consensus clustering<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
SLSAD direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(g) SLSAD<br />
MCLA direct−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.22: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the direct-cos-i2 algorithm on the Corel data set.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
ALSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
KMSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
0<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
SLSAD graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
MCLA graph−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.23: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the graph-cos-i2 algorithm on the Corel data set.<br />
371<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
E.3. Corel data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
CSPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(a) CSPA<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
ALSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(e) ALSAD<br />
λ 40<br />
c<br />
EAC rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(b) EAC<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
KMSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(f) KMSAD<br />
λ 40<br />
c<br />
HGPA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(c) HGPA<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
λ 30<br />
c<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
SLSAD rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
λ 30<br />
c<br />
(g) SLSAD<br />
MCLA rb−cos−i2<br />
E λc<br />
λ 2 c<br />
λ 5 c<br />
λ 10<br />
c<br />
λ 15<br />
c<br />
λ 20<br />
c<br />
(d) MCLA<br />
Figure E.24: φ (NMI) boxplots of the self-refined intermodal consensus clustering solutions<br />
using the rb-cos-i2 algorithm on the Corel data set.<br />
372<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c<br />
λ 30<br />
c<br />
λ 40<br />
c<br />
λ 50<br />
c<br />
λ 75<br />
c
Appendix F<br />
Experiments on soft consensus<br />
clustering<br />
This appendix presents the results of the consensus clustering experiments on soft cluster<br />
ensembles. The main purpose of these experiments is to compare the four voting based consensus<br />
functions put forward in chapter 6 –namely BordaConsensus (BC), CondorcetConsensus<br />
(CC), ProductConsensus (PC) and SumConsensus (SC)– with five state-of-the-art<br />
clustering combiners: the soft versions of the hypergraph based hard consensus functions<br />
CSPA, HGPA and MCLA (Strehl and Ghosh, 2002), and the evidence accumulation approach<br />
(EAC) (Fred and Jain, 2005) (see section 6.2), plus the voting-merging soft consensus<br />
function (VMA) of (Dimitriadou, Weingessel, and Hornik, 2002).<br />
Such comparison entails two aspects: the quality of the consensus clustering solutions<br />
obtained (measured in terms of normalized mutual information –φ (NMI) – with respect to<br />
the ground truth of each data set), and the time complexity of each consensus function<br />
(measured in terms of the CPU time required for their execution —see appendix A.6 for a<br />
description of the computational resources employed in this work).<br />
From a formal viewpoint, the results of these experiments are presented by means of a<br />
φ (NMI) vs. CPU time diagram, onto which the performance of each consensus function is<br />
described by means of a scatterplot covering the mean ± 2-standard deviation region of the<br />
corresponding magnitude (i.e. φ (NMI) and CPU time). Moreover, the statistical significance<br />
of the results is evaluated by means of Student’s t-tests that compare all the consensus<br />
functions on a pairwise basis, thus analyzing whether the hypothetical superiority of any of<br />
them is sustained on firm statistical grounds, using the traditional 95% confidence interval<br />
as a reference for distinguishing between significant and non significant differences.<br />
These soft consensus clustering experiments have been conducted on the twelve unimodal<br />
data collections employed in this work (see appendix A.2.1 for a description). The results<br />
corresponding to the Zoo data collection are presented in chapter 6, and the following<br />
paragraphs describe the results obtained on the eleven remaining data sets.<br />
373
F.1. Iris data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
IRIS<br />
0<br />
0 0.1 0.2 0.3 0.4<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.1: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Iris data collection.<br />
F.1 Iris data set<br />
This section described the results of the soft consensus clustering experiments run on the<br />
Iris data set. Figure F.1 presents the φ (NMI) vs CPU time mean ± 2-standard deviation<br />
regions of the nine consensus functions compared. Quite obviously, the closer the scatterplot<br />
of a consensus function was to the top left corner of the diagram, the better its performance<br />
would be (i.e. it would yield high quality consensus clustering solutions with low time<br />
complexity).<br />
In this case, the proposed SC and PC consensus functions match the performance of the<br />
VMA, both in terms of time complexity and consensus quality. The performance of the other<br />
two consensus functions proposed (BC and CC) is pretty comparable as far as the quality<br />
of the consensus clustering solutions is concerned, but their computational complexity is<br />
higher. As regards the state-of-the-art consensus functions, CSPA seems to yield slightly<br />
better quality results, although its CPU time more than doubles our proposals, being the<br />
most costly. On its part, MCLA seems to be competitive from a computational viewpoint,<br />
but it yields lower quality consensus clusterings. <strong>La</strong>st, EAC and HGPA are the worst<br />
performing consensus functions.<br />
If the statistical significance of the results is evaluated –see table F.1–, it can be obvserved<br />
that the φ (NMI) superiority of CSPA is only apparent, as the differences with respect<br />
to BC and CC are not statistically significant, and the quality of the consensus clusterings<br />
output by SC and PC are significantly better than those of CSPA. Moreover, SC and PC<br />
are statistically equivalent to VMA both in terms of quality and execution time.<br />
F.2 Wine data set<br />
The soft consensus clustering results obtained on the Wine data collection are presented<br />
next. For starters, figure F.2 displays the φ (NMI) vs CPU time mean ± 2-standard deviation<br />
regions corresponding to the nine consensus functions compared. In general terms, it can<br />
be observed that VMA is the fastest alternative, while the best quality consensus clustering<br />
374
Appendix F. Experiments on soft consensus clustering<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
EAC 0.0001 ——— 0.0014 0.0001 0.0001 0.0001 0.001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— × 0.0001 × × 0.0001 0.0001<br />
MCLA 0.043 0.0001 0.0001 ——— 0.0001 × × 0.0001 0.0001<br />
VMA 0.0146 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 × ×<br />
BC × 0.0001 0.0001 0.0001 0.0377 ——— 0.0163 0.0001 0.0001<br />
CC × 0.0001 0.0001 0.0001 0.0373 × ——— 0.0001 0.0001<br />
PC 0.0377 0.0001 0.0001 0.0001 × × × ——— ×<br />
SC 0.0289 0.0001 0.0001 0.0001 × × × × ———<br />
Table F.1: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Iris data set. The upper and lower triangular sections<br />
of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />
Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
WINE<br />
0<br />
0 0.2 0.4 0.6 0.8 1<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.2: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Wine data collection.<br />
solutions are the ones output by two of the proposed consensus functions: BC and CC.<br />
The analysis of the statistical significance of these results reinforces these notions (see<br />
table F.2). Indeed, the CPU time differences between VMA and the remaining consensus<br />
functions is always statistically significant, with very high significance levels (around<br />
0.0001). Moreover, in terms of φ (NMI) , BC and CC are significantly better than any of the<br />
alternatives. On their part, as already suggested by the diagram of figure F.2, SC and PC<br />
are not statistically different from VMA as far as the quality of the consensus clustering<br />
solutions is concerned.<br />
F.3 Glass data set<br />
This section describes the results of the quality and time complexity comparison experiments<br />
between the nine soft consensus functions employed in this work, when applied on the Glass<br />
data set.<br />
375
F.3. Glass data set<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.001 0.0001 × 0.0001 0.0249 0.0005 0.0001 0.0001<br />
EAC 0.0001 ——— 0.0001 × 0.0001 × 0.0105 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— 0.0004 0.0001 0.0001 0.0001 × ×<br />
MCLA × 0.0001 0.0001 ——— 0.0001 × 0.0199 0.0013 0.0014<br />
VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0002 0.0002<br />
BC 0.0001 0.0001 0.0001 0.0001 0.0006 ——— × 0.0001 0.0001<br />
CC 0.0001 0.0001 0.0001 0.0001 0.0006 × ——— 0.0001 0.0001<br />
PC 0.0001 0.0001 0.0001 0.0001 × 0.001 0.001 ——— ×<br />
SC 0.0001 0.0001 0.0001 0.0001 × 0.0129 0.0129 × ———<br />
Table F.2: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Wine data set. The upper and lower triangular sections<br />
of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />
Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
GLASS<br />
0<br />
0 0.5 1 1.5 2 2.5<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.3: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Glass data collection.<br />
As figure F.3 suggests, VMA is again the least time consuming consensus function. As<br />
mentioned earlier, this is due to the simultaneity of the cluster disambiguation and voting<br />
processes in this consensus function. In contast, the proposed CC consensus function is by<br />
far the slowest, probably due to the exhaustive pairwise cluster confrontation implicit in<br />
the Condorcet voting method.<br />
In terms of quality, there is an apparent equality between the VMA, PC and SC consensus<br />
functions, attaining the highest φ (NMI) scores. The CSPA, BC, CC and MCLA<br />
consensus functions apparently yield lower quality consensus clustering solutions.<br />
When the statistical significance of these results is analyzed –see table F.3–, we see that<br />
the apparent time complexity superiority of VMA is statistically significant. As regards the<br />
quality of the consensus clustering solutions, it can be observed that the performances of<br />
VMA, SC and PC are statistically equivalent, whereas the differences between these and<br />
BC and CC are indeed significant.<br />
376
Appendix F. Experiments on soft consensus clustering<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0002 0.0001 0.0345 0.0001 × 0.0001 0.0016 0.0017<br />
EAC 0.0001 ——— 0.0001 × 0.0001 × 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— 0.0002 0.0001 0.0001 0.0001 × ×<br />
MCLA × 0.0001 0.0001 ——— 0.0001 × 0.0001 0.002 0.002<br />
VMA 0.0001 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0233 0.0001 0.0001 0.0001 0.0025 ——— 0.0001 0.0037 0.0038<br />
CC 0.0247 0.0001 0.0001 0.0001 0.0022 × ——— 0.0001 0.0001<br />
PC 0.0001 0.0001 0.0001 0.0001 × 0.0092 0.0084 ——— ×<br />
SC 0.0001 0.0001 0.0001 0.0001 × 0.0064 0.0058 × ———<br />
Table F.3: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Glass data set. The upper and lower triangular sections<br />
of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />
Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />
F.4 Ionosphere data set<br />
In the following paragraphs, the results of the soft consensus clustering experiments conducted<br />
on the Ionosphere data collection are described.<br />
For starters, figure F.4 displays the φ (NMI) vs CPU time mean ± 2-standard deviation<br />
regions corresponding to the nine consensus functions compared in this experiment. It can<br />
be observed that pretty low quality consensus clustering solutions (φ (NMI) < 0.1) are yielded<br />
by all clustering combiners. The highest φ (NMI) scores are obtained by CSPA, BC and CC,<br />
whose performance is statistically significantly better than that of the other six consensus<br />
functions —see table F.4 for the results of the statistical significance analysis of the results<br />
of this experiment.<br />
As regards time complexity, it can be observed that VMA is the most computationally<br />
efficient option, closely followed by HGPA. The proposed PC and SC consensus functions<br />
are comparable to CSPA and MCLA in computational terms, while the positional voting<br />
based BC and CC consensus functions are, together with EAC, the most time consuming<br />
alternatives. The differences between these three groups are statistically significant, as it<br />
can be inferred from the data measurements presented in table F.4.<br />
F.5 WDBC data set<br />
This section describes the results of the soft consensus clustering experiments conducted on<br />
the WDBC data set.<br />
The φ (NMI) vs CPU time mean ± 2-standard deviation regions of the consensus functions<br />
are depicted in figure F.5. Once again, VMA is the most computationally efficient<br />
consensus function (which, as mentioned earlier, is due to the simultaneity of the cluster<br />
disambiguation and voting processes), closely followed by HGPA. However, the confidence<br />
voting based consensus functions (PC and SC) are pretty close to VMA in CPU time terms,<br />
being slightly faster than CSPA and MCLA. As already noticed in the previous experiments,<br />
positional voting makes the BC and CC consensus functions more computationally costly<br />
(in this case, CC is slightly faster than BC, due to the fact that the low number of clusters<br />
377
F.5. WDBC data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
IONOSPHERE<br />
0<br />
0 0.5 1 1.5 2 2.5<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.4: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Ionosphere data collection.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 × 0.0001 0.0002 0.0002 × ×<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 × × 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— 0.0008 0.0001 0.0001 0.0001 0.0033 0.0031<br />
MCLA 0.012 0.0004 0.0492 ——— 0.0001 0.0009 0.0014 × ×<br />
VMA 0.0336 0.0001 0.0001 × ——— 0.0001 0.0001 0.0001 0.0001<br />
BC × 0.0001 0.0001 0.0002 0.0001 ——— × 0.0003 0.0003<br />
CC × 0.0001 0.0001 0.0002 0.0001 × ——— 0.0004 0.0004<br />
PC 0.0312 0.0001 0.0001 × × 0.0001 0.0001 ——— ×<br />
SC 0.0302 0.0001 0.0001 × × 0.0001 0.0001 × ———<br />
Table F.4: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Ionosphere data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
in this data set –k = 2– does not penalize CondorcetConsensus), while EAC is the least<br />
efficient option.<br />
As far as the quality of the consensus clustering solutions is concerned, PC and SC get<br />
to match VMA as the best performing consensus functions, showing smaller dispersion in<br />
φ (NMI) terms than BC, CC and MCLA.<br />
If the statistical significance of the CPU time and φ (NMI) differences between consensus<br />
functions is evaluated –see table F.5– it can be observed that PC and SC are, in execution<br />
time terms, equivalent to CSPA and MCLA. As regards the quality of the consensus<br />
clustering solutions, no significant differences are observed between VMA, BC, CC, PC and<br />
SC, which, as aforementioned, turn out to be the best performing consensus functions.<br />
378
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
Appendix F. Experiments on soft consensus clustering<br />
WDBC<br />
0<br />
0 2 4 6<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.5: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the WDBC data collection.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 × 0.0001 0.0001 0.0002 × ×<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— 0.0026 0.0001 0.0001 0.0001 0.0267 0.0249<br />
MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0002 0.0004 × ×<br />
VMA 0.0001 0.0001 0.0001 0.0057 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0001 0.0001 0.0001 × × ——— × 0.0001 0.0001<br />
CC 0.0001 0.0001 0.0001 × × × ——— 0.0001 0.0001<br />
PC 0.0001 0.0001 0.0001 0.0025 × × × ——— ×<br />
SC 0.0001 0.0001 0.0001 0.0103 × × × × ———<br />
Table F.5: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the WDBC data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
F.6 Balance data set<br />
In this section, the performance of the soft consensus functions is compared through a set<br />
of consensus clustering experiments conducted on the Balance data set.<br />
Figure F.6 depicts the diagram that qualitatively compares the nine consensus functions<br />
in terms of CPU time required for their execution and the φ (NMI) of the consensus clustering<br />
solutions they yield.<br />
As regards the former aspect, we can observe that VMA, PC and SC are the most efficient<br />
consensus functions, and the differences between them, though small, are statistically<br />
significant according to the results of the t-paired tests presented in table F.6. Moreover,<br />
we can also observe that the BC and CC consensus functions achieve a mid-range time<br />
complexity, being slower than MCLA and HGPA, but faster than CSPA and EAC.<br />
<strong>La</strong>st, as far as the quality of the consensus clustering solutions is concerned, there is a<br />
high degree of equality between consensus functions. In fact, the differences between the top<br />
379
F.7. MFeat data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
BALANCE<br />
10 0<br />
CPU time (sec.)<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.6: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Balance data collection.<br />
performing consensus functions (CSPA, VMA, PC and SC) are not statistically significant.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0003 ——— 0.0419 0.0001 0.0004 0.0001 0.0001 0.0001<br />
MCLA 0.0291 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001<br />
VMA × 0.0001 0.0001 × ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0139 0.0001 0.0001 × × ——— 0.0044 0.0001 0.0001<br />
CC 0.0139 0.0001 0.0001 × × × ——— 0.0001 0.0001<br />
PC × 0.0001 0.0001 × × 0.0322 0.0322 ——— ×<br />
SC × 0.0001 0.0001 × × × × × ———<br />
Table F.6: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Balance data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
F.7 MFeat data set<br />
The results of the soft consensus clustering experiments conducted on the MFeat data set<br />
are presented in this section. For this purpose, figure F.7 depicts the diagram displaying<br />
the φ (NMI) vs CPU time mean ± 2-standard deviation regions corresponding to the nine<br />
soft consensus functions compared, and table F.7 presents the results of the statistical<br />
significance t-paired tests that compares them pairwise.<br />
In time complexity terms, VMA is the best performing consensus function, closely followed<br />
by MCLA, HGPA, PC and SC (the two latter being statistically equivalent). Among<br />
the two proposed positional voting based consensus functions, BC is clearly more efficient<br />
than CC. This probably is due to the larger number of classes (i.e. candidates) in this data<br />
set, which makes CC more costly due to the exhaustive pairwise candidate confrontation<br />
380
Appendix F. Experiments on soft consensus clustering<br />
involved in the Condorcet voting method. However, executing CC takes approximately as<br />
long as running CSPA, and much less time than doing so with EAC, which is, by far, the<br />
least efficient consensus function.<br />
When the quality of the consensus clustering solutions delivered by these consensus<br />
functions is compared, we can see that PC, BC and CC obtain the highest φ (NMI) scores<br />
–with no significant differences among them–, closely followed by VMA, SC and CSPA.<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
10 0<br />
MFEAT<br />
CPU time (sec.)<br />
10 2<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.7: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the MFeat data collection.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0022 0.0001 0.0001<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— 0.0013 0.0001 0.0015 0.0001 0.0266 0.026<br />
MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 × ×<br />
VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0007 0.0001 0.0001 0.0001 0.0034 ——— 0.0001 0.0001 0.0001<br />
CC 0.0008 0.0001 0.0001 0.0001 0.0043 × ——— 0.0001 0.0001<br />
PC 0.0382 0.0001 0.0001 0.0001 × × × ——— ×<br />
SC × 0.0001 0.0001 0.0001 × 0.0024 0.003 × ———<br />
Table F.7: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the MFeat data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
F.8 miniNG data set<br />
In this section we present the results of the soft consensus clustering experiments conducted<br />
on the miniNG data collection. The φ (NMI) vs CPU time diagram of figure F.8 reveals that<br />
three of the proposed voting based consensus functions (BC, PC and SC) constitute a good<br />
trade-off between consensus quality and time complexity.<br />
381
F.9. Segmentation data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
10 0<br />
MINING<br />
CPU time (sec.)<br />
10 2<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.8: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the miniNG data collection.<br />
Indeed, the consensus clustering solutions they yield are statistically significantly better<br />
than those output by state-of-the-art consensus functions such as VMA (which is the<br />
least time consuming), CSPA or MCLA —see table F.8 for further details regarding the<br />
statistical significance of the differences between consensus functions. The fourth consensus<br />
function proposed (CC) also yields higher quality than VMA, CSPA and MCLA, but its<br />
time complexity is notably higher, due to the nature of the Condorcet voting method.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 × 0.0001 0.0015 0.0001 0.0114 0.0115<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— 0.0041 0.0001 0.0001 0.0001 0.0001 0.0001<br />
MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0025 0.0001 0.0185 0.0187<br />
VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0051 0.0001 0.0001 0.0001 0.0038 ——— 0.0001 × ×<br />
CC 0.008 0.0001 0.0001 0.0001 0.0061 × ——— 0.0001 0.0001<br />
PC × 0.0001 0.0001 0.0001 × × × ——— ×<br />
SC 0.0001 0.0001 0.0001 0.0001 0.0001 × × 0.0126 ———<br />
Table F.8: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the miniNG data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
F.9 Segmentation data set<br />
The results of the application of the nine soft consensus functions upon the Segmentation<br />
data set are described next. Figure F.9 presents the φ (NMI) vs CPU time mean ± 2-standard<br />
deviation regions employed for comparing them.<br />
Again, VMA is the most computationally efficient consensus function. The proposed<br />
preference voting based clustering combiners (PC and SC), together with MCLA and<br />
382
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
10 0<br />
Appendix F. Experiments on soft consensus clustering<br />
SEGMENTATION<br />
CPU time (sec.)<br />
10 2<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.9: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the Segmentation data collection.<br />
HPGA, are the immediate followers. Between the two proposed consensus functions based<br />
on positional voting, BC is once more the most efficient (being comparable to CSPA in<br />
terms of execution CPU time), as Borda voting is less computationally demanding than<br />
Condorcet voting.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 0.0001 0.0001 × 0.0001 0.0001 0.0001<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— × 0.0001 0.0002 0.0001 × ×<br />
MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.006 0.0001 × ×<br />
VMA × 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.0006 0.0001 0.0001 0.0001 0.0069 ——— 0.0001 0.0007 0.0007<br />
CC 0.0006 0.0001 0.0001 0.0001 0.0069 × ——— 0.0001 0.0001<br />
PC × 0.0001 0.0001 0.0001 × 0.02 0.02 ——— ×<br />
SC × 0.0001 0.0001 0.0001 × 0.0307 0.0307 × ———<br />
Table F.9: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the Segmentation data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
However, these two consensus functions (BC and CC) are the ones to obtain the highest<br />
quality consensus clustering solutions, and the difference between their φ (NMI) scores and<br />
those of the remaining clustering combiners is statistically significant, as the figures shown<br />
in table F.9 reveal. The quality of the consensus clusterings output by the other two voting<br />
based consensus functions (PC and SC) is, from a statistical standpoint, equivalent to that<br />
of the VMA and CSPA consensus functions.<br />
383
F.10. BBC data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
10 0<br />
BBC<br />
CPU time (sec.)<br />
10 2<br />
CSPA<br />
EAC<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.10: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the BBC data collection.<br />
CSPA EAC HGPA MCLA VMA BC CC PC SC<br />
CSPA ——— 0.0001 0.0001 0.0001 0.0001 0.0002 0.0012 0.0001 0.0001<br />
EAC 0.0001 ——— 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001<br />
HGPA 0.0001 0.0001 ——— × 0.0001 0.0001 0.0001 × ×<br />
MCLA 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 × ×<br />
VMA 0.0456 0.0001 0.0001 0.0001 ——— 0.0001 0.0001 0.0001 0.0001<br />
BC 0.004 0.0001 0.0001 0.0001 × ——— 0.0001 0.0002 0.0002<br />
CC 0.004 0.0001 0.0001 0.0001 × × ——— 0.0001 0.0001<br />
PC × 0.0001 0.0001 0.0001 × × × ——— ×<br />
SC × 0.0001 0.0001 0.0001 × 0.0279 0.0279 × ———<br />
Table F.10: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the BBC data set. The upper and lower triangular sections<br />
of the table correspond to the comparison in terms of CPU time and φ (NMI) , respectively.<br />
Statistically non-significant differences (p >0.05) are denoted by the symbol ×.<br />
F.10 BBC data set<br />
This section is devoted to the presentation of the results of the soft consensus clustering experiments<br />
conducted on the BBC data set. A qualitative description of them is provided by<br />
the φ (NMI) vs CPU time diagram of figure F.10, and the results of the statistical significance<br />
study of the differences between consensus functions is presented in table F.10.<br />
It can be observed that VMA is again the fastest consensus function. The confidence<br />
voting consensus functions (PC and SC) are, in statistical terms, as fast as MCLA and<br />
HGPA. The positional voting consensus functions (BC and CC) are slower than those, the<br />
former being also faster than CSPA, while the latter is slower than it.<br />
As regards the quality of the consensus clustering solutions obtained, CSPA, PC and SC<br />
yield the highest φ (NMI) scores, being equivalent from a statistical significance viewpoint.<br />
The BordaConsensus and CondorcetConsensus clustering combiners also deliver pretty good<br />
performances, together with the VMA consensus function, being notably better than MCLA<br />
(and far better than EAC and HGPA, which yield extremely poor consensus clustering<br />
384
solutions).<br />
F.11 PenDigits data set<br />
Appendix F. Experiments on soft consensus clustering<br />
The results of the soft consensus clustering experiments conducted on the PenDigits data<br />
set are described in the following paragraphs. Due to the number of objects n contained in<br />
this collection, the CSPA and EAC consensus functions were not executable –as their space<br />
complexity scales quadratically with n.<br />
Thus, as a means for comparing the seven consensus functions applied on this data set,<br />
figure F.11 depicts the φ (NMI) vs CPU time mean ± 2-standard deviation regions corresponding<br />
to them. It can be observed that, in this case, the four voting based consensus<br />
functions proposed are the most time consuming. However, those based on confidence voting<br />
(PC and SC) are relatively comparable to HGPA and MCLA, while BC and CC are the<br />
most computationally costly (especially the latter). As in the previous cases, VMA is the<br />
most efficient of the consensus functions compared.<br />
When comparison is referred to the φ (NMI) of the consensus clustering solutions yielded<br />
by the seven consensus functions, we can observe that the highest quality is obtained by PC<br />
and SC, which match VMA in this aspect. The other two voting based consensus functions<br />
(BC and CC) perform slightly worse, but far better than MCLA and HGPA.<br />
385
F.11. PenDigits data set<br />
φ (NMI)<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
10 0<br />
PENDIGITS<br />
10 1<br />
CPU time (sec.)<br />
10 2<br />
HGPA<br />
MCLA<br />
VMA<br />
BC<br />
CC<br />
PC<br />
SC<br />
Figure F.11: φ (NMI) vs CPU time mean ± 2-standard deviation regions of the soft consensus<br />
functions on the PenDigits data collection.<br />
HGPA MCLA VMA BC CC PC SC<br />
HGPA ——— × 0.0001 0.0005 0.0001 × ×<br />
MCLA 0.0001 ——— 0.0015 0.0001 0.0001 0.0273 0.0269<br />
VMA 0.0001 0.0001 ——— 0.0001 0.0001 0.0002 0.0002<br />
BC 0.0001 0.0001 0.0001 ——— 0.0001 0.0097 0.0098<br />
CC 0.0001 0.0001 0.0001 × ——— 0.0001 0.0001<br />
PC 0.0001 0.0001 × 0.0001 0.0001 ——— ×<br />
SC 0.0001 0.0001 × 0.0001 0.0001 × ———<br />
Table F.11: Significance levels p corresponding to the pairwise comparison of soft consensus<br />
functions using a t-paired test on the PenDigits data set. The upper and lower triangular<br />
sections of the table correspond to the comparison in terms of CPU time and φ (NMI) ,<br />
respectively. Statistically non-significant differences (p >0.05) are denoted by the symbol<br />
×.<br />
386
C.I.F. G: 59069740 Universitat Ramon Lull Fundació Privada. Rgtre. Fund. Generalitat de Catalunya núm. 472 (28-02-90)<br />
Aquesta Tesi Doctoral ha estat defensada el dia ____ d __________________ de ____<br />
al Centre _______________________________________________________________<br />
de la Universitat Ramon Llull<br />
davant el Tribunal format pels Doctors sotasignants, havent obtingut la qualificació:<br />
President/a<br />
_______________________________<br />
Vocal<br />
_______________________________<br />
Vocal<br />
_______________________________<br />
Vocal<br />
_______________________________<br />
Secretari/ària<br />
_______________________________<br />
Doctorand/a<br />
C. Claravall, 1-3<br />
08022 Barcelona<br />
Tel. 936 022 200<br />
Fax 936 022 249<br />
E-mail: urlsc@sec.url.es<br />
www.url.es