29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

φ (NMI) direct-cos-i2 graph-cos-i2<br />

BBC 0.807 (P100) 0.603 (P83)<br />

PenDigits 0.658 (P84) 0.829 (P99)<br />

Chapter 1. Framework of the thesis<br />

Table 1.1: Illustration of the clustering algorithm indeterminacy on the BBC and PenDigits<br />

data sets clustered by the direct-cos-i2 and graph-cos-i2 algorithms.<br />

index (Davies and Bouldin, 1979), Dunn’s index (Dunn, 1973), the Calinski-Harabasz index<br />

(Calinski and Harabasz, 1974) or the I index (Maulik and Bandyopadhyay, 2002)) can be<br />

applied for determining the most appropriate number of clusters by comparing the relative<br />

merit of several clustering solutions obtained for distinct values of k (Halkidi, Batistakis,<br />

and Vazirgiannis, 2002b)—unfortunately, the performance of these indices is data dependent,<br />

which gives rise to a further indeterminacy (Xu and Wunsch II, 2005). And last, in<br />

the context of mixture densities based clustering, the number of clusters can be determined<br />

through the optimization of criterion functions such as the Akaike’s Information Criterion<br />

(Akaike, 1974) or the Bayesian Inference Criterion (Schwarz, 1978), among others.<br />

In this work, the number of clusters is assumed to be known from the start of the<br />

clustering process, and k is set to be equal to the number of classes defined by the ground<br />

truth of each data set. However, in a real-world scenario, this parameter should be tuned<br />

using some of the previously mentioned techniques.<br />

To illustrate the clustering algorithm selection indeterminacy, table 1.1 presents the<br />

values of φ (NMI) (and their percentiles in the global φ (NMI) distribution of each case) obtained<br />

from evaluating the clustering solutions yielded by the graph-cos-i2 and direct-cos-i2 (graphbased<br />

and direct clustering algorithms using the cosine distance, respectively) operating on<br />

the baseline data representation of the objects contained in the BBC and PenDigits data<br />

sets 6 . As just mentioned, the desired number of clusters k is set to the number of classes<br />

in each data set. It can be observed that these two algorithms, despite using the same<br />

object similarity measure and optimizing the same criterion function, offer almost opposite<br />

performances in these two specific data collections, so no absolute claims on the superiority<br />

of none of them can be made.—see appendix B.1 for more experimental results regarding<br />

the clustering algorithm selection indeterminacy.<br />

It is worth noticing that the decision problems caused by the previously described clustering<br />

indeterminacies are multiplied in the case of clustering multimodal data. In this<br />

context, besides the data representation and clustering algorithm selection indeterminacies,<br />

the clustering practitioner must face an additional set of questions with no clear answer,<br />

such as:<br />

– should one modality dominate the clustering process? If so, which one, and to which<br />

extent?<br />

– should the modalities be fused? If so, how should the fusion process be conducted?<br />

To illustrate the effect of these and the aforementioned indeterminacies in a multimodal<br />

clustering scenario, we have conducted several clustering experiments on the multimodal<br />

data sets presented in appendix A.2.2.<br />

6 Refer to appendices A.1, A.2 and A.3 for a detailed description of the clustering algorithms, the data<br />

sets and the data representations employed in the experimental sections of this thesis.<br />

23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!