29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

WINE rbr−corr−e1<br />

Baseline<br />

PCA<br />

ICA<br />

NMF<br />

RP<br />

0<br />

3 4 5 6 7 8 9 10 11 12 13<br />

dimensions<br />

(a) Clustering results on the Wine<br />

data set.<br />

φ (NMI)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

Chapter 1. Framework of the thesis<br />

miniNG rbr−corr−e1<br />

Baseline<br />

PCA<br />

ICA<br />

NMF<br />

RP<br />

0<br />

50 100 150 200 250 300 350 390<br />

dimensions<br />

(b) Clustering results on the miniNG<br />

data set.<br />

Figure 1.5: Illustration of the data representation indeterminacy on the clustering results<br />

of the (a) Wine and (b) miniNG data sets clustered by the rbr-corr-e1 algorithm.<br />

of the data 5 . In all cases, the desired number of clusters k is set to the number of classes<br />

defined by the ground truth of each data set.<br />

The results of these clustering processes are presented in figure 1.5, which displays<br />

the normalized mutual information (φ (NMI) ) values between these clustering solutions and<br />

the ground truth of each data collection. It can be observed that, in the Wine data set<br />

(figure 1.5(a)), the clustering solution obtained operating on the original representation<br />

is worse (i.e. it attains a lower φ (NMI) score) than all but three of the feature-extraction<br />

based representations. In particular, the maximum value of φ (NMI) is attained using an<br />

8-dimensional PCA transformation of the original data. However, these results cannot be<br />

generalized. In fact, it is the baseline object representation that yields the best results<br />

when the same clustering algorithm is applied on the miniNG data set (see figure 1.5(b)).<br />

If we analyze the data representation that yields the best clustering results across the<br />

12 unimodal data sets described in appendix A.2, we observe a rather even distribution:<br />

baseline (23% of the times), LSA (31%), NMF (31%) and RP (15%), which somehow<br />

reinforces the notion that no intrinsically superior data representation exists—see appendix<br />

B.1 for more experimental results regarding data representation clustering indeterminacies.<br />

Moreover, notice the remarkable influence of the data representation dimensionality on<br />

the value of φ (NMI) , i.e. it is not only important to select the right type of representation,<br />

but its dimensionality is also a critical factor. Although there exist several approaches for<br />

determining the most natural dimensionality of the data (dnat) –such as spectrum of eigenvectors<br />

in PCA (Duda, Hart, and Stork, 2001) or reconstruction error in NMF (Sevillano<br />

et al., 2009)–, it is not trivial to ensure that any clustering algorithm will yield its best<br />

performance when operating on the dnat-dimensional representation of the data.<br />

To sum up, this modest example tries to demonstrate the relevance of the data representation<br />

indeterminacy, as an incorrect choice in data representation may ruin the results<br />

of a clustering process.<br />

5 For a detailed description of the clustering algorithms, the data sets and the data representations employed<br />

in the experimental sections of this thesis, see appendices A.1, A.2 and A.3, respectively.<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!