29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1.4. Clustering indeterminacies<br />

The other major source of indeterminacy is the selection of the particular clustering<br />

algorithm to apply. In this sense, there are several critical questions that must be answered:<br />

– what type of algorithm should we apply? Hierarchical or partitional? Hard or soft?<br />

– once the type of clustering algorithm is selected, which specific clustering algorithm<br />

should be applied?<br />

– how should the parameters of the clustering algorithm be tuned?<br />

– in how many clusters should the data objects be grouped?<br />

As far as the selection of the type of clustering algorithm is concerned, it depends on<br />

the desired shape of the clustering solution. In any case, it is worth noting that soft and<br />

hierarchical clustering algorithms can be regarded as a generalization of their hard and<br />

partitional counterparts, as the latter can always be obtained from the former.<br />

Moreover, it is a commonplace that no universally superior clustering algorithm exists,<br />

as most of the proposals found in the literature have been designed to solve particular<br />

problems in specific fields (Xu and Wunsch II, 2005), thus being able to outperform the<br />

other existing algorithms in a concrete context, but not in others (Jain, Murty, and Flynn,<br />

1999). This fact has been theoretically analyzed and demonstrated by the impossibility<br />

problem in (Kleinberg, 2002). Thus, unless some domain knowledge enables clustering<br />

practitioners to choose a specific algorithm, this selection is often made blindly to a large<br />

extent.<br />

Once the algorithm is chosen, its parameters must be set. Again, this is not a trivial<br />

choice, as they largely determine its behaviour. Several examples concerning the sensitivity<br />

of some popular clustering algorithms to their parameter tuning are easy to find in the<br />

literature: for instance, there is no universal method for identifying the initial centroids<br />

of k-means. The EM clustering algorithm is highly sensitive to the selection of its initial<br />

parameters and the effect of a singular covariance matrix, as it can converge to a local<br />

optimum (Xu and Wunsch II, 2005). In combinatorial search based clustering, there exist<br />

no theoretic guidelines to select the appropriate and effective parameters, while the selection<br />

of the graph <strong>La</strong>placians is a major issue that affects the performance of spectral clustering<br />

algorithms, just like it happens in kernel-based clustering algorithms as regards selecting<br />

the width of the Gaussian kernels (Xu and Wunsch II, 2005).<br />

And finally, one has to decide the number of clusters k into which the objects must be<br />

grouped, as many clustering algorithms (e.g. partitional) require this value to be passed as<br />

one of their parameters. Unfortunately, in most cases the number of classes in a data set<br />

is unknown, so it is one more parameter to guess about. Moreover, determining which is<br />

the ‘correct’ number of clusters in a data set is an open question: in some cases, equally<br />

satisfying (though substantially different) clustering solutions can be obtained with different<br />

values of k for the same data, proving that the right value of k often depends on the scale<br />

at which the data is inspected (Chakaravathy and Ghosh, 1996).<br />

Notwithstanding, there exist several practical procedures for determining the most suitable<br />

value of k. Possibly, the most intuitive approach consists in visualizing the data set on<br />

a two-dimensional space, although this strategy is of little use for complex data sets (Xu and<br />

Wunsch II, 2005). Additionally, relative cluster validity indices (such as the Davies-Bouldin<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!