29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.2. Clustering in knowledge discovery and data mining<br />

indices.<br />

– a predefined and allegedly correct cluster structure, measuring its degree of resemblance<br />

to the obtained clustering solution by means of external cluster validity indices.<br />

– a clustering solution resulting from another clustering process (e.g. a distinct execution<br />

of the same clustering algorithm but using different parameters), measuring<br />

their relative merit so as to decide which one may best reveal the characteristics of<br />

the objects using relative cluster validity indices.<br />

All these three types of evaluation criteria can be used for validating individual clusters,<br />

as well as the output of partitional and hierarchical clustering algorithms (Jain and Dubes,<br />

1988; Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and Vazirgiannis,<br />

2002b).<br />

As regards their applicability, only internal and relative evaluation criteria are applicable<br />

to evaluate clustering solutions in real-life scenarios. This is due to the unsupervised<br />

nature of the clustering problem, as the class membership of objects is unknown in practice.<br />

However, in a research context –where ‘correct’ category labels presumably assigned by an<br />

expert to the objects in the data set are usually known, but not available to the clustering<br />

algorithm–, it is sometimes more appropriate to use external validity indices, since clusters<br />

are ultimately evaluated externally by humans (Strehl, 2002). For this reason, in this work<br />

we will make use of external evaluation criteria solely. However, there exist some recent<br />

efforts that aim to find correlations between internal and external cluster validity indices,<br />

such as (Ingaramo et al., 2008). Nevertheless, for further insight on internal and relative<br />

validity indices, see (Halkidi, Batistakis, and Vazirgiannis, 2002a; Halkidi, Batistakis, and<br />

Vazirgiannis, 2002b; Maulik and Bandyopadhyay, 2002).<br />

Therefore, evaluation will consist in testing whether the clustering solution reflects the<br />

true group structure of the data set, captured in a reference clustering or ground truth4 .<br />

A further advantage of this evaluation approach lies in the fact that external evaluation<br />

measures can be used to compare fairly the performance of clustering algorithms regardless<br />

of their foundations, as they make no assumption about the mechanisms used for finding<br />

the clusters (Strehl, 2002).<br />

There exist multiple ways of comparing a clustering solution to a ground truth. Quite<br />

obviously, different approaches must be followed depending on the nature of the clustering<br />

solution (i.e. whether it is hard or soft, hierarchical or partitional).<br />

As regards the evaluation of soft clustering solutions, the main difficulty lies not in the<br />

comparison process (see (Gopal and Woodcock, 1994; Jäger and Benz, 2000) for some classic<br />

approaches), but in the creation of a fuzzy ground truth, which may require applying an<br />

averaging scheme that accounts for systematic biases in the answers of the expert labelers<br />

(Jäger and Benz, 2000; Jomier, LeDigarcher, and Aylward, 2005).<br />

As far as the validation hierarchical clustering solutions is concerned, it is necessary to<br />

use a hierarchical taxonomy as the ground truth. However, this type of ground truth is<br />

4 The unsupervised nature of clustering makes that the performance of clustering algorithms cannot be<br />

judged with the same certitude as for supervised classifiers, as the external categorization (ground truth)<br />

might not be optimal. For instance, the way web pages are organized in the Yahoo! taxonomy is possibly not<br />

the best structure possible, but achieving a grouping similar to the Yahoo! taxonomy is certainly indicative<br />

of successful clustering (Strehl, 2002).<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!