29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.4. Clustering indeterminacies<br />

executed if the user is not satisfied with the evaluation of the mined patterns. In the case<br />

that clustering is the data mining task of the knowledge discovery process, these important<br />

decisions are often made blindly, due to the unsupervised nature of the clustering problem.<br />

Unfortunately, these decisions determine, to a large extent, the effectiveness of the clustering<br />

task (Jain, Murty, and Flynn, 1999), so they should not be made unconcernedly.<br />

Thus, the obtention of a good quality clustering solution relies heavily on making optimal<br />

(or quasi-optimal) decisions at every stage of the KD process. The doubts that seize<br />

clustering practitioners at the time of making such decisions are caused by what we call<br />

cluster indeterminacies, which mainly localize in the selection of i) the way objects are<br />

represented, and ii) the clustering algorithm to be applied.<br />

As regards the decision on data representation, ideal features should permit distinguishing<br />

objects belonging to different clusters, besides being robust to noise, easy to extract and<br />

interpret (Xu and Wunsch II, 2005). In a blind quest for finding such a data representation,<br />

the clustering practitioner is struck by the following questions:<br />

– how should the objects be represented? Should we stick to their original representation,<br />

select a subset of the original attributes (i.e. feature selection) or transform<br />

them into a new feature space (i.e. feature extraction)?<br />

– if the original data representation is subject to a dimensionality reduction process,<br />

which should be the dimensionality of the reduced feature space?<br />

– if the original data representation is subject to a feature selection process, which<br />

criterion should be followed?<br />

– if feature extraction is applied, which criterion should guide it?<br />

Regrettably, whereas these questions are easy to answer in a supervised classification<br />

scenario (e.g. the optimal feature subset can be chosen by maximizing some function of<br />

predictive classification performance (Kohavi and John, 1998) or by applying a feature<br />

transformation driven by class labels (Torkkola, 2003)), they have no clear nor universal<br />

answer in an unsupervised context. This is due to the fact that, in clustering, the lack of<br />

class labels makes feature selection a necessarily ad hoc and often trial-and-error process (Dy<br />

and Brodley, 2004). Moreover, studies comparing the influence of object representations<br />

based on feature extraction in clustering performance often come up with contradictory<br />

conclusions (Tang et al., 2004; Shafiei et al., 2006; Cobo et al., 2006; Sevillano, Alías, and<br />

Socoró, 2007b).<br />

To illustrate the effect and importance of the data representation clustering indeterminacy,<br />

in the following paragraphs we present experimental evidences that the selection of<br />

a specific object representation can condition the quality of a clustering process to a large<br />

extent. In particular, we have represented the objects contained in the Wine and the miniNG<br />

data collections using multiple data representations: the original attributes (referred<br />

to as baseline) and feature extraction-based representations —obtained by means of Principal<br />

Component Analysis (PCA), Independent Component Analysis (ICA), Non-Negative<br />

Matrix Factorization (NMF) and Random Projection (RP)— on a range of distinct dimensionalities.<br />

Upon each object representation, we have applied a refined repeated bisecting<br />

clustering algorithm based on the correlation similarity measure for obtaining a partition<br />

20

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!