29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.2. Clustering in knowledge discovery and data mining<br />

the data, not provided by an external source (Jain, Murty, and Flynn, 1999).<br />

Being such a generic task, clustering has found applications in multiple research fields,<br />

among which we can name the following few:<br />

– information retrieval, where clustering has been applied for organizing the results<br />

returned by a search engine in response to a users query (i.e. post-retrieval clustering)<br />

(Tombros, Villa, and van Rijsbergen, 2002; Hearst, 2006), for refining ambiguous<br />

queries input to retrieval systems (Käki, 2005), or for improving their performance<br />

(van Rijsbergen, 1979).<br />

– text mining, where browsing through large document collections is simplified if they<br />

are previously clustered (Cutting et al., 1992; Steinbach, Karypis, and Kumar, 2004;<br />

Dhillon, Fan, and Guan, 2001).<br />

– computational genomics, where clustering of gene expression data from DNA microarray<br />

experiments is applied for identifying the functionality of genes, finding out what<br />

genes are co-regulated or distinguishing the important genes between abnormal and<br />

normal tissues (Zhao and Karypis, 2003a; Jiang, Tang, and Zhang, 2004).<br />

– ecomomics, where clustering economic and financial time series can be employed for<br />

identifying i) areas or sectors for policy-making purposes, ii) structural similarities<br />

in economic processes for economic forecasting, iii) stable dependencies for risk management<br />

and investment management (Focardi, 2001), or iv) customer profiles and<br />

customers-products relationships (Liu and Luo, 2005).<br />

– computer vision, where clustering is applied for common procedures such as image<br />

preprocessing (Jain, 1996), segmentation (Mancas-Thillou and Gosselin, 2007) and<br />

matching (Miyajima and Ralescu, 1993).<br />

Regardless of the application, the desired result of any clustering process is a maximally<br />

representative partition of the data set, which usually corresponds to clusters with high<br />

intra-cluster and low inter-cluster object similarities. In the quest for this goal, a myriad<br />

of clustering methods have been proposed. With no claim of being exhaustive, the next<br />

section presents an overview of some of the most relevant clustering methods, highlighting<br />

some important concepts in this context.<br />

1.2.1 Overview of clustering methods<br />

Several excellent and extensive surveys on clustering can be found in the literature (Jain,<br />

Murty, and Flynn, 1999; Berkhin, 2002; Kotsiantis and Pintelas, 2004; Xu and Wunsch II,<br />

2005). Providing a detailed description of the existing clustering methods lies beyond the<br />

scope of this work, so the reader interested in their ins and outs is referred to the previously<br />

cited surveys and references therein. However, due to the central role of clustering processes<br />

in this thesis, this section presents a brief description of several key issues in this context,<br />

such as:<br />

1. A categorization of clustering algorithms according to generic criteria.<br />

2. A brief discussion on similarity measures, one of the central notions in clustering.<br />

8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!