29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.2. Clustering in knowledge discovery and data mining<br />

– Euclidean distance: possibly the most widely used metric, it is obtained as the<br />

particularization of the Minkowski metric for n =2:<br />

D(xi, xj) =<br />

d<br />

l=1<br />

|xil<br />

− xjl | 1<br />

2<br />

2<br />

(1.2)<br />

This is the distance measure used in the most classic implementation of the kmeans<br />

clustering algorithm, and it tends to form hyperspherical clusters (Xu and<br />

Wunsch II, 2005).<br />

– Manhattan distance: also known as city block distance, it is defined as a particular<br />

case of the Minkowski metric for n = 1 (see equation (1.3)), and it tends to<br />

create hyperrectangular clusters (Xu and Wunsch II, 2005).<br />

<br />

d<br />

<br />

D(xi, xj) = |xil − xjl |<br />

l=1<br />

(1.3)<br />

– Mahalanobis distance: it can be regarded as a modified version of the Euclidean<br />

distance, that takes into account the covariance among the attributes. It is<br />

defined as follows:<br />

D(xi, xj) =(xi − xj) T S −1 (xi − xj) (1.4)<br />

where S is the sample covariance matrix computed over all the data set (Jain,<br />

Murty, and Flynn, 1999). Algorithms using this distance tend to create hyperellipsoidal<br />

clusters (Xu and Wunsch II, 2005).<br />

2. Similarity measures<br />

– Cosine similarity: it consists of measuring the angle comprised between the vectors<br />

representing the objects, and, as such, it does not depend on their length.<br />

It is defined as follows:<br />

S(xi, xj) = xT i xj<br />

||xi|| ||xj||<br />

(1.5)<br />

where ||·||denotes vector norm.<br />

– Pearson correlation coefficient: a classic concept in the probability theory and<br />

statistics fields, correlation measures the strength and direction of the linear<br />

relationship between vectors xi and xj. The most widely used correlation index<br />

is the Pearson correlation coefficient, which is defined in equation (1.6):<br />

d<br />

(xil − ¯xi)(xjl − ¯xj)<br />

l=1<br />

S(xi, xj) = <br />

d<br />

d 2 2<br />

(xil − ¯xi) (xjl − ¯xj)<br />

l=1<br />

l=1<br />

where xil is the lth component of vector xi, and¯xi denotes its sample mean.<br />

12<br />

(1.6)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!