29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.2. Clustering in knowledge discovery and data mining<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

−0.2<br />

−0.2 0 0.2 0.4 0.6<br />

(a) Scatterplot of the data of this<br />

toy two-dimensional data set.<br />

4<br />

7<br />

5<br />

9<br />

6<br />

8<br />

Euclidean distance<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

7 9 8 4 5 6 1 2 3<br />

object index<br />

(b) Dendrogram obtained by the<br />

single-link hierarchical algorithm.<br />

Figure 1.4: A hierarchical clustering toy example: (a) Scatterplot of an artificially generated<br />

two-dimensional data set containing n = 9 objects, each one of them is identified by a<br />

numerical label. (b) Dendrogram resulting of applying the single-link hierarchical agglomerative<br />

clustering algorithm on this data, using the Euclidean distance as the similarity<br />

measure. The dashed horizontal line performs a cut on the dendrogram, yielding a 4-way<br />

partition with an Euclidean distance between clusters ranging between 0.112 and 0.255.<br />

performing a cut through the dendrogram at the desired level of cluster similarity or<br />

by setting the desired number of clusters k, as shown by the dashed horizontal line in<br />

figure 1.4(b).<br />

On the other hand, the overlap of the clusters into which objects are grouped is an<br />

additional factor which allows splitting clustering algorithms into two large categories: hard<br />

and soft algorithms.<br />

1. Hard (aka crisp) clustering algorithms: this type of algorithms partition the data<br />

set into k disjoint clusters, i.e. each object is assigned to one and only one cluster.<br />

In mathematical terms, the result of a hard clustering process on the data set X<br />

is a n-dimensional integer-valued row label vector λ = [λ1 λ2 ... λn], where λi ∈<br />

{1, 2,...,k}, ∀i ∈ [1,n]. That is, the ith component of the label vector (or labeling<br />

for short) contains the label of the cluster the ith object in the data set is assigned<br />

to. For instance, the label vector obtained after applying a classic hard clustering<br />

algorithm such as k-means on the artificial toy data set depicted on figure 1.4(a),<br />

setting k =3,isλ =[222111333]. Notice the symbolic nature of the cluster labels,<br />

as the same clustering result would be represented by label permuted label vectors<br />

such as λ =[111222333] or λ =[333111222].<br />

2. Soft (aka fuzzy) clustering algorithms: they generate a set of k overlapping clusters,<br />

i.e. each object is associated to each of the k clusters to a certain degree. Hence, the<br />

result of conducting a soft clustering process on the data set X is a k × n real-valued<br />

clustering matrix Λ, whose(i,j)th entry indicates the degree of association between<br />

the ith cluster and the jth object. This degree of association is typically expressed<br />

in terms of the probability of membership of each object to each cluster, as done by<br />

10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!