29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 1. Framework of the thesis<br />

not always available, as some domains are more prone to be organized hierarchically than<br />

others. Possibly due to this fact, not much research has been done on external hierarchical<br />

clustering evaluation (Patrikainen and Meila, 2005). Some examples of the few existing<br />

hierarchical clustering comparison methods are simple layer-wise comparison (Fowlkes and<br />

Mallows, 1983) and cophenetic matrices (Theodoridis and Koutroumbas, 1999), although<br />

they also have their shortcomings (Patrikainen and Meila, 2005). For this reason, the most<br />

extended strategy is to compare the clusterings found at a certain level of the dendrogram<br />

with a partitional ground truth. Unfortunately, this approach does not take in account the<br />

cluster hierarchy in any way, which is clearly not the point if the hierarchical clustering<br />

solution is to be validated as a whole.<br />

Allowing for all these considerations, and provided that the outputs of soft and hierarchical<br />

clustering algorithms can always be converted to hard and partitional clustering<br />

solutions, respectively (see section 1.2.1), the most common cluster validation procedure<br />

consists in comparing hard partitional clustering solutions (i.e. label vectors) with the<br />

same type of ground truths (i.e. comparing cluster labels with class labels) (Strehl, 2002).<br />

The following paragraphs are devoted to a description of some relevant external validity<br />

indices for evaluating hard partitional clustering solutions.<br />

The multiple possible ways for comparing the class labels contained in the ground truth<br />

label vector γ and the cluster labels in a label vector λ can be categorized into two groups<br />

depending on whether they are based on i) object pairwise matching, or ii) cluster matching.<br />

Object pairwise matching cluster validity indices are based on counting how many object<br />

pairs (xi, xj) , ∀ i = j, are clustered together and separately in both γ and λ (the more coincidences,<br />

the higher the similarity between the clustering solution and the ground truth).<br />

Following this rationale, several validity indices have been proposed, such as the Rand index<br />

(Rand, 1971), the Adjusted Rand index (Hubert and Arabie, 1985), the Fowlkes-Mallows<br />

index (Fowlkes and Mallows, 1983) or the Jaccard index, among others—see (Halkidi, Batistakis,<br />

and Vazirgiannis, 2002a; Denoeud and Guénoche, 2006).<br />

Cluster matching cluster validity indices measure the degree of agreement between the<br />

assignment of objects to classes (according to γ) and clusters (as designated by λ). Typical<br />

examples of this kind of validity indices are the <strong>La</strong>rsen index (<strong>La</strong>rsen and Aone, 1999), the<br />

Van Dongen index (van Dongen, 2000), variation of information (Meila, 2003), entropy or<br />

mutual information (Cover and Thomas, 1991), to name a few.<br />

In all the experimental sections of this work, the cluster validity index used for evaluating<br />

clustering results is normalized mutual information, denoted as φ (NMI) . This choice is<br />

motivated by the fact that φ (NMI) is theoretically well-founded, unbiased, symmetric with<br />

respect to λ and γ and normalized in the [0, 1] interval —the higher the value of φ (NMI) ,the<br />

more similar λ and γ are (Strehl, 2002). Mathematically, normalized mutual information<br />

is defined as follows:<br />

φ (NMI) k k h=1 l=1<br />

(γ, λ) =<br />

nh,l log<br />

k h=1 n(γ)<br />

<br />

)<br />

n(γ k<br />

h<br />

h log n<br />

<br />

n·nh,l<br />

(γ )<br />

n h n(λ)<br />

l<br />

l=1 n(λ)<br />

l<br />

<br />

log n(λ)<br />

<br />

l<br />

n<br />

(1.8)<br />

where n (γ)<br />

h is the number of objects in cluster h according to γ, n (λ)<br />

l is the number of objects<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!