29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1.3. Multimodal clustering<br />

in cluster l according to λ, nh,l denotes the number of objects in cluster h according to γ as<br />

well as in group l according to λ, n is the number of objects contained in the data set, and<br />

k is the number of clusters into which objects are clustered according to λ and γ (Strehl,<br />

2002).<br />

Thus, the more similar the clustering solutions represented by the label vector λ and<br />

the ground truth γ, the closer to 1 φ (NMI) (γ, λ) will be. As the ground truth γ is assumed<br />

to represent the true partition of the data, high quality clusterings will attain φ (NMI) (γ, λ)<br />

values close to unity. As a consequence, given two label vectors λ1 and λ2, the former will<br />

be considered to be better than the latter if φ (NMI) (γ, λ1) >φ (NMI) (γ, λ2), and vice versa.<br />

1.3 Multimodal clustering<br />

The ubiquity of multimedia data has motivated an increasing interest in clustering techniques<br />

capable of dealing with multimodal data. In the following paragraphs we review<br />

some of the most relevant works on clustering multimodal data.<br />

Possibly one of the first works that mention the multimedia clustering problem was<br />

(Hinneburg and Keim, 1998). The authors place special emphasis in highlighting the two<br />

main challenges faced by clustering algorithms in this context: the high dimensionality<br />

of the feature vectors and the existence of noise. To tackle these problems, the authors<br />

proposed DENCLUE, a density-based clustering algorithm capable of dealing satisfactorily<br />

with both issues. However, in that work, multimodality seemed to be more of a pretext to<br />

justify the challenges of clustering high dimensional noisy data than an interest in itself.<br />

This was not the case of the browsing and retrieval system for collections of text annotated<br />

web images presented in (Chen et al., 1999), which was a multimodal extension<br />

of the Scatter-Gather document browser of (Cutting et al., 1992). In this case, multiple<br />

clusterings were created upon text and image features independently. In particular, clustering<br />

on image features was employed as part of a query refinement process. Therefore, this<br />

proposal is multimodal in the sense that the features of the distinct modalities are employed<br />

for clustering the image collection, but, still, they were not fully integrated in the clustering<br />

process.<br />

In contrast, the true multimodality of the clustering approach proposed in (Barnard<br />

and Forsyth, 2001) is guaranteed by modeling the probabilities of word and image feature<br />

occurrences and co-occurrences. It consisted of a statistical hierarchical generative model<br />

fitted with the EM algorithm, which organizes image databases using both image features<br />

and their associated text. In subsequent works, the learnt joint distribution of image regions<br />

and words was exploited in several applications, such as the prediction of words associated<br />

with whole images or with image regions (Barnard et al., 2003).<br />

Multimodal clustering has also been applied to the discovery of perceptual clusters for<br />

disambiguating the semantic meaning of text annotated images (Benitez and Chang, 2002).<br />

To do so, the images are clustered based on the visual or the text feature descriptors. Moreover,<br />

the system could also conduct multimodal clustering upon any combination of text<br />

and visual feature descriptors by conducting an early fusion of these. Principal Component<br />

Analysis was used to integrate and to reduce the dimensionality of feature descriptors<br />

before clustering. As regards the results obtained by multimodal clustering, the authors<br />

highlighted the uncorrelatedness of visual and text feature descriptors, which suggests that<br />

18

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!