29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A.3. Data representations<br />

Data set name<br />

CAL500<br />

(audio/text)<br />

IsoLetters<br />

(speech/image)<br />

InternetAds<br />

(object/collateral)<br />

Corel<br />

(image/text)<br />

Number of Number of Number of<br />

objects attributes per classes<br />

(n) mode (d 1/d 2) (k)<br />

Class imbalance<br />

297 78/127 16 14.5%–1.3%<br />

1559 617/16 26 3.8%–3.8%<br />

2359 133/1425 2 83.8%–16.2%<br />

3960 500/371 44 2.2%–2.3%<br />

Table A.3: Summary of the multimodal data sets employed in the experimental sections of<br />

this thesis. The “Class imbalance” column presents the percentage of objects in the data<br />

set belonging to the most and least populated categories, respectively.<br />

in the sense that some attributes are directly related to the object (as the image geometry<br />

features, the caption and the alt text, which totalize 133 features) while the<br />

remaining 1425 features are referred to collateral elements such as the anchor text or<br />

the URL. We have removed those objects in the data set with missing features (28%<br />

of the total), obtaining a reduced version of the InternetAds data set containing 2359<br />

objects.<br />

4. Corel: this is a rather classic multimodal data set (Duygulu et al., 2002), consisting of<br />

5000 text-annotated images from 50 Corel Stock Photo CDs. Each CD contains 100<br />

images on the same topic, such as “Sunrises and Sunsets”, “Mountains of America”<br />

or “Wild Animals” (Bekkerman and Jeon, 2007). Experiments have been conducted<br />

on the subset of 3960 images of the training subset of the Corel collection which are<br />

assigned at least one topic. The visual modality is codified as follows (Duygulu et<br />

al., 2002): images, represented using 33 color features, are segmented into regions,<br />

and these regions are clustered into 500 smaller connected regions (aka blobs), which<br />

are deemed visual terms, so that each image can be expressed in terms of these. As<br />

regards the textual modality, every image has a caption (i.e. a brief description of the<br />

scene) and an annotation (a list of objects appearing in the image). The vocabulary<br />

contains 371 words, and the term vectors are parameterized using the tfidf weighting<br />

scheme (van Rijsbergen, 1979).<br />

A.3 Data representations<br />

A.3.1 Unimodal data representations<br />

As mentioned in section 1.1, in this work objects are expressed in terms of d numerical<br />

attributes, so each object is represented as a column vector x ∈ Rd . Therefore, a whole<br />

data set containing n objects is mathematically expressed by means of a d × n matrix X.<br />

This original object representation is referred to as baseline throughout the thesis.<br />

Starting from the baseline representation, four other object representations have been<br />

224

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!