29.04.2013 Views

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

TESI DOCTORAL - La Salle

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A.2. Data sets<br />

Data set Number of Number of Number of Class<br />

name objects (n) attributes (d) classes (k) imbalance<br />

Zoo 101 17 7 40.6%–3.9%<br />

Iris 150 4 3 33.3%–33.3%<br />

Wine 178 13 3 39.9%–26.9%<br />

Glass 214 10 6 35.5%–4.2%<br />

Ionosphere 351 34 2 64.1%–35.9%<br />

WDBC 569 32 2 62.7%–37.3%<br />

Balance 625 4 3 46.1%–7.8%<br />

Mfeat 2000 649 10 10%–10%<br />

miniNG 2000 6679 20 5%–5%<br />

Segmentation 2100 19 7 14.3%–14.3%<br />

BBC 2225 6767 5 22.9%–17.3%<br />

PenDigits 7494 16 10 10.4%–9.6%<br />

Table A.2: Summary of the unimodal data sets employed in the experimental sections of<br />

this thesis. The “Class imbalance” column presents the percentage of objects in the data<br />

set belonging to the most and least populated categories, respectively.<br />

9. miniNG: this is a reduced version of the 20 Newsgroups text data set, as it contains<br />

only 2000 objects (text articles posted in Usenet) belonging to one of the 20 predefined<br />

thematic classes (e.g. sci.electronics, rec.sport.baseball or talk.politics.mideast).<br />

Typical text preprocessing tasks such as the removal of stop words and of terms appearing<br />

in less than 4 documents (document frequency thresholding) gives rise to a<br />

bag-of-words representation of each article on a 6679-dimensional tfidf -weighted (i.e.<br />

real-valued) term space (Sebastiani, 2002).<br />

10. Segmentation: known as the Image Segmentation data set, it contains 2100 outdoor<br />

images regions represented by 19-dimensional real-valued feature vectors that should<br />

be classified into one of seven texture classes: brickface, sky, foliage, cement, window,<br />

path and grass. We have employed the test subset of the Segmentation collection.<br />

11. BBC : this data set has been obtained from the online repository of the Machine Learning<br />

Group of the University College Dublin (http://mlg.ucd.ie/content/view/21/). It<br />

consists of 2225 documents from the BBC news website corresponding to stories in<br />

five topical areas (business, entertainment, politics, sport, tech). The original documents’<br />

representation used a 9636-dimensional term space which was reduced to 6767<br />

real-valued attributes after removing those terms with a document frequency smaller<br />

or equal to 4 (Sebastiani, 2002).<br />

12. PenDigits: its original name is Pen-Based Recognition of Handwritten Digits data<br />

set, whose training subset contains 7494 digitized handwritten digits (from 0 to 9)<br />

captured using a pressure sensitive tablet. Each object is represented by 16 integer<br />

attributes corresponding to the (x, y) coordinates of the electronic pen on the tablet<br />

sampled every 100 miliseconds.<br />

222

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!