10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Also, we want to set the final column (column index #1558), which is the class,<br />

to a binary feature. In the Adult dataset, we created a new feature for this. In the<br />

dataset, we will convert the feature while we load it.<br />

converters[1558] = lambda x: 1 if x.strip() == "ad." else 0<br />

Chapter 5<br />

Now we can load the dataset using read_csv. We use the converters parameter to<br />

pass our custom conversion into pandas:<br />

ads = pd.read_csv(data_filename, header=None, converters=converters)<br />

The resulting dataset is quite large, <strong>with</strong> 1,559 features and more than 2,000 rows.<br />

Here are some of the feature values the first five, printed by inserting ads[:5]<br />

into a new cell:<br />

This dataset describes images on websites, <strong>with</strong> the goal of determining whether a<br />

given image is an advertisement or not.<br />

The features in this dataset are not described well by their headings. There are two<br />

files accompanying the ad.data file that have more information: ad.DOCUMENTATION<br />

and ad.names. The first three features are the height, width, and ratio of the image<br />

size. The final feature is 1 if it is an advertisement and 0 if it is not.<br />

The other features are 1 for the presence of certain words in the URL, alt text, or<br />

caption of the image. These words, such as the word sponsor, are used to determine<br />

if the image is likely to be an advertisement. Many of the features overlap<br />

considerably, as they are combinations of other features. Therefore, this dataset has a<br />

lot of redundant information.<br />

With our dataset loaded in pandas, we will now extract the x and y data for our<br />

classification algorithms. The x matrix will be all of the columns in our <strong>Data</strong>frame,<br />

except for the last column. In contrast, the y array will be only that last column,<br />

feature #1558. Let's look at the code:<br />

X = ads.drop(1558, axis=1).values<br />

y = ads[1558]<br />

[ 95 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!