10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extracting Features <strong>with</strong> Transformers<br />

Next, we need to load the dataset <strong>with</strong> pandas. First, we set the data's filename<br />

as always:<br />

import os<br />

import numpy as np<br />

import pandas as pd<br />

data_folder = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>")<br />

data_filename = os.path.join(data_folder, "Ads", "ad.data")<br />

There are a couple of issues <strong>with</strong> this dataset that stop us from loading it easily.<br />

First, the first few features are numerical, but pandas will load them as strings.<br />

To fix this, we need to write a converting function that will convert strings to<br />

numbers if possible. Otherwise, we will get a NaN (which is short for Not a<br />

Number), which is a special value that indicates that the value could not be<br />

interpreted as a number. It is similar to none or null in other programming languages.<br />

Another issue <strong>with</strong> this dataset is that some values are missing. These are<br />

represented in the dataset using the string ?. Luckily, the question mark doesn't<br />

convert to a float, so we can convert those to NaNs using the same concept. In further<br />

chapters, we will look at other ways of dealing <strong>with</strong> missing values like this.<br />

We will create a function that will do this conversion for us:<br />

def convert_number(x):<br />

First, we want to convert the string to a number and see if that fails. Then, we will<br />

surround the conversion in a try/except block, catching a ValueError exception<br />

(which is what is thrown if a string cannot be converted into a number this way):<br />

try:<br />

return float(x)<br />

except ValueError:<br />

Finally, if the conversion failed, we get a NaN that comes from the NumPy library<br />

we imported previously:<br />

return np.nan<br />

Now, we create a dictionary for the conversion. We want to convert all of the<br />

features to floats:<br />

converters = defaultdict(convert_number<br />

[ 94 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!