10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extracting Features <strong>with</strong> Transformers<br />

The result is 10, or finishing one year past high school. If we didn't have this, we<br />

could compute the median by creating an ordering over the education values.<br />

Features can also be categorical. For instance, a ball can be a tennis ball, cricket ball,<br />

football, or any other type of ball. Categorical features are also referred to as nominal<br />

features. For nominal features, the values are either the same or they are different.<br />

While we could rank balls by size or weight, just the category alone isn't enough to<br />

compare things. A tennis ball is not a cricket ball, and it is also not a football. We<br />

could argue that a tennis ball is more similar to a cricket ball (say, in size), but the<br />

category alone doesn't differentiate this—they are the same, or they are not.<br />

We can convert categorical features to numerical features using the one-hot<br />

encoding, as we saw in Chapter 3, Predicting Sports Winners <strong>with</strong> Decision Trees. For<br />

the aforementioned categories of balls, we can create three new binary features: is a<br />

tennis ball, is a cricket ball, and is a football. For a tennis ball, the vector<br />

would be [1, 0, 0]. A cricket ball has the values [0, 1, 0], while a football has the values<br />

[0, 0, 1]. These features are binary, but can be used as continuous features by many<br />

algorithms. One key reason for doing this is that it easily allows for direct numerical<br />

comparison (such as computing the distance between samples).<br />

The Adult dataset contains several categorical features, <strong>with</strong> Work-Class being one<br />

example. While we could argue that some values are of higher rank than others (for<br />

instance, a person <strong>with</strong> a job is likely to have a better income than a person <strong>with</strong>out),<br />

it doesn't make sense for all values. For example, a person working for the state<br />

government is not more or less likely to have a higher income than someone working<br />

in the private sector.<br />

We can view the unique values for this feature in the dataset using the<br />

unique() function:<br />

adult["Work-Class"].unique()<br />

The result shows the unique values in this column:<br />

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',<br />

' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',<br />

' Never-worked', nan], dtype=object)<br />

There are some missing values in the preceding dataset, but they won't affect our<br />

computations in this example.<br />

[ 86 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!