10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5<br />

The downside to transforming data <strong>with</strong> PCA is that these features are often complex<br />

combinations of the other features. For example, the first feature of the preceding<br />

code starts <strong>with</strong> [-0.092, -0.995, -0.024], that is, multiply the first feature in the<br />

original dataset by -0.092, the second by -0.995, the third by -0.024. This feature has<br />

1,558 values of this form, one for each of the original datasets (although many are<br />

zeros). Such features are indistinguishable by humans and it is hard to glean much<br />

relevant information from <strong>with</strong>out a lot of experience working <strong>with</strong> them.<br />

Using PCA can result in models that not only approximate the original dataset,<br />

but can also improve the performance in classification tasks:<br />

clf = DecisionTreeClassifier(random_state=14)<br />

scores_reduced = cross_val_score(clf, Xd, y, scoring='accuracy')<br />

The resulting score is 0.9356, which is (slightly) higher than our original score using<br />

all of the original features. PCA won't always give a benefit like this, but it does more<br />

often than not.<br />

We are using PCA here to reduce the number of features in our dataset.<br />

As a general rule, you shouldn't use it to reduce overfitting in your<br />

data mining experiments. The reason for this is that PCA doesn't take<br />

classes into account. A better solution is to use regularization. An<br />

introduction, <strong>with</strong> code, is available at http://blog.datadive.<br />

net/selecting-good-features-part-ii-linear-modelsand-regularization/.<br />

Another advantage is that PCA allows you to plot datasets that you otherwise<br />

couldn't easily visualize. For example, we can plot the first two features returned<br />

by PCA.<br />

First, we tell I<strong>Python</strong> to display plots inline and import pyplot:<br />

%matplotlib inline<br />

from matplotlib import pyplot as plt<br />

Next, we get all of the distinct classes in our dataset (there are only two: is ad or<br />

not ad):<br />

classes = set(y)<br />

We also assign colors to each of these classes:<br />

colors = ['red', 'green']<br />

We use zip to iterate over both lists at the same time:<br />

for cur_class, color in zip(classes, colors):<br />

[ 97 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!