10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Extracting Features <strong>with</strong> Transformers<br />

Feature selection<br />

We will often have a large number of features to choose from, but we wish to select<br />

only a small subset. There are many possible reasons for this:<br />

• Reducing complexity: Many data mining algorithms need more time and<br />

resources <strong>with</strong> increase in the number of features. Reducing the number<br />

of features is a great way to make an algorithm run faster or <strong>with</strong> fewer<br />

resources.<br />

• Reducing noise: Adding extra features doesn't always lead to better<br />

performance. Extra features may confuse the algorithm, finding correlations<br />

and patterns that don’t have meaning (this is common in smaller datasets).<br />

Choosing only the appropriate features is a good way to reduce the chance of<br />

random correlations that have no real meaning.<br />

• Creating readable models: While many data mining algorithms will happily<br />

compute an answer for models <strong>with</strong> thousands of features, the results may be<br />

difficult to interpret for a human. In these cases, it may be worth using fewer<br />

features and creating a model that a human can understand.<br />

Some classification algorithms can handle data <strong>with</strong> issues such as these. Getting<br />

the data right and getting the features to effectively describe the dataset you are<br />

modeling can still assist algorithms.<br />

There are some basic tests we can perform, such as ensuring that the features are at<br />

least different. If a feature's values are all same, it can't give us extra information to<br />

perform our data mining.<br />

The VarianceThreshold transformer in scikit-learn, for instance, will remove any<br />

feature that doesn't have at least a minimum level of variance in the values. To show<br />

how this works, we first create a simple matrix using NumPy:<br />

import numpy as np<br />

X = np.arange(30).reshape((10, 3))<br />

The result is the numbers zero to 29, in three columns and 10 rows. This represents a<br />

synthetic dataset <strong>with</strong> 10 samples and three features:<br />

array([[ 0, 1, 2],<br />

[ 3, 4, 5],<br />

[ 6, 7, 8],<br />

[ 9, 10, 11],<br />

[12, 13, 14],<br />

[15, 16, 17],<br />

[ 88 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!