08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11<br />

Coming back to Scikit-learn, we find various excellent wrapper classes in the sklearn.<br />

feature_selection package. A real workhorse in this field is RFE, which stands for<br />

recursive feature elimination. It takes an estimator and the desired number of features<br />

to keep as parameters and then trains the estimator <strong>with</strong> various feature sets as long<br />

as it has found a subset of the features that are small enough. The RFE instance itself<br />

pretends to be like an estimator, thereby, wrapping the provided estimator.<br />

In the following example, we create an artificial classification problem of 100 samples<br />

using the convenient make_classification() function of datasets. It lets us<br />

specify the creation of 10 features, out of which only three are really valuable to solve<br />

the classification problem:<br />

>>> from sklearn.feature_selection import RFE<br />

>>> from sklearn.linear_model import LogisticRegression<br />

>>> from sklearn.datasets import make_classification<br />

>>> X,y = make_classification(n_samples=100, n_features=10, n_<br />

informative=3, random_state=0)<br />

>>> clf = LogisticRegression()<br />

>>> clf.fit(X, y)<br />

>>> selector = RFE(clf, n_features_to_select=3)<br />

>>> selector = selector.fit(X, y)<br />

>>> print(selector.support_)<br />

[False True False True False False False False True False]<br />

>>> print(selector.ranking_)<br />

[4 1 3 1 8 5 7 6 1 2]<br />

The problem in real-world scenarios is, of course, how can we know the right value<br />

for n_features_to_select? Truth is, we can't. But most of the time, we can use a<br />

sample of the data and play <strong>with</strong> it using different settings to quickly get a feeling for<br />

the right ballpark.<br />

The good thing is that we don't have to be that exact when using wrappers.<br />

Let's try different values for n_features_to_select to see how support_ and<br />

ranking_ change:<br />

n_<br />

features_<br />

to_select support_ ranking_<br />

1<br />

[False False False True False False False False<br />

False False] [ 6 3 5 1 10 7 9 8 2 4]<br />

2<br />

[False False False True False False False False<br />

True False] [5 2 4 1 9 6 8 7 1 3]<br />

3<br />

[False True False True False False False False<br />

True False] [4 1 3 1 8 5 7 6 1 2]<br />

[ 231 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!