10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5<br />

Putting it all together<br />

Now that we have a tested transformer, it is time to put it into action. Using what<br />

we have learned so far, we create a Pipeline, set the first step to the MeanDiscrete<br />

transformer, and the second step to a Decision Tree Classifier. We then run a cross<br />

validation and print out the result. Let's look at the code:<br />

from sklearn.pipeline import Pipeline<br />

pipeline = Pipeline([('mean_discrete', MeanDiscrete()),<br />

('classifier', DecisionTreeClassifier(random_state=14))])<br />

scores_mean_discrete = cross_val_score(pipeline, X, y,<br />

scoring='accuracy')<br />

print("Mean Discrete performance:<br />

{0:.3f}".format(scores_mean_discrete.mean()))<br />

The result is 0.803, which is not as good as before, but not bad for simple<br />

binary features.<br />

Summary<br />

In this chapter, we looked at features and transformers and how they can be<br />

used in the data mining pipeline. We discussed what makes a good feature<br />

and how to algorithmically choose good features from a standard set. However,<br />

creating good features is more art than science and often requires domain<br />

knowledge and experience.<br />

We then created our own transformer using an interface that allows us to use it<br />

in scikit-learn's helper functions. We will be creating more transformers in later<br />

chapters so that we can perform effective testing using existing functions.<br />

In the next chapter, we use feature extraction on a corpus of text documents.<br />

There are many transformers and feature types for text, each <strong>with</strong> their advantages<br />

and disadvantages.<br />

[ 103 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!