10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2<br />

Pipelines take a list of steps as input, representing the chain of the data mining<br />

application. The last step needs to be an Estimator, while all previous steps are<br />

Transformers. The input dataset is altered by each Transformer, <strong>with</strong> the output of<br />

one step being the input of the next step. Finally, the samples are classified by the last<br />

step's estimator. In our pipeline, we have two steps:<br />

1. Use MinMaxScaler to scale the feature values from 0 to 1<br />

2. Use KNeighborsClassifier as the classification algorithms<br />

Each step is then represented by a tuple ('name', step). We can then create<br />

our pipeline:<br />

scaling_pipeline = Pipeline([('scale', MinMaxScaler()),<br />

('predict', KNeighborsClassifier())])<br />

The key here is the list of tuples. The first tuple is our scaling step and the second<br />

tuple is the predicting step. We give each step a name: the first we call scale and the<br />

second we call predict, but you can choose your own names. The second part of the<br />

tuple is the actual Transformer or estimator object.<br />

Running this pipeline is now very easy, using the cross validation code from before:<br />

scores = cross_val_score(scaling_pipeline, X_broken, y,<br />

scoring='accuracy')<br />

print("The pipeline scored an average accuracy for is {0:.1f}%".<br />

format(np.mean(transformed_scores) * 100))<br />

This gives us the same score as before (82.3 percent), which is expected, as we are<br />

effectively running the same steps.<br />

In later chapters, we will use more advanced testing methods, and setting<br />

up pipelines is a great way to ensure that the code complexity does not<br />

grow unmanageably.<br />

[ 39 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!