10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classifying <strong>with</strong> scikit-learn Estimators<br />

Putting it all together<br />

We can now create a workflow by combining the code from the previous sections,<br />

using the broken dataset previously calculated:<br />

X_transformed = MinMaxScaler().fit_transform(X_broken)<br />

estimator = KNeighborsClassifier()<br />

transformed_scores = cross_val_score(estimator, X_transformed, y,<br />

scoring='accuracy')<br />

print("The average accuracy for is<br />

{0:.1f}%".format(np.mean(transformed_scores) * 100))<br />

This gives us back our score of 82.3 percent accuracy. The MinMaxScaler resulted in<br />

features of the same scale, meaning that no features overpowered others by simply<br />

being bigger values. While the Nearest Neighbor algorithm can be confused <strong>with</strong><br />

larger features, some algorithms handle scale differences better. In contrast, some<br />

are much worse!<br />

Pipelines<br />

As experiments grow, so does the complexity of the operations. We may split up<br />

our dataset, binarize features, perform feature-based scaling, perform sample-based<br />

scaling, and many more operations.<br />

Keeping track of all of these operations can get quite confusing and can result in<br />

being unable to replicate the result. Problems include forgetting a step, incorrectly<br />

applying a transformation, or adding a transformation that wasn't needed.<br />

Another issue is the order of the code. In the previous section, we created our<br />

X_transformed dataset and then created a new estimator for the cross validation.<br />

If we had multiple steps, we would need to track all of these changes to the dataset<br />

in the code.<br />

Pipelines are a construct that addresses these problems (and others, which we will<br />

see in the next chapter). Pipelines store the steps in your data mining workflow. They<br />

can take your raw data in, perform all the necessary transformations, and then create<br />

a prediction. This allows us to use pipelines in functions such as cross_val_score,<br />

where they expect an estimator. First, import the Pipeline object:<br />

from sklearn.pipeline import Pipeline<br />

[ 38 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!