10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9<br />

Evaluation<br />

It is generally never a good idea to base an assessment on a single number. In the<br />

case of the f-score, it is usually more robust than tricks that give good scores despite<br />

not being useful. An example of this is accuracy. As we said in our previous chapter,<br />

a spam classifier could predict everything as being spam and get over 80 percent<br />

accuracy, although that solution is not useful at all. For that reason, it is usually<br />

worth going more in-depth on the results.<br />

To start <strong>with</strong>, we will look at the confusion matrix, as we did in Chapter 8, Beating<br />

CAPTCHAs <strong>with</strong> Neural Networks. Before we can do that, we need to predict a testing<br />

set. The previous code uses cross_val_score, which doesn't actually give us a<br />

trained model we can use. So, we will need to refit one. To do that, we need<br />

training and testing subsets:<br />

from sklearn.cross_validation import train_test_split<br />

training_documents, testing_documents, y_train, y_test =<br />

train_test_split(documents, classes, random_state=14)<br />

Next, we fit the pipeline to our training documents and create our predictions for<br />

the testing set:<br />

pipeline.fit(training_documents, y_train)<br />

y_pred = pipeline.predict(testing_documents)<br />

At this point, you might be wondering what the best combination of parameters<br />

actually was. We can extract this quite easily from our grid search object (which is<br />

the classifier step of our pipeline):<br />

print(pipeline.named_steps['classifier'].best_params_)<br />

The results give you all of the parameters for the classifier. However, most of the<br />

parameters are the defaults that we didn't touch. The ones we did search for were<br />

C and kernel, which were set to 1 and linear, respectively.<br />

Now we can create a confusion matrix:<br />

from sklearn.metrics import confusion_matrix<br />

cm = confusion_matrix(y_pred, y_test)<br />

cm = cm / cm.astype(np.float).sum(axis=1)<br />

Next we get our authors so that we can label the axis correctly. For this purpose, we<br />

use the authors dictionary that our Enron dataset loaded. The code is as follows:<br />

sorted_authors = sorted(authors.keys(), key=lambda x:authors[x])<br />

[ 207 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!