10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Authorship Attribution<br />

If it doesn't exist, quotequail couldn't find it. It is possible it found other text in the<br />

e-mail. If that exists, we return only that text. The code is as follows:<br />

elif 'text' in r:<br />

return r['text']<br />

Finally, if we couldn't get a result, we just return the e-mail contents, hoping they<br />

offer some benefit to our data analysis:<br />

return email_contents<br />

We can now preprocess all of our documents by running this function on each<br />

of them:<br />

documents = [remove_replies(document) for document in documents]<br />

Our preceding e-mail sample is greatly clarified now and contains only the e-mail<br />

written by Mark Greenberg:<br />

I am disappointed on the timing but I understand. Thanks. Mark<br />

Putting it all together<br />

We can use the existing parameter space and classifier from our previous<br />

experiments—all we need to do is refit it on our new data. By default, training<br />

in scikit-learn is done from scratch—subsequent calls to fit() will discard any<br />

previous information.<br />

There is a class of algorithms called online learning that update the<br />

training <strong>with</strong> new samples and don't restart their training each time.<br />

We will see online learning in action later in this book, including the<br />

next chapter, Chapter 10, Clustering News Articles.<br />

As before, we can compute our scores by using cross_val_score and print the<br />

results. The code is as follows:<br />

scores = cross_val_score(pipeline, documents, classes, scoring='f1')<br />

print("Score: {:.3f}".format(np.mean(scores)))<br />

The result is 0.523, which is a reasonable result for such a messy dataset. Adding<br />

more data (such as increasing max_docs_author in the dataset loading) can improve<br />

these results.<br />

[ 206 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!