10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Authorship Attribution<br />

Authorship analysis is, predominately, a text mining task that aims to identify<br />

certain aspects about an author, based only on the content of their writings. This<br />

could include characteristics such as age, gender, or background. In the specific<br />

authorship attribution task, we aim to identify who out of a set of authors wrote<br />

a particular document. This is a classic case of a classification task. In many ways,<br />

authorship analysis tasks are performed using standard data mining methodologies,<br />

such as cross fold validation, feature extraction, and classification algorithms.<br />

In this chapter, we will use the problem of authorship attribution to piece together<br />

the parts of the data mining methodology we developed in the previous chapters.<br />

We identify the problem and discuss the background and knowledge of the problem.<br />

This lets us choose features to extract, which we will build a pipeline for achieving.<br />

We will test two different types of features: function words and character n-grams.<br />

Finally, we will perform an in-depth analysis of the results. We will work <strong>with</strong> a<br />

book dataset, and then a very messy real-world corpus of e-mails.<br />

The topics we will cover in this chapter are as follows:<br />

• Feature engineering and how the features differ based on application<br />

• Revisiting the bag-of-words model <strong>with</strong> a specific goal in mind<br />

• Feature types and the character n-grams model<br />

• Support vector machines<br />

• Cleaning up a messy dataset for data mining<br />

[ 185 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!