10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9<br />

Accessing the Enron dataset<br />

The full set of Enron e-mails is available at https://www.cs.cmu.edu/~./enron/.<br />

The full dataset is 423 MB in a compression format called gzip. If you<br />

don't have a Linux-based machine to decompress (unzip) this file, get<br />

an alternative program, such as 7-zip (http://www.7-zip.org/).<br />

Download the full corpus and decompress it into your data folder. By default, this<br />

will decompress into a folder called enron_mail_20110402.<br />

As we are looking for authorship information, we only want the e-mails we can<br />

attribute to a specific author. For that reason, we will look in each user's sent<br />

folder—that is, e-mails they have sent.<br />

In the Notebook, setup the data folder for the Enron dataset:<br />

enron_data_folder = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>",<br />

"enron_mail_20110402", "maildir")<br />

Creating a dataset loader<br />

We can now create a function that will choose a couple of authors at random and<br />

return each of the emails in their sent folder. Specifically, we are looking for the<br />

payloads—that is, the content rather than the e-mails themselves. For that, we will<br />

need an e-mail parser. The code is as follows:<br />

from email.parser import Parser<br />

p = Parser()<br />

We will be using this later to extract the payloads from the e-mail files that are<br />

in the data folder.<br />

We will be choosing authors at random, so we will be using a random state that<br />

allows us to replicate the results if we want:<br />

from sklearn.utils import check_random_state<br />

[ 201 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!