10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Authorship Attribution<br />

With our data loading function, we are going to have a lot of options. Most of these<br />

ensure that our dataset is relatively balanced. Some authors will have thousands<br />

of e-mails in their sent mail, while others will have only a few dozen. We limit our<br />

search to only authors <strong>with</strong> at least 10 e-mails using min_docs_author and take a<br />

maximum of 100 e-mails from each author using the max_docs_author parameter.<br />

We also specify how many authors we want to get—10 by default using the num_<br />

authors parameter. The code is as follows:<br />

def get_enron_corpus(num_authors=10, data_folder=data_folder,<br />

min_docs_author=10, max_docs_author=100,<br />

random_state=None):<br />

random_state = check_random_state(random_state)<br />

Next, we list all of the folders in the data folder, which are separate e-mail addresses<br />

of Enron employees. We when randomly shuffle them, allowing us to choose a new<br />

set every time the code is run. Remember that setting the random state will allow us<br />

to replicate this result:<br />

email_addresses = sorted(os.listdir(data_folder))<br />

random_state.shuffle(email_addresses)<br />

It may seem odd that we sort the e-mail addresses, only to shuffle<br />

them around. The os.listdir function doesn't always return the<br />

same results, so we sort it first to get some stability. We then shuffle<br />

using a random state, which means our shuffling can reproduce a<br />

past result if needed.<br />

We then set up our documents and class lists. We also create an author_num, which<br />

will tell us which class to use for each new author. We won't use the enumerate trick<br />

we used earlier, as it is possible that we won't choose some authors. For example, if<br />

an author doesn't have 10 sent e-mails, we will not use it. The code is as follows:<br />

documents = []<br />

classes = []<br />

author_num = 0<br />

We are also going to record which authors we used and which class number we<br />

assigned to them. This isn't for the data mining, but will be used in the visualization<br />

so we can identify the authors more easily. The dictionary will simply map e-mail<br />

usernames to class values. The code is as follows:<br />

authors = {}<br />

[ 202 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!