10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Authorship Attribution<br />

We then record the class number we used for this author and then increment it:<br />

authors[user] = author_num<br />

author_num += 1<br />

We then check if we have enough authors and, if so, we break out of the loop to<br />

return the dataset. The code is as follows:<br />

if author_num >= num_authors or author_num >=<br />

len(email_addresses):<br />

break<br />

We then return the datatset's documents and classes, along <strong>with</strong> our author<br />

mapping. The code is as follows:<br />

return documents, np.array(classes), authors<br />

Outside this function, we can now get a dataset by making the following function<br />

call. We are going to use a random state of 14 here (as always in this book), but<br />

you can try other values or set it to none to get a random set each time the function<br />

is called:<br />

documents, classes, authors = get_enron_corpus(data_folder=enron_data_<br />

folder, random_state=14)<br />

If you have a look at the dataset, there is still a further preprocessing set we need to<br />

undertake. Our e-mails are quite messy, but one of the worst bits (from a data analysis<br />

perspective) is that these e-mails contain writings from other authors, in the form of<br />

attached replies. Take the following e-mail, which is documents[100], for instance:<br />

I am disappointed on the timing but I understand. Thanks. Mark<br />

-----Original Message-----<br />

From: Greenberg, Mark<br />

Sent: Friday, September 28, 2001 4:19 PM<br />

To: Haedicke, Mark E.<br />

Subject: Web Site<br />

Mark -<br />

FYI - I have attached below a screen shot of the proposed new look and feel for the<br />

site. We have a couple of tweaks to make, but I believe this is a much cleaner look<br />

than what we have now.<br />

[ 204 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!