10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9<br />

This document contains another e-mail attached to the bottom as a reply, a common<br />

e-mail pattern. The first part of the e-mail is from Mark Haedicke, while the second is<br />

a previous e-mail written to Mark Haedicke by Mark Greenberg. Only the preceding<br />

text (the first instance of -----Original Message-----) could be attributed to the author,<br />

and this is the only bit we are actually worried about.<br />

Extracting this information generally is not easy. E-mail is a notoriously badly used<br />

format. Different e-mail clients add their own headers, define replies in different<br />

ways, and just do things however they want. It is really surprising that e-mail<br />

works at all in the current environment.<br />

There are some commonly used patterns that we can look for. The quotequail<br />

package looks for these and can find the new part of the e-mail, discarding replies<br />

and other information.<br />

You can install quotequail using pip: pip3 install quotequail.<br />

We are going to write a simple function to wrap the quotequail functionality,<br />

allowing us to easily call it on all of our documents. First we import quotequail<br />

and set up the function definition:<br />

import quotequail<br />

def remove_replies(email_contents):<br />

Next, we use quotequail to unwrap the e-mail, which returns a dictionary<br />

containing the different parts of the e-mail. The code is as follows:<br />

r = quotequail.unwrap(email_contents)<br />

In some cases, r can be none. This happens if the e-mail couldn't be parsed. In this<br />

case, we just return the full e-mail contents. This kind of messy solution is often<br />

necessary when working <strong>with</strong> real world datasets. The code is as follows:<br />

if r is None:<br />

return email_contents<br />

The actual part of the e-mail we are interested in is called (by quotequail) the<br />

text_top. If this exists, we return this as our interesting part of the e-mail.<br />

The code is as follows:<br />

if 'text_top' in r:<br />

return r['text_top']<br />

[ 205 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!