10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Working <strong>with</strong> Big <strong>Data</strong><br />

Before we start collecting blog posts, we need to get the gender of the author<br />

of the blog. While we don't normally use the filename as part of MapReduce jobs,<br />

there is a strong need for it (as in this case) so the functionality is available. The<br />

current file is stored as an environment variable, which we can obtain using the<br />

following line of code:<br />

filename = os.environ["map_input_file"]<br />

We then split the filename to get the gender (which is the second token);<br />

gender = filename.split(".")[1]<br />

We remove whitespace from the start and end of the line (there is a lot of whitespace<br />

in these documents) and then do our post-based tracking as before;<br />

line = line.strip()<br />

if line == "":<br />

self.post_start = True<br />

elif line == "":<br />

self.post_start = False<br />

Rather than storing the posts in a list, as we did earlier, we yield them. This allows<br />

mrjob to track the output. We yield both the gender and the post so that we can keep<br />

a record of which gender each record matches. The rest of this function is defined in<br />

the same way as our loop above:<br />

yield gender, repr("\n".join(self.post))<br />

self.post = []<br />

elif self.post_start:<br />

self.post.append(line)<br />

Finally, outside the function and class, we set the script to run this MapReduce job<br />

when it is called from the command line:<br />

if __name__ == '__main__':<br />

ExtractPosts.run()<br />

Now, we can run this MapReduce job using the following shell command. Note that<br />

we are using <strong>Python</strong> 2, and not <strong>Python</strong> 3 to run this;<br />

python extract_posts.py /blogs/51* --outputdir=/blogposts<br />

–no-output<br />

[ 284 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!