10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Working <strong>with</strong> Big <strong>Data</strong><br />

The first value is the word and the second is a dictionary mapping the genders to the<br />

frequency of that word in that gender's writings.<br />

Open a new file in your <strong>Python</strong> IDE or text editor. We will again need the os and re<br />

libraries, as well as NumPy and MRJob from mrjob. We also need itemgetter, as we will<br />

be sorting a dictionary:<br />

import os<br />

import re<br />

import numpy as np<br />

from mrjob.job import MRJob<br />

from operator import itemgetter<br />

We will also need MRStep, which outlines a step in a MapReduce job. Our previous<br />

job only had a single step, which is defined as a mapping function and then as a<br />

reducing function. This job will have three steps where we Map, Reduce, and then<br />

Map and Reduce again. The intuition is the same as the pipelines we used in earlier<br />

chapters, where the output of one step is the input to the next step:<br />

from mrjob.step import MRStep<br />

We then create our word search regular expression and compile it, allowing us to<br />

find word boundaries. This type of regular expression is much more powerful than<br />

the simple split we used in some previous chapters, but if you are looking for a more<br />

accurate word splitter, I recommend using NLTK as we did in Chapter 6, Social Media<br />

Insight using Naive Bayes:<br />

word_search_re = re.compile(r"[\w']+")<br />

We define a new class for our training:<br />

class NaiveBayesTrainer(MRJob):<br />

We define the steps of our MapReduce job. There are two steps. The first step will<br />

extract the word occurrence probabilities. The second step will compare the two<br />

genders and output the probabilities for each to our output file. In each MRStep,<br />

we define the mapper and reducer functions, which are class functions in this<br />

NaiveBayesTrainer class (we will write those functions next):<br />

def steps(self):<br />

return [<br />

MRStep(mapper=self.extract_words_mapping,<br />

reducer=self.reducer_count_words),<br />

MRStep(reducer=self.compare_words_reducer),<br />

]<br />

[ 286 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!