10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Working <strong>with</strong> Big <strong>Data</strong><br />

This gives us the information we needed for our Naive Bayes implementation:<br />

def compare_words_reducer(self, word, values):<br />

per_gender = {}<br />

for value in values:<br />

gender, s = value<br />

per_gender[gender] = s<br />

yield word, per_gender<br />

Finally, we set the code to run this model when the file is run as a script;<br />

if __name__ == '__main__':<br />

NaiveBayesTrainer.run()<br />

We can then run this script. The input to this script is the output of the previous<br />

post-extractor script (we can actually have them as different steps in the same<br />

MapReduce job if you are so inclined);<br />

python nb_train.py /blogposts/<br />

--output-dir=/models/<br />

--no-output<br />

The output directory is a folder that will store a file containing the output from<br />

this MapReduce job, which will be the probabilities we need to run our Naive<br />

Bayes classifier.<br />

Putting it all together<br />

We can now actually run the Naive Bayes classifier using these probabilities.<br />

We will do this in an I<strong>Python</strong> Notebook, and can go back to using <strong>Python</strong> 3 (phew!).<br />

First, take a look at the models folder that was specified in the last MapReduce job.<br />

If the output was more than one file, we can merge the files by just appending them<br />

to each other using a command line function from <strong>with</strong>in the models directory:<br />

cat * > model.txt<br />

If you do this, you'll need to update the following code <strong>with</strong> model.txt as the<br />

model filename.<br />

Back to our Notebook, we first import some standard imports we need for our script:<br />

import os<br />

import re<br />

import numpy as np<br />

from collections import defaultdict<br />

from operator import itemgetter<br />

[ 288 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!