10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

We again redefine our word search regular expression—if you were doing this in<br />

a real application, I recommend centralizing this. It is important that words are<br />

extracted in the same way for training and testing:<br />

word_search_re = re.compile(r"[\w']+")<br />

Next, we create the function that loads our model from a given filename:<br />

def load_model(model_filename):<br />

Chapter 12<br />

The model parameters will take the form of a dictionary of dictionaries, where the<br />

first key is a word, and the inner dictionary maps each gender to a probability. We<br />

use defaultdicts, which will return zero if a value isn't present;<br />

model = defaultdict(lambda: defaultdict(float))<br />

We then open the model and parse each line;<br />

<strong>with</strong> open(model_filename) as inf:<br />

for line in inf:<br />

The line is split into two sections, separated by whitespace. The first is the word itself<br />

and the second is a dictionary of probabilities. For each, we run eval on them to get<br />

the actual value, which was stored using repr in the previous code:<br />

word, values = line.split(maxsplit=1)<br />

word = eval(word)<br />

values = eval(values)<br />

We then track the values to the word in our model:<br />

model[word] = values<br />

return model<br />

Next, we load our actual model. You may need to change the model filename—it will<br />

be in the output dir of the last MapReduce job;<br />

model_filename = os.path.join(os.path.expanduser("~"), "models",<br />

"part-00000")<br />

model = load_model(model_filename)<br />

As an example, we can see the difference in usage of the word i (all words are turned<br />

into lowercase in the MapReduce jobs) between males and females:<br />

model["i"]["male"], model["i"]["female"]<br />

[ 289 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!