10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 12<br />

One problem <strong>with</strong> using log probabilities is that they don't handle zero values well<br />

(although, neither does multiplying by zero probabilities). This is due to the fact<br />

that log(0) is undefined. In some implementations of Naive Bayes, a 1 is added to<br />

all counts to get rid of this, but there are other ways to address this. This is a simple<br />

form of smoothing of the values. In our code, we just return a very small value if the<br />

word hasn't been seen for our given gender.<br />

Back to our prediction function, we can test this by copying a post from our dataset:<br />

new_post = """ Every day should be a half day. Took the afternoon<br />

off to hit the dentist, and while I was out I managed to get my oil<br />

changed, too. Remember that business <strong>with</strong> my car dealership this<br />

winter? Well, consider this the epilogue. The friendly fellas at the<br />

Valvoline Instant Oil Change on Snelling were nice enough to notice<br />

that my dipstick was broken, and the metal piece was too far down in<br />

its little dipstick tube to pull out. Looks like I'm going to need a<br />

magnet. Damn you, Kline Nissan, daaaaaaammmnnn yooouuuu.... Today<br />

I let my boss know that I've submitted my Corps application. The news<br />

has been greeted by everyone in the company <strong>with</strong> a level of enthusiasm<br />

that really floors me. The back deck has finally been cleared off<br />

by the construction company working on the place. This company, for<br />

anyone who's interested, consists mainly of one guy who spends his<br />

days cursing at his crew of Spanish-speaking laborers. Construction<br />

of my deck began around the time Nixon was getting out of office.<br />

"""<br />

We then predict <strong>with</strong> the following code:<br />

nb_predict(model, new_post)<br />

The resulting prediction, male, is correct for this example. Of course, we never test a<br />

model on a single sample. We used the file starting <strong>with</strong> 51 for training this model.<br />

It wasn't many samples, so we can't expect too high of an accuracy.<br />

The first thing we should do is train on more samples. We will test on any file that<br />

starts <strong>with</strong> a 6 or 7 and train on the rest of the files.<br />

In the command line and in your data folder (cd

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!