10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 12<br />

The first parameter, /blogs/51* (just remember to change<br />

to the full path to your data folder), obtains a sample of the<br />

data (all files starting <strong>with</strong> 51, which is only 11 documents). We then set the output<br />

directory to a new folder, which we put in the data folder, and specify not to output<br />

the streamed data. Without the last option, the output data is shown to the command<br />

line when we run it—which isn't very helpful to us and slows down the computer<br />

quite a lot.<br />

Run the script, and quite quickly each of the blog posts will be extracted and stored<br />

in our output folder. This script only ran on a single thread on the local computer so<br />

we didn't get a speedup at all, but we know the code runs.<br />

We can now look in the output folder for the results. A bunch of files are created and<br />

each file contains each blog post on a separate line, preceded by the gender of the<br />

author of the blog.<br />

Training Naive Bayes<br />

Now that we have extracted the blog posts, we can train our Naive Bayes model<br />

on them. The intuition is that we record the probability of a word being written by<br />

a particular gender. To classify a new sample, we would multiply the probabilities<br />

and find the most likely gender.<br />

The aim of this code is to output a file that lists each word in the corpus, along<br />

<strong>with</strong> the frequencies of that word for each gender. The output file will look<br />

something like this:<br />

"'ailleurs" {"female": 0.003205128205128205}<br />

"'air" {"female": 0.003205128205128205}<br />

"'an" {"male": 0.0030581039755351682, "female": 0.004273504273504274}<br />

"'angoisse" {"female": 0.003205128205128205}<br />

"'apprendra" {"male": 0.0013047113868622459, "female":<br />

0.0014172668603481887}<br />

"'attendent" {"female": 0.00641025641025641}<br />

"'autistic" {"male": 0.002150537634408602}<br />

"'auto" {"female": 0.003205128205128205}<br />

"'avais" {"female": 0.00641025641025641}<br />

"'avait" {"female": 0.004273504273504274}<br />

"'behind" {"male": 0.0024390243902439024}<br />

"'bout" {"female": 0.002034152292059272}<br />

[ 285 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!