10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Social Media Insight Using Naive Bayes<br />

Putting it all together<br />

Now comes the moment to put all of these pieces together. In our I<strong>Python</strong><br />

Notebook, set the filenames and load the dataset and classes as we have done<br />

before. Set the filenames for both the tweets themselves (not the IDs!) and the<br />

labels that we assigned to them. The code is as follows:<br />

import os<br />

input_filename = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>",<br />

"twitter", "python_tweets.json")<br />

labels_filename = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>",<br />

"twitter", "python_classes.json")<br />

Load the tweets themselves. We are only interested in the content of the tweets, so<br />

we extract the text value and store only that. The code is as follows:<br />

tweets = []<br />

<strong>with</strong> open(input_filename) as inf:<br />

for line in inf:<br />

if len(line.strip()) == 0:<br />

continue<br />

tweets.append(json.loads(line)['text'])<br />

Load the labels for each of the tweets:<br />

<strong>with</strong> open(classes_filename) as inf:<br />

labels = json.load(inf)<br />

Now, create a pipeline putting together the components from before. Our pipeline<br />

has three parts:<br />

• The NLTKBOW transformer we created<br />

• A DictVectorizer transformer<br />

• A BernoulliNB classifier<br />

The code is as follows:<br />

from sklearn.pipeline import Pipeline<br />

pipeline = Pipeline([('bag-of-words', NLTKBOW()),<br />

('vectorizer', DictVectorizer()),<br />

('naive-bayes', BernoulliNB())<br />

])<br />

[ 128 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!