10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Social Media Insight Using Naive Bayes<br />

As stated, we will use the json library, so import that too:<br />

import json<br />

We create a list that will store the tweets we received from the file:<br />

tweets = []<br />

We then iterate over each line in the file. We aren't interested in lines <strong>with</strong> no<br />

information (they separate the tweets for us), so check if the length of the line<br />

(minus any whitespace characters) is zero. If it is, ignore it and move to the next<br />

line. Otherwise, load the tweet using json.loads (which loads a JSON object<br />

from a string) and add it to our list of tweets. The code is as follows:<br />

<strong>with</strong> open(input_filename) as inf:<br />

for line in inf:<br />

if len(line.strip()) == 0:<br />

continue<br />

tweets.append(json.loads(line))<br />

We are now interested in classifying whether an item is relevant to us or not<br />

(in this case, relevant means refers to the programming language <strong>Python</strong>). We will use<br />

the I<strong>Python</strong> Notebook's ability to embed HTML and talk between JavaScript and<br />

<strong>Python</strong> to create a viewer of tweets to allow us to easily and quickly classify the<br />

tweets as spam or not.<br />

The code will present a new tweet to the user (you) and ask for a label: is it relevant<br />

or not? It will then store the input and present the next tweet to be labeled.<br />

First, we create a list for storing the labels. These labels will be stored whether or not<br />

the given tweet refers to the programming language <strong>Python</strong>, and it will allow our<br />

classifier to learn how to differentiate between meanings.<br />

We also check if we have any labels already and load them. This helps if you need<br />

to close the notebook down midway through labeling. This code will load the<br />

labels from where you left off. It is generally a good idea to consider how to save at<br />

midpoints for tasks like this. Nothing hurts quite like losing an hour of work because<br />

your computer crashed before you saved the labels! The code is as follows:<br />

labels = []<br />

if os.path.exists(labels_filename):<br />

<strong>with</strong> open(labels_filename) as inf:<br />

labels = json.load(inf)<br />

[ 110 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!