10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Social Media Insight Using Naive Bayes<br />

As a last bit of JavaScript for this chapter (I promise), we call the load_next_<br />

tweet() function. This will set the first tweet to be labeled and then close off the<br />

JavaScript. The code is as follows:<br />

load_next_tweet();<br />

<br />

After you run this cell, you will get an HTML textbox, alongside the first tweet's text.<br />

Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the<br />

tweet related to the programming language <strong>Python</strong>) and a 0 if it is not. After you do this,<br />

the next tweet will load. Enter the label and the next one will load. This continues<br />

until the tweets run out.<br />

When you finish all of this, simply save the labels to the output filename we defined<br />

earlier for the class values:<br />

<strong>with</strong> open(labels_filename, 'w') as outf:<br />

json.dump(labels, outf)<br />

You can call the preceding code even if you haven't finished. Any labeling you have<br />

done to that point will be saved. Running this Notebook again will pick up where<br />

you left off and you can keep labeling your tweets.<br />

This might take a while to do this! If you have a lot of tweets in your dataset, you'll<br />

need to classify all of them. If you are pushed for time, you can download the same<br />

dataset I used, which contains classifications.<br />

Creating a replicable dataset from Twitter<br />

In data mining, there are lots of variables. These aren't just in the data mining<br />

algorithms—they also appear in the data collection, environment, and many other<br />

factors. Being able to replicate your results is important as it enables you to verify<br />

or improve upon your results.<br />

Getting 80 percent accuracy on one dataset <strong>with</strong> algorithm X, and<br />

90 percent accuracy on another dataset <strong>with</strong> algorithm Y doesn't<br />

mean that Y is better. We need to be able to test on the same<br />

dataset in the same conditions to be able to properly compare.<br />

[ 114 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!