08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classification II – Sentiment Analysis<br />

Fetching the Twitter data<br />

Naturally, we need tweets and their corresponding labels that tell us whether a<br />

tweet contains positive, negative, or neutral sentiment. In this chapter, we will use<br />

the corpus from Niek Sanders, who has done an awesome job of manually labeling<br />

more than 5000 tweets and granted us permission to use it in this chapter.<br />

To comply <strong>with</strong> Twitter's terms of services, we will not provide any data from<br />

Twitter nor show any real tweets in this chapter. Instead, we can use Sanders'<br />

hand-labeled data, which contains the tweet IDs and their hand-labeled sentiment,<br />

and use his script, install.py, to fetch the corresponding Twitter data. As the<br />

script is playing nicely <strong>with</strong> Twitter's servers, it will take quite some time to<br />

download all the data for more than 5000 tweets. So it is a good idea to start<br />

it now.<br />

The data comes <strong>with</strong> four sentiment labels:<br />

>>> X, Y = load_sanders_data()<br />

>>> classes = np.unique(Y)<br />

>>> for c in classes:<br />

print("#%s: %i" % (c, sum(Y==c)))<br />

#irrelevant: 543<br />

#negative: 535<br />

#neutral: 2082<br />

#positive: 482<br />

We will treat irrelevant and neutral labels together and ignore all non-English<br />

tweets, resulting into 3642 tweets. These can be easily filtered using the data<br />

provided by Twitter.<br />

Introducing the Naive Bayes classifier<br />

Naive Bayes is probably one of the most elegant machine learning algorithms<br />

out there that is of practical use. Despite its name, it is not that naive when you<br />

look at its classification performance. It proves to be quite robust to irrelevant<br />

features, which it kindly ignores. It learns fast and predicts equally so. It does<br />

not require lots of storage. So, why is it then called naive?<br />

The naive was added to the account for one assumption that is required for Bayes<br />

to work optimally: all features must be independent of each other. This, however,<br />

is rarely the case for real-world applications. Nevertheless, it still returns very good<br />

accuracy in practice even when the independent assumption does not hold.<br />

[ 118 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!