Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Classification II – Sentiment Analysis<br />
Fetching the Twitter data<br />
Naturally, we need tweets and their corresponding labels that tell us whether a<br />
tweet contains positive, negative, or neutral sentiment. In this chapter, we will use<br />
the corpus from Niek Sanders, who has done an awesome job of manually labeling<br />
more than 5000 tweets and granted us permission to use it in this chapter.<br />
To comply <strong>with</strong> Twitter's terms of services, we will not provide any data from<br />
Twitter nor show any real tweets in this chapter. Instead, we can use Sanders'<br />
hand-labeled data, which contains the tweet IDs and their hand-labeled sentiment,<br />
and use his script, install.py, to fetch the corresponding Twitter data. As the<br />
script is playing nicely <strong>with</strong> Twitter's servers, it will take quite some time to<br />
download all the data for more than 5000 tweets. So it is a good idea to start<br />
it now.<br />
The data comes <strong>with</strong> four sentiment labels:<br />
>>> X, Y = load_sanders_data()<br />
>>> classes = np.unique(Y)<br />
>>> for c in classes:<br />
print("#%s: %i" % (c, sum(Y==c)))<br />
#irrelevant: 543<br />
#negative: 535<br />
#neutral: 2082<br />
#positive: 482<br />
We will treat irrelevant and neutral labels together and ignore all non-English<br />
tweets, resulting into 3642 tweets. These can be easily filtered using the data<br />
provided by Twitter.<br />
Introducing the Naive Bayes classifier<br />
Naive Bayes is probably one of the most elegant machine learning algorithms<br />
out there that is of practical use. Despite its name, it is not that naive when you<br />
look at its classification performance. It proves to be quite robust to irrelevant<br />
features, which it kindly ignores. It learns fast and predicts equally so. It does<br />
not require lots of storage. So, why is it then called naive?<br />
The naive was added to the account for one assumption that is required for Bayes<br />
to work optimally: all features must be independent of each other. This, however,<br />
is rarely the case for real-world applications. Nevertheless, it still returns very good<br />
accuracy in practice even when the independent assumption does not hold.<br />
[ 118 ]