08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classification II – Sentiment Analysis<br />

With our first try of using Naive Bayes on vectorized TF-IDF trigram features, we get<br />

an accuracy of 80.5 percent and a P/R AUC of 87.8 percent. Looking at the P/R chart<br />

shown in the following screenshot, it shows a much more encouraging behavior than<br />

the plots we saw in the previous chapter:<br />

For the first time, the results are quite encouraging. They get even more impressive<br />

when we realize that 100 percent accuracy is probably never achievable in a<br />

sentiment classification task. For some tweets, even humans often do not really<br />

agree on the same classification label.<br />

Using all the classes<br />

But again, we simplified our task a bit, since we used only positive or negative<br />

tweets. That means we assumed a perfect classifier that classified upfront whether<br />

the tweet contains a sentiment and forwarded that to our Naive Bayes classifier.<br />

So, how well do we perform if we also classify whether a tweet contains any<br />

sentiment at all? To find that out, let us first write a convenience function that<br />

returns a modified class array that provides a list of sentiments that we would like<br />

to interpret as positive<br />

def tweak_labels(Y, pos_sent_list):<br />

pos = Y==pos_sent_list[0]<br />

for sent_label in pos_sent_list[1:]:<br />

pos |= Y==sent_label<br />

[ 130 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!