08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

Y = np.zeros(Y.shape[0])<br />

Y[pos] = 1<br />

Y = Y.astype(int)<br />

return Y<br />

Note that we are talking about two different positives now. The sentiment of a tweet<br />

can be positive, which is to be distinguished from the class of the training data. If, for<br />

example, we want to find out how good we can separate the tweets having sentiment<br />

from neutral ones, we could do this as follows:<br />

>>> Y = tweak_labels(Y, ["positive", "negative"])<br />

In Y we now have a 1 (positive class) for all tweets that are either positive or negative<br />

and a 0 (negative class) for neutral and irrelevant ones.<br />

>>> train_model(create_ngram_model, X, Y, plot=True)<br />

0.767 0.014 0.670 0.022<br />

As expected, the P/R AUC drops considerably, being only 67 percent now.<br />

The accuracy is still high, but that is only due to the fact that we have a highly<br />

imbalanced dataset. Out of 3,642 total tweets, only 1,017 are either positive or<br />

negative, which is about 28 percent. This means that if we created a classifier<br />

that always classified a tweet as not containing any sentiments, we would already<br />

have an accuracy of 72 percent. This is another example of why you should always<br />

look at precision and recall if the training and test data is unbalanced.<br />

[ 131 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!