08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classification II – Sentiment Analysis<br />

Solving an easy problem first<br />

As we have seen when we looked at our tweet data, the tweets are not just positive<br />

or negative. The majority of tweets actually do not contain any sentiment, but<br />

are neutral or irrelevant, containing, for instance, raw information (New book:<br />

<strong>Building</strong> <strong>Machine</strong> <strong>Learning</strong> ... http://link). This leads to four classes. To<br />

avoid complicating the task too much, let us for now only focus on the positive and<br />

negative tweets:<br />

>>> pos_neg_idx=np.logical_or(Y=="positive", Y=="negative")<br />

>>> X = X[pos_neg_idx]<br />

>>> Y = Y[pos_neg_idx]<br />

>>> Y = Y=="positive"<br />

Now, we have in X the raw tweet texts and in Y the binary classification; we assign 0<br />

for negative and 1 for positive tweets.<br />

As we have learned in the chapters before, we can construct TfidfVectorizer<br />

to convert the raw tweet text into the TF-IDF feature values, which we then use<br />

together <strong>with</strong> the labels to train our first classifier. For convenience, we will use the<br />

Pipeline class, which allows us to join the vectorizer and the classifier together and<br />

provides the same interface:<br />

from sklearn.feature_extraction.text import TfidfVectorizer<br />

from sklearn.naive_bayes import MultinomialNB<br />

from sklearn.pipeline import Pipeline<br />

def create_ngram_model():<br />

tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),<br />

analyzer="word", binary=False)<br />

clf = MultinomialNB()<br />

pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])<br />

return pipeline<br />

The Pipeline instance returned by create_ngram_model() can now be used for<br />

fit() and predict() as if we had a normal classifier.<br />

Since we do not have that much data, we should do cross-validation. This time,<br />

however, we will not use KFold, which partitions the data in consecutive folds, but<br />

instead we use ShuffleSplit. This shuffles the data for us, but does not prevent the<br />

same data instance to be in multiple folds. For each fold, then, we keep track of the<br />

area under the Precision-Recall curve and the accuracy.<br />

[ 128 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!