Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Classification II – Sentiment Analysis<br />
Solving an easy problem first<br />
As we have seen when we looked at our tweet data, the tweets are not just positive<br />
or negative. The majority of tweets actually do not contain any sentiment, but<br />
are neutral or irrelevant, containing, for instance, raw information (New book:<br />
<strong>Building</strong> <strong>Machine</strong> <strong>Learning</strong> ... http://link). This leads to four classes. To<br />
avoid complicating the task too much, let us for now only focus on the positive and<br />
negative tweets:<br />
>>> pos_neg_idx=np.logical_or(Y=="positive", Y=="negative")<br />
>>> X = X[pos_neg_idx]<br />
>>> Y = Y[pos_neg_idx]<br />
>>> Y = Y=="positive"<br />
Now, we have in X the raw tweet texts and in Y the binary classification; we assign 0<br />
for negative and 1 for positive tweets.<br />
As we have learned in the chapters before, we can construct TfidfVectorizer<br />
to convert the raw tweet text into the TF-IDF feature values, which we then use<br />
together <strong>with</strong> the labels to train our first classifier. For convenience, we will use the<br />
Pipeline class, which allows us to join the vectorizer and the classifier together and<br />
provides the same interface:<br />
from sklearn.feature_extraction.text import TfidfVectorizer<br />
from sklearn.naive_bayes import MultinomialNB<br />
from sklearn.pipeline import Pipeline<br />
def create_ngram_model():<br />
tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),<br />
analyzer="word", binary=False)<br />
clf = MultinomialNB()<br />
pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])<br />
return pipeline<br />
The Pipeline instance returned by create_ngram_model() can now be used for<br />
fit() and predict() as if we had a normal classifier.<br />
Since we do not have that much data, we should do cross-validation. This time,<br />
however, we will not use KFold, which partitions the data in consecutive folds, but<br />
instead we use ShuffleSplit. This shuffles the data for us, but does not prevent the<br />
same data instance to be in multiple folds. For each fold, then, we keep track of the<br />
area under the Precision-Recall curve and the accuracy.<br />
[ 128 ]