08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

for d in documents:<br />

allcaps.append(np.sum([t.isupper() \<br />

for t in d.split() if len(t)>2]))<br />

exclamation.append(d.count("!"))<br />

question.append(d.count("?"))<br />

hashtag.append(d.count("#"))<br />

mentioning.append(d.count("@"))<br />

result = np.array([obj_val, pos_val, neg_val,<br />

nouns, adjectives, verbs, adverbs,<br />

allcaps, exclamation, question,<br />

hashtag, mentioning]).T<br />

return result<br />

Putting everything together<br />

Nevertheless, using these linguistic features in isolation <strong>with</strong>out the words themselves<br />

will not take us very far. Therefore, we have to combine TfidfVectorizer <strong>with</strong><br />

the linguistic features. This can be done <strong>with</strong> scikit-learn's FeatureUnion class. It<br />

is initialized the same way as Pipeline, but instead of evaluating the estimators<br />

in a sequence and each passing the output of the previous one to the next one,<br />

FeatureUnion does it in parallel and joins the output vectors afterwards:<br />

def create_union_model(params=None):<br />

def preprocessor(tweet):<br />

tweet = tweet.lower()<br />

for k in emo_repl_order:<br />

tweet = tweet.replace(k, emo_repl[k])<br />

for r, repl in re_repl.iteritems():<br />

tweet = re.sub(r, repl, tweet)<br />

return tweet.replace("-", " ").replace("_", " ")<br />

tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,<br />

analyzer="word")<br />

ling_stats = LinguisticVectorizer()<br />

all_features = FeatureUnion([('ling', ling_stats), ('tfidf',<br />

tfidf_ngrams)])<br />

clf = MultinomialNB()<br />

[ 145 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!