08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classification II – Sentiment Analysis<br />

r"\bwouldn't\b": "would not",<br />

r"\bcan't\b": "can not",<br />

r"\bcannot\b": "can not",<br />

}<br />

def create_ngram_model(params=None):<br />

def preprocessor(tweet):<br />

global emoticons_replaced<br />

tweet = tweet.lower()<br />

#return tweet.lower()<br />

for k in emo_repl_order:<br />

tweet = tweet.replace(k, emo_repl[k])<br />

for r, repl in re_repl.iteritems():<br />

tweet = re.sub(r, repl, tweet)<br />

return tweet<br />

tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,<br />

analyzer="word")<br />

# ...<br />

Certainly, there are many more abbreviations that could be used here. But already<br />

<strong>with</strong> this limited set, we get an improvement for sentiment versus not sentiment of<br />

half a point, which comes to 70.7 percent:<br />

== Pos vs. neg ==<br />

0.804 0.022 0.886 0.011<br />

== Pos/neg vs. irrelevant/neutral ==<br />

0.797 0.009 0.707 0.029<br />

== Pos vs. rest ==<br />

0.884 0.005 0.527 0.025<br />

== Neg vs. rest ==<br />

0.886 0.011 0.640 0.032<br />

Taking the word types into account<br />

So far our hope was to simply use the words independent of each other <strong>with</strong> the<br />

hope that a bag-of-words approach would suffice. Just from our intuition, however,<br />

neutral tweets probably contain a higher fraction of nouns, while positive or negative<br />

tweets are more colorful, requiring more adjectives and verbs. What if we could<br />

use this linguistic information of the tweets as well? If we could find out how many<br />

words in a tweet were nouns, verbs, adjectives, and so on, the classifier could maybe<br />

take that into account as well.<br />

[ 138 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!