01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classification II – Sentiment Analysis<br />

r"\bwouldn't\b": "would not",<br />

r"\bcan't\b": "can not",<br />

r"\bcannot\b": "can not",<br />

}<br />

def create_ngram_model(params=None):<br />

def preprocessor(tweet):<br />

global emoticons_replaced<br />

tweet = tweet.lower()<br />

#return tweet.lower()<br />

for k in emo_repl_order:<br />

tweet = tweet.replace(k, emo_repl[k])<br />

for r, repl in re_repl.iteritems():<br />

tweet = re.sub(r, repl, tweet)<br />

return tweet<br />

tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,<br />

analyzer="word")<br />

# ...<br />

Certainly, there are many more abbreviations that could be used here. But already<br />

with this limited set, we get an improvement for sentiment versus not sentiment of<br />

half a point, which comes to 70.7 percent:<br />

== Pos vs. neg ==<br />

0.804 0.022 0.886 0.011<br />

== Pos/neg vs. irrelevant/neutral ==<br />

0.797 0.009 0.707 0.029<br />

== Pos vs. rest ==<br />

0.884 0.005 0.527 0.025<br />

== Neg vs. rest ==<br />

0.886 0.011 0.640 0.032<br />

Taking the word types into account<br />

So far our hope was to simply use the words independent of each other with the<br />

hope that a bag-of-words approach would suffice. Just from our intuition, however,<br />

neutral tweets probably contain a higher fraction of nouns, while positive or negative<br />

tweets are more colorful, requiring more adjectives and verbs. What if we could<br />

use this linguistic information of the tweets as well? If we could find out how many<br />

words in a tweet were nouns, verbs, adjectives, and so on, the classifier could maybe<br />

take that into account as well.<br />

[ 138 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!