08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classification II – Sentiment Analysis<br />

The devastating results for positive tweets against the rest and negative tweets<br />

against the rest will improve if we configure the vectorizer and classifier <strong>with</strong> those<br />

parameters that we have just found out:<br />

== Pos vs. rest ==<br />

0.883 0.005 0.520 0.028<br />

== Neg vs. rest ==<br />

0.888 0.009 0.631 0.031<br />

Indeed, the P/R curves look much better (note that the graphs are from the medium<br />

of the fold classifiers, thus have slightly diverging AUC values):<br />

Nevertheless, we probably still wouldn't use those classifiers. Time for something<br />

completely different!<br />

Cleaning tweets<br />

New constraints lead to new forms. Twitter is no exception in this regard. Because<br />

text has to fit into 140 characters, people naturally develop new language shortcuts<br />

to say the same in less characters. So far, we have ignored all the diverse emoticons<br />

and abbreviations. Let's see how much we can improve by taking that into<br />

account. For this endeavor, we will have to provide our own preprocessor() to<br />

TfidfVectorizer.<br />

[ 136 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!