01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3<br />

0.0<br />

>>> print(tfidf("b", abb, D))<br />

0.270310072072<br />

>>> print(tfidf("a", abc, D))<br />

0.0<br />

>>> print(tfidf("b", abc, D))<br />

0.135155036036<br />

>>> print(tfidf("c", abc, D))<br />

0.366204096223<br />

We see that a carries no meaning for any document since it is contained everywhere.<br />

b is more important for the document abb than for abc as it occurs there twice.<br />

In reality, there are more corner cases to handle than the above example does.<br />

Thanks to Scikit, we don't have to think of them, as they are already nicely packaged<br />

in TfidfVectorizer, which is inherited from CountVectorizer. Sure enough, we<br />

don't want to miss our stemmer:<br />

>>> from sklearn.feature_extraction.text import TfidfVectorizer<br />

>>> class StemmedTfidfVectorizer(TfidfVectorizer):<br />

... def build_analyzer(self):<br />

... analyzer = super(TfidfVectorizer,<br />

self).build_analyzer()<br />

... return lambda doc: (<br />

english_stemmer.stem(w) for w in analyzer(doc))<br />

>>> vectorizer = StemmedTfidfVectorizer(min_df=1,<br />

stop_words='english', charset_error='ignore')<br />

The resulting document vectors will not contain counts any more. Instead, they will<br />

contain the individual TF-IDF values per term.<br />

Our achievements and goals<br />

Our current text preprocessing phase includes the following steps:<br />

1. Tokenizing the text.<br />

2. Throwing away words that occur way too often to be of any help in detecting<br />

relevant posts.<br />

[ 61 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!