08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Clustering – Finding Related Posts<br />

This is only after tokenization, lower casing, and stop word removal. If we also<br />

subtract those words that will be later filtered out via min_df and max_df, which<br />

will be done later in fit_transform, it gets even worse:<br />

>>> list(set(analyzer(z[5][1])).intersection(<br />

vectorizer.get_feature_names()))<br />

[u'cs', u'faq', u'thank']<br />

>>> list(set(analyzer(z[6][1])).intersection(<br />

vectorizer.get_feature_names()))<br />

[u'bh', u'thank']<br />

Furthermore, most of the words occur frequently in other posts as well, as we<br />

can check <strong>with</strong> the IDF scores. Remember that the higher the TF-IDF, the more<br />

discriminative a term is for a given post. And as IDF is a multiplicative factor here,<br />

a low value of it signals that it is not of great value in general:<br />

>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']:<br />

... print('IDF(%s)=%.2f'%(term,<br />

vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]])<br />

IDF(cs)=3.23<br />

IDF(faq)=4.17<br />

IDF(thank)=2.23<br />

IDF(bh)=6.57<br />

IDF(thank)=2.23<br />

So, except for bh, which is close to the maximum overall IDF value of 6.74,<br />

the terms don't have much discriminative power. Understandably, posts from<br />

different newsgroups will be clustered together.<br />

For our goal, however, this is no big deal, as we are only interested in cutting down<br />

the number of posts that we have to compare a new post to. After all, the particular<br />

newsgroup from where our training data came from is of no special interest.<br />

Tweaking the parameters<br />

So what about all the other parameters? Can we tweak them all to get better results?<br />

Sure. We could, of course, tweak the number of clusters or play <strong>with</strong> the vectorizer's<br />

max_features parameter (you should try that!). Also, we could play <strong>with</strong> different<br />

cluster center initializations. There are also more exciting alternatives to KMeans<br />

itself. There are, for example, clustering approaches that also let you use different<br />

similarity measurements such as Cosine similarity, Pearson, or Jaccard. An exciting<br />

field for you to play.<br />

[ 72 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!