Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Clustering – Finding Related Posts<br />
This is only after tokenization, lower casing, and stop word removal. If we also<br />
subtract those words that will be later filtered out via min_df and max_df, which<br />
will be done later in fit_transform, it gets even worse:<br />
>>> list(set(analyzer(z[5][1])).intersection(<br />
vectorizer.get_feature_names()))<br />
[u'cs', u'faq', u'thank']<br />
>>> list(set(analyzer(z[6][1])).intersection(<br />
vectorizer.get_feature_names()))<br />
[u'bh', u'thank']<br />
Furthermore, most of the words occur frequently in other posts as well, as we<br />
can check <strong>with</strong> the IDF scores. Remember that the higher the TF-IDF, the more<br />
discriminative a term is for a given post. And as IDF is a multiplicative factor here,<br />
a low value of it signals that it is not of great value in general:<br />
>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']:<br />
... print('IDF(%s)=%.2f'%(term,<br />
vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]])<br />
IDF(cs)=3.23<br />
IDF(faq)=4.17<br />
IDF(thank)=2.23<br />
IDF(bh)=6.57<br />
IDF(thank)=2.23<br />
So, except for bh, which is close to the maximum overall IDF value of 6.74,<br />
the terms don't have much discriminative power. Understandably, posts from<br />
different newsgroups will be clustered together.<br />
For our goal, however, this is no big deal, as we are only interested in cutting down<br />
the number of posts that we have to compare a new post to. After all, the particular<br />
newsgroup from where our training data came from is of no special interest.<br />
Tweaking the parameters<br />
So what about all the other parameters? Can we tweak them all to get better results?<br />
Sure. We could, of course, tweak the number of clusters or play <strong>with</strong> the vectorizer's<br />
max_features parameter (you should try that!). Also, we could play <strong>with</strong> different<br />
cluster center initializations. There are also more exciting alternatives to KMeans<br />
itself. There are, for example, clustering approaches that also let you use different<br />
similarity measurements such as Cosine similarity, Pearson, or Jaccard. An exciting<br />
field for you to play.<br />
[ 72 ]