08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Topic Modeling<br />

Sparsity means that while you may have large matrices and vectors,<br />

in principle, most of the values are zero (or so small that we can round<br />

them to zero as a good approximation). Therefore, only a few things<br />

are relevant at any given time.<br />

Often problems that seem too big to solve are actually feasible because<br />

the data is sparse. For example, even though one webpage can link to<br />

any other webpage, the graph of links is actually very sparse as each<br />

webpage will link to a very tiny fraction of all other webpages.<br />

In the previous graph, we can see that about 150 documents have 5 topics, while<br />

the majority deal <strong>with</strong> around 10 to 12 of them. No document talks about more<br />

than 20 topics.<br />

To a large extent, this is a function of the parameters used, namely the alpha<br />

parameter. The exact meaning of alpha is a bit abstract, but bigger values for alpha<br />

will result in more topics per document. Alpha needs to be positive, but is typically<br />

very small; usually smaller than one. By default, gensim will set alpha equal to 1.0/<br />

len (corpus), but you can set it yourself as follows:<br />

>>> model = models.ldamodel.LdaModel(<br />

corpus,<br />

num_topics=100,<br />

id2word=corpus.id2word,<br />

alpha=1)<br />

In this case, this is a larger alpha, which should lead to more topics per document.<br />

We could also use a smaller value. As we can see in the combined histogram given<br />

next, gensim behaves as we expected:<br />

[ 78 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!