08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Topic Modeling<br />

Alternatively, we can look at the least talked about topic:<br />

>>> words = model.show_topic(counts.argmin(), 64)<br />

The least talked about are the former French colonies in Central Africa. Just 1.5<br />

percent of documents touch upon it, and it represents 0.08 percent of the words.<br />

Probably if we had performed this exercise using the French Wikipedia, we would<br />

have obtained a very different result.<br />

Choosing the number of topics<br />

So far, we have used a fixed number of topics, which is 100. This was purely an<br />

arbitrary number; we could have just as well done 20 or 200 topics. Fortunately,<br />

for many users, this number does not really matter. If you are going to only use the<br />

topics as an intermediate step as we did previously, the final behavior of the system<br />

is rarely very sensitive to the exact number of topics. This means that as long as<br />

you use enough topics, whether you use 100 topics or 200, the recommendations<br />

that result from the process will not be very different. One hundred is often a good<br />

number (while 20 is too few for a general collection of text documents). The same<br />

is true of setting the alpha (α) value. While playing around <strong>with</strong> it can change the<br />

topics, the final results are again robust against this change.<br />

Topic modeling is often an end towards a goal. In that case, it is not<br />

always important exactly which parameters you choose. Different<br />

numbers of topics or values for parameters such as alpha will result in<br />

systems whose end results are almost identical.<br />

[ 86 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!