1FfUrl0
1FfUrl0
1FfUrl0
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 4<br />
If you are going to explore the topics yourself or build a visualization tool, you<br />
should probably try a few values and see which gives you the most useful or most<br />
appealing results.<br />
However, there are a few methods that will automatically determine the number of<br />
topics for you depending on the dataset. One popular model is called the hierarchical<br />
Dirichlet process. Again, the full mathematical model behind it is complex and beyond<br />
the scope of this book, but the fable we can tell is that instead of having the topics be<br />
fixed a priori and our task being to reverse engineer the data to get them back, the<br />
topics themselves were generated along with the data. Whenever the writer was going<br />
to start a new document, he had the option of using the topics that already existed or<br />
creating a completely new one.<br />
This means that the more documents we have, the more topics we will end up with.<br />
This is one of those statements that is unintuitive at first, but makes perfect sense upon<br />
reflection. We are learning topics, and the more examples we have, the more we can<br />
break them up. If we only have a few examples of news articles, then sports will be a<br />
topic. However, as we have more, we start to break it up into the individual modalities<br />
such as Hockey, Soccer, and so on. As we have even more data, we can start to tell<br />
nuances apart articles about individual teams and even individual players. The same<br />
is true for people. In a group of many different backgrounds, with a few "computer<br />
people", you might put them together; in a slightly larger group, you would have<br />
separate gatherings for programmers and systems managers. In the real world, we<br />
even have different gatherings for Python and Ruby programmers.<br />
One of the methods for automatically determining the number of topics is called<br />
the hierarchical Dirichlet process (HDP), and it is available in gensim. Using it is<br />
trivial. Taking the previous code for LDA, we just need to replace the call to gensim.<br />
models.ldamodel.LdaModel with a call to the HdpModel constructor as follows:<br />
>>> hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)<br />
That's it (except it takes a bit longer to compute—there are no free lunches). Now,<br />
we can use this model as much as we used the LDA model, except that we did not<br />
need to specify the number of topics.<br />
Summary<br />
In this chapter, we discussed a more advanced form of grouping documents, which is<br />
more flexible than simple clustering as we allow each document to be present in more<br />
than one group. We explored the basic LDA model using a new package, gensim, but<br />
were able to integrate it easily into the standard Python scientific ecosystem.<br />
[ 87 ]