01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

If you are going to explore the topics yourself or build a visualization tool, you<br />

should probably try a few values and see which gives you the most useful or most<br />

appealing results.<br />

However, there are a few methods that will automatically determine the number of<br />

topics for you depending on the dataset. One popular model is called the hierarchical<br />

Dirichlet process. Again, the full mathematical model behind it is complex and beyond<br />

the scope of this book, but the fable we can tell is that instead of having the topics be<br />

fixed a priori and our task being to reverse engineer the data to get them back, the<br />

topics themselves were generated along with the data. Whenever the writer was going<br />

to start a new document, he had the option of using the topics that already existed or<br />

creating a completely new one.<br />

This means that the more documents we have, the more topics we will end up with.<br />

This is one of those statements that is unintuitive at first, but makes perfect sense upon<br />

reflection. We are learning topics, and the more examples we have, the more we can<br />

break them up. If we only have a few examples of news articles, then sports will be a<br />

topic. However, as we have more, we start to break it up into the individual modalities<br />

such as Hockey, Soccer, and so on. As we have even more data, we can start to tell<br />

nuances apart articles about individual teams and even individual players. The same<br />

is true for people. In a group of many different backgrounds, with a few "computer<br />

people", you might put them together; in a slightly larger group, you would have<br />

separate gatherings for programmers and systems managers. In the real world, we<br />

even have different gatherings for Python and Ruby programmers.<br />

One of the methods for automatically determining the number of topics is called<br />

the hierarchical Dirichlet process (HDP), and it is available in gensim. Using it is<br />

trivial. Taking the previous code for LDA, we just need to replace the call to gensim.<br />

models.ldamodel.LdaModel with a call to the HdpModel constructor as follows:<br />

>>> hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)<br />

That's it (except it takes a bit longer to compute—there are no free lunches). Now,<br />

we can use this model as much as we used the LDA model, except that we did not<br />

need to specify the number of topics.<br />

Summary<br />

In this chapter, we discussed a more advanced form of grouping documents, which is<br />

more flexible than simple clustering as we allow each document to be present in more<br />

than one group. We explored the basic LDA model using a new package, gensim, but<br />

were able to integrate it easily into the standard Python scientific ecosystem.<br />

[ 87 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!