08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

If you are going to explore the topics yourself or build a visualization tool, you<br />

should probably try a few values and see which gives you the most useful or most<br />

appealing results.<br />

However, there are a few methods that will automatically determine the number of<br />

topics for you depending on the dataset. One popular model is called the hierarchical<br />

Dirichlet process. Again, the full mathematical model behind it is complex and beyond<br />

the scope of this book, but the fable we can tell is that instead of having the topics be<br />

fixed a priori and our task being to reverse engineer the data to get them back, the<br />

topics themselves were generated along <strong>with</strong> the data. Whenever the writer was going<br />

to start a new document, he had the option of using the topics that already existed or<br />

creating a completely new one.<br />

This means that the more documents we have, the more topics we will end up <strong>with</strong>.<br />

This is one of those statements that is unintuitive at first, but makes perfect sense upon<br />

reflection. We are learning topics, and the more examples we have, the more we can<br />

break them up. If we only have a few examples of news articles, then sports will be a<br />

topic. However, as we have more, we start to break it up into the individual modalities<br />

such as Hockey, Soccer, and so on. As we have even more data, we can start to tell<br />

nuances apart articles about individual teams and even individual players. The same<br />

is true for people. In a group of many different backgrounds, <strong>with</strong> a few "computer<br />

people", you might put them together; in a slightly larger group, you would have<br />

separate gatherings for programmers and systems managers. In the real world, we<br />

even have different gatherings for <strong>Python</strong> and Ruby programmers.<br />

One of the methods for automatically determining the number of topics is called<br />

the hierarchical Dirichlet process (HDP), and it is available in gensim. Using it is<br />

trivial. Taking the previous code for LDA, we just need to replace the call to gensim.<br />

models.ldamodel.LdaModel <strong>with</strong> a call to the HdpModel constructor as follows:<br />

>>> hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)<br />

That's it (except it takes a bit longer to compute—there are no free lunches). Now,<br />

we can use this model as much as we used the LDA model, except that we did not<br />

need to specify the number of topics.<br />

Summary<br />

In this chapter, we discussed a more advanced form of grouping documents, which is<br />

more flexible than simple clustering as we allow each document to be present in more<br />

than one group. We explored the basic LDA model using a new package, gensim, but<br />

were able to integrate it easily into the standard <strong>Python</strong> scientific ecosystem.<br />

[ 87 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!