08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Topic Modeling<br />

Although daunting at first glance, we can clearly see that the topics are not just<br />

random words, but are connected. We can also see that these topics refer to older<br />

news items, from when the Soviet union still existed and Gorbachev was its Secretary<br />

General. We can also represent the topics as word clouds, making more likely words<br />

larger For example, this is the visualization of a topic, which deals <strong>with</strong> the Middle<br />

East and politics:<br />

We can also see that some of the words should perhaps be removed (for example,<br />

the word I) as they are not so informative (stop words). In topic modeling, it is<br />

important to filter out stop words, as otherwise you might end up <strong>with</strong> a topic<br />

consisting entirely of stop words, which is not very informative. We may also wish<br />

to preprocess the text to stems in order to normalize plurals and verb forms. This<br />

process was covered in the previous chapter, and you can refer to it for details. If you<br />

are interested, you can download the code from the companion website of the book<br />

and try all these variations to draw different pictures.<br />

<strong>Building</strong> a word cloud like the one in the previous screenshot can<br />

be done <strong>with</strong> several different pieces of software. For the previous<br />

graphic, I used the online tool wordle (http://www.wordle.net),<br />

which generates particularly attractive images. Since I only had a<br />

few examples, I copy and pasted the list of words manually, but it is<br />

possible to use it as a web service and call it directly from <strong>Python</strong>.<br />

Comparing similarity in topic space<br />

Topics can be useful on their own to build small vignettes <strong>with</strong> words that are in the<br />

previous screenshot. These visualizations could be used to navigate a large collection<br />

of documents and, in fact, they have been used in just this way.<br />

[ 80 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!