08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Topic Modeling<br />

In the previous chapter we clustered texts into groups. This is a very useful tool, but<br />

it is not always appropriate. Clustering results in each text belonging to exactly one<br />

cluster. This book is about machine learning and <strong>Python</strong>. Should it be grouped <strong>with</strong><br />

other <strong>Python</strong>-related works or <strong>with</strong> machine-related works? In the paper book age,<br />

a bookstore would need to make this decision when deciding where to stock it. In<br />

the Internet store age, however, the answer is that this book is both about machine<br />

learning and <strong>Python</strong>, and the book can be listed in both sections. We will, however,<br />

not list it in the food section.<br />

In this chapter, we will learn methods that do not cluster objects, but put them into<br />

a small number of groups called topics. We will also learn how to derive between<br />

topics that are central to the text and others only that are vaguely mentioned (this<br />

book mentions plotting every so often, but it is not a central topic such as machine<br />

learning is). The subfield of machine learning that deals <strong>with</strong> these problems is<br />

called topic modeling.<br />

Latent Dirichlet allocation (LDA)<br />

LDA and LDA: unfortunately, there are two methods in machine learning <strong>with</strong><br />

the initials LDA: latent Dirichlet allocation, which is a topic modeling method; and<br />

linear discriminant analysis, which is a classification method. They are completely<br />

unrelated, except for the fact that the initials LDA can refer to either. However, this<br />

can be confusing. Scikit-learn has a submodule, sklearn.lda, which implements<br />

linear discriminant analysis. At the moment, scikit-learn does not implement latent<br />

Dirichlet allocation.<br />

The simplest topic model (on which all others are based) is latent Dirichlet<br />

allocation (LDA). The mathematical ideas behind LDA are fairly complex,<br />

and we will not go into the details here.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!