01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Topic Modeling<br />

In the previous chapter we clustered texts into groups. This is a very useful tool, but<br />

it is not always appropriate. Clustering results in each text belonging to exactly one<br />

cluster. This book is about machine learning and Python. Should it be grouped with<br />

other Python-related works or with machine-related works? In the paper book age,<br />

a bookstore would need to make this decision when deciding where to stock it. In<br />

the Internet store age, however, the answer is that this book is both about machine<br />

learning and Python, and the book can be listed in both sections. We will, however,<br />

not list it in the food section.<br />

In this chapter, we will learn methods that do not cluster objects, but put them into<br />

a small number of groups called topics. We will also learn how to derive between<br />

topics that are central to the text and others only that are vaguely mentioned (this<br />

book mentions plotting every so often, but it is not a central topic such as machine<br />

learning is). The subfield of machine learning that deals with these problems is<br />

called topic modeling.<br />

Latent Dirichlet allocation (LDA)<br />

LDA and LDA: unfortunately, there are two methods in machine learning with<br />

the initials LDA: latent Dirichlet allocation, which is a topic modeling method; and<br />

linear discriminant analysis, which is a classification method. They are completely<br />

unrelated, except for the fact that the initials LDA can refer to either. However, this<br />

can be confusing. Scikit-learn has a submodule, sklearn.lda, which implements<br />

linear discriminant analysis. At the moment, scikit-learn does not implement latent<br />

Dirichlet allocation.<br />

The simplest topic model (on which all others are based) is latent Dirichlet<br />

allocation (LDA). The mathematical ideas behind LDA are fairly complex,<br />

and we will not go into the details here.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!