08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

9 Topic Models, Hidden Markov Process, Graphical<br />

Models, and Belief Propagation<br />

In the chapter on learning and VC dimension, we saw many model-fitting problems.<br />

There we were given labeled data and simple classes <strong>of</strong> functions: half-spaces, support<br />

vector machines, etc. The problem was to fit the best model from a class <strong>of</strong> functions<br />

to the data. Model fitting is <strong>of</strong> course more general and in this chapter we discuss some<br />

useful models. These general models are <strong>of</strong>ten computationally infeasible, in the sense<br />

that they do not admit provably efficient algorithms. Nevertheless, data <strong>of</strong>ten falls into<br />

special cases <strong>of</strong> these models that can be solved efficiently.<br />

9.1 Topic Models<br />

A topic model is a model for representing a large collection <strong>of</strong> documents. Each document<br />

is viewed as a combination <strong>of</strong> topics and each topic has a set <strong>of</strong> word frequencies. For<br />

a collection <strong>of</strong> news articles over a period, the topics may be politics, sports, science, etc.<br />

For the topic politics, the words like “president” and “election” may have high frequencies<br />

and for the topic sports, words like “batter” and “goal” may have high frequencies. A<br />

news item document may be 60% on politics and 40% on sports. The word frequencies in<br />

the document will be a convex combination <strong>of</strong> word frequencies for the two topics, politics<br />

and sports, with weights 0.6 and 0.4 respectively. We describe this more formally with<br />

vectors and matrices.<br />

Each document is viewed as a “bag <strong>of</strong> words”. We disregard the order and context in<br />

which each term occurs in the document and instead only list the frequency <strong>of</strong> occurrences<br />

<strong>of</strong> each term. Frequency is the number <strong>of</strong> occurrences <strong>of</strong> the term divided by the total<br />

number <strong>of</strong> all terms in the document. Discarding context information may seem wasteful,<br />

but this approach works well in practice and is widely used. Each document is an n-<br />

dimensional vector where n is the total number <strong>of</strong> different terms in all the documents<br />

in the collection. Each component <strong>of</strong> the vector is the frequency <strong>of</strong> a particular term in<br />

the document. Terms are words or phrases. Not all words are chosen as terms; articles,<br />

simple verbs, and pronouns like “a”, “is”, and “it” may be ignored. /newline<br />

Represent the collection <strong>of</strong> documents by a n × m matrix A, called the term-document<br />

matrix, with one column per document in the collection. The topic model hypothesizes<br />

that there are r topics and each <strong>of</strong> the m documents is a combination <strong>of</strong> topics. The<br />

number <strong>of</strong> topics r is usually much smaller than the number <strong>of</strong> terms n. So corresponding<br />

to each document, there is a vector with r components telling us the fraction <strong>of</strong> the<br />

document that is on each <strong>of</strong> the topics. In the example above, this vector will have 0.6<br />

in the component for politics and 0.4 in the component for sports. Arrange these vectors<br />

as the columns <strong>of</strong> a r × m matrix C, called the topic-document matrix. There is a third<br />

matrix B which is n × r. Each column <strong>of</strong> B corresponds to a topic; each component <strong>of</strong><br />

the column gives the frequency <strong>of</strong> a term in that topic. In the simplest model, the term<br />

frequencies in documents are exact combinations <strong>of</strong> term frequencies in the various topics<br />

299

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!