08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

that make up the document. So, a ij , the frequency <strong>of</strong> the i th term in the j th document<br />

is the sum over all topics l <strong>of</strong> the fraction <strong>of</strong> document j which is on topic l times the<br />

frequency <strong>of</strong> term i in topic l. In matrix notation,<br />

Pictorially, we can represent this as:<br />

A = BC.<br />

⎛<br />

T<br />

E<br />

R<br />

M<br />

⎜<br />

⎝<br />

DOCUMENT<br />

A<br />

n × m<br />

⎞<br />

=<br />

⎟<br />

⎠<br />

⎛<br />

T<br />

E<br />

R<br />

M<br />

⎜<br />

⎝<br />

TOPIC<br />

B<br />

n × r<br />

⎞<br />

⎟<br />

⎠<br />

⎛<br />

T<br />

O<br />

P<br />

⎜<br />

I ⎝<br />

C<br />

DOCUMENT<br />

C<br />

r × m<br />

⎞<br />

⎟<br />

⎠<br />

This model is too simple to be realistic since the frequency <strong>of</strong> each term in the document<br />

is unlikely to be exactly what is given by the equation A = BC. So, a more<br />

sophisticated stochastic model is used in practice.<br />

From the document collection we observe the n × m matrix A. Can we find B and<br />

C such that A = BC? The top r singular vectors from a singular value decomposition<br />

<strong>of</strong> A give a factorization BC. But there are additional constraints stemming from the<br />

fact that frequencies <strong>of</strong> terms in one particular topic are nonnegative reals summing to<br />

one and from the fact that the fraction <strong>of</strong> each topic a particular document is on are also<br />

nonnegative reals summing to one. Altogether the constraints are:<br />

1. A = BC and ∑ i<br />

a ij = 1.<br />

2. The entries <strong>of</strong> B and C are all non-negative.<br />

3. Each column <strong>of</strong> B and each column <strong>of</strong> C sums to one.<br />

Given the first two conditions, we can achieve the third by multiplying the i th column<br />

<strong>of</strong> B by a positive real number and dividing the i th row <strong>of</strong> C by the same real number<br />

without violating A = BC. By doing this, one may assume that each column <strong>of</strong> B sums<br />

to one. Since ∑ i a ij is the total frequency <strong>of</strong> all terms in document j, ∑ i a ij = 1. Now<br />

a ij = ∑ k b ikc kj implies ∑ i a ij = ∑ i,k b ikc kj = ∑ k c kj = 1. Thus, the columns <strong>of</strong> C also<br />

sum to one.<br />

The problem can be posed as one <strong>of</strong> factoring the given matrix A into the product <strong>of</strong><br />

two matrices with nonnegative entries called nonnegative matrix factorization.<br />

300

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!