08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

So from the theorem E(||AA T − CR|| 2 F ) ≤ ||AAT || 2 F provided<br />

s ≥ (σ2 1 + σ2 2 + . . .) 2<br />

.<br />

σ1 4 + σ2 4 + . . .<br />

If rank(A) = r, then there are r non-zero σ t and the best general upper bound on the<br />

ratio (σ2 1 +σ2 2 +...)2 is r, so in general, s needs to be at least r. If A is full rank, this means<br />

σ1 4+σ4 2<br />

sampling will +...<br />

not gain us anything over taking the whole matrix!<br />

then,<br />

However, if there is a constant c and a small integer p such that<br />

σ 2 1 + σ 2 2 + . . . + σ 2 p ≥ c(σ 2 1 + σ 2 2 + · · · + σ 2 r), (7.5)<br />

(σ 2 1 + σ 2 2 + . . .) 2<br />

σ 4 1 + σ 4 2 + . . .<br />

≤ c 2 (σ2 1 + σ 2 2 + . . . + σ 2 p) 2<br />

σ 4 1 + σ 4 2 + . . . + σ 2 p<br />

≤ c 2 p,<br />

and so s ≥ c 2 p gives us a better estimate than the zero matrix. Increasing s by a factor<br />

decreases the error by the same factor. The condition (7.5) is indeed the hypothesis <strong>of</strong><br />

the subject <strong>of</strong> Principal Component Analysis (PCA) and there are many situations when<br />

the data matrix does satisfy the condition and so sampling algorithms are useful.<br />

7.3.2 Implementing Length Squared Sampling in two passes<br />

Traditional matrix algorithms <strong>of</strong>ten assume that the input matrix is in Random Access<br />

Memory (RAM) and so any particular entry <strong>of</strong> the matrix can be accessed in unit time.<br />

For massive matrices, RAM may be too small to hold the entire matrix, but may be able<br />

to hold and compute with the sampled columns and rows.<br />

Consider a high-level model where the input matrix (or matrices) have to be read from<br />

“external memory” using a “pass”. In one pass, one can read sequentially all entries <strong>of</strong><br />

the matrix and do some “sampling on the fly”.<br />

It is easy to see that two passes suffice to draw a sample <strong>of</strong> columns <strong>of</strong> A according<br />

to length squared probabilities, even if the matrix is not in row-order or column-order<br />

and entries are presented as a linked list. In the first pass, compute the length squared <strong>of</strong><br />

each column and store this information in RAM. The lengths squared can be computed as<br />

running sums. Then, use a random number generator in RAM to determine according to<br />

length squared probability the columns to be sampled. Then, make a second pass picking<br />

the columns to be sampled.<br />

If the matrix is already presented in external memory in column-order? Then, one<br />

pass will do. This is left as an exercise for the reader.<br />

252

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!