08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

This can be viewed as the variance <strong>of</strong> X, defined as the sum <strong>of</strong> the variances <strong>of</strong> all its<br />

entries.<br />

m∑ p∑<br />

Var(X) = Var (x ij ) = ∑ E ( ( )<br />

∑ ∑<br />

xij) 2 − E (xij ) 2 1<br />

= p k a 2<br />

p<br />

ikb 2 2 kj − ||AB|| 2 F .<br />

i=1 j=1 ij<br />

ij<br />

k<br />

We want to choose p k to minimize this quantity, and notice that we can ignore the ||AB|| 2 F<br />

term since it doesn’t depend on the p k ’s at all. We can now simplify by exchanging the<br />

order <strong>of</strong> summations to get<br />

∑ ∑ 1<br />

p k a 2<br />

p<br />

ikb 2 2 kj = ∑<br />

ij k k<br />

k<br />

(<br />

1 ∑<br />

p k<br />

i<br />

a 2 ik<br />

) ( ∑<br />

j<br />

b 2 kj<br />

)<br />

= ∑ k<br />

k<br />

1<br />

p k<br />

|A (:, k) | 2 |B (k, :) | 2 .<br />

What is the best choice <strong>of</strong> p k to minimize this sum? It can be seen by calculus 29 that the<br />

minimizing p k are proportional to |A(:, k)||B(k, :)|. In the important special case when<br />

B = A T , pick columns <strong>of</strong> A with probabilities proportional to the squared length <strong>of</strong> the<br />

columns. Even in the general case when B is not A T , doing so simplifies the bounds, so<br />

we will use it. This sampling is called “length squared sampling”. If p k is proportional to<br />

|A (:, k) | 2 , i.e, p k = |A(:,k)|2 , then<br />

||A|| 2 F<br />

E ( ) ∑<br />

||AB − X|| 2 F = Var(X) ≤ ||A||<br />

2<br />

F |B (k, :) | 2 = ||A|| 2 F ||B|| 2 F .<br />

To reduce the variance, we can do s independent trials. Each trial i, i = 1, 2, . . . , s<br />

yields a matrix X i as in (7.1). We take 1 s<br />

∑ s<br />

i=1 X i as our estimate <strong>of</strong> AB. Since the<br />

variance <strong>of</strong> a sum <strong>of</strong> independent random variables is the sum <strong>of</strong> variances, the variance<br />

<strong>of</strong> 1 s<br />

∑ s<br />

i=1 X i is 1 s Var(X) and so is at most 1 s ||A||2 F ||B||2 F . Let k 1, . . . , k s be the k’s chosen<br />

in each trial. Expanding this, gives:<br />

1<br />

s<br />

s∑<br />

X i = 1 s<br />

i=1<br />

k<br />

( A (:, k1 ) B (k 1 , :)<br />

+ A (:, k 2) B (k 2 , :)<br />

+ · · · + A (:, k )<br />

s) B (k s , :)<br />

. (7.2)<br />

p k1 p k2 p ks<br />

We will find it convieneint to write this as the product <strong>of</strong> an m × s matrix with a s × p<br />

matrix as follows: Let C be the m × s matrix consisting <strong>of</strong> the following columns which<br />

are scaled versions <strong>of</strong> the chosen columns <strong>of</strong> A:<br />

A(:, k 1 )<br />

√ spk1<br />

, A(:, k 2)<br />

√ spk2<br />

, . . . A(:, k s)<br />

√ spks<br />

.<br />

Note that the scaling has a nice property (which the reader is asked to verify):<br />

29 By taking derivatives, for any set <strong>of</strong> nonnegative numbers c k , ∑ k<br />

to √ c k .<br />

E ( CC T ) = AA T . (7.3)<br />

250<br />

c k<br />

p k<br />

is minimized with p k proportional

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!