21.02.2015 Views

Overview of QUIC-SVD - UMass Graphics Lab - University of ...

Overview of QUIC-SVD - UMass Graphics Lab - University of ...

Overview of QUIC-SVD - UMass Graphics Lab - University of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A GPU-based Approximate <strong>SVD</strong> Algorithm<br />

Blake Foster, Sridhar Mahadevan, Rui Wang<br />

<strong>University</strong> <strong>of</strong> Massachusetts Amherst


Introduction<br />

• Singular Value Decomposition (<strong>SVD</strong>)<br />

• A: m × n matrix (m ≥ n)<br />

• U, V: orthogonal matrices<br />

• Σ: diagonal matrix<br />

• Exact <strong>SVD</strong> can be computed in O(mn 2 ) time.<br />

2


Introduction<br />

• Approximate <strong>SVD</strong><br />

• A: m × n matrix, whose intrinsic dimensionality<br />

is far less than n<br />

• Usually sufficient to compute the k (≪ n) largest<br />

singular values and corresponding vectors.<br />

• Approximate <strong>SVD</strong> can be solved in O(mnk) time (<br />

compared to O(mn 2 ) for full svd).<br />

3


Introduction<br />

• Sample-based Approximate <strong>SVD</strong><br />

• Construct a basis V by directly sampling or taking linear<br />

combinations <strong>of</strong> rows from A.<br />

• An approximate <strong>SVD</strong> can be extracted from the projection<br />

<strong>of</strong> A onto V.<br />

4


Introduction<br />

• Sample-based Approximate <strong>SVD</strong><br />

• Construct a basis V by directly sampling or taking linear<br />

combinations <strong>of</strong> rows from A.<br />

• An approximate <strong>SVD</strong> can be extracted from the projection<br />

<strong>of</strong> A onto V.<br />

• Sampling Methods<br />

• Length-squared<br />

[Frieze et al. 2004, Deshpande et al. 2006]<br />

• Random projection<br />

[Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]<br />

5


Introduction<br />

• <strong>QUIC</strong>-<strong>SVD</strong> [Holmes et al. NIPS 2008]<br />

• Sampled-based approximate <strong>SVD</strong>.<br />

• Progressively constructs a basis V using a cosine tree.<br />

• Results have controllable L2 error basis construction<br />

adapts to the matrix and the desired error.<br />

6


Our Goal<br />

• Map <strong>QUIC</strong>-<strong>SVD</strong> onto the GPU (<strong>Graphics</strong> Processing Units)<br />

• Powerful data-parallel computation platform.<br />

• High computation density, high memory bandwidth.<br />

• Relatively low-cost.<br />

NVIDIA GTX 580<br />

512 cores<br />

1.6 Tera FLOPs<br />

3 GB memory<br />

200 GB/s bandwidth<br />

$549<br />

7


Related Work<br />

• OpenGL-based: rasterization, framebuffers, shaders<br />

• [Galoppo et al. 2005]<br />

• [Bondhugula et al. 2006]<br />

• CUDA-based:<br />

• Matrix factorizations [Volkov et al. 2008]<br />

• QR decomposition [Kerr et al. 2009]<br />

• <strong>SVD</strong> [Lahabar et al. 2009]<br />

• Commercial GPU linear algebra package [CULA 2010]<br />

8


Our Work<br />

• A CUDA implementation <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong>.<br />

• Up to 6~7 times speedup over an optimized CPU<br />

version.<br />

• A partitioned version that can process large, out-<strong>of</strong>core<br />

matrices on a single GPU<br />

• The largest we tested are 22,000 x 22,000 matrices<br />

(double precision FP).<br />

9


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

V<br />

10


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

Start from a root node that owns the entire matrix<br />

(all rows).<br />

11


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

pivot row →<br />

Select a pivot row by importance sampling the<br />

squared magnitudes <strong>of</strong> the rows.<br />

12


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

pivot row →<br />

Compute inner product <strong>of</strong> every row with the pivot row.<br />

(this is why it’s called Cosine Tree)<br />

13


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

Closer to<br />

zero →<br />

The rest →<br />

Based on the inner products, the rows are partitioned<br />

into 2 subsets, each represented by a child node.<br />

14


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

V<br />

Compute the row mean (average) <strong>of</strong> each subset;<br />

add it to the current basis set (keep orthogonalized).<br />

15


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

[Holmes et al. 2008]<br />

A<br />

V<br />

Repeat, until the estimated L2 error is below a threshold.<br />

The basis set now provides a good low-rank approximation.<br />

16


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

• Monte Carlo Error Estimation:<br />

• Input: any subset <strong>of</strong> rows A s , the basis V, δ ∈ [0,1]<br />

• Output: estimated , s.t. with probability at<br />

least 1 − δ, the L2 error:<br />

• Termination Criteria:<br />

17


<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

• 5 Steps:<br />

1. Select a leaf node with maximum estimated L2 error;<br />

2. Split the node and create two child nodes;<br />

3. Calculate and insert the mean vector <strong>of</strong> each child to the<br />

basis set V (replacing the parent vector), orthogonalized;<br />

4. Estimate the error <strong>of</strong> newly added child nodes;<br />

5. Repeat steps 1—4 until the termination criteria is<br />

satisfied.<br />

18


GPU Implementation <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />

• Computation cost mostly on constructing the tree.<br />

• Most expensive steps:<br />

• Computing vector inner products and row means<br />

• Basis orthogonalization<br />

19


Parallelize Vector Inner Products and Row Means<br />

• Compute all inner products and row means in parallel.<br />

• Assume a large matrix, there are sufficient number <strong>of</strong><br />

rows to engage all GPU threads.<br />

• An index array is used to point to the physical rows.<br />

20


Parallelize Gram Schmidt Orthogonalization<br />

• Classical Gram Schmidt:<br />

Assume v1 , v2, v3... vn<br />

are in the current basis set (i.e. they<br />

are orthonormal vectors), to add a new vector :<br />

v<br />

( , ) ( , ) ... ( , )<br />

r<br />

1<br />

<br />

r where proj( r, v) ( r v)<br />

v<br />

r r proj r v1 proj r v2<br />

proj r v n<br />

n<br />

• This is easy to parallelize (as each projection is independently<br />

calculated), but it has poor numerical stability.<br />

r<br />

21


Parallelize Gram Schmidt Orthogonalization<br />

• Modified Gram Schmidt:<br />

r r proj( r, v ),<br />

1 1<br />

r r proj( r , v ),<br />

2 1 1 2<br />

... ...<br />

r r r proj( r , v )<br />

v r r<br />

<br />

n1<br />

<br />

k k 1 k 1<br />

k<br />

• This has good numerical stability, but is hard to parallelize, as<br />

the computation is sequential!<br />

22


Parallelize Gram Schmidt Orthogonalization<br />

• Our approach: Blocked Gram Schmidt<br />

(1)<br />

v v proj v u1 proj v u2<br />

proj v um<br />

( , ) ( , ) ... ( , )<br />

v v proj( v , u ) proj( v , u ) ... proj( v , u )<br />

(2) (1) (1) (1) (1)<br />

m1 m2 2m<br />

......<br />

u v proj( v , u ) proj( v , u ) ... proj( v , u )<br />

( n/ m) ( n/ m) ( n/ m) ( n/ m)<br />

k 1 2m1 2m2<br />

k<br />

• This hybrid approach combines the advantages <strong>of</strong> the<br />

previous two – numerically stable, and GPU-friendly.<br />

23


Extract the Final Result<br />

• Input: matrix A, basis set V<br />

• Output: U, Σ, V, the best approximate <strong>SVD</strong> <strong>of</strong> A within<br />

the subspace V.<br />

• Algorithm:<br />

• Compute AV, then (AV) T AV, and <strong>SVD</strong><br />

U ′ Σ ′ V′ T = (AV) T AV<br />

• Let V = VV′, Σ = (Σ′) 1/2 , U = (AV)V′Σ −1 ,<br />

• Return U, Σ, V.<br />

24


Extract the Final Result<br />

• Input: matrix A, subspace basis V<br />

• Output: U, Σ, V, the best approximate <strong>SVD</strong> <strong>of</strong> A within<br />

the subspace V.<br />

• Algorithm:<br />

• Compute AV, then (AV) T AV, and <strong>SVD</strong><br />

U ′ Σ ′ V′ T = (AV) T AV<br />

• Let V = VV′, Σ = (Σ′) 1/2 , U = (AV)V′Σ −1 ,<br />

• Return U, Σ, V.<br />

<strong>SVD</strong> <strong>of</strong> a k × k matrix.<br />

25


Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />

• The advantage <strong>of</strong> the GPU is only obvious for largescale<br />

problems, e.g. matrices larger than 10,000 2 .<br />

• For dense matrices, this will soon become an out-<strong>of</strong>core<br />

problem.<br />

• We modified our algorithm to use a matrix partitioning<br />

scheme, so that each partition can fit in GPU memory.<br />

26


… …<br />

… …<br />

Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />

• Partition the matrix into fixed-size blocks<br />

A 1<br />

A 2<br />

A s<br />

27


… …<br />

… …<br />

Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />

• Partition the matrix into fixed-size blocks<br />

A 1<br />

A 2<br />

A s<br />

28


Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />

29


Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />

• Termination Criteria is guaranteed:<br />

30


Implementation Details<br />

• Main implementation done in CUDA.<br />

• Use CULA library to extract the final <strong>SVD</strong> (this involves<br />

processing a k × k matrix, where k ≪ n).<br />

• Selection <strong>of</strong> the splitting node is done using a priority<br />

queue, maintained on the CPU side.<br />

• Source code will be made available on our site<br />

http://graphics.cs.umass.edu<br />

31


Performance<br />

• Test environment:<br />

• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory<br />

• GPU: NVIDIA GTX 480 GPU<br />

• Test data:<br />

• Given a rank k and matrix size n, we generate a n × k left<br />

matrix and k × n right matrix, both filled with random<br />

numbers; the product <strong>of</strong> the two gives the test matrix.<br />

• Result accuracy:<br />

• Both CPU and GPU versions use double floats; termination<br />

error ε = 10 −12 , Monte Carlo estimation δ = 10 −12 .<br />

32


Performance<br />

• Comparisons:<br />

1. Matlab’s svds;<br />

2. CPU-based <strong>QUIC</strong>-<strong>SVD</strong> (optimized, multi-threaded,<br />

implemented using Intel Math Kernel Library);<br />

3. GPU-based <strong>QUIC</strong>-<strong>SVD</strong>.<br />

4. Tygert <strong>SVD</strong> (approximate <strong>SVD</strong> algorithm, implemented in<br />

Matlab, based on random projection);<br />

33


Performance – Running Time<br />

34


Performance – Speedup Factors<br />

35


Performance – Speedup Factors<br />

36


Performance – CPU vs. GPU <strong>QUIC</strong>-<strong>SVD</strong><br />

Running Time<br />

Speedup<br />

37


Performance – GPU <strong>QUIC</strong>-<strong>SVD</strong> vs. Tygert <strong>SVD</strong><br />

Running Time<br />

Speedup<br />

38


Integration with Manifold Alignment Framework<br />

39


Conclusion<br />

• A fast GPU-based implementation <strong>of</strong> an approximate<br />

<strong>SVD</strong> algorithm.<br />

• Reasonable (up to 6~7 ×) speedup over an optimized<br />

CPU implementation, and Tygert <strong>SVD</strong>.<br />

• Our partitioned version allows processing out-<strong>of</strong>-core<br />

matrices.<br />

40


Future Work<br />

• Be able to process sparse matrices.<br />

• Application in solving large graph Laplacians.<br />

• Components from this project can become building<br />

blocks for other algorithms: random projection trees,<br />

diffusion wavelets etc.<br />

41


Acknowledgement<br />

• Alexander Gray<br />

• PPAM reviewers<br />

• NSF Grant FODAVA-1025120<br />

42

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!