Overview of QUIC-SVD - UMass Graphics Lab - University of ...
Overview of QUIC-SVD - UMass Graphics Lab - University of ...
Overview of QUIC-SVD - UMass Graphics Lab - University of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
A GPU-based Approximate <strong>SVD</strong> Algorithm<br />
Blake Foster, Sridhar Mahadevan, Rui Wang<br />
<strong>University</strong> <strong>of</strong> Massachusetts Amherst
Introduction<br />
• Singular Value Decomposition (<strong>SVD</strong>)<br />
• A: m × n matrix (m ≥ n)<br />
• U, V: orthogonal matrices<br />
• Σ: diagonal matrix<br />
• Exact <strong>SVD</strong> can be computed in O(mn 2 ) time.<br />
2
Introduction<br />
• Approximate <strong>SVD</strong><br />
• A: m × n matrix, whose intrinsic dimensionality<br />
is far less than n<br />
• Usually sufficient to compute the k (≪ n) largest<br />
singular values and corresponding vectors.<br />
• Approximate <strong>SVD</strong> can be solved in O(mnk) time (<br />
compared to O(mn 2 ) for full svd).<br />
3
Introduction<br />
• Sample-based Approximate <strong>SVD</strong><br />
• Construct a basis V by directly sampling or taking linear<br />
combinations <strong>of</strong> rows from A.<br />
• An approximate <strong>SVD</strong> can be extracted from the projection<br />
<strong>of</strong> A onto V.<br />
4
Introduction<br />
• Sample-based Approximate <strong>SVD</strong><br />
• Construct a basis V by directly sampling or taking linear<br />
combinations <strong>of</strong> rows from A.<br />
• An approximate <strong>SVD</strong> can be extracted from the projection<br />
<strong>of</strong> A onto V.<br />
• Sampling Methods<br />
• Length-squared<br />
[Frieze et al. 2004, Deshpande et al. 2006]<br />
• Random projection<br />
[Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]<br />
5
Introduction<br />
• <strong>QUIC</strong>-<strong>SVD</strong> [Holmes et al. NIPS 2008]<br />
• Sampled-based approximate <strong>SVD</strong>.<br />
• Progressively constructs a basis V using a cosine tree.<br />
• Results have controllable L2 error basis construction<br />
adapts to the matrix and the desired error.<br />
6
Our Goal<br />
• Map <strong>QUIC</strong>-<strong>SVD</strong> onto the GPU (<strong>Graphics</strong> Processing Units)<br />
• Powerful data-parallel computation platform.<br />
• High computation density, high memory bandwidth.<br />
• Relatively low-cost.<br />
NVIDIA GTX 580<br />
512 cores<br />
1.6 Tera FLOPs<br />
3 GB memory<br />
200 GB/s bandwidth<br />
$549<br />
7
Related Work<br />
• OpenGL-based: rasterization, framebuffers, shaders<br />
• [Galoppo et al. 2005]<br />
• [Bondhugula et al. 2006]<br />
• CUDA-based:<br />
• Matrix factorizations [Volkov et al. 2008]<br />
• QR decomposition [Kerr et al. 2009]<br />
• <strong>SVD</strong> [Lahabar et al. 2009]<br />
• Commercial GPU linear algebra package [CULA 2010]<br />
8
Our Work<br />
• A CUDA implementation <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong>.<br />
• Up to 6~7 times speedup over an optimized CPU<br />
version.<br />
• A partitioned version that can process large, out-<strong>of</strong>core<br />
matrices on a single GPU<br />
• The largest we tested are 22,000 x 22,000 matrices<br />
(double precision FP).<br />
9
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
V<br />
10
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
Start from a root node that owns the entire matrix<br />
(all rows).<br />
11
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
pivot row →<br />
Select a pivot row by importance sampling the<br />
squared magnitudes <strong>of</strong> the rows.<br />
12
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
pivot row →<br />
Compute inner product <strong>of</strong> every row with the pivot row.<br />
(this is why it’s called Cosine Tree)<br />
13
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
Closer to<br />
zero →<br />
The rest →<br />
Based on the inner products, the rows are partitioned<br />
into 2 subsets, each represented by a child node.<br />
14
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
V<br />
Compute the row mean (average) <strong>of</strong> each subset;<br />
add it to the current basis set (keep orthogonalized).<br />
15
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
[Holmes et al. 2008]<br />
A<br />
V<br />
Repeat, until the estimated L2 error is below a threshold.<br />
The basis set now provides a good low-rank approximation.<br />
16
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
• Monte Carlo Error Estimation:<br />
• Input: any subset <strong>of</strong> rows A s , the basis V, δ ∈ [0,1]<br />
• Output: estimated , s.t. with probability at<br />
least 1 − δ, the L2 error:<br />
• Termination Criteria:<br />
17
<strong>Overview</strong> <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
• 5 Steps:<br />
1. Select a leaf node with maximum estimated L2 error;<br />
2. Split the node and create two child nodes;<br />
3. Calculate and insert the mean vector <strong>of</strong> each child to the<br />
basis set V (replacing the parent vector), orthogonalized;<br />
4. Estimate the error <strong>of</strong> newly added child nodes;<br />
5. Repeat steps 1—4 until the termination criteria is<br />
satisfied.<br />
18
GPU Implementation <strong>of</strong> <strong>QUIC</strong>-<strong>SVD</strong><br />
• Computation cost mostly on constructing the tree.<br />
• Most expensive steps:<br />
• Computing vector inner products and row means<br />
• Basis orthogonalization<br />
19
Parallelize Vector Inner Products and Row Means<br />
• Compute all inner products and row means in parallel.<br />
• Assume a large matrix, there are sufficient number <strong>of</strong><br />
rows to engage all GPU threads.<br />
• An index array is used to point to the physical rows.<br />
20
Parallelize Gram Schmidt Orthogonalization<br />
• Classical Gram Schmidt:<br />
Assume v1 , v2, v3... vn<br />
are in the current basis set (i.e. they<br />
are orthonormal vectors), to add a new vector :<br />
v<br />
( , ) ( , ) ... ( , )<br />
r<br />
1<br />
<br />
r where proj( r, v) ( r v)<br />
v<br />
r r proj r v1 proj r v2<br />
proj r v n<br />
n<br />
• This is easy to parallelize (as each projection is independently<br />
calculated), but it has poor numerical stability.<br />
r<br />
21
Parallelize Gram Schmidt Orthogonalization<br />
• Modified Gram Schmidt:<br />
r r proj( r, v ),<br />
1 1<br />
r r proj( r , v ),<br />
2 1 1 2<br />
... ...<br />
r r r proj( r , v )<br />
v r r<br />
<br />
n1<br />
<br />
k k 1 k 1<br />
k<br />
• This has good numerical stability, but is hard to parallelize, as<br />
the computation is sequential!<br />
22
Parallelize Gram Schmidt Orthogonalization<br />
• Our approach: Blocked Gram Schmidt<br />
(1)<br />
v v proj v u1 proj v u2<br />
proj v um<br />
( , ) ( , ) ... ( , )<br />
v v proj( v , u ) proj( v , u ) ... proj( v , u )<br />
(2) (1) (1) (1) (1)<br />
m1 m2 2m<br />
......<br />
u v proj( v , u ) proj( v , u ) ... proj( v , u )<br />
( n/ m) ( n/ m) ( n/ m) ( n/ m)<br />
k 1 2m1 2m2<br />
k<br />
• This hybrid approach combines the advantages <strong>of</strong> the<br />
previous two – numerically stable, and GPU-friendly.<br />
23
Extract the Final Result<br />
• Input: matrix A, basis set V<br />
• Output: U, Σ, V, the best approximate <strong>SVD</strong> <strong>of</strong> A within<br />
the subspace V.<br />
• Algorithm:<br />
• Compute AV, then (AV) T AV, and <strong>SVD</strong><br />
U ′ Σ ′ V′ T = (AV) T AV<br />
• Let V = VV′, Σ = (Σ′) 1/2 , U = (AV)V′Σ −1 ,<br />
• Return U, Σ, V.<br />
24
Extract the Final Result<br />
• Input: matrix A, subspace basis V<br />
• Output: U, Σ, V, the best approximate <strong>SVD</strong> <strong>of</strong> A within<br />
the subspace V.<br />
• Algorithm:<br />
• Compute AV, then (AV) T AV, and <strong>SVD</strong><br />
U ′ Σ ′ V′ T = (AV) T AV<br />
• Let V = VV′, Σ = (Σ′) 1/2 , U = (AV)V′Σ −1 ,<br />
• Return U, Σ, V.<br />
<strong>SVD</strong> <strong>of</strong> a k × k matrix.<br />
25
Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />
• The advantage <strong>of</strong> the GPU is only obvious for largescale<br />
problems, e.g. matrices larger than 10,000 2 .<br />
• For dense matrices, this will soon become an out-<strong>of</strong>core<br />
problem.<br />
• We modified our algorithm to use a matrix partitioning<br />
scheme, so that each partition can fit in GPU memory.<br />
26
… …<br />
… …<br />
Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />
• Partition the matrix into fixed-size blocks<br />
A 1<br />
A 2<br />
A s<br />
27
… …<br />
… …<br />
Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />
• Partition the matrix into fixed-size blocks<br />
A 1<br />
A 2<br />
A s<br />
28
Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />
29
Partitioned <strong>QUIC</strong>-<strong>SVD</strong><br />
• Termination Criteria is guaranteed:<br />
30
Implementation Details<br />
• Main implementation done in CUDA.<br />
• Use CULA library to extract the final <strong>SVD</strong> (this involves<br />
processing a k × k matrix, where k ≪ n).<br />
• Selection <strong>of</strong> the splitting node is done using a priority<br />
queue, maintained on the CPU side.<br />
• Source code will be made available on our site<br />
http://graphics.cs.umass.edu<br />
31
Performance<br />
• Test environment:<br />
• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory<br />
• GPU: NVIDIA GTX 480 GPU<br />
• Test data:<br />
• Given a rank k and matrix size n, we generate a n × k left<br />
matrix and k × n right matrix, both filled with random<br />
numbers; the product <strong>of</strong> the two gives the test matrix.<br />
• Result accuracy:<br />
• Both CPU and GPU versions use double floats; termination<br />
error ε = 10 −12 , Monte Carlo estimation δ = 10 −12 .<br />
32
Performance<br />
• Comparisons:<br />
1. Matlab’s svds;<br />
2. CPU-based <strong>QUIC</strong>-<strong>SVD</strong> (optimized, multi-threaded,<br />
implemented using Intel Math Kernel Library);<br />
3. GPU-based <strong>QUIC</strong>-<strong>SVD</strong>.<br />
4. Tygert <strong>SVD</strong> (approximate <strong>SVD</strong> algorithm, implemented in<br />
Matlab, based on random projection);<br />
33
Performance – Running Time<br />
34
Performance – Speedup Factors<br />
35
Performance – Speedup Factors<br />
36
Performance – CPU vs. GPU <strong>QUIC</strong>-<strong>SVD</strong><br />
Running Time<br />
Speedup<br />
37
Performance – GPU <strong>QUIC</strong>-<strong>SVD</strong> vs. Tygert <strong>SVD</strong><br />
Running Time<br />
Speedup<br />
38
Integration with Manifold Alignment Framework<br />
39
Conclusion<br />
• A fast GPU-based implementation <strong>of</strong> an approximate<br />
<strong>SVD</strong> algorithm.<br />
• Reasonable (up to 6~7 ×) speedup over an optimized<br />
CPU implementation, and Tygert <strong>SVD</strong>.<br />
• Our partitioned version allows processing out-<strong>of</strong>-core<br />
matrices.<br />
40
Future Work<br />
• Be able to process sparse matrices.<br />
• Application in solving large graph Laplacians.<br />
• Components from this project can become building<br />
blocks for other algorithms: random projection trees,<br />
diffusion wavelets etc.<br />
41
Acknowledgement<br />
• Alexander Gray<br />
• PPAM reviewers<br />
• NSF Grant FODAVA-1025120<br />
42