Simon Layton, Boston University - Nvidia
Simon Layton, Boston University - Nvidia
Simon Layton, Boston University - Nvidia
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Classical Algebraic Multigrid<br />
for Engineering Applications<br />
<strong>Simon</strong> <strong>Layton</strong>, Lorena Barba (BU)<br />
With thanks to Justin Luitjens, Jonathan Cohen (NVIDIA)
Problem Statement<br />
‣Solve sparse linear systems from engineering problems<br />
Au = b<br />
‣Pressure Poisson equation in fluids<br />
r 2<br />
= r·u ⇤<br />
- 90%+ of total run time!<br />
2
What is Multigrid and why do we care?<br />
‣Hierarchical<br />
- Repeatedly reduce size of the problem<br />
‣ Optimal Complexity<br />
- O(N)<br />
‣Parallel & scalable<br />
- 100k+ cores (Hypre)<br />
3
Algebraic vs. Geometric<br />
‣Coarsen from matrix entries<br />
- Not restricted to structured grids<br />
vs<br />
4
Matrix as a graph<br />
‣Variables as vertices<br />
‣Non-zeros as edges<br />
5
Component (1) - Strength of Connection<br />
‣ Measure of how strongly vertices depend on each other<br />
Each edge must<br />
either Strong or Weak<br />
6
Component (2) - Selector<br />
‣Choose vertices with highest weights<br />
‣Weighting is # of strong edges to vertex<br />
7
Component (3) - Interpolator / Restrictor<br />
‣Transfer residuals between levels<br />
- Construct next level<br />
Distance 2<br />
- Looks at neighbours<br />
of neighbours<br />
8
Component (4) - Galerkin Product<br />
‣Generate next level in hierarchy<br />
A k+1 = R k A k P k<br />
1.5<br />
1<br />
0.5<br />
< 1% 2% 2% < 1%<br />
13%<br />
Triple matrix product<br />
0<br />
39%<br />
−0.5<br />
‣<br />
−1<br />
Prolongate & correct<br />
44%<br />
Restrict residual<br />
Compute R<br />
Compute A<br />
Compute P<br />
Smooth<br />
Other<br />
−1.5<br />
−1.5 −1 −0.5 0 0.5 1 1.5<br />
Bottleneck!<br />
9
Component (5) - Solver Cycle<br />
‣Smooth errors on all levels<br />
k=1<br />
V-Cycle<br />
- Simplest option<br />
- lots of SpMV!<br />
Restriction<br />
k=M<br />
Interpolation<br />
10
GPU Implementation - Justification<br />
‣Algorithm entirely parallel!<br />
- Fine-grained parallelism available<br />
‣If we get ~2x+ speedup, massive savings in runtime<br />
- Bigger runs!<br />
- Less time to solution!<br />
11
GPU Implementation - First Thoughts<br />
‣Most operations easily expose parallelism<br />
‣Interpolator is bottleneck<br />
‣Ensure correctness<br />
- Compare against Hypre<br />
- Produce identical results<br />
12
Interpolator (again)<br />
‣Interpolation weights:<br />
‣Where:<br />
0<br />
P ij = 1 @a ij + X<br />
ã ii<br />
ã ii = a ii +<br />
X<br />
n2N w i<br />
\Ĉi<br />
k2F s i<br />
P<br />
a ik ā ki<br />
a in + X<br />
l2Ĉi[{i} ākl<br />
k2F s i<br />
a ik<br />
1<br />
A ,j 2 Ĉi<br />
P<br />
ā ki<br />
l2Ĉi[{i} ākl<br />
‣And:<br />
Ĉ i = C s i [ [<br />
j2F s i<br />
C s j<br />
‣Repeated set union is tricky..<br />
13
Repeated Set Union<br />
Ĉ i = C s i [ [<br />
C s j<br />
j2F s i<br />
‣Operation is conceptually simple<br />
14
Repeated Set Union<br />
Ĉ i = C s i [ [<br />
C s j<br />
j2F s i<br />
‣Operation is conceptually simple<br />
14
Differing Approaches<br />
1. Worst case storage, sort & unique<br />
- Reliant on thrust<br />
2. Construct boolean matrices for connections<br />
- Matrix-Matrix multiply<br />
- Reliant on Cusp<br />
15
Results - Convergence<br />
‣Regular Poisson grids<br />
- Problem size invariance for convergence<br />
1<br />
0.9<br />
0.8<br />
Average Convergence Rate<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
5−pt<br />
9pt<br />
7pt<br />
27pt<br />
0.2<br />
0.1<br />
0 2 4 6 8 10<br />
N<br />
x 10 5<br />
16
Results - Performance<br />
‣System from slow flow past cylinder compared to Hypre<br />
1.5<br />
0.5<br />
-0.5<br />
-1<br />
-1.5<br />
2 Code Time Speedup<br />
1<br />
0<br />
-2<br />
-1 0 1 2 3 4<br />
CUDA 3.57s 1.87x<br />
Hypre<br />
(1C)<br />
Hypre<br />
(2C)<br />
Hypre<br />
(4C)<br />
Hypre<br />
(6C)<br />
6.69s -<br />
5.29s 1.26x<br />
4.16s 1.61x<br />
4.00s 1.67x<br />
17
Profiling<br />
‣1,000,000 unknowns<br />
‣Error norm of 10 -5<br />
‣Single Tesla C2050 GPU<br />
18
Profiling - Breakdown by Routine<br />
‣Generating Interpolation matrix, coarse A most expensive<br />
1.5<br />
1<br />
< 1% 2% 2% < 1%<br />
13%<br />
Triple matrix<br />
product<br />
- coarse A<br />
0.5<br />
0<br />
−0.5<br />
−1<br />
39%<br />
Prolongate & correct<br />
44%<br />
Restrict residual<br />
Compute R<br />
Compute A<br />
Compute P<br />
Smooth<br />
Other<br />
−1.5<br />
−1.5 −1 −0.5 0 0.5 1 1.5<br />
Interpolation<br />
matrix<br />
19
Conclusions & Further Work<br />
‣Classical AMG entirely on GPU<br />
- Validated against Hypre<br />
- Not optimised<br />
‣Multiple approaches<br />
- Test different methods & compare<br />
20