Simon Layton, Boston University - Nvidia

Classical Algebraic Multigrid 

for Engineering Applications 

Simon Layton, Lorena Barba (BU) 

With thanks to Justin Luitjens, Jonathan Cohen (NVIDIA)

Problem Statement 

‣Solve sparse linear systems from engineering problems 

Au = b 

‣Pressure Poisson equation in fluids 

r 2 

= r·u ⇤ 

- 90%+ of total run time! 

2

What is Multigrid and why do we care? 

‣Hierarchical 

- Repeatedly reduce size of the problem 

‣ Optimal Complexity 

- O(N) 

‣Parallel & scalable 

- 100k+ cores (Hypre) 

3

Algebraic vs. Geometric 

‣Coarsen from matrix entries 

- Not restricted to structured grids 

vs 

4

Matrix as a graph 

‣Variables as vertices 

‣Non-zeros as edges 

5

Component (1) - Strength of Connection 

‣ Measure of how strongly vertices depend on each other 

Each edge must 

either Strong or Weak 

6

Component (2) - Selector 

‣Choose vertices with highest weights 

‣Weighting is # of strong edges to vertex 

7

Component (3) - Interpolator / Restrictor 

‣Transfer residuals between levels 

- Construct next level 

Distance 2 

- Looks at neighbours 

of neighbours 

8

Component (4) - Galerkin Product 

‣Generate next level in hierarchy 

A k+1 = R k A k P k 

1.5 

1 

0.5 

< 1% 2% 2% < 1% 

13% 

Triple matrix product 

0 

39% 

−0.5 

‣ 

−1 

Prolongate & correct 

44% 

Restrict residual 

Compute R 

Compute A 

Compute P 

Smooth 

Other 

−1.5 

−1.5 −1 −0.5 0 0.5 1 1.5 

Bottleneck! 

9

Component (5) - Solver Cycle 

‣Smooth errors on all levels 

k=1 

V-Cycle 

- Simplest option 

- lots of SpMV! 

Restriction 

k=M 

Interpolation 

10

GPU Implementation - Justification 

‣Algorithm entirely parallel! 

- Fine-grained parallelism available 

‣If we get ~2x+ speedup, massive savings in runtime 

- Bigger runs! 

- Less time to solution! 

11

GPU Implementation - First Thoughts 

‣Most operations easily expose parallelism 

‣Interpolator is bottleneck 

‣Ensure correctness 

- Compare against Hypre 

- Produce identical results 

12

Interpolator (again) 

‣Interpolation weights: 

‣Where: 

0 

P ij = 1 @a ij + X 

ã ii 

ã ii = a ii + 

X 

n2N w i 

\Ĉi 

k2F s i 

P 

a ik ā ki 

a in + X 

l2Ĉi[{i} ākl 

k2F s i 

a ik 

1 

A ,j 2 Ĉi 

P 

ā ki 

l2Ĉi[{i} ākl 

‣And: 

Ĉ i = C s i [ [ 

j2F s i 

C s j 

‣Repeated set union is tricky.. 

13

Repeated Set Union 

Ĉ i = C s i [ [ 

C s j 

j2F s i 

‣Operation is conceptually simple 

14

Repeated Set Union 

Ĉ i = C s i [ [ 

C s j 

j2F s i 

‣Operation is conceptually simple 

14

Differing Approaches 

1. Worst case storage, sort & unique 

- Reliant on thrust 

2. Construct boolean matrices for connections 

- Matrix-Matrix multiply 

- Reliant on Cusp 

15

Results - Convergence 

‣Regular Poisson grids 

- Problem size invariance for convergence 

1 

0.9 

0.8 

Average Convergence Rate 

0.7 

0.6 

0.5 

0.4 

0.3 

5−pt 

9pt 

7pt 

27pt 

0.2 

0.1 

0 2 4 6 8 10 

N 

x 10 5 

16

Results - Performance 

‣System from slow flow past cylinder compared to Hypre 

1.5 

0.5 

-0.5 

-1 

-1.5 

2 Code Time Speedup 

1 

0 

-2 

-1 0 1 2 3 4 

CUDA 3.57s 1.87x 

Hypre 

(1C) 

Hypre 

(2C) 

Hypre 

(4C) 

Hypre 

(6C) 

6.69s - 

5.29s 1.26x 

4.16s 1.61x 

4.00s 1.67x 

17

Profiling 

‣1,000,000 unknowns 

‣Error norm of 10 -5 

‣Single Tesla C2050 GPU 

18

Profiling - Breakdown by Routine 

‣Generating Interpolation matrix, coarse A most expensive 

1.5 

1 

< 1% 2% 2% < 1% 

13% 

Triple matrix 

product 

- coarse A 

0.5 

0 

−0.5 

−1 

39% 

Prolongate & correct 

44% 

Restrict residual 

Compute R 

Compute A 

Compute P 

Smooth 

Other 

−1.5 

−1.5 −1 −0.5 0 0.5 1 1.5 

Interpolation 

matrix 

19

Conclusions & Further Work 

‣Classical AMG entirely on GPU 

- Validated against Hypre 

- Not optimised 

‣Multiple approaches 

- Test different methods & compare 

20

Simon Layton, Boston University - Nvidia

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?