16.04.2014 Views

Simon Layton, Boston University - Nvidia

Simon Layton, Boston University - Nvidia

Simon Layton, Boston University - Nvidia

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classical Algebraic Multigrid<br />

for Engineering Applications<br />

<strong>Simon</strong> <strong>Layton</strong>, Lorena Barba (BU)<br />

With thanks to Justin Luitjens, Jonathan Cohen (NVIDIA)


Problem Statement<br />

‣Solve sparse linear systems from engineering problems<br />

Au = b<br />

‣Pressure Poisson equation in fluids<br />

r 2<br />

= r·u ⇤<br />

- 90%+ of total run time!<br />

2


What is Multigrid and why do we care?<br />

‣Hierarchical<br />

- Repeatedly reduce size of the problem<br />

‣ Optimal Complexity<br />

- O(N)<br />

‣Parallel & scalable<br />

- 100k+ cores (Hypre)<br />

3


Algebraic vs. Geometric<br />

‣Coarsen from matrix entries<br />

- Not restricted to structured grids<br />

vs<br />

4


Matrix as a graph<br />

‣Variables as vertices<br />

‣Non-zeros as edges<br />

5


Component (1) - Strength of Connection<br />

‣ Measure of how strongly vertices depend on each other<br />

Each edge must<br />

either Strong or Weak<br />

6


Component (2) - Selector<br />

‣Choose vertices with highest weights<br />

‣Weighting is # of strong edges to vertex<br />

7


Component (3) - Interpolator / Restrictor<br />

‣Transfer residuals between levels<br />

- Construct next level<br />

Distance 2<br />

- Looks at neighbours<br />

of neighbours<br />

8


Component (4) - Galerkin Product<br />

‣Generate next level in hierarchy<br />

A k+1 = R k A k P k<br />

1.5<br />

1<br />

0.5<br />

< 1% 2% 2% < 1%<br />

13%<br />

Triple matrix product<br />

0<br />

39%<br />

−0.5<br />

‣<br />

−1<br />

Prolongate & correct<br />

44%<br />

Restrict residual<br />

Compute R<br />

Compute A<br />

Compute P<br />

Smooth<br />

Other<br />

−1.5<br />

−1.5 −1 −0.5 0 0.5 1 1.5<br />

Bottleneck!<br />

9


Component (5) - Solver Cycle<br />

‣Smooth errors on all levels<br />

k=1<br />

V-Cycle<br />

- Simplest option<br />

- lots of SpMV!<br />

Restriction<br />

k=M<br />

Interpolation<br />

10


GPU Implementation - Justification<br />

‣Algorithm entirely parallel!<br />

- Fine-grained parallelism available<br />

‣If we get ~2x+ speedup, massive savings in runtime<br />

- Bigger runs!<br />

- Less time to solution!<br />

11


GPU Implementation - First Thoughts<br />

‣Most operations easily expose parallelism<br />

‣Interpolator is bottleneck<br />

‣Ensure correctness<br />

- Compare against Hypre<br />

- Produce identical results<br />

12


Interpolator (again)<br />

‣Interpolation weights:<br />

‣Where:<br />

0<br />

P ij = 1 @a ij + X<br />

ã ii<br />

ã ii = a ii +<br />

X<br />

n2N w i<br />

\Ĉi<br />

k2F s i<br />

P<br />

a ik ā ki<br />

a in + X<br />

l2Ĉi[{i} ākl<br />

k2F s i<br />

a ik<br />

1<br />

A ,j 2 Ĉi<br />

P<br />

ā ki<br />

l2Ĉi[{i} ākl<br />

‣And:<br />

Ĉ i = C s i [ [<br />

j2F s i<br />

C s j<br />

‣Repeated set union is tricky..<br />

13


Repeated Set Union<br />

Ĉ i = C s i [ [<br />

C s j<br />

j2F s i<br />

‣Operation is conceptually simple<br />

14


Repeated Set Union<br />

Ĉ i = C s i [ [<br />

C s j<br />

j2F s i<br />

‣Operation is conceptually simple<br />

14


Differing Approaches<br />

1. Worst case storage, sort & unique<br />

- Reliant on thrust<br />

2. Construct boolean matrices for connections<br />

- Matrix-Matrix multiply<br />

- Reliant on Cusp<br />

15


Results - Convergence<br />

‣Regular Poisson grids<br />

- Problem size invariance for convergence<br />

1<br />

0.9<br />

0.8<br />

Average Convergence Rate<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

5−pt<br />

9pt<br />

7pt<br />

27pt<br />

0.2<br />

0.1<br />

0 2 4 6 8 10<br />

N<br />

x 10 5<br />

16


Results - Performance<br />

‣System from slow flow past cylinder compared to Hypre<br />

1.5<br />

0.5<br />

-0.5<br />

-1<br />

-1.5<br />

2 Code Time Speedup<br />

1<br />

0<br />

-2<br />

-1 0 1 2 3 4<br />

CUDA 3.57s 1.87x<br />

Hypre<br />

(1C)<br />

Hypre<br />

(2C)<br />

Hypre<br />

(4C)<br />

Hypre<br />

(6C)<br />

6.69s -<br />

5.29s 1.26x<br />

4.16s 1.61x<br />

4.00s 1.67x<br />

17


Profiling<br />

‣1,000,000 unknowns<br />

‣Error norm of 10 -5<br />

‣Single Tesla C2050 GPU<br />

18


Profiling - Breakdown by Routine<br />

‣Generating Interpolation matrix, coarse A most expensive<br />

1.5<br />

1<br />

< 1% 2% 2% < 1%<br />

13%<br />

Triple matrix<br />

product<br />

- coarse A<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

39%<br />

Prolongate & correct<br />

44%<br />

Restrict residual<br />

Compute R<br />

Compute A<br />

Compute P<br />

Smooth<br />

Other<br />

−1.5<br />

−1.5 −1 −0.5 0 0.5 1 1.5<br />

Interpolation<br />

matrix<br />

19


Conclusions & Further Work<br />

‣Classical AMG entirely on GPU<br />

- Validated against Hypre<br />

- Not optimised<br />

‣Multiple approaches<br />

- Test different methods & compare<br />

20

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!