03.08.2012 Views

Tridiagonal Solvers on the GPU and Applications to Fluid Simulation

Tridiagonal Solvers on the GPU and Applications to Fluid Simulation

Tridiagonal Solvers on the GPU and Applications to Fluid Simulation

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>Solvers</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong> <strong>and</strong><br />

Applicati<strong>on</strong>s <strong>to</strong> <strong>Fluid</strong> Simulati<strong>on</strong><br />

Nikolai Sakharnykh, NVIDIA<br />

nsakharnykh@nvidia.com


Agenda<br />

• Introducti<strong>on</strong> <strong>and</strong> Problem Statement<br />

• Governing Equati<strong>on</strong>s<br />

• ADI Numerical Method<br />

• <strong>GPU</strong> Implementati<strong>on</strong> <strong>and</strong> Optimizati<strong>on</strong>s<br />

• Results <strong>and</strong> Future Work


Introducti<strong>on</strong><br />

Direct Numerical<br />

Simulati<strong>on</strong><br />

• all scales of turbulence<br />

• expensive<br />

Turbulence<br />

simulati<strong>on</strong><br />

Large-Eddy<br />

Simulati<strong>on</strong><br />

• Research at Computer Science<br />

department of Moscow State University<br />

– Pask<strong>on</strong>ov V.M., Berezin S.B.<br />

Reynolds-Averaged<br />

Navier-S<strong>to</strong>kes


Problem Statement<br />

• Viscid incompressible fluid in 3D domain<br />

• Initial <strong>and</strong> boundary c<strong>on</strong>diti<strong>on</strong>s<br />

• Euler coordinates: velocity <strong>and</strong><br />

temperature


Definiti<strong>on</strong>s<br />

Density<br />

Velocity<br />

Temperature<br />

Pressure<br />

• State equati<strong>on</strong><br />

R<br />

p<br />

u<br />

T<br />

p<br />

RT<br />

– gas c<strong>on</strong>stant for air<br />

c<strong>on</strong>st<br />

( u,<br />

v,<br />

w)<br />

RT<br />

1


Governing Equati<strong>on</strong>s<br />

• C<strong>on</strong>tinuity equati<strong>on</strong><br />

div u<br />

0<br />

• Navier-S<strong>to</strong>kes equati<strong>on</strong>s<br />

– dimensi<strong>on</strong>less form<br />

u 1 2<br />

t<br />

Re<br />

u<br />

u<br />

T<br />

Re<br />

– Reynolds number<br />

u


Reynolds number<br />

• Similarity parameter<br />

– <strong>the</strong> ratio of inertia forces <strong>to</strong> viscosity<br />

forces<br />

• 3D channel:<br />

Re<br />

V ' L'<br />

'<br />

• High Re – turbulent flow<br />

• Low Re – laminar flow<br />

V ' - mean velocity<br />

L' - length of pipe<br />

' - dynamic viscosity


Governing Equati<strong>on</strong>s<br />

• Energy equati<strong>on</strong><br />

– dimensi<strong>on</strong>less form<br />

T<br />

u<br />

t<br />

Pr<br />

T<br />

T<br />

1<br />

Pr Re<br />

– Pr<strong>and</strong>tl number<br />

– heat capacity ratio<br />

– dissipative functi<strong>on</strong><br />

T<br />

1<br />

Re


Numerical Method<br />

• Alternating Directi<strong>on</strong> Implicit (ADI)<br />

u<br />

t<br />

2<br />

u<br />

2<br />

x<br />

u<br />

t<br />

2<br />

u<br />

2<br />

x<br />

u<br />

t<br />

2<br />

u<br />

2<br />

y<br />

2<br />

u<br />

2<br />

y<br />

2<br />

u<br />

2<br />

z<br />

X Y Z<br />

u<br />

t<br />

2<br />

u<br />

2<br />

z


ADI – Heat C<strong>on</strong>ducti<strong>on</strong><br />

• 3 fracti<strong>on</strong>al steps – X, Y, Z<br />

• Implicit finite-difference scheme<br />

2<br />

3<br />

/<br />

1<br />

,<br />

,<br />

1<br />

3<br />

/<br />

1<br />

,<br />

,<br />

3<br />

/<br />

1<br />

,<br />

,<br />

1<br />

,<br />

,<br />

3<br />

/<br />

1<br />

,<br />

,<br />

2<br />

x<br />

u<br />

u<br />

u<br />

t<br />

u<br />

u<br />

n<br />

k<br />

j<br />

i<br />

n<br />

k<br />

j<br />

i<br />

n<br />

k<br />

j<br />

i<br />

n<br />

k<br />

j<br />

i<br />

n<br />

k<br />

j<br />

i<br />

2<br />

2<br />

x<br />

u<br />

t<br />

u<br />

n<br />

k<br />

j<br />

nx<br />

n<br />

k<br />

j<br />

i<br />

n<br />

k<br />

j<br />

n<br />

k<br />

j<br />

nx<br />

n<br />

k<br />

j<br />

i<br />

n<br />

k<br />

j<br />

u<br />

u<br />

u<br />

t<br />

x<br />

u<br />

u<br />

u<br />

q<br />

q<br />

q<br />

q<br />

,<br />

,<br />

,<br />

,<br />

,<br />

,<br />

1<br />

2<br />

3<br />

/<br />

1<br />

,<br />

,<br />

3<br />

/<br />

1<br />

,<br />

,<br />

3<br />

/<br />

1<br />

,<br />

,<br />

1<br />

1<br />

0<br />

1<br />

1<br />

1<br />

1<br />

1<br />

1<br />

0<br />

1<br />

�<br />

�<br />

�<br />

�<br />

�<br />

t<br />

x<br />

q<br />

2<br />

2


ADI – Navier-S<strong>to</strong>kes<br />

• Equati<strong>on</strong> for X velocity<br />

u<br />

t<br />

u<br />

u<br />

x<br />

Z<br />

v<br />

Y<br />

X<br />

u<br />

y<br />

w<br />

u<br />

t<br />

u<br />

t<br />

u<br />

t<br />

u<br />

z<br />

u<br />

v<br />

w<br />

u<br />

x<br />

– need iterati<strong>on</strong>s for n<strong>on</strong>-linear PDEs<br />

u<br />

y<br />

u<br />

z<br />

T<br />

x<br />

1<br />

Re<br />

T<br />

x<br />

1<br />

Re<br />

1<br />

Re<br />

1<br />

Re<br />

2<br />

u<br />

2<br />

y<br />

2<br />

u<br />

2<br />

z<br />

2<br />

u<br />

2<br />

x<br />

2<br />

u<br />

2<br />

x<br />

2<br />

u<br />

2<br />

y<br />

2<br />

u<br />

2<br />

z


ADI – Time Step<br />

(n-1) time step<br />

Splitting by<br />

X<br />

Splitting by<br />

Y<br />

(n) time step<br />

Splitting by<br />

Z<br />

Global iterati<strong>on</strong>s<br />

Updating<br />

n<strong>on</strong>-linear<br />

parameters<br />

(n+1) time step


ADI – Fracti<strong>on</strong>al Time Step<br />

• Linear PDEs<br />

Previous layer<br />

N time<br />

layer<br />

u: x-velocity<br />

v: y-velocity<br />

w: z-velocity<br />

T: temperature<br />

Sweep<br />

Solves many<br />

tridiag<strong>on</strong>al<br />

systems<br />

independently<br />

N + 1 time<br />

layer<br />

Next layer


ADI – Fracti<strong>on</strong>al Time Step<br />

• N<strong>on</strong>-Linear PDEs<br />

Previous layer<br />

N time<br />

layer<br />

u: x-velocity<br />

v: y-velocity<br />

w: z-velocity<br />

T: temperature<br />

N + ½ time<br />

layer<br />

Sweep<br />

Solves many<br />

tridiag<strong>on</strong>al<br />

systems<br />

independently<br />

Update<br />

N + 1 time<br />

layer<br />

Local<br />

iterati<strong>on</strong>s<br />

Next layer


Main Stages of <strong>the</strong> Algorithm<br />

• Solve a lot of independent tridiag<strong>on</strong>al<br />

systems<br />

– Computati<strong>on</strong>ally intensive<br />

– Easy <strong>to</strong> parallelize<br />

• Subtasks:<br />

– Evaluate dissipati<strong>on</strong> term<br />

– Update n<strong>on</strong>-linear parameters


<str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>Solvers</str<strong>on</strong>g> Overview<br />

• Simplified Gauss eliminati<strong>on</strong><br />

– Also known as Thomas algorithm, Sweep<br />

– The fastest serial approach<br />

• Cyclic Reducti<strong>on</strong> methods<br />

– Attend Yao Zhang’s talk “Fast <str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g><br />

<str<strong>on</strong>g>Solvers</str<strong>on</strong>g>” afterwards!


Sweep algorithm<br />

• Memory requirements<br />

– <strong>on</strong>e additi<strong>on</strong>al array of size N<br />

• Forward eliminati<strong>on</strong> step<br />

• Backward substituti<strong>on</strong> step<br />

• Complexity: O(N)


<strong>GPU</strong> Implementati<strong>on</strong><br />

• All data arrays are s<strong>to</strong>red <strong>on</strong> <strong>GPU</strong><br />

• Several 3D time-layers<br />

– overall 1GB for 192x192x192 grid in DP<br />

• Main kernels<br />

– Sweep<br />

– Dissipative functi<strong>on</strong> evaluati<strong>on</strong><br />

– N<strong>on</strong>-linear update


Sweep <strong>on</strong> <strong>the</strong> <strong>GPU</strong><br />

• One thread solves <strong>on</strong>e system<br />

– N^2 systems <strong>on</strong> each fracti<strong>on</strong>al step<br />

Splitting by X Splitting by Y Splitting by Z<br />

• Each thread operates with 1D slice in<br />

corresp<strong>on</strong>ding directi<strong>on</strong>


Sweep – performance<br />

time steps/sec<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Sweep X Sweep Y Sweep Z<br />

• X splitting is much slower than Y/Z <strong>on</strong>es<br />

float<br />

double<br />

NVIDIA Tesla C1060


Sweep – going in<strong>to</strong> details<br />

• Memory bound – need <strong>to</strong> optimize<br />

access <strong>to</strong> <strong>the</strong> memory<br />

Sweep X Sweep Y Sweep Z<br />

uncoalesced coalesced coalesced


Sweep – optimizati<strong>on</strong><br />

• Soluti<strong>on</strong> for X-splitting<br />

– Reorder data arrays <strong>and</strong> run Y-splitting<br />

– Need few additi<strong>on</strong>al 3D matrix transposes<br />

time steps/sec<br />

2.0<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1.0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

2.5x<br />

1.7x<br />

float double<br />

original<br />

optimized


Code analysis<br />

• <strong>GPU</strong> versi<strong>on</strong> is based <strong>on</strong> <strong>the</strong> CPU code<br />

// boundary c<strong>on</strong>diti<strong>on</strong>s<br />

switch (dir)<br />

{<br />

case X: case X_as_Y: bc_x0(…); break;<br />

case Y: bc_y0(…); break;<br />

case Z: bc_z0(…); break;<br />

}<br />

a[1] = - c1 / c2;<br />

u_next[base_idx] = f_i / c2;<br />

// forward trace of sweep<br />

int idx = base_idx;<br />

int idx_prev;<br />

for (int k = 1; k < n; k++)<br />

{<br />

idx_prev = idx;<br />

idx += p.stride;<br />

}<br />

double c = v_temp[idx];<br />

c1 = p.m_c13 * c - p.h;<br />

c2 = p.m_c2;<br />

c3 = - p.m_c13 * c - p.h;<br />

double q = (c3 * a[k] + c2);<br />

double t = 1 / q;<br />

a[k+1] = - c1 * t;<br />

u_next[idx] = (f[idx] - c3 * u_next[idx_prev]) * t;


Performance Comparis<strong>on</strong><br />

• Test data<br />

– Grid size of 128/192<br />

– 8 n<strong>on</strong>-linear iterati<strong>on</strong>s (2 inner x 4 outer)<br />

• Hardware<br />

– NVIDIA Tesla C1060<br />

– Intel Core2 Quad (4 threads)<br />

– Intel Core i7 Nehalem (8 threads)


Performance – 128 - float<br />

time steps/sec<br />

30<br />

27<br />

24<br />

21<br />

18<br />

15<br />

12<br />

9<br />

6<br />

3<br />

0<br />

20x<br />

7x<br />

7x<br />

9x<br />

Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />

NVIDIA Tesla C1060<br />

Intel Core i7 Nehalem 2.93GHz<br />

(4 cores)<br />

Intel Core 2 Quad 2.4GHz (4<br />

cores)


Performance – 128 - double<br />

time steps/sec<br />

14<br />

13<br />

12<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

10x<br />

4x<br />

4x<br />

5x<br />

Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />

NVIDIA Tesla C1060<br />

Intel Core i7 Nehalem 2.93GHz<br />

(4 cores)<br />

Intel Core 2 Quad 2.4GHz (4<br />

cores)


Performance – 192 - float<br />

time steps/sec<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

28x<br />

8x<br />

8x<br />

11x<br />

Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />

NVIDIA Tesla C1060<br />

Intel Core i7 Nehalem 2.93GHz<br />

(4 cores)<br />

Intel Core 2 Quad 2.4GHz (4<br />

cores)


Performance – 192 - double<br />

time steps/sec<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

13x<br />

4x<br />

5x<br />

5x<br />

Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />

NVIDIA Tesla C1060<br />

Intel Core i7 Nehalem 2.93GHz<br />

(4 cores)<br />

Intel Core 2 Quad 2.4GHz (4<br />

cores)


<strong>GPU</strong> performance – SP/DP<br />

time steps/sec<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

dissipati<strong>on</strong> sweep n<strong>on</strong>linear <strong>to</strong>tal<br />

• In double precisi<strong>on</strong> <strong>GPU</strong> is <strong>on</strong>ly 2x<br />

slower than in single precisi<strong>on</strong><br />

float<br />

double


Visual results<br />

• Boundary c<strong>on</strong>diti<strong>on</strong>s<br />

u=1<br />

no-slip<br />

– C<strong>on</strong>stant flow at start:<br />

– No-slip <strong>on</strong> sides:<br />

– Free at far end:<br />

u<br />

u<br />

2<br />

u<br />

2<br />

x<br />

free<br />

1, v w<br />

v<br />

2<br />

v<br />

2<br />

x<br />

w<br />

0<br />

0<br />

2<br />

w<br />

2<br />

x<br />

0


Visual Results – X-slice<br />

u v<br />

w<br />

x = 0,9<br />

t = 6<br />

T<br />

Re = 1000


Future Work<br />

• Effective multi-<strong>GPU</strong> usage<br />

– Distributed memory systems<br />

• Performance improvements<br />

• High resoluti<strong>on</strong> grids, high Reynolds<br />

numbers


C<strong>on</strong>clusi<strong>on</strong><br />

• High performance <strong>and</strong> efficiency of<br />

<strong>GPU</strong>s in complex 3D fluid simulati<strong>on</strong><br />

• CUDA is an easy-<strong>to</strong>-use <strong>to</strong>ol for <strong>GPU</strong><br />

compute programming<br />

• <strong>GPU</strong> enables new possibilities for<br />

researching


Questi<strong>on</strong>s?<br />

• Keywords:<br />

Thank you!<br />

– ADI, <str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>Solvers</str<strong>on</strong>g>, DNS, Turbulence<br />

• E-mail: nsakharnykh@nvidia.com


References<br />

• Pask<strong>on</strong>ov V.M., Berezin S.B., Korukhova E.S. (2007) “A dynamic<br />

visualizati<strong>on</strong> system for multiprocessor computers with comm<strong>on</strong><br />

memory <strong>and</strong> its applicati<strong>on</strong> for numerical modeling of <strong>the</strong><br />

turbulent flows of viscous fluids”, Moscow University<br />

Computati<strong>on</strong>al Ma<strong>the</strong>matics <strong>and</strong> Cybernetics<br />

• ADI method - Douglas Jr., Jim (1962), "Alternating directi<strong>on</strong><br />

methods for three space variables", Numerische Ma<strong>the</strong>matik 4:<br />

41–63


Dissipative Functi<strong>on</strong><br />

z<br />

v<br />

y<br />

w<br />

z<br />

u<br />

x<br />

w<br />

z<br />

w<br />

z<br />

v<br />

z<br />

u<br />

y<br />

w<br />

z<br />

v<br />

y<br />

u<br />

x<br />

v<br />

y<br />

w<br />

y<br />

v<br />

y<br />

u<br />

x<br />

w<br />

z<br />

u<br />

x<br />

v<br />

y<br />

u<br />

x<br />

w<br />

x<br />

v<br />

x<br />

u<br />

z<br />

y<br />

x<br />

z<br />

y<br />

x<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2<br />

2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!