# Tridiagonal Solvers on the GPU and Applications to Fluid Simulation

Tridiagonal Solvers on the GPU and Applications to Fluid Simulation

ong>Tridiagonalong> ong>Solversong> on the GPU and

Applications to Fluid Simulation

Nikolai Sakharnykh, NVIDIA

nsakharnykh@nvidia.com

Agenda

• Introduction and Problem Statement

• Governing Equations

GPU Implementation and Optimizations

• Results and Future Work

Introduction

Direct Numerical

Simulation

• all scales of turbulence

• expensive

Turbulence

simulation

Large-Eddy

Simulation

• Research at Computer Science

department of Moscow State University

Reynolds-Averaged

Navier-Stokes

Problem Statement

• Viscid incompressible fluid in 3D domain

• Initial and boundary conditions

• Euler coordinates: velocity and

temperature

Definitions

Density

Velocity

Temperature

Pressure

• State equation

R

p

u

T

p

RT

– gas constant for air

const

( u,

v,

w)

RT

1

Governing Equations

• Continuity equation

div u

0

• Navier-Stokes equations

– dimensionless form

u 1 2

t

Re

u

u

T

Re

– Reynolds number

u

Reynolds number

• Similarity parameter

the ratio of inertia forces to viscosity

forces

• 3D channel:

Re

V ' L'

'

• High Re – turbulent flow

• Low Re – laminar flow

V ' - mean velocity

L' - length of pipe

' - dynamic viscosity

Governing Equations

• Energy equation

– dimensionless form

T

u

t

Pr

T

T

1

Pr Re

– Prandtl number

– heat capacity ratio

– dissipative function

T

1

Re

Numerical Method

u

t

2

u

2

x

u

t

2

u

2

x

u

t

2

u

2

y

2

u

2

y

2

u

2

z

X Y Z

u

t

2

u

2

z

• 3 fractional steps – X, Y, Z

• Implicit finite-difference scheme

2

3

/

1

,

,

1

3

/

1

,

,

3

/

1

,

,

1

,

,

3

/

1

,

,

2

x

u

u

u

t

u

u

n

k

j

i

n

k

j

i

n

k

j

i

n

k

j

i

n

k

j

i

2

2

x

u

t

u

n

k

j

nx

n

k

j

i

n

k

j

n

k

j

nx

n

k

j

i

n

k

j

u

u

u

t

x

u

u

u

q

q

q

q

,

,

,

,

,

,

1

2

3

/

1

,

,

3

/

1

,

,

3

/

1

,

,

1

1

0

1

1

1

1

1

1

0

1

t

x

q

2

2

• Equation for X velocity

u

t

u

u

x

Z

v

Y

X

u

y

w

u

t

u

t

u

t

u

z

u

v

w

u

x

– need iterations for non-linear PDEs

u

y

u

z

T

x

1

Re

T

x

1

Re

1

Re

1

Re

2

u

2

y

2

u

2

z

2

u

2

x

2

u

2

x

2

u

2

y

2

u

2

z

(n-1) time step

Splitting by

X

Splitting by

Y

(n) time step

Splitting by

Z

Global iterations

Updating

non-linear

parameters

(n+1) time step

• Linear PDEs

Previous layer

N time

layer

u: x-velocity

v: y-velocity

w: z-velocity

T: temperature

Sweep

Solves many

tridiagonal

systems

independently

N + 1 time

layer

Next layer

• Non-Linear PDEs

Previous layer

N time

layer

u: x-velocity

v: y-velocity

w: z-velocity

T: temperature

N + ½ time

layer

Sweep

Solves many

tridiagonal

systems

independently

Update

N + 1 time

layer

Local

iterations

Next layer

Main Stages of the Algorithm

• Solve a lot of independent tridiagonal

systems

– Computationally intensive

– Easy to parallelize

– Evaluate dissipation term

– Update non-linear parameters

ong>Tridiagonalong> ong>Solversong> Overview

• Simplified Gauss elimination

– Also known as Thomas algorithm, Sweep

– The fastest serial approach

• Cyclic Reduction methods

– Attend Yao Zhang’s talk “Fast ong>Tridiagonalong>

ong>Solversong>” afterwards!

Sweep algorithm

• Memory requirements

one additional array of size N

• Forward elimination step

• Backward substitution step

• Complexity: O(N)

GPU Implementation

• All data arrays are stored on GPU

• Several 3D time-layers

– overall 1GB for 192x192x192 grid in DP

• Main kernels

– Sweep

– Dissipative function evaluation

– Non-linear update

Sweep on the GPU

• One thread solves one system

– N^2 systems on each fractional step

Splitting by X Splitting by Y Splitting by Z

• Each thread operates with 1D slice in

corresponding direction

Sweep – performance

time steps/sec

9

8

7

6

5

4

3

2

1

0

Sweep X Sweep Y Sweep Z

• X splitting is much slower than Y/Z ones

float

double

NVIDIA Tesla C1060

Sweep – going into details

• Memory bound – need to optimize

Sweep X Sweep Y Sweep Z

uncoalesced coalesced coalesced

Sweep – optimization

• Solution for X-splitting

– Reorder data arrays and run Y-splitting

– Need few additional 3D matrix transposes

time steps/sec

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

2.5x

1.7x

float double

original

optimized

Code analysis

GPU version is based on the CPU code

// boundary conditions

switch (dir)

{

case X: case X_as_Y: bc_x0(…); break;

case Y: bc_y0(…); break;

case Z: bc_z0(…); break;

}

a[1] = - c1 / c2;

u_next[base_idx] = f_i / c2;

// forward trace of sweep

int idx = base_idx;

int idx_prev;

for (int k = 1; k < n; k++)

{

idx_prev = idx;

idx += p.stride;

}

double c = v_temp[idx];

c1 = p.m_c13 * c - p.h;

c2 = p.m_c2;

c3 = - p.m_c13 * c - p.h;

double q = (c3 * a[k] + c2);

double t = 1 / q;

a[k+1] = - c1 * t;

u_next[idx] = (f[idx] - c3 * u_next[idx_prev]) * t;

Performance Comparison

• Test data

– Grid size of 128/192

– 8 non-linear iterations (2 inner x 4 outer)

• Hardware

– NVIDIA Tesla C1060

– Intel Core i7 Nehalem (8 threads)

Performance – 128 - float

time steps/sec

30

27

24

21

18

15

12

9

6

3

0

20x

7x

7x

9x

Dissipation Sweep NonLinear Total

NVIDIA Tesla C1060

Intel Core i7 Nehalem 2.93GHz

(4 cores)

Intel Core 2 Quad 2.4GHz (4

cores)

Performance – 128 - double

time steps/sec

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

10x

4x

4x

5x

Dissipation Sweep NonLinear Total

NVIDIA Tesla C1060

Intel Core i7 Nehalem 2.93GHz

(4 cores)

Intel Core 2 Quad 2.4GHz (4

cores)

Performance – 192 - float

time steps/sec

9

8

7

6

5

4

3

2

1

0

28x

8x

8x

11x

Dissipation Sweep NonLinear Total

NVIDIA Tesla C1060

Intel Core i7 Nehalem 2.93GHz

(4 cores)

Intel Core 2 Quad 2.4GHz (4

cores)

Performance – 192 - double

time steps/sec

6

5

4

3

2

1

0

13x

4x

5x

5x

Dissipation Sweep NonLinear Total

NVIDIA Tesla C1060

Intel Core i7 Nehalem 2.93GHz

(4 cores)

Intel Core 2 Quad 2.4GHz (4

cores)

GPU performance – SP/DP

time steps/sec

9

8

7

6

5

4

3

2

1

0

dissipation sweep nonlinear total

• In double precision GPU is only 2x

slower than in single precision

float

double

Visual results

• Boundary conditions

u=1

no-slip

– Constant flow at start:

– No-slip on sides:

– Free at far end:

u

u

2

u

2

x

free

1, v w

v

2

v

2

x

w

0

0

2

w

2

x

0

Visual Results – X-slice

u v

w

x = 0,9

t = 6

T

Re = 1000

Future Work

• Effective multi-GPU usage

– Distributed memory systems

• Performance improvements

• High resolution grids, high Reynolds

numbers

Conclusion

• High performance and efficiency of

GPUs in complex 3D fluid simulation

• CUDA is an easy-to-use tool for GPU

compute programming

GPU enables new possibilities for

researching

Questions?

• Keywords:

Thank you!

– ADI, ong>Tridiagonalong> ong>Solversong>, DNS, Turbulence

• E-mail: nsakharnykh@nvidia.com

References

• Paskonov V.M., Berezin S.B., Korukhova E.S. (2007) “A dynamic

visualization system for multiprocessor computers with common

memory and its application for numerical modeling of the

turbulent flows of viscous fluids”, Moscow University

Computational Mathematics and Cybernetics

• ADI method - Douglas Jr., Jim (1962), "Alternating direction

methods for three space variables", Numerische Mathematik 4:

41–63

Dissipative Function

z

v

y

w

z

u

x

w

z

w

z

v

z

u

y

w

z

v

y

u

x

v

y

w

y

v

y

u

x

w

z

u

x

v

y

u

x

w

x

v

x

u

z

y

x

z

y

x

2

2

2

2

2

2

2

2

2

2

2

2

More magazines by this user
Similar magazines