Tridiagonal Solvers on the GPU and Applications to Fluid Simulation
Tridiagonal Solvers on the GPU and Applications to Fluid Simulation
Tridiagonal Solvers on the GPU and Applications to Fluid Simulation
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>Solvers</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>GPU</strong> <strong>and</strong><br />
Applicati<strong>on</strong>s <strong>to</strong> <strong>Fluid</strong> Simulati<strong>on</strong><br />
Nikolai Sakharnykh, NVIDIA<br />
nsakharnykh@nvidia.com
Agenda<br />
• Introducti<strong>on</strong> <strong>and</strong> Problem Statement<br />
• Governing Equati<strong>on</strong>s<br />
• ADI Numerical Method<br />
• <strong>GPU</strong> Implementati<strong>on</strong> <strong>and</strong> Optimizati<strong>on</strong>s<br />
• Results <strong>and</strong> Future Work
Introducti<strong>on</strong><br />
Direct Numerical<br />
Simulati<strong>on</strong><br />
• all scales of turbulence<br />
• expensive<br />
Turbulence<br />
simulati<strong>on</strong><br />
Large-Eddy<br />
Simulati<strong>on</strong><br />
• Research at Computer Science<br />
department of Moscow State University<br />
– Pask<strong>on</strong>ov V.M., Berezin S.B.<br />
Reynolds-Averaged<br />
Navier-S<strong>to</strong>kes
Problem Statement<br />
• Viscid incompressible fluid in 3D domain<br />
• Initial <strong>and</strong> boundary c<strong>on</strong>diti<strong>on</strong>s<br />
• Euler coordinates: velocity <strong>and</strong><br />
temperature
Definiti<strong>on</strong>s<br />
Density<br />
Velocity<br />
Temperature<br />
Pressure<br />
• State equati<strong>on</strong><br />
R<br />
p<br />
u<br />
T<br />
p<br />
RT<br />
– gas c<strong>on</strong>stant for air<br />
c<strong>on</strong>st<br />
( u,<br />
v,<br />
w)<br />
RT<br />
1
Governing Equati<strong>on</strong>s<br />
• C<strong>on</strong>tinuity equati<strong>on</strong><br />
div u<br />
0<br />
• Navier-S<strong>to</strong>kes equati<strong>on</strong>s<br />
– dimensi<strong>on</strong>less form<br />
u 1 2<br />
t<br />
Re<br />
u<br />
u<br />
T<br />
Re<br />
– Reynolds number<br />
u
Reynolds number<br />
• Similarity parameter<br />
– <strong>the</strong> ratio of inertia forces <strong>to</strong> viscosity<br />
forces<br />
• 3D channel:<br />
Re<br />
V ' L'<br />
'<br />
• High Re – turbulent flow<br />
• Low Re – laminar flow<br />
V ' - mean velocity<br />
L' - length of pipe<br />
' - dynamic viscosity
Governing Equati<strong>on</strong>s<br />
• Energy equati<strong>on</strong><br />
– dimensi<strong>on</strong>less form<br />
T<br />
u<br />
t<br />
Pr<br />
T<br />
T<br />
1<br />
Pr Re<br />
– Pr<strong>and</strong>tl number<br />
– heat capacity ratio<br />
– dissipative functi<strong>on</strong><br />
T<br />
1<br />
Re
Numerical Method<br />
• Alternating Directi<strong>on</strong> Implicit (ADI)<br />
u<br />
t<br />
2<br />
u<br />
2<br />
x<br />
u<br />
t<br />
2<br />
u<br />
2<br />
x<br />
u<br />
t<br />
2<br />
u<br />
2<br />
y<br />
2<br />
u<br />
2<br />
y<br />
2<br />
u<br />
2<br />
z<br />
X Y Z<br />
u<br />
t<br />
2<br />
u<br />
2<br />
z
ADI – Heat C<strong>on</strong>ducti<strong>on</strong><br />
• 3 fracti<strong>on</strong>al steps – X, Y, Z<br />
• Implicit finite-difference scheme<br />
2<br />
3<br />
/<br />
1<br />
,<br />
,<br />
1<br />
3<br />
/<br />
1<br />
,<br />
,<br />
3<br />
/<br />
1<br />
,<br />
,<br />
1<br />
,<br />
,<br />
3<br />
/<br />
1<br />
,<br />
,<br />
2<br />
x<br />
u<br />
u<br />
u<br />
t<br />
u<br />
u<br />
n<br />
k<br />
j<br />
i<br />
n<br />
k<br />
j<br />
i<br />
n<br />
k<br />
j<br />
i<br />
n<br />
k<br />
j<br />
i<br />
n<br />
k<br />
j<br />
i<br />
2<br />
2<br />
x<br />
u<br />
t<br />
u<br />
n<br />
k<br />
j<br />
nx<br />
n<br />
k<br />
j<br />
i<br />
n<br />
k<br />
j<br />
n<br />
k<br />
j<br />
nx<br />
n<br />
k<br />
j<br />
i<br />
n<br />
k<br />
j<br />
u<br />
u<br />
u<br />
t<br />
x<br />
u<br />
u<br />
u<br />
q<br />
q<br />
q<br />
q<br />
,<br />
,<br />
,<br />
,<br />
,<br />
,<br />
1<br />
2<br />
3<br />
/<br />
1<br />
,<br />
,<br />
3<br />
/<br />
1<br />
,<br />
,<br />
3<br />
/<br />
1<br />
,<br />
,<br />
1<br />
1<br />
0<br />
1<br />
1<br />
1<br />
1<br />
1<br />
1<br />
0<br />
1<br />
�<br />
�<br />
�<br />
�<br />
�<br />
t<br />
x<br />
q<br />
2<br />
2
ADI – Navier-S<strong>to</strong>kes<br />
• Equati<strong>on</strong> for X velocity<br />
u<br />
t<br />
u<br />
u<br />
x<br />
Z<br />
v<br />
Y<br />
X<br />
u<br />
y<br />
w<br />
u<br />
t<br />
u<br />
t<br />
u<br />
t<br />
u<br />
z<br />
u<br />
v<br />
w<br />
u<br />
x<br />
– need iterati<strong>on</strong>s for n<strong>on</strong>-linear PDEs<br />
u<br />
y<br />
u<br />
z<br />
T<br />
x<br />
1<br />
Re<br />
T<br />
x<br />
1<br />
Re<br />
1<br />
Re<br />
1<br />
Re<br />
2<br />
u<br />
2<br />
y<br />
2<br />
u<br />
2<br />
z<br />
2<br />
u<br />
2<br />
x<br />
2<br />
u<br />
2<br />
x<br />
2<br />
u<br />
2<br />
y<br />
2<br />
u<br />
2<br />
z
ADI – Time Step<br />
(n-1) time step<br />
Splitting by<br />
X<br />
Splitting by<br />
Y<br />
(n) time step<br />
Splitting by<br />
Z<br />
Global iterati<strong>on</strong>s<br />
Updating<br />
n<strong>on</strong>-linear<br />
parameters<br />
(n+1) time step
ADI – Fracti<strong>on</strong>al Time Step<br />
• Linear PDEs<br />
Previous layer<br />
N time<br />
layer<br />
u: x-velocity<br />
v: y-velocity<br />
w: z-velocity<br />
T: temperature<br />
Sweep<br />
Solves many<br />
tridiag<strong>on</strong>al<br />
systems<br />
independently<br />
N + 1 time<br />
layer<br />
Next layer
ADI – Fracti<strong>on</strong>al Time Step<br />
• N<strong>on</strong>-Linear PDEs<br />
Previous layer<br />
N time<br />
layer<br />
u: x-velocity<br />
v: y-velocity<br />
w: z-velocity<br />
T: temperature<br />
N + ½ time<br />
layer<br />
Sweep<br />
Solves many<br />
tridiag<strong>on</strong>al<br />
systems<br />
independently<br />
Update<br />
N + 1 time<br />
layer<br />
Local<br />
iterati<strong>on</strong>s<br />
Next layer
Main Stages of <strong>the</strong> Algorithm<br />
• Solve a lot of independent tridiag<strong>on</strong>al<br />
systems<br />
– Computati<strong>on</strong>ally intensive<br />
– Easy <strong>to</strong> parallelize<br />
• Subtasks:<br />
– Evaluate dissipati<strong>on</strong> term<br />
– Update n<strong>on</strong>-linear parameters
<str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>Solvers</str<strong>on</strong>g> Overview<br />
• Simplified Gauss eliminati<strong>on</strong><br />
– Also known as Thomas algorithm, Sweep<br />
– The fastest serial approach<br />
• Cyclic Reducti<strong>on</strong> methods<br />
– Attend Yao Zhang’s talk “Fast <str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g><br />
<str<strong>on</strong>g>Solvers</str<strong>on</strong>g>” afterwards!
Sweep algorithm<br />
• Memory requirements<br />
– <strong>on</strong>e additi<strong>on</strong>al array of size N<br />
• Forward eliminati<strong>on</strong> step<br />
• Backward substituti<strong>on</strong> step<br />
• Complexity: O(N)
<strong>GPU</strong> Implementati<strong>on</strong><br />
• All data arrays are s<strong>to</strong>red <strong>on</strong> <strong>GPU</strong><br />
• Several 3D time-layers<br />
– overall 1GB for 192x192x192 grid in DP<br />
• Main kernels<br />
– Sweep<br />
– Dissipative functi<strong>on</strong> evaluati<strong>on</strong><br />
– N<strong>on</strong>-linear update
Sweep <strong>on</strong> <strong>the</strong> <strong>GPU</strong><br />
• One thread solves <strong>on</strong>e system<br />
– N^2 systems <strong>on</strong> each fracti<strong>on</strong>al step<br />
Splitting by X Splitting by Y Splitting by Z<br />
• Each thread operates with 1D slice in<br />
corresp<strong>on</strong>ding directi<strong>on</strong>
Sweep – performance<br />
time steps/sec<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Sweep X Sweep Y Sweep Z<br />
• X splitting is much slower than Y/Z <strong>on</strong>es<br />
float<br />
double<br />
NVIDIA Tesla C1060
Sweep – going in<strong>to</strong> details<br />
• Memory bound – need <strong>to</strong> optimize<br />
access <strong>to</strong> <strong>the</strong> memory<br />
Sweep X Sweep Y Sweep Z<br />
uncoalesced coalesced coalesced
Sweep – optimizati<strong>on</strong><br />
• Soluti<strong>on</strong> for X-splitting<br />
– Reorder data arrays <strong>and</strong> run Y-splitting<br />
– Need few additi<strong>on</strong>al 3D matrix transposes<br />
time steps/sec<br />
2.0<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1.0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0.0<br />
2.5x<br />
1.7x<br />
float double<br />
original<br />
optimized
Code analysis<br />
• <strong>GPU</strong> versi<strong>on</strong> is based <strong>on</strong> <strong>the</strong> CPU code<br />
// boundary c<strong>on</strong>diti<strong>on</strong>s<br />
switch (dir)<br />
{<br />
case X: case X_as_Y: bc_x0(…); break;<br />
case Y: bc_y0(…); break;<br />
case Z: bc_z0(…); break;<br />
}<br />
a[1] = - c1 / c2;<br />
u_next[base_idx] = f_i / c2;<br />
// forward trace of sweep<br />
int idx = base_idx;<br />
int idx_prev;<br />
for (int k = 1; k < n; k++)<br />
{<br />
idx_prev = idx;<br />
idx += p.stride;<br />
}<br />
double c = v_temp[idx];<br />
c1 = p.m_c13 * c - p.h;<br />
c2 = p.m_c2;<br />
c3 = - p.m_c13 * c - p.h;<br />
double q = (c3 * a[k] + c2);<br />
double t = 1 / q;<br />
a[k+1] = - c1 * t;<br />
u_next[idx] = (f[idx] - c3 * u_next[idx_prev]) * t;
Performance Comparis<strong>on</strong><br />
• Test data<br />
– Grid size of 128/192<br />
– 8 n<strong>on</strong>-linear iterati<strong>on</strong>s (2 inner x 4 outer)<br />
• Hardware<br />
– NVIDIA Tesla C1060<br />
– Intel Core2 Quad (4 threads)<br />
– Intel Core i7 Nehalem (8 threads)
Performance – 128 - float<br />
time steps/sec<br />
30<br />
27<br />
24<br />
21<br />
18<br />
15<br />
12<br />
9<br />
6<br />
3<br />
0<br />
20x<br />
7x<br />
7x<br />
9x<br />
Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />
NVIDIA Tesla C1060<br />
Intel Core i7 Nehalem 2.93GHz<br />
(4 cores)<br />
Intel Core 2 Quad 2.4GHz (4<br />
cores)
Performance – 128 - double<br />
time steps/sec<br />
14<br />
13<br />
12<br />
11<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
10x<br />
4x<br />
4x<br />
5x<br />
Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />
NVIDIA Tesla C1060<br />
Intel Core i7 Nehalem 2.93GHz<br />
(4 cores)<br />
Intel Core 2 Quad 2.4GHz (4<br />
cores)
Performance – 192 - float<br />
time steps/sec<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
28x<br />
8x<br />
8x<br />
11x<br />
Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />
NVIDIA Tesla C1060<br />
Intel Core i7 Nehalem 2.93GHz<br />
(4 cores)<br />
Intel Core 2 Quad 2.4GHz (4<br />
cores)
Performance – 192 - double<br />
time steps/sec<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
13x<br />
4x<br />
5x<br />
5x<br />
Dissipati<strong>on</strong> Sweep N<strong>on</strong>Linear Total<br />
NVIDIA Tesla C1060<br />
Intel Core i7 Nehalem 2.93GHz<br />
(4 cores)<br />
Intel Core 2 Quad 2.4GHz (4<br />
cores)
<strong>GPU</strong> performance – SP/DP<br />
time steps/sec<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
dissipati<strong>on</strong> sweep n<strong>on</strong>linear <strong>to</strong>tal<br />
• In double precisi<strong>on</strong> <strong>GPU</strong> is <strong>on</strong>ly 2x<br />
slower than in single precisi<strong>on</strong><br />
float<br />
double
Visual results<br />
• Boundary c<strong>on</strong>diti<strong>on</strong>s<br />
u=1<br />
no-slip<br />
– C<strong>on</strong>stant flow at start:<br />
– No-slip <strong>on</strong> sides:<br />
– Free at far end:<br />
u<br />
u<br />
2<br />
u<br />
2<br />
x<br />
free<br />
1, v w<br />
v<br />
2<br />
v<br />
2<br />
x<br />
w<br />
0<br />
0<br />
2<br />
w<br />
2<br />
x<br />
0
Visual Results – X-slice<br />
u v<br />
w<br />
x = 0,9<br />
t = 6<br />
T<br />
Re = 1000
Future Work<br />
• Effective multi-<strong>GPU</strong> usage<br />
– Distributed memory systems<br />
• Performance improvements<br />
• High resoluti<strong>on</strong> grids, high Reynolds<br />
numbers
C<strong>on</strong>clusi<strong>on</strong><br />
• High performance <strong>and</strong> efficiency of<br />
<strong>GPU</strong>s in complex 3D fluid simulati<strong>on</strong><br />
• CUDA is an easy-<strong>to</strong>-use <strong>to</strong>ol for <strong>GPU</strong><br />
compute programming<br />
• <strong>GPU</strong> enables new possibilities for<br />
researching
Questi<strong>on</strong>s?<br />
• Keywords:<br />
Thank you!<br />
– ADI, <str<strong>on</strong>g>Tridiag<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>Solvers</str<strong>on</strong>g>, DNS, Turbulence<br />
• E-mail: nsakharnykh@nvidia.com
References<br />
• Pask<strong>on</strong>ov V.M., Berezin S.B., Korukhova E.S. (2007) “A dynamic<br />
visualizati<strong>on</strong> system for multiprocessor computers with comm<strong>on</strong><br />
memory <strong>and</strong> its applicati<strong>on</strong> for numerical modeling of <strong>the</strong><br />
turbulent flows of viscous fluids”, Moscow University<br />
Computati<strong>on</strong>al Ma<strong>the</strong>matics <strong>and</strong> Cybernetics<br />
• ADI method - Douglas Jr., Jim (1962), "Alternating directi<strong>on</strong><br />
methods for three space variables", Numerische Ma<strong>the</strong>matik 4:<br />
41–63
Dissipative Functi<strong>on</strong><br />
z<br />
v<br />
y<br />
w<br />
z<br />
u<br />
x<br />
w<br />
z<br />
w<br />
z<br />
v<br />
z<br />
u<br />
y<br />
w<br />
z<br />
v<br />
y<br />
u<br />
x<br />
v<br />
y<br />
w<br />
y<br />
v<br />
y<br />
u<br />
x<br />
w<br />
z<br />
u<br />
x<br />
v<br />
y<br />
u<br />
x<br />
w<br />
x<br />
v<br />
x<br />
u<br />
z<br />
y<br />
x<br />
z<br />
y<br />
x<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2<br />
2