Implementing Finite Volume algorithms on GPUs - many-core.group ...
Implementing Finite Volume algorithms on GPUs - many-core.group ...
Implementing Finite Volume algorithms on GPUs - many-core.group ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<str<strong>on</strong>g>Implementing</str<strong>on</strong>g> <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> <str<strong>on</strong>g>algorithms</str<strong>on</strong>g> <strong>on</strong> <strong>GPUs</strong><br />
Philip Blakely & Sean Lovett<br />
Laboratory for Scientific Computing, University of Cambridge<br />
14th December 2010<br />
Thanks to: Nikos Nikiforakis (Supervisor)<br />
NVIDIA for supplying Tesla and Fermi <strong>GPUs</strong><br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
1 / 22
Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />
Semic<strong>on</strong>ductors<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
2 / 22
Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />
Semic<strong>on</strong>ductors<br />
Mining - explosives and airblasts<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
2 / 22
Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />
Semic<strong>on</strong>ductors<br />
Mining - explosives and airblasts<br />
Defence - det<strong>on</strong>ati<strong>on</strong>s<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
2 / 22
Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />
Semic<strong>on</strong>ductors<br />
Mining - explosives and airblasts<br />
Defence - det<strong>on</strong>ati<strong>on</strong>s<br />
Climate modelling<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
2 / 22
Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />
Semic<strong>on</strong>ductors<br />
Mining - explosives and airblasts<br />
Defence - det<strong>on</strong>ati<strong>on</strong>s<br />
Climate modelling<br />
Relativity<br />
Need fast and accurate simulati<strong>on</strong>s<br />
for operati<strong>on</strong>al use.<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
2 / 22
Shock-bubble simulati<strong>on</strong><br />
shockBubble.mpg<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
3 / 22
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods store cell-averages of c<strong>on</strong>served quantities:<br />
density, momentum, energy<br />
FV methods maintain c<strong>on</strong>servati<strong>on</strong> of these<br />
Use rectangular structured grids for efficiency<br />
Discretise with cell-centred averages u i,j<br />
General update formula:<br />
u n+1<br />
i,j<br />
= u n ,ji + ∆t(F i−1/2,j − F i+1/2,j )<br />
Need to calculate appropriate fluxes F i+1/2,j for each cell boundary<br />
Each flux calculati<strong>on</strong> independent of all others<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
4 / 22
Euler’s equati<strong>on</strong>s<br />
Flux-c<strong>on</strong>servative form:<br />
Euler’s equati<strong>on</strong>s<br />
∂u<br />
∂t + ∇ · F(u) = 0<br />
∂ρ<br />
∂t + ∇ · (ρv) = 0 ,<br />
∂ρv<br />
∂t + ∇ · (ρuT u + pI) = 0 ,<br />
∂E<br />
∂t<br />
+ ∇ · ((E + p)v) = 0<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
5 / 22
Dem<strong>on</strong>strati<strong>on</strong> problem<br />
Solve Euler’s equati<strong>on</strong>s in 2D, using finite-volume approach, with<br />
slope-limited rec<strong>on</strong>structi<strong>on</strong> and exact Riemann solver<br />
(MUSCL-Hancock)<br />
Use single precisi<strong>on</strong> <strong>on</strong>ly<br />
No boundary c<strong>on</strong>diti<strong>on</strong> enforcing - just keep outer two cells of grid<br />
fixed<br />
Time-step calculati<strong>on</strong> - basic reducti<strong>on</strong> to find maximum wave-speed<br />
across all cells<br />
Using single Tesla C2050 (Fermi) card provided by NVIDIA<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
6 / 22
Data layout in memory<br />
We use Struct-of-Arrays approach and store:<br />
ρ 1,1 , ρ 2,1 , ρ 3,1 , . . . , ρ N,N , v x 1,1 , vx 2,1 , . . . , vy N,N , . . .<br />
so that accessing<br />
u i,j = (ρ i,j , v x i,j , vy i,j , p i,j)<br />
from each thread (i, j)<br />
results in coalesced access from global memory<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
7 / 22
MUSCL-Hancock scheme<br />
Piecewise-c<strong>on</strong>stant data<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
8 / 22
MUSCL-Hancock scheme<br />
Piecewise-c<strong>on</strong>stant data<br />
Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
8 / 22
MUSCL-Hancock scheme<br />
Piecewise-c<strong>on</strong>stant data<br />
Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />
Slope-limit<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
8 / 22
MUSCL-Hancock scheme<br />
Piecewise-c<strong>on</strong>stant data<br />
Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />
Slope-limit<br />
Advance by half time-step<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
8 / 22
MUSCL-Hancock scheme<br />
Piecewise-c<strong>on</strong>stant data<br />
Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />
Slope-limit<br />
Advance by half time-step<br />
Solve Riemann Problem at<br />
interface to find flux<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
8 / 22
Loading global data into shared memory<br />
For each cell, need data from three cells to compute slope-limited<br />
values.<br />
Load four cell-centred soluti<strong>on</strong> vectors into shared memory<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
9 / 22
Loading global data into shared memory<br />
For each cell, need data from three cells to compute slope-limited<br />
values.<br />
Load left data into shared memory<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
9 / 22
Loading global data into shared memory<br />
For each cell, need data from three cells to compute slope-limited<br />
values.<br />
Load right data into shared memory<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
9 / 22
Loading global data into shared memory<br />
For each cell, need data from three cells to compute slope-limited<br />
values.<br />
Now have three vectors per thread in shared memory<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
9 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
syncthreads()<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
syncthreads()<br />
Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
syncthreads()<br />
Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />
syncthreads()<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
syncthreads()<br />
Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />
syncthreads()<br />
Final Flux ∆t × (f i−2 − f i−1 ) ∆t × (f i−1 − f i ) ∆t × (f i − f i+1 )<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
10 / 22
Within shared memory<br />
Thread i − 1 Thread i Thread i + 1<br />
Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />
Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
Slope limit u L∗<br />
i−1 , u i−1, u R∗<br />
i−1 u L∗<br />
i , u i , u R∗<br />
i u L∗<br />
i+1 , u i+1, u R∗<br />
i+1<br />
Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />
syncthreads()<br />
Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />
syncthreads()<br />
Final Flux ∆t × (f i−2 − f i−1 ) ∆t × (f i−1 − f i ) ∆t × (f i − f i+1 )<br />
Functi<strong>on</strong>s such as fluxInPlace(u) mean that we <strong>on</strong>ly need 4<br />
soluti<strong>on</strong> vectors per thread.<br />
Overall shared-memory: 4 * NUM VARS * blockDim.y *<br />
blockDim.x<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods Laboratory for Scientific<br />
Computing<br />
10 / 22
Thread-block layout<br />
Each cell depends <strong>on</strong> data from adjacent cells<br />
Need overlapping thread-blocks to hold sufficient data<br />
For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
11 / 22
Thread-block layout<br />
Each cell depends <strong>on</strong> data from adjacent cells<br />
Need overlapping thread-blocks to hold sufficient data<br />
For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
11 / 22
Thread-block layout<br />
Each cell depends <strong>on</strong> data from adjacent cells<br />
Need overlapping thread-blocks to hold sufficient data<br />
For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
11 / 22
Thread-block layout<br />
Each cell depends <strong>on</strong> data from adjacent cells<br />
Need overlapping thread-blocks to hold sufficient data<br />
For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
11 / 22
Thread-block layout<br />
Each cell depends <strong>on</strong> data from adjacent cells<br />
Need overlapping thread-blocks to hold sufficient data<br />
For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
11 / 22
How fast?<br />
We should now check how fast our implementati<strong>on</strong> goes<br />
CPU: Optimized serial code <strong>on</strong> i7 2.80GHz<br />
GPU: Tesla C2050 (Fermi) 448 <strong>core</strong>s @ 1.15GHz (use 48KB<br />
shared-memory)<br />
Benchmark with:<br />
1000 × 1000 cells<br />
2D Riemann problem initial data<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
12 / 22
How fast?<br />
We should now check how fast our implementati<strong>on</strong> goes<br />
CPU: Optimized serial code <strong>on</strong> i7 2.80GHz<br />
GPU: Tesla C2050 (Fermi) 448 <strong>core</strong>s @ 1.15GHz (use 48KB<br />
shared-memory)<br />
Benchmark with:<br />
1000 × 1000 cells<br />
2D Riemann problem initial data<br />
CPU time: 902.53s<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
12 / 22
How fast?<br />
We should now check how fast our implementati<strong>on</strong> goes<br />
CPU: Optimized serial code <strong>on</strong> i7 2.80GHz<br />
GPU: Tesla C2050 (Fermi) 448 <strong>core</strong>s @ 1.15GHz (use 48KB<br />
shared-memory)<br />
Benchmark with:<br />
1000 × 1000 cells<br />
2D Riemann problem initial data<br />
CPU time: 902.53s<br />
GPU time: 9.6s (94.0×)<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
12 / 22
Template over blockDim<br />
Initially allocated shared memory dynamically<br />
Drawback:<br />
Offsets <strong>on</strong>ly calculated at run-time<br />
Overflow of shared memory <strong>on</strong>ly found at run-time<br />
So template all functi<strong>on</strong>s over blockDim.x and blockDim.y:<br />
template<br />
global void getSLICflux GPU(Grid u, Grid flux, float dt, int<br />
coord)<br />
{<br />
shared float temp[4][NUM VARS][blockDim y][blockDim x];<br />
}<br />
Compiler can pre-compute memory offsets<br />
Use of excess shared-memory can be detected at compile-time<br />
Drawback: Must determine block-size at compile-time<br />
Time now: 8.23s (109.7×)<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
13 / 22
Templating over coordinate directi<strong>on</strong><br />
We call getFluxes() for x-coordinate, and then y-coordinate<br />
Some functi<strong>on</strong>s, such as flux-calculati<strong>on</strong>, depend <strong>on</strong> coordinate<br />
For example: f x (ρ) = ρv x , f y (ρ) = ρv y<br />
N<strong>on</strong>-divergent branching, but could be determined at compile-time if<br />
we pass coord as template parameter:<br />
template<br />
device void getFlux(c<strong>on</strong>st float∗ c<strong>on</strong>s, float∗ F, int coord)<br />
changes to:<br />
template<br />
device void getFlux(c<strong>on</strong>st float∗ c<strong>on</strong>s, float∗ F)<br />
Time now: 7.73s (116.8×)<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
14 / 22
Avoid flux-update kernel<br />
Initially, performed update as:<br />
calcFlux(u, flux, X COORD) flux = f x (u)<br />
addFlux(u, flux) u = u + f x (u)<br />
calcFlux(u, flux, Y COORD) flux = f y (u)<br />
addFlux(u, flux) u = u + f y (u)<br />
Instead, we can calculate updated soluti<strong>on</strong> directly in extra array:<br />
advanceSoln(u, u plus, X COORD) u + = u + f x (u)<br />
advanceSoln(u plus, u, Y COORD) u = u + + f y (u + )<br />
Time now: 7.13s (126.6×)<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
15 / 22
What block-size should we use? (x-directi<strong>on</strong>)<br />
When performing flux-calculati<strong>on</strong>, want to reduce no. of overlap cells<br />
relative to block size<br />
Block size Overall time<br />
16 × 8 7.13s<br />
32 × 4 6.93s<br />
64 × 2 6.95s<br />
32 × 8 7.02s<br />
64 × 4 6.98<br />
First 4 lines can have 4 thread blocks per multiprocessor (register<br />
limitati<strong>on</strong>)<br />
Last two lines are limited to 3 thread blocks per multiprocessor.<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
16 / 22
What block-size should we use? (y-directi<strong>on</strong>)<br />
In y directi<strong>on</strong>, different effects come into play<br />
For global memory access coalesence, want to read several adjacent<br />
cells in x-directi<strong>on</strong>.<br />
So, 4 × 32 is not the obvious answer<br />
Block size Overall time<br />
16 × 8 6.93s<br />
32 × 4 8.41s<br />
4 × 32 6.54s<br />
8 × 16 6.51s<br />
8 × 32 6.45s<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
17 / 22
Performance <strong>on</strong> different sized grids<br />
Attain reas<strong>on</strong>able speed-up even <strong>on</strong> smaller grids<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
18 / 22
Speed-up dem<strong>on</strong>strati<strong>on</strong><br />
pipeDemo.mpg<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
19 / 22
Summary<br />
Achieved 147x speed-up for <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> solver<br />
For optimal performance, had to use templates<br />
and tuning of block-size.<br />
Future GPU-capable compilers/languages and APIs may be able to<br />
tune for this<br />
Will need significant effort in optimizati<strong>on</strong> technology<br />
Alternatively, develop C++ expressi<strong>on</strong>-template-like structures to do<br />
this automatically.<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
20 / 22
Implicati<strong>on</strong>s for industry<br />
Many companies need accurate fluid dynamics solvers<br />
Many would also benefit from faster computati<strong>on</strong>s<br />
Work-flow of end-users changed dramatically:<br />
Can adjust model parameters and get feedback much more quickly<br />
Potential to interact with an evolving simulati<strong>on</strong>,<br />
opening up new development strategies<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
21 / 22
Future investigati<strong>on</strong>s<br />
Efficient extensi<strong>on</strong> to three dimensi<strong>on</strong>s<br />
Investigati<strong>on</strong> of this approach for larger systems of equati<strong>on</strong>s<br />
(limitati<strong>on</strong>s <strong>on</strong> shared memory)<br />
More in-depth profiling <strong>on</strong> various compute-capability cards<br />
Extensi<strong>on</strong> to using adaptive mesh refinement<br />
<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />
Laboratory for Scientific<br />
Computing<br />
22 / 22