31.07.2014 Views

Implementing Finite Volume algorithms on GPUs - many-core.group ...

Implementing Finite Volume algorithms on GPUs - many-core.group ...

Implementing Finite Volume algorithms on GPUs - many-core.group ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Implementing</str<strong>on</strong>g> <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> <str<strong>on</strong>g>algorithms</str<strong>on</strong>g> <strong>on</strong> <strong>GPUs</strong><br />

Philip Blakely & Sean Lovett<br />

Laboratory for Scientific Computing, University of Cambridge<br />

14th December 2010<br />

Thanks to: Nikos Nikiforakis (Supervisor)<br />

NVIDIA for supplying Tesla and Fermi <strong>GPUs</strong><br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

1 / 22


Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />

Semic<strong>on</strong>ductors<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

2 / 22


Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />

Semic<strong>on</strong>ductors<br />

Mining - explosives and airblasts<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

2 / 22


Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />

Semic<strong>on</strong>ductors<br />

Mining - explosives and airblasts<br />

Defence - det<strong>on</strong>ati<strong>on</strong>s<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

2 / 22


Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />

Semic<strong>on</strong>ductors<br />

Mining - explosives and airblasts<br />

Defence - det<strong>on</strong>ati<strong>on</strong>s<br />

Climate modelling<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

2 / 22


Applicati<strong>on</strong>s of <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> volume methods used in a wide range of areas:<br />

Semic<strong>on</strong>ductors<br />

Mining - explosives and airblasts<br />

Defence - det<strong>on</strong>ati<strong>on</strong>s<br />

Climate modelling<br />

Relativity<br />

Need fast and accurate simulati<strong>on</strong>s<br />

for operati<strong>on</strong>al use.<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

2 / 22


Shock-bubble simulati<strong>on</strong><br />

shockBubble.mpg<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

3 / 22


<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> methods store cell-averages of c<strong>on</strong>served quantities:<br />

density, momentum, energy<br />

FV methods maintain c<strong>on</strong>servati<strong>on</strong> of these<br />

Use rectangular structured grids for efficiency<br />

Discretise with cell-centred averages u i,j<br />

General update formula:<br />

u n+1<br />

i,j<br />

= u n ,ji + ∆t(F i−1/2,j − F i+1/2,j )<br />

Need to calculate appropriate fluxes F i+1/2,j for each cell boundary<br />

Each flux calculati<strong>on</strong> independent of all others<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

4 / 22


Euler’s equati<strong>on</strong>s<br />

Flux-c<strong>on</strong>servative form:<br />

Euler’s equati<strong>on</strong>s<br />

∂u<br />

∂t + ∇ · F(u) = 0<br />

∂ρ<br />

∂t + ∇ · (ρv) = 0 ,<br />

∂ρv<br />

∂t + ∇ · (ρuT u + pI) = 0 ,<br />

∂E<br />

∂t<br />

+ ∇ · ((E + p)v) = 0<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

5 / 22


Dem<strong>on</strong>strati<strong>on</strong> problem<br />

Solve Euler’s equati<strong>on</strong>s in 2D, using finite-volume approach, with<br />

slope-limited rec<strong>on</strong>structi<strong>on</strong> and exact Riemann solver<br />

(MUSCL-Hancock)<br />

Use single precisi<strong>on</strong> <strong>on</strong>ly<br />

No boundary c<strong>on</strong>diti<strong>on</strong> enforcing - just keep outer two cells of grid<br />

fixed<br />

Time-step calculati<strong>on</strong> - basic reducti<strong>on</strong> to find maximum wave-speed<br />

across all cells<br />

Using single Tesla C2050 (Fermi) card provided by NVIDIA<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

6 / 22


Data layout in memory<br />

We use Struct-of-Arrays approach and store:<br />

ρ 1,1 , ρ 2,1 , ρ 3,1 , . . . , ρ N,N , v x 1,1 , vx 2,1 , . . . , vy N,N , . . .<br />

so that accessing<br />

u i,j = (ρ i,j , v x i,j , vy i,j , p i,j)<br />

from each thread (i, j)<br />

results in coalesced access from global memory<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

7 / 22


MUSCL-Hancock scheme<br />

Piecewise-c<strong>on</strong>stant data<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

8 / 22


MUSCL-Hancock scheme<br />

Piecewise-c<strong>on</strong>stant data<br />

Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

8 / 22


MUSCL-Hancock scheme<br />

Piecewise-c<strong>on</strong>stant data<br />

Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />

Slope-limit<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

8 / 22


MUSCL-Hancock scheme<br />

Piecewise-c<strong>on</strong>stant data<br />

Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />

Slope-limit<br />

Advance by half time-step<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

8 / 22


MUSCL-Hancock scheme<br />

Piecewise-c<strong>on</strong>stant data<br />

Piecewise-linear rec<strong>on</strong>structi<strong>on</strong><br />

Slope-limit<br />

Advance by half time-step<br />

Solve Riemann Problem at<br />

interface to find flux<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

8 / 22


Loading global data into shared memory<br />

For each cell, need data from three cells to compute slope-limited<br />

values.<br />

Load four cell-centred soluti<strong>on</strong> vectors into shared memory<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

9 / 22


Loading global data into shared memory<br />

For each cell, need data from three cells to compute slope-limited<br />

values.<br />

Load left data into shared memory<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

9 / 22


Loading global data into shared memory<br />

For each cell, need data from three cells to compute slope-limited<br />

values.<br />

Load right data into shared memory<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

9 / 22


Loading global data into shared memory<br />

For each cell, need data from three cells to compute slope-limited<br />

values.<br />

Now have three vectors per thread in shared memory<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

9 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

syncthreads()<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

syncthreads()<br />

Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

syncthreads()<br />

Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />

syncthreads()<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

syncthreads()<br />

Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />

syncthreads()<br />

Final Flux ∆t × (f i−2 − f i−1 ) ∆t × (f i−1 − f i ) ∆t × (f i − f i+1 )<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

10 / 22


Within shared memory<br />

Thread i − 1 Thread i Thread i + 1<br />

Initial u i−2 , u i−1 , u i u i−1 , u i , u i+1 u i , u i+1 , u i+2<br />

Rec<strong>on</strong>struct u L i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

Slope limit u L∗<br />

i−1 , u i−1, u R∗<br />

i−1 u L∗<br />

i , u i , u R∗<br />

i u L∗<br />

i+1 , u i+1, u R∗<br />

i+1<br />

Advance 1 2 ∆t uL i−1 , u i−1, u R i−1 u L i , u i, u R i u L i+1 , u i+1, u R i+1<br />

syncthreads()<br />

Solve R.P. f(u R i−2 , uL i−1 ) f(uR i−1 , uL i ) f(uR i , uL i+1 )<br />

syncthreads()<br />

Final Flux ∆t × (f i−2 − f i−1 ) ∆t × (f i−1 − f i ) ∆t × (f i − f i+1 )<br />

Functi<strong>on</strong>s such as fluxInPlace(u) mean that we <strong>on</strong>ly need 4<br />

soluti<strong>on</strong> vectors per thread.<br />

Overall shared-memory: 4 * NUM VARS * blockDim.y *<br />

blockDim.x<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods Laboratory for Scientific<br />

Computing<br />

10 / 22


Thread-block layout<br />

Each cell depends <strong>on</strong> data from adjacent cells<br />

Need overlapping thread-blocks to hold sufficient data<br />

For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

11 / 22


Thread-block layout<br />

Each cell depends <strong>on</strong> data from adjacent cells<br />

Need overlapping thread-blocks to hold sufficient data<br />

For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

11 / 22


Thread-block layout<br />

Each cell depends <strong>on</strong> data from adjacent cells<br />

Need overlapping thread-blocks to hold sufficient data<br />

For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

11 / 22


Thread-block layout<br />

Each cell depends <strong>on</strong> data from adjacent cells<br />

Need overlapping thread-blocks to hold sufficient data<br />

For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

11 / 22


Thread-block layout<br />

Each cell depends <strong>on</strong> data from adjacent cells<br />

Need overlapping thread-blocks to hold sufficient data<br />

For calculating fluxes in x-directi<strong>on</strong> (6 × 6 block):<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

11 / 22


How fast?<br />

We should now check how fast our implementati<strong>on</strong> goes<br />

CPU: Optimized serial code <strong>on</strong> i7 2.80GHz<br />

GPU: Tesla C2050 (Fermi) 448 <strong>core</strong>s @ 1.15GHz (use 48KB<br />

shared-memory)<br />

Benchmark with:<br />

1000 × 1000 cells<br />

2D Riemann problem initial data<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

12 / 22


How fast?<br />

We should now check how fast our implementati<strong>on</strong> goes<br />

CPU: Optimized serial code <strong>on</strong> i7 2.80GHz<br />

GPU: Tesla C2050 (Fermi) 448 <strong>core</strong>s @ 1.15GHz (use 48KB<br />

shared-memory)<br />

Benchmark with:<br />

1000 × 1000 cells<br />

2D Riemann problem initial data<br />

CPU time: 902.53s<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

12 / 22


How fast?<br />

We should now check how fast our implementati<strong>on</strong> goes<br />

CPU: Optimized serial code <strong>on</strong> i7 2.80GHz<br />

GPU: Tesla C2050 (Fermi) 448 <strong>core</strong>s @ 1.15GHz (use 48KB<br />

shared-memory)<br />

Benchmark with:<br />

1000 × 1000 cells<br />

2D Riemann problem initial data<br />

CPU time: 902.53s<br />

GPU time: 9.6s (94.0×)<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

12 / 22


Template over blockDim<br />

Initially allocated shared memory dynamically<br />

Drawback:<br />

Offsets <strong>on</strong>ly calculated at run-time<br />

Overflow of shared memory <strong>on</strong>ly found at run-time<br />

So template all functi<strong>on</strong>s over blockDim.x and blockDim.y:<br />

template<br />

global void getSLICflux GPU(Grid u, Grid flux, float dt, int<br />

coord)<br />

{<br />

shared float temp[4][NUM VARS][blockDim y][blockDim x];<br />

}<br />

Compiler can pre-compute memory offsets<br />

Use of excess shared-memory can be detected at compile-time<br />

Drawback: Must determine block-size at compile-time<br />

Time now: 8.23s (109.7×)<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

13 / 22


Templating over coordinate directi<strong>on</strong><br />

We call getFluxes() for x-coordinate, and then y-coordinate<br />

Some functi<strong>on</strong>s, such as flux-calculati<strong>on</strong>, depend <strong>on</strong> coordinate<br />

For example: f x (ρ) = ρv x , f y (ρ) = ρv y<br />

N<strong>on</strong>-divergent branching, but could be determined at compile-time if<br />

we pass coord as template parameter:<br />

template<br />

device void getFlux(c<strong>on</strong>st float∗ c<strong>on</strong>s, float∗ F, int coord)<br />

changes to:<br />

template<br />

device void getFlux(c<strong>on</strong>st float∗ c<strong>on</strong>s, float∗ F)<br />

Time now: 7.73s (116.8×)<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

14 / 22


Avoid flux-update kernel<br />

Initially, performed update as:<br />

calcFlux(u, flux, X COORD) flux = f x (u)<br />

addFlux(u, flux) u = u + f x (u)<br />

calcFlux(u, flux, Y COORD) flux = f y (u)<br />

addFlux(u, flux) u = u + f y (u)<br />

Instead, we can calculate updated soluti<strong>on</strong> directly in extra array:<br />

advanceSoln(u, u plus, X COORD) u + = u + f x (u)<br />

advanceSoln(u plus, u, Y COORD) u = u + + f y (u + )<br />

Time now: 7.13s (126.6×)<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

15 / 22


What block-size should we use? (x-directi<strong>on</strong>)<br />

When performing flux-calculati<strong>on</strong>, want to reduce no. of overlap cells<br />

relative to block size<br />

Block size Overall time<br />

16 × 8 7.13s<br />

32 × 4 6.93s<br />

64 × 2 6.95s<br />

32 × 8 7.02s<br />

64 × 4 6.98<br />

First 4 lines can have 4 thread blocks per multiprocessor (register<br />

limitati<strong>on</strong>)<br />

Last two lines are limited to 3 thread blocks per multiprocessor.<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

16 / 22


What block-size should we use? (y-directi<strong>on</strong>)<br />

In y directi<strong>on</strong>, different effects come into play<br />

For global memory access coalesence, want to read several adjacent<br />

cells in x-directi<strong>on</strong>.<br />

So, 4 × 32 is not the obvious answer<br />

Block size Overall time<br />

16 × 8 6.93s<br />

32 × 4 8.41s<br />

4 × 32 6.54s<br />

8 × 16 6.51s<br />

8 × 32 6.45s<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

17 / 22


Performance <strong>on</strong> different sized grids<br />

Attain reas<strong>on</strong>able speed-up even <strong>on</strong> smaller grids<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

18 / 22


Speed-up dem<strong>on</strong>strati<strong>on</strong><br />

pipeDemo.mpg<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

19 / 22


Summary<br />

Achieved 147x speed-up for <str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> solver<br />

For optimal performance, had to use templates<br />

and tuning of block-size.<br />

Future GPU-capable compilers/languages and APIs may be able to<br />

tune for this<br />

Will need significant effort in optimizati<strong>on</strong> technology<br />

Alternatively, develop C++ expressi<strong>on</strong>-template-like structures to do<br />

this automatically.<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

20 / 22


Implicati<strong>on</strong>s for industry<br />

Many companies need accurate fluid dynamics solvers<br />

Many would also benefit from faster computati<strong>on</strong>s<br />

Work-flow of end-users changed dramatically:<br />

Can adjust model parameters and get feedback much more quickly<br />

Potential to interact with an evolving simulati<strong>on</strong>,<br />

opening up new development strategies<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

21 / 22


Future investigati<strong>on</strong>s<br />

Efficient extensi<strong>on</strong> to three dimensi<strong>on</strong>s<br />

Investigati<strong>on</strong> of this approach for larger systems of equati<strong>on</strong>s<br />

(limitati<strong>on</strong>s <strong>on</strong> shared memory)<br />

More in-depth profiling <strong>on</strong> various compute-capability cards<br />

Extensi<strong>on</strong> to using adaptive mesh refinement<br />

<str<strong>on</strong>g>Finite</str<strong>on</strong>g> <str<strong>on</strong>g>Volume</str<strong>on</strong>g> Methods<br />

Laboratory for Scientific<br />

Computing<br />

22 / 22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!