Implementing Finite Volume algorithms on GPUs - many-core.group ...

More documents

Recommendations

Info

Template over blockDim Initially allocated shared memory dynamically Drawback: Offsets only calculated at run-time Overflow of shared memory only found at run-time So template all functions over blockDim.x and blockDim.y: template global void getSLICflux GPU(Grid u, Grid flux, float dt, int coord) { shared float temp[4][NUM VARS][blockDim y][blockDim x]; } Compiler can pre-compute memory offsets Use of excess shared-memory can be detected at compile-time Drawback: Must determine block-size at compile-time Time now: 8.23s (109.7×) <strong>Finite</strong> <strong>Volume</strong> Methods Laboratory for Scientific Computing 13 / 22
Templating over coordinate direction We call getFluxes() for x-coordinate, and then y-coordinate Some functions, such as flux-calculation, depend on coordinate For example: f x (ρ) = ρv x , f y (ρ) = ρv y Non-divergent branching, but could be determined at compile-time if we pass coord as template parameter: template device void getFlux(const float∗ cons, float∗ F, int coord) changes to: template device void getFlux(const float∗ cons, float∗ F) Time now: 7.73s (116.8×) <strong>Finite</strong> <strong>Volume</strong> Methods Laboratory for Scientific Computing 14 / 22
Page 1 and 2: Implementing <stro
Page 3 and 4: Applications of Finite</str
Page 5 and 6: Applications of Finite</str
Page 7 and 8: Shock-bubble simulation shockBubble
Page 9 and 10: Euler’s equations Flux-conservati
Page 11 and 12: Data layout in memory We use Struct
Page 13 and 14: MUSCL-Hancock scheme Piecewise-cons
Page 15 and 16: MUSCL-Hancock scheme Piecewise-cons
Page 17 and 18: Loading global data into shared mem
Page 19 and 20: Loading global data into shared mem
Page 21 and 22: Within shared memory Thread i − 1
Page 31 and 32: Thread-block layout Each cell depen
Page 33 and 34: Thread-block layout Each cell depen
Page 35 and 36: How fast? We should now check how f
Page 37: How fast? We should now check how f
Page 41 and 42: What block-size should we use? (x-d
Page 43 and 44: Performance on different sized grid
Page 45 and 46: Summary Achieved 147x speed-up for
Page 47: Future investigations Efficient ext

Implementing Finite Volume algorithms on GPUs - many-core.group ...

Create successful ePaper yourself

Delete template?

Save as template?