Lattice-Boltzmann Simulations on GPUs - Multiple ... - ESPResSo

espressomd.org

Lattice-Boltzmann Simulations on GPUs - Multiple ... - ESPResSo

http://www.icp.uni-stuttgart.de

ong>Latticeong>-ong>Boltzmannong> ong>Simulationsong>

on GPUs

Multiple relaxation time LB with ESPResSo

Dominic Röhm

Institute for Computational Physics

11.10.2012


Dominic Röhm LB on GPUs 2/15

Motivation

http://www.icp.uni-stuttgart.de

• Molecular dynamics (MD)

simulations ⇒ simulating solvent

explicitly is expensive


Dominic Röhm LB on GPUs 2/15

Motivation

http://www.icp.uni-stuttgart.de

• Molecular dynamics (MD)

simulations ⇒ simulating solvent

explicitly is expensive

• Simple way ⇒ use a thermostat

(e.g. Langevin)


Dominic Röhm LB on GPUs 2/15

Motivation

http://www.icp.uni-stuttgart.de

• Molecular dynamics (MD)

simulations ⇒ simulating solvent

explicitly is expensive

• Simple way ⇒ use a thermostat

(e.g. Langevin)

• Problems: confined geometries

and no hydrodynamic interaction

between particles


Dominic Röhm LB on GPUs 2/15

Motivation

http://www.icp.uni-stuttgart.de

• Molecular dynamics (MD)

simulations ⇒ simulating solvent

explicitly is expensive

• Simple way ⇒ use a thermostat

(e.g. Langevin)

• Problems: confined geometries

and no hydrodynamic interaction

between particles

• Solution: ong>Latticeong> ong>Boltzmannong>

method


http://www.icp.uni-stuttgart.de

LB on GPUs

ong>Latticeong> based, well suited

for Single Instruction

Multiple Data (SIMD)

scheme

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

00000000000

00000000000

00000000000

00000000000

00000000000

00000000000

11111111111

11111111111

11111111111

11111111111

11111111111

11111111111

00000000000

00000000000

00000000000

00000000000

00000000000

00000000000

11111111111

11111111111

11111111111

11111111111

11111111111

11111111111

000000000000000000000000000000

111111111111111111111111111111

000000000000000000000000000000

111111111111111111111111111111

00000000000

00000000000

00000000000

00000000000

00000000000

00000000000

11111111111

11111111111

11111111111

11111111111

11111111111

11111111111

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

00000000000

00000000000

00000000000

00000000000

00000000000

11111111111

11111111111

11111111111

11111111111

11111111111

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

000000000000000000000

111111111111111111111

111111111111111111111

111111111111111111111

111111111111111111111

111111111111111111111

0000000000000000000000000000000000000000

0000000000000000000000000000000000000000

0000000000000000000000000000000000000000

0000000000000000000000000000000000000000

0000000000000000000000000000000000000000

1111111111111111111111111111111111111111

1111111111111111111111111111111111111111

1111111111111111111111111111111111111111

1111111111111111111111111111111111111111

1111111111111111111111111111111111111111

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

1111111111

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

000000000000000000000000000000

111111111111111111111111111111

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

000000000000000000000000000000

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

111111111111111111111111111111

0000000000

0000000000

0000000000

0000000000

0000000000

1111111111

1111111111

1111111111

1111111111

1111111111

Dominic Röhm LB on GPUs 3/15


Dominic Röhm LB on GPUs 3/15

LB on GPUs

http://www.icp.uni-stuttgart.de

ong>Latticeong> based, well suited

for Single Instruction

Multiple Data (SIMD)

scheme

GPUs ⇒ Execute same

code massively parallel

Control ALU ALU

ALU ALU

Cache

DRAM

CPU

DRAM

GPU


Dominic Röhm LB on GPUs 3/15

LB on GPUs

http://www.icp.uni-stuttgart.de

ong>Latticeong> based, well suited

for Single Instruction

Multiple Data (SIMD)

scheme

GPUs ⇒ Execute same

code massively parallel

• Different types of memory

explicitly accessed, atomic

operations

(Device) Grid

Block (0,0)

Register Register Register Register

Local

Mem

Shared Memory

Global

Memory

Constant

Memory

Texture

Memory

Block (1,0)

Shared Memory

Thread Thread Thread Thread

(0,0) (0,0)

(1,0) (1,0)

Local Local Local

Mem Mem Mem

Host


Dominic Röhm LB on GPUs 3/15

LB on GPUs

http://www.icp.uni-stuttgart.de

ong>Latticeong> based, well suited

for Single Instruction

Multiple Data (SIMD)

scheme

GPUs ⇒ Execute same

code massively parallel

• Different types of memory

explicitly accessed, atomic

operations

• Hierarchical thread

scheme

Host

Kernel

1

Kernel

2

Block (1,1)

Thread

(0,0,0)

Thread

(0,1,0)

(0,0,1)

Device

Thread

Thread

Grid 2

Grid 1

Block

(0,0)

Block

(0,1)

Block

(1,0)

Block

(1,1)

Block Block Block

(0,0) (1,0) (2,0)

(1,0,1)

Thread

(2,0,1)

Thread

(1,0,0) (2,0,0) (3,0,0)

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

(3,0,1)


Dominic Röhm LB on GPUs 3/15

LB on GPUs

http://www.icp.uni-stuttgart.de

ong>Latticeong> based, well suited

for Single Instruction

Multiple Data (SIMD)

scheme

GPUs ⇒ Execute same

code massively parallel

• Different types of memory

explicitly accessed, atomic

operations

• Hierarchical thread

scheme

• Asynchronous memory

copy, streams

Host

Kernel

1

Kernel

2

Block (1,1)

Thread

(0,0,0)

Thread

(0,1,0)

(0,0,1)

Device

Thread

Thread

Grid 2

Grid 1

Block

(0,0)

Block

(0,1)

Block

(1,0)

Block

(1,1)

Block Block Block

(0,0) (1,0) (2,0)

(1,0,1)

Thread

(2,0,1)

Thread

(1,0,0) (2,0,0) (3,0,0)

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

(3,0,1)


Dominic Röhm LB on GPUs 4/15

Theoretical power of (NVIDIA) GPUs

http://www.icp.uni-stuttgart.de

• Floating point ops per sec

Theoretical speedup almost 10x


Dominic Röhm LB on GPUs 4/15

Theoretical power of (NVIDIA) GPUs

http://www.icp.uni-stuttgart.de

• Floating point ops per sec

• Bandwidth of the VRAM

Theoretical speedup almost 10x


Dominic Röhm LB on GPUs 5/15

ong>Latticeong>-ong>Boltzmannong> on GPUs

http://www.icp.uni-stuttgart.de

• Velocity space resides in (large but “slow”) global memory

• Double buffering to avoid race conditions during streaming

• Memory layout is optimized for coalesced access to the fluid

velocities

• Kernels for solvent update and particle-solvent interaction


ong>Latticeong>-ong>Boltzmannong> on GPUs

http://www.icp.uni-stuttgart.de

• Velocity space resides in (large but “slow”) global memory

• Double buffering to avoid race conditions during streaming

• Memory layout is optimized for coalesced access to the fluid

velocities

• Kernels for solvent update and particle-solvent interaction

Solvent kernel

• One thread per lattice node

• Mode space transformation into registers

• Relax modes, thermalize modes, apply (external) forces,

normalization

• Thermalization uses a dedicated Gaussian RNG per lattice node

• back transformation into velocity space and streaming

step with periodic boundaries at once

Dominic Röhm LB on GPUs 5/15


ong>Latticeong>-ong>Boltzmannong> on GPUs

http://www.icp.uni-stuttgart.de

• Velocity space resides in (large but “slow”) global memory

• Double buffering to avoid race conditions during streaming

• Memory layout is optimized for coalesced access to the fluid

velocities

• Kernels for solvent update and particle-solvent interaction

Particle-solvent interaction kernel

• One thread per particle

• Interpolate fluid velocity at the

position of the particle (registers)

• Fluid force acting on the particle

• Distribute reactio back to lattice

nodes (requires atomic

operations)

F(0,1)

F(0,1)

F(0,0)

F(0,0)

Dominic Röhm LB on GPUs 5/15

F(1,1)

F(1,1)

F(1,0)

F(1,0)


http://www.icp.uni-stuttgart.de

movie Dominic Röhm LB on GPUs 6/15

Hydrodynamics in confined geometries


http://www.icp.uni-stuttgart.de

movie Dominic Röhm LB on GPUs 7/15

Hydrodynamic (long range) interactions


Dominic Röhm LB on GPUs 8/15

Parallel execution scheme

http://www.icp.uni-stuttgart.de

CPU

get system properties via

TCL interface on master node

MPI parallelization

initialize cell structures and

distribute particles

gather particle pos/vel

on master node

calculate short range forces

calculate long range forces

distribute particle forces

to worker nodes

fluid parameters

PCI Express

between

master node

and GPU

particle positions and velocitites

particle−fluid forces

GPU

initialize fluid

set up boundary conditions

allocate particle memory

calculate fluid update

fetch particle pos/vel

from master node

calculate particle−fluid

interaction

send particle forces

to master node

propagate particle positions


Dominic Röhm LB on GPUs 9/15

Performance benchmark: Suspension

Particle interaction: Lennard-Jones radius 2.5 on INTEL XEON Quadcore@2.4GHz + NVIDIA Tesla C2050

10 3 16 24 32 48 64 100

http://www.icp.uni-stuttgart.de

Time per integration step in ms

10 2

10 1

10 0

10 -1

LV

LB GPU

LB CPU

LV

LB GPU

LB CPU

Simulation box length

LB cost negligible for densities > 0.3 due to interleaving


http://www.icp.uni-stuttgart.de

Performance benchmark: Electro-osmotic flow

Particle interaction: L-J radius 1.1225, P3M+ELC, constraints, fluid boundaries on INTEL XEON Quadcore@2.4GHz +

NVIDIA Tesla C2050

time [microsec]

1e+09

1e+08

1e+07

LB GPU with 1000 parts

LV thermostat with 1000 parts

LB CPU with 1000 parts

LB GPU with 100 parts

LV thermostat with 100 parts

LB CPU with 100 parts

LB GPU with 500 parts

LV thermostat with 500 parts

LB CPU with 500 parts

LB GPU with 5000 parts

LV thermostat with 5000 parts

LB CPU with 5000 parts

simulation box length

100

LB cost negligible for > 100 particles due to interleaving

Dominic Röhm LB on GPUs 10/15


Performance benchmark: Thermalized fluid

two INTEL XEON Quadcore@2.4GHz + NVIDIA Tesla C2050

http://www.icp.uni-stuttgart.de

time [microsec]

1e+09

1e+08

1e+07

1e+06

simulation box length

LB GPU @Tesla C2050

LB CPU @XEON E5620

100

Thermalized fluid ⇒ 50x

Dominic Röhm LB on GPUs 11/15


Dominic Röhm LB on GPUs 12/15

Performance benchmark: Thermalized fluid

http://www.icp.uni-stuttgart.de

MLUps

10 2 Ideal scaling

ong>Latticeong> size: 24 3

48 3

96 3

10 1

10 0

10 100

Number of CPU cores

CPU cluster can not achieve single GPU performance!


Dominic Röhm LB on GPUs 13/15

ong>Latticeong> ong>Boltzmannong> with ESPResSo

http://www.icp.uni-stuttgart.de

• lbfluid (cpu/gpu) tau 0.1 agrid 1.0 visc 0.8


Dominic Röhm LB on GPUs 13/15

ong>Latticeong> ong>Boltzmannong> with ESPResSo

http://www.icp.uni-stuttgart.de

• lbfluid (cpu/gpu) tau 0.1 agrid 1.0 visc 0.8

• lbfluid friction 10.0 ext_force $fx $fy $fz


Dominic Röhm LB on GPUs 13/15

ong>Latticeong> ong>Boltzmannong> with ESPResSo

http://www.icp.uni-stuttgart.de

• lbfluid (cpu/gpu) tau 0.1 agrid 1.0 visc 0.8

• lbfluid friction 10.0 ext_force $fx $fy $fz

• Ratio of

setmd time_step / (lbfluid tau)


Dominic Röhm LB on GPUs 13/15

ong>Latticeong> ong>Boltzmannong> with ESPResSo

http://www.icp.uni-stuttgart.de

• lbfluid (cpu/gpu) tau 0.1 agrid 1.0 visc 0.8

• lbfluid friction 10.0 ext_force $fx $fy $fz

• Ratio of

setmd time_step / (lbfluid tau)

• Parameter range e.g.

lbfluid agrid 0.5

lbfluid tau $tau/8


Dominic Röhm LB on GPUs 13/15

ong>Latticeong> ong>Boltzmannong> with ESPResSo

http://www.icp.uni-stuttgart.de

• lbfluid (cpu/gpu) tau 0.1 agrid 1.0 visc 0.8

• lbfluid friction 10.0 ext_force $fx $fy $fz

• Ratio of

setmd time_step / (lbfluid tau)

• Parameter range e.g.

lbfluid agrid 0.5

lbfluid tau $tau/8

• lbboundary wall normal -1 0 0 dist [expr -$box_l_x+1]

lbboundary wall normal 1 0 0 dist [expr +1]

sphere cyclinder rhomboid


Dominic Röhm LB on GPUs 13/15

ong>Latticeong> ong>Boltzmannong> with ESPResSo

http://www.icp.uni-stuttgart.de

• lbfluid (cpu/gpu) tau 0.1 agrid 1.0 visc 0.8

• lbfluid friction 10.0 ext_force $fx $fy $fz

• Ratio of

setmd time_step / (lbfluid tau)

• Parameter range e.g.

lbfluid agrid 0.5

lbfluid tau $tau/8

• lbboundary wall normal -1 0 0 dist [expr -$box_l_x+1]

lbboundary wall normal 1 0 0 dist [expr +1]

sphere cyclinder rhomboid

• lbfluid print (vtk) velocity/boundary $filename


http://www.icp.uni-stuttgart.de

movie Dominic Röhm LB on GPUs 14/15

HI + confined geometries


Dominic Röhm LB on GPUs 15/15

Conclusions

http://www.icp.uni-stuttgart.de

• GPU code on Tesla C2050 up to 50 times

faster than CPU code on two XEON E7620

• Get hydrodynamics for free or calculate pure

hydrodynamic faster than on a CPU Cluster

• D. Roehm and A. Arnold, EPJ -ST 210

(89-100), 2012

Outlook

• GPU-accelerated Coulomb solver (P3M,

ELC, MMM...)

• GPU-accelerated Poisson-ong>Boltzmannong> solver,

included in ESPResSo (G. Rempfer)

More magazines by this user
Similar magazines