CBR Erik Lindahl

SC11 - Nvidia 

Combined CPU-GPU 

Simulation in Gromacs 

CBR 

erik@kth.se 

Erik Lindahl 

Royal Institute of Technology 

Center for Biomembrane Research

Molecular Dynamics 

Protein Folding 

Membrane Proteins 

Free Energy & 

Drug Design 

GROMACS 

www.gromacs.org

GPU Computing 

+

Our �rst attempts... 

First Gromacs GPU project in 2002 

with Ian Buck & Pat Hanrahan, Stanford 

Promise of theoretical high FP 

performance on GeForce4 

Severe limitations in practice... 

But we learned an important lesson: 

Everything we’ve done the last decade(s) has 

been about avoiding �oating-point operations - 

we cannot just implement those algorithms on GPUs

Mixing CPU-GPU code? 

• The PCI Express bus turns into a bottleneck 

• Internal BW: >120GB/s 

• PCI Express x16: ~5GB/s 

We might have to 

STAY on the GPU!

OpenMM 

• Standardized API 

• Interface is fully public 

• Possibly multiple future implementations 

• Commercial libraries A-OK 

• Hardware-agnostic plugin architecture 

• Stanford, Stockholm, Nvidia & AMD

The OpenMM tile approach 

0 0 

32 

64 

96 

Absolute performance critical, 

not speedup relative to a 

reference implementation! 

32 64 96 0 32 64 96 0 32 64 96 

All-vs-all (CUDA book) Newton’s 3rd law Sort atoms in tiles 

N^2 (N^2)/2 N log N 

Scott LeGrand, Peter Eastman

Gromacs & OpenMM in practice 

• GPUs supported in Gromacs 4.5 

mdrun ... -device “OpenMM:Cuda” 

• Same input �les, same output �les: “It just works” 

• Subset of features work on GPUs 

• Amazing implicit solvent performance 

• Supports both Cuda & OpenCL

OpenMM performance over x86 CPU 

32 46 

BPTI (~21k atoms) 

29 

66 

2fs time steps 

810 

230 

470 

PME Reaction-field 

Implicit All-vs-all 

Villin (600 atoms) 

1450 

0 

500 

0 

300 

1500 

1000 

1500 

1200 

900 

600 

ns/day 

Quad x86 

C2050

Why?

CPUs/GPUs are good 

at different things

Tiling circles is difficult! 

• You need a lot of cubes to cover a sphere 

• All interactions beyond cutoff need to be zero

The art of calculating zeroes

The 3rd Stage

CPUs have changed too 

Cores are fairly cheap 

...but they get slower! 

Performance by scaling 

Cray XE6: ~300ns/day

Speedup vs. Performance 

Hannes Loeffler & 

Martyn Winn 

Daresbury 

http://www.cse.scitech.ac.uk/cbg/benchmarks/Report_II.pdf

CPU trick 1: all-bond constraints 

• Δt limited by fast motions - 1fs 

• Remove bond vibrations 

• SHAKE (iterative, slow) - 2fs 

• Problematic in parallel (won’t work) 

• Compromise: constrain h-bonds only - 

1.4fs 

• GROMACS (LINCS): 

• LINear Constraint Solver 

• Approximate matrix inversion expansion 

• Fast & stable - much better than SHAKE 

• Non-iterative 

• Enables 2-3 fs timesteps 

• Parallel: P-LINCS (from Gromacs 4.0) 

LINCS: 

t=2’ 

t=1 

A) Move w/o constraint 

t=2’’ 

t=2 

t=1 

B) Project out motion 

along bonds 

t=1 

C) Correct for rotational 

extension of bond

CPU trick 2: Virtual sites 

• Next fastest motions is H-angle and 

rotations of CH3/NH2 groups 

• Try to remove them: 

• Ideal H position from heavy atoms. 

• CH3/NH2 groups are made rigid 

• Calculate forces, then project back onto heavy atoms 

• Integrate only heavy atom positions, reconstruct H’s 

• Enables 5fs timesteps! 

a 

1-a 

a 

b 

a 

1-a 

| b | 

2 3 

3fd 

3fad 3out 4fd 

θ 

| d | 

| c | 

Interactions Degrees of Freedom

8 th -sphere decomposition 

r c 

(a) (b) (c) 

r c 

1 

2 r c 

rns for the (a) half shell, (b) eighth shell and (c) midpoint methods illustrated for 2D domain 

o radius. half-shell The lines “8th-sphere” with circles show midpoint examples of pair interactions that are assigned to the 

or (a) and (b) the assignment is based on the endpoints of the line, for (c) on the midpoint. 

4 

3 

rc 

7 

5 

0 

cessor with the home cell where these smallest coordinates 

reside. This procedure works as long as the largest 

distance between charge groups involved in bonded interactions 

is not larger than the cut-o radius. To check if 

this6is the case, we count the number of assigned bonded 

interactions during domain decomposition and compare 

it to the total number of bonded interactions in the system. 

1 see the text for details. 

For full dynamic load balancing the boundaries between 

the cells need to move during the simulation. For 

1D domain decomposition this is trivial, but for a 3D 

8th-sphere 

0 

C B B’ 

C’ 

decomposition the cell boundaries in the last two dimensions 

need to be staggered along the first dimensions to 

3 2 

r 

c 

A 

1 

A’ 

cessor w 

nates re 

distance 

actions 

this is th 

interacti 

it to the 

tem. 

For fu 

tween th 

1D dom 

decompo 

sions ne 

FIG. 3: The zones to communicate to the proce 

Load balancing works 

allow fo 

details o 

for arbitrary triclinic cells 

commun

What if we don’t want to choose? 

0.5 TFLOP 

Random memory 

access OK (not great) 

vs. 

1 TFLOP 

Random memory 

access won’t work

Y 

Multiple Program 

Multiple Data 

Revisited 

X 

PME nodes 

Real space (particle) node 

Communicate coordinates to 

construct virtual sites 

Construct virtual sites 

Neighborsearch step? 

N 

Send x and box to 

peer PME processor 

Communicate x with real 

space neighbor processors 


N 

Evaluate potential/forces 

Communicate f with real 

space neighbor processors 

Receive forces/energy/virial 

from peer PME processor 

Spread forces on virtual sites 

Communicate forces from 

virtual sites 

Integrate coordinates 

Constrain bond lengths 

(parallel LINCS) 

Sum energies of all real 

space processors 

Y 

Domain 

decomposition 

Send charges to peer 

PME processor 

Y 

(local) 

neighborsearching 

Start 

Received charges 

from peer real space 

processors 


Receive x and box from 

peer real space processors 

All local coordinates 

received? 

Communicate some atoms 

to neighbor PME proc's 

Spread charges on grid 

Communicate grid overlap 

with PME neighbor proc's 

parallel 3D FFT 

Solve PME (convolution) 

parallel inverse 3D FFT 

Communicate grid overlap 

with PME neighbor proc's 

Interpolate forces from grid 

Communicate some forces 

to neighbor PME proc's 

Send forces/energy/virial to 

peer real space processors 

More steps? More steps? 

Stop 

Y 

PME node 

Y Y 

N N 

N 

N 

Y

Gromacs 4.6: Balancing over 

heterogeneous compute resources 

Proximity 

searching 

CPU 

GPU 

Communication 

Bonded 

Interactions 

Direct-space 

nonbonded 

PME 

Integration 

Required a new multi-threaded PME solver!

R E A L C Y C L E A N D T I M E A C C O U N T I N G 

Computing: Nodes Number G-Cycles Seconds % 

----------------------------------------------------------------------- 

Neighbor search 1 332 18.979 5.9 16.4 

Launch GPU calc. 1 3311 0.869 0.3 0.8 

Force 1 3311 28.572 8.9 24.7 

PME mesh 1 3311 45.625 14.2 39.5 

Wait for GPU calc. 1 3311 0.132 0.0 0.1 

Write traj. 1 1 0.424 0.1 0.4 

Update 1 3311 4.196 1.3 3.6 

Constraints 1 3311 5.393 1.7 4.7 

Rest 1 11.419 3.6 9.9 

----------------------------------------------------------------------- 

Total 1 115.610 36.0 100.0 

----------------------------------------------------------------------- 

----------------------------------------------------------------------- 

PME spread/gather 1 6622 32.128 10.0 27.8 

PME 3D-FFT 1 6622 10.374 3.2 9.0 

PME solve 1 3311 3.084 1.0 2.7 

----------------------------------------------------------------------- 

GPU timings 

----------------------------------------------------------------------- 

Computing: Number Seconds ms/step % 

----------------------------------------------------------------------- 

Neighborlist H2D 332 0.46 1.388 2.4 

Nonbonded H2D 3311 0.39 0.118 2.0 

Nonbonded calc. 3311 18.17 5.488 94.4 

Nonbonded D2H 3311 0.24 0.071 1.2 

----------------------------------------------------------------------- 

Total 19.26 5.817 100.0 

----------------------------------------------------------------------- 

Force evaluation time GPU/CPU: 5.817 ms/6.971 ms = 0.834 

For optimal performance this ratio should be 1! 

NODE (s) Real (s) (%) 

Time: 143.360 35.964 398.6 

2:23 

(Mnbf/s) (GFlops) (ns/day) (hour/ns) 

Performance: 0.000 4.831 15.909 6.014

Gromacs Gen3-GPU strategy 

• Single kernel: 

• Push data, compute, return data 

• Extremely low amount of GPU code 

• Every single feature of Gromacs works 

• Triclinic boxes 

• Pressure coupling 

• Virtual sites, all-bond constraints 

• All force �elds supported 

• Automatic multithread & GPU balancing

ns/day 

ns/day 

Performance 

50 

40 

30 

20 

10 

0 

50 

40 

30 

20 

10 

0 

CPU only CPU+GPU 

CPU only CPU+GPU 

4 core 

Phenom II X4 

& GTX470 

4 core 

Core i5 

& C2050 

Lysozyme, 25k atoms 

Rhombic dodecahedron 

Virtual sites enabled, 5 fs

CPU 

GPU 

Pair 

search 

Idle 

Transfer 

pair-list 

Bonded F PME 

Transfer x,q 

Idle 

Pair-search step every 10-20 iterations 

Non-bonded F 

& 

Pair-list pruning 

Busy-wait 

for GPU 

Idle 

Transfer F, E 

Avg. overlap CPU-GPU: 65-70% per iteration 

Integration, 

Constraints 

Idle

z 

i−j supercell pair 

y 

rlist 

j−i subcell pairs 

We term this the tiles-in-voxels, or 

“3D Tixels” - see poster by Szilard Pall!

State-of-the-art efficiency for nonbonded GPU kernels

CPU 

GPU 

Local stream 

Non-local stream 

MPI receive non-local x MPI send non-local F 

Idle 

Transfer local x,q 

Transfer non- local x,q 

Bonded F PME 

Local 

non-bonded F 

MD step 

Wait for 

non-local F 

Non-local 

non-bonded F 

Transfer non- local F 

Wait for 

local F 

Transfer local F 

Integration, 

Constraints 

Idle

Iteration time per1000 atoms (ms/step) 

0.4 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

CUDA non-bonded force kernel weak scaling 

PME, cutoff=1.0 nm 

Tesla M2090 

Tesla C2075 

GeForce GTX 580 

Geforce GTX 550Ti 

0 

1.5 3 6 12 24 48 96 192 384 768 1536 3072 

System size (1000s of atoms)

Iteration time per 1000 atoms (ms/step) 

0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

PME weak scaling 

Xeon X5650 3T + C2075 / process 

1xC2075 CUDA F kernel 

1xC2075 CPU total 



0 

1.5 3 6 12 24 48 96 192 384 768 1536 3072 

System size/GPU (1000s of atoms)

Current performance limit? 

How fast can we be when the limit is latency? 

μs/day 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

4T+C2050 8T+C2050 

3000 atoms 

(Ridiculously small) 

5fs steps 

2x Xeon E5620 2.4GHz 

C2050 , ECC enabled

State of GROMACS today - 24k atom protein

Performance (ns/day) 

100 

10 

1 

0.1 

Strong scaling of Reaction-Field and PME 

1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff 

RF 

RF linear scaling 

PME 

PME linear scaling 

1 10 100 

#Processes-GPUs

The Future? 

Multi-core 

Multi-architecture

Acknowledgments 

• GROMACS: Berk Hess, David van der Spoel, Per Larsson 

• Gromacs-GPU: Szilard Pall, Berk Hess 

• Multi-Threaded PME: Roland Shultz, Berk Hess 

• Gromacs-OpenMM: Rossen Apostolov, Szilard Pall, 

Peter Eastman, Vijay Pande 

• Nvidia: Scott LeGrand, Duncan Poole, Andrew Walsh,Chris Butler

Björn Wallner, Arjun Ray, Per Larsson, 

Samuel Murail, Aron Hennerdal 

Bioinformatics & 

Structure Modeling 

Simulation Software 

& Distributed computing 

Pär Bjelkmar, Christine Schwaiger, Peter Kasson 

Samuel Murail, Teemu Murtola 

Ion Channel Studies: 

Kv1.2, NaV1.7, GlyRa1 

Screening, Docking & 

Free Energy Calculation 

Berk Hess, Rossen Apostolov, Peter Kasson 

Wiktor Jurkowski, Sander Pronk, 

Per Larsson, Sander Pronk, Szilard Pall 

Anna Johansson, Berk Hess, Björn Wesén 

Arne Elofsson & Gunnar von Heijne, Stockholm University 

Vijay Pande, Stanford University 

Jim Trudell & Edward Bertaccini, Stanford Medical School

CBR Erik Lindahl

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?