02.08.2012 Views

CBR Erik Lindahl

CBR Erik Lindahl

CBR Erik Lindahl

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

SC11 - Nvidia<br />

Combined CPU-GPU<br />

Simulation in Gromacs<br />

<strong>CBR</strong><br />

erik@kth.se<br />

<strong>Erik</strong> <strong>Lindahl</strong><br />

Royal Institute of Technology<br />

Center for Biomembrane Research


Molecular Dynamics<br />

Protein Folding<br />

Membrane Proteins<br />

Free Energy &<br />

Drug Design<br />

GROMACS<br />

www.gromacs.org


GPU Computing<br />

+


Our �rst attempts...<br />

First Gromacs GPU project in 2002<br />

with Ian Buck & Pat Hanrahan, Stanford<br />

Promise of theoretical high FP<br />

performance on GeForce4<br />

Severe limitations in practice...<br />

But we learned an important lesson:<br />

Everything we’ve done the last decade(s) has<br />

been about avoiding �oating-point operations -<br />

we cannot just implement those algorithms on GPUs


Mixing CPU-GPU code?<br />

• The PCI Express bus turns into a bottleneck<br />

• Internal BW: >120GB/s<br />

• PCI Express x16: ~5GB/s<br />

We might have to<br />

STAY on the GPU!


OpenMM<br />

• Standardized API<br />

• Interface is fully public<br />

• Possibly multiple future implementations<br />

• Commercial libraries A-OK<br />

• Hardware-agnostic plugin architecture<br />

• Stanford, Stockholm, Nvidia & AMD


The OpenMM tile approach<br />

0 0<br />

32<br />

64<br />

96<br />

Absolute performance critical,<br />

not speedup relative to a<br />

reference implementation!<br />

32 64 96 0 32 64 96 0 32 64 96<br />

All-vs-all (CUDA book) Newton’s 3rd law Sort atoms in tiles<br />

N^2 (N^2)/2 N log N<br />

Scott LeGrand, Peter Eastman


Gromacs & OpenMM in practice<br />

• GPUs supported in Gromacs 4.5<br />

mdrun ... -device “OpenMM:Cuda”<br />

• Same input �les, same output �les: “It just works”<br />

• Subset of features work on GPUs<br />

• Amazing implicit solvent performance<br />

• Supports both Cuda & OpenCL


OpenMM performance over x86 CPU<br />

32 46<br />

BPTI (~21k atoms)<br />

29<br />

66<br />

2fs time steps<br />

810<br />

230<br />

470<br />

PME Reaction-field<br />

Implicit All-vs-all<br />

Villin (600 atoms)<br />

1450<br />

0<br />

500<br />

0<br />

300<br />

1500<br />

1000<br />

1500<br />

1200<br />

900<br />

600<br />

ns/day<br />

Quad x86<br />

C2050


Why?


CPUs/GPUs are good<br />

at different things


Tiling circles is difficult!<br />

• You need a lot of cubes to cover a sphere<br />

• All interactions beyond cutoff need to be zero


The art of calculating zeroes


The 3rd Stage


CPUs have changed too<br />

Cores are fairly cheap<br />

...but they get slower!<br />

Performance by scaling<br />

Cray XE6: ~300ns/day


Speedup vs. Performance<br />

Hannes Loeffler &<br />

Martyn Winn<br />

Daresbury<br />

http://www.cse.scitech.ac.uk/cbg/benchmarks/Report_II.pdf


CPU trick 1: all-bond constraints<br />

• Δt limited by fast motions - 1fs<br />

• Remove bond vibrations<br />

• SHAKE (iterative, slow) - 2fs<br />

• Problematic in parallel (won’t work)<br />

• Compromise: constrain h-bonds only -<br />

1.4fs<br />

• GROMACS (LINCS):<br />

• LINear Constraint Solver<br />

• Approximate matrix inversion expansion<br />

• Fast & stable - much better than SHAKE<br />

• Non-iterative<br />

• Enables 2-3 fs timesteps<br />

• Parallel: P-LINCS (from Gromacs 4.0)<br />

LINCS:<br />

t=2’<br />

t=1<br />

A) Move w/o constraint<br />

t=2’’<br />

t=2<br />

t=1<br />

B) Project out motion<br />

along bonds<br />

t=1<br />

C) Correct for rotational<br />

extension of bond


CPU trick 2: Virtual sites<br />

• Next fastest motions is H-angle and<br />

rotations of CH3/NH2 groups<br />

• Try to remove them:<br />

• Ideal H position from heavy atoms.<br />

• CH3/NH2 groups are made rigid<br />

• Calculate forces, then project back onto heavy atoms<br />

• Integrate only heavy atom positions, reconstruct H’s<br />

• Enables 5fs timesteps!<br />

a<br />

1-a<br />

a<br />

b<br />

a<br />

1-a<br />

| b |<br />

2 3<br />

3fd<br />

3fad 3out 4fd<br />

θ<br />

| d |<br />

| c |<br />

Interactions Degrees of Freedom


8 th -sphere decomposition<br />

r c<br />

(a) (b) (c)<br />

r c<br />

1<br />

2 r c<br />

rns for the (a) half shell, (b) eighth shell and (c) midpoint methods illustrated for 2D domain<br />

o radius. half-shell The lines “8th-sphere” with circles show midpoint examples of pair interactions that are assigned to the<br />

or (a) and (b) the assignment is based on the endpoints of the line, for (c) on the midpoint.<br />

4<br />

3<br />

rc<br />

7<br />

5<br />

0<br />

cessor with the home cell where these smallest coordinates<br />

reside. This procedure works as long as the largest<br />

distance between charge groups involved in bonded interactions<br />

is not larger than the cut-o radius. To check if<br />

this6is the case, we count the number of assigned bonded<br />

interactions during domain decomposition and compare<br />

it to the total number of bonded interactions in the system.<br />

1 see the text for details.<br />

For full dynamic load balancing the boundaries between<br />

the cells need to move during the simulation. For<br />

1D domain decomposition this is trivial, but for a 3D<br />

8th-sphere<br />

0<br />

C B B’<br />

C’<br />

decomposition the cell boundaries in the last two dimensions<br />

need to be staggered along the first dimensions to<br />

3 2<br />

r<br />

c<br />

A<br />

1<br />

A’<br />

cessor w<br />

nates re<br />

distance<br />

actions<br />

this is th<br />

interacti<br />

it to the<br />

tem.<br />

For fu<br />

tween th<br />

1D dom<br />

decompo<br />

sions ne<br />

FIG. 3: The zones to communicate to the proce<br />

Load balancing works<br />

allow fo<br />

details o<br />

for arbitrary triclinic cells<br />

commun


What if we don’t want to choose?<br />

0.5 TFLOP<br />

Random memory<br />

access OK (not great)<br />

vs.<br />

1 TFLOP<br />

Random memory<br />

access won’t work


Y<br />

Multiple Program<br />

Multiple Data<br />

Revisited<br />

X<br />

PME nodes<br />

Real space (particle) node<br />

Communicate coordinates to<br />

construct virtual sites<br />

Construct virtual sites<br />

Neighborsearch step?<br />

N<br />

Send x and box to<br />

peer PME processor<br />

Communicate x with real<br />

space neighbor processors<br />

Neighborsearch step?<br />

N<br />

Evaluate potential/forces<br />

Communicate f with real<br />

space neighbor processors<br />

Receive forces/energy/virial<br />

from peer PME processor<br />

Spread forces on virtual sites<br />

Communicate forces from<br />

virtual sites<br />

Integrate coordinates<br />

Constrain bond lengths<br />

(parallel LINCS)<br />

Sum energies of all real<br />

space processors<br />

Y<br />

Domain<br />

decomposition<br />

Send charges to peer<br />

PME processor<br />

Y<br />

(local)<br />

neighborsearching<br />

Start<br />

Received charges<br />

from peer real space<br />

processors<br />

Neighborsearch step?<br />

Receive x and box from<br />

peer real space processors<br />

All local coordinates<br />

received?<br />

Communicate some atoms<br />

to neighbor PME proc's<br />

Spread charges on grid<br />

Communicate grid overlap<br />

with PME neighbor proc's<br />

parallel 3D FFT<br />

Solve PME (convolution)<br />

parallel inverse 3D FFT<br />

Communicate grid overlap<br />

with PME neighbor proc's<br />

Interpolate forces from grid<br />

Communicate some forces<br />

to neighbor PME proc's<br />

Send forces/energy/virial to<br />

peer real space processors<br />

More steps? More steps?<br />

Stop<br />

Y<br />

PME node<br />

Y Y<br />

N N<br />

N<br />

N<br />

Y


Gromacs 4.6: Balancing over<br />

heterogeneous compute resources<br />

Proximity<br />

searching<br />

CPU<br />

GPU<br />

Communication<br />

Bonded<br />

Interactions<br />

Direct-space<br />

nonbonded<br />

PME<br />

Integration<br />

Required a new multi-threaded PME solver!


R E A L C Y C L E A N D T I M E A C C O U N T I N G<br />

Computing: Nodes Number G-Cycles Seconds %<br />

-----------------------------------------------------------------------<br />

Neighbor search 1 332 18.979 5.9 16.4<br />

Launch GPU calc. 1 3311 0.869 0.3 0.8<br />

Force 1 3311 28.572 8.9 24.7<br />

PME mesh 1 3311 45.625 14.2 39.5<br />

Wait for GPU calc. 1 3311 0.132 0.0 0.1<br />

Write traj. 1 1 0.424 0.1 0.4<br />

Update 1 3311 4.196 1.3 3.6<br />

Constraints 1 3311 5.393 1.7 4.7<br />

Rest 1 11.419 3.6 9.9<br />

-----------------------------------------------------------------------<br />

Total 1 115.610 36.0 100.0<br />

-----------------------------------------------------------------------<br />

-----------------------------------------------------------------------<br />

PME spread/gather 1 6622 32.128 10.0 27.8<br />

PME 3D-FFT 1 6622 10.374 3.2 9.0<br />

PME solve 1 3311 3.084 1.0 2.7<br />

-----------------------------------------------------------------------<br />

GPU timings<br />

-----------------------------------------------------------------------<br />

Computing: Number Seconds ms/step %<br />

-----------------------------------------------------------------------<br />

Neighborlist H2D 332 0.46 1.388 2.4<br />

Nonbonded H2D 3311 0.39 0.118 2.0<br />

Nonbonded calc. 3311 18.17 5.488 94.4<br />

Nonbonded D2H 3311 0.24 0.071 1.2<br />

-----------------------------------------------------------------------<br />

Total 19.26 5.817 100.0<br />

-----------------------------------------------------------------------<br />

Force evaluation time GPU/CPU: 5.817 ms/6.971 ms = 0.834<br />

For optimal performance this ratio should be 1!<br />

NODE (s) Real (s) (%)<br />

Time: 143.360 35.964 398.6<br />

2:23<br />

(Mnbf/s) (GFlops) (ns/day) (hour/ns)<br />

Performance: 0.000 4.831 15.909 6.014


Gromacs Gen3-GPU strategy<br />

• Single kernel:<br />

• Push data, compute, return data<br />

• Extremely low amount of GPU code<br />

• Every single feature of Gromacs works<br />

• Triclinic boxes<br />

• Pressure coupling<br />

• Virtual sites, all-bond constraints<br />

• All force �elds supported<br />

• Automatic multithread & GPU balancing


ns/day<br />

ns/day<br />

Performance<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

CPU only CPU+GPU<br />

CPU only CPU+GPU<br />

4 core<br />

Phenom II X4<br />

& GTX470<br />

4 core<br />

Core i5<br />

& C2050<br />

Lysozyme, 25k atoms<br />

Rhombic dodecahedron<br />

Virtual sites enabled, 5 fs


CPU<br />

GPU<br />

Pair<br />

search<br />

Idle<br />

Transfer<br />

pair-list<br />

Bonded F PME<br />

Transfer x,q<br />

Idle<br />

Pair-search step every 10-20 iterations<br />

Non-bonded F<br />

&<br />

Pair-list pruning<br />

Busy-wait<br />

for GPU<br />

Idle<br />

Transfer F, E<br />

Avg. overlap CPU-GPU: 65-70% per iteration<br />

Integration,<br />

Constraints<br />

Idle


z<br />

i−j supercell pair<br />

y<br />

rlist<br />

j−i subcell pairs<br />

We term this the tiles-in-voxels, or<br />

“3D Tixels” - see poster by Szilard Pall!


State-of-the-art efficiency for nonbonded GPU kernels


CPU<br />

GPU<br />

Local stream<br />

Non-local stream<br />

MPI receive non-local x MPI send non-local F<br />

Idle<br />

Transfer local x,q<br />

Transfer non- local x,q<br />

Bonded F PME<br />

Local<br />

non-bonded F<br />

MD step<br />

Wait for<br />

non-local F<br />

Non-local<br />

non-bonded F<br />

Transfer non- local F<br />

Wait for<br />

local F<br />

Transfer local F<br />

Integration,<br />

Constraints<br />

Idle


Iteration time per1000 atoms (ms/step)<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

CUDA non-bonded force kernel weak scaling<br />

PME, cutoff=1.0 nm<br />

Tesla M2090<br />

Tesla C2075<br />

GeForce GTX 580<br />

Geforce GTX 550Ti<br />

0<br />

1.5 3 6 12 24 48 96 192 384 768 1536 3072<br />

System size (1000s of atoms)


Iteration time per 1000 atoms (ms/step)<br />

0.35<br />

0.3<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

PME weak scaling<br />

Xeon X5650 3T + C2075 / process<br />

1xC2075 CUDA F kernel<br />

1xC2075 CPU total<br />

2xC2075 CPU total<br />

4xC2075 CPU total<br />

0<br />

1.5 3 6 12 24 48 96 192 384 768 1536 3072<br />

System size/GPU (1000s of atoms)


Current performance limit?<br />

How fast can we be when the limit is latency?<br />

μs/day<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

4T+C2050 8T+C2050<br />

3000 atoms<br />

(Ridiculously small)<br />

5fs steps<br />

2x Xeon E5620 2.4GHz<br />

C2050 , ECC enabled


State of GROMACS today - 24k atom protein


Performance (ns/day)<br />

100<br />

10<br />

1<br />

0.1<br />

Strong scaling of Reaction-Field and PME<br />

1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff<br />

RF<br />

RF linear scaling<br />

PME<br />

PME linear scaling<br />

1 10 100<br />

#Processes-GPUs


The Future?<br />

Multi-core<br />

Multi-architecture


Acknowledgments<br />

• GROMACS: Berk Hess, David van der Spoel, Per Larsson<br />

• Gromacs-GPU: Szilard Pall, Berk Hess<br />

• Multi-Threaded PME: Roland Shultz, Berk Hess<br />

• Gromacs-OpenMM: Rossen Apostolov, Szilard Pall,<br />

Peter Eastman, Vijay Pande<br />

• Nvidia: Scott LeGrand, Duncan Poole, Andrew Walsh,Chris Butler


Björn Wallner, Arjun Ray, Per Larsson,<br />

Samuel Murail, Aron Hennerdal<br />

Bioinformatics &<br />

Structure Modeling<br />

Simulation Software<br />

& Distributed computing<br />

Pär Bjelkmar, Christine Schwaiger, Peter Kasson<br />

Samuel Murail, Teemu Murtola<br />

Ion Channel Studies:<br />

Kv1.2, NaV1.7, GlyRa1<br />

Screening, Docking &<br />

Free Energy Calculation<br />

Berk Hess, Rossen Apostolov, Peter Kasson<br />

Wiktor Jurkowski, Sander Pronk,<br />

Per Larsson, Sander Pronk, Szilard Pall<br />

Anna Johansson, Berk Hess, Björn Wesén<br />

Arne Elofsson & Gunnar von Heijne, Stockholm University<br />

Vijay Pande, Stanford University<br />

Jim Trudell & Edward Bertaccini, Stanford Medical School

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!