05.04.2014 Views

Computational Chemistry Benchmarks - Nvidia

Computational Chemistry Benchmarks - Nvidia

Computational Chemistry Benchmarks - Nvidia

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

update<br />

Updated: February 4, 2013


Molecular Dynamics (MD) Applications<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes/<strong>Benchmarks</strong><br />

AMBER<br />

PMEMD Explicit Solvent & GB<br />

Implicit Solvent<br />

> 100 ns/day<br />

JAC NVE on 2X<br />

K20s<br />

Released<br />

Multi-GPU, multi-node<br />

AMBER 12, GPU Revision Support 12.2<br />

http://ambermd.org/gpus/benchmarks.<br />

htm#<strong>Benchmarks</strong><br />

CHARMM<br />

Implicit (5x), Explicit (2x)<br />

Solvent via OpenMM<br />

2x C2070 equals<br />

32-35x X5667<br />

CPUs<br />

Released<br />

Single & multi-GPU in single node<br />

Release C37b1;<br />

http://www.charmm.org/news/c37b1.html#po<br />

stjump<br />

DL_POLY<br />

Two-body Forces, Link-cell<br />

Pairs, Ewald SPME forces,<br />

Shake VV<br />

4x<br />

Release V 4.03<br />

Multi-GPU, multi-node<br />

Source only, Results Published<br />

http://www.stfc.ac.uk/CSE/randd/ccg/softwa<br />

re/DL_POLY/25526.aspx<br />

GROMACS<br />

Implicit (5x), Explicit (2x)<br />

165 ns/Day<br />

DHFR on<br />

4X C2075s<br />

Released<br />

Multi-GPU, multi-node<br />

Release 4.6; 1 st Multi-GPU support<br />

LAMMPS<br />

Lennard-Jones, Gay-Berne,<br />

Tersoff & many more potentials<br />

3.5-18x on Titan<br />

Released.<br />

Multi-GPU, multi-node<br />

http://lammps.sandia.gov/bench.html#desktop<br />

and http://lammps.sandia.gov/bench.html#titan<br />

NAMD<br />

Full electrostatics with PME and<br />

most simulation features<br />

4.0 ns/days<br />

F1-ATPase on<br />

1x K20X<br />

Released<br />

100M atom capable<br />

Multi-GPU, multi-node<br />

NAMD 2.9<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


New/Additional MD Applications Ramping<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

Abalone<br />

Simulations (on 1060 GPU)<br />

4-29X<br />

(on 1060 GPU)<br />

Released, Version 1.8.51<br />

Single GPU<br />

Agile Molecule, Inc.<br />

Ascalaph<br />

Computation of non-valent<br />

interactions<br />

4-29X<br />

(on 1060 GPU)<br />

Released, Version 1.1.4<br />

Single GPU<br />

Agile Molecule, Inc.<br />

ACEMD<br />

Written for use only on GPUs<br />

150 ns/day DHFR on<br />

1x K20<br />

Released<br />

Single and multi-GPUs<br />

Production bio-molecular dynamics (MD)<br />

software specially optimized to run on GPUs<br />

Folding@Home<br />

Powerful distributed computing<br />

molecular dynamics system;<br />

implicit solvent and folding<br />

Depends upon<br />

number of GPUs<br />

Released;<br />

GPUs and CPUs<br />

http://folding.stanford.edu<br />

GPUs get 4X the points of CPUs<br />

GPUGrid.net<br />

High-performance all-atom<br />

biomolecular simulations;<br />

explicit solvent and binding<br />

Depends upon<br />

number of GPUs<br />

Released;<br />

NVIDIA GPUs only<br />

http://www.gpugrid.net/<br />

HALMD<br />

Simple fluids and binary<br />

mixtures (pair potentials, highprecision<br />

NVE and NVT, dynamic<br />

correlations)<br />

Up to 66x on 2090<br />

vs. 1 CPU core<br />

Released, Version 0.2.0<br />

Single GPU<br />

http://halmd.org/benchmarks.html#supercool<br />

ed-binary-mixture-kob-andersen<br />

HOOMD-Blue<br />

Written for use only on GPUs<br />

Kepler 2X faster<br />

than Fermi<br />

Released, Version 0.11.2<br />

Single and multi-GPU on 1 node<br />

http://codeblue.umich.edu/hoomd-blue/<br />

Multi-GPU w/ MPI in March 2013<br />

OpenMM<br />

Implicit and explicit solvent,<br />

custom forces<br />

Implicit: 127-213<br />

ns/day Explicit: 18-<br />

55 ns/day DHFR<br />

Released Version 4.1.1<br />

Multi-GPU<br />

Library and application for molecular dynamics<br />

on high-performance<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application Features Supported GPU Perf Release Status Notes<br />

Abinit<br />

Local Hamiltonian, non-local<br />

Hamiltonian, LOBPCG algorithm,<br />

diagonalization /<br />

orthogonalization<br />

1.3-2.7X<br />

Released; Version 7.0.5<br />

Multi-GPU support<br />

www.abinit.org<br />

ACES III<br />

Integrating scheduling GPU into<br />

SIAL programming language and<br />

SIP runtime environment<br />

10X on kernels<br />

Under development<br />

Multi-GPU support<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/deumens_ESaccel_2012.pdf<br />

ADF Fock Matrix, Hessians TBD<br />

BigDFT<br />

DFT; Daubechies wavelets,<br />

part of Abinit<br />

5-25X<br />

(1 CPU core to<br />

GPU kernel)<br />

Casino TBD TBD<br />

CP2K<br />

GAMESS-US<br />

DBCSR (spare matrix multiply<br />

library)<br />

Libqc with Rys Quadrature<br />

Algorithm, Hartree-Fock, MP2<br />

and CCSD in Q4 2012<br />

2-7X<br />

1.3-1.6X,<br />

2.3-2.9x HF<br />

Pilot project completed,<br />

Under development<br />

Multi-GPU support<br />

Released June 2009,<br />

current release 1.6.0<br />

Multi-GPU support<br />

Under development,<br />

Spring 2013 release<br />

Multi-GPU support<br />

Under development<br />

Multi-GPU support<br />

Released<br />

Multi-GPU support<br />

www.scm.com<br />

http://inac.cea.fr/L_Sim/BigDFT/news.html,<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/BigDFT-Formalism.pdf and<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/BigDFT-HPC-tues.pdf<br />

http://www.tcm.phy.cam.ac.uk/~mdt26/casino.<br />

html<br />

http://www.olcf.ornl.gov/wpcontent/training/ascc_2012/friday/ACSS_2012_V<br />

andeVondele_s.pdf<br />

Next release Q4 2012.<br />

http://www.msg.ameslab.gov/gamess/index.html<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application Features Supported GPU Perf Release Status Notes<br />

GAMESS-UK<br />

(ss|ss) type integrals within<br />

calculations using Hartree Fock ab<br />

initio methods and density<br />

functional theory. Supports<br />

organics & inorganics.<br />

8x<br />

Release in 2012<br />

Multi-GPU support<br />

http://www.ncbi.nlm.nih.gov/pubmed/215419<br />

63<br />

Gaussian<br />

Joint PGI, NVIDIA & Gaussian<br />

Collaboration<br />

TBD<br />

Under development<br />

Multi-GPU support<br />

Announced Aug. 29, 2011<br />

http://www.gaussian.com/g_press/nvidia_press.htm<br />

GPAW<br />

Electrostatic poisson equation,<br />

orthonormalizing of vectors,<br />

residual minimization method<br />

(rmm-diis)<br />

8x<br />

Released<br />

Multi-GPU support<br />

https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,<br />

Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)<br />

Jaguar Investigating GPU acceleration TBD<br />

Under development<br />

Multi-GPU support<br />

Schrodinger, Inc.<br />

http://www.schrodinger.com/kb/278<br />

MOLCAS CU_BLAS support 1.1x<br />

Released, Version 7.8<br />

Single GPU. Additional GPU<br />

support coming in Version 8<br />

www.molcas.org<br />

MOLPRO<br />

Density-fitted MP2 (DF-MP2),<br />

density fitted local correlation<br />

methods (DF-RHF, DF-KS), DFT<br />

1.7-2.3X<br />

projected<br />

Under development<br />

Multiple GPU<br />

www.molpro.net<br />

Hans-Joachim Werner<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

MOPAC2009<br />

pseudodiagonalization, full<br />

diagonalization, and density<br />

matrix assembling<br />

3.8-14X<br />

Under Development<br />

Single GPU<br />

Academic port.<br />

http://openmopac.net<br />

NWChem<br />

Triples part of Reg-CCSD(T),<br />

CCSD & EOMCCSD task<br />

schedulers<br />

3-10X projected<br />

Release targeting March 2013<br />

Multiple GPUs<br />

Development GPGPU benchmarks:<br />

www.nwchem-sw.org<br />

And http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/Krishnamoorthy-ESCMA12.pdf<br />

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/<br />

PEtot<br />

Density functional theory (DFT)<br />

plane wave pseudopotential<br />

calculations<br />

6-10X<br />

Released<br />

Multi-GPU<br />

First principles materials code that computes<br />

the behavior of the electron structures of<br />

materials<br />

Q-CHEM RI-MP2 8x-14x Released, Version 4.0<br />

http://www.qchem.com/doc_for_web/qchem_manual_4.0.pdf<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

QMCPACK Main features 3-4x<br />

Released<br />

Multiple GPUs<br />

NCSA<br />

University of Illinois at Urbana-Champaign<br />

http://cms.mcc.uiuc.edu/qmcpack/index.php<br />

/GPU_version_of_QMCPACK<br />

Quantum<br />

Espresso/PWscf<br />

PWscf package: linear algebra<br />

(matrix multiply), explicit<br />

computational kernels, 3D FFTs<br />

2.5-3.5x<br />

Released<br />

Version 5.0<br />

Multiple GPUs<br />

Created by Irish Centre for<br />

High-End Computing<br />

http://www.quantum-espresso.org/index.php<br />

and http://www.quantum-espresso.org/<br />

TeraChem<br />

“Full GPU-based solution”<br />

44-650X vs.<br />

GAMESS CPU<br />

version<br />

Released<br />

Version 1.5<br />

Multi-GPU/single node<br />

Completely redesigned to<br />

exploit GPU parallelism. YouTube:<br />

http://youtu.be/EJODzk6RFxE?hd=1 and<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/Luehr-ESCMA.pdf<br />

VASP<br />

Hybrid Hartree-Fock DFT<br />

functionals including exact<br />

exchange<br />

2x<br />

2 GPUs<br />

comparable to<br />

128 CPU cores<br />

Available on request<br />

Multiple GPUs<br />

By Carnegie Mellon University<br />

http://arxiv.org/pdf/1111.0716.pdf<br />

WL-LSMS<br />

Generalized Wang-Landau<br />

method<br />

3x<br />

with 32 GPUs vs.<br />

32 (16-core) CPUs<br />

Under development<br />

Multi-GPU support<br />

NICS Electronic Structure Determination Workshop 2012:<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/Eisenbach_OakRidge_February.pdf<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Viz, ―Docking‖ and Related Applications Growing<br />

Related<br />

Applications<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

Amira 5®<br />

3D visualization of volumetric<br />

data and surfaces<br />

70x<br />

Released, Version 5.3.3<br />

Single GPU<br />

Visualization from Visage Imaging. Next release, 5.4, will use<br />

GPU for general purpose processing in some functions<br />

http://www.visageimaging.com/overview.html<br />

BINDSURF<br />

Allows fast processing of large<br />

ligand databases<br />

100X<br />

Available upon request to<br />

authors; single GPU<br />

High-Throughput parallel blind Virtual Screening,<br />

http://www.biomedcentral.com/1471-2105/13/S14/S13<br />

BUDE<br />

Empirical Free<br />

Energy Forcefield<br />

6.5-13.4X<br />

Released<br />

Single GPU<br />

University of Bristol<br />

http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm<br />

Core Hopping GPU accelerated application 3.75-5000X<br />

Released, Suite 2011<br />

Single and multi-GPUs.<br />

Schrodinger, Inc.<br />

http://www.schrodinger.com/products/14/32/<br />

FastROCS<br />

Real-time shape similarity<br />

searching/comparison<br />

800-3000X<br />

Released<br />

Single and multi-GPUs.<br />

Open Eyes Scientific Software<br />

http://www.eyesopen.com/fastrocs<br />

PyMol<br />

Lines: 460% increase<br />

Cartoons: 1246% increase<br />

Surface: 1746% increase<br />

Spheres: 753% increase<br />

Ribbon: 426% increase<br />

1700x<br />

Released, Version 1.5<br />

Single GPUs<br />

http://pymol.org/<br />

VMD<br />

High quality rendering,<br />

large structures (100 million atoms),<br />

analysis and visualization tasks, multiple<br />

GPU support for display of molecular<br />

orbitals<br />

100-125X or greater<br />

on kernels<br />

Released, Version 1.9<br />

Visualization from University of Illinois at Urbana-Champaign<br />

http://www.ks.uiuc.edu/Research/vmd/<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Bioinformatics Applications<br />

Application<br />

Features<br />

Supported<br />

GPU<br />

Speedup<br />

Release Status<br />

Website<br />

BarraCUDA<br />

Alignment of short sequencing<br />

reads<br />

6-10x<br />

Version 0.6.2 – 3/2012<br />

Multi-GPU, multi-node<br />

http://seqbarracuda.sourceforge.net/<br />

CUDASW++<br />

Parallel search of Smith-<br />

Waterman database<br />

10-50x<br />

Version 2.0.8 – Q1/2012<br />

Multi-GPU, multi-node<br />

http://sourceforge.net/projects/cudasw/<br />

CUSHAW<br />

Parallel, accurate long read<br />

aligner for large genomes<br />

10x<br />

Version 1.0.40 – 6/2012<br />

Multiple-GPU<br />

http://cushaw.sourceforge.net/<br />

GPU-BLAST<br />

Protein alignment according to<br />

BLASTP<br />

3-4x<br />

Version 2.2.26 – 3/2012<br />

Single GPU<br />

http://eudoxus.cheme.cmu.edu/gpublast/gpu<br />

blast.html<br />

GPU-HMMER<br />

Parallel local and global<br />

search of Hidden Markov<br />

Models<br />

60-100x<br />

Version 2.3.2 – Q1/2012<br />

Multi-GPU, multi-node<br />

http://www.mpihmmer.org/installguideGPUH<br />

MMER.htm<br />

mCUDA-MEME<br />

Scalable motif discovery<br />

algorithm based on MEME<br />

4-10x<br />

Version 3.0.12<br />

Multi-GPU, multi-node<br />

https://sites.google.com/site/yongchaosoftwa<br />

re/mcuda-meme<br />

SeqNFind<br />

Hardware and software for<br />

reference assembly, blast, SW,<br />

HMM, de novo assembly<br />

400x<br />

Released.<br />

Multi-GPU, multi-node<br />

http://www.seqnfind.com/<br />

UGENE Fast short read alignment 6-8x<br />

WideLM<br />

Parallel linear regression on<br />

multiple similarly-shaped<br />

models<br />

150x<br />

Version 1.11 – 5/2012<br />

Multi-GPU, multi-node<br />

Version 0.1-1 – 3/2012<br />

Multi-GPU, multi-node<br />

http://ugene.unipro.ru/<br />

http://insilicos.com/products/widelm<br />

GPU Perf compared against same or similar code running on single CPU machine<br />

Performance measured internally or independently


Performance Relative to CPU Only<br />

MD Average Speedups<br />

10<br />

8<br />

The blue node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

The green nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 1 or 2 NVIDIA K10, K20, or<br />

K20X GPUs.<br />

6<br />

4<br />

2<br />

0<br />

CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X<br />

Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.<br />

Error bars show the maximum and minimum speedup for each hardware configuration.


Molecular Dynamics (MD) Applications<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes/<strong>Benchmarks</strong><br />

AMBER<br />

PMEMD Explicit Solvent & GB<br />

Implicit Solvent<br />

> 100 ns/day<br />

JAC NVE on 2X<br />

K20s<br />

Released<br />

Multi-GPU, multi-node<br />

AMBER 12, GPU Revision Support 12.2<br />

http://ambermd.org/gpus/benchmarks.<br />

htm#<strong>Benchmarks</strong><br />

CHARMM<br />

Implicit (5x), Explicit (2x)<br />

Solvent via OpenMM<br />

2x C2070 equals<br />

32-35x X5667<br />

CPUs<br />

Released<br />

Single & multi-GPU in single node<br />

Release C37b1;<br />

http://www.charmm.org/news/c37b1.html#po<br />

stjump<br />

DL_POLY<br />

Two-body Forces, Link-cell<br />

Pairs, Ewald SPME forces,<br />

Shake VV<br />

4x<br />

Release V 4.03<br />

Multi-GPU, multi-node<br />

Source only, Results Published<br />

http://www.stfc.ac.uk/CSE/randd/ccg/softwa<br />

re/DL_POLY/25526.aspx<br />

GROMACS<br />

Implicit (5x), Explicit (2x)<br />

165 ns/Day<br />

DHFR on<br />

4X C2075s<br />

Released<br />

Multi-GPU, multi-node<br />

Release 4.6; 1 st Multi-GPU support<br />

LAMMPS<br />

Lennard-Jones, Gay-Berne,<br />

Tersoff & many more potentials<br />

3.5-18x on Titan<br />

Released.<br />

Multi-GPU, multi-node<br />

http://lammps.sandia.gov/bench.html#desktop<br />

and http://lammps.sandia.gov/bench.html#titan<br />

NAMD<br />

Full electrostatics with PME and<br />

most simulation features<br />

4.0 ns/days<br />

F1-ATPase on<br />

1x K20X<br />

Released<br />

100M atom capable<br />

Multi-GPU, multi-node<br />

NAMD 2.9<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


New/Additional MD Applications Ramping<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

Abalone<br />

Simulations (on 1060 GPU)<br />

4-29X<br />

(on 1060 GPU)<br />

Released, Version 1.8.51<br />

Single GPU<br />

Agile Molecule, Inc.<br />

Ascalaph<br />

Computation of non-valent<br />

interactions<br />

4-29X<br />

(on 1060 GPU)<br />

Released, Version 1.1.4<br />

Single GPU<br />

Agile Molecule, Inc.<br />

ACEMD<br />

Written for use only on GPUs<br />

150 ns/day DHFR on<br />

1x K20<br />

Released<br />

Single and multi-GPUs<br />

Production bio-molecular dynamics (MD)<br />

software specially optimized to run on GPUs<br />

Folding@Home<br />

Powerful distributed computing<br />

molecular dynamics system;<br />

implicit solvent and folding<br />

Depends upon<br />

number of GPUs<br />

Released;<br />

GPUs and CPUs<br />

http://folding.stanford.edu<br />

GPUs get 4X the points of CPUs<br />

GPUGrid.net<br />

High-performance all-atom<br />

biomolecular simulations;<br />

explicit solvent and binding<br />

Depends upon<br />

number of GPUs<br />

Released;<br />

NVIDIA GPUs only<br />

http://www.gpugrid.net/<br />

HALMD<br />

Simple fluids and binary<br />

mixtures (pair potentials, highprecision<br />

NVE and NVT, dynamic<br />

correlations)<br />

Up to 66x on 2090<br />

vs. 1 CPU core<br />

Released, Version 0.2.0<br />

Single GPU<br />

http://halmd.org/benchmarks.html#supercool<br />

ed-binary-mixture-kob-andersen<br />

HOOMD-Blue<br />

Written for use only on GPUs<br />

Kepler 2X faster<br />

than Fermi<br />

Released, Version 0.11.2<br />

Single and multi-GPU on 1 node<br />

http://codeblue.umich.edu/hoomd-blue/<br />

Multi-GPU w/ MPI in March 2013<br />

OpenMM<br />

Implicit and explicit solvent,<br />

custom forces<br />

Implicit: 127-213<br />

ns/day Explicit: 18-<br />

55 ns/day DHFR<br />

Released Version 4.1.1<br />

Multi-GPU<br />

Library and application for molecular dynamics<br />

on high-performance<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Built from Ground Up for GPUs<br />

<strong>Computational</strong> <strong>Chemistry</strong><br />

What<br />

Why<br />

How<br />

Study disease & discover drugs<br />

Predict drug and protein interactions<br />

Speed of simulations is critical<br />

Enables study of:<br />

Longer timeframes<br />

Larger systems<br />

More simulations<br />

GPUs increase throughput & accelerate simulations<br />

AMBER 11 Application<br />

4.6x performance increase with 2 GPUs with<br />

only a 54% added cost*<br />

• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)<br />

• Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333<br />

GPU READY<br />

APPLICATIONS<br />

Abalone<br />

ACEMD<br />

AMBER<br />

DL_PLOY<br />

GAMESS<br />

GROMACS<br />

LAMMPS<br />

NAMD<br />

NWChem<br />

Q-CHEM<br />

Quantum Espresso<br />

TeraChem


AMBER 12<br />

GPU Support Revision 12.2<br />

1/22/2013<br />

15


Nanoseconds / Day<br />

Kepler - Our Fastest Family of GPUs Yet<br />

30.00<br />

25.00<br />

20.00<br />

Factor IX<br />

18.90<br />

22.44<br />

6.6x<br />

25.39<br />

7.4x<br />

Running AMBER 12 GPU Support Revision 12.1<br />

The blue node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

The green nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and either 1x NVIDIA M2090, 1x K10<br />

or 1x K20 for the GPU<br />

15.00<br />

11.85<br />

5.6x<br />

10.00<br />

5.00<br />

3.42<br />

3.5x<br />

16<br />

0.00<br />

1 CPU Node 1 CPU Node +<br />

M2090<br />

1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X<br />

GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)<br />

when compared to a CPU only node<br />

Factor IX


Speedup Compared to CPU Only<br />

K10 Accelerates Simulations of All Sizes<br />

30<br />

Running AMBER 12 GPU Support Revision 12.1<br />

25<br />

24.00<br />

The blue node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

20<br />

19.98<br />

The green nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 1x NVIDIA K10 GPU<br />

15<br />

10<br />

5<br />

5.50 5.53<br />

5.04<br />

2.00<br />

0<br />

CPU<br />

All Molecules<br />

TRPcage<br />

GB<br />

JAC NVE<br />

PME<br />

Factor IX NVE<br />

PME<br />

Cellulose NVE<br />

PME<br />

Myoglobin<br />

GB<br />

Nucleosome<br />

GB<br />

Gain 24x performance by adding just 1 GPU<br />

when compared to dual CPU performance<br />

Nucleosome


Speedup Compared to CPU Only<br />

K20 Accelerates Simulations of All Sizes<br />

30.00<br />

25.00<br />

25.56<br />

28.00<br />

Running AMBER 12 GPU Support Revision 12.1<br />

SPFP with CUDA 4.2.9 ECC Off<br />

The blue node contains 2x Intel E5-2687W CPUs<br />

(8 Cores per CPU)<br />

20.00<br />

Each green nodes contains 2x Intel E5-2687W<br />

CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs<br />

15.00<br />

10.00<br />

6.50 6.56<br />

7.28<br />

5.00<br />

1.00<br />

2.66<br />

0.00<br />

CPU All<br />

Molecules<br />

TRPcage GB<br />

JAC NVE PME Factor IX NVE<br />

PME<br />

Cellulose NVE<br />

PME<br />

Myoglobin GB<br />

Nucleosome<br />

GB<br />

Gain 28x throughput/performance by adding just one K20 GPU<br />

when compared to dual CPU performance<br />

Nucleosome<br />

18<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Speedup Compared to CPU Only<br />

K20X Accelerates Simulations of All Sizes<br />

35<br />

31.30 Running AMBER 12 GPU Support Revision 12.1<br />

30<br />

28.59<br />

The blue node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

25<br />

The green nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 1x NVIDIA K20X GPU<br />

20<br />

15<br />

10<br />

7.15 7.43<br />

8.30<br />

5<br />

2.79<br />

0<br />

CPU<br />

All Molecules<br />

TRPcage<br />

GB<br />

JAC NVE<br />

PME<br />

Factor IX NVE<br />

PME<br />

Cellulose NVE<br />

PME<br />

Myoglobin<br />

GB<br />

Nucleosome<br />

GB<br />

Gain 31x performance by adding just one K20X GPU<br />

when compared to dual CPU performance<br />

Nucleosome<br />

19<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Nanoseconds / Day<br />

K10 Strong Scaling over Nodes<br />

Cellulose 408K Atoms (NPT)<br />

Running AMBER 12 with CUDA 4.2 ECC Off<br />

6<br />

5<br />

4<br />

The blue nodes contains 2x Intel X5670<br />

CPUs (6 Cores per CPU)<br />

The green nodes contains 2x Intel X5670<br />

CPUs (6 Cores per CPU) plus 2x NVIDIA<br />

K10 GPUs<br />

2.4x<br />

3<br />

2<br />

3.6x<br />

CPU Only<br />

With GPU<br />

5.1x<br />

1<br />

Cellulose<br />

0<br />

1 2 4<br />

Number of Nodes<br />

GPUs significantly outperform CPUs while scaling over multiple nodes


Speedups Compared to CPU Only<br />

Kepler – Universally Faster<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

JAC<br />

Factor IX<br />

Cellulose<br />

Running AMBER 12 GPU Support Revision 12.1<br />

The CPU Only node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

The Kepler nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 1x NVIDIA K10, K20, or K20X<br />

GPUs<br />

3<br />

2<br />

1<br />

0<br />

CPU Only CPU + K10 CPU + K20 CPU + K20X<br />

Cellulose<br />

The Kepler GPUs accelerated all simulations, up to 8x


Nanoseconds / Day<br />

K10 Extreme Performance<br />

120<br />

JAC 23K Atoms (NVE)<br />

Running AMBER 12 GPU Support Revision 12.1<br />

The blue node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

100<br />

97.99<br />

The green node contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 2x NVIDIA K10 GPUs<br />

80<br />

60<br />

40<br />

20<br />

0<br />

12.47<br />

1 Node 1 Node<br />

Gain 7.8X performance by adding just 2 GPUs<br />

when compared to dual CPU performance<br />

DHFR


Nanoseconds / Day<br />

K20 Extreme Performance<br />

DHRF JAC 23K Atoms (NVE)<br />

120<br />

95.59<br />

100<br />

80<br />

Running AMBER 12 GPU Support Revision 12.1<br />

SPFP with CUDA 4.2.9 ECC Off<br />

The blue node contains 2x Intel E5-2687W CPUs<br />

(8 Cores per CPU)<br />

Each green node contains 2x Intel E5-2687W<br />

CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU<br />

60<br />

40<br />

20<br />

12.47<br />

0<br />

1 Node 1 Node<br />

Gain > 7.5X throughput/performance by adding just 2 K20 GPUs<br />

when compared to dual CPU performance<br />

DHFR<br />

23<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Replace 8 Nodes with 1 K20 GPU<br />

90.00<br />

80.00<br />

70.00<br />

60.00<br />

50.00<br />

40.00<br />

30.00<br />

65.00<br />

81.09 $32,000.00<br />

35000<br />

30000<br />

25000<br />

20000<br />

15000<br />

10000<br />

Running AMBER 12 GPU Support Revision 12.1<br />

SPFP with CUDA 4.2.9 ECC Off<br />

The eight (8) blue nodes each contain 2x Intel<br />

E5-2687W CPUs (8 Cores per CPU)<br />

Each green node contains 2x Intel E5-2687W<br />

CPUs (8 Cores per CPU) plus 1x NVIDIA K20<br />

GPU<br />

Note: Typical CPU and GPU node pricing used.<br />

Pricing may vary depending on node<br />

configuration. Contact your preferred HW vendor<br />

for actual pricing.<br />

20.00<br />

$6,500.00<br />

10.00<br />

5000<br />

0.00<br />

0<br />

Nanoseconds/Day<br />

Cost<br />

Cut down simulation costs to ¼ and gain higher performance<br />

DHFR<br />

24<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Nanoseconds / Day<br />

Replace 7 Nodes with 1 K10 GPU<br />

Performance on JAC NVE<br />

80<br />

70<br />

$35,000.00<br />

$30,000.00<br />

Cost<br />

$32,000<br />

Running AMBER 12 GPU Support Revision 12.1<br />

SPFP with CUDA 4.2.9 ECC Off<br />

The eight (8) blue nodes each contain 2x Intel<br />

E5-2687W CPUs (8 Cores per CPU)<br />

60<br />

50<br />

40<br />

30<br />

$25,000.00<br />

$20,000.00<br />

$15,000.00<br />

The green node contains 2x Intel E5-2687W<br />

CPUs (8 Cores per CPU) plus 1x NVIDIA K10<br />

GPU<br />

Note: Typical CPU and GPU node pricing used.<br />

Pricing may vary depending on node<br />

configuration. Contact your preferred HW vendor<br />

for actual pricing.<br />

20<br />

10<br />

$10,000.00<br />

$5,000.00<br />

$7,000<br />

0<br />

$0.00<br />

CPU Only<br />

GPU Enabled<br />

CPU Only GPU Enabled<br />

Cut down simulation costs to ¼ and increase performance by 70%<br />

DHFR<br />

25<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Nanoseconds / Day<br />

1 CPU 2 GPUs<br />

2 CPUs 2 GPUs<br />

Extra CPUs decrease Performance<br />

8<br />

7<br />

6<br />

Cellulose NVE<br />

Running AMBER 12 GPU Support Revision 12.1<br />

The orange bars contains one E5-2687W CPUs<br />

(8 Cores per CPU).<br />

The blue bars contain Dual E5-2687W CPUs (8<br />

Cores per CPU)<br />

5<br />

4<br />

3<br />

1 E5-2687W<br />

2 E5-2687W<br />

2<br />

1<br />

0<br />

CPU Only<br />

CPU with dual K20s<br />

Cellulose<br />

When used with GPUs, dual CPU sockets perform worse than single CPU sockets.


Energy Expended (kJ)<br />

Kepler - Greener Science<br />

2500<br />

2000<br />

Energy used in simulating 1 ns of DHFR JAC<br />

Lower is better<br />

Running AMBER 12 GPU Support Revision 12.1<br />

The blue node contains Dual E5-2687W CPUs<br />

(150W each, 8 Cores per CPU).<br />

The green nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 1x NVIDIA K10, K20, or K20X<br />

GPUs (235W each).<br />

1500<br />

1000<br />

Energy Expended<br />

= Power x Time<br />

500<br />

0<br />

CPU Only CPU + K10 CPU + K20 CPU + K20X<br />

The GPU Accelerated systems use 65-75% less energy


Recommended GPU Node Configuration for<br />

AMBER <strong>Computational</strong> <strong>Chemistry</strong><br />

Workstation or Single Node Configuration<br />

# of CPU sockets 2<br />

Cores per CPU socket<br />

4+ (1 CPU core drives 1 GPU)<br />

CPU speed (Ghz) 2.66+<br />

System memory per node (GB) 16<br />

GPUs<br />

# of GPUs per CPU socket<br />

Kepler K10, K20, K20X<br />

Fermi M2090, M2075, C2075<br />

1-2<br />

(4 GPUs on 1 socket is good<br />

to do 4 fast serial GPU runs)<br />

GPU memory preference (GB) 6<br />

GPU to CPU connection<br />

PCIe 2.0 16x or higher<br />

Server storage<br />

Network configuration<br />

2 TB<br />

Infiniband QDR or better<br />

28<br />

Scale to multiple nodes with same single node configuration<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Benefits of GPU AMBER Accelerated Computing<br />

Faster than CPU only systems in all tests<br />

Most major compute intensive aspects of classical MD ported<br />

Large performance boost with marginal price increase<br />

Energy usage cut by more than half<br />

GPUs scale well within a node and over multiple nodes<br />

K20 GPU is our fastest and lowest power high performance GPU yet<br />

29<br />

Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive<br />

AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012


NAMD 2.9


Nanoseconds/Day<br />

Kepler - Our Fastest Family of GPUs Yet<br />

4.50<br />

4.00<br />

3.50<br />

3.00<br />

2.50<br />

2.00<br />

1.50<br />

1.37<br />

2.63<br />

1.9x<br />

ApoA1<br />

3.45<br />

2.5x<br />

3.57<br />

2.6x<br />

4.00<br />

2.9x<br />

Running NAMD version 2.9<br />

The blue node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

The green nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and either 1x NVIDIA M2090, 1x K10<br />

or 1x K20 for the GPU<br />

1.00<br />

0.50<br />

31<br />

0.00<br />

1 CPU Node 1 CPU Node +<br />

M2090<br />

1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X<br />

GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)<br />

when compared to a CPU only node<br />

Apolipoprotein A1<br />

NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Speedup Compared to CPU Only<br />

Accelerates Simulations of All Sizes<br />

3<br />

2.5<br />

2.6<br />

2.4<br />

2.7<br />

Running NAMD 2.9 with CUDA 4.0 ECC Off<br />

The blue node contains 2x Intel E5-2687W CPUs<br />

(8 Cores per CPU)<br />

2<br />

Each green node contains 2x Intel E5-2687W<br />

CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs<br />

1.5<br />

1<br />

0.5<br />

0<br />

CPU All Molecules ApoA1 F1-ATPase STMV<br />

Gain 2.5x throughput/performance by adding just 1 GPU<br />

when compared to dual CPU performance<br />

Apolipoprotein A1<br />

32<br />

NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Speedup Compared to CPU Only<br />

Kepler – Universally Faster<br />

6<br />

Running NAMD version 2.9<br />

5<br />

The CPU Only node contains Dual E5-2687W CPUs<br />

(8 Cores per CPU).<br />

4<br />

4.3x<br />

4.7x<br />

5.1x<br />

The Kepler nodes contain Dual E5-2687W CPUs (8<br />

Cores per CPU) and 1 or two NVIDIA K10, K20, or<br />

K20X GPUs.<br />

3<br />

F1-ATPase<br />

ApoA1<br />

2<br />

2.4x<br />

2.6x<br />

2.9x<br />

STMV<br />

1<br />

0<br />

CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X<br />

| Kepler nodes use Dual CPUs |<br />

F1-ATPase<br />

The Kepler GPUs accelerate all simulations, up to 5x<br />

Average acceleration printed in bars


Nanoseconds / Day<br />

Outstanding Strong Scaling with Multi-STMV<br />

1.2<br />

100 STMV on Hundreds of Nodes<br />

Running NAMD version 2.9<br />

Each blue XE6 CPU node contains 1x AMD<br />

1600 Opteron (16 Cores per CPU).<br />

1<br />

Fermi XK6<br />

CPU XK6<br />

2.7x<br />

Each green XK6 CPU+GPU node contains<br />

1x AMD 1600 Opteron (16 Cores per CPU)<br />

and an additional 1x NVIDIA X2090 GPU.<br />

0.8<br />

0.6<br />

2.9x<br />

0.4<br />

0.2<br />

0<br />

3.6x<br />

3.8x<br />

32 64 128 256 512 640 768<br />

# of Nodes<br />

Concatenation of 100<br />

Satellite Tobacco Mosaic Virus<br />

Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers


Replace 3 Nodes with 1 2090 GPU<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.63<br />

0.74<br />

F1-ATPase<br />

$8,000<br />

4 CPU Nodes<br />

9000<br />

1 CPU Node +<br />

8000<br />

1x M2090 GPUs<br />

7000<br />

6000<br />

Running NAMD version 2.9<br />

Each blue node contains 2x Intel Xeon X5550 CPUs<br />

(4 Cores per CPU).<br />

The green node contains 2x Intel Xeon X5550 CPUs<br />

(4 Cores per CPU) and 1x NVIDIA M2090 GPU<br />

Note: Typical CPU and GPU node pricing used. Pricing<br />

may vary depending on node configuration. Contact your<br />

preferred HW vendor for actual pricing.<br />

0.4<br />

$4,000<br />

5000<br />

4000<br />

0.3<br />

3000<br />

0.2<br />

2000<br />

0.1<br />

1000<br />

0<br />

Nanoseconds/Day<br />

Cost<br />

0<br />

Speedup of 1.2x for 50% the cost<br />

F1-ATPase


Energy Expended (kJ)<br />

K20 - Greener: Twice The Science Per Watt<br />

1200000<br />

1000000<br />

800000<br />

Energy Used in Simulating 1 Nanosecond of ApoA1<br />

Lower is better<br />

Running NAMD version 2.9<br />

Each blue node contains Dual E5-2687W<br />

CPUs (95W, 4 Cores per CPU).<br />

Each green node contains 2x Intel Xeon X5550<br />

CPUs (95W, 4 Cores per CPU) and 2x NVIDIA<br />

K20 GPUs (225W per GPU)<br />

600000<br />

400000<br />

Energy Expended<br />

= Power x Time<br />

200000<br />

0<br />

1 Node 1 Node + 2x K20<br />

Cut down energy usage by ½ with GPUs<br />

36<br />

NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Energy Expended (kJ)<br />

Kepler - Greener: Twice The Science/Joule<br />

Energy used in simulating 1 ns of SMTV<br />

250000<br />

Running NAMD version 2.9<br />

200000<br />

150000<br />

Lower is better<br />

The blue node contains Dual E5-2687W CPUs<br />

(150W each, 8 Cores per CPU).<br />

The green nodes contain Dual E5-2687W CPUs<br />

(8 Cores per CPU) and 2x NVIDIA K10, K20, or<br />

K20X GPUs (235W each).<br />

100000<br />

Energy Expended<br />

= Power x Time<br />

50000<br />

0<br />

CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs<br />

Cut down energy usage by ½ with GPUs<br />

Satellite Tobacco Mosaic Virus


Recommended GPU Node Configuration for<br />

NAMD <strong>Computational</strong> <strong>Chemistry</strong><br />

Workstation or Single Node Configuration<br />

# of CPU sockets 2<br />

Cores per CPU socket 6+<br />

CPU speed (Ghz) 2.66+<br />

System memory per socket (GB) 32<br />

GPUs<br />

Kepler K10, K20, K20X<br />

Fermi M2090, M2075, C2075<br />

# of GPUs per CPU socket 1-2<br />

GPU memory preference (GB) 6<br />

GPU to CPU connection<br />

PCIe 2.0 or higher<br />

Server storage<br />

500 GB or higher<br />

38<br />

Network configuration<br />

Gemini, InfiniBand<br />

Scale to multiple nodes with same single node configuration<br />

NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012


Summary/Conclusions<br />

Benefits of GPU Accelerated Computing<br />

Faster than CPU only systems in all tests<br />

Large performance boost with small marginal price increase<br />

Energy usage cut in half<br />

GPUs scale very well within a node and over multiple nodes<br />

Tesla K20 GPU is our fastest and lowest power high performance GPU to date<br />

39<br />

Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive<br />

NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012


LAMMPS, Jan. 2013 or later


Speedup Compared to CPU Only<br />

6<br />

5<br />

More Science for Your Money<br />

Embedded Atom Model<br />

4.5<br />

5.5<br />

Blue node uses 2x E5-2687W (8 Cores<br />

and 150W per CPU).<br />

Green nodes have 2x E5-2687W and 1<br />

or 2 NVIDIA K10, K20, or K20X GPUs (235W).<br />

4<br />

3<br />

2.47<br />

2.92<br />

3.3<br />

2<br />

1.7<br />

1<br />

0<br />

CPU Only<br />

CPU + 1x<br />

K10<br />

CPU + 1x<br />

K20<br />

CPU + 1x<br />

K20X<br />

CPU + 2x<br />

K10<br />

CPU + 2x<br />

K20<br />

CPU + 2x<br />

K20X<br />

Experience performance increases of up to 5.5x with Kepler GPU nodes.


Speedup Relative to CPU Alone<br />

K20X, the Fastest GPU Yet<br />

7<br />

6<br />

5<br />

Blue node uses 2x E5-2687W (8 Cores<br />

and 150W per CPU).<br />

Green nodes have 2x E5-2687W and 2<br />

NVIDIA M2090s or K20X GPUs (235W).<br />

4<br />

3<br />

2<br />

1<br />

0<br />

CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X<br />

Experience performance increases of up to 6.2x with Kepler GPU nodes.<br />

One K20X performs as well as two M2090s


Normalized to CPU Only<br />

Get a CPU Rebate to Fund Part of Your GPU Budget<br />

20<br />

18<br />

16<br />

14<br />

12<br />

10<br />

Acceleration in Loop Time Computation by<br />

Additional GPUs<br />

9.88<br />

12.9<br />

18.2<br />

Running NAMD version 2.9<br />

The blue node contains Dual X5670 CPUs<br />

(6 Cores per CPU).<br />

The green nodes contain Dual X5570 CPUs<br />

(4 Cores per CPU) and 1-4 NVIDIA M2090<br />

GPUs.<br />

8<br />

6<br />

5.31<br />

4<br />

2<br />

0<br />

1 Node 1 Node + 1x M2090 1 Node + 2x M2090 1 Node + 3x M2090 1 Node + 4x M2090<br />

Increase performance 18x when compared to CPU-only nodes<br />

Cheaper CPUs used with GPUs AND still faster overall performance when<br />

compared to more expensive CPUs!


Loop Time (seconds)<br />

Excellent Strong Scaling on Large Clusters<br />

LAMMPS Gay-Berne 134M Atoms<br />

600<br />

500<br />

GPU Accelerated XK6<br />

CPU only XE6<br />

400<br />

300<br />

3.55x<br />

200<br />

100<br />

3.48x<br />

3.45x<br />

0<br />

300 400 500 600 700 800 900<br />

Nodes<br />

From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance<br />

compared to XE6 CPU nodes<br />

Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)<br />

Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090


GPUs Sustain 5x Performance for Weak Scaling<br />

Loop Time (seconds)<br />

45<br />

40<br />

35<br />

Weak Scaling with 32K Atoms per Node<br />

30<br />

25<br />

6.7x<br />

5.8x<br />

4.8x<br />

20<br />

15<br />

10<br />

5<br />

0<br />

1 8 27 64 125 216 343 512 729<br />

Nodes<br />

Performance of 4.8x-6.7x with GPU-accelerated nodes<br />

when compared to CPUs alone<br />

Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)<br />

Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090


Energy Expended (kJ)<br />

140<br />

Faster, Greener — Worth It!<br />

Energy Consumed in one loop of EAM<br />

120<br />

100<br />

Lower is better<br />

GPU-accelerated computing uses<br />

53% less energy than CPU only<br />

80<br />

60<br />

40<br />

Energy Expended = Power x Time<br />

Power calculated by combining the component’s TDPs<br />

20<br />

0<br />

1 Node 1 Node + 1 K20X 1 Node + 2x K20X<br />

Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.<br />

Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.<br />

Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive


Molecular Dynamics with LAMMPS<br />

on a Hybrid Cray Supercomputer<br />

W. Michael Brown<br />

National Center for <strong>Computational</strong> Sciences<br />

Oak Ridge National Laboratory<br />

NVIDIA Technology Theater, Supercomputing 2012<br />

November 14, 2012


Time (s)<br />

Time (s)<br />

Time (s)<br />

Time (s)<br />

Early Kepler <strong>Benchmarks</strong> on Titan<br />

Atomic Fluid<br />

Bulk Copper<br />

32.00<br />

16.00<br />

8.00<br />

4.00<br />

2.00<br />

1.00<br />

0.50<br />

0.25<br />

0.13<br />

0.06<br />

0.03<br />

8.00<br />

4.00<br />

2.00<br />

1.00<br />

0.50<br />

0.25<br />

0.13<br />

1 2 4 8 16 32 64 128<br />

1 2 4 8 16 32 64 128<br />

XK7+GPU<br />

XK6<br />

XK6+GPU<br />

XK7+GPU<br />

XK6<br />

XK6+GPU<br />

Nodes<br />

XK7+GPU<br />

XK6<br />

XK6+GPU<br />

Nodes<br />

1<br />

1<br />

4<br />

4<br />

16<br />

16<br />

64<br />

64<br />

256<br />

1024<br />

4096<br />

16384<br />

256<br />

1024<br />

4096<br />

16384<br />

4<br />

3<br />

2<br />

1<br />

0<br />

3.0<br />

2.5<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0


Time (s)<br />

Time (s)<br />

1<br />

4<br />

16<br />

64<br />

256<br />

1024<br />

4096<br />

16384<br />

Time (s)<br />

Time (s)<br />

Early Kepler <strong>Benchmarks</strong> on Titan<br />

64.00<br />

32<br />

Protein<br />

32.00<br />

16.00<br />

8.00<br />

4.00<br />

2.00<br />

1.00<br />

XK7+GPU<br />

XK6<br />

XK6+GPU<br />

16<br />

8<br />

4<br />

2<br />

Liquid Crystal<br />

0.50<br />

128.00<br />

64.00<br />

32.00<br />

16.00<br />

8.00<br />

4.00<br />

2.00<br />

1.00<br />

0.50<br />

0.25<br />

0.13<br />

1 2 4 8 16 32 64 128<br />

1 2 4 8 16 32 64 128<br />

Nodes<br />

XK7+GPU<br />

XK6<br />

XK6+GPU<br />

Nodes<br />

1<br />

4<br />

16<br />

64<br />

256<br />

1024<br />

4096<br />

16384<br />

1<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0


Early Titan XK6/XK7 <strong>Benchmarks</strong><br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

Atomic Fluid (cutoff<br />

= 2.5σ)<br />

Speedup with Acceleration on XK6/XK7 Nodes<br />

1 Node = 32K Particles<br />

900 Nodes = 29M Particles<br />

Atomic Fluid (cutoff<br />

= 5.0σ)<br />

Bulk Copper Protein Liquid Crystal<br />

XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82<br />

XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70<br />

XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60<br />

XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14


Recommended GPU Node Configuration for<br />

LAMMPS <strong>Computational</strong> <strong>Chemistry</strong><br />

Workstation or Single Node Configuration<br />

# of CPU sockets 2<br />

Cores per CPU socket 6+<br />

CPU speed (Ghz) 2.66+<br />

System memory per socket (GB) 32<br />

GPUs<br />

Kepler K10, K20, K20X<br />

Fermi M2090, M2075, C2075<br />

# of GPUs per CPU socket 1-2<br />

GPU memory preference (GB) 6<br />

GPU to CPU connection<br />

PCIe 2.0 or higher<br />

Server storage<br />

Network configuration<br />

500 GB or higher<br />

Gemini, InfiniBand<br />

51<br />

Scale to multiple nodes with same single node configuration


GROMACS 4.6 Final, Pre-Beta<br />

and 4.6 Beta


ns/Day<br />

Kepler - Our Fastest Family of GPUs Yet<br />

14<br />

GROMACS 4.6 Final Release Waters (192K atoms)<br />

12<br />

10<br />

8<br />

3.0x 2.8x 2.8x<br />

Running GROMACS 4.6 Final Release<br />

The blue nodes contains either single or<br />

dual E5-2687W CPUs (8 Cores per CPU).<br />

6<br />

4<br />

1.7x<br />

1.7x<br />

The green nodes contain either single or<br />

dual E5-2687W CPUs (8 Cores per CPU)<br />

and either 1x NVIDIA M2090, 1x K10<br />

or 1x K20 for the GPU<br />

2<br />

0<br />

1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X<br />

Single Sandybridge CPU per node with single K10, K20 or K20X produces best performance<br />

53


Nanoseconds / Day<br />

Great Scaling in Small Systems<br />

25.00<br />

20.00<br />

21.68<br />

3.2x<br />

Running GROMACS 4.6 pre-beta with CUDA 4.1<br />

Each blue node contains 1x Intel X5550 CPU<br />

(95W TDP, 4 Cores per CPU)<br />

15.00<br />

13.01<br />

3.2x<br />

Each green node contains 1x Intel X5550 CPU<br />

(95W TDP, 4 Cores per CPU) and 1x NVIDIA<br />

M2090 (225W TDP per GPU)<br />

10.00<br />

8.36<br />

3.6x<br />

3.6x<br />

CPU Only<br />

With GPU<br />

5.00<br />

3.7x<br />

Benchmark systems: RNAse in water<br />

with 16,816 atoms in truncated<br />

dodecahedron box<br />

0.00<br />

1 2 3<br />

Number of Nodes<br />

Get up to 3.7x performance compared to CPU-only nodes


Nanoseconds / Day<br />

Additional Strong Scaling on Larger System<br />

160<br />

140<br />

128K Water Molecules<br />

Running GROMACS 4.6 pre-beta with CUDA 4.1<br />

Each blue node contains 1x Intel X5670 (95W<br />

TDP, 6 Cores per CPU)<br />

120<br />

100<br />

2x<br />

Each green node contains 1x Intel X5670 (95W<br />

TDP, 6 Cores per CPU) and 1x NVIDIA M2070<br />

(225W TDP per GPU)<br />

80<br />

60<br />

40<br />

2.8x<br />

CPU Only<br />

With GPU<br />

20<br />

0<br />

3.1x<br />

8 16 32 64 128<br />

Number of Nodes<br />

Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance<br />

when compared to CPU-only nodes


Replace 3 Nodes with 2 GPUs<br />

9<br />

8<br />

7<br />

6<br />

6.7<br />

ADH in Water (134K Atoms)<br />

8.36<br />

$8,000<br />

$6,500<br />

4 CPU Nodes<br />

9000<br />

1 CPU Node +<br />

2x M2090 GPUs<br />

8000<br />

7000<br />

6000<br />

Running GROMACS 4.6 pre-beta with CUDA<br />

4.1<br />

The blue node contains 2x Intel X5550 CPUs<br />

(95W TDP, 4 Cores per CPU)<br />

The green node contains 2x Intel X5550 CPUs<br />

(95W TDP, 4 Cores per CPU) and 2x<br />

NVIDIA M2090s as the GPU (225W TDP per<br />

GPU)<br />

5<br />

4<br />

3<br />

5000<br />

4000<br />

3000<br />

Note: Typical CPU and GPU node pricing<br />

used. Pricing may vary depending on node<br />

configuration. Contact your preferred HW<br />

vendor for actual pricing.<br />

2<br />

2000<br />

1<br />

1000<br />

0<br />

Nanoseconds/Day<br />

Cost<br />

0<br />

Save thousands of dollars and perform 25% faster


Energy Expended (KiloJoules Consumed)<br />

Greener Science<br />

12000<br />

10000<br />

8000<br />

ADH in Water (134K Atoms)<br />

Lower is better<br />

Running GROMACS 4.6 with CUDA 4.1<br />

The blue nodes contain 2x Intel X5550 CPUs<br />

(95W TDP, 4 Cores per CPU)<br />

The green node contains 2x Intel X5550 CPUs,<br />

4 Cores per CPU) and 2x NVIDIA M2090s GPUs<br />

(225W TDP per GPU)<br />

6000<br />

4000<br />

2000<br />

Energy Expended<br />

= Power x Time<br />

0<br />

4 Nodes<br />

(760 Watts)<br />

1 Node + 2x M2090<br />

(640 Watts)<br />

In simulating each nanosecond, the GPU-accelerated system uses 33% less energy


The Power of Kepler<br />

140<br />

RNase Solvated Protein 24k Atoms<br />

120<br />

100<br />

80<br />

60<br />

M2090<br />

K20X<br />

Running GROMACS version 4.6 beta<br />

The grey nodes contain 1 or 2 E5-2687W CPUs<br />

(150W each, 8 Cores per CPU) and 1 or 2<br />

NVIDIA M2090s.<br />

The green nodes contain 1 or 2 E5-2687W<br />

CPUs (8 Cores per CPU) and 1 or 2 NVIDIA<br />

K20X GPUs (235W each).<br />

40<br />

20<br />

0<br />

1 CPU + 1 GPU 1 CPU + 2 GPU 2 CPU + 1 GPU 2 CPU + 2 GPU<br />

Upgrading an M2090 to a K20X increases performance 10-45%<br />

Ribonuclease


Nanoseconds / Day<br />

K20X – Fast<br />

120<br />

RNase Solvated Protein 24k Atoms<br />

100<br />

80<br />

60<br />

CPU Only<br />

With 1 K20X<br />

Running GROMACS version 4.6 beta<br />

The blue nodes contain 1 or 2 E5-2687W CPUs<br />

(150W each, 8 Cores per CPU).<br />

The green nodes contain 1 or 2 E5-2687W<br />

CPUs (8 Cores per CPU) and 1 or 2 NVIDIA<br />

K20X GPUs (235W each).<br />

40<br />

20<br />

0<br />

1 CPU 2 CPUs<br />

Adding a K20X increases performance by up to 3x<br />

Ribonuclease


K20X, the Fastest Yet<br />

Nanoseconds / Day<br />

16<br />

192K Water Molecules<br />

14<br />

12<br />

Running GROMACS version 4.6-beta2 and<br />

CUDA 5.0.35<br />

The blue node contains 2 E5-2687W CPUs<br />

(150W each, 8 Cores per CPU).<br />

10<br />

8<br />

The green nodes contain 2 E5-2687W CPUs (8<br />

Cores per CPU) and 1 or 2 NVIDIA K20X GPUs<br />

(235W each).<br />

6<br />

4<br />

2<br />

0<br />

CPU CPU + K20X CPU + 2x K20X<br />

Using K20X nodes increases performance by 2.5x<br />

Water<br />

Try GPU accelerated GROMACS 4.6 for free – www.nvidia.com/GPUTestDrive


Recommended GPU Node Configuration for<br />

GROMACS <strong>Computational</strong> <strong>Chemistry</strong><br />

Workstation or Single Node Configuration<br />

# of CPU sockets 2<br />

Cores per CPU socket 6+<br />

CPU speed (Ghz) 2.66+<br />

System memory per socket (GB) 32<br />

GPUs<br />

# of GPUs per CPU socket<br />

Kepler K10, K20, K20X<br />

Fermi M2090, M2075, C2075<br />

1x<br />

Kepler-based GPUs (K20X, K20 or K10): need fast Sandy<br />

Bridge or perhaps the very fastest Westmeres, or high-end<br />

AMD Opterons<br />

GPU memory preference (GB) 6<br />

GPU to CPU connection<br />

Server storage<br />

Network configuration<br />

PCIe 2.0 or higher<br />

500 GB or higher<br />

Gemini, InfiniBand<br />

61<br />

Scale to multiple nodes with same single node configuration


CHARMM Release C37b1


GPUs Outperform CPUs<br />

Nanoseconds / Day<br />

70<br />

Daresbury Crambin 19.6k Atoms<br />

60<br />

50<br />

40<br />

Running CHARMM release C37b1<br />

The blue nodes contains 44 X5667 CPUs<br />

(95W, 4 Cores per CPU).<br />

The green nodes contain 2 X5667 CPUs and 1<br />

or 2 NVIDIA C2070 GPUs (238W each).<br />

30<br />

20<br />

Note: Typical CPU and GPU node pricing used.<br />

Pricing may vary depending on node<br />

configuration. Contact your preferred HW vendor<br />

for actual pricing.<br />

10<br />

0<br />

44x X5667<br />

$44,000<br />

2x X5667 + 1x C2070<br />

$3000<br />

2x X5667 + 2x C2070<br />

$4000<br />

1 GPU = 15 CPUs


More Bang for your Buck<br />

Scaled Performance / Price<br />

12<br />

10<br />

Daresbury Crambin 19.6k Atom<br />

Running CHARMM release C37b1<br />

The blue nodes contains 44 X5667 CPUs<br />

(95W, 4 Cores per CPU).<br />

8<br />

6<br />

4<br />

The green nodes contain 2 X5667 CPUs and 1<br />

or 2 NVIDIA C2070 GPUs (238W).<br />

Note: Typical CPU and GPU node pricing used.<br />

Pricing may vary depending on node<br />

configuration. Contact your preferred HW vendor<br />

for actual pricing.<br />

2<br />

0<br />

44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070<br />

Using GPUs delivers 10.6x the performance for the same cost


Energy Expended (kJ)<br />

Greener Science with NVIDIA<br />

18000<br />

Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

Lower is better<br />

Running CHARMM release C37b1<br />

The blue nodes contains 64 X5667 CPUs<br />

(95W, 4 Cores per CPU).<br />

The green nodes contain 2 X5667 CPUs and 1<br />

or 2 NVIDIA C2070 GPUs (238W each).<br />

Note: Typical CPU and GPU node pricing used.<br />

Pricing may vary depending on node<br />

configuration. Contact your preferred HW vendor<br />

for actual pricing.<br />

4000<br />

2000<br />

0<br />

64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070<br />

Energy Expended<br />

= Power x Time<br />

Using GPUs will decrease energy use by 75%


www.acellera.com<br />

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms)<br />

116 ns/day on 1 GPU for DHFR (23K atoms)<br />

M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and<br />

Comput. 5, 1632 (2009)


www.acellera.com<br />

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT 1<br />

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5,<br />

2371–2377 (2009)<br />

2 For a list of selected references see http://www.acellera.com/acemd/publications


Quantum <strong>Chemistry</strong> Applications<br />

Application Features Supported GPU Perf Release Status Notes<br />

Abinit<br />

ACES III<br />

Local Hamiltonian, non-local<br />

Hamiltonian, LOBPCG algorithm,<br />

diagonalization /<br />

orthogonalization<br />

Integrating scheduling GPU into<br />

SIAL programming language and<br />

SIP runtime environment<br />

1.3-2.7X<br />

10X on kernels<br />

ADF Fock Matrix, Hessians TBD<br />

BigDFT<br />

DFT; Daubechies wavelets,<br />

part of Abinit<br />

5-25X<br />

(1 CPU core to<br />

GPU kernel)<br />

Casino TBD TBD<br />

CP2K<br />

GAMESS-US<br />

DBCSR (spare matrix multiply<br />

library)<br />

Libqc with Rys Quadrature<br />

Algorithm, Hartree-Fock, MP2<br />

and CCSD in Q4 2012<br />

2-7X<br />

1.3-1.6X,<br />

2.3-2.9x HF<br />

Released since Version 6.12<br />

Multi-GPU support<br />

Under development<br />

Multi-GPU support<br />

Pilot project completed,<br />

Under development<br />

Multi-GPU support<br />

Released June 2009,<br />

current release 1.6<br />

Multi-GPU support<br />

Under development,<br />

Spring 2013 release<br />

Multi-GPU support<br />

Under development<br />

Multi-GPU support<br />

Released<br />

Multi-GPU support<br />

www.abinit.org<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/deumens_ESaccel_2012.pdf<br />

www.scm.com<br />

http://inac.cea.fr/L_Sim/BigDFT/news.html,<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/BigDFT-Formalism.pdf and<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/BigDFT-HPC-tues.pdf<br />

http://www.tcm.phy.cam.ac.uk/~mdt26/casino.<br />

html<br />

http://www.olcf.ornl.gov/wpcontent/training/ascc_2012/friday/ACSS_2012_V<br />

andeVondele_s.pdf<br />

Next release Q4 2012.<br />

http://www.msg.ameslab.gov/gamess/index.html<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application Features Supported GPU Perf Release Status Notes<br />

GAMESS-UK<br />

(ss|ss) type integrals within<br />

calculations using Hartree Fock ab<br />

initio methods and density<br />

functional theory. Supports<br />

organics & inorganics.<br />

8x<br />

Release in Summer 2012<br />

Multi-GPU support<br />

http://www.ncbi.nlm.nih.gov/pubmed/215419<br />

63<br />

Gaussian<br />

Joint PGI, NVIDIA & Gaussian<br />

Collaboration<br />

TBD<br />

Under development<br />

Multi-GPU support<br />

Announced Aug. 29, 2011<br />

http://www.gaussian.com/g_press/nvidia_press.htm<br />

GPAW<br />

Electrostatic poisson equation,<br />

orthonormalizing of vectors,<br />

residual minimization method<br />

(rmm-diis)<br />

Jaguar Investigating GPU acceleration TBD<br />

LSMS<br />

Generalized Wang-Landau method<br />

8x<br />

3x<br />

with 32 GPUs vs.<br />

32 (16-core)<br />

CPUs<br />

MOLCAS CU_BLAS support 1.1x<br />

MOLPRO<br />

Density-fitted MP2 (DF-MP2),<br />

density fitted local correlation<br />

methods (DF-RHF, DF-KS), DFT<br />

1.7-2.3X<br />

projected<br />

Released<br />

Multi-GPU support<br />

Under development<br />

Multi-GPU support<br />

Under development<br />

Multi-GPU support<br />

Released, Version 7.8<br />

Single GPU. Additional GPU<br />

support coming in Version 8<br />

Under development<br />

Multiple GPU<br />

https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,<br />

Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)<br />

Schrodinger, Inc.<br />

http://www.schrodinger.com/kb/278<br />

NICS Electronic Structure Determination Workshop 2012:<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/Eisenbach_OakRidge_February.pdf<br />

www.molcas.org<br />

www.molpro.net<br />

Hans-Joachim Werner<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

MOPAC2009<br />

pseudodiagonalization, full<br />

diagonalization, and density<br />

matrix assembling<br />

3.8-14X<br />

Under Development<br />

Single GPU<br />

Academic port.<br />

http://openmopac.net<br />

NWChem<br />

Triples part of Reg-CCSD(T),<br />

CCSD & EOMCCSD task<br />

schedulers<br />

3-10X projected<br />

Release targeting end of 2012<br />

Multiple GPUs<br />

Development GPGPU benchmarks:<br />

www.nwchem-sw.org<br />

And http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/Krishnamoorthy-ESCMA12.pdf<br />

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/<br />

PEtot<br />

Density functional theory (DFT)<br />

plane wave pseudopotential<br />

calculations<br />

6-10X<br />

Released<br />

Multi-GPU<br />

First principles materials code that computes<br />

the behavior of the electron structures of<br />

materials<br />

Q-CHEM RI-MP2 8x-14x Released, Version 4.0<br />

http://www.qchem.com/doc_for_web/qchem_manual_4.0.pdf<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


Quantum <strong>Chemistry</strong> Applications<br />

Application<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

QMCPACK Main features 3-4x<br />

Released<br />

Multiple GPUs<br />

NCSA<br />

University of Illinois at Urbana-Champaign<br />

http://cms.mcc.uiuc.edu/qmcpack/index.php<br />

/GPU_version_of_QMCPACK<br />

Quantum<br />

Espresso/PWscf<br />

PWscf package: linear algebra<br />

(matrix multiply), explicit<br />

computational kernels, 3D FFTs<br />

2.5-3.5x<br />

Released<br />

Version 5.0<br />

Multiple GPUs<br />

Created by Irish Centre for<br />

High-End Computing<br />

http://www.quantum-espresso.org/index.php<br />

and http://www.quantum-espresso.org/<br />

TeraChem<br />

“Full GPU-based solution”<br />

44-650X vs.<br />

GAMESS CPU<br />

version<br />

Released<br />

Version 1.5<br />

Multi-GPU/single node<br />

Completely redesigned to<br />

exploit GPU parallelism. YouTube:<br />

http://youtu.be/EJODzk6RFxE?hd=1 and<br />

http://www.olcf.ornl.gov/wp-<br />

content/training/electronic-structure-<br />

2012/Luehr-ESCMA.pdf<br />

VASP<br />

Hybrid Hartree-Fock DFT<br />

functionals including exact<br />

exchange<br />

2x<br />

2 GPUs<br />

comparable to<br />

128 CPU cores<br />

Available on request<br />

Multiple GPUs<br />

By Carnegie Mellon University<br />

http://arxiv.org/pdf/1111.0716.pdf<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


BigDFT


CP2K


Performance Relative to CPU Only<br />

Kepler, it’s faster<br />

14<br />

12<br />

Running CP2K version 12413-trunk on CUDA<br />

5.0.36<br />

10<br />

8<br />

6<br />

The blue node contains 2 E5-2687W CPUs<br />

(150W, 8 Cores per CPU).<br />

The green nodes contain 2 E5-2687W CPUs<br />

and 1 or 2 NVIDIA K10, K20, or K20X GPUs<br />

(235W each).<br />

4<br />

2<br />

0<br />

CPU Only CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X<br />

Using GPUs delivers up to 12.6x the performance per node


Speedup Relative to 256 non-GPU Cores<br />

Strong Scaling<br />

8<br />

7<br />

6<br />

XK6 With GPUs<br />

XK6 Without GPUs<br />

Conducted on Cray XK6<br />

Using matrix-matrix multiplication<br />

NREP=6 and N=159,000 with 50% occupation<br />

5<br />

4<br />

3<br />

2.9x<br />

3x<br />

2<br />

1<br />

2.3x<br />

0<br />

256 512 768<br />

# of Cores used<br />

Speedups increase as more nodes are added, up to 3x at 768 nodes


Energy Expended (kJ)<br />

Kepler, keeping the planet Green<br />

350<br />

300<br />

250<br />

Running CP2K version 12413-trunk on CUDA<br />

5.0.36<br />

The blue node contains 2 E5-2687W CPUs<br />

(150W, 8 Cores per CPU).<br />

200<br />

150<br />

100<br />

Lower is better<br />

The green nodes contain 2 E5-2687W CPUs<br />

and 1 or 2 NVIDIA K20 GPUs (235W each).<br />

Energy Expended<br />

= Power x Time<br />

50<br />

0<br />

CPU Only CPU + K20 CPU + 2x K20<br />

Using K20s will lower energy use by over 75% for the same simulation


GAUSSIAN


Gaussian<br />

Key quantum chemistry code<br />

ACS Fall 2011 press release<br />

Joint collaboration between Gaussian, NVDA and PGI for GPU<br />

acceleration: http://www.gaussian.com/g_press/nvidia_press.htm<br />

No such release exists for Intel MIC or AMD GPUs<br />

Mike Frisch quote:<br />

“Calculations using Gaussian are limited primarily by the available computing<br />

resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the<br />

development of hardware, compiler technology and application software among the<br />

three companies, the new application will bring the speed and cost-effectiveness of<br />

GPUs to the challenging problems and applications that Gaussian’s customers need to<br />

address.”<br />

NVIDIA Confidential


GAMESS


GAMESS Partnership Overview<br />

Mark Gordon and Andrey Asadchev, key developers of GAMESS,<br />

in collaboration with NVIDIA. Mark Gordon is a recipient of a<br />

NVIDIA Professor Partnership Award.<br />

Quantum <strong>Chemistry</strong> one of major consumers of CPU cycles at<br />

national supercomputer centers<br />

NVIDIA developer resources fully allocated to GAMESS code<br />

“<br />

We like to push the envelope as much as we can in the direction of highly scalable efficient<br />

codes. GPU technology seems like a good way to achieve this goal. Also, since we are<br />

associated with a DOE Laboratory, energy efficiency is important, and this is another reason<br />

to explore quantum chemistry on GPUs.<br />

Prof. Mark Gordon<br />

Distinguished Professor, Department of <strong>Chemistry</strong>, Iowa State University and<br />

”<br />

Director, Applied Mathematical Sciences Program, AMES Laboratory<br />

84


GAMESS Aug. 2011 Release Relative<br />

Performance for Two Small Molecules<br />

GAMESS August 2011 GPU Performance<br />

First GPU supported GAMESS release via "libqc", a library for fast quantum<br />

chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software<br />

2e- AO integrals and their assembly into a closed shell Fock matrix<br />

2.0<br />

4x E5640 CPUs<br />

4x E5640 CPUs + 4x Tesla C2070s<br />

1.0<br />

0.0<br />

Ginkgolide (53 atoms)<br />

Vancomycin (176 atoms)


Upcoming GAMESS Q4 2012 Release<br />

Multi-nodes with multi-GPUs supported<br />

Rys Quadrature<br />

Hartree-Fock<br />

8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup.<br />

See 2012 publication<br />

Møller–Plesset perturbation theory (MP2):<br />

Preliminary code completed<br />

Paper in development<br />

Coupled Cluster SD(T): CCSD code completed,<br />

(T) in progress<br />

86


GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F<br />

3.5<br />

Hartree-Fock GPU Speedups*<br />

3.0<br />

2.5<br />

2.3<br />

2.5 2.5<br />

2.3<br />

2.4<br />

2.3<br />

2.9<br />

Adding 1x 2070 GPU<br />

speeds up computations<br />

by 2.3x to 2.9x<br />

2.0<br />

1.5<br />

Speedup<br />

1.0<br />

0.5<br />

0.0<br />

Taxol 6-31G Taxol 6-31G(d) Taxol 6-<br />

31G(2d,2p)<br />

Taxol 6-<br />

31++G(d,p)<br />

Valinomycin 6-<br />

31G<br />

Valinomycin 6-<br />

31G(d)<br />

Valinomycin 6-<br />

31G(2d,2p)<br />

* A. Asadchev, M.S. Gordon, “New<br />

Multithreaded Hybrid CPU/GPU Approach to<br />

Hartree-Fock,” Journal of Chemical Theory and<br />

Computation (2012)<br />

NVIDIA CONFIDENTIAL<br />

87


GPAW


Used with<br />

permission<br />

from Samuli<br />

Hakala


NWChem


NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes<br />

System: cluster consisting<br />

of dual-socket nodes<br />

constructed from:<br />

• 8-core AMD Interlagos<br />

processors<br />

• 64 GB of memory<br />

• Tesla M2090 (Fermi)<br />

GPUs<br />

The nodes are connected<br />

using a high-performance<br />

QDR Infiniband<br />

interconnect<br />

Courtesy of Kowolski, K.,<br />

Bhaskaran-Nair, at al @<br />

PNNL, JCTC (submitted)


Quantum Espresso/PWscf


Performance Relative to CPU Only<br />

Kepler, fast science<br />

14<br />

12<br />

10<br />

8<br />

AUsurf<br />

Running Quantum Espresso version 5.0-build7<br />

on CUDA 5.0.36<br />

The blue node contains 2 E5-2687W CPUs<br />

(150W, 8 Cores per CPU).<br />

The green nodes contain 2 E5-2687W CPUs<br />

and 1 or 2 NVIDIA M2090 or K10 GPUs (225W<br />

and 235W respectively).<br />

6<br />

4<br />

2<br />

0<br />

CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10<br />

Using K10s delivers up to 11.7x the performance per node over CPUs<br />

And 1.7x the performance when compared to M2090s


Scaled to CPU Only<br />

Extreme Performance/Price from 1 GPU<br />

4<br />

3.5<br />

3<br />

Simulations run on FERMI @ ICHEC.<br />

A 6-Core 2.66 GHz Intel X5650 was<br />

used for the CPU<br />

2.5<br />

An NVIDIA C2050 was used for the<br />

GPU<br />

2<br />

1.5<br />

1<br />

0.5<br />

CPU+<br />

GPU<br />

CPU<br />

Only<br />

0<br />

Price: Performance: (Shilu-3) Performance: (Water-on-Calcite)<br />

Calcite structure<br />

Adding a GPU can improve performance by 3.7x while only increasing price by 25%


Extreme Performance/Price from 1 GPU<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

Price and Performance scaled to the CPU only system<br />

Simulations run on FERMI @ ICHEC.<br />

A 6-Core 2.66 GHz Intel X5650 was<br />

used for the CPU<br />

An NVIDIA C2050 was used for the<br />

GPU<br />

1.5<br />

1<br />

0.5<br />

CPU+<br />

GPU<br />

CPU<br />

Only<br />

0<br />

Price: Performance: (AUSURF112, k-<br />

point)<br />

Performance: (AUSURF112,<br />

gamma-point)<br />

Calculation done for a gold surface of 112 atoms<br />

Adding a GPU can improve performance by 3.5x while only increasing price by 25%


Elapsed Time (minutes)<br />

Replace 72 CPUs with 8 GPUs<br />

250<br />

200<br />

LSMO-BFO (120 Atoms) 8 K-points<br />

223<br />

219<br />

Simulations run on PLX @ CINECA.<br />

Intel 6-Core 2.66 GHz X5550 were<br />

used for the CPUs<br />

NVIDIA M2070s were used for the<br />

GPUs<br />

150<br />

100<br />

50<br />

0<br />

120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)<br />

The GPU Accelerated setup performs faster and costs 24% less


Power Consumption (Watts)<br />

QE/PWscf - Green Science<br />

12000<br />

10000<br />

8000<br />

LSMO-BFO (120 Atoms) 8 K-points<br />

Lower is better<br />

Simulations run on PLX @ CINECA.<br />

Intel 6-Core 2.66 GHz X5550 were<br />

used for the CPUs<br />

NVIDIA M2070s were used for the<br />

GPUs<br />

6000<br />

4000<br />

2000<br />

0<br />

120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)<br />

Over a year, the lower power consumption would save $4300 on energy bills


Power Consumption [kW/h]<br />

NVIDIA GPUs Use Less Energy<br />

0.6<br />

Energy Consumption on Different Tests<br />

CPU Only<br />

Simulations run on FERMI @ ICHEC.<br />

A 6-Core 2.66 GHz Intel X5650 was<br />

used for the CPU<br />

0.5<br />

CPU+GPU<br />

An NVIDIA C2050 was used for the<br />

GPU<br />

0.4<br />

0.3<br />

0.2<br />

-58%<br />

Lower is better<br />

0.1<br />

-57%<br />

-54%<br />

0<br />

Shilu-3 AUSURF112 Water-on-Calcite<br />

In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only


Time (s)<br />

QE/PWscf - Great Strong Scaling in Parallel<br />

35000<br />

CdSe-159 Walltime of 1 full SCF<br />

Simulations run on STONEY @ ICHEC.<br />

30000<br />

Two quad core 2.87 GHz Intel X5560s<br />

were used in each node<br />

25000<br />

2.5x<br />

Lower is better<br />

CPU<br />

CPU+GPU<br />

Two NVIDIA M2090s were used in<br />

each node for the CPU+GPU test<br />

20000<br />

15000<br />

10000<br />

5000<br />

2.2x<br />

2.1x<br />

2.2x<br />

0<br />

2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112)<br />

Nodes (Total CPU Cores)<br />

159 Cadmium Selenide nanodots<br />

Speedups up to 2.5x with GPU Accelerations


Time (s)<br />

QE/PWscf - More Powerful Strong Scaling<br />

GeSnTe134 Walltime of full SCF<br />

4500<br />

Simulations run on PLX @ CINECA.<br />

4000<br />

CPU<br />

CPU+GPU<br />

Two 6-Core 2.4 GHz Intel E5645s were<br />

used in each node<br />

3500<br />

3000<br />

1.6x<br />

Lower is better<br />

Two NVIDIA M2070s were used in<br />

each node for the CPU+GPU test<br />

2500<br />

2000<br />

1500<br />

1000<br />

2.3x<br />

2.4x<br />

2.1x<br />

500<br />

0<br />

4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528)<br />

Nodes (Total CPU Cores)<br />

Accelerate your cluster by up to 2.1x with NVIDIA GPUs<br />

Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive


TeraChem


Time (Seconds)<br />

TeraChem<br />

Supercomputer Speeds on GPUs<br />

100<br />

90<br />

80<br />

70<br />

Time for SCF Step<br />

TeraChem running on 8 C2050s on 1 node<br />

NWChem running on 4096 Quad Core CPUs<br />

In the Chinook Supercomputer<br />

Giant Fullerene C240 Molecule<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)<br />

Similar performance from just a handful of GPUs


Price/Performance relative to Supercomputer<br />

TeraChem<br />

Bang for the Buck<br />

600<br />

Performance/Price<br />

TeraChem running on 8 C2050s on 1 node<br />

NWChem running on 4096 Quad Core CPUs<br />

In the Chinook Supercomputer<br />

500<br />

400<br />

300<br />

493<br />

Giant Fullerene C240 Molecule<br />

Note: Typical CPU and GPU node pricing<br />

used. Pricing may vary depending on node<br />

configuration. Contact your preferred HW<br />

vendor for actual pricing.<br />

200<br />

100<br />

0<br />

1<br />

4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)<br />

Dollars spent on GPUs do 500x more science than those spent on CPUs


Seconds<br />

Seconds<br />

Kepler’s Even Better<br />

800<br />

Olestra BLYP 453 Atoms<br />

2000<br />

B3LYP/6-31G(d)<br />

TeraChem running on C2050 and K20C<br />

700<br />

600<br />

1800<br />

1600<br />

1400<br />

First graph is of BLYP/G-31(d)<br />

Second is B3LYP/6-31G(d)<br />

500<br />

1200<br />

400<br />

1000<br />

300<br />

800<br />

200<br />

600<br />

400<br />

100<br />

200<br />

0<br />

C2050<br />

K20C<br />

0<br />

C2050<br />

K20C<br />

Kepler performs 2x faster than Tesla


Viz, ―Docking‖ and Related Applications Growing<br />

Related<br />

Applications<br />

Features<br />

Supported<br />

GPU Perf Release Status Notes<br />

Amira 5®<br />

3D visualization of volumetric<br />

data and surfaces<br />

70x<br />

Released, Version 5.3.3<br />

Single GPU<br />

Visualization from Visage Imaging. Next release, 5.4, will use<br />

GPU for general purpose processing in some functions<br />

http://www.visageimaging.com/overview.html<br />

BINDSURF<br />

Allows fast processing of large<br />

ligand databases<br />

100X<br />

Available upon request to<br />

authors; single GPU<br />

High-Throughput parallel blind Virtual Screening,<br />

http://www.biomedcentral.com/1471-2105/13/S14/S13<br />

BUDE<br />

Empirical Free<br />

Energy Forcefield<br />

6.5-13.4X<br />

Released<br />

Single GPU<br />

University of Bristol<br />

http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm<br />

Core Hopping GPU accelerated application 3.75-5000X<br />

Released, Suite 2011<br />

Single and multi-GPUs.<br />

Schrodinger, Inc.<br />

http://www.schrodinger.com/products/14/32/<br />

FastROCS<br />

Real-time shape similarity<br />

searching/comparison<br />

800-3000X<br />

Released<br />

Single and multi-GPUs.<br />

Open Eyes Scientific Software<br />

http://www.eyesopen.com/fastrocs<br />

PyMol<br />

Lines: 460% increase<br />

Cartoons: 1246% increase<br />

Surface: 1746% increase<br />

Spheres: 753% increase<br />

Ribbon: 426% increase<br />

1700x<br />

Released, Version 1.5<br />

Single GPUs<br />

http://pymol.org/<br />

VMD<br />

High quality rendering,<br />

large structures (100 million atoms),<br />

analysis and visualization tasks, multiple<br />

GPU support for display of molecular<br />

orbitals<br />

100-125X or greater<br />

on kernels<br />

Released, Version 1.9<br />

Visualization from University of Illinois at Urbana-Champaign<br />

http://www.ks.uiuc.edu/Research/vmd/<br />

GPU Perf compared against Multi-core x86 CPU socket.<br />

GPU Perf benchmarked on GPU supported features<br />

and may be a kernel to kernel perf comparison


FastROCS<br />

OpenEye Japan<br />

Hideyuki Sato, Ph.D.<br />

© 2012 OpenEye Scientific Software


Shape Overlays per<br />

Second<br />

ROCS on the GPU: FastROCS<br />

400000<br />

300000<br />

200000<br />

100000<br />

0<br />

CPU<br />

GPU


Shape Overlays per Second<br />

Riding Moore’s Law<br />

2000000<br />

1800000<br />

1600000<br />

1400000<br />

1200000<br />

1000000<br />

800000<br />

600000<br />

400000<br />

200000<br />

0<br />

C1060 C2050 C2075 C2090 K10 K20


Conformers per Second<br />

9000000<br />

8000000<br />

7000000<br />

6000000<br />

5000000<br />

4000000<br />

3000000<br />

2000000<br />

1000000<br />

0<br />

FastROCS scaling across 4x K10s (2 physical GPUs per K10)<br />

53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule)<br />

1 2 3 4 5 6 7 8<br />

Number of individual K10 GPUs<br />

(Note, each K10 has 2 physical GPUs on the board)


Benefits of GPU Accelerated Computing<br />

Faster than CPU only systems in all tests<br />

Large performance boost with marginal price increase<br />

Energy usage cut by more than half<br />

GPUs scale well within a node and over multiple nodes<br />

K20 GPU is our fastest and lowest power high performance GPU yet<br />

11<br />

8<br />

Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive


GPU Test Drive<br />

Experience GPU Acceleration<br />

For <strong>Computational</strong> <strong>Chemistry</strong><br />

Researchers, Biophysicists<br />

Preconfigured with Molecular<br />

Dynamics Apps<br />

Remotely Hosted GPU Servers<br />

Free & Easy – Sign up, Log in and<br />

See Results<br />

www.nvidia.com/gputestdrive<br />

11<br />

9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!