Computational Chemistry Benchmarks - Nvidia
Computational Chemistry Benchmarks - Nvidia
Computational Chemistry Benchmarks - Nvidia
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
update<br />
Updated: February 4, 2013
Molecular Dynamics (MD) Applications<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes/<strong>Benchmarks</strong><br />
AMBER<br />
PMEMD Explicit Solvent & GB<br />
Implicit Solvent<br />
> 100 ns/day<br />
JAC NVE on 2X<br />
K20s<br />
Released<br />
Multi-GPU, multi-node<br />
AMBER 12, GPU Revision Support 12.2<br />
http://ambermd.org/gpus/benchmarks.<br />
htm#<strong>Benchmarks</strong><br />
CHARMM<br />
Implicit (5x), Explicit (2x)<br />
Solvent via OpenMM<br />
2x C2070 equals<br />
32-35x X5667<br />
CPUs<br />
Released<br />
Single & multi-GPU in single node<br />
Release C37b1;<br />
http://www.charmm.org/news/c37b1.html#po<br />
stjump<br />
DL_POLY<br />
Two-body Forces, Link-cell<br />
Pairs, Ewald SPME forces,<br />
Shake VV<br />
4x<br />
Release V 4.03<br />
Multi-GPU, multi-node<br />
Source only, Results Published<br />
http://www.stfc.ac.uk/CSE/randd/ccg/softwa<br />
re/DL_POLY/25526.aspx<br />
GROMACS<br />
Implicit (5x), Explicit (2x)<br />
165 ns/Day<br />
DHFR on<br />
4X C2075s<br />
Released<br />
Multi-GPU, multi-node<br />
Release 4.6; 1 st Multi-GPU support<br />
LAMMPS<br />
Lennard-Jones, Gay-Berne,<br />
Tersoff & many more potentials<br />
3.5-18x on Titan<br />
Released.<br />
Multi-GPU, multi-node<br />
http://lammps.sandia.gov/bench.html#desktop<br />
and http://lammps.sandia.gov/bench.html#titan<br />
NAMD<br />
Full electrostatics with PME and<br />
most simulation features<br />
4.0 ns/days<br />
F1-ATPase on<br />
1x K20X<br />
Released<br />
100M atom capable<br />
Multi-GPU, multi-node<br />
NAMD 2.9<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
Abalone<br />
Simulations (on 1060 GPU)<br />
4-29X<br />
(on 1060 GPU)<br />
Released, Version 1.8.51<br />
Single GPU<br />
Agile Molecule, Inc.<br />
Ascalaph<br />
Computation of non-valent<br />
interactions<br />
4-29X<br />
(on 1060 GPU)<br />
Released, Version 1.1.4<br />
Single GPU<br />
Agile Molecule, Inc.<br />
ACEMD<br />
Written for use only on GPUs<br />
150 ns/day DHFR on<br />
1x K20<br />
Released<br />
Single and multi-GPUs<br />
Production bio-molecular dynamics (MD)<br />
software specially optimized to run on GPUs<br />
Folding@Home<br />
Powerful distributed computing<br />
molecular dynamics system;<br />
implicit solvent and folding<br />
Depends upon<br />
number of GPUs<br />
Released;<br />
GPUs and CPUs<br />
http://folding.stanford.edu<br />
GPUs get 4X the points of CPUs<br />
GPUGrid.net<br />
High-performance all-atom<br />
biomolecular simulations;<br />
explicit solvent and binding<br />
Depends upon<br />
number of GPUs<br />
Released;<br />
NVIDIA GPUs only<br />
http://www.gpugrid.net/<br />
HALMD<br />
Simple fluids and binary<br />
mixtures (pair potentials, highprecision<br />
NVE and NVT, dynamic<br />
correlations)<br />
Up to 66x on 2090<br />
vs. 1 CPU core<br />
Released, Version 0.2.0<br />
Single GPU<br />
http://halmd.org/benchmarks.html#supercool<br />
ed-binary-mixture-kob-andersen<br />
HOOMD-Blue<br />
Written for use only on GPUs<br />
Kepler 2X faster<br />
than Fermi<br />
Released, Version 0.11.2<br />
Single and multi-GPU on 1 node<br />
http://codeblue.umich.edu/hoomd-blue/<br />
Multi-GPU w/ MPI in March 2013<br />
OpenMM<br />
Implicit and explicit solvent,<br />
custom forces<br />
Implicit: 127-213<br />
ns/day Explicit: 18-<br />
55 ns/day DHFR<br />
Released Version 4.1.1<br />
Multi-GPU<br />
Library and application for molecular dynamics<br />
on high-performance<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application Features Supported GPU Perf Release Status Notes<br />
Abinit<br />
Local Hamiltonian, non-local<br />
Hamiltonian, LOBPCG algorithm,<br />
diagonalization /<br />
orthogonalization<br />
1.3-2.7X<br />
Released; Version 7.0.5<br />
Multi-GPU support<br />
www.abinit.org<br />
ACES III<br />
Integrating scheduling GPU into<br />
SIAL programming language and<br />
SIP runtime environment<br />
10X on kernels<br />
Under development<br />
Multi-GPU support<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/deumens_ESaccel_2012.pdf<br />
ADF Fock Matrix, Hessians TBD<br />
BigDFT<br />
DFT; Daubechies wavelets,<br />
part of Abinit<br />
5-25X<br />
(1 CPU core to<br />
GPU kernel)<br />
Casino TBD TBD<br />
CP2K<br />
GAMESS-US<br />
DBCSR (spare matrix multiply<br />
library)<br />
Libqc with Rys Quadrature<br />
Algorithm, Hartree-Fock, MP2<br />
and CCSD in Q4 2012<br />
2-7X<br />
1.3-1.6X,<br />
2.3-2.9x HF<br />
Pilot project completed,<br />
Under development<br />
Multi-GPU support<br />
Released June 2009,<br />
current release 1.6.0<br />
Multi-GPU support<br />
Under development,<br />
Spring 2013 release<br />
Multi-GPU support<br />
Under development<br />
Multi-GPU support<br />
Released<br />
Multi-GPU support<br />
www.scm.com<br />
http://inac.cea.fr/L_Sim/BigDFT/news.html,<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/BigDFT-Formalism.pdf and<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/BigDFT-HPC-tues.pdf<br />
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.<br />
html<br />
http://www.olcf.ornl.gov/wpcontent/training/ascc_2012/friday/ACSS_2012_V<br />
andeVondele_s.pdf<br />
Next release Q4 2012.<br />
http://www.msg.ameslab.gov/gamess/index.html<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application Features Supported GPU Perf Release Status Notes<br />
GAMESS-UK<br />
(ss|ss) type integrals within<br />
calculations using Hartree Fock ab<br />
initio methods and density<br />
functional theory. Supports<br />
organics & inorganics.<br />
8x<br />
Release in 2012<br />
Multi-GPU support<br />
http://www.ncbi.nlm.nih.gov/pubmed/215419<br />
63<br />
Gaussian<br />
Joint PGI, NVIDIA & Gaussian<br />
Collaboration<br />
TBD<br />
Under development<br />
Multi-GPU support<br />
Announced Aug. 29, 2011<br />
http://www.gaussian.com/g_press/nvidia_press.htm<br />
GPAW<br />
Electrostatic poisson equation,<br />
orthonormalizing of vectors,<br />
residual minimization method<br />
(rmm-diis)<br />
8x<br />
Released<br />
Multi-GPU support<br />
https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,<br />
Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)<br />
Jaguar Investigating GPU acceleration TBD<br />
Under development<br />
Multi-GPU support<br />
Schrodinger, Inc.<br />
http://www.schrodinger.com/kb/278<br />
MOLCAS CU_BLAS support 1.1x<br />
Released, Version 7.8<br />
Single GPU. Additional GPU<br />
support coming in Version 8<br />
www.molcas.org<br />
MOLPRO<br />
Density-fitted MP2 (DF-MP2),<br />
density fitted local correlation<br />
methods (DF-RHF, DF-KS), DFT<br />
1.7-2.3X<br />
projected<br />
Under development<br />
Multiple GPU<br />
www.molpro.net<br />
Hans-Joachim Werner<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
MOPAC2009<br />
pseudodiagonalization, full<br />
diagonalization, and density<br />
matrix assembling<br />
3.8-14X<br />
Under Development<br />
Single GPU<br />
Academic port.<br />
http://openmopac.net<br />
NWChem<br />
Triples part of Reg-CCSD(T),<br />
CCSD & EOMCCSD task<br />
schedulers<br />
3-10X projected<br />
Release targeting March 2013<br />
Multiple GPUs<br />
Development GPGPU benchmarks:<br />
www.nwchem-sw.org<br />
And http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/Krishnamoorthy-ESCMA12.pdf<br />
Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/<br />
PEtot<br />
Density functional theory (DFT)<br />
plane wave pseudopotential<br />
calculations<br />
6-10X<br />
Released<br />
Multi-GPU<br />
First principles materials code that computes<br />
the behavior of the electron structures of<br />
materials<br />
Q-CHEM RI-MP2 8x-14x Released, Version 4.0<br />
http://www.qchem.com/doc_for_web/qchem_manual_4.0.pdf<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
QMCPACK Main features 3-4x<br />
Released<br />
Multiple GPUs<br />
NCSA<br />
University of Illinois at Urbana-Champaign<br />
http://cms.mcc.uiuc.edu/qmcpack/index.php<br />
/GPU_version_of_QMCPACK<br />
Quantum<br />
Espresso/PWscf<br />
PWscf package: linear algebra<br />
(matrix multiply), explicit<br />
computational kernels, 3D FFTs<br />
2.5-3.5x<br />
Released<br />
Version 5.0<br />
Multiple GPUs<br />
Created by Irish Centre for<br />
High-End Computing<br />
http://www.quantum-espresso.org/index.php<br />
and http://www.quantum-espresso.org/<br />
TeraChem<br />
“Full GPU-based solution”<br />
44-650X vs.<br />
GAMESS CPU<br />
version<br />
Released<br />
Version 1.5<br />
Multi-GPU/single node<br />
Completely redesigned to<br />
exploit GPU parallelism. YouTube:<br />
http://youtu.be/EJODzk6RFxE?hd=1 and<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/Luehr-ESCMA.pdf<br />
VASP<br />
Hybrid Hartree-Fock DFT<br />
functionals including exact<br />
exchange<br />
2x<br />
2 GPUs<br />
comparable to<br />
128 CPU cores<br />
Available on request<br />
Multiple GPUs<br />
By Carnegie Mellon University<br />
http://arxiv.org/pdf/1111.0716.pdf<br />
WL-LSMS<br />
Generalized Wang-Landau<br />
method<br />
3x<br />
with 32 GPUs vs.<br />
32 (16-core) CPUs<br />
Under development<br />
Multi-GPU support<br />
NICS Electronic Structure Determination Workshop 2012:<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/Eisenbach_OakRidge_February.pdf<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Viz, ―Docking‖ and Related Applications Growing<br />
Related<br />
Applications<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
Amira 5®<br />
3D visualization of volumetric<br />
data and surfaces<br />
70x<br />
Released, Version 5.3.3<br />
Single GPU<br />
Visualization from Visage Imaging. Next release, 5.4, will use<br />
GPU for general purpose processing in some functions<br />
http://www.visageimaging.com/overview.html<br />
BINDSURF<br />
Allows fast processing of large<br />
ligand databases<br />
100X<br />
Available upon request to<br />
authors; single GPU<br />
High-Throughput parallel blind Virtual Screening,<br />
http://www.biomedcentral.com/1471-2105/13/S14/S13<br />
BUDE<br />
Empirical Free<br />
Energy Forcefield<br />
6.5-13.4X<br />
Released<br />
Single GPU<br />
University of Bristol<br />
http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm<br />
Core Hopping GPU accelerated application 3.75-5000X<br />
Released, Suite 2011<br />
Single and multi-GPUs.<br />
Schrodinger, Inc.<br />
http://www.schrodinger.com/products/14/32/<br />
FastROCS<br />
Real-time shape similarity<br />
searching/comparison<br />
800-3000X<br />
Released<br />
Single and multi-GPUs.<br />
Open Eyes Scientific Software<br />
http://www.eyesopen.com/fastrocs<br />
PyMol<br />
Lines: 460% increase<br />
Cartoons: 1246% increase<br />
Surface: 1746% increase<br />
Spheres: 753% increase<br />
Ribbon: 426% increase<br />
1700x<br />
Released, Version 1.5<br />
Single GPUs<br />
http://pymol.org/<br />
VMD<br />
High quality rendering,<br />
large structures (100 million atoms),<br />
analysis and visualization tasks, multiple<br />
GPU support for display of molecular<br />
orbitals<br />
100-125X or greater<br />
on kernels<br />
Released, Version 1.9<br />
Visualization from University of Illinois at Urbana-Champaign<br />
http://www.ks.uiuc.edu/Research/vmd/<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Bioinformatics Applications<br />
Application<br />
Features<br />
Supported<br />
GPU<br />
Speedup<br />
Release Status<br />
Website<br />
BarraCUDA<br />
Alignment of short sequencing<br />
reads<br />
6-10x<br />
Version 0.6.2 – 3/2012<br />
Multi-GPU, multi-node<br />
http://seqbarracuda.sourceforge.net/<br />
CUDASW++<br />
Parallel search of Smith-<br />
Waterman database<br />
10-50x<br />
Version 2.0.8 – Q1/2012<br />
Multi-GPU, multi-node<br />
http://sourceforge.net/projects/cudasw/<br />
CUSHAW<br />
Parallel, accurate long read<br />
aligner for large genomes<br />
10x<br />
Version 1.0.40 – 6/2012<br />
Multiple-GPU<br />
http://cushaw.sourceforge.net/<br />
GPU-BLAST<br />
Protein alignment according to<br />
BLASTP<br />
3-4x<br />
Version 2.2.26 – 3/2012<br />
Single GPU<br />
http://eudoxus.cheme.cmu.edu/gpublast/gpu<br />
blast.html<br />
GPU-HMMER<br />
Parallel local and global<br />
search of Hidden Markov<br />
Models<br />
60-100x<br />
Version 2.3.2 – Q1/2012<br />
Multi-GPU, multi-node<br />
http://www.mpihmmer.org/installguideGPUH<br />
MMER.htm<br />
mCUDA-MEME<br />
Scalable motif discovery<br />
algorithm based on MEME<br />
4-10x<br />
Version 3.0.12<br />
Multi-GPU, multi-node<br />
https://sites.google.com/site/yongchaosoftwa<br />
re/mcuda-meme<br />
SeqNFind<br />
Hardware and software for<br />
reference assembly, blast, SW,<br />
HMM, de novo assembly<br />
400x<br />
Released.<br />
Multi-GPU, multi-node<br />
http://www.seqnfind.com/<br />
UGENE Fast short read alignment 6-8x<br />
WideLM<br />
Parallel linear regression on<br />
multiple similarly-shaped<br />
models<br />
150x<br />
Version 1.11 – 5/2012<br />
Multi-GPU, multi-node<br />
Version 0.1-1 – 3/2012<br />
Multi-GPU, multi-node<br />
http://ugene.unipro.ru/<br />
http://insilicos.com/products/widelm<br />
GPU Perf compared against same or similar code running on single CPU machine<br />
Performance measured internally or independently
Performance Relative to CPU Only<br />
MD Average Speedups<br />
10<br />
8<br />
The blue node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
The green nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 1 or 2 NVIDIA K10, K20, or<br />
K20X GPUs.<br />
6<br />
4<br />
2<br />
0<br />
CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X<br />
Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.<br />
Error bars show the maximum and minimum speedup for each hardware configuration.
Molecular Dynamics (MD) Applications<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes/<strong>Benchmarks</strong><br />
AMBER<br />
PMEMD Explicit Solvent & GB<br />
Implicit Solvent<br />
> 100 ns/day<br />
JAC NVE on 2X<br />
K20s<br />
Released<br />
Multi-GPU, multi-node<br />
AMBER 12, GPU Revision Support 12.2<br />
http://ambermd.org/gpus/benchmarks.<br />
htm#<strong>Benchmarks</strong><br />
CHARMM<br />
Implicit (5x), Explicit (2x)<br />
Solvent via OpenMM<br />
2x C2070 equals<br />
32-35x X5667<br />
CPUs<br />
Released<br />
Single & multi-GPU in single node<br />
Release C37b1;<br />
http://www.charmm.org/news/c37b1.html#po<br />
stjump<br />
DL_POLY<br />
Two-body Forces, Link-cell<br />
Pairs, Ewald SPME forces,<br />
Shake VV<br />
4x<br />
Release V 4.03<br />
Multi-GPU, multi-node<br />
Source only, Results Published<br />
http://www.stfc.ac.uk/CSE/randd/ccg/softwa<br />
re/DL_POLY/25526.aspx<br />
GROMACS<br />
Implicit (5x), Explicit (2x)<br />
165 ns/Day<br />
DHFR on<br />
4X C2075s<br />
Released<br />
Multi-GPU, multi-node<br />
Release 4.6; 1 st Multi-GPU support<br />
LAMMPS<br />
Lennard-Jones, Gay-Berne,<br />
Tersoff & many more potentials<br />
3.5-18x on Titan<br />
Released.<br />
Multi-GPU, multi-node<br />
http://lammps.sandia.gov/bench.html#desktop<br />
and http://lammps.sandia.gov/bench.html#titan<br />
NAMD<br />
Full electrostatics with PME and<br />
most simulation features<br />
4.0 ns/days<br />
F1-ATPase on<br />
1x K20X<br />
Released<br />
100M atom capable<br />
Multi-GPU, multi-node<br />
NAMD 2.9<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
Abalone<br />
Simulations (on 1060 GPU)<br />
4-29X<br />
(on 1060 GPU)<br />
Released, Version 1.8.51<br />
Single GPU<br />
Agile Molecule, Inc.<br />
Ascalaph<br />
Computation of non-valent<br />
interactions<br />
4-29X<br />
(on 1060 GPU)<br />
Released, Version 1.1.4<br />
Single GPU<br />
Agile Molecule, Inc.<br />
ACEMD<br />
Written for use only on GPUs<br />
150 ns/day DHFR on<br />
1x K20<br />
Released<br />
Single and multi-GPUs<br />
Production bio-molecular dynamics (MD)<br />
software specially optimized to run on GPUs<br />
Folding@Home<br />
Powerful distributed computing<br />
molecular dynamics system;<br />
implicit solvent and folding<br />
Depends upon<br />
number of GPUs<br />
Released;<br />
GPUs and CPUs<br />
http://folding.stanford.edu<br />
GPUs get 4X the points of CPUs<br />
GPUGrid.net<br />
High-performance all-atom<br />
biomolecular simulations;<br />
explicit solvent and binding<br />
Depends upon<br />
number of GPUs<br />
Released;<br />
NVIDIA GPUs only<br />
http://www.gpugrid.net/<br />
HALMD<br />
Simple fluids and binary<br />
mixtures (pair potentials, highprecision<br />
NVE and NVT, dynamic<br />
correlations)<br />
Up to 66x on 2090<br />
vs. 1 CPU core<br />
Released, Version 0.2.0<br />
Single GPU<br />
http://halmd.org/benchmarks.html#supercool<br />
ed-binary-mixture-kob-andersen<br />
HOOMD-Blue<br />
Written for use only on GPUs<br />
Kepler 2X faster<br />
than Fermi<br />
Released, Version 0.11.2<br />
Single and multi-GPU on 1 node<br />
http://codeblue.umich.edu/hoomd-blue/<br />
Multi-GPU w/ MPI in March 2013<br />
OpenMM<br />
Implicit and explicit solvent,<br />
custom forces<br />
Implicit: 127-213<br />
ns/day Explicit: 18-<br />
55 ns/day DHFR<br />
Released Version 4.1.1<br />
Multi-GPU<br />
Library and application for molecular dynamics<br />
on high-performance<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Built from Ground Up for GPUs<br />
<strong>Computational</strong> <strong>Chemistry</strong><br />
What<br />
Why<br />
How<br />
Study disease & discover drugs<br />
Predict drug and protein interactions<br />
Speed of simulations is critical<br />
Enables study of:<br />
Longer timeframes<br />
Larger systems<br />
More simulations<br />
GPUs increase throughput & accelerate simulations<br />
AMBER 11 Application<br />
4.6x performance increase with 2 GPUs with<br />
only a 54% added cost*<br />
• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)<br />
• Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333<br />
GPU READY<br />
APPLICATIONS<br />
Abalone<br />
ACEMD<br />
AMBER<br />
DL_PLOY<br />
GAMESS<br />
GROMACS<br />
LAMMPS<br />
NAMD<br />
NWChem<br />
Q-CHEM<br />
Quantum Espresso<br />
TeraChem
AMBER 12<br />
GPU Support Revision 12.2<br />
1/22/2013<br />
15
Nanoseconds / Day<br />
Kepler - Our Fastest Family of GPUs Yet<br />
30.00<br />
25.00<br />
20.00<br />
Factor IX<br />
18.90<br />
22.44<br />
6.6x<br />
25.39<br />
7.4x<br />
Running AMBER 12 GPU Support Revision 12.1<br />
The blue node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
The green nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and either 1x NVIDIA M2090, 1x K10<br />
or 1x K20 for the GPU<br />
15.00<br />
11.85<br />
5.6x<br />
10.00<br />
5.00<br />
3.42<br />
3.5x<br />
16<br />
0.00<br />
1 CPU Node 1 CPU Node +<br />
M2090<br />
1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X<br />
GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)<br />
when compared to a CPU only node<br />
Factor IX
Speedup Compared to CPU Only<br />
K10 Accelerates Simulations of All Sizes<br />
30<br />
Running AMBER 12 GPU Support Revision 12.1<br />
25<br />
24.00<br />
The blue node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
20<br />
19.98<br />
The green nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 1x NVIDIA K10 GPU<br />
15<br />
10<br />
5<br />
5.50 5.53<br />
5.04<br />
2.00<br />
0<br />
CPU<br />
All Molecules<br />
TRPcage<br />
GB<br />
JAC NVE<br />
PME<br />
Factor IX NVE<br />
PME<br />
Cellulose NVE<br />
PME<br />
Myoglobin<br />
GB<br />
Nucleosome<br />
GB<br />
Gain 24x performance by adding just 1 GPU<br />
when compared to dual CPU performance<br />
Nucleosome
Speedup Compared to CPU Only<br />
K20 Accelerates Simulations of All Sizes<br />
30.00<br />
25.00<br />
25.56<br />
28.00<br />
Running AMBER 12 GPU Support Revision 12.1<br />
SPFP with CUDA 4.2.9 ECC Off<br />
The blue node contains 2x Intel E5-2687W CPUs<br />
(8 Cores per CPU)<br />
20.00<br />
Each green nodes contains 2x Intel E5-2687W<br />
CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs<br />
15.00<br />
10.00<br />
6.50 6.56<br />
7.28<br />
5.00<br />
1.00<br />
2.66<br />
0.00<br />
CPU All<br />
Molecules<br />
TRPcage GB<br />
JAC NVE PME Factor IX NVE<br />
PME<br />
Cellulose NVE<br />
PME<br />
Myoglobin GB<br />
Nucleosome<br />
GB<br />
Gain 28x throughput/performance by adding just one K20 GPU<br />
when compared to dual CPU performance<br />
Nucleosome<br />
18<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Speedup Compared to CPU Only<br />
K20X Accelerates Simulations of All Sizes<br />
35<br />
31.30 Running AMBER 12 GPU Support Revision 12.1<br />
30<br />
28.59<br />
The blue node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
25<br />
The green nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 1x NVIDIA K20X GPU<br />
20<br />
15<br />
10<br />
7.15 7.43<br />
8.30<br />
5<br />
2.79<br />
0<br />
CPU<br />
All Molecules<br />
TRPcage<br />
GB<br />
JAC NVE<br />
PME<br />
Factor IX NVE<br />
PME<br />
Cellulose NVE<br />
PME<br />
Myoglobin<br />
GB<br />
Nucleosome<br />
GB<br />
Gain 31x performance by adding just one K20X GPU<br />
when compared to dual CPU performance<br />
Nucleosome<br />
19<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Nanoseconds / Day<br />
K10 Strong Scaling over Nodes<br />
Cellulose 408K Atoms (NPT)<br />
Running AMBER 12 with CUDA 4.2 ECC Off<br />
6<br />
5<br />
4<br />
The blue nodes contains 2x Intel X5670<br />
CPUs (6 Cores per CPU)<br />
The green nodes contains 2x Intel X5670<br />
CPUs (6 Cores per CPU) plus 2x NVIDIA<br />
K10 GPUs<br />
2.4x<br />
3<br />
2<br />
3.6x<br />
CPU Only<br />
With GPU<br />
5.1x<br />
1<br />
Cellulose<br />
0<br />
1 2 4<br />
Number of Nodes<br />
GPUs significantly outperform CPUs while scaling over multiple nodes
Speedups Compared to CPU Only<br />
Kepler – Universally Faster<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
JAC<br />
Factor IX<br />
Cellulose<br />
Running AMBER 12 GPU Support Revision 12.1<br />
The CPU Only node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
The Kepler nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 1x NVIDIA K10, K20, or K20X<br />
GPUs<br />
3<br />
2<br />
1<br />
0<br />
CPU Only CPU + K10 CPU + K20 CPU + K20X<br />
Cellulose<br />
The Kepler GPUs accelerated all simulations, up to 8x
Nanoseconds / Day<br />
K10 Extreme Performance<br />
120<br />
JAC 23K Atoms (NVE)<br />
Running AMBER 12 GPU Support Revision 12.1<br />
The blue node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
100<br />
97.99<br />
The green node contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 2x NVIDIA K10 GPUs<br />
80<br />
60<br />
40<br />
20<br />
0<br />
12.47<br />
1 Node 1 Node<br />
Gain 7.8X performance by adding just 2 GPUs<br />
when compared to dual CPU performance<br />
DHFR
Nanoseconds / Day<br />
K20 Extreme Performance<br />
DHRF JAC 23K Atoms (NVE)<br />
120<br />
95.59<br />
100<br />
80<br />
Running AMBER 12 GPU Support Revision 12.1<br />
SPFP with CUDA 4.2.9 ECC Off<br />
The blue node contains 2x Intel E5-2687W CPUs<br />
(8 Cores per CPU)<br />
Each green node contains 2x Intel E5-2687W<br />
CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU<br />
60<br />
40<br />
20<br />
12.47<br />
0<br />
1 Node 1 Node<br />
Gain > 7.5X throughput/performance by adding just 2 K20 GPUs<br />
when compared to dual CPU performance<br />
DHFR<br />
23<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Replace 8 Nodes with 1 K20 GPU<br />
90.00<br />
80.00<br />
70.00<br />
60.00<br />
50.00<br />
40.00<br />
30.00<br />
65.00<br />
81.09 $32,000.00<br />
35000<br />
30000<br />
25000<br />
20000<br />
15000<br />
10000<br />
Running AMBER 12 GPU Support Revision 12.1<br />
SPFP with CUDA 4.2.9 ECC Off<br />
The eight (8) blue nodes each contain 2x Intel<br />
E5-2687W CPUs (8 Cores per CPU)<br />
Each green node contains 2x Intel E5-2687W<br />
CPUs (8 Cores per CPU) plus 1x NVIDIA K20<br />
GPU<br />
Note: Typical CPU and GPU node pricing used.<br />
Pricing may vary depending on node<br />
configuration. Contact your preferred HW vendor<br />
for actual pricing.<br />
20.00<br />
$6,500.00<br />
10.00<br />
5000<br />
0.00<br />
0<br />
Nanoseconds/Day<br />
Cost<br />
Cut down simulation costs to ¼ and gain higher performance<br />
DHFR<br />
24<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Nanoseconds / Day<br />
Replace 7 Nodes with 1 K10 GPU<br />
Performance on JAC NVE<br />
80<br />
70<br />
$35,000.00<br />
$30,000.00<br />
Cost<br />
$32,000<br />
Running AMBER 12 GPU Support Revision 12.1<br />
SPFP with CUDA 4.2.9 ECC Off<br />
The eight (8) blue nodes each contain 2x Intel<br />
E5-2687W CPUs (8 Cores per CPU)<br />
60<br />
50<br />
40<br />
30<br />
$25,000.00<br />
$20,000.00<br />
$15,000.00<br />
The green node contains 2x Intel E5-2687W<br />
CPUs (8 Cores per CPU) plus 1x NVIDIA K10<br />
GPU<br />
Note: Typical CPU and GPU node pricing used.<br />
Pricing may vary depending on node<br />
configuration. Contact your preferred HW vendor<br />
for actual pricing.<br />
20<br />
10<br />
$10,000.00<br />
$5,000.00<br />
$7,000<br />
0<br />
$0.00<br />
CPU Only<br />
GPU Enabled<br />
CPU Only GPU Enabled<br />
Cut down simulation costs to ¼ and increase performance by 70%<br />
DHFR<br />
25<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Nanoseconds / Day<br />
1 CPU 2 GPUs<br />
2 CPUs 2 GPUs<br />
Extra CPUs decrease Performance<br />
8<br />
7<br />
6<br />
Cellulose NVE<br />
Running AMBER 12 GPU Support Revision 12.1<br />
The orange bars contains one E5-2687W CPUs<br />
(8 Cores per CPU).<br />
The blue bars contain Dual E5-2687W CPUs (8<br />
Cores per CPU)<br />
5<br />
4<br />
3<br />
1 E5-2687W<br />
2 E5-2687W<br />
2<br />
1<br />
0<br />
CPU Only<br />
CPU with dual K20s<br />
Cellulose<br />
When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
Energy Expended (kJ)<br />
Kepler - Greener Science<br />
2500<br />
2000<br />
Energy used in simulating 1 ns of DHFR JAC<br />
Lower is better<br />
Running AMBER 12 GPU Support Revision 12.1<br />
The blue node contains Dual E5-2687W CPUs<br />
(150W each, 8 Cores per CPU).<br />
The green nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 1x NVIDIA K10, K20, or K20X<br />
GPUs (235W each).<br />
1500<br />
1000<br />
Energy Expended<br />
= Power x Time<br />
500<br />
0<br />
CPU Only CPU + K10 CPU + K20 CPU + K20X<br />
The GPU Accelerated systems use 65-75% less energy
Recommended GPU Node Configuration for<br />
AMBER <strong>Computational</strong> <strong>Chemistry</strong><br />
Workstation or Single Node Configuration<br />
# of CPU sockets 2<br />
Cores per CPU socket<br />
4+ (1 CPU core drives 1 GPU)<br />
CPU speed (Ghz) 2.66+<br />
System memory per node (GB) 16<br />
GPUs<br />
# of GPUs per CPU socket<br />
Kepler K10, K20, K20X<br />
Fermi M2090, M2075, C2075<br />
1-2<br />
(4 GPUs on 1 socket is good<br />
to do 4 fast serial GPU runs)<br />
GPU memory preference (GB) 6<br />
GPU to CPU connection<br />
PCIe 2.0 16x or higher<br />
Server storage<br />
Network configuration<br />
2 TB<br />
Infiniband QDR or better<br />
28<br />
Scale to multiple nodes with same single node configuration<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Benefits of GPU AMBER Accelerated Computing<br />
Faster than CPU only systems in all tests<br />
Most major compute intensive aspects of classical MD ported<br />
Large performance boost with marginal price increase<br />
Energy usage cut by more than half<br />
GPUs scale well within a node and over multiple nodes<br />
K20 GPU is our fastest and lowest power high performance GPU yet<br />
29<br />
Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive<br />
AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
NAMD 2.9
Nanoseconds/Day<br />
Kepler - Our Fastest Family of GPUs Yet<br />
4.50<br />
4.00<br />
3.50<br />
3.00<br />
2.50<br />
2.00<br />
1.50<br />
1.37<br />
2.63<br />
1.9x<br />
ApoA1<br />
3.45<br />
2.5x<br />
3.57<br />
2.6x<br />
4.00<br />
2.9x<br />
Running NAMD version 2.9<br />
The blue node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
The green nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and either 1x NVIDIA M2090, 1x K10<br />
or 1x K20 for the GPU<br />
1.00<br />
0.50<br />
31<br />
0.00<br />
1 CPU Node 1 CPU Node +<br />
M2090<br />
1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X<br />
GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)<br />
when compared to a CPU only node<br />
Apolipoprotein A1<br />
NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Speedup Compared to CPU Only<br />
Accelerates Simulations of All Sizes<br />
3<br />
2.5<br />
2.6<br />
2.4<br />
2.7<br />
Running NAMD 2.9 with CUDA 4.0 ECC Off<br />
The blue node contains 2x Intel E5-2687W CPUs<br />
(8 Cores per CPU)<br />
2<br />
Each green node contains 2x Intel E5-2687W<br />
CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs<br />
1.5<br />
1<br />
0.5<br />
0<br />
CPU All Molecules ApoA1 F1-ATPase STMV<br />
Gain 2.5x throughput/performance by adding just 1 GPU<br />
when compared to dual CPU performance<br />
Apolipoprotein A1<br />
32<br />
NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Speedup Compared to CPU Only<br />
Kepler – Universally Faster<br />
6<br />
Running NAMD version 2.9<br />
5<br />
The CPU Only node contains Dual E5-2687W CPUs<br />
(8 Cores per CPU).<br />
4<br />
4.3x<br />
4.7x<br />
5.1x<br />
The Kepler nodes contain Dual E5-2687W CPUs (8<br />
Cores per CPU) and 1 or two NVIDIA K10, K20, or<br />
K20X GPUs.<br />
3<br />
F1-ATPase<br />
ApoA1<br />
2<br />
2.4x<br />
2.6x<br />
2.9x<br />
STMV<br />
1<br />
0<br />
CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X<br />
| Kepler nodes use Dual CPUs |<br />
F1-ATPase<br />
The Kepler GPUs accelerate all simulations, up to 5x<br />
Average acceleration printed in bars
Nanoseconds / Day<br />
Outstanding Strong Scaling with Multi-STMV<br />
1.2<br />
100 STMV on Hundreds of Nodes<br />
Running NAMD version 2.9<br />
Each blue XE6 CPU node contains 1x AMD<br />
1600 Opteron (16 Cores per CPU).<br />
1<br />
Fermi XK6<br />
CPU XK6<br />
2.7x<br />
Each green XK6 CPU+GPU node contains<br />
1x AMD 1600 Opteron (16 Cores per CPU)<br />
and an additional 1x NVIDIA X2090 GPU.<br />
0.8<br />
0.6<br />
2.9x<br />
0.4<br />
0.2<br />
0<br />
3.6x<br />
3.8x<br />
32 64 128 256 512 640 768<br />
# of Nodes<br />
Concatenation of 100<br />
Satellite Tobacco Mosaic Virus<br />
Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
Replace 3 Nodes with 1 2090 GPU<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.63<br />
0.74<br />
F1-ATPase<br />
$8,000<br />
4 CPU Nodes<br />
9000<br />
1 CPU Node +<br />
8000<br />
1x M2090 GPUs<br />
7000<br />
6000<br />
Running NAMD version 2.9<br />
Each blue node contains 2x Intel Xeon X5550 CPUs<br />
(4 Cores per CPU).<br />
The green node contains 2x Intel Xeon X5550 CPUs<br />
(4 Cores per CPU) and 1x NVIDIA M2090 GPU<br />
Note: Typical CPU and GPU node pricing used. Pricing<br />
may vary depending on node configuration. Contact your<br />
preferred HW vendor for actual pricing.<br />
0.4<br />
$4,000<br />
5000<br />
4000<br />
0.3<br />
3000<br />
0.2<br />
2000<br />
0.1<br />
1000<br />
0<br />
Nanoseconds/Day<br />
Cost<br />
0<br />
Speedup of 1.2x for 50% the cost<br />
F1-ATPase
Energy Expended (kJ)<br />
K20 - Greener: Twice The Science Per Watt<br />
1200000<br />
1000000<br />
800000<br />
Energy Used in Simulating 1 Nanosecond of ApoA1<br />
Lower is better<br />
Running NAMD version 2.9<br />
Each blue node contains Dual E5-2687W<br />
CPUs (95W, 4 Cores per CPU).<br />
Each green node contains 2x Intel Xeon X5550<br />
CPUs (95W, 4 Cores per CPU) and 2x NVIDIA<br />
K20 GPUs (225W per GPU)<br />
600000<br />
400000<br />
Energy Expended<br />
= Power x Time<br />
200000<br />
0<br />
1 Node 1 Node + 2x K20<br />
Cut down energy usage by ½ with GPUs<br />
36<br />
NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Energy Expended (kJ)<br />
Kepler - Greener: Twice The Science/Joule<br />
Energy used in simulating 1 ns of SMTV<br />
250000<br />
Running NAMD version 2.9<br />
200000<br />
150000<br />
Lower is better<br />
The blue node contains Dual E5-2687W CPUs<br />
(150W each, 8 Cores per CPU).<br />
The green nodes contain Dual E5-2687W CPUs<br />
(8 Cores per CPU) and 2x NVIDIA K10, K20, or<br />
K20X GPUs (235W each).<br />
100000<br />
Energy Expended<br />
= Power x Time<br />
50000<br />
0<br />
CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs<br />
Cut down energy usage by ½ with GPUs<br />
Satellite Tobacco Mosaic Virus
Recommended GPU Node Configuration for<br />
NAMD <strong>Computational</strong> <strong>Chemistry</strong><br />
Workstation or Single Node Configuration<br />
# of CPU sockets 2<br />
Cores per CPU socket 6+<br />
CPU speed (Ghz) 2.66+<br />
System memory per socket (GB) 32<br />
GPUs<br />
Kepler K10, K20, K20X<br />
Fermi M2090, M2075, C2075<br />
# of GPUs per CPU socket 1-2<br />
GPU memory preference (GB) 6<br />
GPU to CPU connection<br />
PCIe 2.0 or higher<br />
Server storage<br />
500 GB or higher<br />
38<br />
Network configuration<br />
Gemini, InfiniBand<br />
Scale to multiple nodes with same single node configuration<br />
NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Summary/Conclusions<br />
Benefits of GPU Accelerated Computing<br />
Faster than CPU only systems in all tests<br />
Large performance boost with small marginal price increase<br />
Energy usage cut in half<br />
GPUs scale very well within a node and over multiple nodes<br />
Tesla K20 GPU is our fastest and lowest power high performance GPU to date<br />
39<br />
Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive<br />
NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
LAMMPS, Jan. 2013 or later
Speedup Compared to CPU Only<br />
6<br />
5<br />
More Science for Your Money<br />
Embedded Atom Model<br />
4.5<br />
5.5<br />
Blue node uses 2x E5-2687W (8 Cores<br />
and 150W per CPU).<br />
Green nodes have 2x E5-2687W and 1<br />
or 2 NVIDIA K10, K20, or K20X GPUs (235W).<br />
4<br />
3<br />
2.47<br />
2.92<br />
3.3<br />
2<br />
1.7<br />
1<br />
0<br />
CPU Only<br />
CPU + 1x<br />
K10<br />
CPU + 1x<br />
K20<br />
CPU + 1x<br />
K20X<br />
CPU + 2x<br />
K10<br />
CPU + 2x<br />
K20<br />
CPU + 2x<br />
K20X<br />
Experience performance increases of up to 5.5x with Kepler GPU nodes.
Speedup Relative to CPU Alone<br />
K20X, the Fastest GPU Yet<br />
7<br />
6<br />
5<br />
Blue node uses 2x E5-2687W (8 Cores<br />
and 150W per CPU).<br />
Green nodes have 2x E5-2687W and 2<br />
NVIDIA M2090s or K20X GPUs (235W).<br />
4<br />
3<br />
2<br />
1<br />
0<br />
CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X<br />
Experience performance increases of up to 6.2x with Kepler GPU nodes.<br />
One K20X performs as well as two M2090s
Normalized to CPU Only<br />
Get a CPU Rebate to Fund Part of Your GPU Budget<br />
20<br />
18<br />
16<br />
14<br />
12<br />
10<br />
Acceleration in Loop Time Computation by<br />
Additional GPUs<br />
9.88<br />
12.9<br />
18.2<br />
Running NAMD version 2.9<br />
The blue node contains Dual X5670 CPUs<br />
(6 Cores per CPU).<br />
The green nodes contain Dual X5570 CPUs<br />
(4 Cores per CPU) and 1-4 NVIDIA M2090<br />
GPUs.<br />
8<br />
6<br />
5.31<br />
4<br />
2<br />
0<br />
1 Node 1 Node + 1x M2090 1 Node + 2x M2090 1 Node + 3x M2090 1 Node + 4x M2090<br />
Increase performance 18x when compared to CPU-only nodes<br />
Cheaper CPUs used with GPUs AND still faster overall performance when<br />
compared to more expensive CPUs!
Loop Time (seconds)<br />
Excellent Strong Scaling on Large Clusters<br />
LAMMPS Gay-Berne 134M Atoms<br />
600<br />
500<br />
GPU Accelerated XK6<br />
CPU only XE6<br />
400<br />
300<br />
3.55x<br />
200<br />
100<br />
3.48x<br />
3.45x<br />
0<br />
300 400 500 600 700 800 900<br />
Nodes<br />
From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance<br />
compared to XE6 CPU nodes<br />
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)<br />
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
GPUs Sustain 5x Performance for Weak Scaling<br />
Loop Time (seconds)<br />
45<br />
40<br />
35<br />
Weak Scaling with 32K Atoms per Node<br />
30<br />
25<br />
6.7x<br />
5.8x<br />
4.8x<br />
20<br />
15<br />
10<br />
5<br />
0<br />
1 8 27 64 125 216 343 512 729<br />
Nodes<br />
Performance of 4.8x-6.7x with GPU-accelerated nodes<br />
when compared to CPUs alone<br />
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)<br />
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
Energy Expended (kJ)<br />
140<br />
Faster, Greener — Worth It!<br />
Energy Consumed in one loop of EAM<br />
120<br />
100<br />
Lower is better<br />
GPU-accelerated computing uses<br />
53% less energy than CPU only<br />
80<br />
60<br />
40<br />
Energy Expended = Power x Time<br />
Power calculated by combining the component’s TDPs<br />
20<br />
0<br />
1 Node 1 Node + 1 K20X 1 Node + 2x K20X<br />
Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.<br />
Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.<br />
Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
Molecular Dynamics with LAMMPS<br />
on a Hybrid Cray Supercomputer<br />
W. Michael Brown<br />
National Center for <strong>Computational</strong> Sciences<br />
Oak Ridge National Laboratory<br />
NVIDIA Technology Theater, Supercomputing 2012<br />
November 14, 2012
Time (s)<br />
Time (s)<br />
Time (s)<br />
Time (s)<br />
Early Kepler <strong>Benchmarks</strong> on Titan<br />
Atomic Fluid<br />
Bulk Copper<br />
32.00<br />
16.00<br />
8.00<br />
4.00<br />
2.00<br />
1.00<br />
0.50<br />
0.25<br />
0.13<br />
0.06<br />
0.03<br />
8.00<br />
4.00<br />
2.00<br />
1.00<br />
0.50<br />
0.25<br />
0.13<br />
1 2 4 8 16 32 64 128<br />
1 2 4 8 16 32 64 128<br />
XK7+GPU<br />
XK6<br />
XK6+GPU<br />
XK7+GPU<br />
XK6<br />
XK6+GPU<br />
Nodes<br />
XK7+GPU<br />
XK6<br />
XK6+GPU<br />
Nodes<br />
1<br />
1<br />
4<br />
4<br />
16<br />
16<br />
64<br />
64<br />
256<br />
1024<br />
4096<br />
16384<br />
256<br />
1024<br />
4096<br />
16384<br />
4<br />
3<br />
2<br />
1<br />
0<br />
3.0<br />
2.5<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0
Time (s)<br />
Time (s)<br />
1<br />
4<br />
16<br />
64<br />
256<br />
1024<br />
4096<br />
16384<br />
Time (s)<br />
Time (s)<br />
Early Kepler <strong>Benchmarks</strong> on Titan<br />
64.00<br />
32<br />
Protein<br />
32.00<br />
16.00<br />
8.00<br />
4.00<br />
2.00<br />
1.00<br />
XK7+GPU<br />
XK6<br />
XK6+GPU<br />
16<br />
8<br />
4<br />
2<br />
Liquid Crystal<br />
0.50<br />
128.00<br />
64.00<br />
32.00<br />
16.00<br />
8.00<br />
4.00<br />
2.00<br />
1.00<br />
0.50<br />
0.25<br />
0.13<br />
1 2 4 8 16 32 64 128<br />
1 2 4 8 16 32 64 128<br />
Nodes<br />
XK7+GPU<br />
XK6<br />
XK6+GPU<br />
Nodes<br />
1<br />
4<br />
16<br />
64<br />
256<br />
1024<br />
4096<br />
16384<br />
1<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0
Early Titan XK6/XK7 <strong>Benchmarks</strong><br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
Atomic Fluid (cutoff<br />
= 2.5σ)<br />
Speedup with Acceleration on XK6/XK7 Nodes<br />
1 Node = 32K Particles<br />
900 Nodes = 29M Particles<br />
Atomic Fluid (cutoff<br />
= 5.0σ)<br />
Bulk Copper Protein Liquid Crystal<br />
XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82<br />
XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70<br />
XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60<br />
XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14
Recommended GPU Node Configuration for<br />
LAMMPS <strong>Computational</strong> <strong>Chemistry</strong><br />
Workstation or Single Node Configuration<br />
# of CPU sockets 2<br />
Cores per CPU socket 6+<br />
CPU speed (Ghz) 2.66+<br />
System memory per socket (GB) 32<br />
GPUs<br />
Kepler K10, K20, K20X<br />
Fermi M2090, M2075, C2075<br />
# of GPUs per CPU socket 1-2<br />
GPU memory preference (GB) 6<br />
GPU to CPU connection<br />
PCIe 2.0 or higher<br />
Server storage<br />
Network configuration<br />
500 GB or higher<br />
Gemini, InfiniBand<br />
51<br />
Scale to multiple nodes with same single node configuration
GROMACS 4.6 Final, Pre-Beta<br />
and 4.6 Beta
ns/Day<br />
Kepler - Our Fastest Family of GPUs Yet<br />
14<br />
GROMACS 4.6 Final Release Waters (192K atoms)<br />
12<br />
10<br />
8<br />
3.0x 2.8x 2.8x<br />
Running GROMACS 4.6 Final Release<br />
The blue nodes contains either single or<br />
dual E5-2687W CPUs (8 Cores per CPU).<br />
6<br />
4<br />
1.7x<br />
1.7x<br />
The green nodes contain either single or<br />
dual E5-2687W CPUs (8 Cores per CPU)<br />
and either 1x NVIDIA M2090, 1x K10<br />
or 1x K20 for the GPU<br />
2<br />
0<br />
1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X<br />
Single Sandybridge CPU per node with single K10, K20 or K20X produces best performance<br />
53
Nanoseconds / Day<br />
Great Scaling in Small Systems<br />
25.00<br />
20.00<br />
21.68<br />
3.2x<br />
Running GROMACS 4.6 pre-beta with CUDA 4.1<br />
Each blue node contains 1x Intel X5550 CPU<br />
(95W TDP, 4 Cores per CPU)<br />
15.00<br />
13.01<br />
3.2x<br />
Each green node contains 1x Intel X5550 CPU<br />
(95W TDP, 4 Cores per CPU) and 1x NVIDIA<br />
M2090 (225W TDP per GPU)<br />
10.00<br />
8.36<br />
3.6x<br />
3.6x<br />
CPU Only<br />
With GPU<br />
5.00<br />
3.7x<br />
Benchmark systems: RNAse in water<br />
with 16,816 atoms in truncated<br />
dodecahedron box<br />
0.00<br />
1 2 3<br />
Number of Nodes<br />
Get up to 3.7x performance compared to CPU-only nodes
Nanoseconds / Day<br />
Additional Strong Scaling on Larger System<br />
160<br />
140<br />
128K Water Molecules<br />
Running GROMACS 4.6 pre-beta with CUDA 4.1<br />
Each blue node contains 1x Intel X5670 (95W<br />
TDP, 6 Cores per CPU)<br />
120<br />
100<br />
2x<br />
Each green node contains 1x Intel X5670 (95W<br />
TDP, 6 Cores per CPU) and 1x NVIDIA M2070<br />
(225W TDP per GPU)<br />
80<br />
60<br />
40<br />
2.8x<br />
CPU Only<br />
With GPU<br />
20<br />
0<br />
3.1x<br />
8 16 32 64 128<br />
Number of Nodes<br />
Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance<br />
when compared to CPU-only nodes
Replace 3 Nodes with 2 GPUs<br />
9<br />
8<br />
7<br />
6<br />
6.7<br />
ADH in Water (134K Atoms)<br />
8.36<br />
$8,000<br />
$6,500<br />
4 CPU Nodes<br />
9000<br />
1 CPU Node +<br />
2x M2090 GPUs<br />
8000<br />
7000<br />
6000<br />
Running GROMACS 4.6 pre-beta with CUDA<br />
4.1<br />
The blue node contains 2x Intel X5550 CPUs<br />
(95W TDP, 4 Cores per CPU)<br />
The green node contains 2x Intel X5550 CPUs<br />
(95W TDP, 4 Cores per CPU) and 2x<br />
NVIDIA M2090s as the GPU (225W TDP per<br />
GPU)<br />
5<br />
4<br />
3<br />
5000<br />
4000<br />
3000<br />
Note: Typical CPU and GPU node pricing<br />
used. Pricing may vary depending on node<br />
configuration. Contact your preferred HW<br />
vendor for actual pricing.<br />
2<br />
2000<br />
1<br />
1000<br />
0<br />
Nanoseconds/Day<br />
Cost<br />
0<br />
Save thousands of dollars and perform 25% faster
Energy Expended (KiloJoules Consumed)<br />
Greener Science<br />
12000<br />
10000<br />
8000<br />
ADH in Water (134K Atoms)<br />
Lower is better<br />
Running GROMACS 4.6 with CUDA 4.1<br />
The blue nodes contain 2x Intel X5550 CPUs<br />
(95W TDP, 4 Cores per CPU)<br />
The green node contains 2x Intel X5550 CPUs,<br />
4 Cores per CPU) and 2x NVIDIA M2090s GPUs<br />
(225W TDP per GPU)<br />
6000<br />
4000<br />
2000<br />
Energy Expended<br />
= Power x Time<br />
0<br />
4 Nodes<br />
(760 Watts)<br />
1 Node + 2x M2090<br />
(640 Watts)<br />
In simulating each nanosecond, the GPU-accelerated system uses 33% less energy
The Power of Kepler<br />
140<br />
RNase Solvated Protein 24k Atoms<br />
120<br />
100<br />
80<br />
60<br />
M2090<br />
K20X<br />
Running GROMACS version 4.6 beta<br />
The grey nodes contain 1 or 2 E5-2687W CPUs<br />
(150W each, 8 Cores per CPU) and 1 or 2<br />
NVIDIA M2090s.<br />
The green nodes contain 1 or 2 E5-2687W<br />
CPUs (8 Cores per CPU) and 1 or 2 NVIDIA<br />
K20X GPUs (235W each).<br />
40<br />
20<br />
0<br />
1 CPU + 1 GPU 1 CPU + 2 GPU 2 CPU + 1 GPU 2 CPU + 2 GPU<br />
Upgrading an M2090 to a K20X increases performance 10-45%<br />
Ribonuclease
Nanoseconds / Day<br />
K20X – Fast<br />
120<br />
RNase Solvated Protein 24k Atoms<br />
100<br />
80<br />
60<br />
CPU Only<br />
With 1 K20X<br />
Running GROMACS version 4.6 beta<br />
The blue nodes contain 1 or 2 E5-2687W CPUs<br />
(150W each, 8 Cores per CPU).<br />
The green nodes contain 1 or 2 E5-2687W<br />
CPUs (8 Cores per CPU) and 1 or 2 NVIDIA<br />
K20X GPUs (235W each).<br />
40<br />
20<br />
0<br />
1 CPU 2 CPUs<br />
Adding a K20X increases performance by up to 3x<br />
Ribonuclease
K20X, the Fastest Yet<br />
Nanoseconds / Day<br />
16<br />
192K Water Molecules<br />
14<br />
12<br />
Running GROMACS version 4.6-beta2 and<br />
CUDA 5.0.35<br />
The blue node contains 2 E5-2687W CPUs<br />
(150W each, 8 Cores per CPU).<br />
10<br />
8<br />
The green nodes contain 2 E5-2687W CPUs (8<br />
Cores per CPU) and 1 or 2 NVIDIA K20X GPUs<br />
(235W each).<br />
6<br />
4<br />
2<br />
0<br />
CPU CPU + K20X CPU + 2x K20X<br />
Using K20X nodes increases performance by 2.5x<br />
Water<br />
Try GPU accelerated GROMACS 4.6 for free – www.nvidia.com/GPUTestDrive
Recommended GPU Node Configuration for<br />
GROMACS <strong>Computational</strong> <strong>Chemistry</strong><br />
Workstation or Single Node Configuration<br />
# of CPU sockets 2<br />
Cores per CPU socket 6+<br />
CPU speed (Ghz) 2.66+<br />
System memory per socket (GB) 32<br />
GPUs<br />
# of GPUs per CPU socket<br />
Kepler K10, K20, K20X<br />
Fermi M2090, M2075, C2075<br />
1x<br />
Kepler-based GPUs (K20X, K20 or K10): need fast Sandy<br />
Bridge or perhaps the very fastest Westmeres, or high-end<br />
AMD Opterons<br />
GPU memory preference (GB) 6<br />
GPU to CPU connection<br />
Server storage<br />
Network configuration<br />
PCIe 2.0 or higher<br />
500 GB or higher<br />
Gemini, InfiniBand<br />
61<br />
Scale to multiple nodes with same single node configuration
CHARMM Release C37b1
GPUs Outperform CPUs<br />
Nanoseconds / Day<br />
70<br />
Daresbury Crambin 19.6k Atoms<br />
60<br />
50<br />
40<br />
Running CHARMM release C37b1<br />
The blue nodes contains 44 X5667 CPUs<br />
(95W, 4 Cores per CPU).<br />
The green nodes contain 2 X5667 CPUs and 1<br />
or 2 NVIDIA C2070 GPUs (238W each).<br />
30<br />
20<br />
Note: Typical CPU and GPU node pricing used.<br />
Pricing may vary depending on node<br />
configuration. Contact your preferred HW vendor<br />
for actual pricing.<br />
10<br />
0<br />
44x X5667<br />
$44,000<br />
2x X5667 + 1x C2070<br />
$3000<br />
2x X5667 + 2x C2070<br />
$4000<br />
1 GPU = 15 CPUs
More Bang for your Buck<br />
Scaled Performance / Price<br />
12<br />
10<br />
Daresbury Crambin 19.6k Atom<br />
Running CHARMM release C37b1<br />
The blue nodes contains 44 X5667 CPUs<br />
(95W, 4 Cores per CPU).<br />
8<br />
6<br />
4<br />
The green nodes contain 2 X5667 CPUs and 1<br />
or 2 NVIDIA C2070 GPUs (238W).<br />
Note: Typical CPU and GPU node pricing used.<br />
Pricing may vary depending on node<br />
configuration. Contact your preferred HW vendor<br />
for actual pricing.<br />
2<br />
0<br />
44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070<br />
Using GPUs delivers 10.6x the performance for the same cost
Energy Expended (kJ)<br />
Greener Science with NVIDIA<br />
18000<br />
Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms<br />
16000<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
Lower is better<br />
Running CHARMM release C37b1<br />
The blue nodes contains 64 X5667 CPUs<br />
(95W, 4 Cores per CPU).<br />
The green nodes contain 2 X5667 CPUs and 1<br />
or 2 NVIDIA C2070 GPUs (238W each).<br />
Note: Typical CPU and GPU node pricing used.<br />
Pricing may vary depending on node<br />
configuration. Contact your preferred HW vendor<br />
for actual pricing.<br />
4000<br />
2000<br />
0<br />
64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070<br />
Energy Expended<br />
= Power x Time<br />
Using GPUs will decrease energy use by 75%
www.acellera.com<br />
470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms)<br />
116 ns/day on 1 GPU for DHFR (23K atoms)<br />
M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and<br />
Comput. 5, 1632 (2009)
www.acellera.com<br />
NVT, NPT, PME, TCL, PLUMED, CAMSHIFT 1<br />
1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5,<br />
2371–2377 (2009)<br />
2 For a list of selected references see http://www.acellera.com/acemd/publications
Quantum <strong>Chemistry</strong> Applications<br />
Application Features Supported GPU Perf Release Status Notes<br />
Abinit<br />
ACES III<br />
Local Hamiltonian, non-local<br />
Hamiltonian, LOBPCG algorithm,<br />
diagonalization /<br />
orthogonalization<br />
Integrating scheduling GPU into<br />
SIAL programming language and<br />
SIP runtime environment<br />
1.3-2.7X<br />
10X on kernels<br />
ADF Fock Matrix, Hessians TBD<br />
BigDFT<br />
DFT; Daubechies wavelets,<br />
part of Abinit<br />
5-25X<br />
(1 CPU core to<br />
GPU kernel)<br />
Casino TBD TBD<br />
CP2K<br />
GAMESS-US<br />
DBCSR (spare matrix multiply<br />
library)<br />
Libqc with Rys Quadrature<br />
Algorithm, Hartree-Fock, MP2<br />
and CCSD in Q4 2012<br />
2-7X<br />
1.3-1.6X,<br />
2.3-2.9x HF<br />
Released since Version 6.12<br />
Multi-GPU support<br />
Under development<br />
Multi-GPU support<br />
Pilot project completed,<br />
Under development<br />
Multi-GPU support<br />
Released June 2009,<br />
current release 1.6<br />
Multi-GPU support<br />
Under development,<br />
Spring 2013 release<br />
Multi-GPU support<br />
Under development<br />
Multi-GPU support<br />
Released<br />
Multi-GPU support<br />
www.abinit.org<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/deumens_ESaccel_2012.pdf<br />
www.scm.com<br />
http://inac.cea.fr/L_Sim/BigDFT/news.html,<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/BigDFT-Formalism.pdf and<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/BigDFT-HPC-tues.pdf<br />
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.<br />
html<br />
http://www.olcf.ornl.gov/wpcontent/training/ascc_2012/friday/ACSS_2012_V<br />
andeVondele_s.pdf<br />
Next release Q4 2012.<br />
http://www.msg.ameslab.gov/gamess/index.html<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application Features Supported GPU Perf Release Status Notes<br />
GAMESS-UK<br />
(ss|ss) type integrals within<br />
calculations using Hartree Fock ab<br />
initio methods and density<br />
functional theory. Supports<br />
organics & inorganics.<br />
8x<br />
Release in Summer 2012<br />
Multi-GPU support<br />
http://www.ncbi.nlm.nih.gov/pubmed/215419<br />
63<br />
Gaussian<br />
Joint PGI, NVIDIA & Gaussian<br />
Collaboration<br />
TBD<br />
Under development<br />
Multi-GPU support<br />
Announced Aug. 29, 2011<br />
http://www.gaussian.com/g_press/nvidia_press.htm<br />
GPAW<br />
Electrostatic poisson equation,<br />
orthonormalizing of vectors,<br />
residual minimization method<br />
(rmm-diis)<br />
Jaguar Investigating GPU acceleration TBD<br />
LSMS<br />
Generalized Wang-Landau method<br />
8x<br />
3x<br />
with 32 GPUs vs.<br />
32 (16-core)<br />
CPUs<br />
MOLCAS CU_BLAS support 1.1x<br />
MOLPRO<br />
Density-fitted MP2 (DF-MP2),<br />
density fitted local correlation<br />
methods (DF-RHF, DF-KS), DFT<br />
1.7-2.3X<br />
projected<br />
Released<br />
Multi-GPU support<br />
Under development<br />
Multi-GPU support<br />
Under development<br />
Multi-GPU support<br />
Released, Version 7.8<br />
Single GPU. Additional GPU<br />
support coming in Version 8<br />
Under development<br />
Multiple GPU<br />
https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,<br />
Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)<br />
Schrodinger, Inc.<br />
http://www.schrodinger.com/kb/278<br />
NICS Electronic Structure Determination Workshop 2012:<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/Eisenbach_OakRidge_February.pdf<br />
www.molcas.org<br />
www.molpro.net<br />
Hans-Joachim Werner<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
MOPAC2009<br />
pseudodiagonalization, full<br />
diagonalization, and density<br />
matrix assembling<br />
3.8-14X<br />
Under Development<br />
Single GPU<br />
Academic port.<br />
http://openmopac.net<br />
NWChem<br />
Triples part of Reg-CCSD(T),<br />
CCSD & EOMCCSD task<br />
schedulers<br />
3-10X projected<br />
Release targeting end of 2012<br />
Multiple GPUs<br />
Development GPGPU benchmarks:<br />
www.nwchem-sw.org<br />
And http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/Krishnamoorthy-ESCMA12.pdf<br />
Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/<br />
PEtot<br />
Density functional theory (DFT)<br />
plane wave pseudopotential<br />
calculations<br />
6-10X<br />
Released<br />
Multi-GPU<br />
First principles materials code that computes<br />
the behavior of the electron structures of<br />
materials<br />
Q-CHEM RI-MP2 8x-14x Released, Version 4.0<br />
http://www.qchem.com/doc_for_web/qchem_manual_4.0.pdf<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
Quantum <strong>Chemistry</strong> Applications<br />
Application<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
QMCPACK Main features 3-4x<br />
Released<br />
Multiple GPUs<br />
NCSA<br />
University of Illinois at Urbana-Champaign<br />
http://cms.mcc.uiuc.edu/qmcpack/index.php<br />
/GPU_version_of_QMCPACK<br />
Quantum<br />
Espresso/PWscf<br />
PWscf package: linear algebra<br />
(matrix multiply), explicit<br />
computational kernels, 3D FFTs<br />
2.5-3.5x<br />
Released<br />
Version 5.0<br />
Multiple GPUs<br />
Created by Irish Centre for<br />
High-End Computing<br />
http://www.quantum-espresso.org/index.php<br />
and http://www.quantum-espresso.org/<br />
TeraChem<br />
“Full GPU-based solution”<br />
44-650X vs.<br />
GAMESS CPU<br />
version<br />
Released<br />
Version 1.5<br />
Multi-GPU/single node<br />
Completely redesigned to<br />
exploit GPU parallelism. YouTube:<br />
http://youtu.be/EJODzk6RFxE?hd=1 and<br />
http://www.olcf.ornl.gov/wp-<br />
content/training/electronic-structure-<br />
2012/Luehr-ESCMA.pdf<br />
VASP<br />
Hybrid Hartree-Fock DFT<br />
functionals including exact<br />
exchange<br />
2x<br />
2 GPUs<br />
comparable to<br />
128 CPU cores<br />
Available on request<br />
Multiple GPUs<br />
By Carnegie Mellon University<br />
http://arxiv.org/pdf/1111.0716.pdf<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
BigDFT
CP2K
Performance Relative to CPU Only<br />
Kepler, it’s faster<br />
14<br />
12<br />
Running CP2K version 12413-trunk on CUDA<br />
5.0.36<br />
10<br />
8<br />
6<br />
The blue node contains 2 E5-2687W CPUs<br />
(150W, 8 Cores per CPU).<br />
The green nodes contain 2 E5-2687W CPUs<br />
and 1 or 2 NVIDIA K10, K20, or K20X GPUs<br />
(235W each).<br />
4<br />
2<br />
0<br />
CPU Only CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X<br />
Using GPUs delivers up to 12.6x the performance per node
Speedup Relative to 256 non-GPU Cores<br />
Strong Scaling<br />
8<br />
7<br />
6<br />
XK6 With GPUs<br />
XK6 Without GPUs<br />
Conducted on Cray XK6<br />
Using matrix-matrix multiplication<br />
NREP=6 and N=159,000 with 50% occupation<br />
5<br />
4<br />
3<br />
2.9x<br />
3x<br />
2<br />
1<br />
2.3x<br />
0<br />
256 512 768<br />
# of Cores used<br />
Speedups increase as more nodes are added, up to 3x at 768 nodes
Energy Expended (kJ)<br />
Kepler, keeping the planet Green<br />
350<br />
300<br />
250<br />
Running CP2K version 12413-trunk on CUDA<br />
5.0.36<br />
The blue node contains 2 E5-2687W CPUs<br />
(150W, 8 Cores per CPU).<br />
200<br />
150<br />
100<br />
Lower is better<br />
The green nodes contain 2 E5-2687W CPUs<br />
and 1 or 2 NVIDIA K20 GPUs (235W each).<br />
Energy Expended<br />
= Power x Time<br />
50<br />
0<br />
CPU Only CPU + K20 CPU + 2x K20<br />
Using K20s will lower energy use by over 75% for the same simulation
GAUSSIAN
Gaussian<br />
Key quantum chemistry code<br />
ACS Fall 2011 press release<br />
Joint collaboration between Gaussian, NVDA and PGI for GPU<br />
acceleration: http://www.gaussian.com/g_press/nvidia_press.htm<br />
No such release exists for Intel MIC or AMD GPUs<br />
Mike Frisch quote:<br />
“Calculations using Gaussian are limited primarily by the available computing<br />
resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the<br />
development of hardware, compiler technology and application software among the<br />
three companies, the new application will bring the speed and cost-effectiveness of<br />
GPUs to the challenging problems and applications that Gaussian’s customers need to<br />
address.”<br />
NVIDIA Confidential
GAMESS
GAMESS Partnership Overview<br />
Mark Gordon and Andrey Asadchev, key developers of GAMESS,<br />
in collaboration with NVIDIA. Mark Gordon is a recipient of a<br />
NVIDIA Professor Partnership Award.<br />
Quantum <strong>Chemistry</strong> one of major consumers of CPU cycles at<br />
national supercomputer centers<br />
NVIDIA developer resources fully allocated to GAMESS code<br />
“<br />
We like to push the envelope as much as we can in the direction of highly scalable efficient<br />
codes. GPU technology seems like a good way to achieve this goal. Also, since we are<br />
associated with a DOE Laboratory, energy efficiency is important, and this is another reason<br />
to explore quantum chemistry on GPUs.<br />
Prof. Mark Gordon<br />
Distinguished Professor, Department of <strong>Chemistry</strong>, Iowa State University and<br />
”<br />
Director, Applied Mathematical Sciences Program, AMES Laboratory<br />
84
GAMESS Aug. 2011 Release Relative<br />
Performance for Two Small Molecules<br />
GAMESS August 2011 GPU Performance<br />
First GPU supported GAMESS release via "libqc", a library for fast quantum<br />
chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software<br />
2e- AO integrals and their assembly into a closed shell Fock matrix<br />
2.0<br />
4x E5640 CPUs<br />
4x E5640 CPUs + 4x Tesla C2070s<br />
1.0<br />
0.0<br />
Ginkgolide (53 atoms)<br />
Vancomycin (176 atoms)
Upcoming GAMESS Q4 2012 Release<br />
Multi-nodes with multi-GPUs supported<br />
Rys Quadrature<br />
Hartree-Fock<br />
8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup.<br />
See 2012 publication<br />
Møller–Plesset perturbation theory (MP2):<br />
Preliminary code completed<br />
Paper in development<br />
Coupled Cluster SD(T): CCSD code completed,<br />
(T) in progress<br />
86
GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F<br />
3.5<br />
Hartree-Fock GPU Speedups*<br />
3.0<br />
2.5<br />
2.3<br />
2.5 2.5<br />
2.3<br />
2.4<br />
2.3<br />
2.9<br />
Adding 1x 2070 GPU<br />
speeds up computations<br />
by 2.3x to 2.9x<br />
2.0<br />
1.5<br />
Speedup<br />
1.0<br />
0.5<br />
0.0<br />
Taxol 6-31G Taxol 6-31G(d) Taxol 6-<br />
31G(2d,2p)<br />
Taxol 6-<br />
31++G(d,p)<br />
Valinomycin 6-<br />
31G<br />
Valinomycin 6-<br />
31G(d)<br />
Valinomycin 6-<br />
31G(2d,2p)<br />
* A. Asadchev, M.S. Gordon, “New<br />
Multithreaded Hybrid CPU/GPU Approach to<br />
Hartree-Fock,” Journal of Chemical Theory and<br />
Computation (2012)<br />
NVIDIA CONFIDENTIAL<br />
87
GPAW
Used with<br />
permission<br />
from Samuli<br />
Hakala
NWChem
NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes<br />
System: cluster consisting<br />
of dual-socket nodes<br />
constructed from:<br />
• 8-core AMD Interlagos<br />
processors<br />
• 64 GB of memory<br />
• Tesla M2090 (Fermi)<br />
GPUs<br />
The nodes are connected<br />
using a high-performance<br />
QDR Infiniband<br />
interconnect<br />
Courtesy of Kowolski, K.,<br />
Bhaskaran-Nair, at al @<br />
PNNL, JCTC (submitted)
Quantum Espresso/PWscf
Performance Relative to CPU Only<br />
Kepler, fast science<br />
14<br />
12<br />
10<br />
8<br />
AUsurf<br />
Running Quantum Espresso version 5.0-build7<br />
on CUDA 5.0.36<br />
The blue node contains 2 E5-2687W CPUs<br />
(150W, 8 Cores per CPU).<br />
The green nodes contain 2 E5-2687W CPUs<br />
and 1 or 2 NVIDIA M2090 or K10 GPUs (225W<br />
and 235W respectively).<br />
6<br />
4<br />
2<br />
0<br />
CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10<br />
Using K10s delivers up to 11.7x the performance per node over CPUs<br />
And 1.7x the performance when compared to M2090s
Scaled to CPU Only<br />
Extreme Performance/Price from 1 GPU<br />
4<br />
3.5<br />
3<br />
Simulations run on FERMI @ ICHEC.<br />
A 6-Core 2.66 GHz Intel X5650 was<br />
used for the CPU<br />
2.5<br />
An NVIDIA C2050 was used for the<br />
GPU<br />
2<br />
1.5<br />
1<br />
0.5<br />
CPU+<br />
GPU<br />
CPU<br />
Only<br />
0<br />
Price: Performance: (Shilu-3) Performance: (Water-on-Calcite)<br />
Calcite structure<br />
Adding a GPU can improve performance by 3.7x while only increasing price by 25%
Extreme Performance/Price from 1 GPU<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
Price and Performance scaled to the CPU only system<br />
Simulations run on FERMI @ ICHEC.<br />
A 6-Core 2.66 GHz Intel X5650 was<br />
used for the CPU<br />
An NVIDIA C2050 was used for the<br />
GPU<br />
1.5<br />
1<br />
0.5<br />
CPU+<br />
GPU<br />
CPU<br />
Only<br />
0<br />
Price: Performance: (AUSURF112, k-<br />
point)<br />
Performance: (AUSURF112,<br />
gamma-point)<br />
Calculation done for a gold surface of 112 atoms<br />
Adding a GPU can improve performance by 3.5x while only increasing price by 25%
Elapsed Time (minutes)<br />
Replace 72 CPUs with 8 GPUs<br />
250<br />
200<br />
LSMO-BFO (120 Atoms) 8 K-points<br />
223<br />
219<br />
Simulations run on PLX @ CINECA.<br />
Intel 6-Core 2.66 GHz X5550 were<br />
used for the CPUs<br />
NVIDIA M2070s were used for the<br />
GPUs<br />
150<br />
100<br />
50<br />
0<br />
120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)<br />
The GPU Accelerated setup performs faster and costs 24% less
Power Consumption (Watts)<br />
QE/PWscf - Green Science<br />
12000<br />
10000<br />
8000<br />
LSMO-BFO (120 Atoms) 8 K-points<br />
Lower is better<br />
Simulations run on PLX @ CINECA.<br />
Intel 6-Core 2.66 GHz X5550 were<br />
used for the CPUs<br />
NVIDIA M2070s were used for the<br />
GPUs<br />
6000<br />
4000<br />
2000<br />
0<br />
120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)<br />
Over a year, the lower power consumption would save $4300 on energy bills
Power Consumption [kW/h]<br />
NVIDIA GPUs Use Less Energy<br />
0.6<br />
Energy Consumption on Different Tests<br />
CPU Only<br />
Simulations run on FERMI @ ICHEC.<br />
A 6-Core 2.66 GHz Intel X5650 was<br />
used for the CPU<br />
0.5<br />
CPU+GPU<br />
An NVIDIA C2050 was used for the<br />
GPU<br />
0.4<br />
0.3<br />
0.2<br />
-58%<br />
Lower is better<br />
0.1<br />
-57%<br />
-54%<br />
0<br />
Shilu-3 AUSURF112 Water-on-Calcite<br />
In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only
Time (s)<br />
QE/PWscf - Great Strong Scaling in Parallel<br />
35000<br />
CdSe-159 Walltime of 1 full SCF<br />
Simulations run on STONEY @ ICHEC.<br />
30000<br />
Two quad core 2.87 GHz Intel X5560s<br />
were used in each node<br />
25000<br />
2.5x<br />
Lower is better<br />
CPU<br />
CPU+GPU<br />
Two NVIDIA M2090s were used in<br />
each node for the CPU+GPU test<br />
20000<br />
15000<br />
10000<br />
5000<br />
2.2x<br />
2.1x<br />
2.2x<br />
0<br />
2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112)<br />
Nodes (Total CPU Cores)<br />
159 Cadmium Selenide nanodots<br />
Speedups up to 2.5x with GPU Accelerations
Time (s)<br />
QE/PWscf - More Powerful Strong Scaling<br />
GeSnTe134 Walltime of full SCF<br />
4500<br />
Simulations run on PLX @ CINECA.<br />
4000<br />
CPU<br />
CPU+GPU<br />
Two 6-Core 2.4 GHz Intel E5645s were<br />
used in each node<br />
3500<br />
3000<br />
1.6x<br />
Lower is better<br />
Two NVIDIA M2070s were used in<br />
each node for the CPU+GPU test<br />
2500<br />
2000<br />
1500<br />
1000<br />
2.3x<br />
2.4x<br />
2.1x<br />
500<br />
0<br />
4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528)<br />
Nodes (Total CPU Cores)<br />
Accelerate your cluster by up to 2.1x with NVIDIA GPUs<br />
Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive
TeraChem
Time (Seconds)<br />
TeraChem<br />
Supercomputer Speeds on GPUs<br />
100<br />
90<br />
80<br />
70<br />
Time for SCF Step<br />
TeraChem running on 8 C2050s on 1 node<br />
NWChem running on 4096 Quad Core CPUs<br />
In the Chinook Supercomputer<br />
Giant Fullerene C240 Molecule<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)<br />
Similar performance from just a handful of GPUs
Price/Performance relative to Supercomputer<br />
TeraChem<br />
Bang for the Buck<br />
600<br />
Performance/Price<br />
TeraChem running on 8 C2050s on 1 node<br />
NWChem running on 4096 Quad Core CPUs<br />
In the Chinook Supercomputer<br />
500<br />
400<br />
300<br />
493<br />
Giant Fullerene C240 Molecule<br />
Note: Typical CPU and GPU node pricing<br />
used. Pricing may vary depending on node<br />
configuration. Contact your preferred HW<br />
vendor for actual pricing.<br />
200<br />
100<br />
0<br />
1<br />
4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)<br />
Dollars spent on GPUs do 500x more science than those spent on CPUs
Seconds<br />
Seconds<br />
Kepler’s Even Better<br />
800<br />
Olestra BLYP 453 Atoms<br />
2000<br />
B3LYP/6-31G(d)<br />
TeraChem running on C2050 and K20C<br />
700<br />
600<br />
1800<br />
1600<br />
1400<br />
First graph is of BLYP/G-31(d)<br />
Second is B3LYP/6-31G(d)<br />
500<br />
1200<br />
400<br />
1000<br />
300<br />
800<br />
200<br />
600<br />
400<br />
100<br />
200<br />
0<br />
C2050<br />
K20C<br />
0<br />
C2050<br />
K20C<br />
Kepler performs 2x faster than Tesla
Viz, ―Docking‖ and Related Applications Growing<br />
Related<br />
Applications<br />
Features<br />
Supported<br />
GPU Perf Release Status Notes<br />
Amira 5®<br />
3D visualization of volumetric<br />
data and surfaces<br />
70x<br />
Released, Version 5.3.3<br />
Single GPU<br />
Visualization from Visage Imaging. Next release, 5.4, will use<br />
GPU for general purpose processing in some functions<br />
http://www.visageimaging.com/overview.html<br />
BINDSURF<br />
Allows fast processing of large<br />
ligand databases<br />
100X<br />
Available upon request to<br />
authors; single GPU<br />
High-Throughput parallel blind Virtual Screening,<br />
http://www.biomedcentral.com/1471-2105/13/S14/S13<br />
BUDE<br />
Empirical Free<br />
Energy Forcefield<br />
6.5-13.4X<br />
Released<br />
Single GPU<br />
University of Bristol<br />
http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm<br />
Core Hopping GPU accelerated application 3.75-5000X<br />
Released, Suite 2011<br />
Single and multi-GPUs.<br />
Schrodinger, Inc.<br />
http://www.schrodinger.com/products/14/32/<br />
FastROCS<br />
Real-time shape similarity<br />
searching/comparison<br />
800-3000X<br />
Released<br />
Single and multi-GPUs.<br />
Open Eyes Scientific Software<br />
http://www.eyesopen.com/fastrocs<br />
PyMol<br />
Lines: 460% increase<br />
Cartoons: 1246% increase<br />
Surface: 1746% increase<br />
Spheres: 753% increase<br />
Ribbon: 426% increase<br />
1700x<br />
Released, Version 1.5<br />
Single GPUs<br />
http://pymol.org/<br />
VMD<br />
High quality rendering,<br />
large structures (100 million atoms),<br />
analysis and visualization tasks, multiple<br />
GPU support for display of molecular<br />
orbitals<br />
100-125X or greater<br />
on kernels<br />
Released, Version 1.9<br />
Visualization from University of Illinois at Urbana-Champaign<br />
http://www.ks.uiuc.edu/Research/vmd/<br />
GPU Perf compared against Multi-core x86 CPU socket.<br />
GPU Perf benchmarked on GPU supported features<br />
and may be a kernel to kernel perf comparison
FastROCS<br />
OpenEye Japan<br />
Hideyuki Sato, Ph.D.<br />
© 2012 OpenEye Scientific Software
Shape Overlays per<br />
Second<br />
ROCS on the GPU: FastROCS<br />
400000<br />
300000<br />
200000<br />
100000<br />
0<br />
CPU<br />
GPU
Shape Overlays per Second<br />
Riding Moore’s Law<br />
2000000<br />
1800000<br />
1600000<br />
1400000<br />
1200000<br />
1000000<br />
800000<br />
600000<br />
400000<br />
200000<br />
0<br />
C1060 C2050 C2075 C2090 K10 K20
Conformers per Second<br />
9000000<br />
8000000<br />
7000000<br />
6000000<br />
5000000<br />
4000000<br />
3000000<br />
2000000<br />
1000000<br />
0<br />
FastROCS scaling across 4x K10s (2 physical GPUs per K10)<br />
53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule)<br />
1 2 3 4 5 6 7 8<br />
Number of individual K10 GPUs<br />
(Note, each K10 has 2 physical GPUs on the board)
Benefits of GPU Accelerated Computing<br />
Faster than CPU only systems in all tests<br />
Large performance boost with marginal price increase<br />
Energy usage cut by more than half<br />
GPUs scale well within a node and over multiple nodes<br />
K20 GPU is our fastest and lowest power high performance GPU yet<br />
11<br />
8<br />
Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive
GPU Test Drive<br />
Experience GPU Acceleration<br />
For <strong>Computational</strong> <strong>Chemistry</strong><br />
Researchers, Biophysicists<br />
Preconfigured with Molecular<br />
Dynamics Apps<br />
Remotely Hosted GPU Servers<br />
Free & Easy – Sign up, Log in and<br />
See Results<br />
www.nvidia.com/gputestdrive<br />
11<br />
9