HPC with CUDA
HPC with CUDA
HPC with CUDA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
High Performance<br />
Computing <strong>with</strong> <strong>CUDA</strong><br />
Supercomputing 2011 Tutorial<br />
Cyril Zeller, NVIDIA Corporation<br />
© NVIDIA Corporation 2011
Welcome<br />
� Goal: an introduction to high performance computing <strong>with</strong> <strong>CUDA</strong><br />
© NVIDIA Corporation 2011<br />
� <strong>CUDA</strong> = NVIDIA’s architecture for GPU computing<br />
� Outline:<br />
� Motivation and introduction<br />
� <strong>CUDA</strong> C/C++<br />
� <strong>CUDA</strong> Fortran and <strong>CUDA</strong> libraries<br />
� Optimizations<br />
� Multi-GPU programming<br />
� Case studies
GPUs are Fast!<br />
750<br />
600<br />
450<br />
300<br />
150<br />
0<br />
© NVIDIA Corporation 2011<br />
Performance<br />
Gflops<br />
80.1<br />
656.1<br />
CPU Server GPU-CPU<br />
Server<br />
8x Higher Linpack<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Performance / $<br />
Gflops / $K<br />
11<br />
60<br />
CPU Server GPU-CPU<br />
Server<br />
800<br />
600<br />
400<br />
200<br />
0<br />
Performance / watt<br />
Gflops / kwatt<br />
146<br />
656<br />
CPU Server GPU-CPU<br />
Server<br />
CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw<br />
GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw
World’s Fastest MD Simulation<br />
© NVIDIA Corporation 2011<br />
Sustained Performance of 1.87 Petaflops/s<br />
MD Simulation for Crystalline Silicon<br />
Institute of Process Engineering (IPE)<br />
Chinese Academy of Sciences (CAS)<br />
Used all 7168 Tesla GPUs on<br />
Tianhe-1A GPU Supercomputer
World’s Greenest Petaflop Supercomputer<br />
Tsubame 2.0<br />
Tokyo Institute of Technology<br />
© NVIDIA Corporation 2011<br />
1.19 Petaflops<br />
4,224 Tesla M2050 GPUs
Increasing Number of Professional <strong>CUDA</strong><br />
Applications<br />
Tools &<br />
Libraries<br />
Oil & Gas<br />
Numerical<br />
Analytics<br />
Finance<br />
Other<br />
© NVIDIA Corporation 2011<br />
<strong>CUDA</strong> C/C++<br />
NVIDIA NPP<br />
Perf Primitives<br />
py<strong>CUDA</strong><br />
Headwave Suite<br />
ffA SVI Pro<br />
LabVIEW<br />
Libraries<br />
NAG<br />
RNG<br />
Siemens<br />
4D Ultrasound<br />
Manifold<br />
GIS<br />
Parallel Nsight<br />
Vis Studio IDE<br />
PGI Fortran<br />
R-Stream<br />
Reservoir Labs<br />
OpenGeo Solns<br />
OpenSEIS<br />
Paradigm<br />
SKUA<br />
AccelerEyes<br />
Jacket: MATLAB<br />
Numerix<br />
CounterpartyRisk<br />
Digisens<br />
CT<br />
MVTech<br />
Mach Vision<br />
NVIDIA<br />
Video Libraries<br />
Thrust C++<br />
Template Lib<br />
PBSWorks<br />
GeoStar Seismic<br />
VSG<br />
Open Inventor<br />
MATLAB<br />
SciComp<br />
SciFinance<br />
Schrodinger<br />
Core Hopping<br />
Dalsa<br />
Mach Vision<br />
Available<br />
Now<br />
ParaTools<br />
VampirTrace<br />
Bright Cluster<br />
Manager<br />
MOAB<br />
Adaptive Comp<br />
Acceleware<br />
RTM Solver<br />
Paradigm<br />
GeoDepth RTM<br />
Mathematica<br />
Aquimin<br />
AlphaVision<br />
Useful Prog<br />
Medical Imag<br />
WRF<br />
Weather<br />
PGI<br />
Accelerators<br />
CAPS HMPP<br />
Torque<br />
Adaptive Comp<br />
StoneRidge<br />
RTM<br />
VSG<br />
Avizo<br />
Hanweck Volera<br />
Options Analysi<br />
ASUCA<br />
Weather Model<br />
EMPhotonics<br />
CULAPACK<br />
MAGMA<br />
TotalView<br />
Debugger<br />
Seismic City<br />
RTM<br />
SVI Pro<br />
Murex<br />
MACS<br />
Allinea DDT<br />
Debugger<br />
GPU Packages<br />
For R Stats Pkg<br />
IMSL<br />
Tsunami<br />
RTM<br />
SEA 3D<br />
Pro 2010<br />
Future<br />
Tau<strong>CUDA</strong><br />
Perf Tools<br />
Platform LSF<br />
Cluster Mgr<br />
Schlumberger<br />
Omega<br />
Available Announced<br />
PGI <strong>CUDA</strong>-X86<br />
GPU.net<br />
Schlumberger<br />
Petrel<br />
Paradigm<br />
VoxelGeo
Increasing Number of Professional <strong>CUDA</strong><br />
Applications<br />
Bio-<br />
Chemistry<br />
Bio-<br />
Informatics<br />
EDA<br />
CAE<br />
Video<br />
Rendering<br />
© NVIDIA Corporation 2011<br />
Acellera<br />
ACEMD<br />
GAMESS<br />
AMBER<br />
TeraChem<br />
<strong>CUDA</strong>-BLASTP <strong>CUDA</strong>-EC <strong>CUDA</strong>-MEME <strong>CUDA</strong> SW++ OpenEye ROCS<br />
GPU-HMMR MUMmerGPU<br />
Agilent<br />
EMPro 2010<br />
ACUSIM/Altair<br />
AcuSolve<br />
Adobe<br />
Premier Pro<br />
Bunkspeed<br />
Shot (iray)<br />
mental images<br />
iray (OEM)<br />
CST Microwave<br />
Autodesk<br />
Moldflow<br />
Elemental<br />
Live & Server<br />
Refractive SW<br />
Octane<br />
NVIDIA<br />
OptiX (SDK)<br />
NAMD<br />
BigDFT<br />
ABINT<br />
PIPER<br />
Docking<br />
SPEAG<br />
SEMCAD X<br />
ANSYS<br />
Mechanical<br />
MS Expression<br />
Encoder<br />
Chaos Group<br />
V-Ray RT<br />
Caustic<br />
OpenRL (SDK)<br />
Available<br />
Now<br />
GROMACS GROMOS HOOMD<br />
VMD<br />
HEX Protein<br />
Docking<br />
LAMMPS<br />
DL-POLY<br />
Future<br />
ANSOFT Nexxim<br />
Agilent ADS Remcom Synopsys<br />
Gauda OPC<br />
SPICE Sim<br />
XFdtd<br />
TCAD<br />
SIMULIA<br />
Abaqus/Std<br />
MotionDSP<br />
Ikena Video<br />
Autodesk<br />
3ds Max (iray)<br />
Weta Digital<br />
PantaRay<br />
Impetus<br />
AFEA<br />
MainConcept<br />
<strong>CUDA</strong> H.264<br />
Dassault<br />
Catia v6 (iray)<br />
Works Zebra<br />
Zeany<br />
Metacomp<br />
CFD++<br />
Sorenson<br />
Squeeze 7<br />
Lightworks<br />
Artisan, Author<br />
FluiDyna Culises<br />
OpenFOAM<br />
Fraunhofer<br />
JPEG2000<br />
LSTC<br />
LS-DYNA 972<br />
Available Announced<br />
MSC.Software<br />
Marc<br />
Cebas<br />
finalRender
<strong>CUDA</strong> by the Numbers<br />
300,000,000<br />
© NVIDIA Corporation 2011<br />
500,000<br />
100,000<br />
400<br />
100<br />
<strong>CUDA</strong> Capable GPUs<br />
<strong>CUDA</strong> Toolkit Downloads<br />
Active <strong>CUDA</strong> Developers<br />
Universities Teaching <strong>CUDA</strong><br />
% OEMs offer <strong>CUDA</strong> GPU PCs
© NVIDIA Corporation 2011<br />
C C++ OpenCL<br />
Fermi architecture<br />
(compute capability 2.x)<br />
Tesla architecture<br />
(compute capability 1.x)<br />
GPU Computing Applications<br />
CUBLAS CUFFT CULAPACK<br />
L i b r a r i e s & M i d d l e w a r e<br />
NPP &<br />
CUDPP<br />
Direct<br />
Compute<br />
Video<br />
Fortran<br />
PhysX<br />
Physics<br />
OptiX<br />
Ray tracing<br />
NVIDIA GPU<br />
<strong>with</strong> <strong>CUDA</strong> Parallel Computing Architecture<br />
GeForce 500 series<br />
GeForce 400 series<br />
GeForce 200 series<br />
GeForce 9 series<br />
GeForce 8 series<br />
Entertainment<br />
mental ray<br />
iray<br />
Rendering<br />
Quadro Fermi series Tesla 20 series<br />
Quadro FX series<br />
QuadroPlex series<br />
Quadro NVS series<br />
Professional<br />
Graphics<br />
Java &<br />
Python<br />
Tesla 10 series<br />
Reality<br />
Server<br />
3D web<br />
services<br />
Directives<br />
(Accelerator,<br />
HMPP, …)<br />
High Performance<br />
Computing<br />
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
Tesla Data Center & Workstation GPU Solutions<br />
© NVIDIA Corporation 2011<br />
Tesla M-series GPUs<br />
M2090 | M2070 | M2050<br />
Servers & Blades<br />
M2090 M2070 M2050<br />
Cores 512 448 448<br />
Memory 6 GB 6 GB 3 GB<br />
Memory bandwidth<br />
(ECC off)<br />
Peak<br />
Perf<br />
Gflops<br />
Single<br />
Precision<br />
Double<br />
Precision<br />
177.6 GB/s 150 GB/s 148.8 GB/s<br />
1331 1030 1030<br />
665 515 515<br />
Tesla C-series GPUs<br />
C2070 | C2050<br />
Workstations<br />
C2070 C2050<br />
448 448<br />
6 GB 3 GB<br />
148.8 GB/s 148.8 GB/s<br />
1030 1030<br />
515 515
NVIDIA Developer Ecosystem<br />
© NVIDIA Corporation 2011<br />
Numerical<br />
Packages<br />
MATLAB<br />
Mathematica<br />
NI LabView<br />
py<strong>CUDA</strong><br />
GPGPU Consultants & Training<br />
ANEO GPU Tech<br />
Debuggers<br />
& Profilers<br />
cuda-gdb<br />
NV Visual Profiler<br />
Parallel Nsight<br />
Visual Studio<br />
Allinea<br />
TotalView<br />
GPU Compilers<br />
C<br />
C++<br />
Fortran<br />
OpenCL<br />
DirectCompute<br />
Java<br />
Python<br />
Parallelizing<br />
Compilers<br />
PGI Accelerator<br />
CAPS HMPP<br />
m<strong>CUDA</strong><br />
OpenMP<br />
Libraries<br />
BLAS<br />
FFT<br />
LAPACK<br />
NPP<br />
Video<br />
Imaging<br />
GPULib<br />
OEM Solution Providers
© NVIDIA Corporation 2011<br />
Parallel Nsight<br />
Visual Studio<br />
Visual Profiler<br />
Windows/Linux/Mac<br />
cuda-gdb<br />
Linux/Mac
Schedule<br />
08:30 AM Introduction<br />
08:45 AM <strong>CUDA</strong> C/C++ Basics<br />
Cyril Zeller, NVIDIA<br />
09:45 AM Break<br />
10:00 AM <strong>CUDA</strong> Fortran and <strong>CUDA</strong> Libraries<br />
Justin Luitjens, NVIDIA<br />
11:00 AM Break<br />
11:15 AM <strong>CUDA</strong> Optimizations<br />
Paulius Micikevicius, NVIDIA<br />
12:30 PM Lunch<br />
© NVIDIA Corporation 2011
Schedule<br />
2:00 PM Multi-GPU Programming<br />
Paulius Micikevicius, NVIDIA<br />
2:45 PM Break<br />
3:00 PM Exploiting Thread Locality: Case of Many Small Linear Solves<br />
Vasily Volkov, Berkeley University<br />
3:45 PM Break<br />
4:00 PM <strong>CUDA</strong>-Accelerated Monte Carlo for <strong>HPC</strong><br />
Andrew Sheppard, Fountainhead<br />
4:45 PM Close<br />
© NVIDIA Corporation 2011