15.08.2012 Views

HPC with CUDA

HPC with CUDA

HPC with CUDA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

High Performance<br />

Computing <strong>with</strong> <strong>CUDA</strong><br />

Supercomputing 2011 Tutorial<br />

Cyril Zeller, NVIDIA Corporation<br />

© NVIDIA Corporation 2011


Welcome<br />

� Goal: an introduction to high performance computing <strong>with</strong> <strong>CUDA</strong><br />

© NVIDIA Corporation 2011<br />

� <strong>CUDA</strong> = NVIDIA’s architecture for GPU computing<br />

� Outline:<br />

� Motivation and introduction<br />

� <strong>CUDA</strong> C/C++<br />

� <strong>CUDA</strong> Fortran and <strong>CUDA</strong> libraries<br />

� Optimizations<br />

� Multi-GPU programming<br />

� Case studies


GPUs are Fast!<br />

750<br />

600<br />

450<br />

300<br />

150<br />

0<br />

© NVIDIA Corporation 2011<br />

Performance<br />

Gflops<br />

80.1<br />

656.1<br />

CPU Server GPU-CPU<br />

Server<br />

8x Higher Linpack<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Performance / $<br />

Gflops / $K<br />

11<br />

60<br />

CPU Server GPU-CPU<br />

Server<br />

800<br />

600<br />

400<br />

200<br />

0<br />

Performance / watt<br />

Gflops / kwatt<br />

146<br />

656<br />

CPU Server GPU-CPU<br />

Server<br />

CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw<br />

GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw


World’s Fastest MD Simulation<br />

© NVIDIA Corporation 2011<br />

Sustained Performance of 1.87 Petaflops/s<br />

MD Simulation for Crystalline Silicon<br />

Institute of Process Engineering (IPE)<br />

Chinese Academy of Sciences (CAS)<br />

Used all 7168 Tesla GPUs on<br />

Tianhe-1A GPU Supercomputer


World’s Greenest Petaflop Supercomputer<br />

Tsubame 2.0<br />

Tokyo Institute of Technology<br />

© NVIDIA Corporation 2011<br />

1.19 Petaflops<br />

4,224 Tesla M2050 GPUs


Increasing Number of Professional <strong>CUDA</strong><br />

Applications<br />

Tools &<br />

Libraries<br />

Oil & Gas<br />

Numerical<br />

Analytics<br />

Finance<br />

Other<br />

© NVIDIA Corporation 2011<br />

<strong>CUDA</strong> C/C++<br />

NVIDIA NPP<br />

Perf Primitives<br />

py<strong>CUDA</strong><br />

Headwave Suite<br />

ffA SVI Pro<br />

LabVIEW<br />

Libraries<br />

NAG<br />

RNG<br />

Siemens<br />

4D Ultrasound<br />

Manifold<br />

GIS<br />

Parallel Nsight<br />

Vis Studio IDE<br />

PGI Fortran<br />

R-Stream<br />

Reservoir Labs<br />

OpenGeo Solns<br />

OpenSEIS<br />

Paradigm<br />

SKUA<br />

AccelerEyes<br />

Jacket: MATLAB<br />

Numerix<br />

CounterpartyRisk<br />

Digisens<br />

CT<br />

MVTech<br />

Mach Vision<br />

NVIDIA<br />

Video Libraries<br />

Thrust C++<br />

Template Lib<br />

PBSWorks<br />

GeoStar Seismic<br />

VSG<br />

Open Inventor<br />

MATLAB<br />

SciComp<br />

SciFinance<br />

Schrodinger<br />

Core Hopping<br />

Dalsa<br />

Mach Vision<br />

Available<br />

Now<br />

ParaTools<br />

VampirTrace<br />

Bright Cluster<br />

Manager<br />

MOAB<br />

Adaptive Comp<br />

Acceleware<br />

RTM Solver<br />

Paradigm<br />

GeoDepth RTM<br />

Mathematica<br />

Aquimin<br />

AlphaVision<br />

Useful Prog<br />

Medical Imag<br />

WRF<br />

Weather<br />

PGI<br />

Accelerators<br />

CAPS HMPP<br />

Torque<br />

Adaptive Comp<br />

StoneRidge<br />

RTM<br />

VSG<br />

Avizo<br />

Hanweck Volera<br />

Options Analysi<br />

ASUCA<br />

Weather Model<br />

EMPhotonics<br />

CULAPACK<br />

MAGMA<br />

TotalView<br />

Debugger<br />

Seismic City<br />

RTM<br />

SVI Pro<br />

Murex<br />

MACS<br />

Allinea DDT<br />

Debugger<br />

GPU Packages<br />

For R Stats Pkg<br />

IMSL<br />

Tsunami<br />

RTM<br />

SEA 3D<br />

Pro 2010<br />

Future<br />

Tau<strong>CUDA</strong><br />

Perf Tools<br />

Platform LSF<br />

Cluster Mgr<br />

Schlumberger<br />

Omega<br />

Available Announced<br />

PGI <strong>CUDA</strong>-X86<br />

GPU.net<br />

Schlumberger<br />

Petrel<br />

Paradigm<br />

VoxelGeo


Increasing Number of Professional <strong>CUDA</strong><br />

Applications<br />

Bio-<br />

Chemistry<br />

Bio-<br />

Informatics<br />

EDA<br />

CAE<br />

Video<br />

Rendering<br />

© NVIDIA Corporation 2011<br />

Acellera<br />

ACEMD<br />

GAMESS<br />

AMBER<br />

TeraChem<br />

<strong>CUDA</strong>-BLASTP <strong>CUDA</strong>-EC <strong>CUDA</strong>-MEME <strong>CUDA</strong> SW++ OpenEye ROCS<br />

GPU-HMMR MUMmerGPU<br />

Agilent<br />

EMPro 2010<br />

ACUSIM/Altair<br />

AcuSolve<br />

Adobe<br />

Premier Pro<br />

Bunkspeed<br />

Shot (iray)<br />

mental images<br />

iray (OEM)<br />

CST Microwave<br />

Autodesk<br />

Moldflow<br />

Elemental<br />

Live & Server<br />

Refractive SW<br />

Octane<br />

NVIDIA<br />

OptiX (SDK)<br />

NAMD<br />

BigDFT<br />

ABINT<br />

PIPER<br />

Docking<br />

SPEAG<br />

SEMCAD X<br />

ANSYS<br />

Mechanical<br />

MS Expression<br />

Encoder<br />

Chaos Group<br />

V-Ray RT<br />

Caustic<br />

OpenRL (SDK)<br />

Available<br />

Now<br />

GROMACS GROMOS HOOMD<br />

VMD<br />

HEX Protein<br />

Docking<br />

LAMMPS<br />

DL-POLY<br />

Future<br />

ANSOFT Nexxim<br />

Agilent ADS Remcom Synopsys<br />

Gauda OPC<br />

SPICE Sim<br />

XFdtd<br />

TCAD<br />

SIMULIA<br />

Abaqus/Std<br />

MotionDSP<br />

Ikena Video<br />

Autodesk<br />

3ds Max (iray)<br />

Weta Digital<br />

PantaRay<br />

Impetus<br />

AFEA<br />

MainConcept<br />

<strong>CUDA</strong> H.264<br />

Dassault<br />

Catia v6 (iray)<br />

Works Zebra<br />

Zeany<br />

Metacomp<br />

CFD++<br />

Sorenson<br />

Squeeze 7<br />

Lightworks<br />

Artisan, Author<br />

FluiDyna Culises<br />

OpenFOAM<br />

Fraunhofer<br />

JPEG2000<br />

LSTC<br />

LS-DYNA 972<br />

Available Announced<br />

MSC.Software<br />

Marc<br />

Cebas<br />

finalRender


<strong>CUDA</strong> by the Numbers<br />

300,000,000<br />

© NVIDIA Corporation 2011<br />

500,000<br />

100,000<br />

400<br />

100<br />

<strong>CUDA</strong> Capable GPUs<br />

<strong>CUDA</strong> Toolkit Downloads<br />

Active <strong>CUDA</strong> Developers<br />

Universities Teaching <strong>CUDA</strong><br />

% OEMs offer <strong>CUDA</strong> GPU PCs


© NVIDIA Corporation 2011<br />

C C++ OpenCL<br />

Fermi architecture<br />

(compute capability 2.x)<br />

Tesla architecture<br />

(compute capability 1.x)<br />

GPU Computing Applications<br />

CUBLAS CUFFT CULAPACK<br />

L i b r a r i e s & M i d d l e w a r e<br />

NPP &<br />

CUDPP<br />

Direct<br />

Compute<br />

Video<br />

Fortran<br />

PhysX<br />

Physics<br />

OptiX<br />

Ray tracing<br />

NVIDIA GPU<br />

<strong>with</strong> <strong>CUDA</strong> Parallel Computing Architecture<br />

GeForce 500 series<br />

GeForce 400 series<br />

GeForce 200 series<br />

GeForce 9 series<br />

GeForce 8 series<br />

Entertainment<br />

mental ray<br />

iray<br />

Rendering<br />

Quadro Fermi series Tesla 20 series<br />

Quadro FX series<br />

QuadroPlex series<br />

Quadro NVS series<br />

Professional<br />

Graphics<br />

Java &<br />

Python<br />

Tesla 10 series<br />

Reality<br />

Server<br />

3D web<br />

services<br />

Directives<br />

(Accelerator,<br />

HMPP, …)<br />

High Performance<br />

Computing<br />

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.


Tesla Data Center & Workstation GPU Solutions<br />

© NVIDIA Corporation 2011<br />

Tesla M-series GPUs<br />

M2090 | M2070 | M2050<br />

Servers & Blades<br />

M2090 M2070 M2050<br />

Cores 512 448 448<br />

Memory 6 GB 6 GB 3 GB<br />

Memory bandwidth<br />

(ECC off)<br />

Peak<br />

Perf<br />

Gflops<br />

Single<br />

Precision<br />

Double<br />

Precision<br />

177.6 GB/s 150 GB/s 148.8 GB/s<br />

1331 1030 1030<br />

665 515 515<br />

Tesla C-series GPUs<br />

C2070 | C2050<br />

Workstations<br />

C2070 C2050<br />

448 448<br />

6 GB 3 GB<br />

148.8 GB/s 148.8 GB/s<br />

1030 1030<br />

515 515


NVIDIA Developer Ecosystem<br />

© NVIDIA Corporation 2011<br />

Numerical<br />

Packages<br />

MATLAB<br />

Mathematica<br />

NI LabView<br />

py<strong>CUDA</strong><br />

GPGPU Consultants & Training<br />

ANEO GPU Tech<br />

Debuggers<br />

& Profilers<br />

cuda-gdb<br />

NV Visual Profiler<br />

Parallel Nsight<br />

Visual Studio<br />

Allinea<br />

TotalView<br />

GPU Compilers<br />

C<br />

C++<br />

Fortran<br />

OpenCL<br />

DirectCompute<br />

Java<br />

Python<br />

Parallelizing<br />

Compilers<br />

PGI Accelerator<br />

CAPS HMPP<br />

m<strong>CUDA</strong><br />

OpenMP<br />

Libraries<br />

BLAS<br />

FFT<br />

LAPACK<br />

NPP<br />

Video<br />

Imaging<br />

GPULib<br />

OEM Solution Providers


© NVIDIA Corporation 2011<br />

Parallel Nsight<br />

Visual Studio<br />

Visual Profiler<br />

Windows/Linux/Mac<br />

cuda-gdb<br />

Linux/Mac


Schedule<br />

08:30 AM Introduction<br />

08:45 AM <strong>CUDA</strong> C/C++ Basics<br />

Cyril Zeller, NVIDIA<br />

09:45 AM Break<br />

10:00 AM <strong>CUDA</strong> Fortran and <strong>CUDA</strong> Libraries<br />

Justin Luitjens, NVIDIA<br />

11:00 AM Break<br />

11:15 AM <strong>CUDA</strong> Optimizations<br />

Paulius Micikevicius, NVIDIA<br />

12:30 PM Lunch<br />

© NVIDIA Corporation 2011


Schedule<br />

2:00 PM Multi-GPU Programming<br />

Paulius Micikevicius, NVIDIA<br />

2:45 PM Break<br />

3:00 PM Exploiting Thread Locality: Case of Many Small Linear Solves<br />

Vasily Volkov, Berkeley University<br />

3:45 PM Break<br />

4:00 PM <strong>CUDA</strong>-Accelerated Monte Carlo for <strong>HPC</strong><br />

Andrew Sheppard, Fountainhead<br />

4:45 PM Close<br />

© NVIDIA Corporation 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!