15.08.2012 Views

HPC with CUDA

HPC with CUDA

HPC with CUDA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

High Performance<br />

Computing <strong>with</strong> <strong>CUDA</strong><br />

Supercomputing 2011 Tutorial<br />

Cyril Zeller, NVIDIA Corporation<br />

© NVIDIA Corporation 2011


Welcome<br />

� Goal: an introduction to high performance computing <strong>with</strong> <strong>CUDA</strong><br />

© NVIDIA Corporation 2011<br />

� <strong>CUDA</strong> = NVIDIA’s architecture for GPU computing<br />

� Outline:<br />

� Motivation and introduction<br />

� <strong>CUDA</strong> C/C++<br />

� <strong>CUDA</strong> Fortran and <strong>CUDA</strong> libraries<br />

� Optimizations<br />

� Multi-GPU programming<br />

� Case studies


GPUs are Fast!<br />

750<br />

600<br />

450<br />

300<br />

150<br />

0<br />

© NVIDIA Corporation 2011<br />

Performance<br />

Gflops<br />

80.1<br />

656.1<br />

CPU Server GPU-CPU<br />

Server<br />

8x Higher Linpack<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Performance / $<br />

Gflops / $K<br />

11<br />

60<br />

CPU Server GPU-CPU<br />

Server<br />

800<br />

600<br />

400<br />

200<br />

0<br />

Performance / watt<br />

Gflops / kwatt<br />

146<br />

656<br />

CPU Server GPU-CPU<br />

Server<br />

CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw<br />

GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw


World’s Fastest MD Simulation<br />

© NVIDIA Corporation 2011<br />

Sustained Performance of 1.87 Petaflops/s<br />

MD Simulation for Crystalline Silicon<br />

Institute of Process Engineering (IPE)<br />

Chinese Academy of Sciences (CAS)<br />

Used all 7168 Tesla GPUs on<br />

Tianhe-1A GPU Supercomputer


World’s Greenest Petaflop Supercomputer<br />

Tsubame 2.0<br />

Tokyo Institute of Technology<br />

© NVIDIA Corporation 2011<br />

1.19 Petaflops<br />

4,224 Tesla M2050 GPUs


Increasing Number of Professional <strong>CUDA</strong><br />

Applications<br />

Tools &<br />

Libraries<br />

Oil & Gas<br />

Numerical<br />

Analytics<br />

Finance<br />

Other<br />

© NVIDIA Corporation 2011<br />

<strong>CUDA</strong> C/C++<br />

NVIDIA NPP<br />

Perf Primitives<br />

py<strong>CUDA</strong><br />

Headwave Suite<br />

ffA SVI Pro<br />

LabVIEW<br />

Libraries<br />

NAG<br />

RNG<br />

Siemens<br />

4D Ultrasound<br />

Manifold<br />

GIS<br />

Parallel Nsight<br />

Vis Studio IDE<br />

PGI Fortran<br />

R-Stream<br />

Reservoir Labs<br />

OpenGeo Solns<br />

OpenSEIS<br />

Paradigm<br />

SKUA<br />

AccelerEyes<br />

Jacket: MATLAB<br />

Numerix<br />

CounterpartyRisk<br />

Digisens<br />

CT<br />

MVTech<br />

Mach Vision<br />

NVIDIA<br />

Video Libraries<br />

Thrust C++<br />

Template Lib<br />

PBSWorks<br />

GeoStar Seismic<br />

VSG<br />

Open Inventor<br />

MATLAB<br />

SciComp<br />

SciFinance<br />

Schrodinger<br />

Core Hopping<br />

Dalsa<br />

Mach Vision<br />

Available<br />

Now<br />

ParaTools<br />

VampirTrace<br />

Bright Cluster<br />

Manager<br />

MOAB<br />

Adaptive Comp<br />

Acceleware<br />

RTM Solver<br />

Paradigm<br />

GeoDepth RTM<br />

Mathematica<br />

Aquimin<br />

AlphaVision<br />

Useful Prog<br />

Medical Imag<br />

WRF<br />

Weather<br />

PGI<br />

Accelerators<br />

CAPS HMPP<br />

Torque<br />

Adaptive Comp<br />

StoneRidge<br />

RTM<br />

VSG<br />

Avizo<br />

Hanweck Volera<br />

Options Analysi<br />

ASUCA<br />

Weather Model<br />

EMPhotonics<br />

CULAPACK<br />

MAGMA<br />

TotalView<br />

Debugger<br />

Seismic City<br />

RTM<br />

SVI Pro<br />

Murex<br />

MACS<br />

Allinea DDT<br />

Debugger<br />

GPU Packages<br />

For R Stats Pkg<br />

IMSL<br />

Tsunami<br />

RTM<br />

SEA 3D<br />

Pro 2010<br />

Future<br />

Tau<strong>CUDA</strong><br />

Perf Tools<br />

Platform LSF<br />

Cluster Mgr<br />

Schlumberger<br />

Omega<br />

Available Announced<br />

PGI <strong>CUDA</strong>-X86<br />

GPU.net<br />

Schlumberger<br />

Petrel<br />

Paradigm<br />

VoxelGeo


Increasing Number of Professional <strong>CUDA</strong><br />

Applications<br />

Bio-<br />

Chemistry<br />

Bio-<br />

Informatics<br />

EDA<br />

CAE<br />

Video<br />

Rendering<br />

© NVIDIA Corporation 2011<br />

Acellera<br />

ACEMD<br />

GAMESS<br />

AMBER<br />

TeraChem<br />

<strong>CUDA</strong>-BLASTP <strong>CUDA</strong>-EC <strong>CUDA</strong>-MEME <strong>CUDA</strong> SW++ OpenEye ROCS<br />

GPU-HMMR MUMmerGPU<br />

Agilent<br />

EMPro 2010<br />

ACUSIM/Altair<br />

AcuSolve<br />

Adobe<br />

Premier Pro<br />

Bunkspeed<br />

Shot (iray)<br />

mental images<br />

iray (OEM)<br />

CST Microwave<br />

Autodesk<br />

Moldflow<br />

Elemental<br />

Live & Server<br />

Refractive SW<br />

Octane<br />

NVIDIA<br />

OptiX (SDK)<br />

NAMD<br />

BigDFT<br />

ABINT<br />

PIPER<br />

Docking<br />

SPEAG<br />

SEMCAD X<br />

ANSYS<br />

Mechanical<br />

MS Expression<br />

Encoder<br />

Chaos Group<br />

V-Ray RT<br />

Caustic<br />

OpenRL (SDK)<br />

Available<br />

Now<br />

GROMACS GROMOS HOOMD<br />

VMD<br />

HEX Protein<br />

Docking<br />

LAMMPS<br />

DL-POLY<br />

Future<br />

ANSOFT Nexxim<br />

Agilent ADS Remcom Synopsys<br />

Gauda OPC<br />

SPICE Sim<br />

XFdtd<br />

TCAD<br />

SIMULIA<br />

Abaqus/Std<br />

MotionDSP<br />

Ikena Video<br />

Autodesk<br />

3ds Max (iray)<br />

Weta Digital<br />

PantaRay<br />

Impetus<br />

AFEA<br />

MainConcept<br />

<strong>CUDA</strong> H.264<br />

Dassault<br />

Catia v6 (iray)<br />

Works Zebra<br />

Zeany<br />

Metacomp<br />

CFD++<br />

Sorenson<br />

Squeeze 7<br />

Lightworks<br />

Artisan, Author<br />

FluiDyna Culises<br />

OpenFOAM<br />

Fraunhofer<br />

JPEG2000<br />

LSTC<br />

LS-DYNA 972<br />

Available Announced<br />

MSC.Software<br />

Marc<br />

Cebas<br />

finalRender


<strong>CUDA</strong> by the Numbers<br />

300,000,000<br />

© NVIDIA Corporation 2011<br />

500,000<br />

100,000<br />

400<br />

100<br />

<strong>CUDA</strong> Capable GPUs<br />

<strong>CUDA</strong> Toolkit Downloads<br />

Active <strong>CUDA</strong> Developers<br />

Universities Teaching <strong>CUDA</strong><br />

% OEMs offer <strong>CUDA</strong> GPU PCs


© NVIDIA Corporation 2011<br />

C C++ OpenCL<br />

Fermi architecture<br />

(compute capability 2.x)<br />

Tesla architecture<br />

(compute capability 1.x)<br />

GPU Computing Applications<br />

CUBLAS CUFFT CULAPACK<br />

L i b r a r i e s & M i d d l e w a r e<br />

NPP &<br />

CUDPP<br />

Direct<br />

Compute<br />

Video<br />

Fortran<br />

PhysX<br />

Physics<br />

OptiX<br />

Ray tracing<br />

NVIDIA GPU<br />

<strong>with</strong> <strong>CUDA</strong> Parallel Computing Architecture<br />

GeForce 500 series<br />

GeForce 400 series<br />

GeForce 200 series<br />

GeForce 9 series<br />

GeForce 8 series<br />

Entertainment<br />

mental ray<br />

iray<br />

Rendering<br />

Quadro Fermi series Tesla 20 series<br />

Quadro FX series<br />

QuadroPlex series<br />

Quadro NVS series<br />

Professional<br />

Graphics<br />

Java &<br />

Python<br />

Tesla 10 series<br />

Reality<br />

Server<br />

3D web<br />

services<br />

Directives<br />

(Accelerator,<br />

HMPP, …)<br />

High Performance<br />

Computing<br />

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.


Tesla Data Center & Workstation GPU Solutions<br />

© NVIDIA Corporation 2011<br />

Tesla M-series GPUs<br />

M2090 | M2070 | M2050<br />

Servers & Blades<br />

M2090 M2070 M2050<br />

Cores 512 448 448<br />

Memory 6 GB 6 GB 3 GB<br />

Memory bandwidth<br />

(ECC off)<br />

Peak<br />

Perf<br />

Gflops<br />

Single<br />

Precision<br />

Double<br />

Precision<br />

177.6 GB/s 150 GB/s 148.8 GB/s<br />

1331 1030 1030<br />

665 515 515<br />

Tesla C-series GPUs<br />

C2070 | C2050<br />

Workstations<br />

C2070 C2050<br />

448 448<br />

6 GB 3 GB<br />

148.8 GB/s 148.8 GB/s<br />

1030 1030<br />

515 515


NVIDIA Developer Ecosystem<br />

© NVIDIA Corporation 2011<br />

Numerical<br />

Packages<br />

MATLAB<br />

Mathematica<br />

NI LabView<br />

py<strong>CUDA</strong><br />

GPGPU Consultants & Training<br />

ANEO GPU Tech<br />

Debuggers<br />

& Profilers<br />

cuda-gdb<br />

NV Visual Profiler<br />

Parallel Nsight<br />

Visual Studio<br />

Allinea<br />

TotalView<br />

GPU Compilers<br />

C<br />

C++<br />

Fortran<br />

OpenCL<br />

DirectCompute<br />

Java<br />

Python<br />

Parallelizing<br />

Compilers<br />

PGI Accelerator<br />

CAPS HMPP<br />

m<strong>CUDA</strong><br />

OpenMP<br />

Libraries<br />

BLAS<br />

FFT<br />

LAPACK<br />

NPP<br />

Video<br />

Imaging<br />

GPULib<br />

OEM Solution Providers


© NVIDIA Corporation 2011<br />

Parallel Nsight<br />

Visual Studio<br />

Visual Profiler<br />

Windows/Linux/Mac<br />

cuda-gdb<br />

Linux/Mac


Schedule<br />

08:30 AM Introduction<br />

08:45 AM <strong>CUDA</strong> C/C++ Basics<br />

Cyril Zeller, NVIDIA<br />

09:45 AM Break<br />

10:00 AM <strong>CUDA</strong> Fortran and <strong>CUDA</strong> Libraries<br />

Justin Luitjens, NVIDIA<br />

11:00 AM Break<br />

11:15 AM <strong>CUDA</strong> Optimizations<br />

Paulius Micikevicius, NVIDIA<br />

12:30 PM Lunch<br />

© NVIDIA Corporation 2011


Schedule<br />

2:00 PM Multi-GPU Programming<br />

Paulius Micikevicius, NVIDIA<br />

2:45 PM Break<br />

3:00 PM Exploiting Thread Locality: Case of Many Small Linear Solves<br />

Vasily Volkov, Berkeley University<br />

3:45 PM Break<br />

4:00 PM <strong>CUDA</strong>-Accelerated Monte Carlo for <strong>HPC</strong><br />

Andrew Sheppard, Fountainhead<br />

4:45 PM Close<br />

© NVIDIA Corporation 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!