HPC with CUDA

High Performance 

Computing with CUDA 

Supercomputing 2011 Tutorial 

Cyril Zeller, NVIDIA Corporation 

© NVIDIA Corporation 2011

Welcome 

� Goal: an introduction to high performance computing with CUDA 

© NVIDIA Corporation 2011 

� CUDA = NVIDIA’s architecture for GPU computing 

� Outline: 

� Motivation and introduction 

� CUDA C/C++ 

� CUDA Fortran and CUDA libraries 

� Optimizations 

� Multi-GPU programming 

� Case studies

GPUs are Fast! 

750 

600 

450 

300 

150 

0 


Performance 

Gflops 

80.1 

656.1 

CPU Server GPU-CPU 

Server 

8x Higher Linpack 

70 

60 

50 

40 

30 

20 

10 

0 

Performance / $ 

Gflops / $K 

11 

60 


Server 

800 

600 

400 

200 

0 

Performance / watt 

Gflops / kwatt 

146 

656 


Server 

CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw 

GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw

World’s Fastest MD Simulation 


Sustained Performance of 1.87 Petaflops/s 

MD Simulation for Crystalline Silicon 

Institute of Process Engineering (IPE) 

Chinese Academy of Sciences (CAS) 

Used all 7168 Tesla GPUs on 

Tianhe-1A GPU Supercomputer

World’s Greenest Petaflop Supercomputer 

Tsubame 2.0 

Tokyo Institute of Technology 


1.19 Petaflops 

4,224 Tesla M2050 GPUs

Increasing Number of Professional CUDA 

Applications 

Tools & 

Libraries 

Oil & Gas 

Numerical 

Analytics 

Finance 

Other 


CUDA C/C++ 

NVIDIA NPP 

Perf Primitives 

pyCUDA 

Headwave Suite 

ffA SVI Pro 

LabVIEW 

Libraries 

NAG 

RNG 

Siemens 

4D Ultrasound 

Manifold 

GIS 

Parallel Nsight 

Vis Studio IDE 

PGI Fortran 

R-Stream 

Reservoir Labs 

OpenGeo Solns 

OpenSEIS 

Paradigm 

SKUA 

AccelerEyes 

Jacket: MATLAB 

Numerix 

CounterpartyRisk 

Digisens 

CT 

MVTech 

Mach Vision 

NVIDIA 

Video Libraries 

Thrust C++ 

Template Lib 

PBSWorks 

GeoStar Seismic 

VSG 

Open Inventor 

MATLAB 

SciComp 

SciFinance 

Schrodinger 

Core Hopping 

Dalsa 

Mach Vision 

Available 

Now 

ParaTools 

VampirTrace 

Bright Cluster 

Manager 

MOAB 

Adaptive Comp 

Acceleware 

RTM Solver 

Paradigm 

GeoDepth RTM 

Mathematica 

Aquimin 

AlphaVision 

Useful Prog 

Medical Imag 

WRF 

Weather 

PGI 

Accelerators 

CAPS HMPP 

Torque 

Adaptive Comp 

StoneRidge 

RTM 

VSG 

Avizo 

Hanweck Volera 

Options Analysi 

ASUCA 

Weather Model 

EMPhotonics 

CULAPACK 

MAGMA 

TotalView 

Debugger 

Seismic City 

RTM 

SVI Pro 

Murex 

MACS 

Allinea DDT 

Debugger 

GPU Packages 

For R Stats Pkg 

IMSL 

Tsunami 

RTM 

SEA 3D 

Pro 2010 

Future 

TauCUDA 

Perf Tools 

Platform LSF 

Cluster Mgr 

Schlumberger 

Omega 

Available Announced 

PGI CUDA-X86 

GPU.net 

Schlumberger 

Petrel 

Paradigm 

VoxelGeo

Increasing Number of Professional CUDA 

Applications 

Bio- 

Chemistry 

Bio- 

Informatics 

EDA 

CAE 

Video 

Rendering 


Acellera 

ACEMD 

GAMESS 

AMBER 

TeraChem 

CUDA-BLASTP CUDA-EC CUDA-MEME CUDA SW++ OpenEye ROCS 

GPU-HMMR MUMmerGPU 

Agilent 

EMPro 2010 

ACUSIM/Altair 

AcuSolve 

Adobe 

Premier Pro 

Bunkspeed 

Shot (iray) 

mental images 

iray (OEM) 

CST Microwave 

Autodesk 

Moldflow 

Elemental 

Live & Server 

Refractive SW 

Octane 

NVIDIA 

OptiX (SDK) 

NAMD 

BigDFT 

ABINT 

PIPER 

Docking 

SPEAG 

SEMCAD X 

ANSYS 

Mechanical 

MS Expression 

Encoder 

Chaos Group 

V-Ray RT 

Caustic 

OpenRL (SDK) 

Available 

Now 

GROMACS GROMOS HOOMD 

VMD 

HEX Protein 

Docking 

LAMMPS 

DL-POLY 

Future 

ANSOFT Nexxim 

Agilent ADS Remcom Synopsys 

Gauda OPC 

SPICE Sim 

XFdtd 

TCAD 

SIMULIA 

Abaqus/Std 

MotionDSP 

Ikena Video 

Autodesk 

3ds Max (iray) 

Weta Digital 

PantaRay 

Impetus 

AFEA 

MainConcept 

CUDA H.264 

Dassault 

Catia v6 (iray) 

Works Zebra 

Zeany 

Metacomp 

CFD++ 

Sorenson 

Squeeze 7 

Lightworks 

Artisan, Author 

FluiDyna Culises 

OpenFOAM 

Fraunhofer 

JPEG2000 

LSTC 

LS-DYNA 972 

Available Announced 

MSC.Software 

Marc 

Cebas 

finalRender

CUDA by the Numbers 

300,000,000 


500,000 

100,000 

400 

100 

CUDA Capable GPUs 

CUDA Toolkit Downloads 

Active CUDA Developers 

Universities Teaching CUDA 

% OEMs offer CUDA GPU PCs


C C++ OpenCL 

Fermi architecture 

(compute capability 2.x) 

Tesla architecture 

(compute capability 1.x) 

GPU Computing Applications 

CUBLAS CUFFT CULAPACK 

L i b r a r i e s & M i d d l e w a r e 

NPP & 

CUDPP 

Direct 

Compute 

Video 

Fortran 

PhysX 

Physics 

OptiX 

Ray tracing 

NVIDIA GPU 

with CUDA Parallel Computing Architecture 

GeForce 500 series 





Entertainment 

mental ray 

iray 

Rendering 

Quadro Fermi series Tesla 20 series 

Quadro FX series 

QuadroPlex series 

Quadro NVS series 

Professional 

Graphics 

Java & 

Python 

Tesla 10 series 

Reality 

Server 

3D web 

services 

Directives 

(Accelerator, 

HMPP, …) 

High Performance 

Computing 

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

Tesla Data Center & Workstation GPU Solutions 


Tesla M-series GPUs 

M2090 | M2070 | M2050 

Servers & Blades 

M2090 M2070 M2050 

Cores 512 448 448 

Memory 6 GB 6 GB 3 GB 

Memory bandwidth 

(ECC off) 

Peak 

Perf 

Gflops 

Single 

Precision 

Double 

Precision 

177.6 GB/s 150 GB/s 148.8 GB/s 

1331 1030 1030 

665 515 515 

Tesla C-series GPUs 

C2070 | C2050 

Workstations 

C2070 C2050 

448 448 

6 GB 3 GB 

148.8 GB/s 148.8 GB/s 

1030 1030 

515 515

NVIDIA Developer Ecosystem 


Numerical 

Packages 

MATLAB 

Mathematica 

NI LabView 

pyCUDA 

GPGPU Consultants & Training 

ANEO GPU Tech 

Debuggers 

& Profilers 

cuda-gdb 

NV Visual Profiler 


Visual Studio 

Allinea 

TotalView 

GPU Compilers 

C 

C++ 

Fortran 

OpenCL 

DirectCompute 

Java 

Python 

Parallelizing 

Compilers 

PGI Accelerator 

CAPS HMPP 

mCUDA 

OpenMP 

Libraries 

BLAS 

FFT 

LAPACK 

NPP 

Video 

Imaging 

GPULib 

OEM Solution Providers



Visual Studio 

Visual Profiler 

Windows/Linux/Mac 

cuda-gdb 

Linux/Mac

Schedule 

08:30 AM Introduction 

08:45 AM CUDA C/C++ Basics 

Cyril Zeller, NVIDIA 

09:45 AM Break 

10:00 AM CUDA Fortran and CUDA Libraries 

Justin Luitjens, NVIDIA 

11:00 AM Break 

11:15 AM CUDA Optimizations 

Paulius Micikevicius, NVIDIA 

12:30 PM Lunch 


Schedule 

2:00 PM Multi-GPU Programming 

Paulius Micikevicius, NVIDIA 

2:45 PM Break 

3:00 PM Exploiting Thread Locality: Case of Many Small Linear Solves 

Vasily Volkov, Berkeley University 

3:45 PM Break 

4:00 PM CUDA-Accelerated Monte Carlo for HPC 

Andrew Sheppard, Fountainhead 

4:45 PM Close

HPC with CUDA

Create successful ePaper yourself

Delete template?

Save as template?