CUDA 5.5 Release Candidate

gputechconf.com

CUDA 5.5 Release Candidate

CUDA 5.5

Release Candidate

June 25, 2013


Linux RPM and Debian Packages

SIMPLIFIED INSTALLATION AND UPGRADES!

Use native package installers

apt-get, yum, zypper

Everything: apt-get install cuda

One package: apt-get install cuda-documentation

Updates:

apt-get upgrade

© 2013 NVIDIA


RPM and Debian Packaging Features

Side-by-side installations

apt-get install cuda-5-5 cuda-6-0 (when available)

Cross-platform development (i.e., 32-bit target on 64-bit OS)

apt-get install cuda-cross

Version locking

update to latest version: apt-get install cuda

update to latest 5.5 version: apt-get install cuda-5-5

Everything is installed under /usr/local/cuda-5.5/

© 2013 NVIDIA


5.5 Download Page

© 2013 NVIDIA


RPM and Debian packages

One package repository per supported Linux distribution

Except Ubuntu 10.04 and RHEL 5

Ubuntu Example

$ [ download cuda-repo-__.deb ]

$ sudo dpkg -i cuda-repo-__.deb

$ sudo apt-get update

$ sudo apt-get install cuda

© 2013 NVIDIA


3 rd Party Library ISV Application Developer End-User System

myCUDAPluginA.so

libcudart.so.4.2

CUDA App

libcudart.so.5.0

myCUDAPluginB.so

libcudart.so.3.2

© 2013 NVIDIA


3 rd Party Library ISV Application Developer End-User System

myCUDAPluginA.so

libcudart.so.4.2

myCUDAPluginB.so

libcudart.so.3.2

CUDA App

libcudart.so.5.0

CUDA App

myCUDAPluginA.so

myCUDAPluginB.so

libcudart.so.3.2

libcudart.so.4.2

libcudart.so.5.0

Complex coordination to ship correct components

© 2013 NVIDIA


Static Library CUDART (CUDA Runtime)

3 rd Party Library ISV Application Developer End-User System

myCUDAPluginA.so

myCUDAPluginB.so

CUDA App

CUDA App

myCUDAPluginA.so

myCUDAPluginB.so

Distribution is made much simpler

Linux, Windows, Mac

© 2013 NVIDIA


Stream Priorities Accelerates Critical Path

No Priorities

Stream 1

Kernel A

Kernel B

Kernel C

Stream 2

Kernel X

Launched

Kernel X

With Priorities—especially useful when Kernel X generates data for MPI_Send()

Stream 1

Kernel A

Kernel B

Kernel C

High-Priority

Stream 2

Kernel X

Launched

Kernel X

© 2013 NVIDIA


CUDA Dynamic Parallelism (CDP)

int main() {

float *data;

setup(data);

}

A > (data);

B > (data);

C > (data);

cudaDeviceSynchronize();

return 0;

CPU

main

A

GPU

X

__global__ void B(float *data)

{

do_stuff(data);

X > (data);

Y > (data);

Z > (data);

cudaDeviceSynchronize();

B

C

Y

Z

}

do_more_stuff(data);

© 2013 NVIDIA


CUDA 5.5 Scheduling Optimizations vs 5.0

~2.5X lower child-child

latency

~3.5X lower completion

latency

SW scheduler parallelized

Primarily benefits

smaller grids

Long dependent chains

~50% speedup vs. 5.0 for a

chain of 10 5μs grids

© 2013 NVIDIA


GPU Utilization %

GPU Utilization %

100

Without Hyper-Q

100

With Hyper-Q

50

50

0

Time

0

Time

© 2013 NVIDIA


Hyper-Q in CUDA 5.0: Streams

© 2013 NVIDIA


New in CUDA 5.5: Hyper-Q / MPI

FERMI

1 MPI Task at a Time

KEPLER

32 Simultaneous MPI Tasks

© 2013 NVIDIA


Multi-Process Server Required for Hyper-Q / MPI

CUDA

MPI

Rank

0

CUDA

MPI

Rank

1

CUDA

MPI

Rank

2

CUDA

MPI

Rank

3

$ mpirun -np 4 my_cuda_app

No application re-compile to share the GPU

No user configuration needed

Can be preconfigured by SysAdmin

MPI Ranks using CUDA are clients

Server spawns on-demand per user

CUDA Server Process

GPU

One job per user

No isolation between MPI ranks

Exclusive process mode enforces single server

One GPU per rank

No cudaSetDevice()

only CUDA device 0 is visible

© 2013 NVIDIA


Strong Scaling of CP2K on Cray XK7

Hyper-Q with multiple

MPI ranks leads to

2.5X speedup over

single MPI rank using

the GPU

Blog post by Peter

Messmer of NVIDIA

© 2013 NVIDIA


Multi-user debugging with a single GPU

Nsight Visual Studio already supports

Nsight VSE now bundled in CUDA installer

Now also supported by CUDA-GDB &

Nsight Eclipse Edition

BETA feature

SM 3.5 Only

Debugger & GUI on one GPU

Multi-user debug on one GPU

2 ways to opt-in

CUDA_DEBUGGER_SOFTWARE_PREEMPTION=1

set cuda software_preemption on

CUDA-GDB and Nsight EE now also

support Dynamic Parallelism debug

© 2013 NVIDIA


Remote Debugging using Nsight EE

© 2013 NVIDIA


Automatic Performance Analysis

NEW in 5.5 Step-by-step optimization guidance

© 2013 NVIDIA


Identifying Candidate Kernels

Analysis system estimates which kernels are best candidates for

speedup

Execution time, achieved occupancy

© 2013 NVIDIA


Primary Performance Bound

Most likely limiter to performance for a kernel

Memory bandwidth

Compute resources

Instruction and memory latency

Primary bound should be addressed first

Often beneficial to examine secondary bounds as well

© 2013 NVIDIA


Visual Profiler: Performance Bound

© 2013 NVIDIA


© 2013 NVIDIA


Visual Profiler: Memory Efficiency

© 2013 NVIDIA


GPU per-process accounting statistics

$ nvidia-smi -q -d ACCOUNTING

==============NVSMI LOG==============

Timestamp

10:36:59 2013

: Tue Apr 16

Driver Version : 319.15

Attached GPUs : 1

GPU 0000:06:00.0

Accounting Mode

: Enabled

Accounting Mode Buffer Size : 128

Accounted Processes

Process ID : 28043

GPU Utilization : 20 %

Memory Utilization : 4 %

Max memory usage : 461 MB

Time

: 5566 ms

Process ID : 28085

GPU Utilization : 99 %

Memory Utilization : 100 %

Max memory usage : 101 MB

Time

: 11888 ms

Requirements

Tesla or Quadro GPU

Kepler

CUDA 5.5

Linux 32 & 64, Windows 64

© 2013 NVIDIA


CUFFT API Enhancements: Extensibility

Existing APIs (still work in 5.5)

New additional APIs in 5.5

cufftPlanMany(…)

/* Might recreate plan */

cufftSetCompatibilityMode(…)

cufftExecC2C(…)

cufftDestroy(…)

cufftCreate(…)

cufftSetCompatibilityMode(…)

cufftMakePlanMany(…)

cufftExecC2C(…)

cufftDestroy(…)

Each new configuration option

may require expensive plan

re-creation

Allows new configuration

options to be employed

without multiple re-plan steps

© 2013 NVIDIA


CUFFT API Enhancements: Memory

Query size of workspace

Control scratch workspace

cufftEstimate1d(…),

cufftEstimate2d(…),

cufftEstimate3d(…), and

cufftEstimateMany(…)

Helps determine if plan fits in

GPU memory

Returned size is not exact in

5.5 (see docs for details)

cufftCreate(…)

cufftSetAutoAllocate(0)

/* returns size of work area */

cufftMakePlanMany(…)

cufftSetWorkArea(…)

cufftExecC2C(…)

cufftDestroy(…)

Share the same workspace across

executions of different plans

(non-concurrently)

© 2013 NVIDIA


CUFFT API Enhancements: FFTW support

Easily port from FFTW to

CUFFT by changing link library

Supports All Combinations of:

Single and Double Precision

C2C, R2C, C2R Transforms

FFTW Basic Interface

FFTW Advanced Interface

FFTW Guru Interface

Does Not Support:

Extended Precision

Real to Real Transforms

“Split” Memory Layout

Distributed Memory with MPI

*guru64* APIs

FFTW compatibility header

file helps developers detect

when unsupported FFTW APIs

are used

© 2013 NVIDIA


CUDA 5.5 enables new platforms

Enable 3rd party toolchains…

Compiler SDK based on LLVM

(libnvvm)

Allows 3rd party ports of new

languages to GPUs

Enable ARM…

Cross compilation from x86 or

native ARM compilation

CUDA development tools support

SECO mITX board for HPC:

© 2013 NVIDIA


OS and Compiler Support Matrix

© 2013 NVIDIA

32 64 Status

Ubuntu-12.04 X X New

Ubuntu-12.10 X X New

Ubuntu-10.04 X X Continued

Ubuntu-11.10 X X Removed

Fedora 18

X New

Fedora 16 X X Removed

RHEL-5.5+

X Continued

RHEL-6.X

X Continued

Mac OS X 10.8 X X Continued

Mac OS X 10.7.x X X Continued

OpenSUSE-12.1 X Continued

SLES 11 SP2 X Continued

WinXP

X X Continued

Vista/Win7/Win8 X X Continued Includes Win 2008 and 2012 server

VC 9.0 (VS 2008) X X Continued

VC 10.0 (VS 2010) X X Continued

VC 11.0 (VS 2012) X X New


CUDACasts

on

YouTube

© 2013 NVIDIA


CUDA 5.5

Linux RPM/DEB installers

Stream Priorities

Static CUDART

Dynamic Parallelism performance improvements

MPS on Linux

Multi-user and remote debugging

New Visual Profiler guided optimization

CUFFT API Enhancements

LLVM based Compiler SDK

© 2013 NVIDIA

More magazines by this user
Similar magazines