Programming Models for Nvidia Accelerators

Programming Models for Nvidia Accelerators

Programming Modelsfor Accelerators(and the role NVIDIA plays)Duncan Poole, NVIDIA

DOE Funded GPU Computing Revolution20032001

Small Changes, Big Speed-upApplication CodeGPUCompute-Intensive FunctionsUse GPU to ParallelizeRest of SequentialCPU CodeCPU+

CUDA By the Numbers:>375,000,000>1,000,000>120,000>50098CUDA Capable GPUsToolkit DownloadsActive DevelopersUniversities Teaching CUDACUDA Toolkit Downloads / Hour

Programming ModelsLibrariesDirectivesLanguagesToolsManageMath, CommunicationsOpenMP, OpenACCC, C++, Fortran, PGAS – (NVVM)IDE, Debug, Profile – (CUPTI)Management, Scheduling, Monitoring (NVML)

3 Ways to Accelerate ApplicationsApplicationsLibrariesOpenACCDirectivesProgrammingLanguages“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility

3 Ways to Accelerate ApplicationsApplicationsLibrariesOpenACCDirectivesProgrammingLanguages“Drop-in”AccelerationCUDA Libraries areinteroperable with OpenACCEasily AccelerateApplicationsMaximumFlexibility

3 Ways to Accelerate ApplicationsApplicationsLibraries“Drop-in”AccelerationOpenACCDirectivesEasily AccelerateApplicationsCUDA Languages areinteroperable with OpenACC,too!ProgrammingLanguagesMaximumFlexibility

3 Ways to Accelerate ApplicationsApplicationsLibrariesOpenACCDirectivesProgrammingLanguages“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility

Easy, High-Quality AccelerationEase of use:“Drop-in”:Quality:Performance:Using libraries enables GPU acceleration without in-depthknowledge of GPU programmingMany GPU-accelerated libraries follow standard APIs, thusenabling acceleration with minimal code changesLibraries offer high-quality implementations of functionsencountered in a broad range of applicationsNVIDIA libraries are tuned by experts

GPU Accelerated Libraries“Drop-in” Acceleration for your ApplicationsNVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPPVector SignalImage ProcessingGPU AcceleratedLinear AlgebraMatrix Algebra onGPU and MulticoreNVIDIA cuFFTIMSL LibraryBuilding-blockAlgorithms for CUDASparse LinearAlgebraC++ STL Featuresfor CUDA

3 Ways to Accelerate ApplicationsApplicationsLibrariesOpenACCDirectivesProgrammingLanguages“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility

Programming LanguagesNumerical analyticsFortranCC++PythonC#MATLAB, Mathematica, LabVIEWOpenACC, CUDA FortranOpenACC, CUDA CThrust, CUDA C++PyCUDAGPU.NET

What is CUDA?CUDA ArchitectureExpose GPU parallelism for general-purpose computingRetain performanceCUDA C/C++Based on industry-standard C/C++Small set of extensions to enable heterogeneous programmingStraightforward APIs to manage devices, memory etc.

Standard C CodeCUDA CParallel C Codevoid saxpy(int n, float a,float *x, float *y){for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}int N = 1

3 Ways to Accelerate ApplicationsApplicationsLibrariesOpenACCDirectivesProgrammingLanguages“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility

OpenACCThe Standard for GPU DirectivesSimple:Directives are the easy path to accelerate computeintensive applicationsOpen:OpenACC is an open GPU directives standard, making GPUprogramming straightforward and portable across paralleland multi-core processorsPowerful: GPU Directives allow complete access to the massiveparallel power of a GPU

High-level BenefitsCreate high-level heterogeneous programsWithout explicit accelerator initializationWithout explicit data or program transfers between host and acceleratorCompiler directives to specify parallel regions in C & FortranOffload parallel regionsPortable across OSes, host CPUs, accelerators, and compilers

High-level… with low-level accessProgramming model allows programmers to start simpleCompiler gives additional guidanceLoop mappings, data location, and other performance detailsCompatible with other GPU languages and librariesInteroperate between CUDA C/Fortran and GPU librariese.g. CUFFT, CUBLAS, CUSPARSE, etc.

3 OpenACC PartnersSingle binary (X86 and GPU) for OpenACCGreat cross platform support on X86 and Windows/Mac and Linux15 day free trial with OpenACCOpenACC cross platform storyIntel MIC and AMD GPU support via HMPP comingSupports any CPU Compiler (ICC)Pro-Active consultant business modelComplete vendor solution: CRAY OpenACC + CRAY HardwareCUDA GDB and Allinea debugger for OpenACC

OpenACC DirectivesCPUGPUSimple Compiler hintsProgram myscience... serial code ...!$acc kernelsdo k = 1,n1do i = 1,n2... parallel code ...enddoenddo!$acc end kernels...End Program myscienceYour originalFortran or C codeOpenACCCompilerHintCompiler Parallelizes codeWorks on many-core GPUs &multicore CPUsCan you mix CUDA and OpenACC?Yes, you can even use CUDA to managememory

Familiar to OpenMP ProgrammersOpenMPOpenACCCPUCPUGPUmain() {double pi = 0.0; long i;main() {double pi = 0.0; long i;#pragma omp parallel for reduction(+:pi)for (i=0; i

A Very Simple Exercise: SAXPYSAXPY in CSAXPY in Fortranvoid saxpy(int n,float a,float *x,float *restrict y){#pragma acc kernelsfor (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}...// Perform SAXPY on 1M elementssaxpy(1

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours“Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. Themost important thing is avoiding restructuring of existing code for production applications.”-- Developer at the Global Manufacturer of Navigation SystemsDirectives: Easy & PowerfulReal-Time ObjectDetectionGlobal Manufacturer of NavigationSystemsValuation of Stock Portfoliosusing Monte CarloGlobal Technology Consulting CompanyInteraction of Solvents andBiomoleculesUniversity of Texas at San Antonio

OpenACC Tools EcosystemDebuggersAllinea DDT, CUDAdbg, on CrayAllinea partial support on CAPSProfilersCudaprof works todayLibrariesAlso work underway with other CUPTI partnersExisting CUDA libraries will workpoint developers to SDK examples (New www)called outside the accelerator regionin accelerator data regions

Impact of CUDA 5

CUDA 5Application Acceleration Made EasierDynamic ParallelismSpawn new parallel work from within GPU code on GK110GPU Object LinkingLibraries and plug-ins for GPU codeNew Nsight Eclipse EditionDevelop, Debug, and Optimize… All in one tool!GPUDirectRDMA between GPUs and PCIe devices

Dynamic ParallelismCPU Fermi GPU CPU Kepler GPUFermi: Only CPU can generate GPU workKepler: GPU can generate work for itself

CUDA 5: Separate Compilation & Linkinga.cub.cuc.cumain.cpp+ program.exea.ob.oc.oSeparate compilation allows building independent object filesCUDA 5 can link multiple object files into one program

CUDA 5: Separate Compilation &… ab.culiba.o+b.oprogram.exeprogram2.exeCan also combine object files into static librariesLink and externally call device codeFacilitates code reuse, reduces compile time

NVIDIA GPUDirect now supports RDMASystemMemoryGDDR5MemoryGDDR5MemoryGDDR5MemoryGDDR5MemorySystemMemoryCPUGPU1GPU2GPU1GPU2CPUPCI-ePCI-eNetworkCardNetworkNetworkCardServer 1Server 2RDMA: Remote Direct Memory Access between any GPUs in your cluster

Peer-to-Peer TransfersGPU DirectOpen MPIMVAPICHPlatform MPIGPU-Aware MPI LibrariesIntegrated Support for GPU Computing

NVIDIA ® Nsight , Eclipse EditionCUDA-Aware EditorAutomated CPU to GPU code refactoringSemantic highlighting of CUDA codeIntegrated code samples & docsNsight DebuggerSimultaneously debug of CPU and GPUInspect variables across CUDA threadsUse breakpoints & single-step debuggingNsight ProfilerQuickly identifies performance issuesIntegrated expert systemSource line correlationAvailable for Linux and Mac OS

CUDA Compiler Contributed to Open Source LLVMDevelopers want to buildfront-ends forJava, Python, R, DSLsTarget other processors likeARM, FPGA, GPUs, x86CUDAC, C++, FortranNVIDIAGPUsLLVM CompilerFor CUDAx86CPUsNew LanguageSupportNew ProcessorSupport

CUDA 5 and OpenACC

OpenACC 2.0Procedure CallsWithout InliningSeparate CompilationAllows for device librariesEnabled by CUDA5 linkingNested ParallelismLaunch more paralleloperations without hostreturnProgram stays on deviceEnabled by CUDA5 + Kepler2Unstructured Data RegionOpenACC 1.0 supportsstructured data (enter attop, exit at bottom)Programs want dynamicmalloc/free of data ondevice

OpenACC 2.0 ContinuedFloat** in C/C++for rectangular arraysClarifications to 1.0 specpromote uniformimplementationsOpenACC test suitePromote uniformimplementationsOther items in discussionMultiple device supportProfiling and other toolinterfacesArray ReductionsDeep copy for pointerbased data structures

OpenACC MembershipBylaws model Khronos, OpenMPAdded low-cost Academic Membership ClassFounding members: CAPS, Cray, PGI, NVIDIANew Members: Allinea, Georgia Tech, ORNL, TU-Dresden,University of HoustonOthers welcome!

OpenACC and OpenMPMembers of OpenACC are (almost) all members of OpenMPWhich means, members will ultimately support both standards, until convergenceQ3 meetings will set roadmaps forDraft OpenMP 4.0 features, subcommittees decide what goes in draftDraft OpenACC 2.0 featuresOpenACC strengthsLeadership in features that extract GPGPU performanceGap in features does not appear to be closing

OpenACC at Supercomputing 2012Organizer Type TitleVetter ++ Tutorial Scalable Heterogeneous Computing on GPU ClustersCray Tutorial Productive, Portable Performance on Accelerators Using OpenACC Compilersand ToolsPGI Tutorial Introduction to GPU Computing with OpenACCPGI Tutorial Advanced GPU Computing with OpenACCCrayBroadEngagementAn Effective Standard for Developing Performance Portable Applications forFuture Hybrid SystemsLevesque ++ Paper Hybridizing S3D into an Exascale Application using OpenACCFarber ++ BOF OpenACC API Status and FutureCAPSNV/PGI/AMD/IntelExhibitorForumBOFAdvanced Programming of Many-Core Systems Using CAPS OpenACC CompilerHigh-level Programming Models for Computing Using AcceleratorsNvidia Confidential

NVIDIA ConferencesSupercomputing (SC12) 10 th -16 th 2012Salt Lake City, UtahNVIDIA will help promote partner ecosystemLet us know if you have K20 DemosGPU Technology Conference (GTC13) 18 th -21 st 2013San Jose, CaliforniaCall for Submissions ends October 3 rd

Thank you!

More magazines by this user
Similar magazines