Introducing CUDA 5

gputechconf.com

Introducing CUDA 5

Introducing CUDA 5Mark HarrisChief Technologist, GPU Computing© 2012 NVIDIA


CUDA Parallel Computing Platformwww.nvidia.com/getcudaProgrammingApproachesLibraries“Drop-in” AccelerationOpenACCDirectivesEasily Accelerate AppsProgrammingLanguagesMaximum FlexibilityDevelopmentEnvironmentNsight IDELinux, Mac and WindowsGPU Debugging and ProfilingCUDA-GDB debuggerNVIDIA Visual ProfilerOpen CompilerTool ChainEnables compiling new languages to CUDA platform, andCUDA languages to other architecturesHardwareCapabilitiesSMX Dynamic Parallelism HyperQGPUDirect© 2012 NVIDIA


CUDA 5Application Acceleration Made EasierDynamic ParallelismSpawn new parallel work from within GPU code on GK110GPU-callable LibrariesLibraries and plug-ins for GPU codeNew Nsight Eclipse EditionDevelop, Debug, and Optimize… All in one tool!GPUDirectRDMA between GPUs and PCIe devices© 2012 NVIDIA


What is CUDA Dynamic Parallelism?The ability for any GPU thread to launch a parallel GPU kernelDynamically - based on run-time dataSimultaneously - from multiple threads at onceIndependently - each thread can launch a different gridCPU GPU CPU GPUFermi: Only CPU can generate GPU workKepler: GPU can generate work for itself© 2012 NVIDIA


Dynamic ParallelismCPU Fermi GPU CPU Kepler GPU© 2012 NVIDIA


Dynamic Work GenerationFixed GridStatically assign conservativeworst-case gridDynamically assign performancewhere accuracy is requiredInitial GridDynamic Grid© 2012 NVIDIA


Familiar Syntax and Programming Modelint main() {float *data;setup(data);}__global__ void B(float *data){do_stuff(data);}A > (data);B > (data);C > (data);cudaDeviceSynchronize();return 0;X > (data);Y > (data);Z > (data);cudaDeviceSynchronize();do_more_stuff(data);CPUmainABCGPUXYZ© 2012 NVIDIA


Before CUDA 5: Whole-Program Compilationall.cumain.cpp + a.cu b.cu c.cuprogram.exeInclude files together to buildPrevious CUDA releases required a single source file for each kernelLinking with external device code was not supported© 2012 NVIDIA


CUDA 5: Separate Compilation & Linkinga.cub.cuc.cumain.cpp+ program.exea.ob.oc.oSeparate compilation allows building independent object filesCUDA 5 can link multiple object files into one program© 2012 NVIDIA


CUDA 5: GPU Callable Librariesa.cua.ob.cumain.cpp+kernel()++ b.oab.aprogram.exeCan combine object files into static librariesLink and externally call device code© 2012 NVIDIA


CUDA 5: GPU Callable Librariesmain.cppmain2.cppa.cub.cu+foo()+bar()a.o+b.o+ +ab.a ab.aprogram.exeprogram2.exeFacilitates code reuse, reduces compile time© 2012 NVIDIA


CUDA 5: Callbacksmain.cppEnables closed-source devicelibraries to call user-defineddevice callback functionsfoo.cukernel() callback()vendor.culibprogram.exe© 2012 NVIDIA


GPU-Callable LibrariesDirect Access to BLAS and otherLibraries from GPU Codemain2.cpp+bar()+CUBLASprogram2.exe© 2012 NVIDIA


NVIDIA ® Nsight, Eclipse EditionCUDA aware editor• Semantic highlighting of CUDA codemakes it easy to differentiate GPU codefrom CPU code• Generate code faster with CUDA awareauto code completion and inline help• Easily port CPU loops to CUDA kernelswith automatic code refactoring• Integrated CUDA samples makes it quickand easy to get started• Hyperlink navigation enables faster codebrowsing• Supports automatic makefile generation© 2012 NVIDIA


NVIDIA ® Nsight Eclipse Edition,Nsight Debugger• Seamless and simultaneous debugging ofboth CPU and GPU code• View program variables across severalCUDA threads• Examine execution state and mapping ofthe kernels and GPUs• View, navigate and filter to selectivelytrack execution across threads• Set breakpoints and single-step executionat both source-code and assembly levels• Includes CUDA-MEMCHECK to helpdetect memory errors© 2012 NVIDIA


NVIDIA ® Nsight, Eclipse EditionNsight Profiler• Easily identify performance bottlenecksusing a unified CPU and GPU trace ofapplication activity• Expert analysis system pin-pointspotential optimization opportunities• Highlights potential performance problemsat specific source-lines within applicationkernels• Close integration with Nsight editor andbuilder for fast edit-build-profileoptimization cycle• Integrates with the new nvprof commandlineprofiler to enable visualization ofprofile data collected on headlesscompute nodes© 2012 NVIDIA


NVIDIA ® GPUDirect Support for RDMADirect Communication Between GPUs and PCIe devicesSystemMemoryGDDR5MemoryGDDR5MemoryGDDR5MemoryGDDR5MemorySystemMemoryCPUGPU1GPU2GPU2GPU1CPUPCI-eNetworkCardNetworkNetworkCardPCI-eServer 1Server 2© 2012 NVIDIA


GPUDirect enables GPU-aware MPIcudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);cudaMemcpy(r_buf_h,r_buf_d,size,cudaMemcpyHostToDevice);Simplifies toMPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);(for CPU and GPU buffers)© 2012 NVIDIA


New Online CUDA Resource CenterAll the Latest Information, All in One PlaceProgramming GuidesAPI ReferenceLibrary ManualsCode SamplesTools ManualsPlatform SpecsOver 1600 Files!http://docs.nvidia.com© 2012 NVIDIA


CUDA Open Compiler Tool ChainDevelopers want to buildfront-ends forJava, Python, R, DSLsTarget other processors likeARM, FPGA, GPUs, x86CUDAC, C++, FortranNVIDIAGPUsLLVM CompilerFor CUDAx86CPUsNew LanguageSupportNew ProcessorSupport© 2012 NVIDIA


Thank you!Download CUDA 5http://www.nvidia.com/getcuda© 2012 NVIDIA


Extra Slides© 2012 NVIDIA


Device Linker InvocationIntroduction of an optional link step for device codenvcc –arch=sm_20 –dc a.cu b.cunvcc –arch=sm_20 –dlink a.o b.o –o link.og++ a.o b.o link.o –L -lcudartLink device-runtime library for dynamic parallelismnvcc –arch=sm_35 –dc a.cu b.cunvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.og++ a.o b.o link.o –L -lcudadevrt -lcudartCurrently, link occurs at cubin level (PTX not yet supported)© 2012 NVIDIA


CPU is FreeSimpler Code: LU ExampleLU decomposition (Fermi)LU decomposition (Kepler)dgetrf(N, N) {for j=1 to Nfor i=1 to 64idamaxmemcpydswapmemcpydscaldgernext i}memcpydlaswapdtrsmdgemmnext jidamax();dswap();dscal();dger();dlaswap();dtrsm();dgemm();dgetrf(N, N) {dgetrfsynchronize();}dgetrf(N, N) {for j=1 to Nfor i=1 to 64idamaxdswapdscaldgernext idlaswapdtrsmdgemmnext j}CPU CodeGPU CodeCPU CodeGPU Code© 2012 NVIDIA


CUDA Dynamic ParallelismData-Dependent ExecutionRecursive Parallel AlgorithmsGPU-SideKernelLaunchEfficiencyBatching to Help Fill GPUDynamic Load BalancingSimplify CPU/GPU DivideLibrary Calls from Kernels© 2012 NVIDIA

More magazines by this user
Similar magazines