Compiling CUDA and Other Languages for GPUs - GTC 2012

gputechconf.com

Compiling CUDA and Other Languages for GPUs - GTC 2012

Compiling CUDA and Other Languages for GPUsVinod Grover and Yuan Lin


Agenda• Vision• Compiler Architecture• Scenarios• SDK Components• Roadmap• Deep Dive• SDK Samples• Demos


Vision• Build a platform for GPUcomputing around foundationsof CUDA.— Bring other languages to GPUs— Enable CUDA for other platforms• Make that platform available forISVs, researchers, and hobbyists— Create a flourishing eco-systemNVIDIAGPUsCUDAC, C++LLVM CompilerFor CUDAx86CPUsNew LanguageSupportNew ProcessorSupport


Compiler Architecture


CUDA Compiler Architecture (4.1, 4.2 and 5.0)NVCC.cuCUDA FELLVM OptimizerNVPTX CodeGenPTXPTXASHost CompilerCUDA RuntimeCUDA Driver


Open Compiler ArchitectureNVCC.cuCUDA FElibCUDA.LANGNVVM IRLLVM OptimizerNVPTX CodeGenlibNVVMOpenSourcedPTXPTXASHost CompilerCUDA RuntimeCUDA Driver


Scenarios


Building Production Quality CompilersNVCClibCUDA.LANGCUDA FortranOpenACClibNVVMCUDA Runtime


Building Domain Specific LanguagesNVCClibCUDA.LANGDSL Front EndlibNVVMDSL RuntimeCUDA Runtime


Enabling Other PlatformsNVCClibCUDA.LANGx86 LLVMBackendlibNVVMx86 CUDARuntimeCUDA Runtime


Enabling Research in GPU ComputingCU++CLANG’x86 LLVMBackendlibNVVMCustom Runtime


Compiler SDK Components• Open source NVPTX backend• NVVM IR specification• libNVVM• libCUDA.LANG


Enabling open source GPU compilers• Contribute NVPTX code generator sources back to LLVM trunk• Sources available directly from LLVM!— Standard LLVM license• Invite community to:• Use it!• Contribute improvements to it!


NVVM IR• An intermediate representation based on LLVM and targeted forGPU computing• IR Specification describes:— Address spaces— Kernels and device functions— Intrinsics— … more


libNVVM• A binary component library for Windows, Linux and MacOStargeting PTX 3.1— Fermi, Kepler and later architectures— 32-bit and 64-bit hosts— NVVM IR verifier• API document• Sample applications


libCUDA.LANG• A binary library that takes CUDA source program and generatesNVVM IR and host C++ program for supported platforms— API document• Sample applications


Roadmap


Availability• Open source NVPTX backend— May 2012• NVVM IR spec, libNVVM, samples— Preview release (May 2012)• http://developer.nvidia.com/cuda-llvm-compiler— Production release (Next release after CUDA 5.0)• libCUDA.LANG— To be announced


Deep Dive


Deep DiveNVVM IRlibNVVMSDK Samples


NVVM IR• Designed to represent GPU kernel functions and device functions— Represents the code executed by each CUDA thread• NVVM IR Specification 1.0


NVVM IR and LLVM IR• NVVM IR— Based on LLVM IR— With a set of rules and intrinsics• No new types. No new operators. No new reserved words.• An NVVM IR program can work with any standard LLVM IR tool llvm-as llvm-link llvm-extract llvm-dis llvm-ar …• An NVVM IR program can be built with the standard LLVMdistribution.svn co http://llvm.org/svn/llvm-project/llvm/branches/release_30 llvm


NVVM IR


NVVM IR: Address Spaces


NVVM IR: Address Spaces• CUDA C/C++— Address space is a storage qualifier.— A pointers is generic pointer, which can pointto any address space.__global__ int g;__shared__ int s;__constant__ int c;void foo(int a) {int l;int *p ;switch (a) {case 1: p = &g; …case 2: p = &s; …case 3: p = &c; …case 4: p = &l; …}…}


NVVM IR: Address Spaces• CUDA C/C++— Address space is a storage qualifier.— A pointers is generic pointer, which can pointto any address space.• OpenCL C— Address space is part of the type system.— A pointer type must be qualified with anaddress space.global int g;constant int c;foo(local int *ps) {int l;int *p ;p = l;global int *pg = &g;constant int *pc = &c;…}


NVVM IR: Address Spaces• CUDA C/C++— Address space is a storage qualifier.— A pointers is generic pointer, which can pointto any address space.• OpenCL C— Address space is part of the type system.— A pointer type must be qualified with anaddress space.• NVVM IR— Support both in the same program.


NVVM IR: Address Spaces• Define address space numbers• Allow generic pointers and specific pointers• Provide intrinsics to perform conversions between generic pointersand specific pointers


NVVM IR: GPU Program Properties• Properties:— Maximum/minimum expected CTA size from any launch— Minimum number of CTAs on an SM— Kernel function vs. device function— Texture/surface variables— more• Use Named Metadata


NVVM IR: Intrinsics• Atomic operations• Barriers• Address space conversions• Special registers read• Texture/surface access• more


NVVM IR: NVVM ABI for PTXAllow interoperability of PTX codes generated by differentNVVM compilersHow NVVM linkagetypes are mappedto PTX linkagetypesHow NVVM datatypes are mappedto PTX data typesHow function callsare lowered to PTXcalls


libNVVM Library• Address space access optimization• Thread convergence analyses• Re-materialization• Load / store coalescing• sign extension elimination• Inline asm (PTX) support• Tuning of existing optimizations• New phi eliminationlibNVVM : An optimizing compiler libraryNVVM IR -> PTX


libNVVM LibraryCUDA CcompilerDSLcompilerToolsYourCompilerlibNVVM : An optimizing compiler libraryNVVM IR -> PTX


libNVVM APIs• Start and shutdown• Create a compilation unit— Add NVVM IR modules to a compilation unit.— Support NVVM IR level linking• Verify the input IR• Compile to PTX• Get result— Get back PTX string— Get back message log


Samples


SDK Sample 1: Simple• Read in an NVVM IRprogram in LL or BCformat from disk• Generate PTX• Execute the PTX onGPU// Read infread (source, size, 1, fh);// CompilenvvmInit ();nvvmCreateCU (&cu);nvvmCUAddModule (cu, source, size);nvvmCompileCU (cu, num_options, options);nvvmGetCompiledResultSize (cu, &ptxsize);nvvmGetCompiledResult (cu, ptx);nvvmDestroyCU (&cu);nvvmFini ();// Execute...cuModuleLoadDataEx (phModule, ptx, 0, 0, 0);cuModuleGetFunction (phKernel, *phModule, "simple");cuLaunchGrid (hKernel, nBlocks, 1))


SDK Sample 1: SimpleIR fromfileNVVM IR -> PTXlibNVVMPTX JIT and Execution on GPU


SDK Sample 2: ptxGen• Read in a list of NVVM IR programs in LL or BC format from disk• Perform IR level linkingfor (i=0; i


SDK Sample 2: ptxGenIR fromfilelibNVVMNVVM IR -> PTXIR level linkingIR Verifier


SDK Sample 3: Kaleidoscope• Based on the “Kaleidoscope” example in LLVM tutorial• An interpreter for a simple language— build expressions using primitive operators and user defined functions• Can evaluate a function on a GPU over a sequence of arguments• Uses LLVM IR builder— Statically linked with stock llvm 3.0 libLLVMCore.a and libLLVMBitWriter.a• Dynamically linked with libnvvm.so and libcuda.so.


SDK Sample 3: Kaleidoscopeready> def foo(x)ready> 4*x;ready> eval foo 1 2 3 4;ready> Evaluating foo on a gpu:Using CUDA Device [0]: GeForce GTX 4804.000000 8.000000 12.000000 16.000000ready> def bar(a b)ready> foo(a)+b;ready> eval bar 1 2 3 4 ;ready> Evaluating bar on a gpu:Using CUDA Device [0]: GeForce GTX 4806.000000 16.000000


SDK Sample 3: KaleidoscopeinterpreterLLVM IR BuilderlibNVVMNVVM IR -> PTXPTX JIT and Execution on GPU


SDK Sample 4: Glang• A prototype compiler based on Clang• Compile a subset of C++ programs to execute on GPUs• Has its own set of builtin functions


glang_kernel kernel_vector_add(__global__ const float* a,__global__ const float* b,__global__ float* c,unsigned int size) {// Determine thread/block id/sizeint tidX = __glang_tid_x();int bidX = __glang_ctaid_x();int bsize = __glang_ntid_x();// Compute offset into vectorint index = bidX*bsize + tidX;}// Compute addition for our elementif(index < size) {c[index] = a[index] + b[index];}


SDK SamplesclanglibNVVMNVVM IR -> PTXIR level linkingUser builtin functionsPTX JIT and Execution on GPU


SDK Sample 5: Rg• R is a language and environment for statistical computing andgraphics – www.r-project.org• Dynamically compile R code and execute it on a GPU— Supports a useful subset of R• An example of how to accelerate a DSL using libNVVM


SDK Sample 5: Rg> v1 = c (1, 2, 3, 4)> dv1 = rg.dv (c (1, 2, 3, 4))> dv2 = rg.dv (c (10, 20, 30, 40))> dv3 = rg.gapply (function(x, y) {x+y;}, dv1, dv2)> as.double (dv3)[1] 11 22 33 44


SDK Sample 5: Rg# define the code for Mandelbrotmandelbrot


Demo


SDK Sample 5: RgRLLVM IR BuilderNVVM IR -> PTXlibNVVMPTX JIT and Execution on GPU


SDK SamplesintepreterRIR fromfileclangLLVM IR BuilderlibNVVMNVVM IR -> PTXIR level linkingIR VerifierUser builtin functionsPTX JIT and Execution on GPU


How to get it?• http://developer.nvidia.com/cuda-llvm-compiler• Open to all NVIDIA registered developers• Available with CUDA 5.0 Preview Release• File bugs and post questions to NVIDIA forum— Tag ‘NVVM’


Back up slides


SDK Sample 3: Kaleidoscope“ks” Interpreter


SDK Sample 4: Glanga.glanga.glang.cppa.glang.hppa.outhost.cpplibglangrt.so libcuda.so


SDK Sample 4: Glanga.glanga.glang.cppa.glang.hppglanga.outhost.cpplibglangrt.so libcuda.so


SDK Sample 5: Rg (Overview)• Create R objects as proxies for GPU vector data— Rg runtime handles data creation and transfers to and from the GPU• Write scalar R functions• Use rg.gapply to map the scalar function to the vector data— Dynamically create code for the scalar function.— Generated code is specialized to the runtime type of the respective vectorelements— Launch the generated code on the GPU and create an R object which is aproxy for the result.


SDK Sample 5: RG (Compiler)• rg.gapply(f, v1, v2, …, vn) –— compiles f and launches the generated code on vector arguments v1, …, vnalready resident on the GPU.• v1, …, vn are atomic vectors of element types t1, …, tndiscovered at runtime.• Creates a type specialized LLVM IR for f with signature f: t1 x … xtn -> T where T is the inferred type of the body of f

More magazines by this user
Similar magazines