10.07.2015 Views

PGI Accelerator Compilers for x64+GPU - many-core.group

PGI Accelerator Compilers for x64+GPU - many-core.group

PGI Accelerator Compilers for x64+GPU - many-core.group

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>PGI</strong> ®<strong>Accelerator</strong> <strong>Compilers</strong><strong>for</strong> <strong>x64+GPU</strong> SystemsDave NortonThe Portland GroupDave.Norton@p<strong>group</strong>.comCambridge UK-GPU ConferenceDecember 14, 2010


• C, C++, F2003 compilers• Optimizing, Vectorizing, Parallelizing• Graphical parallel debugger, profiler• AMD & Intel, 32 & 64-bit, SSE & AVX• <strong>PGI</strong> Unified Binary technology• Linux, MacOS, Windows• Visual Studio integration on Windows• CUDA Fortran <strong>for</strong> NVIDIA GPUs• CUDA C <strong>for</strong> both X86 and NVIDIA• <strong>PGI</strong> <strong>Accelerator</strong> Programming Modelwww.p<strong>group</strong>.com


Emerging Cluster Node ArchitectureCommodity Multi<strong>core</strong> x86 + Commodity Many<strong>core</strong> GPUs4 – 48 CPU Cores 240 – 1920 GPU/<strong>Accelerator</strong> Cores


Abstracted x64+Fermi Architecture1-4


CUDA Fortran VADD Host Codesubroutine vadd( A, B, C )use cuda<strong>for</strong>use kmodreal(4), dimension(:) :: A, Breal(4), pinned, dimension(:) :: Creal(4), device, allocatable:: Ad(:), Bd(:), Cd(:)integer :: NN = size( A, 1 )allocate( Ad(N), Bd(N), Cd(N) )Ad = A(1:N)Bd = B(1:N)call vaddkernel( Ad, Bd, Cd, N )C(1:N) = Cddeallocate( Ad, Bd, Cd )end subroutine5


CUDA Fortran VADD Device Codemodule kmoduse cuda<strong>for</strong>containsattributes(global) subroutine vaddkernel(A,B,C,N)real(4), device :: A(N), B(N), C(N)integer, value :: Ninteger :: ii = (blockidx%x-1)*32 + threadidx%xif( i


3 Aspects of GPU Programming1. Split code between Host and GPU• CUDA and OpenCL – function level, done manually by the programmer• Modern <strong>Compilers</strong> – can do this just as well as you can, and a lot faster, andenable offloading of regions within functions2. Manage data allocation/movement between Host and Device• CUDA and OpenCL – do this manually with API calls, one or more per argumentto the device kernel, host code nearly unrecognizable compared to original• Modern <strong>Compilers</strong> – can do this almost as well you can, user-driven tuning isrequired, but can and should be quick and easy3. Tune Device Kernels• CUDA and OpenCL – this step is both time-consuming and difficult; mustoptimize grid/thread geometry, optimize memory placement/accesses, etc• Modern <strong>Compilers</strong> – can help a little here and make the code portable, but thisstep is probably always going to be hard


Explicit programming (CUDA) vs.implicit programming (directives)CUDA:+ Good per<strong>for</strong>mance with hand tuned kernels+ Incremental porting to GPU- not portable to non-CUDA plat<strong>for</strong>ms- requires maintaining two sets of codeDirectives:+ Good per<strong>for</strong>mance possible+ Incremental porting to GPU+ portable to non-CUDA plat<strong>for</strong>ms including X64+ requires only a single code source- Obscurity in what the compiler is actually doing- “Best practices” not clearly established – more datafrom user, vendor, and plat<strong>for</strong>m needed


<strong>PGI</strong> <strong>Accelerator</strong> Programming Model Built on lessons from 30 years of experience with vector machinesand 20 years of experience with SMP programming Directives to offload compute kernels to a GPU, manage datamovement between host and GPU, map loop parallelism onto a GPU Fortran 2003 and C99 today, eventually C++ Programs remain 100% standard compliant and portable to othercompilers and HW Incremental porting/tuning of applications to <strong>x64+GPU</strong> Designed to enable development of applications that are per<strong>for</strong>manceportable to multiple types of accelerators


Host<strong>PGI</strong> <strong>Accelerator</strong>Program Execution Model• executes most of the program• allocates accelerator memory• initiates data copy from host memory to accelerator• sends kernel code to accelerator• queues kernels <strong>for</strong> execution on accelerator• waits <strong>for</strong> kernel completion• initiates data copy from accelerator to host memory• deallocates accelerator memory <strong>Accelerator</strong>• executes kernels, one after another• concurrently, may transfer data between host and accelerator


Accelerating an Application• Given that the appmeets the constraintsdiscussed, simplysurround region to beaccelerated withdirectives:!$acc region!$acc end region• Compile <strong>for</strong> GPUpg<strong>for</strong>tran –fast-Minfo=accel-ta=nvidiaProduces acceleratedkernels with correctdata movementCompiler feedbackimportant <strong>for</strong> tuning


<strong>PGI</strong> Directive-based MatrixMultiply <strong>for</strong> <strong>x64+GPU</strong>C codevoidcomputeMM(float C[][WB], float [][WA],float B[][WB], int hA,int wA, int wB){int i, j, k;#pragma acc region{<strong>for</strong> (i = 0; i < hA; ++i) {<strong>for</strong> (j = 0; j < wB; ++j) {C[i][j] = 0.0;}<strong>for</strong> (k = 0; k < wA; ++k) {<strong>for</strong> (j = 0; j < wB; ++j) {C[i][j]+=A[i][k]*B[k][j];}}}}}<strong>PGI</strong> 2010 automatically generatescode <strong>for</strong> NVIDIA GPUsGenerated code takes into accountcorner casesBlock dimension chosen by thecompiler is 16x16 threadsEach thread of a given block computesone point in ‘C’ output matrixNo use of shared memory <strong>for</strong> A & BaccessesCode needs to structured to enable« cache » accesses to A & BSingle binary <strong>for</strong> both optimizedversions <strong>for</strong> multi<strong>core</strong> and GPU


void saxpy (float a,float *restrict x,float *restrict y, int n){#pragma acc region{<strong>for</strong> (int i=1; i


Refinements: Loop Schedules<strong>Accelerator</strong> kernel generated26, #pragma acc <strong>for</strong> parallel, vector(16)27, #pragma acc <strong>for</strong> parallel, vector(16) vector loops correspond to threadidx indices parallel loops correspond to blockidx indices this schedule has a CUDA schedule:> Compiler strip-mines to protect against very longloop limits, generates clean-up code <strong>for</strong> arbitrary loopbounds, etc Syntax supports any legal CUDA schedule


Refinements: Data Motion!$acc region copyin(a(1:m,1:n)), copyout rdo j = 2,n-1 ! Update interior pointsdo i = 2,m-1r(i,j) = a(i,j) * 2.0enddoenddo!$acc end region copyin changes default copyin procedure copyout changes default copyout procedure By default, compiler moves only data that is used E.g. <strong>for</strong> computing on non-halo regions, it is moreefficient to move entire array rather then have thecompiler generate a move <strong>for</strong> each column/row


Refinements: Leaving Data on GPUSubroutine magma (A, B)real(4), dimension(:,:) :: A, B!$acc reflected (A,B) reflected is a new directive in <strong>PGI</strong> 11.0 reflected requires visibility of the caller (module orinterface block) Instructs the compiler that the data is already on theGPU Starts to help programmer work around the “no stackpointer” issue Replaces trying to get the compiler to inline calledroutines


Compiler-to-User Feedback% pg<strong>for</strong>tran -fast -ta=nvidia –Minfo=accel mm.f...62, Loop is parallelizable64, Loop carried dependence of 'C' prevents parallelizationLoop carried backward dependence of 'C' prevents vectorization66, Loop is parallelizable<strong>Accelerator</strong> kernel generated62, #pragma acc <strong>for</strong> parallel, vector(16) /* blockIdx.y threadIdx.y */64, #pragma acc <strong>for</strong> seq(16)Cached references to size [16x16] block of 'A'Cached references to size [16x16] block of 'B'66, #pragma acc <strong>for</strong> parallel, vector(16) /* blockIdx.x threadIdx.x */Using register <strong>for</strong> 'C'CC 1.3 : 27 registers; 2264 shared, 24 constant,0 local memory bytes; 50% occupancy...


Optional Region Clauses <strong>for</strong> TuningData Allocation and MovementClauseif (cond)copy (list)copyin (list)copyout (list)local (list)mirror (list)reflected (list)Scoperegionregion, declarationregion, declarationregion, declarationregion, declarationregion, declaration (Fortran)declaration (Fortran)See www.p<strong>group</strong>.com/accelerate <strong>for</strong> a complete specificationof the <strong>PGI</strong> <strong>Accelerator</strong> programming model and directives


Optional Loop Directive Clauses <strong>for</strong>Tuning Kernel SchedulesClausehost [(width)]parallel [(width)]seq [(width)]vector [(width)]private (list)kernelunroll (width)cache (list)ScopelooplooplooplooplooplooplooploopSee www.p<strong>group</strong>.com/accelerate <strong>for</strong> a complete specificationof the <strong>PGI</strong> <strong>Accelerator</strong> programming model and directives


<strong>PGI</strong> <strong>Accelerator</strong> vs CUDAThe <strong>PGI</strong> <strong>Accelerator</strong> programming model is a high-level implicitprogramming model <strong>for</strong> <strong>x64+GPU</strong> systems, similar to OpenMP<strong>for</strong> Multi-<strong>core</strong> x64:• Offload compute-intensive loops and code regions using simplecompiler directives• Directives are Fortran comments and C pragmas, programsremain 100% standard-compliant and portable• Makes GPGPU programming and optimization incremental andaccessible to application domain experts• Supported in both the <strong>PGI</strong> F2003 and PGCC C99 compilers


Reference Materials<strong>PGI</strong> <strong>Accelerator</strong> programming model – supported <strong>for</strong> x64+NVIDIAtargets in the <strong>PGI</strong> Fortran 95/03 and C99 compilers• http://www.p<strong>group</strong>.com/lit/whitepapers/pgi_accel_prog_model_1.3.pdf CUDA Fortran – supported on NVIDIA GPUs in <strong>PGI</strong> Fortran 95/03compiler• http://www.p<strong>group</strong>.com/lit/whitepapers/pgicuda<strong>for</strong>ug.pdfUnderstanding the CUDA Data Parallel Threading Model• http://www.p<strong>group</strong>.com/lit/articles/insider/v2n1a5.htm


Copyright Notice© Contents copyright 2010, The Portland Group,Inc. This material may not be reproduced in anymanner without the expressed writtenpermission of The Portland Group.PGFORTRAN, PGF95, <strong>PGI</strong> <strong>Accelerator</strong> and <strong>PGI</strong> Unified Binary are trademarks; and <strong>PGI</strong>, PGCC, PGC++, <strong>PGI</strong> VisualFortran, PVF, <strong>PGI</strong> CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarksof The Portland Group Incorporated. Other brands and names are the property of their respective owners.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!