The GPU Computing Revolution - London Mathematical Society

More documents

Recommendations

Info

14 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsGPU parallelprogramming modelsWith the rapidly increasingpopularity of GPU-basedcomputing, multiple differentprogramming models haveemerged for these new platforms.These include vendor-specificexamples, such as NVIDIA’sCUDA, The Portland Group’s PGIAccelerator TM [113], CAPSenterprise’s HMPP Workbench [23]and Microsoft’s DirectCompute [81],as well as open standards such asOpenCL. We shall now look at thetwo most widely used GPUprogramming models in moredetail: CUDA and OpenCL.CUDANVIDIA’s CUDA is currently themost mature and easiest to useparallel programming model forGPUs, first appearing on themarket in 2006 [91]. CUDA wasdesigned to be familiar to as manysoftware developers as possible,and so its first instantiation beganas a variant of the C programminglanguage, called CUDA C. Indeed,in developing CUDA C code it isoften possible to incrementally portan existing C application, usingprofiling to identify those parts ofthe code that consume most of therun-time (the ‘hot spots’) andconverting these to run in parallelon the many-core GPU.This ‘incremental porting’ approachcan be very effective and certainlylowers the barrier to entry for manyusers. But in our experience, mostsoftware developers find that theycan only get so far with thisapproach. The reason for thislimitation usually comes back to thedesign of the underpinningalgorithm. If it was not designed tobe massively parallel, then it islikely that the current softwareimplementation cannot easily bemodified in an incremental way inorder to make it massively parallel.Of course there are alwaysexceptions, but often theincremental porting approach willonly take you so far.Coming back to CUDA C, todemonstrate how familiar it can be,consider the two code examples inFigure 7. On the left is someordinary ANSI C code forperforming a simple vector addition.Because it is written in a serialprogramming language, this codeuses a loop and an array index torun along the two input vectors,producing one output element ateach loop iteration.In the CUDA C example on theright, you can see the loop hascompletely disappeared. Instead,many copies of the function areexecuted in parallel, one copy foreach parallel data item we wish toprocess. In this instance we wouldlaunch ‘size’ copies of the vectoraddition function, all of which couldrun in parallel. At execution timethe run-time system decides how tomap efficiently the expressedparallelism onto the availableparallel hardware resources.It is important not to underestimatethe level of challenge when writingcode for GPUs. The trivialexamples in Figure 7 are very smallin order to illustrate a point. In areal example, modifying thealgorithm in order to expressenough parallelism to achieve goodperformance on a GPU oftenrequires significant intellectualeffort. And while the parallelkernels themselves may often lookquite similar to a highly optimisedsequential equivalent, significantadditional code is required in orderto initialise the parallel environment,orchestrate the GPUs in thesystem, and communicate dataefficiently, often by overlapping thedata movement with computation.The CUDA C code on the righthand side of Figure 7 is an exampleof a ‘kernel’ — this is commonterminology for the parallel versionof a routine that has been modifiedin order to run on a many-coreprocessor. CUDA uses slightlydifferent terminology for the// ANSI C vector addition examplevoid vectorAdd(const float *a,const float *b,float *c,unsigned int size){unsigned int i;for (i=0; i
A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS15Processing Elements (which it calls‘threads’) and Compute Units(which it calls ‘thread blocks’) butgenerally it is considered to befairly straightforward to port codebetween CUDA C and similarprogramming models, such asOpenCL.CUDA threads executeindependently and thus ideally onindependent data — this is whydata parallelism is such a natural fitfor these kinds of architectures.Threads in a thread block aregrouped to execute essentially thesame program at the same time,but on different pieces of data. Thisdata-parallel approach is known asSingle Instruction Multiple Data(SIMD) [38]. On the other hand,different thread blocks may executecompletely different programs fromone another if the need arises,although most applications tend torun the same program on all thethread blocks at the same time,essentially turning the computationinto one large SIMD or vectorcalculation.CUDA’s maturity brings a numberof benefits for software developers,including a growing number ofsoftware development toolsincluding debuggers andprofilers [97]. In 2009, CUDAFortran arrived, as a jointdevelopment between NVIDIA andthe Portland Group [95]. CUDAFortran takes the principles ofCUDA C and weaves them into astate-of-the-art commercial Fortran2003 compiler.OpenCLOpenCL bears many similarities toCUDA and indeed NVIDIA is one ofthe main contributors to theOpenCL standard and so thisshould be no surprise. The biggestdifferences are in the way OpenCLis being developed. WhereasCUDA is a proprietary solutionbeing driven by a single vendor,OpenCL is an open standard,instigated by Apple, but now beingdriven by a consortium of over 35companies, including all the majorprocessor vendors such as Intel,IBM and AMD. The consortium isbeing organised and run by theKhronos Group [67].OpenCL is a much more recentdevelopment than CUDA and iscorrespondingly less mature.However, OpenCL also includes anumber of more recent advancesfor supporting heterogeneouscomputing in systems combiningmultiple CPUs and GPUs. The firstversion of the OpenCL standardwas released in December2008 [66], and OpenCL has beendeveloping rapidly since then. Ithas already been integrated intorecent versions of Apple’s OS Xoperating system. AMD andNVIDIA have releasedimplementations for their GPUs, theformer also including a version thatwill run on a multi-core x86 hostCPU. IBM has demonstrated aversion of OpenCL running on itsCell processor and recentlyreleased a version for theirhigh-end POWER architecture [56].Intel released its first OpenCLimplementation for its multi-corex86 CPUs in late 2010 [27, 60].Embedded processor companiesare also developing their ownOpenCL solutions, includingARM [11, 99], ImaginationTechnologies [57] and ZiiLabs [122]. These last threecompanies provide the CPUs andGPUs in most of the popularconsumer electronics gadgets,such as smartphones and portableMP3 players.While OpenCL is less mature thanNVIDIA’s CUDA and has some ofthe drawbacks of committeedesigned standards, its benefits arethe openness of the standard, thevast resource being ploughed intoits development by manycompanies, and most importantly,its cross-platform capabilities.OpenCL is quite a low-levelsolution. It exposes features thatmany software developers may nothave had to deal with before.CUDA has similar features butincludes a higher-level applicationprogrammer interface (API) thatconveniently handles much of thelow-level detail. But in OpenCL thisis all left up to the programmer.One example is in the explicit useof queues for sending commandssuch as ‘run this kernel’ from thehost processor to the many-coreGPU. It is expected that as OpenCLmatures, various solutions willemerge to abstract away thislower-level detail, leaving mostprogrammers to operate at a higherlevel. An interface has alreadybeen developed that provides thisfacility for C++ programs.One of OpenCL’s other importantcharacteristics is that it has beendesigned to support heterogeneouscomputing from Day One; that is, itsupports running codesimultaneously on multiple, differentkinds of processors, all within asingle OpenCL program. Whenre-engineering software this is animportant consideration: adopting aprogramming environment thatsupports a wide range ofheterogeneous parallel hardwarewill give developers the greatestflexibility when deploying theirre-engineered codes in the future.For example, an OpenCL programcould decide to run one task onone of the host processor cores,while running another task using amany-core GPU, and do this all inparallel. These multiple OpenCLtasks can easily coordinatebetween themselves, passing dataand signals from one to the other.Because almost all processors will
Page 1 and 2: The GPU ComputingRevolutionFrom Mul
Page 3 and 4: THE GPU COMPUTING REVOLUTIONFrom Mu
Page 5 and 6: A KNOWLEDGE TRANSFER REPORT FROM TH
Page 15: A KNOWLEDGE TRANSFER REPORT FROM TH

The GPU Computing Revolution - London Mathematical Society

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?