The GPU Computing Revolution - London Mathematical Society

The GPU ComputingRevolutionFrom Multi-Core CPUs to Many-Core Graphics ProcessorsA Knowledge Transfer Report from the London Mathematical Societyand Knowledge Transfer Network for Industrial MathematicsBy Simon McIntosh-Smith

Copyright © 2011 by Simon McIntosh-SmithFront cover image credits:Top left: Umberto Shtanzman / Shutterstock.comTop right: godrick / Shutterstock.comBottom left: Double Negative Visual EffectsBottom right: University of BristolBackground: Serg64 / Shutterstock.com

THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsBy Simon McIntosh-SmithContentsPageExecutive Summary 3From Multi-Core to Many-Core: Background and Development 4Success Stories 7GPUs in Depth 11Current Challenges 18Next Steps 19Appendix 1: Active Researchers and Practitioner Groups 21Appendix 2: Software Applications Available on GPUs 23References 24September 2011A Knowledge Transfer Report from the London Mathematical Society andthe Knowledge Transfer Network for Industrial MathematicsEdited by Robert Leese and Tom MelhamLondon Mathematical Society, De Morgan House, 57–58 Russell Square, London WC1B 4HSKTN for Industrial Mathematics, Surrey Technology Centre, Surrey Research Park, Guildford GU2 7YG

2 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsAUTHORSimon McIntosh-Smith is head of the Microelectronics Research Group at the Universityof Bristol and chair of the Many-Core and Reconfigurable Supercomputing Conference(MRSC), Europe’s largest conference dedicated to the use of massively parallel computerarchitectures. Prior to joining the university he spent fifteen years in industry where hedesigned massively parallel hardware and software at companies such as Inmos, STMicroelectronicsand Pixelfusion, before co-founding ClearSpeed as Vice-President of Architectureand Applications. In 2006 ClearSpeed’s many-core processors enabled the creation ofone of the fastest and most energy-efficient supercomputers: TSUBAME at Tokyo Tech. Hehas given many invited talks on how massively parallel, heterogeneous processors are revolutionisinghigh-performance computing, and has published numerous papers on how toexploit the speed of graphics processors for scientific applications. He holds eight patentsin parallel hardware and software and sits on the steering and programme committees ofmost of the well-known international high-performance computing conferences.www.cs.bris.ac.uk/home/simonm/simonm@cs.bris.ac.uk

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS3Executive SummaryComputer architectures areundergoing their most radicalchange in a decade. In the past,processor performance has beenimproved largely by increasingclock speed — the faster the clockspeed, the faster a processor canexecute instructions, and thus thegreater the performance that isdelivered to the end user. Thisdrive to greater and greater clockspeeds has stopped, and indeed insome cases is actually reversing.Instead, computer architects aredesigning processors that includemultiple cores that run in parallel.For the purposes of this report weshall think of a core as oneindependent ‘brain’, able to executeits own stream of instructions,independently fetching the data itrequires and writing its own resultsback to a main memory. This shiftfrom increasing clock speeds toincreasing core counts — and thusincreasing parallelism — ispresenting significant newopportunities and challenges.During this architectural shift a newclass of many-core computerarchitectures is emerging from thegraphics market into themainstream. This new class ofprocessor already exploitshundreds or even thousands ofcores. Many-core processorspotentially offer an order ofmagnitude greater performancethan conventional processors, asignificant increase that, ifharnessed, can deliver majorcompetitive advantages.But to take advantage of thesepowerful new many-coreprocessors, most users will have toradically redesign their software,potentially needing completely newalgorithms that can exploit massiveparallelism.The many-coreopportunityFor computationally challengingproblems in business and scientificresearch, these new, highlyparallel, many-core architecturesrepresent a technologicalbreakthrough that can deliversignificant benefits in science andbusiness benefits. Such astep-change in available computingperformance could enable fasterfinancial trades, higher-resolutionsimulations, the design of morecompetitive products such asincreasingly aerodynamic cars andmore effective drugs for thetreatment of disease.Thus new many-core processorsrepresent one of the biggestadvances in computing in recenthistory. They also indicate thedirection of all future processordesign — we cannot avoid thismajor paradigm shift into massiveparallelism. All processors from allthe major vendors will becomemany-core designs over the nextfew years. Whether this is manyidentical mainstream CPU cores ora heterogeneous mix of differentkinds of core remains to be seen,but it is inevitable that there will belots of cores in each processor, andimportantly, to harness theirperformance, most software andalgorithms will need redesigning.There is significant competitiveadvantage to be gained for thosewho can embrace and exploit thisnew, many-core technology, andconsiderable risk for any who areunable to adapt to a many-coreapproach. This report explains thebackground to these developments,presents several success stories,and describes a number of ways toget started with many-coretechnologies.

4 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsFrom Multi-Core to Many-Core: Background and DevelopmentHistorically, advances in computerhardware have deliveredperformance improvements thathave been transparent to the enduser – software has run morequickly on newer hardware withouthaving to make any changes to thatsoftware. We have enjoyeddecades of ‘single thread’performance improvement, whichrelied on converting exponentialincreases in available transistors(known as Moore’s Law [83]), intoincreasing processor clock speedsand advances in microprocessorpipelining, instruction-levelparallelism and out-of-orderexecution. All these technologydevelopments are essentially‘under the hood’, with existingsoftware generally performingfaster on next generation hardwarewithout programmer intervention.Most of these developments havenow hit a wall.In 2003 it started to becomeapparent that several fundamentalphysical limitations were going tochange everything. Herb Sutternoticed this significant change inhis widely cited 2004 essay ‘TheFree Lunch is Over’ [110]. Figure 1is taken from Sutter’s paper,updated in 2009, and shows howsingle thread performanceincreases, based on increasingclock speed and instruction levelparallelism improvements, couldonly have continued at the cost ofsignificant increases in processorpower consumption. Somethinghad to change — the free lunchwas over.The answer was, and is,parallelism. It transpires that, forvarious semiconductorphysics-related reasons, aprocessor containing two 3GHzcores will use significantly lessenergy (and thus dissipate lesswaste heat) than an alternativeprocessor containing a single 6GHzcore. Yet in theory the dual-core3GHz device can deliver the sameperformance as the single 6GHzcore (much more on this later).Today even laptop processorscontain at least two cores, withmost mainstream server CPUsalready containing four or six cores.Thus, due to the powerconsumption problem, increasingcore counts have replacedincreasing clock speeds as themain method of delivering greaterhardware performance. (For moredetails see Box 1 later.)Of course the important implicationof ‘the free lunch is over’ is thatmost existing software will not justgo faster when we increase thenumber of cores inside aprocessor, unless you are in thefortunate position that yourperformance-critical software isalready ‘multi-core aware’.So, hardware is suddenly anddramatically changing: it isdelivering ever more performance,but is now doing so through rapidly Figure 1: Graph of four key technology trends from ‘The Free Lunch is Over’: transistors per processor, clockspeed, power consumption and instruction level parallelism [110].

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS5increasing parallelism. Softwarewill therefore need to changeradically in order to exploit thesenew parallel architectures. Thisshift to parallelism represents oneof the largest challenges thatsoftware developers have faced.Modifying existing software to makeit parallel is often (but not always)difficult. Now might be a good timeto re-engineer many older codesfrom scratch.Since hardware designers gave upon trying to maintain the free lunch,incredible advances in computerhardware design have been made.While mainstream CPUs have beenembracing parallelism relativelyslowly as the majority of softwaretakes time to change to this newmodel, other kinds of processorshave exploited parallelism muchmore rapidly, becoming massivelyparallel and delivering largeperformance improvements as aresult.The most significant class ofprocessor to have fully embracedmassive parallelism are graphicsprocessors, often called GraphicsProcessing Units or GPUs.Contemporary GPUs first appearedin the 1990’s and were originallydesigned to run the 2D and 3Dgraphical operations used by theburgeoning computer gamesmarket. GPUs started asfixed-function devices, designed toexcel at graphics-specific functions.GPUs reached an inflection point in2002, when the main graphicsstandards, OpenGL [68] andDirectX [82] started to requiregeneral-purpose programmability inorder to deliver advanced specialeffects for the latest computergames. The mass-market forcesdriving the development of GPUs(over 525 million graphicsprocessors were sold in 2010 [63])drove rapid increases in GPUperformance and programmability.GPUs quickly evolved from theirfixed-function origins intofully-programmable, massivelyparallel processors (see Figure 2).Soon many developers were tryingto exploit these low-cost yetincredibly high-performance GPUsto solve non-graphics problems.The term ‘General-Purposecomputation on GraphicsProcessing Units’ or GPGPU, wascoined by Mark Harris in 2002 [51],and since then GPUs havecontinued to become more andmore programmable.Today, GPUs represent thepinnacle of the many-corehardware design philosophy. Amodern GPU will consist ofhundreds or even thousands offairly simple processor cores. Thisdegree of parallelism on a singleprocessor is usually called‘many-core’ in preference to‘multi-core’ — the latter term istypically applied to processors withat most a few dozen cores.It is not just GPUs that are pushingthe development of parallelprocessing. Led by Moore’s Law,mainstream processors are also onan inexorable advance towardsmany-core designs. AMD startedshipping a 12-core mainstream x86 Figure 2: The evolution of GPUs from fixed function pipelines on the left, to fully programmable arrays ofsimple processor cores on the right [69].

6 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsCPU codenamed Magny-Cours in2010 [10], while in Spring 2011Intel launched their 10-coreWestmere EX x86 CPU [102].Some server vendors are usingthese latest multi-core x86 CPUs todeliver machines with 24 or even48 cores in a single 1U ‘pizza box’server. Early in 2010 Intelannounced a 48-core x86 prototypeprocessor which they described asa ‘cloud on a chip’ [61]. Thesemany-core mainstream processorsare probably the future of serverCPUs, as they are well suited tolarge-scale, highly parallelworkloads of the sort performed bythe Googles and Amazons of theinternet industry.But perhaps the most revealingtechnology trend is that many of usare already carrying heterogeneousmany-core processors in ourpockets. This is becausesmartphones have quickly adoptedthis new technology to deliver thebest performance possible whilemaintaining energy efficiency.It is the penetration of many-corearchitectures into all threepredominant processor markets —consumer electronics, personalcomputers and servers — thatdemonstrates the inevitability of thistrend. We now have to pay for ourlunch and embrace explicitparallelism in our software but, ifembraced fully, exciting newhardware platforms can deliverbreakthroughs in performance andenergy efficiency.Box 1: The Power EquationProcessor power consumption is the primary hardware issue driving the development of many-core architectures.Processor power consumption, P, depends on a number of factors, which can be described by the power equation,P=ACV 2 f,where A is the activity of the basic building blocks, or gates in the processor, C is the total capacitance of the gates’outputs, V is the processor’s voltage and f is its clock frequency.Since power consumption is proportional to the square of the processor’s voltage, one can immediately see that reducingvoltage is one of the main methods for keeping power under control. In the past voltage has continually beenreducing, most recently to just under 1.0 volt. However, in the types of CMOS process from which most modernsilicon chips are made, voltage has a lower limit of about 0.7 volts. If we try to turn voltage down lower than this, thetransistors that make up the device simply do not turn on and off properly.Revisiting the power equation, one can further see that, as we build ever more complex processors using exponentiallyincreasing numbers of transistors, the total gate capacitance C will continue to increase. If we have lost the ability tocompensate for this increase by reducing V then there are only two variables left we can play with: A and f , and it islikely that both will have to reduce.The many-core architecture trend is an inevitable outcome from this power equation: Moore’s Law gives us ever moretransistors, pushing up C, but CMOS transistors have a lower limit to V which we have now reached, and thus the onlyeffective way to harness all of these transistors is to use more and more cores at lower clock frequencies.For more detail on why power consumption is driving hardware trends towards many-core designs we refer the readerto two IEEE Computer articles: Mudge’s 2001 ‘Power: a first-class architectural design constraint’ [85] and Woo andLee’s ‘Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era’ from 2008 [121].

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS7Success StoriesBefore we analyse a few many-coresuccess stories, it is important toaddress some of the hype aroundGPU acceleration. As is often thecase with a new technology, therehave been lots of inflated claims ofperformance speedups, some evenclaiming speedups of hundreds oftimes compared to regular CPUs.These claims almost never standup to serious scrutiny and usuallycontain at least one serious flaw.Common problems include:optimising the GPU code muchmore than the serial code; runningon a single host core rather than allof the host cores at the same time;not using the best compileroptimisation flags; not using thevector instruction set on the CPU(SSE or AVX); using older, slowerCPUs etc. But if we just comparethe raw capabilities of the GPU withthe CPU then typically there is anorder of magnitude advantage forthe GPU — not two or three ordersof magnitude. If you see speedupclaims of greater than about afactor of ten then be suspicious. Agenuine speedup of a factor of tenis of course a significantachievement and is more thanenough to make it worthwhileinvesting in GPU solutions.There is one more common issueto consider, and it is to do with thesize of the problem beingcomputed. Very parallel systemssuch as many-core GPUs achievetheir best performance whensolving problems that contain acorrespondingly high degree ofparallelism, enough to give everycore enough work so that they canoperate at close to their peakefficiency for long periods. Forexample, when accelerating linearalgebra, one may need to beprocessing matrices of the order ofthousands of elements in eachdimension to get the bestperformance from a GPU, whereason a CPU one may only needmatrix dimensions of the order ofhundreds of elements to achieveclose to their best performance(see Box 2 for more on linearalgebra). This discrepancy canlead to apples-to-orangescomparisons, where a GPU systemis benchmarked on one size ofproblem while a CPU system isbenchmarked on a different size ofproblem. In some cases the GPUsystem may even need a largerproblem size than is reallywarranted to achieve its bestperformance, again resulting inflawed performance claims.With these warnings aside let uslook at some genuine successstories in the areas ofcomputational fluid dynamics forthe aerospace industry, moleculardocking for the pharmaceuticalindustry, options pricing for thefinancial services industry, specialeffects rendering for the creativeindustries, and data mining forlarge-scale electronic commerce.Computational fluiddynamics on GPUs:OP2Finding solutions of theNavier-Stokes equations in two andthree dimensions is an importanttask for mathematically modellingand analysing flows in fluids, suchas one might find when consideringthe aerodynamics of a new car orwhen analysing how airbornepollution is blown around the tallbuildings of a modern city.Computational fluid dynamics(CFD) is the discipline of usingcomputer-based models for solvingthe Navier-Stokes equations. CFDis a powerful and widely used toolwhich also happens to becomputationally very expensive.GPU computing has thus been ofgreat interest to the CFDcommunity.Prof. Mike Giles at the University ofOxford is one of the leadingdevelopers of CFD methods thatuse unstructured grids to solvechallenging engineering problems.This work has led to two significantsoftware projects: OPlus, a codedesigned in collaboration withRolls-Royce to run on clusters ofcommodity processors utilisingmessage passing [28], and morerecently OP2, which is beingdesigned to utilise the latestmany-core architectures [42].OP2 is an example of a veryinteresting class of applicationwhich aims to enable the user towork at a high level of abstractionwhile delivering high performance.It achieves this by providing aframework for implementingefficient unstructured gridapplications. Developers write theircode in a familiar programminglanguage such as C, C++ orFortran, specifying the importantfeatures of their unstructured gridCFD problem at a high level. Thesefeatures include the nodes, edgesand faces of the unstructured grid,flow variables, the mappings fromedges to nodes, and parallel loops.From this information OP2 canautomatically generate optimisedcode for a specific targetarchitecture, such as a GPU usingthe new parallel programminglanguages CUDA or OpenCL, ormulti-core CPUs using OpenMPand vectorisation for their SIMDinstruction sets (AVX andSSE) [58]. We will cover all of theseapproaches to parallelism in moredetail later in this report.At the time of writing, OP2 is still awork in progress, in collaboration

8 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processorswith Prof. Paul Kelly at ImperialCollege, but early results arealready impressive. Using an airfoiltest code with a single-precision 2DEuler equation, a problem with 0.75million cells and 1.5 million edgeshas been solved in 7.4 secondsusing a single NVIDIA C2070 GPU.Double-precision performance wasroughly half this, taking 15.4seconds on the same problem.When OP2 generates similarlyoptimised code for a pair ofcontemporary Intel ‘Westmere’6-core 2.67GHz X5650 CPUs, thesame problem takes 34.2 secondsto solve in single precision and 44.6seconds in double precision. OneGPU has therefore deliveredimpressive relative speedups of4.6X and 2.9X for single and doubleprecision respectively, compared to12 top-of-the-range CPU cores.OP2 is being developed as an opensource project and potential userscan contact Prof. Giles via theproject webpage [44].Other CFD examples — Anothersuccessful project using GPUs forCFD has been Turbostream,developed by Tobias Brandvik andDr Graham Pullan in the WhittleLaboratory at the University ofCambridge [18, 19]. In contrast toOP2’s unstructured grid approach,Turbostream solves theNavier-Stokes equations bydiscretising the problem onto astructured grid and then performingexplicit time-stepping to solve theresulting partial differentialequations (PDEs) using athree-dimensional stencil operation.Turbostream is another example ofan approach that enables theprogrammer to work at a higherlevel of abstraction: rather thanprogrammers having to writecomplex, low-level codethemselves, the programmerspecifies the required stenciloperations in a high-leveldescription from which Turbostreamautomatically generates optimisedcode targeting the latest GPUs andmulti-core CPUs.Turbostream has achievedimpressive performanceimprovements, with speedups inthe range of 10–20X on the latestNVIDIA GPUs compared withoptimised code running on a fastquad-core Intel CPU. AdditionallyTurbostream has shown goodscalability, using up to 64 GPUs atonce. Turbostream is available tolicense and the reader is referred tothe project website for moreinformation [20].Others have also been achievinggreat success in using GPUs toaccelerate CFD problems. BAESystems has announced it is usinga GPU-accelerated cluster todeliver near-interactive speeds fortheir CFD simulations [37]. A teamat Stanford University was recentlysuccessful in modelling hypersonicairflows using a GPU-basedsystem [33].These early successes indicatethat CFD is an application area thatshould see increasing benefit fromGPUs in the future; for additionalinformation see the CFD Onlinewebsite [24].Molecular dockingusing GPUs: BUDEThe University of Bristol has beenat the forefront of moleculardocking for the past decade [41].Molecular docking is the process ofsimulating the interaction betweentwo molecules, where one moleculeis typically a protein of medicalinterest in the human body, and thesecond represents a potential newdrug.The Bristol University DockingEngine (BUDE) was originallydeveloped to run on a commoditycluster of x86-based CPUs, but asearly as 2006 it was ported to oneof the first generation of many-coreprocessors, the CSX architecturefrom ClearSpeed [75]. In 2010Figure 3: GPUs have enabled BAE Systems to investigate the aerodynamic performance of aircraft throughadvanced CFD simulations (source: BAE Systems).

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS9Figure 4: BUDE molecular docking performance in seconds (lower is better).BUDE was ported to the OpenCLparallel programming language.Figure 4 compares theperformance of this OpenCL coderunning on a GPU to the original,optimised Fortran code running ona fast, quad-core CPU. The resultsfor both performance and energyefficiency were compelling for theGPU-based system: NVIDIAC2050 GPUs gave a real-worldspeedup of 4.0X versus a fastquad-core CPU while using aboutone half of the energy to completethe same work [76, 77].An important benefit of using thenew OpenCL parallel programminglanguage was the ability to runexactly the same code on a rangeof different GPUs from differentvendors, and even on multi-corex86 CPUs. Thus BUDE can nowuse whichever hardware is thefastest available at any given time.BUDE can also use GPUs andCPUs at the same time to deliverthe maximum possible aggregateperformance.Options pricing usingGPUs: NAGOne of the earliest applicationareas explored using GPUs wasMonte-Carlo-based numericalintegration for derivative pricing andrisk management in financialmarkets. The primary need in thisapplication is the ability to generaterapidly high-quality pseudo-randomor quasi-random numbers.Fortunately parallel algorithms forgenerating these kinds of randomnumbers already exist, and severalof these algorithms are wellmatched to the GPU’s many-corearchitecture.The Numerical Algorithms Group(NAG) is a provider of high-qualitynumerical software libraries. NAGis one of the first vendors to supportGPUs – their GPU-acceleratedpseudo-random number generatorsshow speedups of between 5X and34X when running on NVIDIA’sC2050 GPU, compared to Intel’shighly optimised MKL library whenrunning on all eight cores of acontemporary Intel Core i7 860operating at 2.8GHz [17]. Theexact speedup of these routinesdepends on the kind of randomnumber generator being used(MRG32k3a, Sobol or MT19937)and also the distribution required(uniform, exponential or normal).However, in modern financialmodels, generating the randomnumbers is typically only a fractionof the overall task, with the modelsthemselves often requiringconsiderable computation toevaluate. Since the Monte Carlomethod is inherently parallel, theentire simulation can be performedon the GPU, where the abundanceof computing power means thatthese complex models can beevaluated very rapidly. Thiscapability has attracted severalusers from the financial servicesindustry, including several of themore sophisticated insurancecompanies. This class of potentialGPU user is faced with modern riskmanagement requirements such as‘Solvency II’, which require largeamounts of simulation. Whencombined with the complexcashflow calculations inherent ininsurance portfolios, this applicationbecomes massively parallel andvery compute-intensive, making itwell suited to modern GPUs.NAG has published the results itachieved while working with two ofits tier-one financial servicescustomers. One of these customershas used NAG’s GPU-acceleratedrandom number generator library tospeed up its ‘local vol’ code andsaw speedups of approximately tentimes compared to a fast quad-corex86 CPU.In summary, the generation of largesets of pseudo-random numbers forMonte Carlo methods has beenone application area where GPUshave had a big impact. Speedupsfor the random number generationroutines of 100X or more versus asingle CPU core are regularlyachieved, often translating into

10 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processorsreal-world performanceimprovements of 10X or morecompared to a host machine with atotal of eight cores, as in the NAGcases described above.Have we just ignored the earlierwarning about excessivespeedups? Not quite — this is arare instance where the GPU reallydoes have an additionalperformance advantage over theCPU. GPUs uniquely includededicated hardware for executingcertain functions very quickly, inonly a few clock cycles. Thesefunctions include exponential, sine,cosine, square root and inversesquare root amongst others. Manyof these functions are usedextensively when generatingpseudo-random numbers, and so inthis instance GPUs really do have agreater performance advantageover CPUs than the peak speedcomparison would indicate.For an overview of GPUs incomputational finance see [43].Special effectsrendering using GPUs:Double NegativeLondon-based Double Negative isthe largest visual effects companyin Europe [31]. Their Oscar-winningwork can be seen in many of thebig blockbuster movies over the lasttwelve years. One of the biggestindustrial users ofhigh-performance computing,visual effects companies requiredatacentres with thousands ofservers to produce theircomputer-generated visuals.Water, smoke and fire-basedeffects have become increasinglypopular in movies over the last fewyears, with Double Negative’sproprietary fluid solver ‘Squirt’being a key part of their success.The image of the volcanicexplosion on the front cover of thisreport was generated by DoubleNegative using this software. Thecomputationally expensive processof solving the Navier-Stokesequations required by thesefluid-based effects have led DoubleNegative to investigate the potentialof GPUs. Double Negative recentlyadded a GPU-based Poissonsolver to their fluid simulations,resulting in as much as a 20Xspeedup of this component of theirworkflow. These results motivatedDouble Negative to install their firstGPU-based render-farm.Double Negative is now developinga specialist language and compilerto make GPUs easier to use fortheir visual effects artists. Resultsare again impressive, with a 26Xspeedup on an NVIDIA C2050GPU compared to the originalsingle-threaded C++ code runningon one core of a 3.33GHz six-coreIntel X5680 CPU (the speedupwould be about 4.4X if all six coresof the host could be used at oncewith perfect scaling on theCPU) [14]. These increases inperformance have motivatedDouble Negative to investigateusing GPUs for other applications,such as image manipulation toolsand deformers, in a bid toaccelerate as much as possible oftheir visual effects workflow.Accelerating databaseoperations usingGPUs: 3DMashUpIt may be clear that many-corearchitectures will shine oncomputationally intensive codes,but what about more data-drivenapplications?Researchers are now looking atusing GPU architectures to improvethe performance of computationallyintensive database operations. Oneof the first companies developingthis technology is theCalifornia-based 3DMashUp.3DMashUp has developed asolution using a PostgreSQL-basedrelational database that embedsGPU code alongside data withinthe database itself. Users canquery the GPU-aware database,transparently causing the GPUcode to be executed on the data, allthe while remaining within thecontext of the database.This approach has a number ofbenefits, one of which is that thedata and the code can remainresident in the database at alltimes. The overhead of callingOpenCL GPU code from within thedatabase is just 4 microseconds,making this a convenient andefficient way to exploit GPUs forlarge database users withcomputationally expensiveprocessing requirements. It alsobrings GPU acceleration tomainstream markets. For example,users who have very large imagedatabases may now process theseimages using GPUs just byintegrating the appropriate3DMashUp methods and installingthe latest GPUs in their system.Users of the database do not haveto care about GPU programming atall in this scenario; their tasks ‘justrun faster’.The 3DMashUp solution alsotransparently handles mapping themost performance-critical data intothe GPU’s memory as and when itis needed, removing the need forusers to have to concernthemselves with these low-levelprogramming details.This approach appears to be verypromising and is one of the fewexamples where the benefits ofGPU acceleration can be invisiblyprovided to end users, in this casewith computationally intensive,large data problems that usedatabases to store and managethat data. For more information seethe 3DMashUp website [26].

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS11GPUs in DepthWhile several different many-corearchitectures have emerged duringthe last few years, we will focus onGPU-based many-core systems forthe following discussion. However,almost all the principles andterminology discussed in thiscontext apply equally to the othermany-core architectures, includingx86 CPUs and consumerelectronics products as previouslydescribed.A modern GPU is a many-coredevice, meaning it will containhundreds or even thousands ofsimple yet fully programmablecores, all on a single chip. Thesecores are often grouped togetherinto homogeneous sets withvarying names. The emergingC-based many-core programminglanguage OpenCL [67] calls thesesimple cores ‘Processing Elements’or PEs, and the groupings of thesePEs ‘Compute Units’ or CUs.Another common feature of allmany-core architectures is amulti-level memory hierarchy.Typically each Processing Elementwill have a small amount of its ownprivate memory. There is often alarger memory per Compute Unitthat can be used as sharedmemory between all that ComputeUnit’s Processing Elements. Therewill also usually be a globalmemory which can be seen by allthe Compute Units and thus by allthe Processing Elements. Finally,there is usually a separate memoryfor the host processor system.OpenCL refers to these four levelsin the memory hierarchy as Private,Local, Global and Host,respectively. Figure 5 illustrates theOpenCL memory hierarchyterminology. We will adoptOpenCL’s terminology for the restof this report.The GPU itself is integrated into a‘host system’. This might mean theGPU is on its own add-in board,plugged into a standard PCIExpress expansion slot within aserver or desktop computer.Alternatively, the GPU may beintegrated alongside the host CPU,as found inside high-end laptopsand smartphones today.Increasingly in the future we willsee the many-core GPU tightlyintegrated onto the same chip asthe host CPU; in June 2011 AMDofficially launched their first ‘Fusion’CPU, codenamed Llano, thatintegrates a quad-core x86 CPUwith a many-core GPU capable ofrunning OpenCL programs [104].NVIDIA already has consumer-level‘Fusion’ devices in its TegraCPU+GPU product line, but atSuperComputing 2010 in NewOrleans they announced ‘ProjectDenver’. This is NVIDIA’sprogramme for a high-endFusion-class device, integratingtheir cutting-edge GPUs with new,high-end ARM cores [84]. The firstProject Denver products will notarrive for several years, but NVIDIAhas indicated that this ‘fusion’approach of integrating their GPUsalongside their own ARM-basedmulti-core CPUs is central to theirfuture product roadmap.The reader may come across threevery different configurations ofGPU-accelerated systems,illustrated in Figure 6. These threedifferent ways of integrating GPUswithin a system may have verydifferent performancecharacteristics, but their usagemodels are almost identical and weshall treat them in the same way forthe purposes of this report. Animportant implication of this widerange of different approaches isthat GPU-accelerated computing isnot just for supercomputers —almost all systems are capable ofexploiting the performance ofPrivateMemoryPrivateMemoryPrivateMemoryPrivateMemoryWork-ItemWork-Item Work-ItemWork-ItemLocal MemoryLocal MemoryWork-GroupWork-GroupGlobal Memory & Constant MemoryCompute DeviceHost MemoryHostFigure 5: The memory hierarchy of a generic GPU (source: Khronos).

12 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsFigure 6: Three different methods of integrating GPUs: add-in graphics cards for a workstation, an integratedbut discrete GPU within a laptop, and a fully integrated, single-chip CPU and GPU (source: NVIDIA, AMD).GPUs, including laptops, tabletsand even smartphones.What are the performance benefitsof the current many-core GPUsystems? Just how fast are they?GPUs are still so new to the area ofscientific computing that no widelyused standard benchmarks exist forthem. This makes it very difficult forus to compare their performance toeach other, or to traditional CPUs.Before GPU-oriented benchmarksmature, we have to ‘make do’ withmuch less accurate indicators ofperformance, such as peakhardware capabilities (FLOPS,bandwidth etc). There are someearly application performancecomparisons we can use too; the‘Success Stories’ section of thisreport gives some real-worldexamples.Let us consider three differentsystems as points of reference: alaptop, a desktop and a server. Thelaptop on which this report waswritten is a contemporary AppleMacBook Pro and has a dual-coreCPU running at 2.4 GHz. Its peakperformance is around 40 GFLOPS(GigaFLOPS), measured insingle-precision floating pointoperations per second — 1 GFLOPis one billion (10 9 ) floating pointoperations per second.Double-precision floating pointperformance is typically one halfthat of single precisionperformance on CPUs. The otherimportant hardware characteristicfor performance is memorybandwidth — the same laptop hasa peak bandwidth to its mainmemory of about 8 GBytes/s.Under the author’s desk is a PCcontaining a reasonably fast CPU:a dual-core x86 running at 3.16GHz. This desktop’s peak speed isnearly 60 GFLOPS single-precisionand its peak memory bandwidth isabout 12 GBytes/s. The fastestmainstream x86 server CPUsavailable on the market in theSpring of 2011 were Intel’s six-coreWestmere-EX devices which, attheir fastest speeds of 2.4 GHz, arecapable of peak speeds of around230 GFLOPS per CPU for singleprecision, and of over 60 GBytes/sof main memory bandwidth. TheWestmere figures are trulyimpressive, although thesehigh-end CPUs are both expensive(with a list price per CPU of $4,616at launch) and power-hungry(130W for the CPU alone). Evenso, a single one of these high-endCPUs would have qualified as asupercomputer in its own right just15 or 20 years ago [79].But many-core performance, asembodied by the GPUs, is on adifferent level. For example, thedesktop machine mentioned abovealso contains an NVIDIA GTX 285GPU. This has a peak speed ofover 1,000 GFLOPS (1 TeraFLOP)and peak memory bandwidth ofnearly 160 GBytes/s. Hence theGPU, which only cost a fewhundred pounds, has about 17times the peak single-precisionperformance and over 13 times thememory bandwidth of the host CPUin the same system. Evencompared to the quad-core serverCPUs found in most servers todaythe GPU is fast, with 10 times thepeak FLOPS and over 5 times thepeak memory bandwidth.So now the opportunity should beclear: many-core GPUs canpotentially offer an order ofmagnitude greater performance inpeak compute and memorybandwidth. The challenge istranslating this into real-worldbenefits. GPU-based computing isstill in its infancy, but earlyindications are that for the rightkinds of applications (an importantproviso), real-world speedups inthe range of 5 to 10 times are beingachieved using GPUs.Gaining performance:massive parallelismThe examples we have presenteddemonstrate that GPUs are veryfast and can provide up to a tenfoldimprovement in performance overtraditional CPUs. The trick togetting the most from these

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS13Box 2: Linear AlgebraLinear algebra libraries have been a staple in scientific software since the development of the Basic Linear AlgebraSubprograms (BLAS) by Kincaid and Krogh [70] amongst others in the 1970’s. Today the speed of routines for vectorand matrix arithmetic is critically important to the performance of a wide range of applications. Not surprisinglytherefore, significant effort has gone into optimising linear algebra libraries for GPUs.Most vendors supply optimised BLAS and LAPACK (Linear Algebra PACKage) libraries for their hardware. CPUvendor libraries include Intel’s MKL [59], AMD’s ACML [7] and IBM’s ESSL [55]. GPU vendor libraries includeNVIDIA’s CuBLAS [93] and AMD’s GPU ACML [8]. These libraries speed up the BLAS routines by optimising theoriginal BLAS code for the targetted hardware. This approach has been reasonably successful up to now, but there arequestions about its long-term scalability. The author has first-hand experience in this area, having led one of the firstteams to develop a many-core optimised BLAS/LAPACK library in 2003 while at ClearSpeed. The concern is aroundthis approach to programming, and whether it is realistic to be able to write a serial program that calls a softwarelibrary that ‘magically’ hides all the complexity of using massive parallelism in an optimal fashion. Experience hasshown that, to get the best performance, one usually has to port most of the application to the massively parallel partof the system, and not just rely on calls to libraries with parallel implementations.To address this issue, the PLASMA and MAGMA research projects have been looking at the implications ofheterogeneous and many-core architectures for future BLAS and LAPACK libraries [5]. These projects have shownthat it is possible to achieve a significant fraction of peak performance for LAPACK-level routines, such as Choleskyand LU solvers, on heterogeneous, multi- and many-core systems, going as far as using multiple GPUs in a singleaccelerated system to achieve more than 1.1 TeraFLOPS of single precision performance [71].many-core architectures isparallelism, and lots of it. The GPUhardware itself is massivelyparallel, with today’s GPUs alreadyincluding hundreds or thousands ofsimple cores, and GPUs with tensof thousands of cores due toappear within the next few years.The most popular programmingmodels for GPUs, NVIDIA’sCUDA [96] and the cross-platformopen standard, OpenCL [67],encourage programmers to exposethe maximum parallelism possiblein their code in order to achieve thebest performance, both now and inthe future. In general these are justgood software design practices andwe encourage all softwaredevelopers to start thinking this wayif they have not already — within afew years almost all processors willbe of a many-core design and willrequire massively parallel softwarefor best performance.This requirement for massiveparallelism may at first appearintimidating. But fortunately thekind of parallelism required doesnot have to be coarse-grained ‘task’parallelism; it can simply befine-grained ‘data’ parallelism. Thisis usually much easier to find andoften corresponds to traditionalvector processing with which manysoftware developers are alreadyfamiliar. Computationally intensiveapplications almost always have alarge degree of data parallelisminherent in the underlying problembeing solved, ranging from theparticle parallelism in classicalN-body problems to the row andcolumn parallelism in matrix-basedlinear algebra. But it remains afundamental point: to get the bestperformance from many-corearchitectures you will want toexpose the maximum parallelismpossible in your application.This is a very important point and isworth restating: we will go so far asto make the potentiallycontroversial statement that mostexisting software will need to becompletely redeveloped, possiblyeven requiring new algorithms, inorder to scale well on theincreasingly parallel hardware wewill all be using from now on.There is an equally importantimplication that we will also stateexplicitly: all computer hardware islikely to remain on the increasinglyparallel trend for at least the nextten to twenty years. Hardwaredesigners predict that processorswill become an order of magnitudemore parallel every five years,meaning that in fifteen years’ time,processors will be of the order of athousand times more parallel thanthe already very parallel processorsthat exist today!We will come back to what thismeans for all of us as users ordevelopers of software in the ‘NextSteps’ section later in this report.

14 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsGPU parallelprogramming modelsWith the rapidly increasingpopularity of GPU-basedcomputing, multiple differentprogramming models haveemerged for these new platforms.These include vendor-specificexamples, such as NVIDIA’sCUDA, The Portland Group’s PGIAccelerator TM [113], CAPSenterprise’s HMPP Workbench [23]and Microsoft’s DirectCompute [81],as well as open standards such asOpenCL. We shall now look at thetwo most widely used GPUprogramming models in moredetail: CUDA and OpenCL.CUDANVIDIA’s CUDA is currently themost mature and easiest to useparallel programming model forGPUs, first appearing on themarket in 2006 [91]. CUDA wasdesigned to be familiar to as manysoftware developers as possible,and so its first instantiation beganas a variant of the C programminglanguage, called CUDA C. Indeed,in developing CUDA C code it isoften possible to incrementally portan existing C application, usingprofiling to identify those parts ofthe code that consume most of therun-time (the ‘hot spots’) andconverting these to run in parallelon the many-core GPU.This ‘incremental porting’ approachcan be very effective and certainlylowers the barrier to entry for manyusers. But in our experience, mostsoftware developers find that theycan only get so far with thisapproach. The reason for thislimitation usually comes back to thedesign of the underpinningalgorithm. If it was not designed tobe massively parallel, then it islikely that the current softwareimplementation cannot easily bemodified in an incremental way inorder to make it massively parallel.Of course there are alwaysexceptions, but often theincremental porting approach willonly take you so far.Coming back to CUDA C, todemonstrate how familiar it can be,consider the two code examples inFigure 7. On the left is someordinary ANSI C code forperforming a simple vector addition.Because it is written in a serialprogramming language, this codeuses a loop and an array index torun along the two input vectors,producing one output element ateach loop iteration.In the CUDA C example on theright, you can see the loop hascompletely disappeared. Instead,many copies of the function areexecuted in parallel, one copy foreach parallel data item we wish toprocess. In this instance we wouldlaunch ‘size’ copies of the vectoraddition function, all of which couldrun in parallel. At execution timethe run-time system decides how tomap efficiently the expressedparallelism onto the availableparallel hardware resources.It is important not to underestimatethe level of challenge when writingcode for GPUs. The trivialexamples in Figure 7 are very smallin order to illustrate a point. In areal example, modifying thealgorithm in order to expressenough parallelism to achieve goodperformance on a GPU oftenrequires significant intellectualeffort. And while the parallelkernels themselves may often lookquite similar to a highly optimisedsequential equivalent, significantadditional code is required in orderto initialise the parallel environment,orchestrate the GPUs in thesystem, and communicate dataefficiently, often by overlapping thedata movement with computation.The CUDA C code on the righthand side of Figure 7 is an exampleof a ‘kernel’ — this is commonterminology for the parallel versionof a routine that has been modifiedin order to run on a many-coreprocessor. CUDA uses slightlydifferent terminology for the// ANSI C vector addition examplevoid vectorAdd(const float *a,const float *b,float *c,unsigned int size){unsigned int i;for (i=0; i

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS15Processing Elements (which it calls‘threads’) and Compute Units(which it calls ‘thread blocks’) butgenerally it is considered to befairly straightforward to port codebetween CUDA C and similarprogramming models, such asOpenCL.CUDA threads executeindependently and thus ideally onindependent data — this is whydata parallelism is such a natural fitfor these kinds of architectures.Threads in a thread block aregrouped to execute essentially thesame program at the same time,but on different pieces of data. Thisdata-parallel approach is known asSingle Instruction Multiple Data(SIMD) [38]. On the other hand,different thread blocks may executecompletely different programs fromone another if the need arises,although most applications tend torun the same program on all thethread blocks at the same time,essentially turning the computationinto one large SIMD or vectorcalculation.CUDA’s maturity brings a numberof benefits for software developers,including a growing number ofsoftware development toolsincluding debuggers andprofilers [97]. In 2009, CUDAFortran arrived, as a jointdevelopment between NVIDIA andthe Portland Group [95]. CUDAFortran takes the principles ofCUDA C and weaves them into astate-of-the-art commercial Fortran2003 compiler.OpenCLOpenCL bears many similarities toCUDA and indeed NVIDIA is one ofthe main contributors to theOpenCL standard and so thisshould be no surprise. The biggestdifferences are in the way OpenCLis being developed. WhereasCUDA is a proprietary solutionbeing driven by a single vendor,OpenCL is an open standard,instigated by Apple, but now beingdriven by a consortium of over 35companies, including all the majorprocessor vendors such as Intel,IBM and AMD. The consortium isbeing organised and run by theKhronos Group [67].OpenCL is a much more recentdevelopment than CUDA and iscorrespondingly less mature.However, OpenCL also includes anumber of more recent advancesfor supporting heterogeneouscomputing in systems combiningmultiple CPUs and GPUs. The firstversion of the OpenCL standardwas released in December2008 [66], and OpenCL has beendeveloping rapidly since then. Ithas already been integrated intorecent versions of Apple’s OS Xoperating system. AMD andNVIDIA have releasedimplementations for their GPUs, theformer also including a version thatwill run on a multi-core x86 hostCPU. IBM has demonstrated aversion of OpenCL running on itsCell processor and recentlyreleased a version for theirhigh-end POWER architecture [56].Intel released its first OpenCLimplementation for its multi-corex86 CPUs in late 2010 [27, 60].Embedded processor companiesare also developing their ownOpenCL solutions, includingARM [11, 99], ImaginationTechnologies [57] and ZiiLabs [122]. These last threecompanies provide the CPUs andGPUs in most of the popularconsumer electronics gadgets,such as smartphones and portableMP3 players.While OpenCL is less mature thanNVIDIA’s CUDA and has some ofthe drawbacks of committeedesigned standards, its benefits arethe openness of the standard, thevast resource being ploughed intoits development by manycompanies, and most importantly,its cross-platform capabilities.OpenCL is quite a low-levelsolution. It exposes features thatmany software developers may nothave had to deal with before.CUDA has similar features butincludes a higher-level applicationprogrammer interface (API) thatconveniently handles much of thelow-level detail. But in OpenCL thisis all left up to the programmer.One example is in the explicit useof queues for sending commandssuch as ‘run this kernel’ from thehost processor to the many-coreGPU. It is expected that as OpenCLmatures, various solutions willemerge to abstract away thislower-level detail, leaving mostprogrammers to operate at a higherlevel. An interface has alreadybeen developed that provides thisfacility for C++ programs.One of OpenCL’s other importantcharacteristics is that it has beendesigned to support heterogeneouscomputing from Day One; that is, itsupports running codesimultaneously on multiple, differentkinds of processors, all within asingle OpenCL program. Whenre-engineering software this is animportant consideration: adopting aprogramming environment thatsupports a wide range ofheterogeneous parallel hardwarewill give developers the greatestflexibility when deploying theirre-engineered codes in the future.For example, an OpenCL programcould decide to run one task onone of the host processor cores,while running another task using amany-core GPU, and do this all inparallel. These multiple OpenCLtasks can easily coordinatebetween themselves, passing dataand signals from one to the other.Because almost all processors will

16 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors// OpenCL vector addition example__kernel void vectorAdd(__global const float *a,__global const float *b, __global float *c){// Vector element indexint nIndex = get_global_id(0);c[nIndex] = a[nIndex] + b[nIndex];}Figure 8: OpenCL vector addition kernel example. Note the similarity with the CUDA example on the righthand side of Figure 7.combine elements of bothheterogeneity and many-core in thefuture, this is a critical feature tosupport. The ability to run OpenCLcode on almost any multi-coremainstream processor makes itextremely attractive as a method forexploiting many-core parallelism,be it on today’s increasinglymulti-core CPUs or tomorrow’sheterogeneous many-coreprocessors.For completeness, Figure 8 showsthe simple vector addition exampleonce again, this time in OpenCL.To learn more about OpenCL,Scarpino’s ‘A gentle introduction toOpenCL’ [105] is a good place tostart. Two OpenCL text books aredue to appear at about the sametime as this report: Munshi et al’s‘OpenCL Programming Guide’ [86]and Gaster et al’s ‘HeterogeneousComputing with OpenCL’ [40].Other many-coreprogramming modelsWhile we have focused ourdescriptions on the many-coreprogramming models that areemerging to support GPUs, thereexist other models that werecreated primarily to supportmulti-processor systems. There aremany such models, but the mostcommonly used fall mainly into twocategories — message passingand shared memory.Message passing programmingmodels provide methods forrunning collections of tasks inparallel, and for communicatingdata between these tasks bysending messages overcommunication links.Shared memory programmingmodels assume a system ofparallel processors connectedtogether via a memory that can beseen by all of the processors.Both of these models are used inthe mainstream today and we shallbriefly describe the most commonexamples of each.Message passing, MPI — MPI ‘isa message-passing applicationprogrammer interface (API),together with protocol and semanticspecifications for how its featuresmust behave in anyimplementation’ [49]. MPI has beenproven to scale to largehigh-performance computingsystems consisting of hundreds ofthousands of cores and underpinsmost parallel applications runningon large-scale systems. It is mostcommonly used to communicatebetween tasks running on multipleservers, with the MPI protocol usedto send messages over the networkconnecting the servers together.As well as providing basicprimitives for point-to-pointcommunications, MPI includescollective communicationoperations such as broadcast andreduction. The more recent MPIversion 2 (MPI-2) adds support forparallel I/O and dynamic processmanagement [112].Many implementations of MPI exist,with the standard specifyingbindings for C, C++ and Fortran.Shared memory, OpenMP —OpenMP is the most commonlyused high-level shared-memoryprogramming model in scientificcomputing [25]. Its API isimplemented in the form ofcompiler directives for C, C++ andFortran. Most major hardwarevendors support it, as do mostmajor compilers, including GNUgcc, Intel’s ICC and Microsoft’sVisual Studio.OpenMP supports task-levelparallelism (also known as MultipleInstruction Multiple Data or MIMD)in addition to the data-levelparallelism (SIMD) that we havealready met [38]. Sections of codethat can be executed in parallel areidentified with the addition ofspecial compiler markers, orpragmas, that tell anOpenMP-compatible compiler whatto do. In theory these pragmas aresimply ignored by non-OpenMPcompilers, which would generatevalid serial executables from thesame source code.For comparison, Figure 9 showsour by now familiar simple vector

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS17!$omp parallel default(none)!$omp shared(a, b, c) private(i)!$omp dodo i = 1, sizec(i) = a(i) + b(i)end do!$omp end do!$omp end parallel&Figure 9: Vector addition example in Fortran extended with OpenMP.addition, this time in OpenMP, andfor variety, in Fortran (the OpenMPFortran pragmas are the linesbeginning with ‘!$omp’).Work is already under waydeveloping the next major versionof OpenMP [109] and early signsindicate that consideration is beinggiven to extensions for supportingmany-core architectures. Thus ifthe reader is already usingOpenMP we would recommendfollowing the developments in thisarea from the standards committeeand also active OpenMP vendorssuch as Cray.Hybrid solutions — It is possibleto combine two or more of theseparallel programming models forhybrid solutions. In the parallelprogramming mainstream today,one of the most commonapproaches is to use a hybridsystem of OpenMP within eachmulti-core server node and MPIbetween nodes. Much has beenwritten about this particular hybridapproach [109].An increasingly common approachis to combine MPI with one of thenew GPU-oriented parallelprogramming systems, such asCUDA or OpenCL. This enables thecreation of systems from multiplenodes, each node including one ormore GPUs. Several of the fastestcomputers in the world haverecently been constructed in justthis fashion, including TSUBAME2.0 at Tokyo Tech in Japan [50].Indeed the second fastestcomputer in the world in June 2011was a GPU-powered cluster:China’s Tianhe-1A system at theNational Supercomputer Center inTianjin, with a performance of 2.57PetaFLOPS (2.57×10 15 floatingpoint operations per second, or2.57 million GFLOPS) [16, 114]. Asof the InternationalSupercomputing Conference inJune 2011, three of the top tensupercomputers in the Top500 listare GPU accelerated. The fractionof systems in the Top500 thatachieve their performance by GPUacceleration is set to increaserapidly.

18 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsCurrent ChallengesLike all breakthroughs intechnology, the change frommulti-core to many-core computerarchitectures will not be smooth foreveryone. There are significantchallenges during the transition,some of which are outlined below.1. Porting code to massivelyparallel heterogeneoussystems is often (but notalways) harder than ports tonew hardware have been inthe past. Often completelynew algorithms are required.2. Many-core technologies arestill relatively new, withimplications for the maturityof software tools, the lack ofsoftware developers with theright skills and experience,and the paucity of portedapplication software andlibraries.3. Even though cross-platformprogramming languages suchas OpenCL are nowemerging, these have so farfocused on source codeportability and cannotguarantee performanceportability. This is of coursenot a new issue; any highlyoptimised code written in amainstream language suchas C, C++ or Fortran hasperformance portabilityissues between differentarchitectures. However,differences between GPUarchitectures are evengreater than those betweenCPU architectures, and soperformance portability is setto become a greaterchallenge in the future.4. There are multiple competingopen and de facto standardswhich inevitably confuse thesituation for potentialadopters.5. Many current GPGPUproducts still carry some oftheir consumer graphicsheritage, including the lack ofimportant hardware reliabilityfeatures such as ErrorCorrecting Codes (ECC) ontheir memories. Even wherethese features do exist, theycurrently incur prohibitiveperformance penalties thatare not present in thecorresponding CPUsolutions.6. There is a lot of hype aroundGPU computing, with manyover-inflated claims ofperformance speedups of100 times or more. Theseclaims increase the risk ofsetting expectations too high,with subsequentdisappointment from trialprojects.7. The lack of industry standardbenchmarks makes it difficultfor users to comparecompeting many-coreproducts simply andaccurately.Of all these challenges, the mostfundamental is the design anddevelopment of new algorithms thatwill naturally lend themselves to themassive parallelism of GPUs today,and to the ubiquitousheterogeneous multi-/many-coresystems of tomorrow. If as acommunity we can design adaptive,highly scalable algorithms, ideallywith heterogeneity and evenfault-tolerance in mind, we will bewell placed to exploit the rapiddevelopment of parallelarchitectures over the next twodecades.

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS19Next StepsAudit your softwareOne valuable practical step we caneach take is to perform an audit ofthe software we use that isperformance-critical to our work.For software developed by thirdparties, find out what their policy istowards supporting many-coreprocessors such as GPUs. Is theirsoftware already parallel? If so,how scalable is it? Does it runeffectively on a quad-core CPUtoday? What about the emerging 8,12 and 16 core CPUs? Do theyhave a demonstration of theirsoftware accelerated on a GPU?What is their roadmap for thesoftware? The software licensingmodel is something that you alsoneed to be conscious of — is thesoftware licensed per core,processor, node, user, . . . ? Willyou have to pay more for anupgrade that supports many-coreprocessors? Some vendors willsupply these features as no-costupgrades; others will charge extrafor them.It is also important to be specificabout parallel acceleration of theparticular features you use in thesoftware in question. For example,at the time of writing there are GPUaccelerated versions of denselinear algebra solvers in MATLAB,but not of sparse linear algebrasolvers [73]. Just because anapplication claims to be‘GPU-accelerated’, it does notnecessarily follow that yourparticular use of that application willgain the performance benefit ofGPUs. Your mileage will definitelyvary, so check with the supplier ofyour software to verify beforecommitting.Plan for parallelismIf you develop your own software,start thinking about what your ownpath to parallelism should be. Arethe users of your software likely tostick primarily to multi-coreprocessors in mainstream laptops,desktops and servers? If so youshould be thinking about adoptingOpenMP, MPI or another widelysupported approach for parallelprogramming. You should probablyavoid proprietary approaches thatmay not support all platforms or becommercially viable in the longterm. Instead, use open standardsavailable across multiple platformsand vendors to minimise your risk.Also consider when many-coreprocessors will feature in yourroadmap. These are inevitable —even the mainstream CPUs willrapidly become heterogeneousmany-cores, so this really is a‘when’ not an ‘if’. If you do not wantto support many-core processors inthe near or medium term, OpenMPand MPI will be good choices. If,however, you may want to supportmany-core processors within thenext few years, you will need a planto adopt either OpenCL or CUDAsooner rather than later. OpenCLmight be a viable alternative toOpenMP on multi-core CPUs in theshort term.If you are going to develop yourown many-core aware softwarethere is a tremendous amount ofsupport that you can tap into.Attend a workshopIn the UK each year there areseveral training workshops in theuse of GPUs. The nationalsupercomputing serviceHECToR [53] is starting to provideGPU workshops on CUDA andOpenCL programming; thetimetable for these is availableonline [52]. Prof Mike Giles at theUniversity of Oxford regularly runsCUDA programming workshops; forthe date of the next one see [43].His webpage also includesexcellent links to other GPUprogramming resources. A searchfor GPU training in the UK shouldturn up offerings from severaluniversities. There are alsocommercial training providers in theUK that are worth considering,such as NAG [89]. Daresbury Labshas a team who track the latestprocessor technologies and whoare already experienced indeveloping software for GPUs. Thisgroup holds occasional seminarsand workshops, and is willing tooffer advice and guidance fornewcomers to many-coretechnologies [30]. GPU vendorswill often provide assistance ifasked, especially if their assistancecould lead to new sales.There are also many conferencesand seminars emerging to addressmany-core computing. The UK nowhas a regular GPU developersworkshop. Previous years haveseen the workshop held inOxford [1] and Cambridge [2]. The2011 workshop is due to be held atImperial College.A useful GPU computing resourceis GPUcomputing.net [47]. Inparticular it has a page dedicatedto GPU computing in the UK [45].This site is mostly dominated by theuse of NVIDIA GPUs, reflectingNVIDIA’s lead in the market, butover time the site should see agreater percentage of contentcoming from work on a wider rangeof platforms.To get started with OpenCL, onegood place to start is AMD’s

20 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsOpenCL webpage [9]. NVIDIA’sOpenCL webpage [98] is alsouseful, as is their OpenCL ‘jumpstart’ guide [92].We would also recommend readingsome of the latest work in the areaof parallel algorithms. A good placeto start is ‘The view fromBerkeley’ [12], which builds onearlier work by Per BrinchHansen [21]. Tim Mattson’s‘Patterns for Parallel Programming’is another notable book in thisarea [74]. These works recognisethat most forms of scientificcomputation can be decomposedinto a small set of common kernelsoften known as ‘dwarfs’, ‘motifs’ or‘templates’. Decomposingalgorithms into sub-algorithms thatmay in turn be classified as aparticular kind of dwarf makes iteasier to exploit the substantialworks of the past and present, thussimplifying the task of identifyingparallel approaches for your ownalgorithms where they are alreadyknown.Finally, there is an online list ofapplications that have already beenported to NVIDIA’s GPUs [94].

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS21Appendix 1: Active Researchers and Practitioner GroupsThis is a rapidly growing field and there are already many active researchers and developers using GPU computingin the UK and around the world. The GPU developer conference held at the University of Cambridge in December2010 saw around 100 attendees, and the Many-Core and Reconfigurable Supercomputing Conference at Bristol inApril 2011 saw over 130 delegates, all actively working within this area.In the UKThe list below is not intended to be exhaustive but instead to give a cross section of the GPU computing communityin the UK and beyond. Our apologies are offered in advance to anyone we left off this list.University of Bristol: McIntosh-Smith is a computer architect with extensive industrial experience designingmany-core heterogeneous architectures. He has more recently been developing software and algorithms forGPUs, and has been working on new, massively parallel algorithms for a range of different applicationsincluding molecular docking, computational chemistry, climate modelling, and linear algebra [78].University of Oxford: Giles is the leading expert in the UK in the use of many-core GPUs for financial applicationsand also for unstructured grid problems. Giles runs regular GPU programming workshops in NVIDIA’sproprietary CUDA programming language [43]. Also at Oxford, Karastergiou has been investigating the use ofGPUs for the real-time processing of data for the future international telescope project, the Square KilometreArray (SKA) [64].University of Warwick: Jarvis’s group has performed extensive work modelling the performance of large-scaleHPC systems, more recently looking at the effects of adding GPUs into these systems. The group has aparticular focus on wavefront codes [62].Imperial College: Kelly is an expert in high-level tools for automatically generating optimised code for many-corearchitectures. His research interests include parallelising compilers and auto-tuning techniques [65].UCL: Atkinson is one of the leaders in applying GPUs to medical imaging problems, in particular MagneticResonance Imaging (MRI) [13].Edinburgh Parallel Computing Centre: EPCC is one of the leading providers of high-performance computingtraining in the UK [34], and can provide consultancy to help parallelise algorithms and software for parallelcomputer architectures.University of Manchester: Manchester has a vibrant community of GPU users and runs a ‘GPU club’ that meetson a semi-regular basis, attracting around one hundred attendees to recent events [72].Queen’s University Belfast: Gillan and Scott lead a group of experts in using many-core architectures for imageprocessing applications [46, 108]. This group provides expertise and consultancy to third parties who need toadapt their codes for GPUs and for reconfigurable systems based on Field-Programmable Gate Arrays(FPGAs).University of Cambridge: Pullan and Brandvik were two of the earliest adopters of many-core architectures in theUK. They are experts in using GPUs for CFD and in porting structured grid codes to many-corearchitectures [20]. Also at Cambridge, Gratton has been investigating the use of AMD and NVIDIA GPUs foraccelerating cosmology codes [48]. All three are part of Cambridge’s Many-Core Computing Group whichalso has expertise in the use of GPUs for medical imaging and genomic sequence processing [22].Daresbury Laboratories: the Distributed Computing (DisCo) group at Daresbury Labs regularly runs workshopsin the use of many-core architectures and can provide support in porting codes to GPUs. They also haveaccess to various GPU hardware platforms that they can make available for remote testing [30].

22 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsAWE: Turland leads a group using GPUs to accelerate industrial wavefront simulation codes. His group has beendeveloping OpenCL-based methods that abstract away from specific architectures and vendor programmingmodels, allowing them to develop many-core applications that can perform well on a wide range of differenthardware [15].BAE Systems: Appa was one of the earliest adopters of many-core architectures of GPUs in the world. He nowleads the team within the mathematical modelling group at BAE Systems that is using GPUs to accelerateCFD computations to great effect [37].The Numerical Algorithms Group: NAG has a team developing commercial many-core software libraries forlinear algebra and random number generators, amongst others. NAG also provides commercial many-coretraining courses and software development consultancy services [89].The MathWorks: The MATLAB developer is doing a lot of work to target GPUs from within their software, much ofwhich is taking place at their Cambridge development office [73].Allinea: Allinea is a UK-based company developing some of the leading multi-core and many-core softwaredevelopment tools, including debuggers and profilers for GPUs [6]. Developers of large-scale software usinghundreds or thousands of CPUs and GPUs could benefit significantly from these kinds of parallel debuggers.InternationallyInnovative Computing Lab, University of Tennessee: Dongarra is a world expert in linear algebra libraries. Hisgroup is spearheading the development of highly scalable, dense linear algebra libraries for heterogeneousmany-core processors and massively parallel systems [35, 36].University of Illinois, Urbana-Champaign: Hwu is another leading global figure in GPU-based computing. Hisparticular interest is in using GPUs to accelerate medical imaging applications [54]. UIUC is also one of theleading academic institutions in the use of GPUs to accelerate molecular dynamics and visualisation codes,such as NAMD and VMD [115].Oak Ridge National Lab: Vetter’s Future Technologies group at ORNL is planning one of the largestsupercomputers in the world based on GPU technology [116]. Vetter’s group has also developed one of thefirst GPU benchmarks, the SHOC suite, which is useful for comparing the performance of different many-corearchitectures [29, 117].Tokyo Institute of Technology: Matsuoka is one of the leading HPC experts in Japan and is a specialist inbuilding highly energy-efficient supercomputers based on GPU technologies. He is the architect of theTSUBAME 2.0 GPU-based supercomputer, the fifth fastest in the world as of June 2011. TSUBAME 2.0 isalso one of the most energy-efficient production systems in the world [50]. Matsuoka’s group also developsGPU software, and has developed one of the fastest FFT libraries for GPUs [90].Stanford: Pande’s group developed one of the first GPU-accelerated applications, the Folding@Homescreensaver-based protein folding project [100]. This group went on to develop the OpenMM open sourcelibrary for molecular mechanics. OpenMM takes advantage of GPUs via CUDA and OpenCL and is a goodstarting point for porting molecular dynamics codes to GPUs [39, 101].NVIDIA: NVIDIA was the first GPU company to see the real potential of GPU-based computing. Introducing theCUDA parallel programming language in 2006, NVIDIA has the most applications ported to their many-corearchitecture. Their ported applications webpage is a good place to find out what software is available forGPUs [94].The Portland Group: PGI provides software development tools including a parallel Fortran compiler thatgenerates code for GPUs [103].Acceleware: Acceleware provides multi-core and GPU-accelerated software solutions for the electromagneticsand oil and gas industries [4].SciComp: SciComp is a financial software services company that provides a suite of software tools that exploitGPUs and multi-core CPUs to deliver significant performance advantages over traditional approaches [106].

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS23Appendix 2: Software Applications Available on GPUsThe table below lists several important applications along with their current multi-core and GPU capabilities. Thesewill be changing all the time so do check for the latest information when considering any application. This list is farfrom exhaustive; readers are encouraged to investigate further if their favourite codes are not included.Software Multi-Core Many-CoreMATLABMathematicaNAGRSciFinanceBLAS, LAPACKmath librariesSome built-in ability to exploit parallelism atthe level of libraries such as BLAS andLAPACK. Additionally provides a ‘Parallelcomputing toolbox’ to exploit multi-coreCPUs [111].Built-in support for parallelism sinceMathematica 7 [119]; also providesgridMathematica for large-scale parallelcomputation [118].NAG has a software library product forshared memory and multi-core parallelsystems [87].Multiple R packages are available forexploiting parallelism on multi-coresystems [32].This code synthesis tool for buildingderivatives pricing and risk models canalready generate multi-core code usingOpenMP [107].Almost all BLAS and LAPACK librariesalready support multi-core processors:Intel’s MKL, AMD’s ACML, IBM’s ESSL,ATLAS, ScaLAPACK, PLASMA.Has a beta release of a GPU-acceleratedversion of MATLAB [73]. Various third-partysolutions available, such as AccelerEyes’Jacket [3].Mathematica 8 now includes support forGPUs via CUDA and OpenCL [120].Has a beta release of random numbergenerators using CUDA on NVIDIAGPUs [88].Numerous open source projects porting R toGPUs, including R+GPU available in thegputools R package [80].Already supports the generation of CUDAcode for PDEs and SDEs [107].This is an area where much software isalready available for GPUs: NVIDIA’sCUBLAS, AMD’s ACML-GPU, Acceleware,EM Photonics’ Cula Tools and MAGMA.

24 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsReferences[1] 1st CUDA Developers’ Conference.www.smithinst.ac.uk/Events/CUDA2009, December 2009.[2] 2nd UK GPU Computing Conference.www.many-core.group.cam.ac.uk/ukgpucc2, December 2010.[3] AccelerEyes — MATLAB GPU computing.www.accelereyes.com, April 2011.[4] Acceleware Ltd — Multi-Core Software Solutions.www.acceleware.com, July 2011.[5] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov.Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. Journal of Physics:Conference Series, 180(1), 2009.[6] Allinea Software.allinea.com, May 2011.[7] AMD Corp — AMD Core Math Library (ACML).developer.amd.com/cpu/Libraries/acml/Pages, April2011.[8] AMD Corp — AMD Core Math Library for graphic processors (ACML-GPU).developer.amd.com/gpu/acmlgpu/pages, April 2011.[9] AMD Corp — AMD Developer Central: getting started with OpenCL.developer.amd.com/zones/OpenCLZone/Pages/gettingstarted.aspx, April 2011.[10] AMD Corp — AMD Developer Central: Magny-Cours zone.developer.amd.com/zones/magny-cours/pages/default.aspx, April 2011.[11] ARM Ltd — ARM heralds new era in embedded graphics with next-generation Mali GPU.www.arm.com/about/newsroom/arm-heralds-new-era-in-embedded-graphics-with-next-generation-mali-gpu.php, April 2010.[12] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, KurtKeutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A.Yelick. The landscape of parallel computing research: a view from Berkeley. Technical ReportUCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.[13] David Atkinson.www.cs.ucl.ac.uk/staff/d.atkinson, July 2011.[14] Dan Bailey. Fast GPU preconditioning for fluid simulations in film production (GTC 2010 presentationarchive).www.nvidia.com/object/gtc2010-presentation-archive.html#session2239, April 2011.[15] Dave Barrett, Ron Bell, Claire Hepwood, Gavin Pringle, David Turland, and Paul Graham. An investigation ofdifferent programming models and their performance on innovative architectures. In The Many-Core andReconfigurable Supercomputing Conference (MRSC), 2011.[16] BBC — China’s Tianhe-1A crowned supercomputer king.www.bbc.co.uk/news/technology-11766840,November 2010.[17] Thomas Bradley, Jacques du Toit, Robert Tong, Mike Giles, and Paul Woodhams. Parallelization techniquesfor random number generations. In Wen mei W. Hwu, editor, GPU Computing Gems Emerald Edition,chapter 16, pages 231–246. Morgan Kaufmann, 2011.[18] Tobias Brandvik and Graham Pullan. SBLOCK: a framework for efficient stencil-based PDE solvers onmulti-core platforms. In 10th IEEE International Conference on Computer and Information Technology, pages1181–1188, 2010.[19] Tobias Brandvik and Graham Pullan. An accelerated 3D Navier-Stokes solver for flows in turbomachines.Journal of Turbomachinery, 133(2):021025, 2011.[20] Tobias Brandvik and Graham Pullan. Turbostream: ultra-fast CFD for turbomachines.www.turbostream-cfd.com, April 2011.

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS25[21] Per Brinch Hansen. Model programs for computational science: a programming methodology formulticomputers. Concurrency — Practice and Experience, 5(5):407–423, August 1993.[22] University of Cambridge — Cambridge many-core group projects.www.many-core.group.cam.ac.uk/projects, May 2011.[23] CAPS entreprise — HMPP Workbench.www.caps-entreprise.com/hmpp.html, August 2011.[24] CFD Online.www.cfd-online.com, April 2011.[25] B. Chapman, G. Jost, and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming.MIT Press, 2007.[26] Tim Childs. 3DMashUp.www.scribd.com/3dmashup, April 2011.[27] Peter Clarke. Intel releases alpha OpenCL SDK for Core.www.eetimes.com/electronics-news/4210812/Intel-releases-alpha-OpenCL-SDK-for-Core, November 2010.[28] P. I. Crumpton and M. B. Giles. Multigrid aircraft computations using the OPlus parallel library. In A. Ecer,J. Periaux, N. Satofuka, and S. Taylor, editors, Parallel Computational Fluid Dynamics: Implementations andResults Using Parallel Computers, pages 339–346. North-Holland, 1996.[29] Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, VinodTipparaju, and Jeffrey S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. InProceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages63–74. ACM, 2010.[30] DisCo: Daresbury Laboratory distributed computing group.www.cse.scitech.ac.uk/disco, September2010.[31] Double Negative Visual Effects.www.dneg.com, April 2011.[32] Dirk Eddelbuettel. CRAN task view: high-performance and parallel computing with R.cran.r-project.org/web/views/HighPerformanceComputing.html, February 2011.[33] Erich Elsen, Patrick LeGresley, and Eric Darve. Large calculation of the flow over a hypersonic vehicle usinga GPU. Journal of Computational Physics, 227:10148–10161, December 2008.[34] EPCC.www.epcc.ed.ac.uk, May 2011.[35] Jack Dongarra et al. PLASMA.icl.cs.utk.edu/plasma/index.html, November 2010.[36] Jack Dongarra et al. MAGMA.icl.cs.utk.edu/magma/index.html, April 2011.[37] Mark Fletcher. BAE Systems packs massive power thanks to GPU coprocessing.mcad-online.com/en/bae-systems-packs-massive-power-thanks-to-gpu-co-processing.html?cmp_id=7&news_id=1078, May 2010.[38] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on Computers,21:948–960, September 1972.[39] M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. LeGrand, A. L. Beberg, D. L. Ensign, C. M.Bruns, and V. S. Pande. Accelerating molecular dynamic simulation on graphics processing units. Journal ofComputational Chemistry, 30(6):864–872, 2009.[40] Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, and Dana Schaa. Heterogeneous Computingwith OpenCL. Morgan Kaufmann, 2011.[41] Nick Gibbs, Anthony R. Clarke, and Richard B. Sessions. Ab initio protein structure prediction usingphysicochemical potentials and a simplified off-lattice model. Proteins: Structure, Function, andBioinformatics, 43(2):186–202, 2001.

26 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors[42] M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. Performance analysis and optimisation ofthe OP2 framework on many-core architectures. The Computer Journal, 2011 (to appear).[43] Mike Giles. Course on CUDA programming.people.maths.ox.ac.uk/gilesm/cuda, August 2011.[44] Mike Giles. OP2 project page.people.maths.ox.ac.uk/gilesm/op2, April 2011.[45] Mike Giles and Simon McIntosh-Smith. GPUcomputing.net: GPU Computing UK.www.gpucomputing.net/?q=node/160, April 2011.[46] Charles J. Gillan.www.ecit.qub.ac.uk/Card/?name=c.gillan, May 2011.[47] GPUcomputing.net.www.gpucomputing.net, April 2011.[48] Steve Gratton. Graphics card computing for cosmology.www.ast.cam.ac.uk/˜stg20/gpgpu/index.html,July 2011.[49] William Gropp. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press,second edition, 2000.[50] GSIC, Tokyo Institute of Technology — TSUBAME2.www.gsic.titech.ac.jp/en/tsubame2, April 2011.[51] Mark Harris. General-purpose computation on graphics hardware.gpgpu.org/about, July 2011.[52] HECToR training.www.hector.ac.uk/cse/training, April 2011.[53] HECToR: UK National Supercomputing Service.www.hector.ac.uk, April 2011.[54] Wen-mei W. Hwu. The IMPACT research group at UIUC.impact.crhc.illinois.edu/people/current/hwu/hwu_research.php, May 2011.[55] IBM — Engineering and scientific subroutine library (ESSL) and parallel ESSL.www-03.ibm.com/systems/software/essl, April 2011.[56] IBM — OpenCL development kit for Linux on Power.www.alphaworks.ibm.com/tech/opencl, April 2011.[57] Imagination Technologies — Imagination Technologies announces POWERVR SGX545 graphics IP corewith full DirectX 10.1, OpenGL 3.2 and OpenCL 1.0 capabilities.www.imgtec.com/news/Release/index.asp?NewsID=516, January 2010.[58] Intel Corp — Intel AVX: new frontiers in performance improvements and energy efficiency.software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency, April 2011.[59] Intel Corp — Intel math kernel library.software.intel.com/en-us/articles/intel-mkl, April 2011.[60] Intel Corp — Intel OpenCL SDK.software.intel.com/en-us/articles/intel-opencl-sdk, April 2011.[61] Intel Corp — Intel Research: single-chip cloud computer.techresearch.intel.com/ProjectDetails.aspx?Id=1, April 2011.[62] Stephen Jarvis. High Performance Computing group, University of Warwick.www2.warwick.ac.uk/fac/sci/dcs/research/pcav/areas/hpc, May 2011.[63] Jon Peddie Research — Jon Peddie Research announces 2nd quarter PC graphics shipments.jonpeddie.com/press-releases/details/jon-peddie-research-announces-2nd-quarter-pc-graphics-shipments, July 2010.[64] Aris Karastergiou.www.physics.ox.ac.uk/users/Karastergiou/Site/home.html, July 2011.[65] Paul H. J. Kelly.www.doc.ic.ac.uk/˜phjk, May 2011.[66] Khronos Group — Khronos Group releases OpenCL 1.0 specification.www.khronos.org/news/press/releases/the_khronos_group_releases_opencl_1.0_specification, December 2008.

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS27[67] Khronos Group — OpenCL: the open standard for parallel programming of heterogeneous systems.www.khronos.org/opencl, November 2010.[68] Khronos Group — OpenGL: the industry’s foundation for high performance graphics.www.opengl.org, April2011.[69] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: a Hands-on Approach.Morgan Kaufmann, 2010.[70] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortranusage. ACM Transactions on Mathematical Software, 5:308–323, September 1979.[71] Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A scalable high performantCholesky factorization for multicore with GPU accelerators. In Proceedings of the 9th InternationalConference on High-Performance Computing for Computational Science, pages 93–101. Springer-Verlag,2010.[72] University of Manchester, Computational Science Community Wiki.wiki.rcs.manchester.ac.uk/community/GPU/Club, May 2011.[73] The MathWorks — MATLAB GPU computing with NVIDIA CUDA-enabled GPUs.www.mathworks.com/discovery/matlab-gpu.html, July 2011.[74] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns for Parallel Programming.Addison-Wesley, 2004.[75] Simon McIntosh-Smith and Richard B. Sessions. An accelerated, computer assisted molecular modelingmethod for drug design. In International Supercomputing, June 2008.www.cs.bris.ac.uk/˜simonm/publications/ClearSpeed_BUDE_ISC08.pdf.[76] Simon McIntosh-Smith, Terry Wilson, Jon Crisp, Amaurys Ávila Ibarra, and Richard B. Sessions.Energy-aware metrics for benchmarking heterogeneous systems. SIGMETRICS Performance EvaluationReview, 38:88–94, March 2011.[77] Simon McIntosh-Smith, Terry Wilson, Amaurys Ávila Ibarra, Richard B. Sessions, and Jon Crisp.Benchmarking energy efficiency, power costs and carbon emissions on heterogeneous systems. TheComputer Journal, 2011 (to appear).[78] Simon McIntosh-Smith.www.cs.bris.ac.uk/home/simonm, April 2011.[79] Hans Meuer. Top500 supercomputer sites.www.top500.org, April 2011.[80] University of Michigan, Molecular and Behavioral Neuroscience Institute — R+GPU.brainarray.mbni.med.umich.edu/brainarray/rgpgpu, May 2011.[81] Microsoft Corp — DirectCompute.blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx, April 2011.[82] Microsoft Corp — Microsoft DirectX.www.gamesforwindows.com/en-US/directx, April 2011.[83] Gordon Moore. Cramming more components onto integrated circuits. Electronics Magazine, pages 114–117,April 1965.[84] Samuel K. Moore. With Denver project NVIDIA and ARM join CPU-GPU integration race.spectrum.ieee.org/tech-talk/semiconductors/processors/with-denver-project-nvidia-and-arm-join-cpugpu-integration-race, January 2011.[85] Trevor Mudge. Power: a first-class architectural design constraint. IEEE Computer, 34(4):52–58, April 2001.[86] Aaftab Munshi, Benedict Gaster, Timothy Mattson, James Fung, and Dan Ginsburg. OpenCL ProgrammingGuide. Addison-Wesley, 2011.

28 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors[87] NAG Ltd — NAG library for SMP & multicore.www.nag.co.uk/numeric/FL/FSdescription.asp, May2011.[88] NAG Ltd — NAG numerical routines for GPUs.www.nag.co.uk/numeric/GPUs, May 2011.[89] NAG Ltd — NAG training.nag.co.uk/support_training.asp, April 2011.[90] Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-D FFT library for CUDA GPUs. In Proceedings of theConference on High Performance Computing Networking, Storage and Analysis, pages 30:1–30:10. ACM,2009.[91] NVIDIA Corp — NVIDIA unveils CUDA – the GPU computing revolution begins.www.nvidia.com/object/IO_37226.html, November 2006.[92] NVIDIA Corp — NVIDIA OpenCL JumpStart Guide.developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf, April 2009.[93] NVIDIA Corp — CUBLAS library.developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUBLAS_Library.pdf, August 2010.[94] NVIDIA Corp — CUDA-accelerated applications.www.nvidia.com/object/cuda_app_tesla.html, April2011.[95] NVIDIA Corp — CUDA Fortran.www.nvidia.com/object/cuda_fortran.html, April 2011.[96] NVIDIA Corp — CUDA zone.www.nvidia.com/object/cuda_home_new.html, April 2011.[97] NVIDIA Corp — NVIDIA developer zone: tools & ecosystem.developer.nvidia.com/tools-ecosystem,April 2011.[98] NVIDIA Corp — NVIDIA OpenCL.www.nvidia.com/object/cuda_opencl_new.html, April 2011.[99] Tom Olson. Why OpenCL will be on every smartphone in 2014.blogs.arm.com/multimedia/263-why-opencl-will-be-on-every-smartphone-in-2014, August 2010.[100] Vijay Pande. Folding@home.folding.stanford.edu, May 2011.[101] Vijay Pande. OpenMM.simtk.org/home/openmm, April 2011.[102] Kevin Parrish. Intel to ship 10-core CPUs in first half of 2011.www.tomshardware.com/news/Xeon-westmere-EX-Nehalem-EX-10-cores-160-threads,12230.html, February 2011.[103] The Portland Group.www.pgroup.com, July 2011.[104] Don Scansen. Closer look inside AMD’s Llano APU at ISSCC.www.eetimes.com/electronics-news/4087527/Closer-look-inside-AMD-s-Llano-APU-at-ISSCC, February 2010.[105] Matthew Scarpino. A gentle introduction to OpenCL. Dr. Dobb’s Journal, July 2011.drdobbs.com/high-performance-computing/231002854.[106] SciComp inc — SciGPU achieves stunning acceleration for PDE and MC derivative pricing models.www.scicomp.com/parallel_computing/GPU_OpenMP, July 2011.[107] SciComp Inc — SciFinance achieves astounding accelerations with parallel computing.www.scicomp.com/parallel_computing/GPU_OpenMP, May 2011.[108] Stan Scott.www.qub.ac.uk/research-centres/HPDC/Profile/?name=ns.scott, May 2011.[109] Lorna Smith and Mark Bull. Development of mixed mode MPI / OpenMP applications. ScientificProgramming, 9:83–98, August 2001.[110] Herb Sutter. The free lunch is over: a fundamental turn toward concurrency in software. Dr. Dobb’s Journal,30(3), March 2005.www.gotw.ca/publications/concurrency-ddj.htm.

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS29[111] The MathWorks — MATLAB’s parallel computing toolbox.www.mathworks.com/products/parallel-computing/?s_cid=HP_FP_ML_parallelcomptbx, May 2011.[112] The MPI Forum — MPI: a message-passing interface standard, Version 2.2.www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf, September 2009.[113] The Portland Group — PGI Accelerator Compilers.www.pgroup.com/resources/accel.htm, August 2011.[114] Top500.org — Tianhe-1A.www.top500.org/system/10587, April 2010.[115] University of Illinois at Urbana-Champaign — GPU acceleration of molecular modeling applications.www.ks.uiuc.edu/Research/gpu, April 2011.[116] Jeffrey Vetter. Toward exascale computational science with heterogeneous processing. In Proceedings of the3rd Workshop on General-Purpose Computation on Graphics Processing Units, page 1. ACM, 2010.[117] Jeffrey Vetter. The SHOC benchmark suite.ft.ornl.gov/doku/shoc, April 2011.[118] Wolfram Research — gridMathematica.www.wolfram.com/products/gridmathematica, May 2011.[119] Wolfram Research — Mathematica 7 built-in parallel computing.www.wolfram.com/products/mathematica/newin7/content/BuiltInParallelComputing, May 2011.[120] Wolfram Research — Mathematica 8 CUDA and OpenCL support.www.wolfram.com/mathematica/new-in-8/cuda-and-opencl-support, May 2011.[121] Dong Hyuk Woo and H-H. S. Lee. Extending Amdahl’s law for energy-efficient computing in the many-coreera. IEEE Computer, 41(12):24–31, December 2008.[122] ZiiLABS Pte Ltd — ZiiLABS OpenCL SDK.www.ziilabs.com/technology/opencl.aspx, April 2011.

The GPU Computing RevolutionFrom Multi-Core CPUs to Many-Core Graphics ProcessorsA Knowledge Transfer Report from the London Mathematical Society and the Knowledge TransferNetwork for Industrial Mathematicsby Simon McIntosh-SmithThe London Mathematical Society (LMS) is the UK's learned society for mathematics. Founded in 1865 forthe promotion and extension of mathematical knowledge, the Society is concerned with all branches ofmathematics and its applications. It is an independent and self-financing charity, with a membership ofaround 2,300 drawn from all parts of the UK and overseas. Its principal activities are the organisation ofmeetings and conferences, the publication of periodicals and books, the provision of financial support formathematical activities, and the contribution to public debates on issues related to mathematics researchand education. It works collaboratively with other mathematical bodies worldwide. It is the UK’s adheringbody to the International Mathematical Union and is a member of the Council for the Mathematical Sciences,which also comprises the Institute of Mathematics and its Applications, the Royal Statistical Society, the OperationalResearch Society and the Edinburgh Mathematical Society.www.lms.ac.ukThe Knowledge Transfer Network for Industrial Mathematics (IM-KTN) brings to life the underpinning role ofmathematics in boosting innovation performance. It enables mathematical modelling and analysis to addvalue to product design, operations and strategy across all areas of business, recognising that innovationoften occurs on the interfaces between existing capabilities. The IM-KTN connects together companies (of allsizes), public sector organisations, government agencies and universities, to exploit these opportunities usingmechanisms that are established catalysts of new activity. It is a vehicle for developing new capability, sharingknowledge and experience, and helping shape science and technology strategy. The IM-KTN currently hasthe active engagement of about 450 organisations and is managed on behalf of the Technology StrategyBoard by the Smith Institute for Industrial Mathematics and System Engineering.www.innovateuk.org/mathsktnThe LMS-KTN Knowledge Transfer Reports are an initiative that is coordinated jointly by the IM-KTN and theComputer Science Committee of the LMS. The reports are being produced as an occasional series, each oneaddressing an area where mathematics and computing have come together to provide significant new capabilitythat is on the cusp of mainstream industrial uptake. They are written by senior researchers in each chosenarea, for a mixed audience in business and government. The reports are designed to raise awarenessamong managers and decision-makers of new tools and techniques, in a format that allows them to assessrapidly the potential for exploitation in their own fields, alongside information about potential collaboratorsand suppliers.

The GPU Computing Revolution - London Mathematical Society

Create successful ePaper yourself

Delete template?

Save as template?