11.07.2015 Views

The GPU Computing Revolution - London Mathematical Society

The GPU Computing Revolution - London Mathematical Society

The GPU Computing Revolution - London Mathematical Society

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>GPU</strong> <strong>Computing</strong><strong>Revolution</strong>From Multi-Core CPUs to Many-Core Graphics ProcessorsA Knowledge Transfer Report from the <strong>London</strong> <strong>Mathematical</strong> <strong>Society</strong>and Knowledge Transfer Network for Industrial MathematicsBy Simon McIntosh-Smith


Copyright © 2011 by Simon McIntosh-SmithFront cover image credits:Top left: Umberto Shtanzman / Shutterstock.comTop right: godrick / Shutterstock.comBottom left: Double Negative Visual EffectsBottom right: University of BristolBackground: Serg64 / Shutterstock.com


THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsBy Simon McIntosh-SmithContentsPageExecutive Summary 3From Multi-Core to Many-Core: Background and Development 4Success Stories 7<strong>GPU</strong>s in Depth 11Current Challenges 18Next Steps 19Appendix 1: Active Researchers and Practitioner Groups 21Appendix 2: Software Applications Available on <strong>GPU</strong>s 23References 24September 2011A Knowledge Transfer Report from the <strong>London</strong> <strong>Mathematical</strong> <strong>Society</strong> andthe Knowledge Transfer Network for Industrial MathematicsEdited by Robert Leese and Tom Melham<strong>London</strong> <strong>Mathematical</strong> <strong>Society</strong>, De Morgan House, 57–58 Russell Square, <strong>London</strong> WC1B 4HSKTN for Industrial Mathematics, Surrey Technology Centre, Surrey Research Park, Guildford GU2 7YG


2 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsAUTHORSimon McIntosh-Smith is head of the Microelectronics Research Group at the Universityof Bristol and chair of the Many-Core and Reconfigurable Supercomputing Conference(MRSC), Europe’s largest conference dedicated to the use of massively parallel computerarchitectures. Prior to joining the university he spent fifteen years in industry where hedesigned massively parallel hardware and software at companies such as Inmos, STMicroelectronicsand Pixelfusion, before co-founding ClearSpeed as Vice-President of Architectureand Applications. In 2006 ClearSpeed’s many-core processors enabled the creation ofone of the fastest and most energy-efficient supercomputers: TSUBAME at Tokyo Tech. Hehas given many invited talks on how massively parallel, heterogeneous processors are revolutionisinghigh-performance computing, and has published numerous papers on how toexploit the speed of graphics processors for scientific applications. He holds eight patentsin parallel hardware and software and sits on the steering and programme committees ofmost of the well-known international high-performance computing conferences.www.cs.bris.ac.uk/home/simonm/simonm@cs.bris.ac.uk


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS3Executive SummaryComputer architectures areundergoing their most radicalchange in a decade. In the past,processor performance has beenimproved largely by increasingclock speed — the faster the clockspeed, the faster a processor canexecute instructions, and thus thegreater the performance that isdelivered to the end user. Thisdrive to greater and greater clockspeeds has stopped, and indeed insome cases is actually reversing.Instead, computer architects aredesigning processors that includemultiple cores that run in parallel.For the purposes of this report weshall think of a core as oneindependent ‘brain’, able to executeits own stream of instructions,independently fetching the data itrequires and writing its own resultsback to a main memory. This shiftfrom increasing clock speeds toincreasing core counts — and thusincreasing parallelism — ispresenting significant newopportunities and challenges.During this architectural shift a newclass of many-core computerarchitectures is emerging from thegraphics market into themainstream. This new class ofprocessor already exploitshundreds or even thousands ofcores. Many-core processorspotentially offer an order ofmagnitude greater performancethan conventional processors, asignificant increase that, ifharnessed, can deliver majorcompetitive advantages.But to take advantage of thesepowerful new many-coreprocessors, most users will have toradically redesign their software,potentially needing completely newalgorithms that can exploit massiveparallelism.<strong>The</strong> many-coreopportunityFor computationally challengingproblems in business and scientificresearch, these new, highlyparallel, many-core architecturesrepresent a technologicalbreakthrough that can deliversignificant benefits in science andbusiness benefits. Such astep-change in available computingperformance could enable fasterfinancial trades, higher-resolutionsimulations, the design of morecompetitive products such asincreasingly aerodynamic cars andmore effective drugs for thetreatment of disease.Thus new many-core processorsrepresent one of the biggestadvances in computing in recenthistory. <strong>The</strong>y also indicate thedirection of all future processordesign — we cannot avoid thismajor paradigm shift into massiveparallelism. All processors from allthe major vendors will becomemany-core designs over the nextfew years. Whether this is manyidentical mainstream CPU cores ora heterogeneous mix of differentkinds of core remains to be seen,but it is inevitable that there will belots of cores in each processor, andimportantly, to harness theirperformance, most software andalgorithms will need redesigning.<strong>The</strong>re is significant competitiveadvantage to be gained for thosewho can embrace and exploit thisnew, many-core technology, andconsiderable risk for any who areunable to adapt to a many-coreapproach. This report explains thebackground to these developments,presents several success stories,and describes a number of ways toget started with many-coretechnologies.


4 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsFrom Multi-Core to Many-Core: Background and DevelopmentHistorically, advances in computerhardware have deliveredperformance improvements thathave been transparent to the enduser – software has run morequickly on newer hardware withouthaving to make any changes to thatsoftware. We have enjoyeddecades of ‘single thread’performance improvement, whichrelied on converting exponentialincreases in available transistors(known as Moore’s Law [83]), intoincreasing processor clock speedsand advances in microprocessorpipelining, instruction-levelparallelism and out-of-orderexecution. All these technologydevelopments are essentially‘under the hood’, with existingsoftware generally performingfaster on next generation hardwarewithout programmer intervention.Most of these developments havenow hit a wall.In 2003 it started to becomeapparent that several fundamentalphysical limitations were going tochange everything. Herb Sutternoticed this significant change inhis widely cited 2004 essay ‘<strong>The</strong>Free Lunch is Over’ [110]. Figure 1is taken from Sutter’s paper,updated in 2009, and shows howsingle thread performanceincreases, based on increasingclock speed and instruction levelparallelism improvements, couldonly have continued at the cost ofsignificant increases in processorpower consumption. Somethinghad to change — the free lunchwas over.<strong>The</strong> answer was, and is,parallelism. It transpires that, forvarious semiconductorphysics-related reasons, aprocessor containing two 3GHzcores will use significantly lessenergy (and thus dissipate lesswaste heat) than an alternativeprocessor containing a single 6GHzcore. Yet in theory the dual-core3GHz device can deliver the sameperformance as the single 6GHzcore (much more on this later).Today even laptop processorscontain at least two cores, withmost mainstream server CPUsalready containing four or six cores.Thus, due to the powerconsumption problem, increasingcore counts have replacedincreasing clock speeds as themain method of delivering greaterhardware performance. (For moredetails see Box 1 later.)Of course the important implicationof ‘the free lunch is over’ is thatmost existing software will not justgo faster when we increase thenumber of cores inside aprocessor, unless you are in thefortunate position that yourperformance-critical software isalready ‘multi-core aware’.So, hardware is suddenly anddramatically changing: it isdelivering ever more performance,but is now doing so through rapidly Figure 1: Graph of four key technology trends from ‘<strong>The</strong> Free Lunch is Over’: transistors per processor, clockspeed, power consumption and instruction level parallelism [110].


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS5increasing parallelism. Softwarewill therefore need to changeradically in order to exploit thesenew parallel architectures. Thisshift to parallelism represents oneof the largest challenges thatsoftware developers have faced.Modifying existing software to makeit parallel is often (but not always)difficult. Now might be a good timeto re-engineer many older codesfrom scratch.Since hardware designers gave upon trying to maintain the free lunch,incredible advances in computerhardware design have been made.While mainstream CPUs have beenembracing parallelism relativelyslowly as the majority of softwaretakes time to change to this newmodel, other kinds of processorshave exploited parallelism muchmore rapidly, becoming massivelyparallel and delivering largeperformance improvements as aresult.<strong>The</strong> most significant class ofprocessor to have fully embracedmassive parallelism are graphicsprocessors, often called GraphicsProcessing Units or <strong>GPU</strong>s.Contemporary <strong>GPU</strong>s first appearedin the 1990’s and were originallydesigned to run the 2D and 3Dgraphical operations used by theburgeoning computer gamesmarket. <strong>GPU</strong>s started asfixed-function devices, designed toexcel at graphics-specific functions.<strong>GPU</strong>s reached an inflection point in2002, when the main graphicsstandards, OpenGL [68] andDirectX [82] started to requiregeneral-purpose programmability inorder to deliver advanced specialeffects for the latest computergames. <strong>The</strong> mass-market forcesdriving the development of <strong>GPU</strong>s(over 525 million graphicsprocessors were sold in 2010 [63])drove rapid increases in <strong>GPU</strong>performance and programmability.<strong>GPU</strong>s quickly evolved from theirfixed-function origins intofully-programmable, massivelyparallel processors (see Figure 2).Soon many developers were tryingto exploit these low-cost yetincredibly high-performance <strong>GPU</strong>sto solve non-graphics problems.<strong>The</strong> term ‘General-Purposecomputation on GraphicsProcessing Units’ or GP<strong>GPU</strong>, wascoined by Mark Harris in 2002 [51],and since then <strong>GPU</strong>s havecontinued to become more andmore programmable.Today, <strong>GPU</strong>s represent thepinnacle of the many-corehardware design philosophy. Amodern <strong>GPU</strong> will consist ofhundreds or even thousands offairly simple processor cores. Thisdegree of parallelism on a singleprocessor is usually called‘many-core’ in preference to‘multi-core’ — the latter term istypically applied to processors withat most a few dozen cores.It is not just <strong>GPU</strong>s that are pushingthe development of parallelprocessing. Led by Moore’s Law,mainstream processors are also onan inexorable advance towardsmany-core designs. AMD startedshipping a 12-core mainstream x86 Figure 2: <strong>The</strong> evolution of <strong>GPU</strong>s from fixed function pipelines on the left, to fully programmable arrays ofsimple processor cores on the right [69].


6 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsCPU codenamed Magny-Cours in2010 [10], while in Spring 2011Intel launched their 10-coreWestmere EX x86 CPU [102].Some server vendors are usingthese latest multi-core x86 CPUs todeliver machines with 24 or even48 cores in a single 1U ‘pizza box’server. Early in 2010 Intelannounced a 48-core x86 prototypeprocessor which they described asa ‘cloud on a chip’ [61]. <strong>The</strong>semany-core mainstream processorsare probably the future of serverCPUs, as they are well suited tolarge-scale, highly parallelworkloads of the sort performed bythe Googles and Amazons of theinternet industry.But perhaps the most revealingtechnology trend is that many of usare already carrying heterogeneousmany-core processors in ourpockets. This is becausesmartphones have quickly adoptedthis new technology to deliver thebest performance possible whilemaintaining energy efficiency.It is the penetration of many-corearchitectures into all threepredominant processor markets —consumer electronics, personalcomputers and servers — thatdemonstrates the inevitability of thistrend. We now have to pay for ourlunch and embrace explicitparallelism in our software but, ifembraced fully, exciting newhardware platforms can deliverbreakthroughs in performance andenergy efficiency.Box 1: <strong>The</strong> Power EquationProcessor power consumption is the primary hardware issue driving the development of many-core architectures.Processor power consumption, P, depends on a number of factors, which can be described by the power equation,P=ACV 2 f,where A is the activity of the basic building blocks, or gates in the processor, C is the total capacitance of the gates’outputs, V is the processor’s voltage and f is its clock frequency.Since power consumption is proportional to the square of the processor’s voltage, one can immediately see that reducingvoltage is one of the main methods for keeping power under control. In the past voltage has continually beenreducing, most recently to just under 1.0 volt. However, in the types of CMOS process from which most modernsilicon chips are made, voltage has a lower limit of about 0.7 volts. If we try to turn voltage down lower than this, thetransistors that make up the device simply do not turn on and off properly.Revisiting the power equation, one can further see that, as we build ever more complex processors using exponentiallyincreasing numbers of transistors, the total gate capacitance C will continue to increase. If we have lost the ability tocompensate for this increase by reducing V then there are only two variables left we can play with: A and f , and it islikely that both will have to reduce.<strong>The</strong> many-core architecture trend is an inevitable outcome from this power equation: Moore’s Law gives us ever moretransistors, pushing up C, but CMOS transistors have a lower limit to V which we have now reached, and thus the onlyeffective way to harness all of these transistors is to use more and more cores at lower clock frequencies.For more detail on why power consumption is driving hardware trends towards many-core designs we refer the readerto two IEEE Computer articles: Mudge’s 2001 ‘Power: a first-class architectural design constraint’ [85] and Woo andLee’s ‘Extending Amdahl’s Law for Energy-Efficient <strong>Computing</strong> in the Many-Core Era’ from 2008 [121].


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS7Success StoriesBefore we analyse a few many-coresuccess stories, it is important toaddress some of the hype around<strong>GPU</strong> acceleration. As is often thecase with a new technology, therehave been lots of inflated claims ofperformance speedups, some evenclaiming speedups of hundreds oftimes compared to regular CPUs.<strong>The</strong>se claims almost never standup to serious scrutiny and usuallycontain at least one serious flaw.Common problems include:optimising the <strong>GPU</strong> code muchmore than the serial code; runningon a single host core rather than allof the host cores at the same time;not using the best compileroptimisation flags; not using thevector instruction set on the CPU(SSE or AVX); using older, slowerCPUs etc. But if we just comparethe raw capabilities of the <strong>GPU</strong> withthe CPU then typically there is anorder of magnitude advantage forthe <strong>GPU</strong> — not two or three ordersof magnitude. If you see speedupclaims of greater than about afactor of ten then be suspicious. Agenuine speedup of a factor of tenis of course a significantachievement and is more thanenough to make it worthwhileinvesting in <strong>GPU</strong> solutions.<strong>The</strong>re is one more common issueto consider, and it is to do with thesize of the problem beingcomputed. Very parallel systemssuch as many-core <strong>GPU</strong>s achievetheir best performance whensolving problems that contain acorrespondingly high degree ofparallelism, enough to give everycore enough work so that they canoperate at close to their peakefficiency for long periods. Forexample, when accelerating linearalgebra, one may need to beprocessing matrices of the order ofthousands of elements in eachdimension to get the bestperformance from a <strong>GPU</strong>, whereason a CPU one may only needmatrix dimensions of the order ofhundreds of elements to achieveclose to their best performance(see Box 2 for more on linearalgebra). This discrepancy canlead to apples-to-orangescomparisons, where a <strong>GPU</strong> systemis benchmarked on one size ofproblem while a CPU system isbenchmarked on a different size ofproblem. In some cases the <strong>GPU</strong>system may even need a largerproblem size than is reallywarranted to achieve its bestperformance, again resulting inflawed performance claims.With these warnings aside let uslook at some genuine successstories in the areas ofcomputational fluid dynamics forthe aerospace industry, moleculardocking for the pharmaceuticalindustry, options pricing for thefinancial services industry, specialeffects rendering for the creativeindustries, and data mining forlarge-scale electronic commerce.Computational fluiddynamics on <strong>GPU</strong>s:OP2Finding solutions of theNavier-Stokes equations in two andthree dimensions is an importanttask for mathematically modellingand analysing flows in fluids, suchas one might find when consideringthe aerodynamics of a new car orwhen analysing how airbornepollution is blown around the tallbuildings of a modern city.Computational fluid dynamics(CFD) is the discipline of usingcomputer-based models for solvingthe Navier-Stokes equations. CFDis a powerful and widely used toolwhich also happens to becomputationally very expensive.<strong>GPU</strong> computing has thus been ofgreat interest to the CFDcommunity.Prof. Mike Giles at the University ofOxford is one of the leadingdevelopers of CFD methods thatuse unstructured grids to solvechallenging engineering problems.This work has led to two significantsoftware projects: OPlus, a codedesigned in collaboration withRolls-Royce to run on clusters ofcommodity processors utilisingmessage passing [28], and morerecently OP2, which is beingdesigned to utilise the latestmany-core architectures [42].OP2 is an example of a veryinteresting class of applicationwhich aims to enable the user towork at a high level of abstractionwhile delivering high performance.It achieves this by providing aframework for implementingefficient unstructured gridapplications. Developers write theircode in a familiar programminglanguage such as C, C++ orFortran, specifying the importantfeatures of their unstructured gridCFD problem at a high level. <strong>The</strong>sefeatures include the nodes, edgesand faces of the unstructured grid,flow variables, the mappings fromedges to nodes, and parallel loops.From this information OP2 canautomatically generate optimisedcode for a specific targetarchitecture, such as a <strong>GPU</strong> usingthe new parallel programminglanguages CUDA or OpenCL, ormulti-core CPUs using OpenMPand vectorisation for their SIMDinstruction sets (AVX andSSE) [58]. We will cover all of theseapproaches to parallelism in moredetail later in this report.At the time of writing, OP2 is still awork in progress, in collaboration


8 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processorswith Prof. Paul Kelly at ImperialCollege, but early results arealready impressive. Using an airfoiltest code with a single-precision 2DEuler equation, a problem with 0.75million cells and 1.5 million edgeshas been solved in 7.4 secondsusing a single NVIDIA C2070 <strong>GPU</strong>.Double-precision performance wasroughly half this, taking 15.4seconds on the same problem.When OP2 generates similarlyoptimised code for a pair ofcontemporary Intel ‘Westmere’6-core 2.67GHz X5650 CPUs, thesame problem takes 34.2 secondsto solve in single precision and 44.6seconds in double precision. One<strong>GPU</strong> has therefore deliveredimpressive relative speedups of4.6X and 2.9X for single and doubleprecision respectively, compared to12 top-of-the-range CPU cores.OP2 is being developed as an opensource project and potential userscan contact Prof. Giles via theproject webpage [44].Other CFD examples — Anothersuccessful project using <strong>GPU</strong>s forCFD has been Turbostream,developed by Tobias Brandvik andDr Graham Pullan in the WhittleLaboratory at the University ofCambridge [18, 19]. In contrast toOP2’s unstructured grid approach,Turbostream solves theNavier-Stokes equations bydiscretising the problem onto astructured grid and then performingexplicit time-stepping to solve theresulting partial differentialequations (PDEs) using athree-dimensional stencil operation.Turbostream is another example ofan approach that enables theprogrammer to work at a higherlevel of abstraction: rather thanprogrammers having to writecomplex, low-level codethemselves, the programmerspecifies the required stenciloperations in a high-leveldescription from which Turbostreamautomatically generates optimisedcode targeting the latest <strong>GPU</strong>s andmulti-core CPUs.Turbostream has achievedimpressive performanceimprovements, with speedups inthe range of 10–20X on the latestNVIDIA <strong>GPU</strong>s compared withoptimised code running on a fastquad-core Intel CPU. AdditionallyTurbostream has shown goodscalability, using up to 64 <strong>GPU</strong>s atonce. Turbostream is available tolicense and the reader is referred tothe project website for moreinformation [20].Others have also been achievinggreat success in using <strong>GPU</strong>s toaccelerate CFD problems. BAESystems has announced it is usinga <strong>GPU</strong>-accelerated cluster todeliver near-interactive speeds fortheir CFD simulations [37]. A teamat Stanford University was recentlysuccessful in modelling hypersonicairflows using a <strong>GPU</strong>-basedsystem [33].<strong>The</strong>se early successes indicatethat CFD is an application area thatshould see increasing benefit from<strong>GPU</strong>s in the future; for additionalinformation see the CFD Onlinewebsite [24].Molecular dockingusing <strong>GPU</strong>s: BUDE<strong>The</strong> University of Bristol has beenat the forefront of moleculardocking for the past decade [41].Molecular docking is the process ofsimulating the interaction betweentwo molecules, where one moleculeis typically a protein of medicalinterest in the human body, and thesecond represents a potential newdrug.<strong>The</strong> Bristol University DockingEngine (BUDE) was originallydeveloped to run on a commoditycluster of x86-based CPUs, but asearly as 2006 it was ported to oneof the first generation of many-coreprocessors, the CSX architecturefrom ClearSpeed [75]. In 2010Figure 3: <strong>GPU</strong>s have enabled BAE Systems to investigate the aerodynamic performance of aircraft throughadvanced CFD simulations (source: BAE Systems).


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS9Figure 4: BUDE molecular docking performance in seconds (lower is better).BUDE was ported to the OpenCLparallel programming language.Figure 4 compares theperformance of this OpenCL coderunning on a <strong>GPU</strong> to the original,optimised Fortran code running ona fast, quad-core CPU. <strong>The</strong> resultsfor both performance and energyefficiency were compelling for the<strong>GPU</strong>-based system: NVIDIAC2050 <strong>GPU</strong>s gave a real-worldspeedup of 4.0X versus a fastquad-core CPU while using aboutone half of the energy to completethe same work [76, 77].An important benefit of using thenew OpenCL parallel programminglanguage was the ability to runexactly the same code on a rangeof different <strong>GPU</strong>s from differentvendors, and even on multi-corex86 CPUs. Thus BUDE can nowuse whichever hardware is thefastest available at any given time.BUDE can also use <strong>GPU</strong>s andCPUs at the same time to deliverthe maximum possible aggregateperformance.Options pricing using<strong>GPU</strong>s: NAGOne of the earliest applicationareas explored using <strong>GPU</strong>s wasMonte-Carlo-based numericalintegration for derivative pricing andrisk management in financialmarkets. <strong>The</strong> primary need in thisapplication is the ability to generaterapidly high-quality pseudo-randomor quasi-random numbers.Fortunately parallel algorithms forgenerating these kinds of randomnumbers already exist, and severalof these algorithms are wellmatched to the <strong>GPU</strong>’s many-corearchitecture.<strong>The</strong> Numerical Algorithms Group(NAG) is a provider of high-qualitynumerical software libraries. NAGis one of the first vendors to support<strong>GPU</strong>s – their <strong>GPU</strong>-acceleratedpseudo-random number generatorsshow speedups of between 5X and34X when running on NVIDIA’sC2050 <strong>GPU</strong>, compared to Intel’shighly optimised MKL library whenrunning on all eight cores of acontemporary Intel Core i7 860operating at 2.8GHz [17]. <strong>The</strong>exact speedup of these routinesdepends on the kind of randomnumber generator being used(MRG32k3a, Sobol or MT19937)and also the distribution required(uniform, exponential or normal).However, in modern financialmodels, generating the randomnumbers is typically only a fractionof the overall task, with the modelsthemselves often requiringconsiderable computation toevaluate. Since the Monte Carlomethod is inherently parallel, theentire simulation can be performedon the <strong>GPU</strong>, where the abundanceof computing power means thatthese complex models can beevaluated very rapidly. Thiscapability has attracted severalusers from the financial servicesindustry, including several of themore sophisticated insurancecompanies. This class of potential<strong>GPU</strong> user is faced with modern riskmanagement requirements such as‘Solvency II’, which require largeamounts of simulation. Whencombined with the complexcashflow calculations inherent ininsurance portfolios, this applicationbecomes massively parallel andvery compute-intensive, making itwell suited to modern <strong>GPU</strong>s.NAG has published the results itachieved while working with two ofits tier-one financial servicescustomers. One of these customershas used NAG’s <strong>GPU</strong>-acceleratedrandom number generator library tospeed up its ‘local vol’ code andsaw speedups of approximately tentimes compared to a fast quad-corex86 CPU.In summary, the generation of largesets of pseudo-random numbers forMonte Carlo methods has beenone application area where <strong>GPU</strong>shave had a big impact. Speedupsfor the random number generationroutines of 100X or more versus asingle CPU core are regularlyachieved, often translating into


10 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processorsreal-world performanceimprovements of 10X or morecompared to a host machine with atotal of eight cores, as in the NAGcases described above.Have we just ignored the earlierwarning about excessivespeedups? Not quite — this is arare instance where the <strong>GPU</strong> reallydoes have an additionalperformance advantage over theCPU. <strong>GPU</strong>s uniquely includededicated hardware for executingcertain functions very quickly, inonly a few clock cycles. <strong>The</strong>sefunctions include exponential, sine,cosine, square root and inversesquare root amongst others. Manyof these functions are usedextensively when generatingpseudo-random numbers, and so inthis instance <strong>GPU</strong>s really do have agreater performance advantageover CPUs than the peak speedcomparison would indicate.For an overview of <strong>GPU</strong>s incomputational finance see [43].Special effectsrendering using <strong>GPU</strong>s:Double Negative<strong>London</strong>-based Double Negative isthe largest visual effects companyin Europe [31]. <strong>The</strong>ir Oscar-winningwork can be seen in many of thebig blockbuster movies over the lasttwelve years. One of the biggestindustrial users ofhigh-performance computing,visual effects companies requiredatacentres with thousands ofservers to produce theircomputer-generated visuals.Water, smoke and fire-basedeffects have become increasinglypopular in movies over the last fewyears, with Double Negative’sproprietary fluid solver ‘Squirt’being a key part of their success.<strong>The</strong> image of the volcanicexplosion on the front cover of thisreport was generated by DoubleNegative using this software. <strong>The</strong>computationally expensive processof solving the Navier-Stokesequations required by thesefluid-based effects have led DoubleNegative to investigate the potentialof <strong>GPU</strong>s. Double Negative recentlyadded a <strong>GPU</strong>-based Poissonsolver to their fluid simulations,resulting in as much as a 20Xspeedup of this component of theirworkflow. <strong>The</strong>se results motivatedDouble Negative to install their first<strong>GPU</strong>-based render-farm.Double Negative is now developinga specialist language and compilerto make <strong>GPU</strong>s easier to use fortheir visual effects artists. Resultsare again impressive, with a 26Xspeedup on an NVIDIA C2050<strong>GPU</strong> compared to the originalsingle-threaded C++ code runningon one core of a 3.33GHz six-coreIntel X5680 CPU (the speedupwould be about 4.4X if all six coresof the host could be used at oncewith perfect scaling on theCPU) [14]. <strong>The</strong>se increases inperformance have motivatedDouble Negative to investigateusing <strong>GPU</strong>s for other applications,such as image manipulation toolsand deformers, in a bid toaccelerate as much as possible oftheir visual effects workflow.Accelerating databaseoperations using<strong>GPU</strong>s: 3DMashUpIt may be clear that many-corearchitectures will shine oncomputationally intensive codes,but what about more data-drivenapplications?Researchers are now looking atusing <strong>GPU</strong> architectures to improvethe performance of computationallyintensive database operations. Oneof the first companies developingthis technology is theCalifornia-based 3DMashUp.3DMashUp has developed asolution using a PostgreSQL-basedrelational database that embeds<strong>GPU</strong> code alongside data withinthe database itself. Users canquery the <strong>GPU</strong>-aware database,transparently causing the <strong>GPU</strong>code to be executed on the data, allthe while remaining within thecontext of the database.This approach has a number ofbenefits, one of which is that thedata and the code can remainresident in the database at alltimes. <strong>The</strong> overhead of callingOpenCL <strong>GPU</strong> code from within thedatabase is just 4 microseconds,making this a convenient andefficient way to exploit <strong>GPU</strong>s forlarge database users withcomputationally expensiveprocessing requirements. It alsobrings <strong>GPU</strong> acceleration tomainstream markets. For example,users who have very large imagedatabases may now process theseimages using <strong>GPU</strong>s just byintegrating the appropriate3DMashUp methods and installingthe latest <strong>GPU</strong>s in their system.Users of the database do not haveto care about <strong>GPU</strong> programming atall in this scenario; their tasks ‘justrun faster’.<strong>The</strong> 3DMashUp solution alsotransparently handles mapping themost performance-critical data intothe <strong>GPU</strong>’s memory as and when itis needed, removing the need forusers to have to concernthemselves with these low-levelprogramming details.This approach appears to be verypromising and is one of the fewexamples where the benefits of<strong>GPU</strong> acceleration can be invisiblyprovided to end users, in this casewith computationally intensive,large data problems that usedatabases to store and managethat data. For more information seethe 3DMashUp website [26].


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS11<strong>GPU</strong>s in DepthWhile several different many-corearchitectures have emerged duringthe last few years, we will focus on<strong>GPU</strong>-based many-core systems forthe following discussion. However,almost all the principles andterminology discussed in thiscontext apply equally to the othermany-core architectures, includingx86 CPUs and consumerelectronics products as previouslydescribed.A modern <strong>GPU</strong> is a many-coredevice, meaning it will containhundreds or even thousands ofsimple yet fully programmablecores, all on a single chip. <strong>The</strong>secores are often grouped togetherinto homogeneous sets withvarying names. <strong>The</strong> emergingC-based many-core programminglanguage OpenCL [67] calls thesesimple cores ‘Processing Elements’or PEs, and the groupings of thesePEs ‘Compute Units’ or CUs.Another common feature of allmany-core architectures is amulti-level memory hierarchy.Typically each Processing Elementwill have a small amount of its ownprivate memory. <strong>The</strong>re is often alarger memory per Compute Unitthat can be used as sharedmemory between all that ComputeUnit’s Processing Elements. <strong>The</strong>rewill also usually be a globalmemory which can be seen by allthe Compute Units and thus by allthe Processing Elements. Finally,there is usually a separate memoryfor the host processor system.OpenCL refers to these four levelsin the memory hierarchy as Private,Local, Global and Host,respectively. Figure 5 illustrates theOpenCL memory hierarchyterminology. We will adoptOpenCL’s terminology for the restof this report.<strong>The</strong> <strong>GPU</strong> itself is integrated into a‘host system’. This might mean the<strong>GPU</strong> is on its own add-in board,plugged into a standard PCIExpress expansion slot within aserver or desktop computer.Alternatively, the <strong>GPU</strong> may beintegrated alongside the host CPU,as found inside high-end laptopsand smartphones today.Increasingly in the future we willsee the many-core <strong>GPU</strong> tightlyintegrated onto the same chip asthe host CPU; in June 2011 AMDofficially launched their first ‘Fusion’CPU, codenamed Llano, thatintegrates a quad-core x86 CPUwith a many-core <strong>GPU</strong> capable ofrunning OpenCL programs [104].NVIDIA already has consumer-level‘Fusion’ devices in its TegraCPU+<strong>GPU</strong> product line, but atSuper<strong>Computing</strong> 2010 in NewOrleans they announced ‘ProjectDenver’. This is NVIDIA’sprogramme for a high-endFusion-class device, integratingtheir cutting-edge <strong>GPU</strong>s with new,high-end ARM cores [84]. <strong>The</strong> firstProject Denver products will notarrive for several years, but NVIDIAhas indicated that this ‘fusion’approach of integrating their <strong>GPU</strong>salongside their own ARM-basedmulti-core CPUs is central to theirfuture product roadmap.<strong>The</strong> reader may come across threevery different configurations of<strong>GPU</strong>-accelerated systems,illustrated in Figure 6. <strong>The</strong>se threedifferent ways of integrating <strong>GPU</strong>swithin a system may have verydifferent performancecharacteristics, but their usagemodels are almost identical and weshall treat them in the same way forthe purposes of this report. Animportant implication of this widerange of different approaches isthat <strong>GPU</strong>-accelerated computing isnot just for supercomputers —almost all systems are capable ofexploiting the performance ofPrivateMemoryPrivateMemoryPrivateMemoryPrivateMemoryWork-ItemWork-Item Work-ItemWork-ItemLocal MemoryLocal MemoryWork-GroupWork-GroupGlobal Memory & Constant MemoryCompute DeviceHost MemoryHostFigure 5: <strong>The</strong> memory hierarchy of a generic <strong>GPU</strong> (source: Khronos).


12 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsFigure 6: Three different methods of integrating <strong>GPU</strong>s: add-in graphics cards for a workstation, an integratedbut discrete <strong>GPU</strong> within a laptop, and a fully integrated, single-chip CPU and <strong>GPU</strong> (source: NVIDIA, AMD).<strong>GPU</strong>s, including laptops, tabletsand even smartphones.What are the performance benefitsof the current many-core <strong>GPU</strong>systems? Just how fast are they?<strong>GPU</strong>s are still so new to the area ofscientific computing that no widelyused standard benchmarks exist forthem. This makes it very difficult forus to compare their performance toeach other, or to traditional CPUs.Before <strong>GPU</strong>-oriented benchmarksmature, we have to ‘make do’ withmuch less accurate indicators ofperformance, such as peakhardware capabilities (FLOPS,bandwidth etc). <strong>The</strong>re are someearly application performancecomparisons we can use too; the‘Success Stories’ section of thisreport gives some real-worldexamples.Let us consider three differentsystems as points of reference: alaptop, a desktop and a server. <strong>The</strong>laptop on which this report waswritten is a contemporary AppleMacBook Pro and has a dual-coreCPU running at 2.4 GHz. Its peakperformance is around 40 GFLOPS(GigaFLOPS), measured insingle-precision floating pointoperations per second — 1 GFLOPis one billion (10 9 ) floating pointoperations per second.Double-precision floating pointperformance is typically one halfthat of single precisionperformance on CPUs. <strong>The</strong> otherimportant hardware characteristicfor performance is memorybandwidth — the same laptop hasa peak bandwidth to its mainmemory of about 8 GBytes/s.Under the author’s desk is a PCcontaining a reasonably fast CPU:a dual-core x86 running at 3.16GHz. This desktop’s peak speed isnearly 60 GFLOPS single-precisionand its peak memory bandwidth isabout 12 GBytes/s. <strong>The</strong> fastestmainstream x86 server CPUsavailable on the market in theSpring of 2011 were Intel’s six-coreWestmere-EX devices which, attheir fastest speeds of 2.4 GHz, arecapable of peak speeds of around230 GFLOPS per CPU for singleprecision, and of over 60 GBytes/sof main memory bandwidth. <strong>The</strong>Westmere figures are trulyimpressive, although thesehigh-end CPUs are both expensive(with a list price per CPU of $4,616at launch) and power-hungry(130W for the CPU alone). Evenso, a single one of these high-endCPUs would have qualified as asupercomputer in its own right just15 or 20 years ago [79].But many-core performance, asembodied by the <strong>GPU</strong>s, is on adifferent level. For example, thedesktop machine mentioned abovealso contains an NVIDIA GTX 285<strong>GPU</strong>. This has a peak speed ofover 1,000 GFLOPS (1 TeraFLOP)and peak memory bandwidth ofnearly 160 GBytes/s. Hence the<strong>GPU</strong>, which only cost a fewhundred pounds, has about 17times the peak single-precisionperformance and over 13 times thememory bandwidth of the host CPUin the same system. Evencompared to the quad-core serverCPUs found in most servers todaythe <strong>GPU</strong> is fast, with 10 times thepeak FLOPS and over 5 times thepeak memory bandwidth.So now the opportunity should beclear: many-core <strong>GPU</strong>s canpotentially offer an order ofmagnitude greater performance inpeak compute and memorybandwidth. <strong>The</strong> challenge istranslating this into real-worldbenefits. <strong>GPU</strong>-based computing isstill in its infancy, but earlyindications are that for the rightkinds of applications (an importantproviso), real-world speedups inthe range of 5 to 10 times are beingachieved using <strong>GPU</strong>s.Gaining performance:massive parallelism<strong>The</strong> examples we have presenteddemonstrate that <strong>GPU</strong>s are veryfast and can provide up to a tenfoldimprovement in performance overtraditional CPUs. <strong>The</strong> trick togetting the most from these


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS13Box 2: Linear AlgebraLinear algebra libraries have been a staple in scientific software since the development of the Basic Linear AlgebraSubprograms (BLAS) by Kincaid and Krogh [70] amongst others in the 1970’s. Today the speed of routines for vectorand matrix arithmetic is critically important to the performance of a wide range of applications. Not surprisinglytherefore, significant effort has gone into optimising linear algebra libraries for <strong>GPU</strong>s.Most vendors supply optimised BLAS and LAPACK (Linear Algebra PACKage) libraries for their hardware. CPUvendor libraries include Intel’s MKL [59], AMD’s ACML [7] and IBM’s ESSL [55]. <strong>GPU</strong> vendor libraries includeNVIDIA’s CuBLAS [93] and AMD’s <strong>GPU</strong> ACML [8]. <strong>The</strong>se libraries speed up the BLAS routines by optimising theoriginal BLAS code for the targetted hardware. This approach has been reasonably successful up to now, but there arequestions about its long-term scalability. <strong>The</strong> author has first-hand experience in this area, having led one of the firstteams to develop a many-core optimised BLAS/LAPACK library in 2003 while at ClearSpeed. <strong>The</strong> concern is aroundthis approach to programming, and whether it is realistic to be able to write a serial program that calls a softwarelibrary that ‘magically’ hides all the complexity of using massive parallelism in an optimal fashion. Experience hasshown that, to get the best performance, one usually has to port most of the application to the massively parallel partof the system, and not just rely on calls to libraries with parallel implementations.To address this issue, the PLASMA and MAGMA research projects have been looking at the implications ofheterogeneous and many-core architectures for future BLAS and LAPACK libraries [5]. <strong>The</strong>se projects have shownthat it is possible to achieve a significant fraction of peak performance for LAPACK-level routines, such as Choleskyand LU solvers, on heterogeneous, multi- and many-core systems, going as far as using multiple <strong>GPU</strong>s in a singleaccelerated system to achieve more than 1.1 TeraFLOPS of single precision performance [71].many-core architectures isparallelism, and lots of it. <strong>The</strong> <strong>GPU</strong>hardware itself is massivelyparallel, with today’s <strong>GPU</strong>s alreadyincluding hundreds or thousands ofsimple cores, and <strong>GPU</strong>s with tensof thousands of cores due toappear within the next few years.<strong>The</strong> most popular programmingmodels for <strong>GPU</strong>s, NVIDIA’sCUDA [96] and the cross-platformopen standard, OpenCL [67],encourage programmers to exposethe maximum parallelism possiblein their code in order to achieve thebest performance, both now and inthe future. In general these are justgood software design practices andwe encourage all softwaredevelopers to start thinking this wayif they have not already — within afew years almost all processors willbe of a many-core design and willrequire massively parallel softwarefor best performance.This requirement for massiveparallelism may at first appearintimidating. But fortunately thekind of parallelism required doesnot have to be coarse-grained ‘task’parallelism; it can simply befine-grained ‘data’ parallelism. Thisis usually much easier to find andoften corresponds to traditionalvector processing with which manysoftware developers are alreadyfamiliar. Computationally intensiveapplications almost always have alarge degree of data parallelisminherent in the underlying problembeing solved, ranging from theparticle parallelism in classicalN-body problems to the row andcolumn parallelism in matrix-basedlinear algebra. But it remains afundamental point: to get the bestperformance from many-corearchitectures you will want toexpose the maximum parallelismpossible in your application.This is a very important point and isworth restating: we will go so far asto make the potentiallycontroversial statement that mostexisting software will need to becompletely redeveloped, possiblyeven requiring new algorithms, inorder to scale well on theincreasingly parallel hardware wewill all be using from now on.<strong>The</strong>re is an equally importantimplication that we will also stateexplicitly: all computer hardware islikely to remain on the increasinglyparallel trend for at least the nextten to twenty years. Hardwaredesigners predict that processorswill become an order of magnitudemore parallel every five years,meaning that in fifteen years’ time,processors will be of the order of athousand times more parallel thanthe already very parallel processorsthat exist today!We will come back to what thismeans for all of us as users ordevelopers of software in the ‘NextSteps’ section later in this report.


14 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors<strong>GPU</strong> parallelprogramming modelsWith the rapidly increasingpopularity of <strong>GPU</strong>-basedcomputing, multiple differentprogramming models haveemerged for these new platforms.<strong>The</strong>se include vendor-specificexamples, such as NVIDIA’sCUDA, <strong>The</strong> Portland Group’s PGIAccelerator TM [113], CAPSenterprise’s HMPP Workbench [23]and Microsoft’s DirectCompute [81],as well as open standards such asOpenCL. We shall now look at thetwo most widely used <strong>GPU</strong>programming models in moredetail: CUDA and OpenCL.CUDANVIDIA’s CUDA is currently themost mature and easiest to useparallel programming model for<strong>GPU</strong>s, first appearing on themarket in 2006 [91]. CUDA wasdesigned to be familiar to as manysoftware developers as possible,and so its first instantiation beganas a variant of the C programminglanguage, called CUDA C. Indeed,in developing CUDA C code it isoften possible to incrementally portan existing C application, usingprofiling to identify those parts ofthe code that consume most of therun-time (the ‘hot spots’) andconverting these to run in parallelon the many-core <strong>GPU</strong>.This ‘incremental porting’ approachcan be very effective and certainlylowers the barrier to entry for manyusers. But in our experience, mostsoftware developers find that theycan only get so far with thisapproach. <strong>The</strong> reason for thislimitation usually comes back to thedesign of the underpinningalgorithm. If it was not designed tobe massively parallel, then it islikely that the current softwareimplementation cannot easily bemodified in an incremental way inorder to make it massively parallel.Of course there are alwaysexceptions, but often theincremental porting approach willonly take you so far.Coming back to CUDA C, todemonstrate how familiar it can be,consider the two code examples inFigure 7. On the left is someordinary ANSI C code forperforming a simple vector addition.Because it is written in a serialprogramming language, this codeuses a loop and an array index torun along the two input vectors,producing one output element ateach loop iteration.In the CUDA C example on theright, you can see the loop hascompletely disappeared. Instead,many copies of the function areexecuted in parallel, one copy foreach parallel data item we wish toprocess. In this instance we wouldlaunch ‘size’ copies of the vectoraddition function, all of which couldrun in parallel. At execution timethe run-time system decides how tomap efficiently the expressedparallelism onto the availableparallel hardware resources.It is important not to underestimatethe level of challenge when writingcode for <strong>GPU</strong>s. <strong>The</strong> trivialexamples in Figure 7 are very smallin order to illustrate a point. In areal example, modifying thealgorithm in order to expressenough parallelism to achieve goodperformance on a <strong>GPU</strong> oftenrequires significant intellectualeffort. And while the parallelkernels themselves may often lookquite similar to a highly optimisedsequential equivalent, significantadditional code is required in orderto initialise the parallel environment,orchestrate the <strong>GPU</strong>s in thesystem, and communicate dataefficiently, often by overlapping thedata movement with computation.<strong>The</strong> CUDA C code on the righthand side of Figure 7 is an exampleof a ‘kernel’ — this is commonterminology for the parallel versionof a routine that has been modifiedin order to run on a many-coreprocessor. CUDA uses slightlydifferent terminology for the// ANSI C vector addition examplevoid vectorAdd(const float *a,const float *b,float *c,unsigned int size){unsigned int i;for (i=0; i


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS15Processing Elements (which it calls‘threads’) and Compute Units(which it calls ‘thread blocks’) butgenerally it is considered to befairly straightforward to port codebetween CUDA C and similarprogramming models, such asOpenCL.CUDA threads executeindependently and thus ideally onindependent data — this is whydata parallelism is such a natural fitfor these kinds of architectures.Threads in a thread block aregrouped to execute essentially thesame program at the same time,but on different pieces of data. Thisdata-parallel approach is known asSingle Instruction Multiple Data(SIMD) [38]. On the other hand,different thread blocks may executecompletely different programs fromone another if the need arises,although most applications tend torun the same program on all thethread blocks at the same time,essentially turning the computationinto one large SIMD or vectorcalculation.CUDA’s maturity brings a numberof benefits for software developers,including a growing number ofsoftware development toolsincluding debuggers andprofilers [97]. In 2009, CUDAFortran arrived, as a jointdevelopment between NVIDIA andthe Portland Group [95]. CUDAFortran takes the principles ofCUDA C and weaves them into astate-of-the-art commercial Fortran2003 compiler.OpenCLOpenCL bears many similarities toCUDA and indeed NVIDIA is one ofthe main contributors to theOpenCL standard and so thisshould be no surprise. <strong>The</strong> biggestdifferences are in the way OpenCLis being developed. WhereasCUDA is a proprietary solutionbeing driven by a single vendor,OpenCL is an open standard,instigated by Apple, but now beingdriven by a consortium of over 35companies, including all the majorprocessor vendors such as Intel,IBM and AMD. <strong>The</strong> consortium isbeing organised and run by theKhronos Group [67].OpenCL is a much more recentdevelopment than CUDA and iscorrespondingly less mature.However, OpenCL also includes anumber of more recent advancesfor supporting heterogeneouscomputing in systems combiningmultiple CPUs and <strong>GPU</strong>s. <strong>The</strong> firstversion of the OpenCL standardwas released in December2008 [66], and OpenCL has beendeveloping rapidly since then. Ithas already been integrated intorecent versions of Apple’s OS Xoperating system. AMD andNVIDIA have releasedimplementations for their <strong>GPU</strong>s, theformer also including a version thatwill run on a multi-core x86 hostCPU. IBM has demonstrated aversion of OpenCL running on itsCell processor and recentlyreleased a version for theirhigh-end POWER architecture [56].Intel released its first OpenCLimplementation for its multi-corex86 CPUs in late 2010 [27, 60].Embedded processor companiesare also developing their ownOpenCL solutions, includingARM [11, 99], ImaginationTechnologies [57] and ZiiLabs [122]. <strong>The</strong>se last threecompanies provide the CPUs and<strong>GPU</strong>s in most of the popularconsumer electronics gadgets,such as smartphones and portableMP3 players.While OpenCL is less mature thanNVIDIA’s CUDA and has some ofthe drawbacks of committeedesigned standards, its benefits arethe openness of the standard, thevast resource being ploughed intoits development by manycompanies, and most importantly,its cross-platform capabilities.OpenCL is quite a low-levelsolution. It exposes features thatmany software developers may nothave had to deal with before.CUDA has similar features butincludes a higher-level applicationprogrammer interface (API) thatconveniently handles much of thelow-level detail. But in OpenCL thisis all left up to the programmer.One example is in the explicit useof queues for sending commandssuch as ‘run this kernel’ from thehost processor to the many-core<strong>GPU</strong>. It is expected that as OpenCLmatures, various solutions willemerge to abstract away thislower-level detail, leaving mostprogrammers to operate at a higherlevel. An interface has alreadybeen developed that provides thisfacility for C++ programs.One of OpenCL’s other importantcharacteristics is that it has beendesigned to support heterogeneouscomputing from Day One; that is, itsupports running codesimultaneously on multiple, differentkinds of processors, all within asingle OpenCL program. Whenre-engineering software this is animportant consideration: adopting aprogramming environment thatsupports a wide range ofheterogeneous parallel hardwarewill give developers the greatestflexibility when deploying theirre-engineered codes in the future.For example, an OpenCL programcould decide to run one task onone of the host processor cores,while running another task using amany-core <strong>GPU</strong>, and do this all inparallel. <strong>The</strong>se multiple OpenCLtasks can easily coordinatebetween themselves, passing dataand signals from one to the other.Because almost all processors will


16 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors// OpenCL vector addition example__kernel void vectorAdd(__global const float *a,__global const float *b, __global float *c){// Vector element indexint nIndex = get_global_id(0);c[nIndex] = a[nIndex] + b[nIndex];}Figure 8: OpenCL vector addition kernel example. Note the similarity with the CUDA example on the righthand side of Figure 7.combine elements of bothheterogeneity and many-core in thefuture, this is a critical feature tosupport. <strong>The</strong> ability to run OpenCLcode on almost any multi-coremainstream processor makes itextremely attractive as a method forexploiting many-core parallelism,be it on today’s increasinglymulti-core CPUs or tomorrow’sheterogeneous many-coreprocessors.For completeness, Figure 8 showsthe simple vector addition exampleonce again, this time in OpenCL.To learn more about OpenCL,Scarpino’s ‘A gentle introduction toOpenCL’ [105] is a good place tostart. Two OpenCL text books aredue to appear at about the sametime as this report: Munshi et al’s‘OpenCL Programming Guide’ [86]and Gaster et al’s ‘Heterogeneous<strong>Computing</strong> with OpenCL’ [40].Other many-coreprogramming modelsWhile we have focused ourdescriptions on the many-coreprogramming models that areemerging to support <strong>GPU</strong>s, thereexist other models that werecreated primarily to supportmulti-processor systems. <strong>The</strong>re aremany such models, but the mostcommonly used fall mainly into twocategories — message passingand shared memory.Message passing programmingmodels provide methods forrunning collections of tasks inparallel, and for communicatingdata between these tasks bysending messages overcommunication links.Shared memory programmingmodels assume a system ofparallel processors connectedtogether via a memory that can beseen by all of the processors.Both of these models are used inthe mainstream today and we shallbriefly describe the most commonexamples of each.Message passing, MPI — MPI ‘isa message-passing applicationprogrammer interface (API),together with protocol and semanticspecifications for how its featuresmust behave in anyimplementation’ [49]. MPI has beenproven to scale to largehigh-performance computingsystems consisting of hundreds ofthousands of cores and underpinsmost parallel applications runningon large-scale systems. It is mostcommonly used to communicatebetween tasks running on multipleservers, with the MPI protocol usedto send messages over the networkconnecting the servers together.As well as providing basicprimitives for point-to-pointcommunications, MPI includescollective communicationoperations such as broadcast andreduction. <strong>The</strong> more recent MPIversion 2 (MPI-2) adds support forparallel I/O and dynamic processmanagement [112].Many implementations of MPI exist,with the standard specifyingbindings for C, C++ and Fortran.Shared memory, OpenMP —OpenMP is the most commonlyused high-level shared-memoryprogramming model in scientificcomputing [25]. Its API isimplemented in the form ofcompiler directives for C, C++ andFortran. Most major hardwarevendors support it, as do mostmajor compilers, including GNUgcc, Intel’s ICC and Microsoft’sVisual Studio.OpenMP supports task-levelparallelism (also known as MultipleInstruction Multiple Data or MIMD)in addition to the data-levelparallelism (SIMD) that we havealready met [38]. Sections of codethat can be executed in parallel areidentified with the addition ofspecial compiler markers, orpragmas, that tell anOpenMP-compatible compiler whatto do. In theory these pragmas aresimply ignored by non-OpenMPcompilers, which would generatevalid serial executables from thesame source code.For comparison, Figure 9 showsour by now familiar simple vector


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS17!$omp parallel default(none)!$omp shared(a, b, c) private(i)!$omp dodo i = 1, sizec(i) = a(i) + b(i)end do!$omp end do!$omp end parallel&Figure 9: Vector addition example in Fortran extended with OpenMP.addition, this time in OpenMP, andfor variety, in Fortran (the OpenMPFortran pragmas are the linesbeginning with ‘!$omp’).Work is already under waydeveloping the next major versionof OpenMP [109] and early signsindicate that consideration is beinggiven to extensions for supportingmany-core architectures. Thus ifthe reader is already usingOpenMP we would recommendfollowing the developments in thisarea from the standards committeeand also active OpenMP vendorssuch as Cray.Hybrid solutions — It is possibleto combine two or more of theseparallel programming models forhybrid solutions. In the parallelprogramming mainstream today,one of the most commonapproaches is to use a hybridsystem of OpenMP within eachmulti-core server node and MPIbetween nodes. Much has beenwritten about this particular hybridapproach [109].An increasingly common approachis to combine MPI with one of thenew <strong>GPU</strong>-oriented parallelprogramming systems, such asCUDA or OpenCL. This enables thecreation of systems from multiplenodes, each node including one ormore <strong>GPU</strong>s. Several of the fastestcomputers in the world haverecently been constructed in justthis fashion, including TSUBAME2.0 at Tokyo Tech in Japan [50].Indeed the second fastestcomputer in the world in June 2011was a <strong>GPU</strong>-powered cluster:China’s Tianhe-1A system at theNational Supercomputer Center inTianjin, with a performance of 2.57PetaFLOPS (2.57×10 15 floatingpoint operations per second, or2.57 million GFLOPS) [16, 114]. Asof the InternationalSupercomputing Conference inJune 2011, three of the top tensupercomputers in the Top500 listare <strong>GPU</strong> accelerated. <strong>The</strong> fractionof systems in the Top500 thatachieve their performance by <strong>GPU</strong>acceleration is set to increaserapidly.


18 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsCurrent ChallengesLike all breakthroughs intechnology, the change frommulti-core to many-core computerarchitectures will not be smooth foreveryone. <strong>The</strong>re are significantchallenges during the transition,some of which are outlined below.1. Porting code to massivelyparallel heterogeneoussystems is often (but notalways) harder than ports tonew hardware have been inthe past. Often completelynew algorithms are required.2. Many-core technologies arestill relatively new, withimplications for the maturityof software tools, the lack ofsoftware developers with theright skills and experience,and the paucity of portedapplication software andlibraries.3. Even though cross-platformprogramming languages suchas OpenCL are nowemerging, these have so farfocused on source codeportability and cannotguarantee performanceportability. This is of coursenot a new issue; any highlyoptimised code written in amainstream language suchas C, C++ or Fortran hasperformance portabilityissues between differentarchitectures. However,differences between <strong>GPU</strong>architectures are evengreater than those betweenCPU architectures, and soperformance portability is setto become a greaterchallenge in the future.4. <strong>The</strong>re are multiple competingopen and de facto standardswhich inevitably confuse thesituation for potentialadopters.5. Many current GP<strong>GPU</strong>products still carry some oftheir consumer graphicsheritage, including the lack ofimportant hardware reliabilityfeatures such as ErrorCorrecting Codes (ECC) ontheir memories. Even wherethese features do exist, theycurrently incur prohibitiveperformance penalties thatare not present in thecorresponding CPUsolutions.6. <strong>The</strong>re is a lot of hype around<strong>GPU</strong> computing, with manyover-inflated claims ofperformance speedups of100 times or more. <strong>The</strong>seclaims increase the risk ofsetting expectations too high,with subsequentdisappointment from trialprojects.7. <strong>The</strong> lack of industry standardbenchmarks makes it difficultfor users to comparecompeting many-coreproducts simply andaccurately.Of all these challenges, the mostfundamental is the design anddevelopment of new algorithms thatwill naturally lend themselves to themassive parallelism of <strong>GPU</strong>s today,and to the ubiquitousheterogeneous multi-/many-coresystems of tomorrow. If as acommunity we can design adaptive,highly scalable algorithms, ideallywith heterogeneity and evenfault-tolerance in mind, we will bewell placed to exploit the rapiddevelopment of parallelarchitectures over the next twodecades.


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS19Next StepsAudit your softwareOne valuable practical step we caneach take is to perform an audit ofthe software we use that isperformance-critical to our work.For software developed by thirdparties, find out what their policy istowards supporting many-coreprocessors such as <strong>GPU</strong>s. Is theirsoftware already parallel? If so,how scalable is it? Does it runeffectively on a quad-core CPUtoday? What about the emerging 8,12 and 16 core CPUs? Do theyhave a demonstration of theirsoftware accelerated on a <strong>GPU</strong>?What is their roadmap for thesoftware? <strong>The</strong> software licensingmodel is something that you alsoneed to be conscious of — is thesoftware licensed per core,processor, node, user, . . . ? Willyou have to pay more for anupgrade that supports many-coreprocessors? Some vendors willsupply these features as no-costupgrades; others will charge extrafor them.It is also important to be specificabout parallel acceleration of theparticular features you use in thesoftware in question. For example,at the time of writing there are <strong>GPU</strong>accelerated versions of denselinear algebra solvers in MATLAB,but not of sparse linear algebrasolvers [73]. Just because anapplication claims to be‘<strong>GPU</strong>-accelerated’, it does notnecessarily follow that yourparticular use of that application willgain the performance benefit of<strong>GPU</strong>s. Your mileage will definitelyvary, so check with the supplier ofyour software to verify beforecommitting.Plan for parallelismIf you develop your own software,start thinking about what your ownpath to parallelism should be. Arethe users of your software likely tostick primarily to multi-coreprocessors in mainstream laptops,desktops and servers? If so youshould be thinking about adoptingOpenMP, MPI or another widelysupported approach for parallelprogramming. You should probablyavoid proprietary approaches thatmay not support all platforms or becommercially viable in the longterm. Instead, use open standardsavailable across multiple platformsand vendors to minimise your risk.Also consider when many-coreprocessors will feature in yourroadmap. <strong>The</strong>se are inevitable —even the mainstream CPUs willrapidly become heterogeneousmany-cores, so this really is a‘when’ not an ‘if’. If you do not wantto support many-core processors inthe near or medium term, OpenMPand MPI will be good choices. If,however, you may want to supportmany-core processors within thenext few years, you will need a planto adopt either OpenCL or CUDAsooner rather than later. OpenCLmight be a viable alternative toOpenMP on multi-core CPUs in theshort term.If you are going to develop yourown many-core aware softwarethere is a tremendous amount ofsupport that you can tap into.Attend a workshopIn the UK each year there areseveral training workshops in theuse of <strong>GPU</strong>s. <strong>The</strong> nationalsupercomputing serviceHECToR [53] is starting to provide<strong>GPU</strong> workshops on CUDA andOpenCL programming; thetimetable for these is availableonline [52]. Prof Mike Giles at theUniversity of Oxford regularly runsCUDA programming workshops; forthe date of the next one see [43].His webpage also includesexcellent links to other <strong>GPU</strong>programming resources. A searchfor <strong>GPU</strong> training in the UK shouldturn up offerings from severaluniversities. <strong>The</strong>re are alsocommercial training providers in theUK that are worth considering,such as NAG [89]. Daresbury Labshas a team who track the latestprocessor technologies and whoare already experienced indeveloping software for <strong>GPU</strong>s. Thisgroup holds occasional seminarsand workshops, and is willing tooffer advice and guidance fornewcomers to many-coretechnologies [30]. <strong>GPU</strong> vendorswill often provide assistance ifasked, especially if their assistancecould lead to new sales.<strong>The</strong>re are also many conferencesand seminars emerging to addressmany-core computing. <strong>The</strong> UK nowhas a regular <strong>GPU</strong> developersworkshop. Previous years haveseen the workshop held inOxford [1] and Cambridge [2]. <strong>The</strong>2011 workshop is due to be held atImperial College.A useful <strong>GPU</strong> computing resourceis <strong>GPU</strong>computing.net [47]. Inparticular it has a page dedicatedto <strong>GPU</strong> computing in the UK [45].This site is mostly dominated by theuse of NVIDIA <strong>GPU</strong>s, reflectingNVIDIA’s lead in the market, butover time the site should see agreater percentage of contentcoming from work on a wider rangeof platforms.To get started with OpenCL, onegood place to start is AMD’s


20 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsOpenCL webpage [9]. NVIDIA’sOpenCL webpage [98] is alsouseful, as is their OpenCL ‘jumpstart’ guide [92].We would also recommend readingsome of the latest work in the areaof parallel algorithms. A good placeto start is ‘<strong>The</strong> view fromBerkeley’ [12], which builds onearlier work by Per BrinchHansen [21]. Tim Mattson’s‘Patterns for Parallel Programming’is another notable book in thisarea [74]. <strong>The</strong>se works recognisethat most forms of scientificcomputation can be decomposedinto a small set of common kernelsoften known as ‘dwarfs’, ‘motifs’ or‘templates’. Decomposingalgorithms into sub-algorithms thatmay in turn be classified as aparticular kind of dwarf makes iteasier to exploit the substantialworks of the past and present, thussimplifying the task of identifyingparallel approaches for your ownalgorithms where they are alreadyknown.Finally, there is an online list ofapplications that have already beenported to NVIDIA’s <strong>GPU</strong>s [94].


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS21Appendix 1: Active Researchers and Practitioner GroupsThis is a rapidly growing field and there are already many active researchers and developers using <strong>GPU</strong> computingin the UK and around the world. <strong>The</strong> <strong>GPU</strong> developer conference held at the University of Cambridge in December2010 saw around 100 attendees, and the Many-Core and Reconfigurable Supercomputing Conference at Bristol inApril 2011 saw over 130 delegates, all actively working within this area.In the UK<strong>The</strong> list below is not intended to be exhaustive but instead to give a cross section of the <strong>GPU</strong> computing communityin the UK and beyond. Our apologies are offered in advance to anyone we left off this list.University of Bristol: McIntosh-Smith is a computer architect with extensive industrial experience designingmany-core heterogeneous architectures. He has more recently been developing software and algorithms for<strong>GPU</strong>s, and has been working on new, massively parallel algorithms for a range of different applicationsincluding molecular docking, computational chemistry, climate modelling, and linear algebra [78].University of Oxford: Giles is the leading expert in the UK in the use of many-core <strong>GPU</strong>s for financial applicationsand also for unstructured grid problems. Giles runs regular <strong>GPU</strong> programming workshops in NVIDIA’sproprietary CUDA programming language [43]. Also at Oxford, Karastergiou has been investigating the use of<strong>GPU</strong>s for the real-time processing of data for the future international telescope project, the Square KilometreArray (SKA) [64].University of Warwick: Jarvis’s group has performed extensive work modelling the performance of large-scaleHPC systems, more recently looking at the effects of adding <strong>GPU</strong>s into these systems. <strong>The</strong> group has aparticular focus on wavefront codes [62].Imperial College: Kelly is an expert in high-level tools for automatically generating optimised code for many-corearchitectures. His research interests include parallelising compilers and auto-tuning techniques [65].UCL: Atkinson is one of the leaders in applying <strong>GPU</strong>s to medical imaging problems, in particular MagneticResonance Imaging (MRI) [13].Edinburgh Parallel <strong>Computing</strong> Centre: EPCC is one of the leading providers of high-performance computingtraining in the UK [34], and can provide consultancy to help parallelise algorithms and software for parallelcomputer architectures.University of Manchester: Manchester has a vibrant community of <strong>GPU</strong> users and runs a ‘<strong>GPU</strong> club’ that meetson a semi-regular basis, attracting around one hundred attendees to recent events [72].Queen’s University Belfast: Gillan and Scott lead a group of experts in using many-core architectures for imageprocessing applications [46, 108]. This group provides expertise and consultancy to third parties who need toadapt their codes for <strong>GPU</strong>s and for reconfigurable systems based on Field-Programmable Gate Arrays(FPGAs).University of Cambridge: Pullan and Brandvik were two of the earliest adopters of many-core architectures in theUK. <strong>The</strong>y are experts in using <strong>GPU</strong>s for CFD and in porting structured grid codes to many-corearchitectures [20]. Also at Cambridge, Gratton has been investigating the use of AMD and NVIDIA <strong>GPU</strong>s foraccelerating cosmology codes [48]. All three are part of Cambridge’s Many-Core <strong>Computing</strong> Group whichalso has expertise in the use of <strong>GPU</strong>s for medical imaging and genomic sequence processing [22].Daresbury Laboratories: the Distributed <strong>Computing</strong> (DisCo) group at Daresbury Labs regularly runs workshopsin the use of many-core architectures and can provide support in porting codes to <strong>GPU</strong>s. <strong>The</strong>y also haveaccess to various <strong>GPU</strong> hardware platforms that they can make available for remote testing [30].


22 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsAWE: Turland leads a group using <strong>GPU</strong>s to accelerate industrial wavefront simulation codes. His group has beendeveloping OpenCL-based methods that abstract away from specific architectures and vendor programmingmodels, allowing them to develop many-core applications that can perform well on a wide range of differenthardware [15].BAE Systems: Appa was one of the earliest adopters of many-core architectures of <strong>GPU</strong>s in the world. He nowleads the team within the mathematical modelling group at BAE Systems that is using <strong>GPU</strong>s to accelerateCFD computations to great effect [37].<strong>The</strong> Numerical Algorithms Group: NAG has a team developing commercial many-core software libraries forlinear algebra and random number generators, amongst others. NAG also provides commercial many-coretraining courses and software development consultancy services [89].<strong>The</strong> MathWorks: <strong>The</strong> MATLAB developer is doing a lot of work to target <strong>GPU</strong>s from within their software, much ofwhich is taking place at their Cambridge development office [73].Allinea: Allinea is a UK-based company developing some of the leading multi-core and many-core softwaredevelopment tools, including debuggers and profilers for <strong>GPU</strong>s [6]. Developers of large-scale software usinghundreds or thousands of CPUs and <strong>GPU</strong>s could benefit significantly from these kinds of parallel debuggers.InternationallyInnovative <strong>Computing</strong> Lab, University of Tennessee: Dongarra is a world expert in linear algebra libraries. Hisgroup is spearheading the development of highly scalable, dense linear algebra libraries for heterogeneousmany-core processors and massively parallel systems [35, 36].University of Illinois, Urbana-Champaign: Hwu is another leading global figure in <strong>GPU</strong>-based computing. Hisparticular interest is in using <strong>GPU</strong>s to accelerate medical imaging applications [54]. UIUC is also one of theleading academic institutions in the use of <strong>GPU</strong>s to accelerate molecular dynamics and visualisation codes,such as NAMD and VMD [115].Oak Ridge National Lab: Vetter’s Future Technologies group at ORNL is planning one of the largestsupercomputers in the world based on <strong>GPU</strong> technology [116]. Vetter’s group has also developed one of thefirst <strong>GPU</strong> benchmarks, the SHOC suite, which is useful for comparing the performance of different many-corearchitectures [29, 117].Tokyo Institute of Technology: Matsuoka is one of the leading HPC experts in Japan and is a specialist inbuilding highly energy-efficient supercomputers based on <strong>GPU</strong> technologies. He is the architect of theTSUBAME 2.0 <strong>GPU</strong>-based supercomputer, the fifth fastest in the world as of June 2011. TSUBAME 2.0 isalso one of the most energy-efficient production systems in the world [50]. Matsuoka’s group also develops<strong>GPU</strong> software, and has developed one of the fastest FFT libraries for <strong>GPU</strong>s [90].Stanford: Pande’s group developed one of the first <strong>GPU</strong>-accelerated applications, the Folding@Homescreensaver-based protein folding project [100]. This group went on to develop the OpenMM open sourcelibrary for molecular mechanics. OpenMM takes advantage of <strong>GPU</strong>s via CUDA and OpenCL and is a goodstarting point for porting molecular dynamics codes to <strong>GPU</strong>s [39, 101].NVIDIA: NVIDIA was the first <strong>GPU</strong> company to see the real potential of <strong>GPU</strong>-based computing. Introducing theCUDA parallel programming language in 2006, NVIDIA has the most applications ported to their many-corearchitecture. <strong>The</strong>ir ported applications webpage is a good place to find out what software is available for<strong>GPU</strong>s [94].<strong>The</strong> Portland Group: PGI provides software development tools including a parallel Fortran compiler thatgenerates code for <strong>GPU</strong>s [103].Acceleware: Acceleware provides multi-core and <strong>GPU</strong>-accelerated software solutions for the electromagneticsand oil and gas industries [4].SciComp: SciComp is a financial software services company that provides a suite of software tools that exploit<strong>GPU</strong>s and multi-core CPUs to deliver significant performance advantages over traditional approaches [106].


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS23Appendix 2: Software Applications Available on <strong>GPU</strong>s<strong>The</strong> table below lists several important applications along with their current multi-core and <strong>GPU</strong> capabilities. <strong>The</strong>sewill be changing all the time so do check for the latest information when considering any application. This list is farfrom exhaustive; readers are encouraged to investigate further if their favourite codes are not included.Software Multi-Core Many-CoreMATLABMathematicaNAGRSciFinanceBLAS, LAPACKmath librariesSome built-in ability to exploit parallelism atthe level of libraries such as BLAS andLAPACK. Additionally provides a ‘Parallelcomputing toolbox’ to exploit multi-coreCPUs [111].Built-in support for parallelism sinceMathematica 7 [119]; also providesgridMathematica for large-scale parallelcomputation [118].NAG has a software library product forshared memory and multi-core parallelsystems [87].Multiple R packages are available forexploiting parallelism on multi-coresystems [32].This code synthesis tool for buildingderivatives pricing and risk models canalready generate multi-core code usingOpenMP [107].Almost all BLAS and LAPACK librariesalready support multi-core processors:Intel’s MKL, AMD’s ACML, IBM’s ESSL,ATLAS, ScaLAPACK, PLASMA.Has a beta release of a <strong>GPU</strong>-acceleratedversion of MATLAB [73]. Various third-partysolutions available, such as AccelerEyes’Jacket [3].Mathematica 8 now includes support for<strong>GPU</strong>s via CUDA and OpenCL [120].Has a beta release of random numbergenerators using CUDA on NVIDIA<strong>GPU</strong>s [88].Numerous open source projects porting R to<strong>GPU</strong>s, including R+<strong>GPU</strong> available in thegputools R package [80].Already supports the generation of CUDAcode for PDEs and SDEs [107].This is an area where much software isalready available for <strong>GPU</strong>s: NVIDIA’sCUBLAS, AMD’s ACML-<strong>GPU</strong>, Acceleware,EM Photonics’ Cula Tools and MAGMA.


24 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsReferences[1] 1st CUDA Developers’ Conference.www.smithinst.ac.uk/Events/CUDA2009, December 2009.[2] 2nd UK <strong>GPU</strong> <strong>Computing</strong> Conference.www.many-core.group.cam.ac.uk/ukgpucc2, December 2010.[3] AccelerEyes — MATLAB <strong>GPU</strong> computing.www.accelereyes.com, April 2011.[4] Acceleware Ltd — Multi-Core Software Solutions.www.acceleware.com, July 2011.[5] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov.Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. Journal of Physics:Conference Series, 180(1), 2009.[6] Allinea Software.allinea.com, May 2011.[7] AMD Corp — AMD Core Math Library (ACML).developer.amd.com/cpu/Libraries/acml/Pages, April2011.[8] AMD Corp — AMD Core Math Library for graphic processors (ACML-<strong>GPU</strong>).developer.amd.com/gpu/acmlgpu/pages, April 2011.[9] AMD Corp — AMD Developer Central: getting started with OpenCL.developer.amd.com/zones/OpenCLZone/Pages/gettingstarted.aspx, April 2011.[10] AMD Corp — AMD Developer Central: Magny-Cours zone.developer.amd.com/zones/magny-cours/pages/default.aspx, April 2011.[11] ARM Ltd — ARM heralds new era in embedded graphics with next-generation Mali <strong>GPU</strong>.www.arm.com/about/newsroom/arm-heralds-new-era-in-embedded-graphics-with-next-generation-mali-gpu.php, April 2010.[12] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, KurtKeutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A.Yelick. <strong>The</strong> landscape of parallel computing research: a view from Berkeley. Technical ReportUCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.[13] David Atkinson.www.cs.ucl.ac.uk/staff/d.atkinson, July 2011.[14] Dan Bailey. Fast <strong>GPU</strong> preconditioning for fluid simulations in film production (GTC 2010 presentationarchive).www.nvidia.com/object/gtc2010-presentation-archive.html#session2239, April 2011.[15] Dave Barrett, Ron Bell, Claire Hepwood, Gavin Pringle, David Turland, and Paul Graham. An investigation ofdifferent programming models and their performance on innovative architectures. In <strong>The</strong> Many-Core andReconfigurable Supercomputing Conference (MRSC), 2011.[16] BBC — China’s Tianhe-1A crowned supercomputer king.www.bbc.co.uk/news/technology-11766840,November 2010.[17] Thomas Bradley, Jacques du Toit, Robert Tong, Mike Giles, and Paul Woodhams. Parallelization techniquesfor random number generations. In Wen mei W. Hwu, editor, <strong>GPU</strong> <strong>Computing</strong> Gems Emerald Edition,chapter 16, pages 231–246. Morgan Kaufmann, 2011.[18] Tobias Brandvik and Graham Pullan. SBLOCK: a framework for efficient stencil-based PDE solvers onmulti-core platforms. In 10th IEEE International Conference on Computer and Information Technology, pages1181–1188, 2010.[19] Tobias Brandvik and Graham Pullan. An accelerated 3D Navier-Stokes solver for flows in turbomachines.Journal of Turbomachinery, 133(2):021025, 2011.[20] Tobias Brandvik and Graham Pullan. Turbostream: ultra-fast CFD for turbomachines.www.turbostream-cfd.com, April 2011.


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS25[21] Per Brinch Hansen. Model programs for computational science: a programming methodology formulticomputers. Concurrency — Practice and Experience, 5(5):407–423, August 1993.[22] University of Cambridge — Cambridge many-core group projects.www.many-core.group.cam.ac.uk/projects, May 2011.[23] CAPS entreprise — HMPP Workbench.www.caps-entreprise.com/hmpp.html, August 2011.[24] CFD Online.www.cfd-online.com, April 2011.[25] B. Chapman, G. Jost, and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming.MIT Press, 2007.[26] Tim Childs. 3DMashUp.www.scribd.com/3dmashup, April 2011.[27] Peter Clarke. Intel releases alpha OpenCL SDK for Core.www.eetimes.com/electronics-news/4210812/Intel-releases-alpha-OpenCL-SDK-for-Core, November 2010.[28] P. I. Crumpton and M. B. Giles. Multigrid aircraft computations using the OPlus parallel library. In A. Ecer,J. Periaux, N. Satofuka, and S. Taylor, editors, Parallel Computational Fluid Dynamics: Implementations andResults Using Parallel Computers, pages 339–346. North-Holland, 1996.[29] Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, VinodTipparaju, and Jeffrey S. Vetter. <strong>The</strong> scalable heterogeneous computing (SHOC) benchmark suite. InProceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages63–74. ACM, 2010.[30] DisCo: Daresbury Laboratory distributed computing group.www.cse.scitech.ac.uk/disco, September2010.[31] Double Negative Visual Effects.www.dneg.com, April 2011.[32] Dirk Eddelbuettel. CRAN task view: high-performance and parallel computing with R.cran.r-project.org/web/views/HighPerformance<strong>Computing</strong>.html, February 2011.[33] Erich Elsen, Patrick LeGresley, and Eric Darve. Large calculation of the flow over a hypersonic vehicle usinga <strong>GPU</strong>. Journal of Computational Physics, 227:10148–10161, December 2008.[34] EPCC.www.epcc.ed.ac.uk, May 2011.[35] Jack Dongarra et al. PLASMA.icl.cs.utk.edu/plasma/index.html, November 2010.[36] Jack Dongarra et al. MAGMA.icl.cs.utk.edu/magma/index.html, April 2011.[37] Mark Fletcher. BAE Systems packs massive power thanks to <strong>GPU</strong> coprocessing.mcad-online.com/en/bae-systems-packs-massive-power-thanks-to-gpu-co-processing.html?cmp_id=7&news_id=1078, May 2010.[38] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on Computers,21:948–960, September 1972.[39] M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. LeGrand, A. L. Beberg, D. L. Ensign, C. M.Bruns, and V. S. Pande. Accelerating molecular dynamic simulation on graphics processing units. Journal ofComputational Chemistry, 30(6):864–872, 2009.[40] Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, and Dana Schaa. Heterogeneous <strong>Computing</strong>with OpenCL. Morgan Kaufmann, 2011.[41] Nick Gibbs, Anthony R. Clarke, and Richard B. Sessions. Ab initio protein structure prediction usingphysicochemical potentials and a simplified off-lattice model. Proteins: Structure, Function, andBioinformatics, 43(2):186–202, 2001.


26 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors[42] M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. Performance analysis and optimisation ofthe OP2 framework on many-core architectures. <strong>The</strong> Computer Journal, 2011 (to appear).[43] Mike Giles. Course on CUDA programming.people.maths.ox.ac.uk/gilesm/cuda, August 2011.[44] Mike Giles. OP2 project page.people.maths.ox.ac.uk/gilesm/op2, April 2011.[45] Mike Giles and Simon McIntosh-Smith. <strong>GPU</strong>computing.net: <strong>GPU</strong> <strong>Computing</strong> UK.www.gpucomputing.net/?q=node/160, April 2011.[46] Charles J. Gillan.www.ecit.qub.ac.uk/Card/?name=c.gillan, May 2011.[47] <strong>GPU</strong>computing.net.www.gpucomputing.net, April 2011.[48] Steve Gratton. Graphics card computing for cosmology.www.ast.cam.ac.uk/˜stg20/gpgpu/index.html,July 2011.[49] William Gropp. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press,second edition, 2000.[50] GSIC, Tokyo Institute of Technology — TSUBAME2.www.gsic.titech.ac.jp/en/tsubame2, April 2011.[51] Mark Harris. General-purpose computation on graphics hardware.gpgpu.org/about, July 2011.[52] HECToR training.www.hector.ac.uk/cse/training, April 2011.[53] HECToR: UK National Supercomputing Service.www.hector.ac.uk, April 2011.[54] Wen-mei W. Hwu. <strong>The</strong> IMPACT research group at UIUC.impact.crhc.illinois.edu/people/current/hwu/hwu_research.php, May 2011.[55] IBM — Engineering and scientific subroutine library (ESSL) and parallel ESSL.www-03.ibm.com/systems/software/essl, April 2011.[56] IBM — OpenCL development kit for Linux on Power.www.alphaworks.ibm.com/tech/opencl, April 2011.[57] Imagination Technologies — Imagination Technologies announces POWERVR SGX545 graphics IP corewith full DirectX 10.1, OpenGL 3.2 and OpenCL 1.0 capabilities.www.imgtec.com/news/Release/index.asp?NewsID=516, January 2010.[58] Intel Corp — Intel AVX: new frontiers in performance improvements and energy efficiency.software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency, April 2011.[59] Intel Corp — Intel math kernel library.software.intel.com/en-us/articles/intel-mkl, April 2011.[60] Intel Corp — Intel OpenCL SDK.software.intel.com/en-us/articles/intel-opencl-sdk, April 2011.[61] Intel Corp — Intel Research: single-chip cloud computer.techresearch.intel.com/ProjectDetails.aspx?Id=1, April 2011.[62] Stephen Jarvis. High Performance <strong>Computing</strong> group, University of Warwick.www2.warwick.ac.uk/fac/sci/dcs/research/pcav/areas/hpc, May 2011.[63] Jon Peddie Research — Jon Peddie Research announces 2nd quarter PC graphics shipments.jonpeddie.com/press-releases/details/jon-peddie-research-announces-2nd-quarter-pc-graphics-shipments, July 2010.[64] Aris Karastergiou.www.physics.ox.ac.uk/users/Karastergiou/Site/home.html, July 2011.[65] Paul H. J. Kelly.www.doc.ic.ac.uk/˜phjk, May 2011.[66] Khronos Group — Khronos Group releases OpenCL 1.0 specification.www.khronos.org/news/press/releases/the_khronos_group_releases_opencl_1.0_specification, December 2008.


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS27[67] Khronos Group — OpenCL: the open standard for parallel programming of heterogeneous systems.www.khronos.org/opencl, November 2010.[68] Khronos Group — OpenGL: the industry’s foundation for high performance graphics.www.opengl.org, April2011.[69] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: a Hands-on Approach.Morgan Kaufmann, 2010.[70] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortranusage. ACM Transactions on <strong>Mathematical</strong> Software, 5:308–323, September 1979.[71] Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A scalable high performantCholesky factorization for multicore with <strong>GPU</strong> accelerators. In Proceedings of the 9th InternationalConference on High-Performance <strong>Computing</strong> for Computational Science, pages 93–101. Springer-Verlag,2010.[72] University of Manchester, Computational Science Community Wiki.wiki.rcs.manchester.ac.uk/community/<strong>GPU</strong>/Club, May 2011.[73] <strong>The</strong> MathWorks — MATLAB <strong>GPU</strong> computing with NVIDIA CUDA-enabled <strong>GPU</strong>s.www.mathworks.com/discovery/matlab-gpu.html, July 2011.[74] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns for Parallel Programming.Addison-Wesley, 2004.[75] Simon McIntosh-Smith and Richard B. Sessions. An accelerated, computer assisted molecular modelingmethod for drug design. In International Supercomputing, June 2008.www.cs.bris.ac.uk/˜simonm/publications/ClearSpeed_BUDE_ISC08.pdf.[76] Simon McIntosh-Smith, Terry Wilson, Jon Crisp, Amaurys Ávila Ibarra, and Richard B. Sessions.Energy-aware metrics for benchmarking heterogeneous systems. SIGMETRICS Performance EvaluationReview, 38:88–94, March 2011.[77] Simon McIntosh-Smith, Terry Wilson, Amaurys Ávila Ibarra, Richard B. Sessions, and Jon Crisp.Benchmarking energy efficiency, power costs and carbon emissions on heterogeneous systems. <strong>The</strong>Computer Journal, 2011 (to appear).[78] Simon McIntosh-Smith.www.cs.bris.ac.uk/home/simonm, April 2011.[79] Hans Meuer. Top500 supercomputer sites.www.top500.org, April 2011.[80] University of Michigan, Molecular and Behavioral Neuroscience Institute — R+<strong>GPU</strong>.brainarray.mbni.med.umich.edu/brainarray/rgpgpu, May 2011.[81] Microsoft Corp — DirectCompute.blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx, April 2011.[82] Microsoft Corp — Microsoft DirectX.www.gamesforwindows.com/en-US/directx, April 2011.[83] Gordon Moore. Cramming more components onto integrated circuits. Electronics Magazine, pages 114–117,April 1965.[84] Samuel K. Moore. With Denver project NVIDIA and ARM join CPU-<strong>GPU</strong> integration race.spectrum.ieee.org/tech-talk/semiconductors/processors/with-denver-project-nvidia-and-arm-join-cpugpu-integration-race, January 2011.[85] Trevor Mudge. Power: a first-class architectural design constraint. IEEE Computer, 34(4):52–58, April 2001.[86] Aaftab Munshi, Benedict Gaster, Timothy Mattson, James Fung, and Dan Ginsburg. OpenCL ProgrammingGuide. Addison-Wesley, 2011.


28 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processors[87] NAG Ltd — NAG library for SMP & multicore.www.nag.co.uk/numeric/FL/FSdescription.asp, May2011.[88] NAG Ltd — NAG numerical routines for <strong>GPU</strong>s.www.nag.co.uk/numeric/<strong>GPU</strong>s, May 2011.[89] NAG Ltd — NAG training.nag.co.uk/support_training.asp, April 2011.[90] Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-D FFT library for CUDA <strong>GPU</strong>s. In Proceedings of theConference on High Performance <strong>Computing</strong> Networking, Storage and Analysis, pages 30:1–30:10. ACM,2009.[91] NVIDIA Corp — NVIDIA unveils CUDA – the <strong>GPU</strong> computing revolution begins.www.nvidia.com/object/IO_37226.html, November 2006.[92] NVIDIA Corp — NVIDIA OpenCL JumpStart Guide.developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf, April 2009.[93] NVIDIA Corp — CUBLAS library.developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUBLAS_Library.pdf, August 2010.[94] NVIDIA Corp — CUDA-accelerated applications.www.nvidia.com/object/cuda_app_tesla.html, April2011.[95] NVIDIA Corp — CUDA Fortran.www.nvidia.com/object/cuda_fortran.html, April 2011.[96] NVIDIA Corp — CUDA zone.www.nvidia.com/object/cuda_home_new.html, April 2011.[97] NVIDIA Corp — NVIDIA developer zone: tools & ecosystem.developer.nvidia.com/tools-ecosystem,April 2011.[98] NVIDIA Corp — NVIDIA OpenCL.www.nvidia.com/object/cuda_opencl_new.html, April 2011.[99] Tom Olson. Why OpenCL will be on every smartphone in 2014.blogs.arm.com/multimedia/263-why-opencl-will-be-on-every-smartphone-in-2014, August 2010.[100] Vijay Pande. Folding@home.folding.stanford.edu, May 2011.[101] Vijay Pande. OpenMM.simtk.org/home/openmm, April 2011.[102] Kevin Parrish. Intel to ship 10-core CPUs in first half of 2011.www.tomshardware.com/news/Xeon-westmere-EX-Nehalem-EX-10-cores-160-threads,12230.html, February 2011.[103] <strong>The</strong> Portland Group.www.pgroup.com, July 2011.[104] Don Scansen. Closer look inside AMD’s Llano APU at ISSCC.www.eetimes.com/electronics-news/4087527/Closer-look-inside-AMD-s-Llano-APU-at-ISSCC, February 2010.[105] Matthew Scarpino. A gentle introduction to OpenCL. Dr. Dobb’s Journal, July 2011.drdobbs.com/high-performance-computing/231002854.[106] SciComp inc — Sci<strong>GPU</strong> achieves stunning acceleration for PDE and MC derivative pricing models.www.scicomp.com/parallel_computing/<strong>GPU</strong>_OpenMP, July 2011.[107] SciComp Inc — SciFinance achieves astounding accelerations with parallel computing.www.scicomp.com/parallel_computing/<strong>GPU</strong>_OpenMP, May 2011.[108] Stan Scott.www.qub.ac.uk/research-centres/HPDC/Profile/?name=ns.scott, May 2011.[109] Lorna Smith and Mark Bull. Development of mixed mode MPI / OpenMP applications. ScientificProgramming, 9:83–98, August 2001.[110] Herb Sutter. <strong>The</strong> free lunch is over: a fundamental turn toward concurrency in software. Dr. Dobb’s Journal,30(3), March 2005.www.gotw.ca/publications/concurrency-ddj.htm.


A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS29[111] <strong>The</strong> MathWorks — MATLAB’s parallel computing toolbox.www.mathworks.com/products/parallel-computing/?s_cid=HP_FP_ML_parallelcomptbx, May 2011.[112] <strong>The</strong> MPI Forum — MPI: a message-passing interface standard, Version 2.2.www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf, September 2009.[113] <strong>The</strong> Portland Group — PGI Accelerator Compilers.www.pgroup.com/resources/accel.htm, August 2011.[114] Top500.org — Tianhe-1A.www.top500.org/system/10587, April 2010.[115] University of Illinois at Urbana-Champaign — <strong>GPU</strong> acceleration of molecular modeling applications.www.ks.uiuc.edu/Research/gpu, April 2011.[116] Jeffrey Vetter. Toward exascale computational science with heterogeneous processing. In Proceedings of the3rd Workshop on General-Purpose Computation on Graphics Processing Units, page 1. ACM, 2010.[117] Jeffrey Vetter. <strong>The</strong> SHOC benchmark suite.ft.ornl.gov/doku/shoc, April 2011.[118] Wolfram Research — gridMathematica.www.wolfram.com/products/gridmathematica, May 2011.[119] Wolfram Research — Mathematica 7 built-in parallel computing.www.wolfram.com/products/mathematica/newin7/content/BuiltInParallel<strong>Computing</strong>, May 2011.[120] Wolfram Research — Mathematica 8 CUDA and OpenCL support.www.wolfram.com/mathematica/new-in-8/cuda-and-opencl-support, May 2011.[121] Dong Hyuk Woo and H-H. S. Lee. Extending Amdahl’s law for energy-efficient computing in the many-coreera. IEEE Computer, 41(12):24–31, December 2008.[122] ZiiLABS Pte Ltd — ZiiLABS OpenCL SDK.www.ziilabs.com/technology/opencl.aspx, April 2011.


<strong>The</strong> <strong>GPU</strong> <strong>Computing</strong> <strong>Revolution</strong>From Multi-Core CPUs to Many-Core Graphics ProcessorsA Knowledge Transfer Report from the <strong>London</strong> <strong>Mathematical</strong> <strong>Society</strong> and the Knowledge TransferNetwork for Industrial Mathematicsby Simon McIntosh-Smith<strong>The</strong> <strong>London</strong> <strong>Mathematical</strong> <strong>Society</strong> (LMS) is the UK's learned society for mathematics. Founded in 1865 forthe promotion and extension of mathematical knowledge, the <strong>Society</strong> is concerned with all branches ofmathematics and its applications. It is an independent and self-financing charity, with a membership ofaround 2,300 drawn from all parts of the UK and overseas. Its principal activities are the organisation ofmeetings and conferences, the publication of periodicals and books, the provision of financial support formathematical activities, and the contribution to public debates on issues related to mathematics researchand education. It works collaboratively with other mathematical bodies worldwide. It is the UK’s adheringbody to the International <strong>Mathematical</strong> Union and is a member of the Council for the <strong>Mathematical</strong> Sciences,which also comprises the Institute of Mathematics and its Applications, the Royal Statistical <strong>Society</strong>, the OperationalResearch <strong>Society</strong> and the Edinburgh <strong>Mathematical</strong> <strong>Society</strong>.www.lms.ac.uk<strong>The</strong> Knowledge Transfer Network for Industrial Mathematics (IM-KTN) brings to life the underpinning role ofmathematics in boosting innovation performance. It enables mathematical modelling and analysis to addvalue to product design, operations and strategy across all areas of business, recognising that innovationoften occurs on the interfaces between existing capabilities. <strong>The</strong> IM-KTN connects together companies (of allsizes), public sector organisations, government agencies and universities, to exploit these opportunities usingmechanisms that are established catalysts of new activity. It is a vehicle for developing new capability, sharingknowledge and experience, and helping shape science and technology strategy. <strong>The</strong> IM-KTN currently hasthe active engagement of about 450 organisations and is managed on behalf of the Technology StrategyBoard by the Smith Institute for Industrial Mathematics and System Engineering.www.innovateuk.org/mathsktn<strong>The</strong> LMS-KTN Knowledge Transfer Reports are an initiative that is coordinated jointly by the IM-KTN and theComputer Science Committee of the LMS. <strong>The</strong> reports are being produced as an occasional series, each oneaddressing an area where mathematics and computing have come together to provide significant new capabilitythat is on the cusp of mainstream industrial uptake. <strong>The</strong>y are written by senior researchers in each chosenarea, for a mixed audience in business and government. <strong>The</strong> reports are designed to raise awarenessamong managers and decision-makers of new tools and techniques, in a format that allows them to assessrapidly the potential for exploitation in their own fields, alongside information about potential collaboratorsand suppliers.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!