6 THE <strong>GPU</strong> COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics ProcessorsCPU codenamed Magny-Cours in2010 [10], while in Spring 2011Intel launched their 10-coreWestmere EX x86 CPU [102].Some server vendors are usingthese latest multi-core x86 CPUs todeliver machines with 24 or even48 cores in a single 1U ‘pizza box’server. Early in 2010 Intelannounced a 48-core x86 prototypeprocessor which they described asa ‘cloud on a chip’ [61]. <strong>The</strong>semany-core mainstream processorsare probably the future of serverCPUs, as they are well suited tolarge-scale, highly parallelworkloads of the sort performed bythe Googles and Amazons of theinternet industry.But perhaps the most revealingtechnology trend is that many of usare already carrying heterogeneousmany-core processors in ourpockets. This is becausesmartphones have quickly adoptedthis new technology to deliver thebest performance possible whilemaintaining energy efficiency.It is the penetration of many-corearchitectures into all threepredominant processor markets —consumer electronics, personalcomputers and servers — thatdemonstrates the inevitability of thistrend. We now have to pay for ourlunch and embrace explicitparallelism in our software but, ifembraced fully, exciting newhardware platforms can deliverbreakthroughs in performance andenergy efficiency.Box 1: <strong>The</strong> Power EquationProcessor power consumption is the primary hardware issue driving the development of many-core architectures.Processor power consumption, P, depends on a number of factors, which can be described by the power equation,P=ACV 2 f,where A is the activity of the basic building blocks, or gates in the processor, C is the total capacitance of the gates’outputs, V is the processor’s voltage and f is its clock frequency.Since power consumption is proportional to the square of the processor’s voltage, one can immediately see that reducingvoltage is one of the main methods for keeping power under control. In the past voltage has continually beenreducing, most recently to just under 1.0 volt. However, in the types of CMOS process from which most modernsilicon chips are made, voltage has a lower limit of about 0.7 volts. If we try to turn voltage down lower than this, thetransistors that make up the device simply do not turn on and off properly.Revisiting the power equation, one can further see that, as we build ever more complex processors using exponentiallyincreasing numbers of transistors, the total gate capacitance C will continue to increase. If we have lost the ability tocompensate for this increase by reducing V then there are only two variables left we can play with: A and f , and it islikely that both will have to reduce.<strong>The</strong> many-core architecture trend is an inevitable outcome from this power equation: Moore’s Law gives us ever moretransistors, pushing up C, but CMOS transistors have a lower limit to V which we have now reached, and thus the onlyeffective way to harness all of these transistors is to use more and more cores at lower clock frequencies.For more detail on why power consumption is driving hardware trends towards many-core designs we refer the readerto two IEEE Computer articles: Mudge’s 2001 ‘Power: a first-class architectural design constraint’ [85] and Woo andLee’s ‘Extending Amdahl’s Law for Energy-Efficient <strong>Computing</strong> in the Many-Core Era’ from 2008 [121].
A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS7Success StoriesBefore we analyse a few many-coresuccess stories, it is important toaddress some of the hype around<strong>GPU</strong> acceleration. As is often thecase with a new technology, therehave been lots of inflated claims ofperformance speedups, some evenclaiming speedups of hundreds oftimes compared to regular CPUs.<strong>The</strong>se claims almost never standup to serious scrutiny and usuallycontain at least one serious flaw.Common problems include:optimising the <strong>GPU</strong> code muchmore than the serial code; runningon a single host core rather than allof the host cores at the same time;not using the best compileroptimisation flags; not using thevector instruction set on the CPU(SSE or AVX); using older, slowerCPUs etc. But if we just comparethe raw capabilities of the <strong>GPU</strong> withthe CPU then typically there is anorder of magnitude advantage forthe <strong>GPU</strong> — not two or three ordersof magnitude. If you see speedupclaims of greater than about afactor of ten then be suspicious. Agenuine speedup of a factor of tenis of course a significantachievement and is more thanenough to make it worthwhileinvesting in <strong>GPU</strong> solutions.<strong>The</strong>re is one more common issueto consider, and it is to do with thesize of the problem beingcomputed. Very parallel systemssuch as many-core <strong>GPU</strong>s achievetheir best performance whensolving problems that contain acorrespondingly high degree ofparallelism, enough to give everycore enough work so that they canoperate at close to their peakefficiency for long periods. Forexample, when accelerating linearalgebra, one may need to beprocessing matrices of the order ofthousands of elements in eachdimension to get the bestperformance from a <strong>GPU</strong>, whereason a CPU one may only needmatrix dimensions of the order ofhundreds of elements to achieveclose to their best performance(see Box 2 for more on linearalgebra). This discrepancy canlead to apples-to-orangescomparisons, where a <strong>GPU</strong> systemis benchmarked on one size ofproblem while a CPU system isbenchmarked on a different size ofproblem. In some cases the <strong>GPU</strong>system may even need a largerproblem size than is reallywarranted to achieve its bestperformance, again resulting inflawed performance claims.With these warnings aside let uslook at some genuine successstories in the areas ofcomputational fluid dynamics forthe aerospace industry, moleculardocking for the pharmaceuticalindustry, options pricing for thefinancial services industry, specialeffects rendering for the creativeindustries, and data mining forlarge-scale electronic commerce.Computational fluiddynamics on <strong>GPU</strong>s:OP2Finding solutions of theNavier-Stokes equations in two andthree dimensions is an importanttask for mathematically modellingand analysing flows in fluids, suchas one might find when consideringthe aerodynamics of a new car orwhen analysing how airbornepollution is blown around the tallbuildings of a modern city.Computational fluid dynamics(CFD) is the discipline of usingcomputer-based models for solvingthe Navier-Stokes equations. CFDis a powerful and widely used toolwhich also happens to becomputationally very expensive.<strong>GPU</strong> computing has thus been ofgreat interest to the CFDcommunity.Prof. Mike Giles at the University ofOxford is one of the leadingdevelopers of CFD methods thatuse unstructured grids to solvechallenging engineering problems.This work has led to two significantsoftware projects: OPlus, a codedesigned in collaboration withRolls-Royce to run on clusters ofcommodity processors utilisingmessage passing [28], and morerecently OP2, which is beingdesigned to utilise the latestmany-core architectures [42].OP2 is an example of a veryinteresting class of applicationwhich aims to enable the user towork at a high level of abstractionwhile delivering high performance.It achieves this by providing aframework for implementingefficient unstructured gridapplications. Developers write theircode in a familiar programminglanguage such as C, C++ orFortran, specifying the importantfeatures of their unstructured gridCFD problem at a high level. <strong>The</strong>sefeatures include the nodes, edgesand faces of the unstructured grid,flow variables, the mappings fromedges to nodes, and parallel loops.From this information OP2 canautomatically generate optimisedcode for a specific targetarchitecture, such as a <strong>GPU</strong> usingthe new parallel programminglanguages CUDA or OpenCL, ormulti-core CPUs using OpenMPand vectorisation for their SIMDinstruction sets (AVX andSSE) [58]. We will cover all of theseapproaches to parallelism in moredetail later in this report.At the time of writing, OP2 is still awork in progress, in collaboration