11.07.2015 Views

The GPU Computing Revolution - London Mathematical Society

The GPU Computing Revolution - London Mathematical Society

The GPU Computing Revolution - London Mathematical Society

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS7Success StoriesBefore we analyse a few many-coresuccess stories, it is important toaddress some of the hype around<strong>GPU</strong> acceleration. As is often thecase with a new technology, therehave been lots of inflated claims ofperformance speedups, some evenclaiming speedups of hundreds oftimes compared to regular CPUs.<strong>The</strong>se claims almost never standup to serious scrutiny and usuallycontain at least one serious flaw.Common problems include:optimising the <strong>GPU</strong> code muchmore than the serial code; runningon a single host core rather than allof the host cores at the same time;not using the best compileroptimisation flags; not using thevector instruction set on the CPU(SSE or AVX); using older, slowerCPUs etc. But if we just comparethe raw capabilities of the <strong>GPU</strong> withthe CPU then typically there is anorder of magnitude advantage forthe <strong>GPU</strong> — not two or three ordersof magnitude. If you see speedupclaims of greater than about afactor of ten then be suspicious. Agenuine speedup of a factor of tenis of course a significantachievement and is more thanenough to make it worthwhileinvesting in <strong>GPU</strong> solutions.<strong>The</strong>re is one more common issueto consider, and it is to do with thesize of the problem beingcomputed. Very parallel systemssuch as many-core <strong>GPU</strong>s achievetheir best performance whensolving problems that contain acorrespondingly high degree ofparallelism, enough to give everycore enough work so that they canoperate at close to their peakefficiency for long periods. Forexample, when accelerating linearalgebra, one may need to beprocessing matrices of the order ofthousands of elements in eachdimension to get the bestperformance from a <strong>GPU</strong>, whereason a CPU one may only needmatrix dimensions of the order ofhundreds of elements to achieveclose to their best performance(see Box 2 for more on linearalgebra). This discrepancy canlead to apples-to-orangescomparisons, where a <strong>GPU</strong> systemis benchmarked on one size ofproblem while a CPU system isbenchmarked on a different size ofproblem. In some cases the <strong>GPU</strong>system may even need a largerproblem size than is reallywarranted to achieve its bestperformance, again resulting inflawed performance claims.With these warnings aside let uslook at some genuine successstories in the areas ofcomputational fluid dynamics forthe aerospace industry, moleculardocking for the pharmaceuticalindustry, options pricing for thefinancial services industry, specialeffects rendering for the creativeindustries, and data mining forlarge-scale electronic commerce.Computational fluiddynamics on <strong>GPU</strong>s:OP2Finding solutions of theNavier-Stokes equations in two andthree dimensions is an importanttask for mathematically modellingand analysing flows in fluids, suchas one might find when consideringthe aerodynamics of a new car orwhen analysing how airbornepollution is blown around the tallbuildings of a modern city.Computational fluid dynamics(CFD) is the discipline of usingcomputer-based models for solvingthe Navier-Stokes equations. CFDis a powerful and widely used toolwhich also happens to becomputationally very expensive.<strong>GPU</strong> computing has thus been ofgreat interest to the CFDcommunity.Prof. Mike Giles at the University ofOxford is one of the leadingdevelopers of CFD methods thatuse unstructured grids to solvechallenging engineering problems.This work has led to two significantsoftware projects: OPlus, a codedesigned in collaboration withRolls-Royce to run on clusters ofcommodity processors utilisingmessage passing [28], and morerecently OP2, which is beingdesigned to utilise the latestmany-core architectures [42].OP2 is an example of a veryinteresting class of applicationwhich aims to enable the user towork at a high level of abstractionwhile delivering high performance.It achieves this by providing aframework for implementingefficient unstructured gridapplications. Developers write theircode in a familiar programminglanguage such as C, C++ orFortran, specifying the importantfeatures of their unstructured gridCFD problem at a high level. <strong>The</strong>sefeatures include the nodes, edgesand faces of the unstructured grid,flow variables, the mappings fromedges to nodes, and parallel loops.From this information OP2 canautomatically generate optimisedcode for a specific targetarchitecture, such as a <strong>GPU</strong> usingthe new parallel programminglanguages CUDA or OpenCL, ormulti-core CPUs using OpenMPand vectorisation for their SIMDinstruction sets (AVX andSSE) [58]. We will cover all of theseapproaches to parallelism in moredetail later in this report.At the time of writing, OP2 is still awork in progress, in collaboration

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!