The GPU Computing Revolution - London Mathematical Society

More documents

Recommendations

Info

10 THE GPU COMPUTING REVOLUTIONFrom Multi-Core CPUs To Many-Core Graphics Processorsreal-world performanceimprovements of 10X or morecompared to a host machine with atotal of eight cores, as in the NAGcases described above.Have we just ignored the earlierwarning about excessivespeedups? Not quite — this is arare instance where the GPU reallydoes have an additionalperformance advantage over theCPU. GPUs uniquely includededicated hardware for executingcertain functions very quickly, inonly a few clock cycles. Thesefunctions include exponential, sine,cosine, square root and inversesquare root amongst others. Manyof these functions are usedextensively when generatingpseudo-random numbers, and so inthis instance GPUs really do have agreater performance advantageover CPUs than the peak speedcomparison would indicate.For an overview of GPUs incomputational finance see [43].Special effectsrendering using GPUs:Double NegativeLondon-based Double Negative isthe largest visual effects companyin Europe [31]. Their Oscar-winningwork can be seen in many of thebig blockbuster movies over the lasttwelve years. One of the biggestindustrial users ofhigh-performance computing,visual effects companies requiredatacentres with thousands ofservers to produce theircomputer-generated visuals.Water, smoke and fire-basedeffects have become increasinglypopular in movies over the last fewyears, with Double Negative’sproprietary fluid solver ‘Squirt’being a key part of their success.The image of the volcanicexplosion on the front cover of thisreport was generated by DoubleNegative using this software. Thecomputationally expensive processof solving the Navier-Stokesequations required by thesefluid-based effects have led DoubleNegative to investigate the potentialof GPUs. Double Negative recentlyadded a GPU-based Poissonsolver to their fluid simulations,resulting in as much as a 20Xspeedup of this component of theirworkflow. These results motivatedDouble Negative to install their firstGPU-based render-farm.Double Negative is now developinga specialist language and compilerto make GPUs easier to use fortheir visual effects artists. Resultsare again impressive, with a 26Xspeedup on an NVIDIA C2050GPU compared to the originalsingle-threaded C++ code runningon one core of a 3.33GHz six-coreIntel X5680 CPU (the speedupwould be about 4.4X if all six coresof the host could be used at oncewith perfect scaling on theCPU) [14]. These increases inperformance have motivatedDouble Negative to investigateusing GPUs for other applications,such as image manipulation toolsand deformers, in a bid toaccelerate as much as possible oftheir visual effects workflow.Accelerating databaseoperations usingGPUs: 3DMashUpIt may be clear that many-corearchitectures will shine oncomputationally intensive codes,but what about more data-drivenapplications?Researchers are now looking atusing GPU architectures to improvethe performance of computationallyintensive database operations. Oneof the first companies developingthis technology is theCalifornia-based 3DMashUp.3DMashUp has developed asolution using a PostgreSQL-basedrelational database that embedsGPU code alongside data withinthe database itself. Users canquery the GPU-aware database,transparently causing the GPUcode to be executed on the data, allthe while remaining within thecontext of the database.This approach has a number ofbenefits, one of which is that thedata and the code can remainresident in the database at alltimes. The overhead of callingOpenCL GPU code from within thedatabase is just 4 microseconds,making this a convenient andefficient way to exploit GPUs forlarge database users withcomputationally expensiveprocessing requirements. It alsobrings GPU acceleration tomainstream markets. For example,users who have very large imagedatabases may now process theseimages using GPUs just byintegrating the appropriate3DMashUp methods and installingthe latest GPUs in their system.Users of the database do not haveto care about GPU programming atall in this scenario; their tasks ‘justrun faster’.The 3DMashUp solution alsotransparently handles mapping themost performance-critical data intothe GPU’s memory as and when itis needed, removing the need forusers to have to concernthemselves with these low-levelprogramming details.This approach appears to be verypromising and is one of the fewexamples where the benefits ofGPU acceleration can be invisiblyprovided to end users, in this casewith computationally intensive,large data problems that usedatabases to store and managethat data. For more information seethe 3DMashUp website [26].
A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS11GPUs in DepthWhile several different many-corearchitectures have emerged duringthe last few years, we will focus onGPU-based many-core systems forthe following discussion. However,almost all the principles andterminology discussed in thiscontext apply equally to the othermany-core architectures, includingx86 CPUs and consumerelectronics products as previouslydescribed.A modern GPU is a many-coredevice, meaning it will containhundreds or even thousands ofsimple yet fully programmablecores, all on a single chip. Thesecores are often grouped togetherinto homogeneous sets withvarying names. The emergingC-based many-core programminglanguage OpenCL [67] calls thesesimple cores ‘Processing Elements’or PEs, and the groupings of thesePEs ‘Compute Units’ or CUs.Another common feature of allmany-core architectures is amulti-level memory hierarchy.Typically each Processing Elementwill have a small amount of its ownprivate memory. There is often alarger memory per Compute Unitthat can be used as sharedmemory between all that ComputeUnit’s Processing Elements. Therewill also usually be a globalmemory which can be seen by allthe Compute Units and thus by allthe Processing Elements. Finally,there is usually a separate memoryfor the host processor system.OpenCL refers to these four levelsin the memory hierarchy as Private,Local, Global and Host,respectively. Figure 5 illustrates theOpenCL memory hierarchyterminology. We will adoptOpenCL’s terminology for the restof this report.The GPU itself is integrated into a‘host system’. This might mean theGPU is on its own add-in board,plugged into a standard PCIExpress expansion slot within aserver or desktop computer.Alternatively, the GPU may beintegrated alongside the host CPU,as found inside high-end laptopsand smartphones today.Increasingly in the future we willsee the many-core GPU tightlyintegrated onto the same chip asthe host CPU; in June 2011 AMDofficially launched their first ‘Fusion’CPU, codenamed Llano, thatintegrates a quad-core x86 CPUwith a many-core GPU capable ofrunning OpenCL programs [104].NVIDIA already has consumer-level‘Fusion’ devices in its TegraCPU+GPU product line, but atSuperComputing 2010 in NewOrleans they announced ‘ProjectDenver’. This is NVIDIA’sprogramme for a high-endFusion-class device, integratingtheir cutting-edge GPUs with new,high-end ARM cores [84]. The firstProject Denver products will notarrive for several years, but NVIDIAhas indicated that this ‘fusion’approach of integrating their GPUsalongside their own ARM-basedmulti-core CPUs is central to theirfuture product roadmap.The reader may come across threevery different configurations ofGPU-accelerated systems,illustrated in Figure 6. These threedifferent ways of integrating GPUswithin a system may have verydifferent performancecharacteristics, but their usagemodels are almost identical and weshall treat them in the same way forthe purposes of this report. Animportant implication of this widerange of different approaches isthat GPU-accelerated computing isnot just for supercomputers —almost all systems are capable ofexploiting the performance ofPrivateMemoryPrivateMemoryPrivateMemoryPrivateMemoryWork-ItemWork-Item Work-ItemWork-ItemLocal MemoryLocal MemoryWork-GroupWork-GroupGlobal Memory & Constant MemoryCompute DeviceHost MemoryHostFigure 5: The memory hierarchy of a generic GPU (source: Khronos).
Page 1 and 2: The GPU ComputingRevolutionFrom Mul
Page 3 and 4: THE GPU COMPUTING REVOLUTIONFrom Mu
Page 5 and 6: A KNOWLEDGE TRANSFER REPORT FROM TH
Page 11: A KNOWLEDGE TRANSFER REPORT FROM TH

The GPU Computing Revolution - London Mathematical Society

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?