11.07.2015 Views

The GPU Computing Revolution - London Mathematical Society

The GPU Computing Revolution - London Mathematical Society

The GPU Computing Revolution - London Mathematical Society

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A KNOWLEDGE TRANSFER REPORT FROM THE LMSAND THE KTN FOR INDUSTRIAL MATHEMATICS11<strong>GPU</strong>s in DepthWhile several different many-corearchitectures have emerged duringthe last few years, we will focus on<strong>GPU</strong>-based many-core systems forthe following discussion. However,almost all the principles andterminology discussed in thiscontext apply equally to the othermany-core architectures, includingx86 CPUs and consumerelectronics products as previouslydescribed.A modern <strong>GPU</strong> is a many-coredevice, meaning it will containhundreds or even thousands ofsimple yet fully programmablecores, all on a single chip. <strong>The</strong>secores are often grouped togetherinto homogeneous sets withvarying names. <strong>The</strong> emergingC-based many-core programminglanguage OpenCL [67] calls thesesimple cores ‘Processing Elements’or PEs, and the groupings of thesePEs ‘Compute Units’ or CUs.Another common feature of allmany-core architectures is amulti-level memory hierarchy.Typically each Processing Elementwill have a small amount of its ownprivate memory. <strong>The</strong>re is often alarger memory per Compute Unitthat can be used as sharedmemory between all that ComputeUnit’s Processing Elements. <strong>The</strong>rewill also usually be a globalmemory which can be seen by allthe Compute Units and thus by allthe Processing Elements. Finally,there is usually a separate memoryfor the host processor system.OpenCL refers to these four levelsin the memory hierarchy as Private,Local, Global and Host,respectively. Figure 5 illustrates theOpenCL memory hierarchyterminology. We will adoptOpenCL’s terminology for the restof this report.<strong>The</strong> <strong>GPU</strong> itself is integrated into a‘host system’. This might mean the<strong>GPU</strong> is on its own add-in board,plugged into a standard PCIExpress expansion slot within aserver or desktop computer.Alternatively, the <strong>GPU</strong> may beintegrated alongside the host CPU,as found inside high-end laptopsand smartphones today.Increasingly in the future we willsee the many-core <strong>GPU</strong> tightlyintegrated onto the same chip asthe host CPU; in June 2011 AMDofficially launched their first ‘Fusion’CPU, codenamed Llano, thatintegrates a quad-core x86 CPUwith a many-core <strong>GPU</strong> capable ofrunning OpenCL programs [104].NVIDIA already has consumer-level‘Fusion’ devices in its TegraCPU+<strong>GPU</strong> product line, but atSuper<strong>Computing</strong> 2010 in NewOrleans they announced ‘ProjectDenver’. This is NVIDIA’sprogramme for a high-endFusion-class device, integratingtheir cutting-edge <strong>GPU</strong>s with new,high-end ARM cores [84]. <strong>The</strong> firstProject Denver products will notarrive for several years, but NVIDIAhas indicated that this ‘fusion’approach of integrating their <strong>GPU</strong>salongside their own ARM-basedmulti-core CPUs is central to theirfuture product roadmap.<strong>The</strong> reader may come across threevery different configurations of<strong>GPU</strong>-accelerated systems,illustrated in Figure 6. <strong>The</strong>se threedifferent ways of integrating <strong>GPU</strong>swithin a system may have verydifferent performancecharacteristics, but their usagemodels are almost identical and weshall treat them in the same way forthe purposes of this report. Animportant implication of this widerange of different approaches isthat <strong>GPU</strong>-accelerated computing isnot just for supercomputers —almost all systems are capable ofexploiting the performance ofPrivateMemoryPrivateMemoryPrivateMemoryPrivateMemoryWork-ItemWork-Item Work-ItemWork-ItemLocal MemoryLocal MemoryWork-GroupWork-GroupGlobal Memory & Constant MemoryCompute DeviceHost MemoryHostFigure 5: <strong>The</strong> memory hierarchy of a generic <strong>GPU</strong> (source: Khronos).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!