11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 2AMD RV770 architecture <str<strong>on</strong>g>for</str<strong>on</strong>g>GPGPUA modern GPU board c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> the core chip, <strong>on</strong>-board high-bandwidth RAM and variousc<strong>on</strong>nectors all put <strong>on</strong> a single board which is then inserted into interc<strong>on</strong>nects such as thePCIe. In this thesis, all experiments were d<strong>on</strong>e <strong>on</strong> AMD’s Rade<strong>on</strong> HD 4870 GPU, whichis <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> the fastest GPUs available <strong>on</strong> the market at the time <str<strong>on</strong>g>of</str<strong>on</strong>g> writing. Rade<strong>on</strong> 4870 iscapable <str<strong>on</strong>g>of</str<strong>on</strong>g> delivering over a teraflop <str<strong>on</strong>g>of</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit and 240 gigaflops <str<strong>on</strong>g>of</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance<str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the theoretical peak <str<strong>on</strong>g>of</str<strong>on</strong>g> Rade<strong>on</strong> 4870 is muchhigher than top-<str<strong>on</strong>g>of</str<strong>on</strong>g>-the-line x86 CPUs from Intel and AMD.The Rade<strong>on</strong> 4870 is based <strong>on</strong> a chip, codenamed the RV770, and pairs it with highbandwidthGDDR5 memory. RV770 is a massively parallel-processing core designed both <str<strong>on</strong>g>for</str<strong>on</strong>g>graphics and more general-purpose computing tasks. RV770 powers various GPUs such asRade<strong>on</strong> 4850, Rade<strong>on</strong> 4870 and Rade<strong>on</strong> 4830. A power-optimized variant <str<strong>on</strong>g>of</str<strong>on</strong>g> RV770 powersthe Rade<strong>on</strong> 4890, which is currently the fastest single GPU soluti<strong>on</strong> from AMD. AMDhas also released smaller and cheaper chips, such as the RV730 and RV740, c<strong>on</strong>tainingless computati<strong>on</strong>al cores, but with designs similar to RV770 and with c<strong>on</strong>ceptually thesame architecture. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e understanding the RV770 is sufficient <str<strong>on</strong>g>for</str<strong>on</strong>g> understanding theworkings <str<strong>on</strong>g>of</str<strong>on</strong>g> most <str<strong>on</strong>g>of</str<strong>on</strong>g> the current GPU designs from AMD.This chapter covers the RV770 architecture as well as the programming model exposedby AMD <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPU tasks and code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s tailored <str<strong>on</strong>g>for</str<strong>on</strong>g> the architecture. Understandingthe architecture is necessary to produce high-per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance code <str<strong>on</strong>g>for</str<strong>on</strong>g> the GPU. Aswill be shown, the AMD GPU architecture is different from modern CPU architectures andhas very different strengths and weaknesses.The following terminology will be used in the chapter:1. fp32 : 32-bit floating-point2. fp64 : 64-bit floating-point4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!