11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Examining CommitteeJose Nels<strong>on</strong> Amaral, Computing SciencePaul Lu, Computing ScienceBruce Cockburn, Electrical and Computer Engineering


To the never ending advance <str<strong>on</strong>g>of</str<strong>on</strong>g> technology


AbstractModern Graphics Processing Units (GPUs) are providing breakthrough per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>for</str<strong>on</strong>g> numericalcomputing at the cost <str<strong>on</strong>g>of</str<strong>on</strong>g> increased programming complexity. Current programmingmodels <str<strong>on</strong>g>for</str<strong>on</strong>g> GPUs require that the programmer manually manage the data transfer betweenCPU and GPU. This thesis proposes a simpler programming model and introduces a newcompilati<strong>on</strong> framework to enable <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> applicati<strong>on</strong>s c<strong>on</strong>taining numerical computati<strong>on</strong>sto be executed <strong>on</strong> GPUs and multi-core CPUs.The new programming model minimally extends <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to include type and parallel-loopannotati<strong>on</strong>s. Our compiler framework then automatically identifies the data to be transferredbetween the main memory and the GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> a particular class <str<strong>on</strong>g>of</str<strong>on</strong>g> affine array accesses.The compiler also automatically per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s to improve per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>on</strong>GPUs.For kernels with regular loop structure and simple memory access patterns, the GPUcode generated by the compiler achieves significant per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance improvement over multicoreCPU codes.


AcknowledgementsFirst and <str<strong>on</strong>g>for</str<strong>on</strong>g>emost, I would like to thank my supervisor Pr<str<strong>on</strong>g>of</str<strong>on</strong>g> Amaral <str<strong>on</strong>g>for</str<strong>on</strong>g> his supervisi<strong>on</strong>throughout my studies as a Masters student. He helped me to maintain an overall directi<strong>on</strong>in my thesis while still allowing me to freely explore new ideas. He not <strong>on</strong>ly taught methe skills necessary to c<strong>on</strong>duct <strong>on</strong>e’s research, but he also taught me the importance <str<strong>on</strong>g>of</str<strong>on</strong>g>communicating ideas to the rest <str<strong>on</strong>g>of</str<strong>on</strong>g> the research community. This thesis would not havebeen possible without the patience and guidance <str<strong>on</strong>g>of</str<strong>on</strong>g> Pr<str<strong>on</strong>g>of</str<strong>on</strong>g>. Amaral.Next, I would like to thank Kit Bart<strong>on</strong>, Calin Cascaval, George Almasi, Ettore Tiottoand many others from IBM who taught me the difference between toy compilers and realworldproducti<strong>on</strong> compilers. In particular, Kit voluntarily mentored me and was alwaysready to answer my naive questi<strong>on</strong>s. The experience <str<strong>on</strong>g>of</str<strong>on</strong>g> working with a world-class group <str<strong>on</strong>g>of</str<strong>on</strong>g>compiler writers and researchers from IBM Research was very valuable and helped me growas a researcher and compiler writer.I would also like to thank Micah Villmoh, Michael Chu and many others from AMD’sStream computing group. They provided us with hardware <str<strong>on</strong>g>for</str<strong>on</strong>g> initial experiments and alsoprovided us with in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> and clarificati<strong>on</strong>s about the CAL SDK. Their help is greatlyappreciated.The w<strong>on</strong>derful open source community behind NumPy provided me with useful feedbackand suggesti<strong>on</strong>s. In particular, I would like to thank Travis Oliphant and developers <str<strong>on</strong>g>of</str<strong>on</strong>g>Cyth<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> their feedback about syntax <str<strong>on</strong>g>of</str<strong>on</strong>g> extensi<strong>on</strong>s introduced.I am also thankful to my commitee <str<strong>on</strong>g>for</str<strong>on</strong>g> providing feedback about the thesis. Theirquesti<strong>on</strong>s and comments have also prompted me to think in greater detail about possiblefuture directi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> this thesis.My Masters studies were partially funded by an iCore internati<strong>on</strong>al student award.Finally, I would like to thank my friends Leslie Robins<strong>on</strong> and Leslie Weigl. Withouttheir support and encouragement, I would not have survived the cold Edm<strong>on</strong>t<strong>on</strong> wintersand would have quit this project l<strong>on</strong>g ago.


5.3.3 Multidimensi<strong>on</strong>al RCSLMADs with comm<strong>on</strong> strides . . . . . . . . . . 445.4 Loop tiling <str<strong>on</strong>g>for</str<strong>on</strong>g> handling large loops . . . . . . . . . . . . . . . . . . . . . . . . 475.5 C<strong>on</strong>clusi<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> AMD GPUs 497 Experimental evaluati<strong>on</strong> 527.1 Matrix multiplicati<strong>on</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2 CP benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.3 Black Scholes opti<strong>on</strong> pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.4 5-point stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.5 RPES benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 Related Work 608.1 Compiling <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.2 Array access analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.3 <str<strong>on</strong>g>Compiler</str<strong>on</strong>g>s <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628.4 C<strong>on</strong>clusi<strong>on</strong>s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 C<strong>on</strong>clusi<strong>on</strong>s 65


List <str<strong>on</strong>g>of</str<strong>on</strong>g> Tables2.1 Comparisi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> bandwidth available to Rade<strong>on</strong> 4870 and Phenom II X4 940 . 177.1 Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> benchmark <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point(sec<strong>on</strong>ds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.2 Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> using GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point overATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.3 Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> benchmark <str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point(sec<strong>on</strong>ds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.4 Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> using GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point overATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.5 Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> CP benchmark (sec<strong>on</strong>ds) . . . . . . . . . . . . . . . . . . . 567.6 Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> CP using GPU over OpenMP . . . . . . . . . . . . . . . . . . . 567.7 Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> Black-Scholes benchmark (sec<strong>on</strong>ds) . . . . . . . . . . . . . 577.8 Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> Black-Scholes benchmark using GPU over OpenMP . . . . . . . 577.9 Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> 5-point stencil benchmark (millisec<strong>on</strong>ds) . . . . . . . . . . 58


List <str<strong>on</strong>g>of</str<strong>on</strong>g> Figures2.1 RV770 Block Diagram. Source : AMD. Printed with permissi<strong>on</strong>. . . . . . . . 62.2 Memory arrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 CAL programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Simple example <str<strong>on</strong>g>of</str<strong>on</strong>g> a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> functi<strong>on</strong>. . . . . . . . . . . . . . . . . . . . . . . 193.2 C pointer example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Examples <str<strong>on</strong>g>of</str<strong>on</strong>g> array slicing in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. . . . . . . . . . . . . . . . . . . . . . . . 223.4 Comparis<strong>on</strong> between state <str<strong>on</strong>g>of</str<strong>on</strong>g> the art <str<strong>on</strong>g>for</str<strong>on</strong>g> generati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> numerical-intensiveapplicati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> in CPU/GPU and the proposed programming model. 243.5 Example <str<strong>on</strong>g>of</str<strong>on</strong>g> decorator used as a type declarator. . . . . . . . . . . . . . . . . . 254.1 Block diagram <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> CPU code generati<strong>on</strong> . . . . . . . . . . . . . . 294.2 Block diagram <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> CPU code generati<strong>on</strong> . . . . . . . . . . . . . . 325.1 Loop nests c<strong>on</strong>sidered <str<strong>on</strong>g>for</str<strong>on</strong>g> array access analysis . . . . . . . . . . . . . . . . . 345.2 Visualizati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a two-dimensi<strong>on</strong>al RCSLMAD example . . . . . . . . . . . . 37


List <str<strong>on</strong>g>of</str<strong>on</strong>g> acr<strong>on</strong>yms and symbolsALUAMD ILAOTAPIASTATLASCALCUDAERLFLOPGDDRGFLOPGLSLGPUGPGPUHLSLJITLDSLMADPCIeArithmetic Logic UnitAMD Intermediate LanguageAhead-<str<strong>on</strong>g>of</str<strong>on</strong>g>-timeApplicati<strong>on</strong> Programming InterfaceAbstract Syntax TreeAutomatically Tuned Linear Algebra S<str<strong>on</strong>g>of</str<strong>on</strong>g>twareCompute Abstracti<strong>on</strong> LayerCompute Unified Device ArchitectureExtended Restricted C<strong>on</strong>stant Stride Linear Memory Access DescriptorFloating Point Operati<strong>on</strong>Graphics Double Data RateGiga Floating Point Operati<strong>on</strong>OpenGL Shading LanguageGraphics Processing UnitGeneral Purpose Graphics Processing UnitHigh Level Shader LanguageJust-in-timeLocal Data ShareLinear Memory Access DescriptorPeripheral Comp<strong>on</strong>ent Interc<strong>on</strong>nect ExpressRCSLMAD Resctricted C<strong>on</strong>stant Stride Linear Memory Access Descriptor1


RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDELp ku kRandom Access MemoryReduced Instructi<strong>on</strong> Set ComputerSingle Instructi<strong>on</strong> Multiple DataStream ProcessorSingle Program Multiple DataStreaming SIMD Extensi<strong>on</strong>sTexture UnitThread ProcessorUniversal Video DecodeVery Large Instructi<strong>on</strong> WordBase <str<strong>on</strong>g>of</str<strong>on</strong>g> an LMAD.Domain <str<strong>on</strong>g>of</str<strong>on</strong>g> an RCSLMAD.Symbol <str<strong>on</strong>g>for</str<strong>on</strong>g> an ERLSymbol <str<strong>on</strong>g>for</str<strong>on</strong>g> an LMAD.Stride <str<strong>on</strong>g>of</str<strong>on</strong>g> an RCSLMAD in the k-th dimensi<strong>on</strong>Upper bound <str<strong>on</strong>g>of</str<strong>on</strong>g> k-th loop.


Chapter 1Introducti<strong>on</strong>In recent years, processor designers have hit a wall in increasing the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> a singleprocessor core. Processors have adopted multicore designs fitting multiple cores <strong>on</strong> a singledie or a single package. Quad-core CPUs are already comm<strong>on</strong> <strong>on</strong> the desktop and hexa-coreprocessors are available <str<strong>on</strong>g>for</str<strong>on</strong>g> servers. Another new development in the hardware field has beenthe emergence <str<strong>on</strong>g>of</str<strong>on</strong>g> programmable graphics processing units (GPUs). GPUs have evolved froma fixed functi<strong>on</strong> pipeline to a highly programmable pipeline. Modern GPUs are a massivelyparallel architecture c<strong>on</strong>taining hundreds <str<strong>on</strong>g>of</str<strong>on</strong>g> programmable ALUs that can be utilized <str<strong>on</strong>g>for</str<strong>on</strong>g>tasks such as numerical computati<strong>on</strong>. A programmable GPU which can be utilized <str<strong>on</strong>g>for</str<strong>on</strong>g>n<strong>on</strong>-graphic related tasks is termed a general-purpose GPU (GPGPU). GPGPUs are lessflexible than CPUs but <str<strong>on</strong>g>of</str<strong>on</strong>g>fer much greater floating-point capability. GPGPUs <str<strong>on</strong>g>of</str<strong>on</strong>g>fering up toa teraflop <str<strong>on</strong>g>of</str<strong>on</strong>g> floating-point per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance are already available. This per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance is an order<str<strong>on</strong>g>of</str<strong>on</strong>g> magnitude better than current x86 processors.The per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance gains <str<strong>on</strong>g>of</str<strong>on</strong>g>fered by multicores and GPGPUs comes at the cost <str<strong>on</strong>g>of</str<strong>on</strong>g> increasedprogrammning ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t. To utilize multicores, a programmer is required to write parallel code.Many abstracti<strong>on</strong>s, such as threads and parallel loops, are available <str<strong>on</strong>g>for</str<strong>on</strong>g> writing parallel programs.For numerical programs, expressing parallelism through parallel <str<strong>on</strong>g>for</str<strong>on</strong>g>-loops (availablein APIs such as OpenMP) is a comm<strong>on</strong> choice <str<strong>on</strong>g>for</str<strong>on</strong>g> multicores as well as <str<strong>on</strong>g>for</str<strong>on</strong>g> symmetric multiprocessorarchitectures. Programming <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPUs requires the programmer to rewritemuch <str<strong>on</strong>g>of</str<strong>on</strong>g> her code to utilize a GPU specific API. GPGPUs usually have a completely separateaddress space and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e the programmer has to copy data from the system RAM to theGPGPU address space.On the s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware side, languages such as Matlab and <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> are becoming more popular<str<strong>on</strong>g>for</str<strong>on</strong>g> scientific computing. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> in particular is starting to attract a lot <str<strong>on</strong>g>of</str<strong>on</strong>g> attenti<strong>on</strong> in thenumerical-programming community. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is a general-purpose object-oriented languagewith very clean syntax. An extensi<strong>on</strong> library <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, called NumPy, <str<strong>on</strong>g>of</str<strong>on</strong>g>fers very flexiblemulti-dimensi<strong>on</strong>al array abstracti<strong>on</strong>s and is particularly useful <str<strong>on</strong>g>for</str<strong>on</strong>g> scientific programming.However, while <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g>fers great flexibility to the programmer, per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance is usually1


much lower than equivalent C programs. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> also does not currently provide any simpletools <str<strong>on</strong>g>for</str<strong>on</strong>g> parallel programming <str<strong>on</strong>g>for</str<strong>on</strong>g> numerical programs.This thesis presents a compiler <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> that enables the programmer to easily utilizemulticore processors and GPUs <str<strong>on</strong>g>for</str<strong>on</strong>g> maximum per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance. The programmer <strong>on</strong>ly needsto add minimal annotati<strong>on</strong> to the program and the compiler can automatically generatemulticore CPU and GPU code. The annotati<strong>on</strong>s are designed to be portable to systemswith or without a GPU.1.1 C<strong>on</strong>tributi<strong>on</strong>sThe thesis presents a compiler system that implements the following:1. A simple type-annotati<strong>on</strong> system and a simple parallel <str<strong>on</strong>g>for</str<strong>on</strong>g> loop API <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. Theparallel <str<strong>on</strong>g>for</str<strong>on</strong>g> loop exposes a data-parallel shared-memory paradigm to the programmer.2. A new array-access analysis algorithm that can automatically compute the memoryaddresses to be transferred between the CPU and the GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> a class <str<strong>on</strong>g>of</str<strong>on</strong>g> parallel loops.The algorithm handles rectangular loop nests with loop-invariant bounds and with aparticular class <str<strong>on</strong>g>of</str<strong>on</strong>g> affine subscript expressi<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> array references.3. A set <str<strong>on</strong>g>of</str<strong>on</strong>g> loop optimizati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> the AMD Rade<strong>on</strong> 4800 series <str<strong>on</strong>g>of</str<strong>on</strong>g> GPUs.The compiler is implemented as a three-part system:1. An ahead-<str<strong>on</strong>g>of</str<strong>on</strong>g>-time compiler, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, that compiles a subset <str<strong>on</strong>g>of</str<strong>on</strong>g> annotated <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> toC++ and OpenMP.2. A just-in-time compiler, jit4GPU, that generates optimized GPU code from a treerepresentati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> program. Jit4GPU also computes the memory addressesto be transferred between the CPU and the GPU as required by the computati<strong>on</strong>.3. A run-time system that supports jit4GPU by providing jit4GPU with a simple APIto manage program executi<strong>on</strong> <strong>on</strong> the GPU as well as data transfer between the CPUand GPU.Experimental results, discussed in Chapter 7, show that using a GPU resulted in up to100 times per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance improvement over compiler-generated parallel C++ with OpenMPprograms.1.2 Thesis Organizati<strong>on</strong>Chapter 2 discusses the GPGPU architecture used in this thesis, codenamed the AMDRV770. All experiments in this thesis were d<strong>on</strong>e using the GPU AMD Rade<strong>on</strong> 4870 based2


up<strong>on</strong> the RV770 architecture. A brief descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, NumPy and the new annotati<strong>on</strong>sproposed is given in Chapter 3. Chapter 4 gives an overview <str<strong>on</strong>g>of</str<strong>on</strong>g> the compiler implementati<strong>on</strong>.Chapter 5 describes the array-access analysis algorithm used by the compiler toautomatically copy relevant data between CPU and GPU. Loop optimizati<strong>on</strong>s per<str<strong>on</strong>g>for</str<strong>on</strong>g>med bythe compiler <str<strong>on</strong>g>for</str<strong>on</strong>g> GPUs are described in Chapter 6. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> the compiler is evaluatedover several benchmarks in Chapter 7. Related work is examined in Chapter 8. Finally,Chapter 9 c<strong>on</strong>cludes the thesis.3


Chapter 2AMD RV770 architecture <str<strong>on</strong>g>for</str<strong>on</strong>g>GPGPUA modern GPU board c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> the core chip, <strong>on</strong>-board high-bandwidth RAM and variousc<strong>on</strong>nectors all put <strong>on</strong> a single board which is then inserted into interc<strong>on</strong>nects such as thePCIe. In this thesis, all experiments were d<strong>on</strong>e <strong>on</strong> AMD’s Rade<strong>on</strong> HD 4870 GPU, whichis <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> the fastest GPUs available <strong>on</strong> the market at the time <str<strong>on</strong>g>of</str<strong>on</strong>g> writing. Rade<strong>on</strong> 4870 iscapable <str<strong>on</strong>g>of</str<strong>on</strong>g> delivering over a teraflop <str<strong>on</strong>g>of</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit and 240 gigaflops <str<strong>on</strong>g>of</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance<str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the theoretical peak <str<strong>on</strong>g>of</str<strong>on</strong>g> Rade<strong>on</strong> 4870 is muchhigher than top-<str<strong>on</strong>g>of</str<strong>on</strong>g>-the-line x86 CPUs from Intel and AMD.The Rade<strong>on</strong> 4870 is based <strong>on</strong> a chip, codenamed the RV770, and pairs it with highbandwidthGDDR5 memory. RV770 is a massively parallel-processing core designed both <str<strong>on</strong>g>for</str<strong>on</strong>g>graphics and more general-purpose computing tasks. RV770 powers various GPUs such asRade<strong>on</strong> 4850, Rade<strong>on</strong> 4870 and Rade<strong>on</strong> 4830. A power-optimized variant <str<strong>on</strong>g>of</str<strong>on</strong>g> RV770 powersthe Rade<strong>on</strong> 4890, which is currently the fastest single GPU soluti<strong>on</strong> from AMD. AMDhas also released smaller and cheaper chips, such as the RV730 and RV740, c<strong>on</strong>tainingless computati<strong>on</strong>al cores, but with designs similar to RV770 and with c<strong>on</strong>ceptually thesame architecture. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e understanding the RV770 is sufficient <str<strong>on</strong>g>for</str<strong>on</strong>g> understanding theworkings <str<strong>on</strong>g>of</str<strong>on</strong>g> most <str<strong>on</strong>g>of</str<strong>on</strong>g> the current GPU designs from AMD.This chapter covers the RV770 architecture as well as the programming model exposedby AMD <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPU tasks and code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s tailored <str<strong>on</strong>g>for</str<strong>on</strong>g> the architecture. Understandingthe architecture is necessary to produce high-per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance code <str<strong>on</strong>g>for</str<strong>on</strong>g> the GPU. Aswill be shown, the AMD GPU architecture is different from modern CPU architectures andhas very different strengths and weaknesses.The following terminology will be used in the chapter:1. fp32 : 32-bit floating-point2. fp64 : 64-bit floating-point4


3. float : 32-bit floating-point4. double : 64-bit floating-point5. int : 32-bit signed integer6. float2 : A 64-bit structure <str<strong>on</strong>g>of</str<strong>on</strong>g> 2 floats. If X is a float2, then it has two float comp<strong>on</strong>entswhich can be referenced as X.x and X.y. <str<strong>on</strong>g>Numerical</str<strong>on</strong>g> indexing is not allowed and .x and.y is the <strong>on</strong>ly way to get to comp<strong>on</strong>ents.7. float4 : A 128-bit structure <str<strong>on</strong>g>of</str<strong>on</strong>g> 4 floats. The 4 comp<strong>on</strong>ents are respectively x, y, z andw. <str<strong>on</strong>g>Numerical</str<strong>on</strong>g> indexing is not allowed.8. double2 : 128-bit structure <str<strong>on</strong>g>of</str<strong>on</strong>g> 2 doubles with comp<strong>on</strong>ents .x and .y.2.1 OverviewRV770 is a massively parallel graphics and computing core. An overview <str<strong>on</strong>g>of</str<strong>on</strong>g> the chip isprovided in Figure 2.1. In the figure, the following blocks <str<strong>on</strong>g>of</str<strong>on</strong>g> RV770 are visible:1. A setup engine and an ultrathreaded dispatcher that dynamically schedules threadexecuti<strong>on</strong> <strong>on</strong> the SIMD units.2. Ten SIMD units. These <str<strong>on</strong>g>for</str<strong>on</strong>g>m the executi<strong>on</strong> cores <str<strong>on</strong>g>of</str<strong>on</strong>g> the chip.3. Ten texture units, where each texture unit is aligned with an SIMD unit. Textureunits are equivalent to a load unit <strong>on</strong> a CPU. Each texture unit also has a dedicatedL1 texture cache.4. Four 64-bit memory c<strong>on</strong>trollers (<str<strong>on</strong>g>for</str<strong>on</strong>g> a total 256-bit memory interface) that can becombined with GDDR3 or GDDR5. Each memory c<strong>on</strong>troller has a dedicated L2cache. The memory c<strong>on</strong>trollers are c<strong>on</strong>nected to the texture units through a crossbarswitch.5. Various c<strong>on</strong>nectors such as PCIe and display c<strong>on</strong>trollers.6. A UVD (Universal Video Decode) block <str<strong>on</strong>g>for</str<strong>on</strong>g> decoding video codecs.From the GPGPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance perspective, understanding the SIMD units and thetexture units is important <str<strong>on</strong>g>for</str<strong>on</strong>g> obtaining the best per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance out <str<strong>on</strong>g>of</str<strong>on</strong>g> the chip. The SIMDunits and texture units are described next.5


Figure 2.1: RV770 Block Diagram. Source : AMD. Printed with permissi<strong>on</strong>.6


2.1.1 SIMD unitsEach SIMD unit is composed <str<strong>on</strong>g>of</str<strong>on</strong>g> the following:1. 16 thread processors (TP). Each TP is a VLIW unit that can execute <strong>on</strong>e 5-wide VeryL<strong>on</strong>g Instructi<strong>on</strong> Word (VLIW) instructi<strong>on</strong>s each cycle. Each TP c<strong>on</strong>tains five ALUs.AMD calls each ALU as stream processor (SP). Each SP can execute <strong>on</strong>e fp32 or <strong>on</strong>eint32 multiply-and-add (MAD) operati<strong>on</strong> per cycle. One <str<strong>on</strong>g>of</str<strong>on</strong>g> the five SPs in a TP isa “fat” SP and <strong>on</strong>ly the fat SP can execute transcendental functi<strong>on</strong>s such as sin orcos. Thus, the peak per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> a TP is ten fp32 FLOPs per cycle. For fp64, thescenario is more complex. The four n<strong>on</strong>-fat ALUs co-operate to execute a single fp64MAD instructi<strong>on</strong> each cycle and thus the fp64 peak <str<strong>on</strong>g>of</str<strong>on</strong>g> each TP is two FLOPs/cycleor 1/5th <str<strong>on</strong>g>of</str<strong>on</strong>g> the fp32 peak. If the compiler is unable to pack 5 instructi<strong>on</strong>s into <strong>on</strong>epacket, then the peak is not achieved because each TP is a VLIW processor.2. Each SIMD has a huge register file with 64×256 float4/int4 registers. The registersare untyped with four 32-bit comp<strong>on</strong>ents. Each comp<strong>on</strong>ent <str<strong>on</strong>g>of</str<strong>on</strong>g> a register can eitherstore a floating-point value or an integer value. Two comp<strong>on</strong>ents <str<strong>on</strong>g>of</str<strong>on</strong>g> a register can becombined to store a double-precisi<strong>on</strong> value. The register file is arranged in sixteenblocks <str<strong>on</strong>g>of</str<strong>on</strong>g> four banks each. This gives a total <str<strong>on</strong>g>of</str<strong>on</strong>g> 64 banks and each bank can do <strong>on</strong>efloat4 transfer per cycle. Thus the bandwidth to the register file is 1024 bytes/cycle.The peak transfer rate from the register file is less than the peak demand <str<strong>on</strong>g>of</str<strong>on</strong>g> the ALUs.This is explained by the fact that there exists a <str<strong>on</strong>g>for</str<strong>on</strong>g>warding path where a TP can usea result computed in a previous cycle.3. A 16-Kbyte shared memory called the local data share (LDS). The LDS is arranged infour banks, each with 256 128-bit entries. Each bank can do <strong>on</strong>e 128-bit transfer percycle. Thus the LDS can transfer 64 bytes/cycle. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the bandwidth from theLDS is much smaller than the bandwidth from the register file. The size and layout <str<strong>on</strong>g>of</str<strong>on</strong>g>the register file suggests that it is very similar to <strong>on</strong>e block <str<strong>on</strong>g>of</str<strong>on</strong>g> a register file but withadditi<strong>on</strong>al indexing hardware.2.1.2 Texture unitsEach texture unit can compute four addresses per cycle. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the ten texture unitscan calculate a total <str<strong>on</strong>g>of</str<strong>on</strong>g> 40 addresses per cycle. Each texture unit has a dedicated L1 cache.The size <str<strong>on</strong>g>of</str<strong>on</strong>g> the L1 cache has not been published by AMD. Each texture unit is aligned withand services exactly <strong>on</strong>e SIMD unit. Given that an SIMD can execute 800 ALU operati<strong>on</strong>sor equivalently 1600 FLOPs per cycle, the asymmetry in the ALU:TEX ratio is apparent.The texture unit provides many facilities, such as filtering, but those are typically not usedin compute workloads and are thus not discussed in this thesis.7


2.2 AMD CAL Programming modelVarious APIs and programming languages are available to program the RV770 GPU <str<strong>on</strong>g>for</str<strong>on</strong>g>GPGPU purposes. The AMD Compute Abstracti<strong>on</strong> Layer (CAL) Applicati<strong>on</strong> ProgrammingInterface [8] provides the ability to program the GPU using either the R700 family ISA[10] orusing the AMD Intermediate Language (IL) [9]. OpenCL support and DirectX-11 computeshader 4.1 support are also expected in the future <str<strong>on</strong>g>for</str<strong>on</strong>g> the RV770. To program the GPUusing the CAL API, the programmer explicitly manages the GPU. For most efficiency,the programmer must explicitly manage the GPU’s <strong>on</strong>-board memory and must explicitlytransfer data between the system RAM and the GPU <strong>on</strong>-board RAM. While the hardwaredoes provide some ability to directly access the system RAM without explicit transfers to theGPU <strong>on</strong>-board RAM, the ability is very limited. Accessing the system memory directly overthe PCI-e bus is also highly inefficient due to the high latency and relatively low bandwidth.This document discusses the AMD IL programming model since it corresp<strong>on</strong>ds closelyto the hardware. AMD IL is a RISC-like program representati<strong>on</strong> derived from the ShaderModel 4.0 assembler. The AMD CAL API includes a JIT compiler to compile and run AMDIL programs and also provides routines <str<strong>on</strong>g>for</str<strong>on</strong>g> managing the GPU memory, initializati<strong>on</strong> andshutdown <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU and querying the GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> available resources. The CAL runtime andthe CAL compiler are distributed as part <str<strong>on</strong>g>of</str<strong>on</strong>g> the AMD graphic driver.2.2.1 Memory managementIn the CAL API, the memory <strong>on</strong> the GPU is allocated as two-dimensi<strong>on</strong>al resources. Toallocate a resource, the programmer specifies the height, width, <str<strong>on</strong>g>for</str<strong>on</strong>g>mat, memory type andlocati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the resource. The width and height are both restricted to 8192 elements. The<str<strong>on</strong>g>for</str<strong>on</strong>g>mat specifies the element type <str<strong>on</strong>g>of</str<strong>on</strong>g> the resource and can be <str<strong>on</strong>g>of</str<strong>on</strong>g> any <str<strong>on</strong>g>of</str<strong>on</strong>g> the numeric datatypesup to 128-bit width. For example, float, float2, float4, int, int2, int4, double and double2are all valid <str<strong>on</strong>g>for</str<strong>on</strong>g>mat specifiers. The locati<strong>on</strong> specifies whether the resource is located in theGPU memory or in a driver-allocated porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> system RAM.For GPU resources, another specifier is the memory type. AMD GPUs are capable <str<strong>on</strong>g>of</str<strong>on</strong>g>storing resources in two different physical arrangements. The default arrangement is a tiledarrangement where a block <str<strong>on</strong>g>of</str<strong>on</strong>g> 16×4 bytes is stored c<strong>on</strong>tiguously. The other opti<strong>on</strong> is to storeresources in linear memory that corresp<strong>on</strong>ds to row major order, similar to 2-dimensi<strong>on</strong>alarrays in C. The two memory arrangements are illustrated in Figure 2.2.2.2.2 C<strong>on</strong>text managementA GPU can support <strong>on</strong>e or more executi<strong>on</strong> c<strong>on</strong>texts. All executi<strong>on</strong> <strong>on</strong> a GPU is d<strong>on</strong>ethrough a c<strong>on</strong>text. Within a c<strong>on</strong>text, resources can be mapped to <strong>on</strong>e or more predefinednames. A resource is mapped to a name that specifies how the resource can be used within8


Figure 2.2: Memory arrangements9


a c<strong>on</strong>text. The predefined names are i0, i1, .. i7, o0, o1, .. , o7 and g. A resource mapped asi0, i1,.. i7 can be used as a read-<strong>on</strong>ly resource in any kernel launched within the c<strong>on</strong>text. Akernel mapped as o0, o1, .. o7 can be used as a write-<strong>on</strong>ly render backend within any kernellaunched in the c<strong>on</strong>text. A resource mapped as ’g’ can be used as the unique read-writeglobal buffer that can be read or written to by any thread in a kernel. Any resource can bemapped to any <str<strong>on</strong>g>of</str<strong>on</strong>g> the names with the <strong>on</strong>ly c<strong>on</strong>straint being that the resource mapped to gmust have been allocated in linear memory. A resource can be mapped to multiple namesbut no guarantees are provided about coherence.One severe limitati<strong>on</strong> in the current AMD SDK is that the global buffer <strong>on</strong>ly supports128-bit elements. For example, address computati<strong>on</strong> is always d<strong>on</strong>e assuming that theelements are float4/int4 etc. This means that write patterns that are not aligned to 128-bitboundaries are very hard to support.2.2.3 S<str<strong>on</strong>g>of</str<strong>on</strong>g>tware executi<strong>on</strong> ModelThe AMD IL allows two types <str<strong>on</strong>g>of</str<strong>on</strong>g> shader programs : pixel shaders and compute shaders.The main difference between these programming modes is the way the threads are organizedand the memory elements the threads can access.In pixel-shader mode, threads are organized in a two-dimensi<strong>on</strong>al grid and each threadknows its two-dimensi<strong>on</strong>al thread id. Pixel shader programs can output to either the renderbackends or to the global buffer. The write capability to a render backend is limited. Athread with ID (x,y) can <strong>on</strong>ly output to index (x,y) in a render backend. The global bufferis indexed through a <strong>on</strong>e-dimensi<strong>on</strong>al index. Any thread can read from, or write to, anyindex in the global buffer but no guarantees are provided about coherence if there is a datarace between different threads reading or writing from the same locati<strong>on</strong>. Any thread canread from any index from the read-<strong>on</strong>ly resources i0 to i7. The end <str<strong>on</strong>g>of</str<strong>on</strong>g> a kernel is an implicitsynchr<strong>on</strong>izati<strong>on</strong> point but no other communicati<strong>on</strong> or synchr<strong>on</strong>izati<strong>on</strong> is provided betweenthreads. An illustrati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the pixel-shader programming model is provided in Figure 2.3In compute-shader mode, threads are arranged in <strong>on</strong>e-dimensi<strong>on</strong>al thread groups andthread groups are organized in a <strong>on</strong>e-dimensi<strong>on</strong>al grid. Each thread knows its <strong>on</strong>e-dimensi<strong>on</strong>alabsolute id as well as <strong>on</strong>e-dimensi<strong>on</strong>al id within its group. A thread group can have a size<str<strong>on</strong>g>of</str<strong>on</strong>g> up to 1024 threads. In compute-shader mode, a thread cannot write to render backendsand can <strong>on</strong>ly write to the global buffer. Threads within a thread group can communicatethrough the LDS or through shared registers. To utilize the LDS, each thread declaresthe size <str<strong>on</strong>g>of</str<strong>on</strong>g> memory that the thread owns in the LDS. Each thread can <strong>on</strong>ly write to thepiece <str<strong>on</strong>g>of</str<strong>on</strong>g> memory it owns in the LDS but can read from any index in the LDS. Computeshaders also provide synchr<strong>on</strong>izati<strong>on</strong> primitives used to synchr<strong>on</strong>ize within a thread group.No communicati<strong>on</strong> or synchr<strong>on</strong>izati<strong>on</strong> is possible between thread groups.10


Figure 2.3: CAL programming model11


2.3 Hardware executi<strong>on</strong> model2.3.1 Pixel shadersThe programmer can specify milli<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> threads to be executed. However not all threadsare executed at the same time <strong>on</strong> the GPU. In pixel-shader mode, the hardware dividesthe two-dimensi<strong>on</strong>al grid <str<strong>on</strong>g>of</str<strong>on</strong>g> threads into groups <str<strong>on</strong>g>of</str<strong>on</strong>g> 8×8 thread groups called wavefr<strong>on</strong>ts.An SIMD runs <strong>on</strong>e or more wavefr<strong>on</strong>ts in parallel. The wavefr<strong>on</strong>t sizes vary from GPUto GPU: it is 64 <strong>on</strong> the RV770 but it is 32 <strong>on</strong> the RV730. A wavefr<strong>on</strong>t starts executing<strong>on</strong> an SIMD and c<strong>on</strong>tinues executing until it encounters a n<strong>on</strong>-ALU instructi<strong>on</strong> (such as atexture related instructi<strong>on</strong>). Then the wavefr<strong>on</strong>t is swapped out and replaced with anotherwavefr<strong>on</strong>t while the original n<strong>on</strong>-ALU wavefr<strong>on</strong>t is sent back to be scheduled <strong>on</strong> the unitrequired to complete the instructi<strong>on</strong> that it is waiting <strong>on</strong>. Thus the GPU uses the massiveparallelism to hide the latency <str<strong>on</strong>g>of</str<strong>on</strong>g> memory loads.The number <str<strong>on</strong>g>of</str<strong>on</strong>g> wavefr<strong>on</strong>ts that can be run in parallel <strong>on</strong> an SIMD in pixel-shader modeis decided by the availability <str<strong>on</strong>g>of</str<strong>on</strong>g> registers. If R denotes the number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers required perthread, W denotes the wavefr<strong>on</strong>t size (64 <strong>on</strong> RV770), N is the number <str<strong>on</strong>g>of</str<strong>on</strong>g> wavefr<strong>on</strong>ts runningin parallel and P is the number <str<strong>on</strong>g>of</str<strong>on</strong>g> physical registers in the register file, then N = P/(W ∗R).The driver is resp<strong>on</strong>sible <str<strong>on</strong>g>for</str<strong>on</strong>g> partiti<strong>on</strong>ing the physical register file am<strong>on</strong>g all the runningwavefr<strong>on</strong>ts. When a wavefr<strong>on</strong>t is swapped out from an ALU to a texture unit, the registersbeing used by the wavefr<strong>on</strong>t are still retained in the register file until the wavefr<strong>on</strong>t finishesexecuti<strong>on</strong>.For RV770, the wavefr<strong>on</strong>t size <str<strong>on</strong>g>of</str<strong>on</strong>g> 64 is four times the number <str<strong>on</strong>g>of</str<strong>on</strong>g> thread processors perSIMD. Thus each thread processor is running four threads in parallel and is likely to hidethe latency <str<strong>on</strong>g>of</str<strong>on</strong>g> ALU units. The register file is divided into banks <str<strong>on</strong>g>of</str<strong>on</strong>g> 64 and thus I speculatethat each thread in a wavefr<strong>on</strong>t reads and writes to exactly <strong>on</strong>e bank.2.3.2 Compute shadersCompute shaders are similar to pixel shaders in most respects. To execute a compute shader,64 c<strong>on</strong>tinuous thread-ids in a thread are chosen as a wavefr<strong>on</strong>t. The maximum number <str<strong>on</strong>g>of</str<strong>on</strong>g>wavefr<strong>on</strong>ts that can run in parallel in a compute shader is determined by the number <str<strong>on</strong>g>of</str<strong>on</strong>g>registers used per thread as well as the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> local data share owned by each thread.Shared registersThe register file in a compute shader is used not <strong>on</strong>ly <str<strong>on</strong>g>for</str<strong>on</strong>g> the per-thread registers butalso <str<strong>on</strong>g>for</str<strong>on</strong>g> shared registers. Shared registers are registers shared am<strong>on</strong>gst threads in a threadgroup. However shared registers have the restricti<strong>on</strong> that the register is actually <strong>on</strong>ly sharedbetween threads n, n+64, n+128.. and so <strong>on</strong> within a thread group. Thus thread 0 andthread 1 within a thread group do not actually see the same register. Instead there are 6412


distinct and independent copies <str<strong>on</strong>g>of</str<strong>on</strong>g> a shared register. This is explained by the fact that ashared register is actually shared by threads in the different wavefr<strong>on</strong>ts.There are 64 threads in a wavefr<strong>on</strong>t and 64 banks in the register file. When a sharedregister is declared, <strong>on</strong>e entry in each <str<strong>on</strong>g>of</str<strong>on</strong>g> the 64 banks is reserved as the shared register thuscreating 64 independent copies that are not c<strong>on</strong>sistent with each other. Thread 0 and thread1 can <strong>on</strong>ly read/write to different banks. For instance, thread 0 cannot see the copies <str<strong>on</strong>g>of</str<strong>on</strong>g> theshared register seen by thread 1 because threads in the same wavefr<strong>on</strong>t cannot access thesame banks. Thread 0 and thread 64, thread 1 and thread 65 etc are in different wavefr<strong>on</strong>tsbut associated with the same bank in the register file and hence can read/write to the 0thand 1st copy <str<strong>on</strong>g>of</str<strong>on</strong>g> the shared register respectively. Shared registers can be useful in reducti<strong>on</strong>typekernels. Shared registers increase register pressure because shared registers reduce theregister availability in the register file.Local Data ShareApart from shared registers, the LDS can be used <str<strong>on</strong>g>for</str<strong>on</strong>g> data sharing within a thread group.The size <str<strong>on</strong>g>of</str<strong>on</strong>g> LDS is declared in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> the number <str<strong>on</strong>g>of</str<strong>on</strong>g> 128-bit values owned by each thread.All the entries owned by a thread are stored in the same bank in c<strong>on</strong>tinuous locati<strong>on</strong>s. Thebanks are assigned to threads in a round-robin fashi<strong>on</strong>. Thus, thread 0 is associated withbank 0, thread 1 with bank 1, thread 2 with bank 2, thread 3 with bank 3, thread 4 withbank 0 and so <strong>on</strong>. Each thread can read/write from any index in the LDS by specifying thethread number that owns the data and the <str<strong>on</strong>g>of</str<strong>on</strong>g>fset within the porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> memory owned bythe target thread. However, a thread may need to wait <str<strong>on</strong>g>for</str<strong>on</strong>g> many cycles be<str<strong>on</strong>g>for</str<strong>on</strong>g>e reading fromor writing to the LDS because the LDS has <strong>on</strong>ly 4 banks and each bank can <strong>on</strong>ly service<strong>on</strong>e 128-bit value per cycle.For data sharing, shared registers access is as fast as n<strong>on</strong>-shared registers but haverestricted sharing semantics. LDS has more flexible sharing semantics but the bandwidthfrom the LDS is much lower than that <str<strong>on</strong>g>of</str<strong>on</strong>g> the register file.2.4 Code generati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong>A matrix-multiplicati<strong>on</strong> example can be used to illustrate the effect <str<strong>on</strong>g>of</str<strong>on</strong>g> the ALU:TEX ratio<strong>on</strong> the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>on</strong> a GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming general-purpose computati<strong>on</strong>s. If the code wereto do N loads and N FLOPs, then the ALUs would be idle 95% <str<strong>on</strong>g>of</str<strong>on</strong>g> the time. Any code thatneeds to be executed <strong>on</strong> the RV770 should execute very few load instructi<strong>on</strong>s and lots <str<strong>on</strong>g>of</str<strong>on</strong>g>arithmetic instructi<strong>on</strong>s. Ideally the code should c<strong>on</strong>tain 20 ALU operati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> every TEXinstructi<strong>on</strong>.Two important observati<strong>on</strong>s can be made about the GPU. First, the ALUs can <strong>on</strong>lyoperate <strong>on</strong> data present in registers. To execute a load, the ALU invokes a sample instructi<strong>on</strong>13


<strong>on</strong> the texture unit to bring the data first into a register. Sec<strong>on</strong>d, the texture units runin parallel to the ALUs. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, to predict per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> cycles requiredto execute the texture instructi<strong>on</strong>s and the number <str<strong>on</strong>g>of</str<strong>on</strong>g> cycles required to execute ALUinstructi<strong>on</strong>s are calculated separately. The larger <str<strong>on</strong>g>of</str<strong>on</strong>g> the two values represents the bottleneckin the computati<strong>on</strong>.C<strong>on</strong>sider matrix multiplicati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> matrices <str<strong>on</strong>g>of</str<strong>on</strong>g> size N*N. the syntax used in the examplesis very similar to code accepted by the compiler library, calseum, that I wrote.syntax corresp<strong>on</strong>ds roughly to C-like code. First, c<strong>on</strong>sider a naive code. This code launchesN*N threads each computing <strong>on</strong>e scalar value. C<strong>on</strong>sider the following pseudo-C code thatrepresents the code being executed by <strong>on</strong>e thread.float[] i0;float[] i1;float o0;float sum = 0.0;<str<strong>on</strong>g>for</str<strong>on</strong>g>(int i=0;i


float sum = 0.0;<str<strong>on</strong>g>for</str<strong>on</strong>g>(int i=0;i


float4 temp1 = i1[i][threadId.y/4];float4 temp2 = i1[i+1][threadId.y/4];float4 temp1 = i1[i+2][threadId.y/4];float4 temp1 = i1[i+3][threadId.y/4];sum.x += temp.x*temp1.x;sum.x += temp.y*temp2.x;sum.x += temp.z*temp3.x;sum.x += temp.w*temp4.x;sum.y += temp.x*temp1.y;sum.y += temp.y*temp2.y;sum.y += temp.z*temp3.y;sum.y += temp.w*temp4.y;sum.z += temp.x*temp1.z;sum.z += temp.y*temp2.z;sum.z += temp.z*temp3.z;sum.z += temp.w*temp4.z;sum.w += temp.x*temp1.w;sum.w += temp.y*temp2.w;sum.w += temp.z*temp3.w;sum.w += temp.w*temp4.w;}The code above c<strong>on</strong>tains sixteen ALU operati<strong>on</strong>s and five load instructi<strong>on</strong>s resulting ina ALU:TEX ratio <str<strong>on</strong>g>of</str<strong>on</strong>g> 3.2. This yields a peak theoretical per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> 172 Gflops <str<strong>on</strong>g>for</str<strong>on</strong>g> theRade<strong>on</strong> 4870. We can further increase the tile size to further increase the ALU:TEX ratio.The tile size is restricted by the number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers available per thread. To hide memorylatency, let’s assume we run 8 wavefr<strong>on</strong>ts per SIMD giving 5120 threads <strong>on</strong> the chip. Thenwe get 32 registers available per thread, each <str<strong>on</strong>g>of</str<strong>on</strong>g> size 128 bits. 32 registers allows <str<strong>on</strong>g>for</str<strong>on</strong>g> a tile<str<strong>on</strong>g>of</str<strong>on</strong>g> up to 5×8 giving a ALU:TEX ratio <str<strong>on</strong>g>of</str<strong>on</strong>g> 12 with a predicted per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> 720 GFLOPs.If the memory access latency is not completely hidden, then the actual per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance will beless than the predicted value.From the matrix multiplicati<strong>on</strong> example, it can be c<strong>on</strong>cluded that increasing the ratio <str<strong>on</strong>g>of</str<strong>on</strong>g>ALU instructi<strong>on</strong>s to texture instructi<strong>on</strong>s is essential <str<strong>on</strong>g>for</str<strong>on</strong>g> achieving maximum per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>on</strong>the GPU. Loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s such as unrolling and tiling and the use <str<strong>on</strong>g>of</str<strong>on</strong>g> packed 128-bitdatatypes wherever possible can substantially reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> texture instructi<strong>on</strong>sexecuted.2.5 On-chip bandwidth bottlenecksA simple way to understand the ALU:TEX bottleneck is to understand the <strong>on</strong>-chip bandwidthavailable compared to the bandwidth available <strong>on</strong> a modern CPU. Lets compare thebandwidth from the L1 cache and the bandwidth from the register file <strong>on</strong> the Rade<strong>on</strong> 4870to an AMD Phenom II X4 940 CPU. Phenom II X4 940 is a relatively high end x86 3.0 GHzquad-core. We compare the bandwidths available to the x4 940 to the bandwidths available16


Table 2.1: Comparisi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> bandwidth available to Rade<strong>on</strong> 4870 and Phenom II X4 940Property Rade<strong>on</strong> 4870 Phenom x4 940 GPU/CPU ratioRegister file bandwidth 7680 GB/s 768 GB/s 10L1 cache bandwidth 480 GB/s 384 GB/s 1.25RAM bandwidth 115.2 17 GB/s 6.77to the Rade<strong>on</strong> 4870, which runs at 750MHz, coupled with the 900MHz GDDR5. One core<str<strong>on</strong>g>of</str<strong>on</strong>g> X4 940 can do <strong>on</strong>e SSE ADD and <strong>on</strong>e SSE MUL each cycle. Thus it must be capable<str<strong>on</strong>g>of</str<strong>on</strong>g> reading 16 floats or 64 bytes from the SSE register file each cycle. Each core the x4 940can also load 32 bytes each cycle from the L1 data cache. The bandwidth estimates <str<strong>on</strong>g>for</str<strong>on</strong>g> theRade<strong>on</strong> 4870 and Phenom II 940 are provided in Table 2.1.The GPU has almost the same bandwidth from the texture unit as the bandwidth to aCPU from the L1 data cache. In situati<strong>on</strong>s where the cache bandwidth is the limitati<strong>on</strong>,the GPU simply cannot reach anywhere near the theoretical ALU peak. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e theprogrammer has to spend a c<strong>on</strong>siderable amount <str<strong>on</strong>g>of</str<strong>on</strong>g> ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t <strong>on</strong> loading as much data aspossible into registers. Further, while the GPU cache sizes have not been published byAMD, the cache sizes <strong>on</strong> GPUs have traditi<strong>on</strong>ally been much smaller than the cache sizes<strong>on</strong> a CPU. Taking into account the number <str<strong>on</strong>g>of</str<strong>on</strong>g> threads that are utilizing the cache at thesame time, the cache is <str<strong>on</strong>g>of</str<strong>on</strong>g> very limited use <strong>on</strong> the GPU when compared to a CPU. The GPUrelies almost entirely up<strong>on</strong> its large register file and multi-threading to hide load latencies.2.6 Final RemarksThis chapter described the architecture <str<strong>on</strong>g>of</str<strong>on</strong>g> a typical c<strong>on</strong>temporary GPU, the AMD RV770.Then it analysed the issues involved with the generati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> efficient code <str<strong>on</strong>g>for</str<strong>on</strong>g> such a GPU.As this analysis showed, the compiler must take into account the memory hierarchy <str<strong>on</strong>g>of</str<strong>on</strong>g> theGPU and must properly utilize the register file to efficiently feed the ALU.17


Chapter 3<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and the proposedprogramming modelThe objective <str<strong>on</strong>g>of</str<strong>on</strong>g> this thesis is to provide a high-productivity programming model to thescientific programmer which is portable to both GPUs and multi-core CPUs. Increasingly,scientific programmers are choosing high-level programming languages such as Matlab and<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> over languages such as C and Fortran. For parallel programming, shared-memoryparadigms, such as OpenMP, are popular with C/C++ programmers.This thesis provides the programmer with a programming model based <strong>on</strong> a combinati<strong>on</strong><str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and the c<strong>on</strong>cept <str<strong>on</strong>g>of</str<strong>on</strong>g> parallel loops inspired by OpenMP parallel loops. The choice<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> was based <strong>on</strong> multiple reas<strong>on</strong>s.1. Unlike special-purpose languages such as Matlab, <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is a fully featured generalpurposeprogramming language well known <str<strong>on</strong>g>for</str<strong>on</strong>g> its simplicity and productivity. NumPyis a multi-dimensi<strong>on</strong>al array library <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> providing a very flexible array abstracti<strong>on</strong>with c<strong>on</strong>cise notati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> operati<strong>on</strong>s such as slicing. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> in combinati<strong>on</strong> withNumPy is well suited <str<strong>on</strong>g>for</str<strong>on</strong>g> rapidly developing full applicati<strong>on</strong>s in a wide variety <str<strong>on</strong>g>of</str<strong>on</strong>g>domains.2. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and NumPy are both open source and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e it was much easier to understandthe inner workings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and NumPy.3. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, in combinati<strong>on</strong> with NumPy, is becoming popular in the scientific and numericcomputati<strong>on</strong> community because <str<strong>on</strong>g>of</str<strong>on</strong>g> simplicity and productivity. A library collecti<strong>on</strong>aimed at scientific computati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, called SciPy, and built up<strong>on</strong> NumPy isalso available. SciPy includes libraries <str<strong>on</strong>g>for</str<strong>on</strong>g> tasks such as functi<strong>on</strong> optimizati<strong>on</strong>. Anactive community is <str<strong>on</strong>g>for</str<strong>on</strong>g>ming around NumPy and SciPy. Various mailing lists, wikisand project listings can be found at the scipy.org website.This chapter introduces the relevant features <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and NumPy followed by theproposed extensi<strong>on</strong>s to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and NumPy.18


3.2 NumPy<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> has no native-language support <str<strong>on</strong>g>for</str<strong>on</strong>g> arrays. NumPy is an extensi<strong>on</strong> module thatdefines a fast multi-dimensi<strong>on</strong>al array class. NumPy is implemented in C and uses <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>’ssupport <str<strong>on</strong>g>for</str<strong>on</strong>g> operator overloading to provide flexible indexing. Indexing in NumPy arrays iszero based and subscripting is bound checked.NumPy arrays are much more general abstracti<strong>on</strong>s than arrays in languages such as C.Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e going into a detailed descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy arrays, I first examine some properties<str<strong>on</strong>g>of</str<strong>on</strong>g> C arrays and pointers.C arrays differ from C pointers in that C arrays are actually allocated regi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> memorywhile pointers are just views into already allocated data. If a C array goes out <str<strong>on</strong>g>of</str<strong>on</strong>g> scope,then the memory regi<strong>on</strong> allocated <str<strong>on</strong>g>for</str<strong>on</strong>g> the array is also destroyed. If a pointer goes out <str<strong>on</strong>g>of</str<strong>on</strong>g>scope, memory being pointed to by it is not automatically freed. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, C arrays canbe thought <str<strong>on</strong>g>of</str<strong>on</strong>g> as owning the data while pointers do not own the data they point to.NumPy arrays are a general abstracti<strong>on</strong> that encompasses both C array types and Cpointers. NumPy arrays c<strong>on</strong>tain a field called data that points to the raw data buffer holdingthe data. NumPy arrays also have a flag bit that indicates whether or not the NumPy arrayowns the data. If a NumPy array owns the data, then the raw data buffer is freed when thearray destructor is called. If the NumPy array does not own the data, then the data bufferis not freed when the array destructor is called. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, NumPy arrays can either act astrue owners <str<strong>on</strong>g>of</str<strong>on</strong>g> data or as views (or pointers) into data owned by some other object.NumPy arrays can be indexed with integers just like C arrays or pointers but the addressarithmetic is much more general than C pointers. C<strong>on</strong>sider a snippet <str<strong>on</strong>g>of</str<strong>on</strong>g> C source code inFigure 3.2. The expressi<strong>on</strong> p[i][j][k] refers to the address p + s 1 ∗ i + s 2 ∗ j + s 3 ∗ k wheres 1 = m ∗ n, s 2 = n and s 3 = 1. The values s 1 , s 2 and s 3 can be thought <str<strong>on</strong>g>of</str<strong>on</strong>g> as strides in3 dimensi<strong>on</strong>s representing the number <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s moved when the corresp<strong>on</strong>dingsubscript is incremented by unity. Due to the pointer declarati<strong>on</strong> syntax and semantics <str<strong>on</strong>g>of</str<strong>on</strong>g>C, the last stride (in this case s 3 ) is always c<strong>on</strong>strained to be 1. The strides are also alwayspositive in C and follow a very regular ordering. For example, s i is always greater than s i+1in C.1 int i , j , k ;2 .3 .4 .5 char (∗ p ) [m] [ n ] = some functi<strong>on</strong> ( ) ;6 char c = p [ i ] [ j ] [ k ] ;Figure 3.2: C pointer example20


Unlike C, NumPy does not impose any order up<strong>on</strong> the strides used in index computati<strong>on</strong>s.NumPy disassociates strides from the dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> the array and allows the programmerto specify arbitrary integer strides <str<strong>on</strong>g>for</str<strong>on</strong>g> memory address computati<strong>on</strong>. Each NumPyarray has a field called strides. The field strides <str<strong>on</strong>g>for</str<strong>on</strong>g> an n-dimensi<strong>on</strong>al NumPy array isan array <str<strong>on</strong>g>of</str<strong>on</strong>g> n integers representing the stride in each dimensi<strong>on</strong>. A stride in dimensi<strong>on</strong> ddefines the number <str<strong>on</strong>g>of</str<strong>on</strong>g> bytes in memory that must be jumped when the index in dimensi<strong>on</strong>d is incremented by 1. If arr is a NumPy array with n dimensi<strong>on</strong>s, and (i 0 , i 1 , ..., i n−1 ) is anindex into the array, then the actual memory address M referenced is computed as follows:n−1∑M = arr.data + a.strides[j] ∗ i j (3.1)j=0The programmer specifies the strides <str<strong>on</strong>g>of</str<strong>on</strong>g> an array. A stride is an arbitrary integer includingzero. The generalized strides enable a NumPy array to emulate C-style row-majorarray indexing, Fortran-style column-major indexing and many types <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>-c<strong>on</strong>tiguous datalayouts depending up<strong>on</strong> the programmer-specified strides.Apart from the generalized stride-based indexing, the sec<strong>on</strong>d key feature <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy isthe support <str<strong>on</strong>g>for</str<strong>on</strong>g> multiple views into the same underlying data. NumPy provides many waysto create such views, the most comm<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> which is the slicing operator. Slicing <str<strong>on</strong>g>for</str<strong>on</strong>g> arraysis defined as follows. Let ar be an <strong>on</strong>e-dimensi<strong>on</strong>al array. Let br = ar[start : stop : step] bea slice <str<strong>on</strong>g>of</str<strong>on</strong>g> the array ar. Then element br[i] is the same element as ar[i ∗ step + start]. Thelength <str<strong>on</strong>g>of</str<strong>on</strong>g> the array br is given by ⌊(stop − start)/step⌋. Slices are views into the originalarray and no data copying occurs when creating a slice. Multi-dimensi<strong>on</strong>al arrays can besliced independently in each directi<strong>on</strong>.Figure 3.3 shows some examples <str<strong>on</strong>g>of</str<strong>on</strong>g> array creati<strong>on</strong> and slicing. Line 1 imports the NumPylibrary. Line 2 allocates an array arr <str<strong>on</strong>g>of</str<strong>on</strong>g> doubles <str<strong>on</strong>g>of</str<strong>on</strong>g> size 100×300 initialized to zero. Thestrides <str<strong>on</strong>g>of</str<strong>on</strong>g> this array emulate a row-major storage order. Line 3 creates a slice <str<strong>on</strong>g>of</str<strong>on</strong>g> arr. Thenotati<strong>on</strong> 5:97:3 states that the slice starts at 5, stops at 97 (excluding 97 itself), and hasa step size <str<strong>on</strong>g>of</str<strong>on</strong>g> 3. Line 4 writes the c<strong>on</strong>stant 1.0 in the fifth row, third column <str<strong>on</strong>g>of</str<strong>on</strong>g> arr. Line5 prints 1.0 because the element [0,0] <str<strong>on</strong>g>of</str<strong>on</strong>g> the slice v1 corresp<strong>on</strong>ds to the element [5,3] <str<strong>on</strong>g>of</str<strong>on</strong>g>the array arr. Line 7 prints 2.0, the c<strong>on</strong>stant assigned to the array via slice v1 in line 6.Each NumPy array has a field called strides that is a tuple storing the strides <str<strong>on</strong>g>of</str<strong>on</strong>g> the array ineach dimensi<strong>on</strong> in bytes. Lines 8 and 9 print (2400,8) and (7200,16), respectively, assumingthat each element is 8 bytes. Line 10 creates a new view, v2, <str<strong>on</strong>g>of</str<strong>on</strong>g> arr where the axes areswapped. Thus, line 11 prints (8,2400). In line 12, a new <strong>on</strong>e dimensi<strong>on</strong>al view v3 <str<strong>on</strong>g>of</str<strong>on</strong>g> arris created. Line 12 prints 1.0 because element 300*5+3 <str<strong>on</strong>g>of</str<strong>on</strong>g> v3 is the same as element [5,3]<str<strong>on</strong>g>of</str<strong>on</strong>g> arr. Line 13 creates a view v4 <str<strong>on</strong>g>of</str<strong>on</strong>g> v3 using the slice [3:100*300:5] which starts at theelement 3 and has stride <str<strong>on</strong>g>of</str<strong>on</strong>g> 5 elements or 40 bytes. Line 14 prints 1.0 because element 300<str<strong>on</strong>g>of</str<strong>on</strong>g> v4 is the same as element 5*300+3 <str<strong>on</strong>g>of</str<strong>on</strong>g> v3 or, equivalently, element [5,3] <str<strong>on</strong>g>of</str<strong>on</strong>g> arr.21


1 import numpy2 a r r = numpy . z e r o s ( ( 1 0 0 , 3 0 0 ) )3 v1 = a r r [ 5 : 9 7 : 3 , 3 : 3 0 0 : 2 ]4 a r r [ 5 , 3 ] = 1 . 05 print v1 [ 0 , 0 ]6 v1 [ 1 , 2 ] = 2 . 07 print a r r [ 8 , 7 ]8 print a r r . s t r i d e s9 print v1 . s t r i d e s10 v2 = a r r . swapaxes ( 0 , 1 )11 print v2 . s t r i d e s12 v3 = a r r . reshape (100∗300)13 print v3 [300∗5+3]14 v4 = v3 [ 3 : 1 0 0 ∗ 3 0 0 : 5 ]15 print v4 [ 3 0 0 ]Figure 3.3: Examples <str<strong>on</strong>g>of</str<strong>on</strong>g> array slicing in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>.The strides field can be changed directly without going through slicing or method callssuch as reshape. The strides <str<strong>on</strong>g>of</str<strong>on</strong>g> a NumPy array can also be changed through the NumPy CAPI.3.3 <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> combines the development flexibility <str<strong>on</strong>g>of</str<strong>on</strong>g> a dynamically-typed language with an API<str<strong>on</strong>g>for</str<strong>on</strong>g> efficient C implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance critical porti<strong>on</strong>s — functi<strong>on</strong>s and classes —<str<strong>on</strong>g>of</str<strong>on</strong>g> an applicati<strong>on</strong>. This <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API provides many utility functi<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> passing databack and <str<strong>on</strong>g>for</str<strong>on</strong>g>th between C and <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> exposes all <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> objects as a PyObjectdatatype in C. The API provides various c<strong>on</strong>venience functi<strong>on</strong>s and macros to c<strong>on</strong>vertbetween PyObject and C native types such as int, l<strong>on</strong>g and structures. Various other Cdatatypes are also provided <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> types such as list and tuple. However, the base type<str<strong>on</strong>g>of</str<strong>on</strong>g> all these types is PyObject. NumPy’s C-API exposes NumPy arrays as PyArrayObjectdatatype in C and allows direct manipulati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> all fields <str<strong>on</strong>g>of</str<strong>on</strong>g> the NumPy object.For implementing functi<strong>on</strong>s, the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API requires that any functi<strong>on</strong> that is to beused as a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> functi<strong>on</strong> must have a predefined type signature and must pass and returnpointers to PyObjects because the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter <strong>on</strong>ly knows about PyObjects.<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> modules implemented in C utilizing the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API are called <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> extensi<strong>on</strong>modules. To write an extensi<strong>on</strong> module foo in C, a programmer first writes the code<str<strong>on</strong>g>for</str<strong>on</strong>g> the functi<strong>on</strong>s to be implemented in C in <strong>on</strong>e or more C files. The programmer thenwrites a special module initializati<strong>on</strong> functi<strong>on</strong>, which must be named initfoo if the desiredmodule name is foo. The C code is then compiled into a single dynamically loadable library(DLL <strong>on</strong> windows or shared object files <strong>on</strong> linux) called foo.so <strong>on</strong> Linux or foo.dll <strong>on</strong>22


Windows. The module initializati<strong>on</strong> functi<strong>on</strong> passes a symbol table mapping C strings toC pointers to functi<strong>on</strong>s or variables <str<strong>on</strong>g>of</str<strong>on</strong>g> suitable type to be exported to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. When the<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter encounters an import foo statement, it first looks <str<strong>on</strong>g>for</str<strong>on</strong>g> a file called foo.so.If foo.so is found, then the interpreter looks <str<strong>on</strong>g>for</str<strong>on</strong>g> the functi<strong>on</strong> initfoo to initialize the module.If the initfoo functi<strong>on</strong> is not found, then <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> throws an excepti<strong>on</strong>. If the file foo.so isnot found, then <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> looks <str<strong>on</strong>g>for</str<strong>on</strong>g> a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> source file called foo.py.For details, see <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>’s <str<strong>on</strong>g>of</str<strong>on</strong>g>ficial C-API documentati<strong>on</strong>[6].3.4 Programmer’s workflowThe typical development <str<strong>on</strong>g>of</str<strong>on</strong>g> applicati<strong>on</strong>s c<strong>on</strong>taining per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance-critical, numerical-intensive,secti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> code in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> writing a prototype in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, pr<str<strong>on</strong>g>of</str<strong>on</strong>g>iling the code toidentify per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance-critical secti<strong>on</strong>s, and then re-writing these secti<strong>on</strong>s in a compiled languagesuch as C or C++. The programmer also writes glue code in C/C++ using the<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API. Figure 3.4(a) presents a block diagram that illustrates this state-<str<strong>on</strong>g>of</str<strong>on</strong>g>-theartdevelopment process and emphasizes the code that has to be written by a developer.Rewriting the code in C/C++ is a tedious and error-pr<strong>on</strong>e process that reduces developer’sproductivity and results in a code base that is less maintainable. If a GPU versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> thecode is also required, then the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> code to be written by the programmer, and thecomplexity <str<strong>on</strong>g>of</str<strong>on</strong>g> the code base, is further increased.This thesis presents an alternative <str<strong>on</strong>g>for</str<strong>on</strong>g> the development <str<strong>on</strong>g>of</str<strong>on</strong>g> such applicati<strong>on</strong>s in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>,as illustrated in Figure 3.4(b). It introduces a new compiler system <str<strong>on</strong>g>for</str<strong>on</strong>g> automatic generati<strong>on</strong><str<strong>on</strong>g>of</str<strong>on</strong>g> C++ code <str<strong>on</strong>g>for</str<strong>on</strong>g> the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance-critical <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> secti<strong>on</strong>s. The programmer is not requiredto rewrite the code in C++. Instead, the programmer annotates the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code with typedeclarati<strong>on</strong>s and parallel loop annotati<strong>on</strong>s.The compiler framework that supports this new programming model is described indetail in chapter 4. The compiler automatically generates the C++ code as well as therequired glue code <str<strong>on</strong>g>for</str<strong>on</strong>g> the annotated code. The programmer then invokes a standard C++compiler to compile the code into a DLL. This DLL is a drop-in replacement <str<strong>on</strong>g>for</str<strong>on</strong>g> the original<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> module. To generate code <str<strong>on</strong>g>for</str<strong>on</strong>g> multi-core CPUs, the programmer <strong>on</strong>ly needs to addparallel-loop annotati<strong>on</strong>s, where possible, to the code be<str<strong>on</strong>g>for</str<strong>on</strong>g>e compilati<strong>on</strong>. To take advantage<str<strong>on</strong>g>of</str<strong>on</strong>g> GPU accelerati<strong>on</strong>, the programmer simply annotates certain parallel loops to be executed<strong>on</strong> the GPU and the compiler handles everything else.The implemented compiler can <strong>on</strong>ly deal with a subset <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and requires typeannotati<strong>on</strong>. However, since <strong>on</strong>ly a small porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the applicati<strong>on</strong> needs to be compiled toC++, <strong>on</strong>ly a small porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code needs to c<strong>on</strong><str<strong>on</strong>g>for</str<strong>on</strong>g>m to these restricti<strong>on</strong>s. Therest <str<strong>on</strong>g>of</str<strong>on</strong>g> the applicati<strong>on</strong>, which does not need to be compiled, remains as is and requires nochanges.23


<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>/NumpyCodewithAnnota3<strong>on</strong>s<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>/NumpyCodewithAnnota3<strong>on</strong>sPr<str<strong>on</strong>g>of</str<strong>on</strong>g>ileprototypePr<str<strong>on</strong>g>of</str<strong>on</strong>g>ileprototypeOpenMPsec3<strong>on</strong>sCALsec3<strong>on</strong>sGluecodeAddAnnota3<strong>on</strong>toexis3ngcodeOpenMP<str<strong>on</strong>g>Compiler</str<strong>on</strong>g>AMDCALJIT<str<strong>on</strong>g>Compiler</str<strong>on</strong>g><str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>InterpreterNewCompila3<strong>on</strong>FrameworkCPU/GPUExecu3<strong>on</strong>CPU/GPUExecu3<strong>on</strong>(a) State <str<strong>on</strong>g>of</str<strong>on</strong>g> the Art(b) Proposed Programming ModelFigure 3.4: Comparis<strong>on</strong> between state <str<strong>on</strong>g>of</str<strong>on</strong>g> the art <str<strong>on</strong>g>for</str<strong>on</strong>g> generati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> numerical-intensive applicati<strong>on</strong>s<str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> in CPU/GPU and the proposed programming model.Furthermore, we designed the annotati<strong>on</strong>s in such a way that they are still compliantwith the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> grammar and are transparent to the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter. Thus if theprogrammer does not wish to compile the module to C++ during development to avoid thebuild and link steps, the annotated <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> will run <strong>on</strong> the interpreter.3.5 Proposed extensi<strong>on</strong>s to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>For efficient and simple compilati<strong>on</strong>, several extensi<strong>on</strong>s to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> were introduced in the<str<strong>on</strong>g>for</str<strong>on</strong>g>m <str<strong>on</strong>g>of</str<strong>on</strong>g> various annotati<strong>on</strong>s. The key idea behind the syntax <str<strong>on</strong>g>of</str<strong>on</strong>g> the annotati<strong>on</strong>s is thatno new keywords should be introduced and the grammar <str<strong>on</strong>g>of</str<strong>on</strong>g> the language should not bealtered. The language should remain compatible with the standard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter andany annotati<strong>on</strong>s introduced should be essentially transparent to the interpreter. However,when the un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> compiler encounters the annotati<strong>on</strong>s, it treats them as special compilerdirectives and accordingly generates the appropriate C code.Thus, even if the code is not compiled, the code can run <strong>on</strong> any standard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>interpreter. The ability to run <strong>on</strong> the standard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter has the benefit <str<strong>on</strong>g>of</str<strong>on</strong>g> fasterdevelopment and debugging cycles <str<strong>on</strong>g>for</str<strong>on</strong>g> the programmer by avoiding the compilati<strong>on</strong> step untilthe code is fully debugged. The interpreter compatibility also ensures easier debugging <str<strong>on</strong>g>of</str<strong>on</strong>g>the compiler because a mismatch between the compiled and the interpreted versi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> thecode usually indicates a bug in the compiler.24


3.5.1 Type annotati<strong>on</strong>sType declarati<strong>on</strong>s enable the compiler to generate efficient C code. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is a dynamicallytyped language and has no type declarati<strong>on</strong> syntax. For this purpose, an extensi<strong>on</strong> wasintroduced to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to declare the type signature <str<strong>on</strong>g>of</str<strong>on</strong>g> a functi<strong>on</strong>.In this extensi<strong>on</strong>, the programmer creates a special decorator to annotate each functi<strong>on</strong>to be compiled. This decorator declares the types <str<strong>on</strong>g>of</str<strong>on</strong>g> the functi<strong>on</strong>’s parameters and <str<strong>on</strong>g>of</str<strong>on</strong>g> itsreturn value. Decorators are annotati<strong>on</strong>s added just be<str<strong>on</strong>g>for</str<strong>on</strong>g>e a functi<strong>on</strong> definiti<strong>on</strong> and area standard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> syntax feature.Opti<strong>on</strong>ally, the types <str<strong>on</strong>g>of</str<strong>on</strong>g> local variables may also bedeclared in a similar fashi<strong>on</strong>. If no such declarati<strong>on</strong>s are provided, the compiler infers thetypes <str<strong>on</strong>g>of</str<strong>on</strong>g> a local variable based <strong>on</strong> the type <str<strong>on</strong>g>of</str<strong>on</strong>g> the first value assigned to the variable in thefuncti<strong>on</strong>. In this framework, a variable cannot change its type within a functi<strong>on</strong>.Figure 3.5 shows an example <str<strong>on</strong>g>of</str<strong>on</strong>g> type-declarati<strong>on</strong> syntax. The decorator unpyth<strong>on</strong>.typeis added to the functi<strong>on</strong> declarati<strong>on</strong>. Types are passed to the decorator as arguments. Ifthe functi<strong>on</strong> takes x number <str<strong>on</strong>g>of</str<strong>on</strong>g> parameters, then x + 1 types are required. The first x typesare <str<strong>on</strong>g>for</str<strong>on</strong>g> the parameters and the last type specified is <str<strong>on</strong>g>for</str<strong>on</strong>g> the return type <str<strong>on</strong>g>of</str<strong>on</strong>g> the functi<strong>on</strong>. Inthe example, the first type is <str<strong>on</strong>g>for</str<strong>on</strong>g> the parameter n and the sec<strong>on</strong>d type is <str<strong>on</strong>g>for</str<strong>on</strong>g> the return type.The types <str<strong>on</strong>g>of</str<strong>on</strong>g> sum and i are automatically inferred by the compiler to be int64 which is thedefault integer type. These types can be <str<strong>on</strong>g>for</str<strong>on</strong>g>ced to be int32 by adding type declarati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g>the local variables.1 @unpyth<strong>on</strong> . type ( ’ i n t 6 4 ’ , ’ i n t 6 4 ’ )2 def f ( n ) :3 sum = 04 <str<strong>on</strong>g>for</str<strong>on</strong>g> i in range ( n ) :5 sum += i6 return sumFigure 3.5: Example <str<strong>on</strong>g>of</str<strong>on</strong>g> decorator used as a type declarator.3.5.2 <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> annotati<strong>on</strong>sThe <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter does not provide any support <str<strong>on</strong>g>for</str<strong>on</strong>g> parallel programming. One wayto take advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> multiple processor cores is to utilize process-based parallelism wheremultiple instances <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter may be launched. Process-based parallelism isnot suitable <str<strong>on</strong>g>for</str<strong>on</strong>g> all types <str<strong>on</strong>g>of</str<strong>on</strong>g> problems and is not c<strong>on</strong>sidered in this thesis. The <strong>on</strong>ly wayto take advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> multiple processor cores within a single <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> process is to writea multithreaded functi<strong>on</strong> in C using an API such as pthreads or OpenMP. However, thecurrent <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter is not multithreaded and calling <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API functi<strong>on</strong>s fromtwo different C threads is unsafe. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the programmer must be careful about calling25


<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API functi<strong>on</strong>s from multithreaded functi<strong>on</strong>s.I introduce an extensi<strong>on</strong> to mark certain <str<strong>on</strong>g>for</str<strong>on</strong>g> loops as parallel to provide a sharedmemorydata-parallel model similar to (but less general than) the OpenMP parallel <str<strong>on</strong>g>for</str<strong>on</strong>g> loopdirectives. To avoid the complexity <str<strong>on</strong>g>of</str<strong>on</strong>g> specifying parallel semantics <str<strong>on</strong>g>for</str<strong>on</strong>g> arbitrary iterators,<strong>on</strong>ly a parallel versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> xrange iterator is defined. A special iterator, prange, is introducedthat is used to specify that a loop is parallel. The iterator prange is <strong>on</strong>ly a hint to thecompiler that the programmer intends to execute the loop in parallel but the compiler isnot obligated to compile the loop to a parallel C++ loop. To the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter, prangeis identical to xrange. However, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> treats the annotati<strong>on</strong>s as a special parallel-loopannotati<strong>on</strong>. A <str<strong>on</strong>g>for</str<strong>on</strong>g> loop over the parallel iterator prange has the following properties:1. Let the program point immediately be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the loop be P. Then the program executi<strong>on</strong>is serial be<str<strong>on</strong>g>for</str<strong>on</strong>g>e P.2. Let the program point immediately after the loop be Q. Then the program may spawnmultiple threads between P and Q. The point Q is an implicit join-point. Programexecuti<strong>on</strong> does not c<strong>on</strong>tinue until all threads spawned reach Q. After Q, serial executi<strong>on</strong><str<strong>on</strong>g>of</str<strong>on</strong>g> the program c<strong>on</strong>tinues.3. Each iterati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the loop is independent and may be executed in any order.4. <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> loops may be nested but, in this model, a join point <strong>on</strong>ly exists at the ending<str<strong>on</strong>g>of</str<strong>on</strong>g> the outermost parallel loop.5. <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> loops are a compiler directive and the compiler is not obligated to parallelizethe loop.6. If there is a data-race between any two iterati<strong>on</strong>s, then the output <str<strong>on</strong>g>of</str<strong>on</strong>g> the program isundefined.7. The body <str<strong>on</strong>g>of</str<strong>on</strong>g> the loop may <strong>on</strong>ly access variables <str<strong>on</strong>g>of</str<strong>on</strong>g> basic numeric types (float32, float64,int32 and int64) or NumPy arrays <str<strong>on</strong>g>of</str<strong>on</strong>g> basic numeric types. If any other object type isreferenced in the loop, then the loop is not parallelized by the compiler to avoid thepossibility <str<strong>on</strong>g>of</str<strong>on</strong>g> calling <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API functi<strong>on</strong>s from two different C threads.8. All variables defined outside the loop are treated as shared variables am<strong>on</strong>g threads.However, a programmer can define some variables as private by passing the keywordargument private with the value equal to a tuple c<strong>on</strong>taining a list <str<strong>on</strong>g>of</str<strong>on</strong>g> variable namesas strings.9. All variables local to the loop are treated as private.26


Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> generates OpenMP code <str<strong>on</strong>g>for</str<strong>on</strong>g> parallel loops that will be executed in multi-coreCPUs.A variant <str<strong>on</strong>g>of</str<strong>on</strong>g> prange called gpurange is also introduced. This parallel loop type is meant<str<strong>on</strong>g>for</str<strong>on</strong>g> accelerati<strong>on</strong> by the GPU. The compiler is resp<strong>on</strong>sible <str<strong>on</strong>g>for</str<strong>on</strong>g> generating GPU code as well as<str<strong>on</strong>g>for</str<strong>on</strong>g> managing data transfers between CPU and GPU automatically. However, the compileris not obligated to utilize this directive and may not generate GPU code. Further, if thetarget machine does not have a suitable GPU or if GPU code generati<strong>on</strong> is disabled throughcompiler command line opti<strong>on</strong>s, then the compiler does not generate any GPU code. Thus,the same code can easily be ported to multiple plat<str<strong>on</strong>g>for</str<strong>on</strong>g>ms.3.5.3 C<strong>on</strong>straints <strong>on</strong> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code to be compiledTo efficiently compile <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <strong>on</strong>ly accepts a subset <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> that is critical <str<strong>on</strong>g>for</str<strong>on</strong>g>the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> numerical applicati<strong>on</strong>s. Currently, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> does not support featuressuch as runtime code executi<strong>on</strong>, higher order functi<strong>on</strong>s, generators, metaclasses and specialmethods such as setitem. Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> supports 32- and 64-bit integers and 32- and 64-bitfloating point numbers, but it does not support arbitrary-precisi<strong>on</strong> arithmetics. Ensuringno overflows is the resp<strong>on</strong>sibility <str<strong>on</strong>g>of</str<strong>on</strong>g> the programmer. These restricti<strong>on</strong>s <strong>on</strong>ly apply to thecompiled code. All features are available <str<strong>on</strong>g>for</str<strong>on</strong>g> the n<strong>on</strong>-compiled porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the applicati<strong>on</strong>code that is executed by the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter.3.6 C<strong>on</strong>clusi<strong>on</strong>This chapter provided a brief overview <str<strong>on</strong>g>of</str<strong>on</strong>g> the syntax and semantics <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, NumPyand the proposed extensi<strong>on</strong>s. These extensi<strong>on</strong>s were carefully designed to ensure that theparallel-annotated code written <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> a GPU can also be run by an unmodified<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter. The flexibility <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy arrays combined with the proposed parallelloop extensi<strong>on</strong>s provides the programmer with very productive tools to write parallelnumerical programs.27


Chapter 4<str<strong>on</strong>g>Compiler</str<strong>on</strong>g> implementati<strong>on</strong>The compiler system presented in this thesis is a hybrid combinati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> two different compilers:un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, an ahead-<str<strong>on</strong>g>of</str<strong>on</strong>g>-time compiler and jit4GPU, a just-in-time compiler. GeneratingCPU code is solely the resp<strong>on</strong>sibility <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. For GPU code generati<strong>on</strong>, several analysesneed to be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med that cannot be d<strong>on</strong>e completely ahead <str<strong>on</strong>g>of</str<strong>on</strong>g> time. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, codegenerati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU is primarily d<strong>on</strong>e by jit4GPU assisted by in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> supplied by un-<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. To understand the working <str<strong>on</strong>g>of</str<strong>on</strong>g> the entire system, it is simpler to c<strong>on</strong>sider the codegenerati<strong>on</strong> process <str<strong>on</strong>g>for</str<strong>on</strong>g> CPUs and GPUs separately.4.1 CPU code generati<strong>on</strong>C<strong>on</strong>sider a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> program to be compiled <str<strong>on</strong>g>for</str<strong>on</strong>g> a multi-core CPU target. The programmerinvokes un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> passing the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> filename to un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. A block diagram <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>is given in Figure 4.1. Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> reads the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> source code and generates C++ andOpenMP code <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the program <strong>on</strong> the CPU. Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is a simple translatorand there<str<strong>on</strong>g>for</str<strong>on</strong>g>e it <strong>on</strong>ly per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms a few trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s.The fr<strong>on</strong>t-end <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is simply a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> script that parses <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code using thestandard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter’s parsing interface. The parser returns an abstract syntax tree(AST) and the AST is dumped into a file by the fr<strong>on</strong>t-end script. The AST is then readback into the middle end <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. After parsing, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms type-checkingwhich includes type inference <str<strong>on</strong>g>for</str<strong>on</strong>g> local variables. The initially untyped AST is c<strong>on</strong>vertedinto a typed AST by the type-checking phase. The typed AST is then lowered into a simplertyped AST. Lowering the AST involves breaking down complex expressi<strong>on</strong>s and statementsinto simpler expressi<strong>on</strong>s and statements that are closer to the desired C++ code. Afterlowering, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms liveness analysis using an iterative dataflow algorithm. Theliveness analysis algorithm is currently very restricted and <strong>on</strong>ly works <strong>on</strong> functi<strong>on</strong>s that <strong>on</strong>lymanipulate numeric scalar types and numeric NumPy arrays. If a functi<strong>on</strong> involves objects<str<strong>on</strong>g>of</str<strong>on</strong>g> other types, the liveness analysis is not per<str<strong>on</strong>g>for</str<strong>on</strong>g>med. Results <str<strong>on</strong>g>of</str<strong>on</strong>g> the liveness analysis are used28


<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> source codeParserType CheckerTree LowererLiveness AnalysisCode generatorC++ with OpenMPFigure 4.1: Block diagram <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> CPU code generati<strong>on</strong>by the compiler to determine which scalar variables referred to within a parallel loop canbe declared as private when generating OpenMP code. After liveness analysis is complete,the compiler simply traverses the typed AST to generate the final C++ and OpenMP code.To generate code, the compiler generates the functi<strong>on</strong> and method bodies required. Allboiler-plate code required (such as wrapper functi<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>-C API) is generated withthe help <str<strong>on</strong>g>of</str<strong>on</strong>g> a text-templating system. Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is implemented in a combinati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Javaand Scala languages. Scala is a statically-typed object-oriented language with many featuresderived from functi<strong>on</strong>al programming languages and it runs <strong>on</strong> the JVM. FreeMarker texttemplating library <str<strong>on</strong>g>for</str<strong>on</strong>g> Java is used <str<strong>on</strong>g>for</str<strong>on</strong>g> generating boiler-plate code.4.2 GPU code generati<strong>on</strong>For GPU code generati<strong>on</strong>, simple translati<strong>on</strong> is not sufficient. The objective is to mapa shared-memory parallel-loop programming model to a heterogeneous system with twodistinct address spaces. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the compiler needs to not <strong>on</strong>ly generate GPU code, butalso per<str<strong>on</strong>g>for</str<strong>on</strong>g>m analysis to determine which data must be transferred between the two distinctaddress spaces. The compiler needs to know the memory access patterns <str<strong>on</strong>g>of</str<strong>on</strong>g> all the arrayreferences inside the loop to analyze the data to be transferred to the GPU. The memory29


access pattern analysis and GPU code generati<strong>on</strong> is d<strong>on</strong>e by a JIT compiler jit4GPU andun<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <strong>on</strong>ly plays a supporting role in GPU code generati<strong>on</strong>.4.2.1 Motivati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the JIT compilerOne <str<strong>on</strong>g>of</str<strong>on</strong>g> the original goals in the project to compile <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> was to extend un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> togenerate GPU code. However, after some ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t, it became clear that un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> does nothave sufficient in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m the analysis required <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU code generati<strong>on</strong>. Forexample, the compiler does not know the data layout and the aliasing <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy arraysthat are parameters to a functi<strong>on</strong> being compiled. NumPy arrays are generally views <str<strong>on</strong>g>of</str<strong>on</strong>g>an underlying data array and are more like pointers than true arrays from the perspective<str<strong>on</strong>g>of</str<strong>on</strong>g> compiler analysis. The memory layout <str<strong>on</strong>g>of</str<strong>on</strong>g> a NumPy array is determined by the strides<str<strong>on</strong>g>of</str<strong>on</strong>g> the NumPy array and these strides are dynamic quantities unknown to un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. Twoapparently distinct NumPy arrays passed into a functi<strong>on</strong> may be two different views <str<strong>on</strong>g>of</str<strong>on</strong>g>the same piece <str<strong>on</strong>g>of</str<strong>on</strong>g> memory. A somewhat similar problem arises in C functi<strong>on</strong>s that havepointer parameters. Typically C compilers per<str<strong>on</strong>g>for</str<strong>on</strong>g>m a global analysis and can customize thefuncti<strong>on</strong> differently <str<strong>on</strong>g>for</str<strong>on</strong>g> different calling c<strong>on</strong>texts. However, such an analysis is not applicable<str<strong>on</strong>g>for</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is not a whole-program compiler and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e cannot do globalanalysis, such as finding the calling c<strong>on</strong>text <str<strong>on</strong>g>of</str<strong>on</strong>g> all functi<strong>on</strong>s.The objective <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is to compile <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> libraries that can be used by variousapplicati<strong>on</strong>s. The idea is that un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> will <strong>on</strong>ly see a small porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the applicati<strong>on</strong>because <strong>on</strong>ly a small porti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the applicati<strong>on</strong> is per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance critical. <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> programmerscannot be expected to impose the typing restricti<strong>on</strong>s required by un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <strong>on</strong> theirentire program. Even if a whole-program compiler were implemented <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>applicati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g>ten c<strong>on</strong>tain bindings to C libraries and those libraries are opaque to a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>compiler unless it is also interfaced with a full C compiler. The complexity <str<strong>on</strong>g>of</str<strong>on</strong>g> such a systemcan be prohibitive to implement. Thus, effectively, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> cannot do any globalanalysis. In absence <str<strong>on</strong>g>of</str<strong>on</strong>g> a global analysis, <strong>on</strong>e other possible soluti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> generating codefrom un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> was to generate different versi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> the loop <str<strong>on</strong>g>for</str<strong>on</strong>g> different cases <str<strong>on</strong>g>of</str<strong>on</strong>g> layouts<str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy arrays accessed within a loop. But NumPy arrays are very general structuresand the number <str<strong>on</strong>g>of</str<strong>on</strong>g> possible loop versi<strong>on</strong>s can become intractable as the number <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPyarrays increases.A simpler soluti<strong>on</strong> to the problem <str<strong>on</strong>g>of</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU code generati<strong>on</strong> is tocompile a parallel loop to GPU code just be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the loop is to be executed. At this stage,jit4GPU can query all the NumPy arrays to determine their data layouts and also knowsthe value <str<strong>on</strong>g>of</str<strong>on</strong>g> loop-invariant numeric c<strong>on</strong>stants. For example, loop bounds may appear tobe unknown symbolic c<strong>on</strong>stants to un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> but those c<strong>on</strong>stants are known to jit4gpu.Jit4GPU operates <strong>on</strong> a typed AST <str<strong>on</strong>g>of</str<strong>on</strong>g> the loop to be compiled and per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms several analysis30


and optimizati<strong>on</strong>s detailed later in Chapters 5 and 6.The JIT compiler is actually not visible to the programmer. As far as the programmeris c<strong>on</strong>cerned, an additi<strong>on</strong>al library (the JIT compiler) is linked into the applicati<strong>on</strong> but theprogrammer is not c<strong>on</strong>cerned with the functi<strong>on</strong> or inner workings <str<strong>on</strong>g>of</str<strong>on</strong>g> the JIT compiler. TheJIT compiler also makes no difference to codes that do not target GPUs.4.2.2 GPU code generati<strong>on</strong>Now c<strong>on</strong>sider the full process <str<strong>on</strong>g>of</str<strong>on</strong>g> generating GPU code. When un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> encounters a loopthat is marked by the programmer to be compiled <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> the GPU, it outputs twoversi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> code <str<strong>on</strong>g>for</str<strong>on</strong>g> that loop. First, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> outputs a typed AST representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> theparallel loop to be used by jit4GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> analysis and code generati<strong>on</strong>. The representati<strong>on</strong> isprinted out in a Lisp-like S-expressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mat read by jit4GPU al<strong>on</strong>g with a symbol tableand a table <str<strong>on</strong>g>of</str<strong>on</strong>g> array references inside the loop. However, jit4GPU is not guaranteed toproduce GPU code and can return failure if any <str<strong>on</strong>g>of</str<strong>on</strong>g> the analysis phases fails. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e,un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> also generates a C++ and OpenMP codepath <str<strong>on</strong>g>for</str<strong>on</strong>g> the parallel loop which is usedas a fallback in case jit4GPU fails to generate GPU code.Jit4GPU is implemented as a multi-phase compiler operating <strong>on</strong> typed ASTs. Anoverview <str<strong>on</strong>g>of</str<strong>on</strong>g> jit4GPU is provided in Figure 4.2. In the first phase, jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms anarray-access analysis detailed in chapter 5 to determine the data to be transferred to theGPU. The array access-analysis phase produces two outputs: a data structure representingthe data transfers to be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med and a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> code to be executed <strong>on</strong> the GPU. Ifjit4GPU fails to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m the access analysis, then it returns failure. In the next phase,jit4GPU optimizes the typed AST using loop optimizati<strong>on</strong>s detailed in chapter 6. The loopoptimizati<strong>on</strong> phases optimize the AST and can also alter the data layout chosen <strong>on</strong> theGPU and can there<str<strong>on</strong>g>for</str<strong>on</strong>g>e also change the parameters <str<strong>on</strong>g>for</str<strong>on</strong>g> data transfer. In the next phase,jit4GPU generates AMD IL assembly from the AST and passes this assembly code to thecompiler that c<strong>on</strong>verts AD IL code to GPU binaries. The AMD IL to binary compilerper<str<strong>on</strong>g>for</str<strong>on</strong>g>ms extensive low-level code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, such as physical register allocati<strong>on</strong>, <strong>on</strong>the GPU and instructi<strong>on</strong> scheduling <strong>on</strong> various units <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU. The exact optimizati<strong>on</strong>sper<str<strong>on</strong>g>for</str<strong>on</strong>g>med by the AMD compiler are not published by AMD but I have observed, looking atthe disassembler output, that the AMD CAL compiler per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s. Inthe final phase, jit4GPU issues a call to a supporting runtime library to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m the actualdata transfer and to execute the code <strong>on</strong> the GPU.Once the executi<strong>on</strong> is finished and data is transfered back to the CPU, jit4GPU reportssuccess and the executi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the program resumes <strong>on</strong> the CPU from the program point justafter the parallel loop.31


Typed AST from un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>Typed ASTTyped ASTArray access analysisLoop optimizati<strong>on</strong>sData transferparametersIL Code generati<strong>on</strong>AMD ILAMD CAL IL compilerDatatransferparams(opt)GPU binary codeData transfer and executi<strong>on</strong>Figure 4.2: Block diagram <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> CPU code generati<strong>on</strong>4.3 Final RemarksThis chapter described the design <str<strong>on</strong>g>of</str<strong>on</strong>g> a new compiler system that combines an ahead-<str<strong>on</strong>g>of</str<strong>on</strong>g>-timecompiler, a just-in-time compiler and the AMD CAL IL compiler to create a compilati<strong>on</strong>path <str<strong>on</strong>g>for</str<strong>on</strong>g> parallel-annotated <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> numerical code. This compilati<strong>on</strong> infrastructure generatesflexible code that execute either in a GPU or in a multicore CPU.32


Chapter 5Array access analysisArray access analysis is per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by a compiler to compute the set <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>sthat may be accessed by array references occuring in a loop nest. Jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms arrayaccess analysis to compute the set <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s accessed in a parallel loop to beexecuted <strong>on</strong> the GPU. Jit4GPU then computes a <strong>on</strong>e-to-<strong>on</strong>e mapping <str<strong>on</strong>g>of</str<strong>on</strong>g> the set <str<strong>on</strong>g>of</str<strong>on</strong>g> memorylocati<strong>on</strong>s accessed in the system memory to a set <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s <strong>on</strong> the <strong>on</strong>-boardmemory <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU. Jit4GPU also checks if all the memory potentially accessed withinthe loop can fit in the limited <strong>on</strong>-board memory <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU. If the data accessed duringcomputati<strong>on</strong> does not fit <strong>on</strong> the GPU, then jit4GPU attempts to tile the loop. Finally, if thearray access analysis and potential tiling <str<strong>on</strong>g>of</str<strong>on</strong>g> the loop are successful, then jit4GPU generatesdata transfer parameters and GPU binary and executes the loop.This chapter describes a new array access analysis algorithm used by jit4GPU .5.1 Representati<strong>on</strong> and problem statementTo execute a loop nest <strong>on</strong> the GPU, the compiler first needs to compute the set <str<strong>on</strong>g>of</str<strong>on</strong>g> memorylocati<strong>on</strong>s that can potentially be accessed by the loop nest and then transfer the data to theGPU. The compiler also needs to compute the size <str<strong>on</strong>g>of</str<strong>on</strong>g> the set <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s accessedto determine if the set will fit <strong>on</strong> the GPU. In this work, the analysis is restricted to loopnests <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>for</str<strong>on</strong>g>m shown in Figure 5.1.In Figure 5.1, a loop nest <str<strong>on</strong>g>of</str<strong>on</strong>g> depth d, where the top m loops are parallel, is shown.Variables i 1 to i d above are loop counters while u 1 to u d are upper bounds <str<strong>on</strong>g>of</str<strong>on</strong>g> the loops. Theloop body may c<strong>on</strong>tain zero or more array accesses. The set <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s accessedby an array reference depends <strong>on</strong> the subscript expressi<strong>on</strong> used in the reference, the mapping<str<strong>on</strong>g>of</str<strong>on</strong>g> the subscript expressi<strong>on</strong> to memory locati<strong>on</strong>s based <strong>on</strong> strides <str<strong>on</strong>g>of</str<strong>on</strong>g> the array and <strong>on</strong> theloop bounds.For array access analysis, the compiler requires a <str<strong>on</strong>g>for</str<strong>on</strong>g>mal representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the memoryaccess pattern. Such a representati<strong>on</strong> is a mapping from the d-dimensi<strong>on</strong>al space <str<strong>on</strong>g>of</str<strong>on</strong>g> tuples <str<strong>on</strong>g>of</str<strong>on</strong>g>33


1 <str<strong>on</strong>g>for</str<strong>on</strong>g> i 1 in prange(u 1 ) :2 <str<strong>on</strong>g>for</str<strong>on</strong>g> i 2 in prange(u 2 ) :3 . . .4 <str<strong>on</strong>g>for</str<strong>on</strong>g> i m in prange(u m ) :5 <str<strong>on</strong>g>for</str<strong>on</strong>g> i m+1 in range(u m+1 ) :6 .7 .8 <str<strong>on</strong>g>for</str<strong>on</strong>g> i d in range(u d )9 #loop bodyFigure 5.1: Loop nests c<strong>on</strong>sidered <str<strong>on</strong>g>for</str<strong>on</strong>g> array access analysisloop counters to <strong>on</strong>e-dimensi<strong>on</strong>al space <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s. The space <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>srepresents the locati<strong>on</strong>s in virtual memory. This space starts at zero and ends at positiveinfinity. The memory space is assumed to c<strong>on</strong>sist <str<strong>on</strong>g>of</str<strong>on</strong>g> discrete cells where each cell is composed<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>e discrete unit <str<strong>on</strong>g>of</str<strong>on</strong>g> memory. If all the arrays in the loop have the same element size s,(<str<strong>on</strong>g>for</str<strong>on</strong>g> example if all arrays are arrays <str<strong>on</strong>g>of</str<strong>on</strong>g> 4-byte values), and if the starting positi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> allarrays is aligned <strong>on</strong> s bytes, then <strong>on</strong>e cell <str<strong>on</strong>g>of</str<strong>on</strong>g> memory can be thought <str<strong>on</strong>g>of</str<strong>on</strong>g> as having s bytes.In this work, I restrict myself to the case where <strong>on</strong>e cell <str<strong>on</strong>g>of</str<strong>on</strong>g> memory is equated to the size <str<strong>on</strong>g>of</str<strong>on</strong>g><strong>on</strong>e element <str<strong>on</strong>g>of</str<strong>on</strong>g> the array.For array-access analysis, representing and reas<strong>on</strong>ing about arbitrary memory access patternsis not feasible. Thus array-access analysis is typically restricted to representing <strong>on</strong>eparticular class <str<strong>on</strong>g>of</str<strong>on</strong>g> memory access patterns. Linear Memory Access Descriptors or LMADs,first introduced by Paek et al. [17], are <strong>on</strong>e such representati<strong>on</strong> capable <str<strong>on</strong>g>of</str<strong>on</strong>g> precisely representinga wide variety <str<strong>on</strong>g>of</str<strong>on</strong>g> memory access patterns.Definiti<strong>on</strong> 1. An LMAD L is a mapping from a d-dimensi<strong>on</strong>al space <str<strong>on</strong>g>of</str<strong>on</strong>g> tuples <str<strong>on</strong>g>of</str<strong>on</strong>g> loopcounters to the 1-dimensi<strong>on</strong>al space <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>for</str<strong>on</strong>g>m:d∑L(i 1 , i 2 , .., i d ) = c 0 + f k (i k ) (5.1)k=1where ∀(1 ≤ k ≤ d), starting with k = 1 and going to k = d:0 ≤ i k < u k (5.2)u k = g k (i 1 , i 2 , .., i k−1 ) (5.3)i k , u k , c 0 , f k (i k ), g k (i 1 , i 2 , .., i k−1 )ɛZ (5.4)where L is the LMAD, Z is the set <str<strong>on</strong>g>of</str<strong>on</strong>g> integers, i k is the k-th loop counter, c 0 to c d areinteger c<strong>on</strong>stants and u k is the upper bound <str<strong>on</strong>g>of</str<strong>on</strong>g> the k-th loop counter. Functi<strong>on</strong>s f k and g kare arbitrary integer functi<strong>on</strong>s.34


LMADs can represent a wide variety <str<strong>on</strong>g>of</str<strong>on</strong>g> access patterns but require the compiler toretain the symbolic representati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> the functi<strong>on</strong>s f k and g k . Thus representing andreas<strong>on</strong>ing about general LMADs requires a general symbolic algebra engine to be embeddedin the compiler. Instead <str<strong>on</strong>g>of</str<strong>on</strong>g> dealing with general LMADs, this thesis is limited to a simpler,yet comm<strong>on</strong>, subset which I term as RCSLMADs <str<strong>on</strong>g>for</str<strong>on</strong>g> restricted c<strong>on</strong>stant stride LMADs.Intuitively, RCSLMADs represent a class <str<strong>on</strong>g>of</str<strong>on</strong>g> affine subscript expressi<strong>on</strong>s occurring inside arectangular loop nest with loop invariant bounds.Definiti<strong>on</strong> 2. An LMAD L is an RCSLMAD if and <strong>on</strong>ly if it satisfies all <str<strong>on</strong>g>of</str<strong>on</strong>g> the followingc<strong>on</strong>diti<strong>on</strong>s:1. Let N be the set <str<strong>on</strong>g>of</str<strong>on</strong>g> integers {1, 2, .., d}. Then L must be <str<strong>on</strong>g>of</str<strong>on</strong>g> the following <str<strong>on</strong>g>for</str<strong>on</strong>g>m:d∑L(i 1 , i 2 , ..., i d ) = c 0 + p k ∗ i k (5.5)where ∀(kɛN):k=10 ≤ i k < u k (5.6)p k , u k ɛZ + (5.7)c 0 ɛZ (5.8)c 0 ≥ 0 (5.9)Z + is the set <str<strong>on</strong>g>of</str<strong>on</strong>g> positive integers.2. There must exist a <strong>on</strong>e-to-<strong>on</strong>e mapping G from N to N such that ∀(kɛN):n=d∑p G(k) > p G(n) ∗ (u G(n) − 1) + p G(n) − 1 (5.10)k+1where G(k) represents the applicati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the mapping G <strong>on</strong> the value k. The mappingG is called the ordering functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> L.For c<strong>on</strong>venience, a number <str<strong>on</strong>g>of</str<strong>on</strong>g> auxiliary terms can be defined. Let L be an RCSLMAD.Then:Definiti<strong>on</strong> 3. The integer c<strong>on</strong>stant c 0 is called the base <str<strong>on</strong>g>of</str<strong>on</strong>g> the RCSLMAD.Definiti<strong>on</strong> 4. ∀(kɛN), the integer c<strong>on</strong>stant p k is called the stride in the k-th dimensi<strong>on</strong>.Definiti<strong>on</strong> 5. The mapping G is called the ordering functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the RCSLMAD L.35


Definiti<strong>on</strong> 6. ∀(kɛN), the value p k ∗ (u k − 1) is called the span in the k-th dimensi<strong>on</strong> andrepresents the distance traveled in memory space when the k-th loop counter i k goes from 0to the upper bound u k .The definiti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> stride and span are compatible with the definiti<strong>on</strong>s given by Paek etal. <str<strong>on</strong>g>for</str<strong>on</strong>g> stride and span <str<strong>on</strong>g>of</str<strong>on</strong>g> general LMADs.Definiti<strong>on</strong> 7. Let there be a set D such thatD = {(i 1 , i 2 , ..., i d ) : ∀(1 ≤ k ≤ d), 0 ≤ i k < u k } (5.11)The set D is called the domain <str<strong>on</strong>g>of</str<strong>on</strong>g> the RCSLMAD L.Definiti<strong>on</strong> 8. Let D be the domain <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMAD L and let V ɛD be a d-tuple. Then thed∑value v = c 0 + V (k) ∗ p k is called the applicati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> L <strong>on</strong> V and is denoted by L(V ).k=1To understand the c<strong>on</strong>cept <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs, an example can be c<strong>on</strong>sidered. C<strong>on</strong>sider atwo-dimensi<strong>on</strong>al example:L 1 = 5 + 20 ∗ i + 3 ∗ j (5.12)0 ≤ i < 7 (5.13)0 ≤ j < 5 (5.14)To visualize the RCSLMAD, c<strong>on</strong>sider a matrix with 7 rows and 20 columns. The number<str<strong>on</strong>g>of</str<strong>on</strong>g> rows was chosen to be equal to the upper bound <str<strong>on</strong>g>of</str<strong>on</strong>g> i while the number <str<strong>on</strong>g>of</str<strong>on</strong>g> columns waschosen to be equal to the stride <str<strong>on</strong>g>of</str<strong>on</strong>g> i. Then set the element in i-th row and j-th column <str<strong>on</strong>g>of</str<strong>on</strong>g> thematrix to 5 + 20 ∗ i + j. The matrix is effectively c<strong>on</strong>structed to represent a row-major twodimensi<strong>on</strong>alarray (such as those present in programming languages C/C++) <str<strong>on</strong>g>of</str<strong>on</strong>g> size 7 ∗ 20beginning at memory address 5. Figure 5.2 highlights all the elements in the matrix whichbel<strong>on</strong>g to the RCSLMAD L 1 . The RCSLMAD is representing a regular pattern <str<strong>on</strong>g>of</str<strong>on</strong>g> accessingthe matrix similar to the access <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>for</str<strong>on</strong>g>m [i][3∗j] <str<strong>on</strong>g>of</str<strong>on</strong>g> a C-style two-dimensi<strong>on</strong>al array. Thisis the intuitive idea behind RCSLMADs. All the accesses are effectively accessing a verysimple strided pattern in a C-style d-dimensi<strong>on</strong>al matrix.RCSLMADs have several useful properties. To describe the properties <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs,we will use notati<strong>on</strong> c<strong>on</strong>sistent with the above definiti<strong>on</strong>s. L will represent an RCSLMADdefined over a d dimensi<strong>on</strong>al domain D. Symbols i k , u k , p k , σ k will represent the loopcounter, upper bound, stride and span respectively in the k-th dimensi<strong>on</strong>.Theorem 1. L is a <strong>on</strong>e-to-<strong>on</strong>e mapping from D to the memory space. Alternately, eachd-tuple <str<strong>on</strong>g>of</str<strong>on</strong>g> loop counters from the domain D is mapped to a unique locati<strong>on</strong> in memory bythe RCSLMAD L.36


Figure 5.2: Visualizati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a two-dimensi<strong>on</strong>al RCSLMAD examplePro<str<strong>on</strong>g>of</str<strong>on</strong>g>. Without loss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality, the ordering functi<strong>on</strong> G is assumed to be the identityfuncti<strong>on</strong>. The theorem can be proved by c<strong>on</strong>tradicti<strong>on</strong>. Let us assume that there exist 2tuples V 1 and V 2 such thatL(V 1 ) = L(V 2 ), V 1 ɛD, V 2 ɛD, V 1 ≠ V 2 (5.15)Then the d-dimensi<strong>on</strong>al tuples V 1 and V 2 must differ in at least <strong>on</strong>e dimensi<strong>on</strong>. Withoutloss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality, we assume V 1 > V 2 lexicographically. Let m be the smallest integer suchthat V 1 (m) > V 2 (m). Thend∑L(V 1 ) − L(V 2 ) = p m ∗ (V 1 (m) − V 2 (m)) − p k ∗ (V 2 (k) − V 1 (k)) = 0 (5.16)k=m+1d∑p m ∗ (V 1 (m) − V 2 (m)) = p k ∗ (V 2 (k) − V 1 (k)) (5.17)k=m+1min(p m ∗ (V 1 (m) − V 2 (m))) = p m (5.18)d∑d∑max( (p k ∗ (V 2 (k) − V 1 (k)))) = (p k ∗ (u k − 1)) (5.19)However, by definiti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMAD:k=m+1k=m+1d∑p m > (p k ∗ (u k − 1)) (5.20)k=m+1d∑∴ min(p m ∗ (V 1 (m) − V 2 (m))) > max( (p k ∗ (V 2 (k) − V 1 (k)))) (5.21)k=m+1∴ L(V 1 ) − L(V 2 ) > 0 (5.22)37


Thus we have arrived at a c<strong>on</strong>tradicti<strong>on</strong> and proved that <str<strong>on</strong>g>for</str<strong>on</strong>g> RCSLMADs, each d-tuplein the domain D maps to a distinct locati<strong>on</strong> in memory.d∏Theorem 2. The size S <str<strong>on</strong>g>of</str<strong>on</strong>g> the set represented by the RCSLMAD L is given by S = u k .Pro<str<strong>on</strong>g>of</str<strong>on</strong>g>. The size S <str<strong>on</strong>g>of</str<strong>on</strong>g> the set is equal to the size <str<strong>on</strong>g>of</str<strong>on</strong>g> the domain D because each element inthe domain maps to a distinct memory locati<strong>on</strong>. The size S can simply be computed as aproduct <str<strong>on</strong>g>of</str<strong>on</strong>g> the upper bounds because the domain is rectangular.A loop nest may have multiple RCSLMADs and they may overlap at <strong>on</strong>e or more memorylocati<strong>on</strong>s. If each RCSLMAD is transferred independently, then the same memory locati<strong>on</strong><strong>on</strong> the CPU may be copied to multiple copies <strong>on</strong> the GPU. If there is a read/write dependenceinvolving a memory locati<strong>on</strong>, then creating multiple copies <strong>on</strong> the GPU is cumbersomebecause the copies must be kept c<strong>on</strong>sistent. A better soluti<strong>on</strong> is to compute a set uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>the RCSLMADs and then transfer the uni<strong>on</strong> to the GPU. This way there will be exactly<strong>on</strong>e copy <str<strong>on</strong>g>of</str<strong>on</strong>g> M <strong>on</strong> the GPU. The compiler also ensures that no other CPU thread writesto the memory locati<strong>on</strong>s copied to the GPU be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the GPU executi<strong>on</strong> and data transfersare complete. For correctness, it is not necessary to compute an exact uni<strong>on</strong>. If the uni<strong>on</strong>is represented by U, then it is sufficient to compute a superset U <str<strong>on</strong>g>of</str<strong>on</strong>g> the uni<strong>on</strong>. We nowpresent the problem statement including the design criteria <str<strong>on</strong>g>of</str<strong>on</strong>g> an analysis to compute theuni<strong>on</strong>:Problem Statement 1. Given a list {L 1 , L 2 , .., L n } <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs defined over a d-dimensi<strong>on</strong>al domain D:1. Compute the set M <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s to be transferred to the GPU. The set M shouldbe a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> the uni<strong>on</strong> U <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs. The number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements in M −U shouldbe as small as possible. M should be easily representable by the compiler.2. Compute the size S <str<strong>on</strong>g>of</str<strong>on</strong>g> the set M.3. C<strong>on</strong>struct a map F with domain M. The range R <str<strong>on</strong>g>of</str<strong>on</strong>g> F should be a set <str<strong>on</strong>g>of</str<strong>on</strong>g> validmemory locati<strong>on</strong>s in the GPU address space. If F (m), mɛM represents the address <str<strong>on</strong>g>of</str<strong>on</strong>g>the memory locati<strong>on</strong> m <strong>on</strong> the GPU, then F should be such that F (m 1 ) ≠ F (m 2 ) ifm 1 ≠ m 2 . The Range R should be as small as possible. The computati<strong>on</strong>al complexityand memory requirement <str<strong>on</strong>g>for</str<strong>on</strong>g> computing F should be as low as possible.5.2 General soluti<strong>on</strong>C<strong>on</strong>sider the case where the list c<strong>on</strong>tains a single RCSLMAD L defined over a domain D.Without loss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality, assume that the ordering functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> L is the identity functi<strong>on</strong>.In such a case, an exact soluti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the problem 1 can be calculated by algorithm 1.k=138


Algorithm 1 solves the data transfer problem <str<strong>on</strong>g>for</str<strong>on</strong>g> a single RCSLMAD L defined over adomain D.1: M = {L(V )|V ɛD}.d∏2: S = u k from property <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMAD.k=13: Let m = L(V ), V ɛD, then F (m) =d∑(V (k) ∗ (k=1d∏j=k+1u j )) and range R = S.To argue that the algorithm 1 does produce a correct functi<strong>on</strong> F , we observe that givena d-dimensi<strong>on</strong>al tuple D, the composite functi<strong>on</strong> F (L) is also an RCSLMAD defined overthe domain D, and hence each distinct tuple V is mapped to a distinct locati<strong>on</strong> in theGPU memory by the composite functi<strong>on</strong> F (L). The range R can be directly verified bycalculating the maximum and minimum values <str<strong>on</strong>g>of</str<strong>on</strong>g> F (L) over D.The problem <str<strong>on</strong>g>of</str<strong>on</strong>g> multiple RCSLMADs can be simplified in many cases by c<strong>on</strong>sideringthe memory intervals spanned by the RCSLMADs in the list.If the memory intervalsspanned by the RCSLMADs in the list are completely disjoint, then each RCSLMAD canbe analyzed individually without any necessity to compute the uni<strong>on</strong> because there can be nodependencies and no possibilities <str<strong>on</strong>g>of</str<strong>on</strong>g> multiple copies. If the interval spanned by two or moreRCSLMADs overlaps, then we can <str<strong>on</strong>g>for</str<strong>on</strong>g>m a group <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs. A group <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADsspans over an interval given as the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> intervals <str<strong>on</strong>g>of</str<strong>on</strong>g> all members <str<strong>on</strong>g>of</str<strong>on</strong>g> the group. In generalwe can partiti<strong>on</strong> the list <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs into groups <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs where the intervalrepresented by each group is disjoint from other groups. An algorithm to compute groups<str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs is given by algorithm 2.Algorithm 2 partiti<strong>on</strong>s a list <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs defined over a domain D into disjoint groups<str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs.Inputs: List {L 1 , L 2 , .., L n } <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs defined over a domain D.Outputs: List {G 1 , G 2 , .., G n } <str<strong>on</strong>g>of</str<strong>on</strong>g> groups <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs such that the interval spanned by eachgroup is disjoint from the interval spanned by any other group.1: <str<strong>on</strong>g>for</str<strong>on</strong>g> each RCSLMAD L in {L 1 , L 2 , .., L n } do2: Find out the start and end addresses <str<strong>on</strong>g>of</str<strong>on</strong>g> R in memory.3: end <str<strong>on</strong>g>for</str<strong>on</strong>g>4: C<strong>on</strong>struct a graph G where each RCSLMAD is a node.5: <str<strong>on</strong>g>for</str<strong>on</strong>g> each pair <str<strong>on</strong>g>of</str<strong>on</strong>g> nodes in G do6: Insert an edge if the start-to-end intervals <str<strong>on</strong>g>of</str<strong>on</strong>g> the two nodes overlap.7: end <str<strong>on</strong>g>for</str<strong>on</strong>g>8: Find the c<strong>on</strong>nected comp<strong>on</strong>ents in the graph G using a suitable algorithm such as adepth first search. Each c<strong>on</strong>nected comp<strong>on</strong>ent represents a group <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs.9: return A list <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>nected comp<strong>on</strong>ents.If <strong>on</strong>e or more groups c<strong>on</strong>tains more than <strong>on</strong>e RCSLMAD, then the problem <str<strong>on</strong>g>of</str<strong>on</strong>g> computingthe uni<strong>on</strong> (or the superset) still remains. One general soluti<strong>on</strong> is described in algorithm 3.39


The algorithm is very c<strong>on</strong>servative and transfers all memory locati<strong>on</strong>s in the memory intervalspanned by the group. The algorithm is correct but transfers too much data and uses toomuch GPU <strong>on</strong>-board memory. Better soluti<strong>on</strong>s are needed in most cases.Algorithm 3 computes a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> arbitrary RCSLMADsInputs: A group {L 1 , L 2 , .., L n } <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs defined over domain D.Outputs: Set M, size S <str<strong>on</strong>g>of</str<strong>on</strong>g> set M and functi<strong>on</strong> F <str<strong>on</strong>g>for</str<strong>on</strong>g> mapping CPU memory locati<strong>on</strong>s toGPU memory locati<strong>on</strong>s.1: Compute m 1 = min(min(L 1 (V ), V ɛD), min(L 2 (V ), V ɛD), .., min(L n (V ), V ɛD)).2: Compute m 2 = max(max(L 1 (V ), V ɛD), max(L 2 (V ), V ɛD), .., max(L n (V ), V ɛD)).3: Compute M = {m|m 1 ≤ m ≤ m 2 , mɛZ}.4: Compute S = m 2 − m 1 .5: C<strong>on</strong>struct F (m) = m − m 1 and R = {m|0 ≤ m ≤ m 2 − m 1 , mɛZ}.6: return M, S, and F .While the algorithms 1, 2 and 3 have <strong>on</strong>ly been described <str<strong>on</strong>g>for</str<strong>on</strong>g> RCSLMADs, they areactually applicable <strong>on</strong> a slightly wider class <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs. For c<strong>on</strong>venience, assume thatthe ordering functi<strong>on</strong> is the identity functi<strong>on</strong> and c<strong>on</strong>sider the c<strong>on</strong>diti<strong>on</strong> p k > p k+1 − 1 +d∑p j ∗ (u j − 1) imposed <strong>on</strong> RCSLMADs. If the c<strong>on</strong>diti<strong>on</strong> is instead loosened to p k >j=k+1d∑j=k+1p j ∗ (u j − 1), then the algorithms 1, 2 and 3 are still valid (and can be verified bysubstituting the loosened c<strong>on</strong>diti<strong>on</strong> in the corresp<strong>on</strong>ding pro<str<strong>on</strong>g>of</str<strong>on</strong>g>s).5.3 More efficient soluti<strong>on</strong>s in specific casesIn some cases, it is possible to derive more efficient soluti<strong>on</strong>s that transfer less data thanthe general algorithm presented earlier. Computing and reas<strong>on</strong>ing with uni<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> arbitraryRCSLMADs is n<strong>on</strong>-trivial. C<strong>on</strong>sider a group <str<strong>on</strong>g>of</str<strong>on</strong>g> n RCSLMADs. Let p jk be the stride inthe j-th RCSLMAD in k-th dimensi<strong>on</strong>. Instead <str<strong>on</strong>g>of</str<strong>on</strong>g> attempting to compute the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>RCSLMADs in arbitrary cases, this secti<strong>on</strong> is limited to the case p t1k = p t2k, 1 ≤ t 1 , t 2 ≤ n,i.e. all RCSLMADs have the same stride in any given dimensi<strong>on</strong> k. Such cases occur whena programmer is accessing multiple array locati<strong>on</strong>s in the loop body with fixed distancebetween the array accesses. All the RCSLMADs must have the same ordering functi<strong>on</strong>because all the RCSLMADs share strides and are defined over the same domain. Withoutloss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality, assume that the ordering functi<strong>on</strong> is identity throughout this secti<strong>on</strong>.One example <str<strong>on</strong>g>of</str<strong>on</strong>g> the types <str<strong>on</strong>g>of</str<strong>on</strong>g> problems being studied in this secti<strong>on</strong> is as follows:L 1 = 0 + 20 ∗ i + 3 ∗ j (5.23)L 2 = 21 + 20 ∗ i + 3 ∗ j (5.24)0 ≤ i < 5 (5.25)40


0 ≤ j < 5 (5.26)In the example, the two RCSLMADs share the same strides but have different bases.5.3.1 Representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> uni<strong>on</strong>A suitable representati<strong>on</strong> needs to be chosen to represent the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs. Therepresentati<strong>on</strong> chosen influences how accurately the uni<strong>on</strong> can be computed. As in any othercompiler analysis, a trade<str<strong>on</strong>g>of</str<strong>on</strong>g>f needs to be made between simplicity and accuracy. One potentialchoice <str<strong>on</strong>g>for</str<strong>on</strong>g> representing uni<strong>on</strong>s is a single RCSLMAD. However the set <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADsis not closed over the uni<strong>on</strong> operati<strong>on</strong>, i.e. the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> two RCSLMADs cannot always berepresented exactly using a single RCSLMAD. Instead, I define a new type <str<strong>on</strong>g>of</str<strong>on</strong>g> set designed torepresent a collecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> interleaved RCSLMADs with comm<strong>on</strong> strides. For brevity, I termsuch groups as ERL <str<strong>on</strong>g>for</str<strong>on</strong>g> Extended Restricted CSLMADs since they extend RCSLMADs tomultiple bases. Formally an ERL E is defined as a set over a d-dimensi<strong>on</strong>al domain D asfollows:Definiti<strong>on</strong> 9. Let there be n RCSLMADs L 1 , L 2 , ..., L n defined over the same d-dimensi<strong>on</strong>aldomain D with the following c<strong>on</strong>straints:1. Let p t1k and p t2k represents the k-th strides <str<strong>on</strong>g>of</str<strong>on</strong>g> L t1 and L t2 respectively. Then thefollowing c<strong>on</strong>diti<strong>on</strong> must be satisfied:∀(1 ≤ t 1 , t 2 ≤ n, 1 ≤ k ≤ d) p t1k = p t2k (5.27)2. Let b x be the base <str<strong>on</strong>g>of</str<strong>on</strong>g> the RCSLMAD L x . Then∀(1 ≤ x, y ≤ n) b x − b y < p d (5.28)Given the above two c<strong>on</strong>straints, ERL E is defined as the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the RCSLMADs L 1 toL n . For d-dimensi<strong>on</strong>s and n RCSLMADs, an ERL has 2 ∗ d + n parameters including dupper bounds, d strides and n bases.One example <str<strong>on</strong>g>of</str<strong>on</strong>g> the type <str<strong>on</strong>g>of</str<strong>on</strong>g> sets represented by ERLs is seen by the following example:L 1 = 0 + 20 ∗ i + 3 ∗ j (5.29)L 2 = 1 + 20 ∗ i + 3 ∗ j (5.30)0 ≤ i < 5 (5.31)0 ≤ j < 5 (5.32)As can be verified, the above RCSLMADs can be represented by an ERL since all c<strong>on</strong>diti<strong>on</strong>sare satisfied. The 2 RCSLMADs are interleaved, i.e. they never overlap and have the same41


strides. In general, ERL has the useful property that the size <str<strong>on</strong>g>of</str<strong>on</strong>g> the set being representedcan be easily computed. To find an algorithm <str<strong>on</strong>g>for</str<strong>on</strong>g> finding the size <str<strong>on</strong>g>of</str<strong>on</strong>g> such a group E, I firstprove a theorem.Theorem 3. Let there be two d-dimensi<strong>on</strong>al RCSLMADs L 1 and L 2 defined over the samedomain D with equal strides p k in each dimensi<strong>on</strong> k. If the bases b 1 and b 2 <str<strong>on</strong>g>of</str<strong>on</strong>g> the twoRCSLMADs are such that 0 ≤ b 2 − b 1 < p d , then L 1 and L 2 are completely disjoint.Pro<str<strong>on</strong>g>of</str<strong>on</strong>g>. I prove the theorem by c<strong>on</strong>tradicti<strong>on</strong>. Let there be two d-dimensi<strong>on</strong>al tuples V 1 ɛDand V 2 ɛD such that L 1 (V 1 ) = L 2 (V 2 ). For the sake <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>tradicti<strong>on</strong>, I assume that V 1 ≠ V 2 .Let m be the smallest dimensi<strong>on</strong> in which V 1 and V 2 differ. The rest <str<strong>on</strong>g>of</str<strong>on</strong>g> the pro<str<strong>on</strong>g>of</str<strong>on</strong>g> is similar tothe pro<str<strong>on</strong>g>of</str<strong>on</strong>g> given in theorem 1. First, c<strong>on</strong>sider the case where V 1 (m) < V 2 (m). We substitutethe values <str<strong>on</strong>g>of</str<strong>on</strong>g> V 1 and V 2 in the definiti<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 and L 2 .Then we apply the c<strong>on</strong>diti<strong>on</strong>p m > p 1 − 1 + ∑ dk=m+1 p k ∗ (u k − 1). Following the logic <str<strong>on</strong>g>of</str<strong>on</strong>g> pro<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> theorem 1, using thec<strong>on</strong>diti<strong>on</strong> it can be proven that L 1 (V 1 ) < L 2 (V 2 ). There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, we arrive at a c<strong>on</strong>tradicti<strong>on</strong>and the theorem is proved <str<strong>on</strong>g>for</str<strong>on</strong>g> this case. Sec<strong>on</strong>d, c<strong>on</strong>sider the case V 1 (m) > V 2 (m). In thiscase it can be proven that L 1 (V 1 ) > L 2 (V 2 ). Again we arrive at a c<strong>on</strong>tradicti<strong>on</strong> and thetheorem is proved.Given the previous theorem, computing the number <str<strong>on</strong>g>of</str<strong>on</strong>g> distinct elements in an d-dimensi<strong>on</strong>alERL E composed <str<strong>on</strong>g>of</str<strong>on</strong>g> n RCSLMADs is straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward.Theorem 4. Let E be a d-dimensi<strong>on</strong>al ERL composed <str<strong>on</strong>g>of</str<strong>on</strong>g> n RCSLMADs L 1 , L 2 , .., L n withbases b 1 , b 2 , ..., b n respectively. Then the number <str<strong>on</strong>g>of</str<strong>on</strong>g> distinct elements <str<strong>on</strong>g>of</str<strong>on</strong>g> E is given by q ∗d∏u k where q is the number <str<strong>on</strong>g>of</str<strong>on</strong>g> distinct elements in the list {b 1 , b 2 , b 3 , .., b n }.k=1Pro<str<strong>on</strong>g>of</str<strong>on</strong>g>. The theorem is a corollary <str<strong>on</strong>g>of</str<strong>on</strong>g> the previous theorem.5.3.2 One-dimensi<strong>on</strong>al RCSLMADs with comm<strong>on</strong> strideC<strong>on</strong>sider two <strong>on</strong>e-dimensi<strong>on</strong>al RCSLMADs L 1 , L 2 having the same stride m and definedover the domain D.L 1 = b 1 + m ∗ i (5.33)L 2 = b 2 + m ∗ i (5.34)0 ≤ i < u (5.35)Without loss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality assume that b 1 ≤ b 2 . As was shown earlier, RCSLMADs canbe visualized as simple-strided accesses into an imaginary C array. If there is more than <strong>on</strong>eRCSLMAD, we can attempt to place both <str<strong>on</strong>g>of</str<strong>on</strong>g> them in the same imaginary C array. In this42


case, assume we had a C array starting at some arbitrary positive integer b 0 ≤ b 1 . 1RCSLMADs can be rewritten as:TheL 1 = b 0 + m ∗ (i + t 1 ) + r 1 (5.36)t 1 = ⌊(b 1 − b 0 )/m⌋ (5.37)r 1 = (b 1 − b 0 )%m (5.38)L 2 = b 0 + m ∗ (i + t 2 ) + r 2 (5.39)t 2 = ⌊(b 2 − b 0 )/m⌋ (5.40)r 2 = (b 2 − b 0 )%m (5.41)0 ≤ i < u (5.42)Further rewriting:L 1 = b 0 + m ∗ j + r 1 (5.43)t 1 ≤ j < u + t 1 (5.44)L 2 = b 0 + m ∗ k + r 2 (5.45)t 2 ≤ k < u + t 2 (5.46)We observe that the value t 1 will be less than or equal to t 2 because b 1 ≤ b 2 . We canc<strong>on</strong>struct an RCSLMAD L 3 that is a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 by increasing the domain <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 .L 3 = (b 0 + r 1 ) + m ∗ j (5.47)0 ≤ j < u + t 2 (5.48)Similarly we can c<strong>on</strong>struct an RCSLMAD L 4 as a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> L 2 .L 4 = (b 0 + r 2 ) + m ∗ j (5.49)0 ≤ j < u + t 2 (5.50)L 3 and L 4 are defined over the same domain D ′ = {j|0 ≤ j < u + t 2 }. Further, thedifference in bases <str<strong>on</strong>g>of</str<strong>on</strong>g> L 3 and L 4 is equal to |r 1 − r 2 | and |r 1 − r 2 | < m because r 1 < m andr 2 < m. The uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> L 3 and L 4 is there<str<strong>on</strong>g>for</str<strong>on</strong>g>e a <strong>on</strong>e-dimensi<strong>on</strong>al ERL E defined over domainD ′ with stride m and with bases b 0 + r 1 and b 0 + r 2 . Since we have not yet chosen b 0 , weshould attempt to choose a b 0 such that the set D ′ is the smallest. The integer b 0 = b 1 gives1 The new base b 0 will later be chosen to minimize the domain <str<strong>on</strong>g>of</str<strong>on</strong>g> an RCSLMAD that combines thememory references <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 and L 2 .43


the smallest set E because any value <str<strong>on</strong>g>of</str<strong>on</strong>g> b 0 smaller than b 1 will result in either the same ora larger domain D ′ .The methodology is generalized by c<strong>on</strong>structi<strong>on</strong> in algorithm 4 <str<strong>on</strong>g>for</str<strong>on</strong>g> n RCSLMADs withb 0 = b 1 . The idea is that if the number <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>-distinct bases in the c<strong>on</strong>struct ERL E is q,and if the upper bound <str<strong>on</strong>g>of</str<strong>on</strong>g> the domain <str<strong>on</strong>g>of</str<strong>on</strong>g> E is t n +u, then the total number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements to betransferred to the GPU is (t n +u)∗q since t n +u elements need to be transferred <str<strong>on</strong>g>for</str<strong>on</strong>g> each n<strong>on</strong>distinctRCSLMAD present in the ERL E. Thus <strong>on</strong> the GPU, we can allocate space equalto (t n + u) ∗ q elements. For address mapping, c<strong>on</strong>sider the interval b 0 to b 0 + m − 1 <strong>on</strong> thesystem RAM. Out <str<strong>on</strong>g>of</str<strong>on</strong>g> this interval, q elements accessed by the q n<strong>on</strong>-overlapping RCSLMADsare transferred. This pattern <str<strong>on</strong>g>of</str<strong>on</strong>g> q out <str<strong>on</strong>g>of</str<strong>on</strong>g> every m elements is repeated. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e if elementsnot accessed by the ERL were to be discarded, then the pattern is equivalent to q interleavedaccesses with stride q.Algorithm 4 computes the approximate uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> n <strong>on</strong>e-dimensi<strong>on</strong>al RCSLMADs with thesame stride.Inputs: n <strong>on</strong>e-dimensi<strong>on</strong>al RCSLMADs {L 1 , L 2 , .., L n } with stride m and bases{b 1 , b 2 , .., b n } such that b 1 ≤ b 2 ... ≤ b n defined over domain D = {i, 0 ≤ i < u}.Outputs: ERL E and a list <str<strong>on</strong>g>of</str<strong>on</strong>g> n expressi<strong>on</strong>s representing the trans<str<strong>on</strong>g>for</str<strong>on</strong>g>med address <str<strong>on</strong>g>of</str<strong>on</strong>g> eachRCSLMAD.1: <str<strong>on</strong>g>for</str<strong>on</strong>g> each RCSLMAD L j do2: Compute the term t j = ⌊(b j − b 1 )/m⌋ and r j = (b j − b 1 )%m.3: end <str<strong>on</strong>g>for</str<strong>on</strong>g>4: C<strong>on</strong>struct a list R = {b 1 + r 1 , b 1 + r 2 , .., b n + r n }.5: Remove any duplicates from R. Let the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements remaining in R be q.6: Sort R in-place.7: C<strong>on</strong>struct a <strong>on</strong>e-dimensi<strong>on</strong>al ERL E with stride m, bases R1 and domain D ′ = {j, 0 ≤j ≤ t n + u}.8: C<strong>on</strong>struct an empty list I.9: <str<strong>on</strong>g>for</str<strong>on</strong>g> each RCSLMAD L j do10: Find x such that R(x) = b j + r j .11: Append the expressi<strong>on</strong> (represented as an AST or other compiler IR) q ∗ (i + t j ) + xto I where i is the original loop counter <str<strong>on</strong>g>for</str<strong>on</strong>g> which the analysis is being c<strong>on</strong>ducted.12: end <str<strong>on</strong>g>for</str<strong>on</strong>g>13: return E and I.5.3.3 Multidimensi<strong>on</strong>al RCSLMADs with comm<strong>on</strong> stridesC<strong>on</strong>sider n RCSLMADs {L 1 , L 2 , .., L n } with the comm<strong>on</strong> strides P = {p 1 , p 2 , .., p d }. Letthe RCSLMADs be defined over a d dimensi<strong>on</strong>al domain D = {(i 1 , i 2 , .., i d ) | 0 ≤ i j


solve <str<strong>on</strong>g>for</str<strong>on</strong>g> unknown parameters <str<strong>on</strong>g>of</str<strong>on</strong>g> E using an integer programming solver. The equati<strong>on</strong>s arederived as follows:1. E can be c<strong>on</strong>strained such that the m-th comp<strong>on</strong>ent <str<strong>on</strong>g>of</str<strong>on</strong>g> E must be a superset <str<strong>on</strong>g>of</str<strong>on</strong>g>RCSLMAD L m . The idea is to assume that <str<strong>on</strong>g>for</str<strong>on</strong>g> each V ɛD, there exists a point V ′ ɛD ′such that V ′ = V + {t m1 , t m2 , ..., t mn } and that E(m)(V ′ ) = L m (V ). The values t mkare assumed to be unknown integer c<strong>on</strong>stants. The following equati<strong>on</strong>s can be stated:d∑d∑b m + p k ∗ i k = b ′ m + p k ∗ (i k + t mk ) (5.51)k=1k=1u ′ 1 ≥ u k + t m1 (5.52)u ′ 2 ≥ u 2 + t m2 (5.53).u ′ d ≥ u d + t md (5.54)t m1 ≥ 0 (5.55)t m2 ≥ 0 (5.56).t mk ≥ 0 (5.57)If suitable integer values are found <str<strong>on</strong>g>for</str<strong>on</strong>g> the unknowns t mk and u ′ kthat satisfy the abovec<strong>on</strong>straints, then E is a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the RCSLMADs by c<strong>on</strong>structi<strong>on</strong>.2. Each comp<strong>on</strong>ent <str<strong>on</strong>g>of</str<strong>on</strong>g> E must be an RCSLMAD and must there<str<strong>on</strong>g>for</str<strong>on</strong>g>e satisfy c<strong>on</strong>straintsrelating the upper bounds and strides.For each integer k such that 2 ≤ k ≤ dd∑p ′ k ≥ p d + (p j ∗ (u ′ j − 1)) (5.58)j=k+13. From the definiti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> an ERL, the difference between any pair <str<strong>on</strong>g>of</str<strong>on</strong>g> bases (b ′ x, b ′ y shouldbe less than stride p d in the last dimensi<strong>on</strong>.For {(x, y)|1 ≤ x ≤ n, 1 ≤ y ≤ n, x ≠ y}b x − b y ≤ p d − 1 (5.59)These inequalities <str<strong>on</strong>g>for</str<strong>on</strong>g>ms a set <str<strong>on</strong>g>of</str<strong>on</strong>g> (n − 1) 2 c<strong>on</strong>straints necessary <str<strong>on</strong>g>for</str<strong>on</strong>g> ensuring that E isan ERL.Thus a total <str<strong>on</strong>g>of</str<strong>on</strong>g> n equality c<strong>on</strong>straints and n ∗ d + (d − 1) + (n − 1) 2 inequality c<strong>on</strong>straintscan be derived <str<strong>on</strong>g>for</str<strong>on</strong>g> a total <str<strong>on</strong>g>of</str<strong>on</strong>g> n + d + n ∗ d unknowns. The unknown variables are u ′ k , b′ m45


and t mk where 1 ≤ k ≤ d and 1 ≤ m ≤ n. Ideally we want to minimize the size <str<strong>on</strong>g>of</str<strong>on</strong>g> E butthe size <str<strong>on</strong>g>of</str<strong>on</strong>g> E is a n<strong>on</strong>linear functi<strong>on</strong>. To utilize an integer programming solver, I define thefollowing objective functi<strong>on</strong> that still attempts to minimize the size <str<strong>on</strong>g>of</str<strong>on</strong>g> the E. To choose asuitable objective functi<strong>on</strong>, the following observati<strong>on</strong>s can be made:1. Let t k = max(t mk ). Then u k ≥ u k + t k . Observing all c<strong>on</strong>straints involving u ′ k , ifthe linear integer solver finds a feasible soluti<strong>on</strong> such that u ′ k > u k + t k , then simplyreducing the value <str<strong>on</strong>g>of</str<strong>on</strong>g> u ′ k to t k + u k will still satisfy all c<strong>on</strong>straints.2. Let q be the number <str<strong>on</strong>g>of</str<strong>on</strong>g> distinct entries in B ′ . Then the size <str<strong>on</strong>g>of</str<strong>on</strong>g> ERL E is q∗ ∏ dk=1 (u k+t k )by property <str<strong>on</strong>g>of</str<strong>on</strong>g> ERLs proved earlier.There<str<strong>on</strong>g>for</str<strong>on</strong>g>e we derive a heuristic objective functi<strong>on</strong> that should provide a reas<strong>on</strong>able approximati<strong>on</strong>to minimizing the number <str<strong>on</strong>g>of</str<strong>on</strong>g> entries:d∑Minimize u ′ k (5.60)Given the c<strong>on</strong>straints and the objective functi<strong>on</strong>, an integer programming solver is utilizedto find the value <str<strong>on</strong>g>of</str<strong>on</strong>g> all unknowns. However, if the number <str<strong>on</strong>g>of</str<strong>on</strong>g> dimensi<strong>on</strong>s or the number<str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs exceeds a threshold, then the compiler falls back to algorithm 3 becauseinteger programming solvers can take a large amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time as the number <str<strong>on</strong>g>of</str<strong>on</strong>g> variablesincreases.C<strong>on</strong>sider the following example:k=1L 1 = 0 + 20 ∗ i + 3 ∗ j (5.61)L 2 = 21 + 20 ∗ i + 3 ∗ j (5.62)0 ≤ i < 5 (5.63)0 ≤ j < 5 (5.64)Then the algorithm will find the following ERL:L ′ 1 = 0 + 20 ∗ i ′ + 3 ∗ j ′ (5.65)L ′ 2 = 1 + 20 ∗ i ′ + 3 ∗ j ′ (5.66)0 ≤ i ′ < 6 (5.67)0 ≤ j ′ < 5 (5.68)If the general <str<strong>on</strong>g>for</str<strong>on</strong>g>m was used in the example instead <str<strong>on</strong>g>of</str<strong>on</strong>g> the ERL approach, then thenumber <str<strong>on</strong>g>of</str<strong>on</strong>g> elements transferred would have been 21 + 20 ∗ 4 + 3 ∗ 4 = 113. Using the ERL46


approach, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements transferred is 6 ∗ 5 ∗ 2 = 60 elements. The ERL approachcan result in significant savings in the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements transferred if <strong>on</strong>ly a small tile <str<strong>on</strong>g>of</str<strong>on</strong>g>a large array is accessed in a computati<strong>on</strong>.If the number <str<strong>on</strong>g>of</str<strong>on</strong>g> dimensi<strong>on</strong>s is greater than <strong>on</strong>e, then a feasible soluti<strong>on</strong> does not necessarilyexist. The problem is that we are attempting to interpret the n RCSLMADs as ninterleaved accesses into a d-dimensi<strong>on</strong>al array but such an interpretati<strong>on</strong> does not alwaysexist. If the integer programming solver is called and no feasible soluti<strong>on</strong> is found, then thecompiler falls back to algorithm 3 <str<strong>on</strong>g>for</str<strong>on</strong>g> data transfer analysis.To generate the mapped address <str<strong>on</strong>g>of</str<strong>on</strong>g> each array access, algorithm 5 is used.Algorithm 5 computes the mapped address <str<strong>on</strong>g>for</str<strong>on</strong>g> multidimensi<strong>on</strong>al RCSLMAD problemssolved using the integer linear programming methodInputs: Specified c<strong>on</strong>stants u k , values t mk and b ′ m computed using integer linearprogramming Outputs: A list <str<strong>on</strong>g>of</str<strong>on</strong>g> n expressi<strong>on</strong>s representing the address <strong>on</strong> theGPU.1: C<strong>on</strong>struct a vector R1 = [b ′ 1, b ′ 2, ..., b ′ n].2: C<strong>on</strong>struct a vector R2 = R1.3: Remove all n<strong>on</strong>-duplicate entries <str<strong>on</strong>g>of</str<strong>on</strong>g> R2.4: Sort R2 in-place.5: C<strong>on</strong>struct a vector R3 such that R3[i] = x where R2[x] = R1[i].6: Compute q = length <str<strong>on</strong>g>of</str<strong>on</strong>g> R2.7: <str<strong>on</strong>g>for</str<strong>on</strong>g> each integer k such that 1 ≤ k ≤ d do8: Compute t k = max(t 1k , t 2k , ..., t nk ).9: end <str<strong>on</strong>g>for</str<strong>on</strong>g>10: Compute b 0 = min(R1).11: <str<strong>on</strong>g>for</str<strong>on</strong>g> each RCSLMAD L m dod∑d∏12: C<strong>on</strong>struct an expressi<strong>on</strong> I m = (b ′ m − b 0 ) + q ∗ (i k + t mk ) ∗ (u l + t l ).13: end <str<strong>on</strong>g>for</str<strong>on</strong>g>14: return list {I 1 , I 2 , ..., I n }.k=1l=k+15.4 Loop tiling <str<strong>on</strong>g>for</str<strong>on</strong>g> handling large loopsUsing previously discussed algorithms 3 and 5, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements to be transferred tothe GPU can be computed. However, if the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> memory to be transferred does not fit<strong>on</strong> the GPU, then the compiler must attempt to tile the loop. In the implemented compiler,the compiler <strong>on</strong>ly c<strong>on</strong>siders tiling the parallel loops since tiling parallel loops is always legal.Once an acceptable tile size is found, then the tiles are executed sequentially <strong>on</strong> the GPU.The executi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the tiles can be pipelined but this possibility was not c<strong>on</strong>sidered due totime c<strong>on</strong>straints.In the ideal case, the tiling should be such that the total amount <str<strong>on</strong>g>of</str<strong>on</strong>g> data transferredbetween the CPU and GPU should be minimized. However, again the problem is hard andthere<str<strong>on</strong>g>for</str<strong>on</strong>g>e a very basic heuristic was used. The compiler simply reduces the upper bound <str<strong>on</strong>g>of</str<strong>on</strong>g>47


each parallel loop to half its original value. If there are m parallel loops, then 2 m tiles are<str<strong>on</strong>g>for</str<strong>on</strong>g>med. The compiler then computes the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> data to be transferred <str<strong>on</strong>g>for</str<strong>on</strong>g> each tile andif the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements is less than the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> memory available, then the compilerc<strong>on</strong>tinues tiling the loop. The compiler aborts the attempts to tile the loop if the totalnumber <str<strong>on</strong>g>of</str<strong>on</strong>g> tiles exceeds 64 and aband<strong>on</strong>s ef<str<strong>on</strong>g>for</str<strong>on</strong>g>ts to use the GPU.5.5 C<strong>on</strong>clusi<strong>on</strong>sThis chapter presented a new heuristic algorithm <str<strong>on</strong>g>for</str<strong>on</strong>g> array access analysis and <str<strong>on</strong>g>for</str<strong>on</strong>g> automaticallytransfering data between the system memory and the GPU memory. The algorithm<strong>on</strong>ly handles <strong>on</strong>e class <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs but can <str<strong>on</strong>g>of</str<strong>on</strong>g>fer significant space savings <strong>on</strong> the GPU comparedto more naive approaches. This chapter also presented a loop tiling algorithm thatcan automatically scale parallel loop nests so that the data required <str<strong>on</strong>g>for</str<strong>on</strong>g> the computati<strong>on</strong>fits in the limited GPU memory. The algorithms have been implemented in jit4GPU thatcurrently <strong>on</strong>ly generates code <str<strong>on</strong>g>for</str<strong>on</strong>g> AMD GPUs, but the algorithms presented in this chapterare equally applicable to any GPGPU system with a separate address space and limitedGPU memory.48


Chapter 6Loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> AMDGPUsTo extract the maximum per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance out <str<strong>on</strong>g>of</str<strong>on</strong>g> a chip such as the RV770, extensive codetrans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s such as loop unrolling and load-coalescing are necessary. Programmersare not expected to do such loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s manually <str<strong>on</strong>g>for</str<strong>on</strong>g> two reas<strong>on</strong>s. First, looptrans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s like unrolling reduce programmer productivity and program maintainability.Sec<strong>on</strong>dly, low-level details like the memory arrangement <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU and in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> aboutspecific datatypes are not exposed to the programmer. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, it is the resp<strong>on</strong>sibility <str<strong>on</strong>g>of</str<strong>on</strong>g>the compiler to do the necessary trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s. In the implemented compiler framework,jit4GPU is resp<strong>on</strong>sible <str<strong>on</strong>g>for</str<strong>on</strong>g> loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s while the AMD CAL IL compiler does lowlevel trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s like instructi<strong>on</strong> scheduling and register allocati<strong>on</strong>.One important code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> d<strong>on</strong>e by jit4GPU is reducti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the number <str<strong>on</strong>g>of</str<strong>on</strong>g>memory load instructi<strong>on</strong>s (called texturing instructi<strong>on</strong>s <strong>on</strong> the GPU) by coalescing loadsinto loads <str<strong>on</strong>g>of</str<strong>on</strong>g> multi-comp<strong>on</strong>ent types such as float2 and float4. Due to the unbalanced ratio<str<strong>on</strong>g>of</str<strong>on</strong>g> ALU to load unit hardware, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> load instructi<strong>on</strong>s should be reduced so thatthe load (texturing) unit is not the bottleneck. By default, the compiler uses all the memoryresources <strong>on</strong> the GPU in single-comp<strong>on</strong>ent <str<strong>on</strong>g>for</str<strong>on</strong>g>mats. Single-comp<strong>on</strong>ent <str<strong>on</strong>g>for</str<strong>on</strong>g>mats <strong>on</strong>ly allowsingle-comp<strong>on</strong>ent loads. The compiler tries to identify GPU resources that can be storedin aligned multi-comp<strong>on</strong>ent <str<strong>on</strong>g>for</str<strong>on</strong>g>mats. Resources stored in aligned multi-comp<strong>on</strong>ent <str<strong>on</strong>g>for</str<strong>on</strong>g>mats<strong>on</strong>ly allow aligned multi-comp<strong>on</strong>ent loads. Thus, the compiler first analyzes if the loadsfrom a particular resource can always be grouped into aligned multi-comp<strong>on</strong>ent loads. Ifthe grouping is successful, then the compiler uses more efficient <str<strong>on</strong>g>for</str<strong>on</strong>g>mats and reduces thenumber <str<strong>on</strong>g>of</str<strong>on</strong>g> load instructi<strong>on</strong>s. To find out groups <str<strong>on</strong>g>of</str<strong>on</strong>g> loads that can be combined, the compilerlooks at the addresses accessed in the loop body and attempts to find loads <str<strong>on</strong>g>of</str<strong>on</strong>g> the <str<strong>on</strong>g>for</str<strong>on</strong>g>m4 ∗ e + c where e is any expressi<strong>on</strong> while c is a c<strong>on</strong>stant with the value 0,1,2 or 3. If allthe accesses from a memory resource match this <str<strong>on</strong>g>for</str<strong>on</strong>g>m, then the memory resource can bestored <strong>on</strong> the GPU using a aligned four-comp<strong>on</strong>ent <str<strong>on</strong>g>for</str<strong>on</strong>g>mat. The compiler then attempts to49


identify loads that can be coalesced.Reducing the number <str<strong>on</strong>g>of</str<strong>on</strong>g> load instructi<strong>on</strong>s is <strong>on</strong>ly d<strong>on</strong>e by jit4GPU when it can findsuitable loads in the loop body to coalesce. Thus, larger loop bodies have more potential <str<strong>on</strong>g>for</str<strong>on</strong>g>load-coalescing. Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e doing load-coalescing trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, Jit4GPU carries out a loopunrollingpass. A larger loop body also helps the AMD IL compiler to do better instructi<strong>on</strong>scheduling and register allocati<strong>on</strong>. However, larger loop bodies can cause an increase inregister usage per thread that results in a lower number <str<strong>on</strong>g>of</str<strong>on</strong>g> threads running in parallel <strong>on</strong>the GPU. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the compiler first uses an heuristic to determine if the register usageis below a certain threshold. Loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s are <strong>on</strong>ly d<strong>on</strong>e if register usage is belowthe threshold. Estimating register usage can be time c<strong>on</strong>suming but <str<strong>on</strong>g>for</str<strong>on</strong>g>tunately need notbe d<strong>on</strong>e by the JIT compiler. Register usage is estimated by un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and it passes thein<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> to jit4GPU. If enough registers are available then a loop unroll factor <str<strong>on</strong>g>of</str<strong>on</strong>g> eitherfour or two is applied. The factor four was chosen because in many cases a factor <str<strong>on</strong>g>of</str<strong>on</strong>g> fourunroll gives good opportunities to coalesce loads into float4, the widest data type available<strong>on</strong> GPUs. If the loop upper is not divisible by four, then an unroll by two is attemptedinstead. Inner loops are given higher priority <str<strong>on</strong>g>for</str<strong>on</strong>g> unrolling.Under special c<strong>on</strong>diti<strong>on</strong>s the jit4GPU also per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms loop fusi<strong>on</strong>. Loop fusi<strong>on</strong> is a looptrans<str<strong>on</strong>g>for</str<strong>on</strong>g>m where two loops with the same loop bounds and no dependence can be fusedinto a single loop where the loop bodies <str<strong>on</strong>g>of</str<strong>on</strong>g> the two loops is c<strong>on</strong>catenated into a single loopbody. Thus, loop fusi<strong>on</strong> produces a single larger loop from two smaller loops. However,loop fusi<strong>on</strong> can <strong>on</strong>ly be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med when there is no dependence between loops. Further, ifthere is some code occurring between the two loops, then loop fusi<strong>on</strong> can <strong>on</strong>ly be appliedif the intervening code can be safely moved to a locati<strong>on</strong> be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the first loop. I havenot implemented dependence analysis in jit4GPU . There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, in general jit4GPU cannotper<str<strong>on</strong>g>for</str<strong>on</strong>g>m loop fusi<strong>on</strong>. However, under some circumstances the dependence checking is notrequired. Let loop L 1 be a parallel loop and let L 2 be a loop c<strong>on</strong>tained in the body <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 .If L 1 is unrolled, then a copy L ′ 2 <str<strong>on</strong>g>of</str<strong>on</strong>g> L 2 is <str<strong>on</strong>g>for</str<strong>on</strong>g>med. There cannot be any dependence betweendifferent iterati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 because L 1 is parallel. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, there cannot be any dependencebetween L 2 and L ′ 2 because L 2 and L ′ 2 are in the body <str<strong>on</strong>g>of</str<strong>on</strong>g> two different iterati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 .There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the compiler can easily fuse the loops L 2 and L ′ 2.Jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms all the loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s described be<str<strong>on</strong>g>for</str<strong>on</strong>g>e generating AMD ILcode. Jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms loop unrolling and limited loop fusi<strong>on</strong> followed by load-coalescingas a means to reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> texture unit instructi<strong>on</strong>s executed. The overview <str<strong>on</strong>g>of</str<strong>on</strong>g> thetrans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by jit4GPU is summarized in algorithm 6.50


Algorithm 6 per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU code1: <str<strong>on</strong>g>for</str<strong>on</strong>g> all loops with no child loops do2: Mark as candidate <str<strong>on</strong>g>for</str<strong>on</strong>g> unrolling3: end <str<strong>on</strong>g>for</str<strong>on</strong>g>4: <str<strong>on</strong>g>for</str<strong>on</strong>g> all parallel loops do5: Mark as suitable <str<strong>on</strong>g>for</str<strong>on</strong>g> unrolling if children loops have no children.6: end <str<strong>on</strong>g>for</str<strong>on</strong>g>7: while register usage is below threshold do8: Pick the innermost unrolling candidate loop available and unroll.9: Update register usage estimate <str<strong>on</strong>g>of</str<strong>on</strong>g> parent loops.10: Mark the unrolled loop as unsuitable <str<strong>on</strong>g>for</str<strong>on</strong>g> unrolling.11: end while12: Fuse as many loops as possible.13: Per<str<strong>on</strong>g>for</str<strong>on</strong>g>m load-coalescing trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s.51


Chapter 7Experimental evaluati<strong>on</strong>This chapter presents the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> the code generated <str<strong>on</strong>g>for</str<strong>on</strong>g> OpenMP and <strong>on</strong> severalhighly parallel kernel benchmarks. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance was evaluated against the generated serialC++ code. For each kernel benchmark, the executi<strong>on</strong> time <str<strong>on</strong>g>of</str<strong>on</strong>g> the generated serial codewas compared against the total executi<strong>on</strong> time <str<strong>on</strong>g>of</str<strong>on</strong>g> generated OpenMP and GPU versi<strong>on</strong>s.Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>on</strong> the GPU was measured with and without loop optimizati<strong>on</strong>s. For GPUper<str<strong>on</strong>g>for</str<strong>on</strong>g>mance, four numbers are reported:1. GPU total executi<strong>on</strong> time with GPU-specific loop optimizati<strong>on</strong>s enabled in jit4GPU.The time reported includes data transfers and JIT compilati<strong>on</strong> overheads.2. GPU executi<strong>on</strong> time <strong>on</strong>ly with loop optimizati<strong>on</strong>s enabled. Only the time taken bythe GPU to execute the GPU binary are included and does not include data transfersand JIT compilati<strong>on</strong> times.3. GPU total executi<strong>on</strong> time without any GPU-specific loop optimizati<strong>on</strong>s.4. GPU executi<strong>on</strong> <strong>on</strong>ly without any GPU-specific loop optimizati<strong>on</strong>s enabled.The kernels chosen <str<strong>on</strong>g>for</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> were matrix multiplicati<strong>on</strong>, CP benchmarkfrom the Parboil benchmark suite, Black Scholes opti<strong>on</strong> pricing, 5-point stencil codeand RPES kernel from the Parboil benchmark suite. The objective <str<strong>on</strong>g>of</str<strong>on</strong>g> the per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceevaluati<strong>on</strong> is to study the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance gains when GPU code generati<strong>on</strong> is enabled againstthe per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> an OpenMP versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> each benchmark. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, benchmarks wereall chosen to be highly parallel kernels which are suitable <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> the GPU. Am<strong>on</strong>gthe chosen benchmarks, the memory access pattern in four <str<strong>on</strong>g>of</str<strong>on</strong>g> the benchmarks is describableby RCSLMADs and the compiler is able to generate GPU code. In benchmark RPES, loopsare triangular with indirect memory references and the compiler was unable to generateGPU code.A more comprehensive study <str<strong>on</strong>g>of</str<strong>on</strong>g> the percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> cases in which the compiler is ableto generate GPU code will require a standard benchmark suite implemented in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>52


with suitable GPU code annotati<strong>on</strong>s. Implementing such a benchmark suite is left as futurework. However, as menti<strong>on</strong>ed in chapter 5, jit4GPU is <strong>on</strong>ly able to generate GPU code whenthe memory access pattern is describable by RCSLMADs. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, some examples <str<strong>on</strong>g>for</str<strong>on</strong>g>which jit4GPU will be unable to generate a GPU versi<strong>on</strong> includes FFT, c<strong>on</strong>jugate gradientalgorithms and matrix solvers with triangular loops.The important results from the experiments are:1. Using a GPU delivered upto 100 times speedup over generated OpenMP code running<strong>on</strong> the CPU.2. Loop optimizati<strong>on</strong>s per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by jit4GPU deliver upto four times per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance improvement<strong>on</strong> the GPU.3. Benchmarks that per<str<strong>on</strong>g>for</str<strong>on</strong>g>m very little computati<strong>on</strong> per data item accessed are not suitable<str<strong>on</strong>g>for</str<strong>on</strong>g> the GPU because the data transfer overhead is larger than the computati<strong>on</strong>time in such cases.All experiments were d<strong>on</strong>e using a AMD Phenom X4 9550 (2.2ghz quad-core) pairedwith a Rade<strong>on</strong> HD 4870 and 4gb <str<strong>on</strong>g>of</str<strong>on</strong>g> RAM. Frequency scaling was kept disabled <strong>on</strong> the CPU.The operating system was Ubuntu 8.10 with Linux kernel versi<strong>on</strong> 2.6.27-7 and GCC versi<strong>on</strong>4.3.2. For compiling C++ code, the optimizati<strong>on</strong> flag -O3 was passed to the CPU. Whencompiling <str<strong>on</strong>g>for</str<strong>on</strong>g> OpenMP, the flag -fopenmp was also passed. Several other flags were alsotested but they produced no notable per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance changes and were exluded in the resultspresented here.Each experiment was repeated 5 times and the minimum, maximum and mean executi<strong>on</strong>times are presented.7.1 Matrix multiplicati<strong>on</strong>Matrix multiplicati<strong>on</strong> was implemeneted <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit and 64-bit floating point matrices. Avery simple implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> was d<strong>on</strong>e in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and the outer twoloops were marked as parallel loops <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU executi<strong>on</strong>. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance was studied againstthe matrix sizes. For comparing per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> the generated GPU code against the CPU,the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance results <str<strong>on</strong>g>of</str<strong>on</strong>g> the generated OpenMP code as well as per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance results fromATLAS library are included. ATLAS implements a tiled matrix multiplicati<strong>on</strong> algorithmand also autotunes to best fit the system CPU at the time <str<strong>on</strong>g>of</str<strong>on</strong>g> installati<strong>on</strong>. ATLAS is a highper<str<strong>on</strong>g>for</str<strong>on</strong>g>mance library with many years <str<strong>on</strong>g>of</str<strong>on</strong>g> development ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t. By comparisi<strong>on</strong>, the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>implementati<strong>on</strong> was a straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> written inunder ten lines <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code. From this <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> source, the compiler was able to generate53


GPU code that per<str<strong>on</strong>g>for</str<strong>on</strong>g>med twice as fast as ATLAS and over 100 times faster than generatedOpenMP code.The executi<strong>on</strong> times <str<strong>on</strong>g>for</str<strong>on</strong>g> are presented in Table 7.1 <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point and in Table7.3 <str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point. Each case was repeated 5 times and the minimum, maximumand mean executi<strong>on</strong> time are presented. The speedups obtained using the GPU over ATLASare presented in Table 7.2 and Table 7.4. The speedups are calculated using the minimumexecuti<strong>on</strong> time.Table 7.1: Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> benchmark <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point(sec<strong>on</strong>ds)Problem Size 1024 2048 4096 6120OpenMP 4 threads min 7.64 68.35 874.42 1213.7max 7.82 68.83 877.98 1219.4mean 7.75 68.67 875.64 1215.14ATLAS BLAS min 0.125 0.684 5.12 17.87max 0.126 0.687 5.14 18.05mean 0.125 0.685 5.13 17.91GPU Total (Opt) min 0.084 0.39 3.00 8.19max 0.087 0.41 3.00 8.34mean 0.085 0.40 3.00 8.27GPU Only (Opt) min 0.025 0.277 1.78 6.86max 0.027 0.288 1.88 7.01mean 0.026 0.282 1.82 6.93GPU Total (No opt) min 0.159 0.99 8.68 28.16max 0.161 1.04 8.84 28.8mean 0.160 1.01 8.73 28.39GPU Only (No opt) min 0.108 0.90 7.46 26.88max 0.110 0.95 7.61 27.52mean 0.109 0.916 7.5 27.11Table 7.2: Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> using GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point overATLASProblem Size Speedup No Opt Speedup Opt1024 0.786 1.4882048 0.69 1.754096 0.58 1.76120 0.634 2.187.2 CP benchmarkCP benchmark from the Parboil suite was implemented in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. The benchmark is asimple nested loop and the top two loops are annotated to be parallel <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU executi<strong>on</strong>.The CP benchmark is representative <str<strong>on</strong>g>of</str<strong>on</strong>g> some computati<strong>on</strong>s d<strong>on</strong>e in molecular dynamics.This benchmark computes the columbic potential at each point in a planar grid. Thecomputati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> potential at each grid point is independant <str<strong>on</strong>g>of</str<strong>on</strong>g> other grid points and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e54


Table 7.3: Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> benchmark <str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point(sec<strong>on</strong>ds)Problem Size 1024 2048 3072 4096OpenMP 4 threads min 8.55 109.5 274.85 1406.82max 8.68 110.4 275.8 1409.5mean 8.59 109.8 275.1 1408.0ATLAS BLAS min 0.244 1.34 4.38 11.62max 0.247 1.36 4.47 11.70mean 0.245 1.35 4.41 11.64GPU Total (Opt) min 0.147 0.679 3.71 5.34max 0.16 0.695 3.82 5.53mean 0.15 0.685 3.78 5.39GPU Only (Opt) min 0.078 0.513 2.34 3.84max 0.09 0.543 2.44 4.06mean 0.081 0.523 2.38 3.91GPU Total (No opt) min 0.227 1.42 6.21 11.26max 0.24 1.53 6.52 11.73mean 0.232 1.47 6.33 11.4GPU Only (No opt) min 0.164 1.249 4.87 10.13max 0.177 1.33 5.182 10.66mean 0.168 1.27 4.96 10.31Table 7.4: Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> using GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point overATLASProblem Size Speedup (No Opt) Speedup (Opt)1024 1.07 1.6592048 0.95 1.973072 0.7 1.184096 1.03 2.1755


CP is a highly parallel kernel suitable <str<strong>on</strong>g>for</str<strong>on</strong>g> GPUs. At each grid point, the potential iscomputed as a summati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> potentials due to atoms randomly distributed in 3D space.Jit4GPU was able to achieve a speedup <str<strong>on</strong>g>of</str<strong>on</strong>g> more than 50 times over generated OpenMPcodes. Executi<strong>on</strong> times <str<strong>on</strong>g>for</str<strong>on</strong>g> the benchmark are presented in Table 7.5. For each case, theminimum, maximum and mean executi<strong>on</strong> time are presented. The speedups are reported inTable 7.6. The speedups are calculated based <strong>on</strong> minimum executi<strong>on</strong> time.Table 7.5: Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> CP benchmark (sec<strong>on</strong>ds)Problem size 128 256 512 1024Serial Time min 9.01 36.03 137.99 552.26max 9.04 36.48 138.76 555.35mean 9.11 36.11 137.19 553.08OpenMP 1 thread min 8.78 34.56 138.01 552.71max 8.89 35.09 139.52 554.00mean 8.82 34.73 138.62 553.28OpenMP 4 threads min 2.234 8.692 34.57 138.06max 2.297 8.982 35.29 140.06mean 2.24 8.925 35.77 138.96GPU Total (Opt) min 0.084 0.1979 0.672 2.58max 0.085 0.214 0.713 2.85mean 0.084 0.204 0.692 2.65GPU Only (Opt) min 0.05 0.163 0.635 2.524max 0.05 0.182 0.687 2.802mean 0.05 0.169 0.655 2.61GPU Total (No opt) min 0.161 0.544 2.056 8.129max 0.163 0.57 2.235 8.35mean 0.162 0.55 2.096 8.4GPU Only (No opt) min 0.13 0.51 2.022 8.088max 0.15 0.53 2.20 8.30mean 0.14 0.52 2.06 8.16Table 7.6: Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> CP using GPU over OpenMPProblem Size Speedup (No Opt) Speedup (Opt)128 13.81 26.59256 15.98 43.92512 16.81 51.441024 16.98 53.517.3 Black Scholes opti<strong>on</strong> pricingBlack Scholes <str<strong>on</strong>g>for</str<strong>on</strong>g>mula is used <str<strong>on</strong>g>for</str<strong>on</strong>g> opti<strong>on</strong> pricing in computati<strong>on</strong>al finance. Black Scholes <str<strong>on</strong>g>for</str<strong>on</strong>g>mulais a simple scalar computati<strong>on</strong> that computes two scalar results from five scalar inputs.The benchmark computes multiple Black Scholes opti<strong>on</strong>s in parallel. The <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> implementati<strong>on</strong>is derived from a Brook+ implementati<strong>on</strong> provided by AMD with the Stream SDK.The per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> Black Scholes <strong>on</strong> the GPU is entirely dominated by the data transfer56


time and the executi<strong>on</strong> time <strong>on</strong> the GPU is negligible. A speedup <str<strong>on</strong>g>of</str<strong>on</strong>g> about six times isachieved over generated OpenMP code. The detailed results are presented in Table 7.7 andthe speedups are listed in Table 7.8. For each case, the minimum, maximum and meanexecuti<strong>on</strong> times are presented and the speedups are calculated bsaed <strong>on</strong> minimum executi<strong>on</strong>time.The benchmark involves functi<strong>on</strong>s such as exp<strong>on</strong>ential and natural logarithms and thesefuncti<strong>on</strong>s were compiled to corresp<strong>on</strong>ding hardware instructi<strong>on</strong>s <strong>on</strong> the GPU. However, asignificant difference in the calculated results was observed between CPU and GPU. Accurates<str<strong>on</strong>g>of</str<strong>on</strong>g>tware implementati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> square root, natural log and other functi<strong>on</strong>s <strong>on</strong> the GPUis future work <str<strong>on</strong>g>for</str<strong>on</strong>g> this thesis.Table 7.7: Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> Black-Scholes benchmark (sec<strong>on</strong>ds)Problem size 512 1024 2048Serial Time min 0.36 1.438 5.71max 0.363 1.46 5.77mean 0.361 1.44 5.73OpenMP 1 thread min 0.35 1.46 5.6max 0.352 1.478 5.67mean 0.351 1.465 5.62OpenMP 4 threads min 0.1633 0.394 1.4max 0.1633 0.394 1.4mean 0.1633 0.394 1.4GPU Total (Opt) min 0.0829 0.106 0.224max 0.084 0.108 0.227mean 0.083 0.1065 0.225GPU Only (Opt) min 0.0004 0.0009 0.0020max 0.0004 0.0009 0.0020mean 0.0004 0.0009 0.0020GPU Total (No Opt) min 0.079 0.111 0.22max 0.079 0.112 0.23mean 0.079 0.111 0.224GPU Only (No Opt) min 0.0004 0.0009 0.0025max 0.0004 0.0009 0.0025mean 0.0004 0.0009 0.0025Table 7.8: Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> Black-Scholes benchmark using GPU over OpenMPProblem Size Speedup (No Opt) Speedup (Opt)512 2.067 1.961024 3.54 3.712048 6.36 6.257.4 5-point stencil5-point stencil benchmark computes an out-<str<strong>on</strong>g>of</str<strong>on</strong>g>-place 5-point stencil over a 2-dimensi<strong>on</strong>almatrix. For each element in the input matrix, the code computes a weighted average <str<strong>on</strong>g>of</str<strong>on</strong>g>57


the element and its 4 neighbors and writes the result to the corresp<strong>on</strong>ding element in amatrix <str<strong>on</strong>g>of</str<strong>on</strong>g> the same dimensi<strong>on</strong>s. The kernel is highly parallel as each element is processedindependently but the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> computati<strong>on</strong> per point is very small. The compiler wasable to successfully carry out the array access analysis and was able to generate the GPUcode <str<strong>on</strong>g>for</str<strong>on</strong>g> this benchmark. However, the data transfer overhead c<strong>on</strong>siderably outweighed thecomputati<strong>on</strong> time savings <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the benchmark ran slower when usinga GPU compared to OpenMP. The results are reported in Table 7.9. For each case, theminimum, maximum and mean executi<strong>on</strong> times are presented.Table 7.9: Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> 5-point stencil benchmark (millisec<strong>on</strong>ds)Problem size 1024 2048 3072 4096Serial Time min 10.8 43.3 97.5 175max 10.9 43.7 98.2 176.6mean 10.8 43.4 97.7 175.8OpenMP 1 thread min 10.1 46 70 150max 10.26 47 71.24 152mean 10.17 46.3 70.4 151OpenMP 4 threads min 5 24 47.3 58.1max 5 24 47.8 58.5mean 5 24 47.5 58.2GPU Total (Opt) min 65 98.8 123.1 1010max 65.8 99.6 124.9 1016mean 65.3 99.1 123.7 1014GPU Only (Opt) min 0.39 0.89 2.8 25max 0.39 0.89 2.8 26mean 0.39 0.89 2.8 25.5GPU Total (No Opt) min 35.3 63.1 112.1 1000.3max 35.6 63.7 112.7 1006mean 35.4 63.25 112.3 1002GPU Only (No Opt) min 0.59 2.0 4.2 50max 0.59 2.0 4.2 52mean 0.59 2.0 4.2 517.5 RPES benchmarkRPES benchmark is a <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> adapti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the benchmark from Parboil suite. The benchmarkinvolves indirect memory loads and triangular loops. Un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> determined that aGPU versi<strong>on</strong> cannot be generated and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e did not generate calls to the JIT compiler.There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance did not change when the GPU was enabled because theGPU was never used and the JIT compiler was not called. The compiler <strong>on</strong>ly generatedan OpenMP versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the benchmark. This benchmark illustrates that the programmercan safely add GPU parallel annotati<strong>on</strong>s without any fear <str<strong>on</strong>g>of</str<strong>on</strong>g> errors in the case <str<strong>on</strong>g>of</str<strong>on</strong>g> compilerlimitati<strong>on</strong>s.The benchmark was <strong>on</strong>ly tested with default parameters provided by the benchmark58


suite. The serial versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the benchmark completed in 259 sec<strong>on</strong>ds and the parallel versi<strong>on</strong>completed in 62 sec<strong>on</strong>ds.7.6 RemarksThe secti<strong>on</strong> evaluated the implemented compiler over several kernel benchmarks. The generatedGPU code delivered over 2 orders <str<strong>on</strong>g>of</str<strong>on</strong>g> magnitude per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance over OpenMP code <str<strong>on</strong>g>for</str<strong>on</strong>g>some benchmarks. Jit4GPU was also able to deliver better per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance using the GPUthan highly tuned CPU libraries such as ATLAS. The loop optimizati<strong>on</strong>s d<strong>on</strong>e by the compilerwere also found to be effective and provided c<strong>on</strong>siderable speedups over unoptimizedGPU code. The per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> some benchmarks, such as 5-point stencil, were found to bebound by the data transfer overhead making it unsuitable <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> the GPU.59


Chapter 8Related WorkThe most important previous research that relates to the main c<strong>on</strong>tributi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> this thesiscan be divided into three cathegories: the c<strong>on</strong>versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> code originally written to be interpretedin <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to compiled code to run in a CPU, the analysis <str<strong>on</strong>g>of</str<strong>on</strong>g> memory accesses viaarrays and the transfer <str<strong>on</strong>g>of</str<strong>on</strong>g> relevant memory regi<strong>on</strong>s between the CPU main memory andthe GPU memory, and the c<strong>on</strong>structi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> compilers <str<strong>on</strong>g>for</str<strong>on</strong>g> general-purpose computati<strong>on</strong>s thatexecute in GPUs. This chapter will discuss each category <str<strong>on</strong>g>of</str<strong>on</strong>g> related work in the subsequentsecti<strong>on</strong>s.8.1 Compiling <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> CPUsUn<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is not the first compiler attempting to compile <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to C/C++ or other lowerlevel language. Various compilers have been developed that compile <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> or a closelyrelated dialect. Pyrex [5] is a compiler that compiles the Pyrex language to C/C++. Pyrexlanguage is a statically typed language with many syntactic similarities to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. Pyrexalso has special syntax <str<strong>on</strong>g>for</str<strong>on</strong>g> interacti<strong>on</strong> with C libraries. The major advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> Pyrex overun<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> is that it allows very easy interacti<strong>on</strong> with C libraries and it has a very simpleand efficient way <str<strong>on</strong>g>of</str<strong>on</strong>g> generating <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> bindings <str<strong>on</strong>g>for</str<strong>on</strong>g> C libraries. However Pyrex language isnot <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and will not run <strong>on</strong> the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter. Further, Pyrex currently has nosupport <str<strong>on</strong>g>for</str<strong>on</strong>g> NumPy arrays and does not support any parallel programming features.Cyth<strong>on</strong> [2] is a <str<strong>on</strong>g>for</str<strong>on</strong>g>k <str<strong>on</strong>g>of</str<strong>on</strong>g> Pyrex and has added many features over Pyrex. Cyth<strong>on</strong>’s syntaxis much closer to <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and does support efficient access to NumPy arrays. However,Cyth<strong>on</strong> does not support any parallel programming features.Several compilers have attempted to compile pure <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to lower-level languages usingtype inference. Shedskin [13] is a compiler that compiles pure, unannotated <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> toC++. Shedskin uses type inference and the type inference requires certain implicit typingrestricti<strong>on</strong>s <strong>on</strong> the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code being compiled. Shedskin uses a much more advancedtype inferencing algorithm than un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and is designed to be a whole-program compiler.60


However Shedskin does not support efficiently accessing NumPy arrays and does not haveany parallel programming features.Another compiler that compiles pure <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to lower level language is the R<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>[11] compiler present in the PyPy [4] project. PyPy is an attempt to write the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>interpreter in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> itself. To achieve this goal, the interpreter is written in a subset <str<strong>on</strong>g>of</str<strong>on</strong>g><str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> called R<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> that includes various implicit type-based restricti<strong>on</strong>s. A compilerthen compiles R<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to various backends such as C. Various other backends are also inprogress including an LLVM backend and a Java backend. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, in the future PyPycan compile R<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to C, LLVM and Java. However, PyPy does not have support <str<strong>on</strong>g>for</str<strong>on</strong>g>efficient access <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy arrays. To gain support <str<strong>on</strong>g>for</str<strong>on</strong>g> NumPy arrays, PyPy will requireimplementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> NumPy array in R<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>.In a very different directi<strong>on</strong>, a new project called Unladen Swallow [7] is a branch <str<strong>on</strong>g>of</str<strong>on</strong>g>the standard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter with an LLVM-based JIT compiler. The project aims tobe a replacement <str<strong>on</strong>g>of</str<strong>on</strong>g> the standard interpreter and, as such, requires no changes in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>code but may break C-API compatibility with the standard interpreter. One <str<strong>on</strong>g>of</str<strong>on</strong>g> the futuregoals <str<strong>on</strong>g>of</str<strong>on</strong>g> the project is to have a completely thread-safe <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter that will allowtrue multithreaded applicati<strong>on</strong>s in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. The project is looking at replacing the garbagecollecti<strong>on</strong>(GC) mechanism <str<strong>on</strong>g>of</str<strong>on</strong>g> the standard <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter with an alternate threadsafescheme because the current reference count based GC is <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> the major obstaclesto thread safety in the interpreter. Unladen Swallow is still a work in progress and if itsucceeds in building a high-per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance thread-safe interpreter and <str<strong>on</strong>g>of</str<strong>on</strong>g>fering a thread-safeC API, then it has the potential to be a really good complement to un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. The current<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> interpreter has severely handicapped the parallel programming capability <str<strong>on</strong>g>of</str<strong>on</strong>g>fered byun<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>.8.2 Array access analysisThe primary objective <str<strong>on</strong>g>of</str<strong>on</strong>g> array access analysis in this thesis is to find out the memorylocati<strong>on</strong>s to be transfered between CPU and GPU memories. The array access analysis usedis based <strong>on</strong> the c<strong>on</strong>cept <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs. LMADs were first defined by Paek et al. [17]. Theyproposed LMADs as an efficient way to capture very accurate array access in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>.They also discussed several operati<strong>on</strong>s <strong>on</strong> LMADs such as a set intersecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs.However they did not describe a method to compute the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs. In this thesis, Iproposed a method to approximately compute and represent a uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs whichare a a subclass <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs.In this thesis, array access analysis is d<strong>on</strong>e by the JIT compiler jit4GPU and the c<strong>on</strong>cept<str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs is not used by the ahead-<str<strong>on</strong>g>of</str<strong>on</strong>g>-time (AOT) compiler un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. One significantadvantage <str<strong>on</strong>g>of</str<strong>on</strong>g> doing array-access analysis just be<str<strong>on</strong>g>for</str<strong>on</strong>g>e executing a loop is that the value <str<strong>on</strong>g>of</str<strong>on</strong>g>61


loop-invariant symbolic c<strong>on</strong>stants is known just be<str<strong>on</strong>g>for</str<strong>on</strong>g>e executing the loop. AOT compilersdo not know the value <str<strong>on</strong>g>of</str<strong>on</strong>g> symbolic c<strong>on</strong>stants and thus may have to deal with a very largenumber <str<strong>on</strong>g>of</str<strong>on</strong>g> possible scenarios <str<strong>on</strong>g>of</str<strong>on</strong>g> memory access patterns. JIT compilers can know the value<str<strong>on</strong>g>of</str<strong>on</strong>g> the symbolic c<strong>on</strong>stants and can do a much more precise array-access analysis. Jit4GPU isnot the first compiler to do array-access analysis at runtime using LMADs. Rus et al. presenta runtime representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs called RT-LMADs [18],[19]. They were interested inapplying array-access analysis <str<strong>on</strong>g>for</str<strong>on</strong>g> parallelizati<strong>on</strong> and hence their implementati<strong>on</strong> focused<strong>on</strong> computing dependence in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. They represented a uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs simply as alist <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs. However this representati<strong>on</strong> is not useful <str<strong>on</strong>g>for</str<strong>on</strong>g> this thesis. In this thesis,dependence in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> is not required and instead an efficient representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the uni<strong>on</strong>that easily maps to a different address space is needed. The algorithm presented in thisthesis <str<strong>on</strong>g>for</str<strong>on</strong>g> computati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> uni<strong>on</strong> is a unique c<strong>on</strong>tributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> this thesis.8.3 <str<strong>on</strong>g>Compiler</str<strong>on</strong>g>s <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPUThe field <str<strong>on</strong>g>of</str<strong>on</strong>g> general-purpose computati<strong>on</strong>s <strong>on</strong> GPUs started with disguising general-purposeprograms as graphic-shader programs. The real emergence <str<strong>on</strong>g>of</str<strong>on</strong>g> GPGPU started with languagessuch as Brook [12]. Brook is an extensi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the C programming language. Brookprovides a stream datatype and introduces kernel functi<strong>on</strong>s. Streams are similar to arraysbut represent data to be transferred to the GPU. Kernel functi<strong>on</strong>s are functi<strong>on</strong>s to be executedby each GPU thread and <strong>on</strong>ly operate <strong>on</strong> streams and scalar values. Streams passedas parameters to the kernel functi<strong>on</strong>s were marked as input or output streams. At the timethe original implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Brook was d<strong>on</strong>e, GPUs could not do writes to arbitrary locati<strong>on</strong>s.Thus, kernel functi<strong>on</strong>s could not write to arbitrary locati<strong>on</strong>s. To utilize GPUs <str<strong>on</strong>g>for</str<strong>on</strong>g>general purpose computati<strong>on</strong>, the programmer had to first copy data into streams and thenpass these streams to kernel functi<strong>on</strong>s. The Brook compiler compiled the kernel functi<strong>on</strong>sinto OpenGL GLSL or DirectX 9 HLSL shaders. AMD has extended Brook to create alanguage called Brook+ to expose writes to arbitrary locati<strong>on</strong>s and to introduce several newdatatypes, but the programming model remains the same. Apart from Brook, several otherstream-based programming languages have been proposed <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPU such as StreamIt[21][22], Sh [16] and Stream Virtual Machine [14].Currently, APIs such as OpenCL [3] and Nvidia CUDA [1] are available to programGPUs. Nvidia CUDA is an extensi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> C/C++. CUDA exposes the GPU as a Single-Program Multiple-Data (SPMD) machine where the same code is executed by many GPUthreads. In CUDA, the GPU also has its own separate address space and the programmer isrequired to manually transfer the data between the CPU and GPU address spaces. CUDAextends C by providing device functi<strong>on</strong> types that are functi<strong>on</strong>s to be executed <strong>on</strong> theGPU. The code in a device functi<strong>on</strong> represents the code to be executed by each GPU62


thread. Device functi<strong>on</strong>s are written in a subset <str<strong>on</strong>g>of</str<strong>on</strong>g> C and also can have added syntax<str<strong>on</strong>g>for</str<strong>on</strong>g> utilizing GPU hardware features such as <strong>on</strong>-chip shared memory. The calls to devicefuncti<strong>on</strong>s must also provide a grid <str<strong>on</strong>g>of</str<strong>on</strong>g> threads that specifies as a parameter the number andthe c<strong>on</strong>figurati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> GPU threads to be launched. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, overall CUDA provides almostcomplete c<strong>on</strong>trol <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU to the programmer but is a fairly low-level programmingmodel. If a code is required to be portable between CPU and GPU, then writing code inCUDA can require significant code duplicati<strong>on</strong> because two versi<strong>on</strong>s (normal C functi<strong>on</strong>and a device functi<strong>on</strong>) <str<strong>on</strong>g>of</str<strong>on</strong>g> the same code may need to be implemented. OpenCL is based <strong>on</strong>c<strong>on</strong>cepts similar to CUDA. Compared to CUDA, this thesis provides a simpler alternativeto programming GPUs by transferring the burden <str<strong>on</strong>g>of</str<strong>on</strong>g> generating and optimizing GPU codefrom the programmer to the compiler. But the compiler framework presented in this thesisis not always successful in generating GPU code and the JIT compilati<strong>on</strong> utilized can alsorepresent a significant per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance overhead.One limitati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> APIs such as CUDA and OpenCL is that each per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance-criticalfuncti<strong>on</strong> is implemented twice: <strong>on</strong>ce as a CPU-specific functi<strong>on</strong> and then as a GPU-specificdevice functi<strong>on</strong>. One attempt to reduce this duplicati<strong>on</strong> is <str<strong>on</strong>g>of</str<strong>on</strong>g>fered by MCUDA [20]. MCUDAcompiles CUDA code to multicore x86 code and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e allows CUDA code to be executed<strong>on</strong> multicore CPUs. MCUDA is based <strong>on</strong> the premise that programmers should <strong>on</strong>ly writethe CUDA versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance-critical functi<strong>on</strong> and the MCUDA automaticallygenerates a CPU versi<strong>on</strong>. Thus, MCUDA is based <strong>on</strong> a philosophy exactly oppposite <str<strong>on</strong>g>of</str<strong>on</strong>g>this thesis. This thesis provides the programmer an illusi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> executing the program <strong>on</strong>a symmetric multiprocessor machine while automatically utilizing GPU. MCUDA providesan illusi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> executing <strong>on</strong> a GPU but automatically generates a CPU versi<strong>on</strong> if a GPU isnot present.The work closest to this thesis is the work by Lee et al. <strong>on</strong> compiling OpenMP codeto Nvidia CUDA [15]. They describe a compiler that can automatically generate NvidiaCUDA code from C/C++ programs with OpenMP annotati<strong>on</strong>s. There are three primarydifferences between this thesis and the work d<strong>on</strong>e by Lee et al. First, <str<strong>on</strong>g>for</str<strong>on</strong>g> analyzing whichdata to transfer to the GPU, they can <strong>on</strong>ly deal with arrays but not with pointers. Theysimply transfer the entire array without analyzing what specific memory locati<strong>on</strong>s withinthe array are accessed in the loop. The approach in this thesis is based up<strong>on</strong> analyzingthe memory locati<strong>on</strong>s accessed and this approach, if used within a C/C++ setting, can beapplied to both pointers and arrays. Sec<strong>on</strong>d, they cannot tile or break the loop if the datadoes not fit into the GPU. If <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> the arrays referenced in the loop is too big to fit <strong>on</strong>the GPU, they cannot generate GPU code whereas this thesis can tile the loop in somecases and can thus divide the computati<strong>on</strong> into smaller pieces. Finally, they describe loopoptimizati<strong>on</strong>s that are suitable <str<strong>on</strong>g>for</str<strong>on</strong>g> Nvidia GPU architecture while this thesis is c<strong>on</strong>cerned63


with AMD architecture. Their loop optimizati<strong>on</strong>s are completely different than the loopoptimizati<strong>on</strong>s described in this thesis because the architectures are very different.Another work that extends OpenMP <str<strong>on</strong>g>for</str<strong>on</strong>g> GPGPU programming is EXOCHI frameworkby Wang et al [23]. EXOCHI is an extensi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> OpenMP <str<strong>on</strong>g>for</str<strong>on</strong>g> C/C++ <str<strong>on</strong>g>for</str<strong>on</strong>g> heterogenoussystems. Their implementati<strong>on</strong> is <str<strong>on</strong>g>for</str<strong>on</strong>g> a multicore x86 CPU and <str<strong>on</strong>g>for</str<strong>on</strong>g> an integrated Intelgraphics chipset. Unlike the discrete GPUs c<strong>on</strong>sidered in this thesis, such as the Rade<strong>on</strong>4870, Intel graphics chipsets are integrated into the northbridge <str<strong>on</strong>g>of</str<strong>on</strong>g> the CPU and do not sit<strong>on</strong> a PCIe bus. These integrated chips also do not have a separate <strong>on</strong>board memory and canaccess the system RAM. EXOCHI there<str<strong>on</strong>g>for</str<strong>on</strong>g>e does not need to copy data and instead <strong>on</strong>lyneeds to remap the memory address translati<strong>on</strong> table from CPU to the GPU. The addresstranslati<strong>on</strong> remapping is handled by EXOCHI’s runtime. To program the GPU, EXOCHIrequires the programmer to write GPU code but does not require the programmer to do anydata transfers because data transfers are not necessary. Instead, the GPU code can directlyaccess any data in system RAM thereby simplifying the programming. EXOCHI is <strong>on</strong>lysuitable <str<strong>on</strong>g>for</str<strong>on</strong>g> systems where both the CPU and the accelerator (such as the GPU) can accessthe system RAM directly and where the address translati<strong>on</strong> table can be simply remapped.There<str<strong>on</strong>g>for</str<strong>on</strong>g>e EXOCHI is not applicable to current generati<strong>on</strong> discrete GPUs.8.4 C<strong>on</strong>clusi<strong>on</strong>sThis thesis describes the first <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> compiler to provide simple parallel programmingsupport <str<strong>on</strong>g>for</str<strong>on</strong>g> numerical applicati<strong>on</strong>s. The implented compiler is also <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> the first to automaticallymap a shared-memory parallel programming model to a GPGPU system. Thisthesis describes a new algorithm to automatically transfer relevant data between a CPUand a GPU. The implemeneted compiler provides a programming model that is simpler toprogram than current GPGPU APIs such as CUDA and that relies <strong>on</strong> compiler analysisand optimizati<strong>on</strong> to automatically generate GPU code.64


Chapter 9C<strong>on</strong>clusi<strong>on</strong>sThis thesis introduced a new programming model <str<strong>on</strong>g>for</str<strong>on</strong>g> more efficient programming <str<strong>on</strong>g>of</str<strong>on</strong>g> numericalprograms in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> GPUs. The thesis also described the design andimplementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a compiling system to c<strong>on</strong>vert numerical <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> programs annotatedwith type and parallel loop annotati<strong>on</strong>s to multi-cores and GPUs. In this new programmingmodel, a programmer writes code <str<strong>on</strong>g>for</str<strong>on</strong>g> a simple shared-memory abstracti<strong>on</strong> and the compilerautomatically c<strong>on</strong>verts the program to use a GPU as an accelerator. The program remainsportable to multicores and GPUs with no code changes.The compiler system c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>, an ahead-<str<strong>on</strong>g>of</str<strong>on</strong>g>-time compiler and jit4GPU,a just-in-time compiler. Jit4GPU implements a new algorithm to analyze the regi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g>memory accessed by an array reference in a loop nest. The algorithm is restricted to a class<str<strong>on</strong>g>of</str<strong>on</strong>g> affine accesses termed as RCSLMADs. Jit4GPU automatically transfers the required data<str<strong>on</strong>g>for</str<strong>on</strong>g> the computati<strong>on</strong> between the CPU and the GPU based <strong>on</strong> the results <str<strong>on</strong>g>of</str<strong>on</strong>g> the array accessalgorithms. Jit4GPU is not a general-purpose JIT compiler and <strong>on</strong>ly works <strong>on</strong> numericalprograms represented as parallel loop nests with array accesses representable as RCSLMADs.Jit4GPU generates GPU code from a typed abstract-syntax-tree (AST) representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> program generated by un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. Jit4GPU also per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms several loop optimizati<strong>on</strong>ssuch as loop unrolling and memory load coalescing.The per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> used several numerical kernels. On some kernels, Jit4GPUper<str<strong>on</strong>g>for</str<strong>on</strong>g>ms over 100 times faster than OpenMP code generated by un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. Jit4GPU alsodelivers better per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance than some highly tuned CPU libraries, such as ATLAS, withoutrequiring the programmer to do any optimizati<strong>on</strong>s such as unrolling or tiling in the original<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> source. <str<strong>on</strong>g>Compiler</str<strong>on</strong>g>s, such as Jit4GPU, allow the programmer to easily utilize thecomputati<strong>on</strong>al power <str<strong>on</strong>g>of</str<strong>on</strong>g> modern GPUs <str<strong>on</strong>g>for</str<strong>on</strong>g> general purpose computati<strong>on</strong>.65


Bibliography[1] Nvidia CUDA (2009-09-30). http://www.nvidia.com/cuda.[2] Cyth<strong>on</strong>:C-Extensi<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> (2009-09-30). http://www.cyth<strong>on</strong>.org.[3] OpenCL - The open standard <str<strong>on</strong>g>for</str<strong>on</strong>g> parallel programming <str<strong>on</strong>g>of</str<strong>on</strong>g> heterogeneous systems (2009-09-30). http://www.khr<strong>on</strong>os.org/opencl/.[4] PyPy project (2009-09-30). http://codespeak.net/pypy/dist/pypy/doc/.[5] Pyrex - a Language <str<strong>on</strong>g>for</str<strong>on</strong>g> Writing <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> Extensi<strong>on</strong> Modules (2009-09-30). http://www.cosc.canterbury.ac.nz/greg.ewing/pyth<strong>on</strong>/Pyrex/.[6] <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>/C API Reference Manual (2009-09-30). http://www.pyth<strong>on</strong>.org/doc/2.5/api/api.html.[7] Unladen Swallow: A faster implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> (2009-09-30). http://code.google.com/p/unladen-swallow.[8] AMD. ATI Stream Computing User Guide, February 2009.[9] AMD. Compute Abstracti<strong>on</strong> Layer (CAL) Intermediate Language (IL) Reference Guide,February 2009.[10] AMD. R700-Family Instructi<strong>on</strong> Set Architecture, March 2009.[11] D. Anc<strong>on</strong>a, M. Anc<strong>on</strong>a, A Cuni, and N. Matsakis. R<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>: A Step Towards Rec<strong>on</strong>cilingDynamically and Statically Typed OO Languages. In OOPSLA 2007 Proceedingsand Compani<strong>on</strong>, DLS’07: Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the 2007 Symposium <strong>on</strong> Dynamic Languages,pages 53–64. ACM, 2007.[12] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayv<strong>on</strong> Fatahalian, Mike Houst<strong>on</strong>,and Pat Hanrahan. Brook <str<strong>on</strong>g>for</str<strong>on</strong>g> GPUs: Stream Computing <strong>on</strong> Graphics Hardware.ACM Trans. Graph., 23(3):777–786, 2004.[13] Mark Dufour. Shed Skin - An experimental (restricted) <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> to C++ compiler(2009-09-30). http://code.google.com/p/shedskin/.66


[14] Francois Lab<strong>on</strong>te, Peter Matts<strong>on</strong>, William Thies, Ian Buck, Christos Kozyrakis, andMark Horowitz. The Stream Virtual Machine. In PACT ’04: Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the 13thInternati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> Architectures and Compilati<strong>on</strong> Techniques, pages267–277. IEEE Computer Society, 2004.[15] Sey<strong>on</strong>g Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compilerframework <str<strong>on</strong>g>for</str<strong>on</strong>g> automatic translati<strong>on</strong> and optimizati<strong>on</strong>. In PPoPP ’09: Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g>the 14th ACM SIGPLAN symposium <strong>on</strong> Principles and practice <str<strong>on</strong>g>of</str<strong>on</strong>g> parallel programming,pages 101–110. ACM, 2009.[16] Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metaprogramming. InHWWS ’02: Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the ACM SIGGRAPH/EUROGRAPHICS c<strong>on</strong>ference <strong>on</strong>Graphics hardware, pages 57–68. Eurographics Associati<strong>on</strong>, 2002.[17] Yunheung Paek, Jay Hoeflinger, and David Padua. Efficient and precise array accessanalysis. ACM Trans. Program. Lang. Syst., 24(1):65–109, 2002.[18] Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. Run-time Assisted InterproceduralAnalysis <str<strong>on</strong>g>of</str<strong>on</strong>g> Memory Access Patterns. Technical report, Department <str<strong>on</strong>g>of</str<strong>on</strong>g> ComputerScience, Texas A&M University, 2001.[19] Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. Hybrid analysis: static &dynamic memory reference analysis. Internati<strong>on</strong>al Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> Programming,31(4):251–283, 2003.[20] John Stratt<strong>on</strong>, Sam St<strong>on</strong>e, and Wen mei Hwu. MCUDA: An Efficient Implementati<strong>on</strong><str<strong>on</strong>g>of</str<strong>on</strong>g> CUDA Kernels <str<strong>on</strong>g>for</str<strong>on</strong>g> Multi-core CPUs. In 21st Annual Workshop <strong>on</strong> Languages and<str<strong>on</strong>g>Compiler</str<strong>on</strong>g>s <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> Computing (LCPC), July 2008.[21] William Thies, Michael Karczmarek, and Saman Amarasinghe. StreamIt: A Language<str<strong>on</strong>g>for</str<strong>on</strong>g> Streaming Applicati<strong>on</strong>s. In Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the Internati<strong>on</strong>al C<strong>on</strong>ference <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Compiler</str<strong>on</strong>g>C<strong>on</strong>structi<strong>on</strong>, 2002.[22] A. Udupa, R. Govindarajan, and M.J Thazhuthaveetil. S<str<strong>on</strong>g>of</str<strong>on</strong>g>tware Pipelined Executi<strong>on</strong><str<strong>on</strong>g>of</str<strong>on</strong>g> Stream <str<strong>on</strong>g>Programs</str<strong>on</strong>g> <strong>on</strong> GPUs. In Internati<strong>on</strong>al Symposium <strong>on</strong> Code Generati<strong>on</strong> andOptimizati<strong>on</strong> (CGO), pages 200–209, 2009.[23] Perry H. Wang, Jamis<strong>on</strong> D. Collins, Gautham N. Chinya, H<strong>on</strong>g Jiang, Xinmin Tian,Milind Girkar, Nick Y. Yang, Guei-Yuan Lueh, and H<strong>on</strong>g Wang. EXOCHI: architectureand programming envir<strong>on</strong>ment <str<strong>on</strong>g>for</str<strong>on</strong>g> a heterogeneous multi-core multithreaded system.In PLDI ’07: Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> the 2007 ACM SIGPLAN c<strong>on</strong>ference <strong>on</strong> Programminglanguage design and implementati<strong>on</strong>, pages 156–166. ACM, 2007.67

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!