A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

each parallel loop to half its original value. If there are m parallel loops, then 2 m tiles are<strong>for</strong>med. The compiler then computes the amount <strong>of</strong> data to be transferred <strong>for</strong> each tile andif the number <strong>of</strong> elements is less than the amount <strong>of</strong> memory available, then the compilercontinues tiling the loop. The compiler aborts the attempts to tile the loop if the totalnumber <strong>of</strong> tiles exceeds 64 and abandons ef<strong>for</strong>ts to use the GPU.5.5 ConclusionsThis chapter presented a new heuristic algorithm <strong>for</strong> array access analysis and <strong>for</strong> automaticallytransfering data between the system memory and the GPU memory. The algorithmonly handles one class <strong>of</strong> LMADs but can <strong>of</strong>fer significant space savings on the GPU comparedto more naive approaches. This chapter also presented a loop tiling algorithm thatcan automatically scale parallel loop nests so that the data required <strong>for</strong> the computationfits in the limited GPU memory. The algorithms have been implemented in jit4GPU thatcurrently only generates code <strong>for</strong> AMD GPUs, but the algorithms presented in this chapterare equally applicable to any GPGPU system with a separate address space and limitedGPU memory.48
Chapter 6Loop trans<strong>for</strong>mations <strong>for</strong> AMDGPUsTo extract the maximum per<strong>for</strong>mance out <strong>of</strong> a chip such as the RV770, extensive codetrans<strong>for</strong>mations such as loop unrolling and load-coalescing are necessary. Programmersare not expected to do such loop trans<strong>for</strong>mations manually <strong>for</strong> two reasons. First, looptrans<strong>for</strong>mations like unrolling reduce programmer productivity and program maintainability.Secondly, low-level details like the memory arrangement <strong>of</strong> the GPU and in<strong>for</strong>mation aboutspecific datatypes are not exposed to the programmer. There<strong>for</strong>e, it is the responsibility <strong>of</strong>the compiler to do the necessary trans<strong>for</strong>mations. In the implemented compiler framework,jit4GPU is responsible <strong>for</strong> loop trans<strong>for</strong>mations while the AMD CAL IL compiler does lowlevel trans<strong>for</strong>mations like instruction scheduling and register allocation.One important code trans<strong>for</strong>mation done by jit4GPU is reduction <strong>of</strong> the number <strong>of</strong>memory load instructions (called texturing instructions on the GPU) by coalescing loadsinto loads <strong>of</strong> multi-component types such as float2 and float4. Due to the unbalanced ratio<strong>of</strong> ALU to load unit hardware, the number <strong>of</strong> load instructions should be reduced so thatthe load (texturing) unit is not the bottleneck. By default, the compiler uses all the memoryresources on the GPU in single-component <strong>for</strong>mats. Single-component <strong>for</strong>mats only allowsingle-component loads. The compiler tries to identify GPU resources that can be storedin aligned multi-component <strong>for</strong>mats. Resources stored in aligned multi-component <strong>for</strong>matsonly allow aligned multi-component loads. Thus, the compiler first analyzes if the loadsfrom a particular resource can always be grouped into aligned multi-component loads. Ifthe grouping is successful, then the compiler uses more efficient <strong>for</strong>mats and reduces thenumber <strong>of</strong> load instructions. To find out groups <strong>of</strong> loads that can be combined, the compilerlooks at the addresses accessed in the loop body and attempts to find loads <strong>of</strong> the <strong>for</strong>m4 ∗ e + c where e is any expression while c is a constant with the value 0,1,2 or 3. If allthe accesses from a memory resource match this <strong>for</strong>m, then the memory resource can bestored on the GPU using a aligned four-component <strong>for</strong>mat. The compiler then attempts to49
Page 3 and 4:
Examining CommitteeJose Nelson Amar
Page 5 and 6:
AbstractModern Graphics Processing
Page 8 and 9:
5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11: List of Figures2.1
Page 12 and 13: RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15: much lower than equivalent C progra
Page 16 and 17: Chapter 2AMD RV770 architecture <st
Page 18 and 19: Figure 2.1: RV770 Block Diagram. So
Page 20 and 21: 2.2 AMD CAL Programming modelVariou
Page 22 and 23: a context. The predefined names are
Page 24 and 25: 2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51 and 52: Algorithm 1 solves the data transfe
Page 53 and 54: 0 ≤ j < 5 (5.26)In the example, t
Page 55 and 56: case, assume we had a C array start
Page 57 and 58: solve for unknown
Page 59: approach, the number of</st
Page 63 and 64: Algorithm 6 perfor
Page 65 and 66: with suitable GPU code annotations.
Page 67 and 68: Table 7.3: Execution time f
Page 69 and 70: time and the execution time on the
Page 71 and 72: suite. The serial version o
Page 73 and 74: However Shedskin does not support e
Page 75 and 76: thread. Device functions are writte
Page 77 and 78: Chapter 9ConclusionsThis thesis int
Page 79: [14] Francois Labonte, Peter Mattso
show all

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

Create successful ePaper yourself

Delete template?

Save as template?